# Research for the mobile operator "Megaline". Tariff Recommendation.

The mobile operator "Megaline" has found that many customers are using outdated . They want to build a system capable of analyzing customer behavior and suggesting a new tariff: "Smart" or "Ultra". We have data on the behavior of customers who have already switched to these tariffs. We need to build a classification model that will select the appropriate tariff. Data preprocessing is not required.

Attributes:

- calls: number of calls
- minutes: total duration of calls in minutes
- messages: number of SMS messages
- mb_used: internet traffic in megabytes
- is_ultra: the tariff used during the month ("Ultra" - 1, "Smart" - 0).

Research Steps:

1. Open the data file located at '/datasets/users_behavior.csv'.
2. Divide the original data into training, validation, and test sets.
3. Explore the quality of different models by adjusting hyperparameters.
4. Summarize the conclusions of the research.
5. Check the model's quality on the test set.
6. Additional task: check the models for adequacy.

## Let's open and examine the file.

Import all the necessary libraries.

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv('/Users/daniyardjumaliev/Jupyter/Projects/datasets/users_behavior.csv')
display(df)
display(df.describe())
df.info()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


## Let's split the data into samples.

First, let's create training features and the target feature.

In [3]:
features = df.drop(['is_ultra'], axis=1)
target = df['is_ultra']

Now let's split the data into training (60%), validation (20%), and test (20%) sets.

In [4]:
features_train, features_temp, target_train, target_temp = train_test_split(
    features, target, test_size=0.4, random_state=12345)

features_valid, features_test, target_valid, target_test = train_test_split(
    features_temp, target_temp, test_size=0.5, random_state=12345)

print('Train:', features_train.shape)
print('Validation:', features_valid.shape)
print('Test:', features_test.shape)

Train: (1928, 4)
Validation: (643, 4)
Test: (643, 4)


## Let's explore the models.

Now, let's determine which learning model will be the most effective.

Decision tree.

In [5]:
best_depth = 0
best_result = 0

for depth in range(1, 11):
    tree_model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    tree_model.fit(features_train, target_train)
    predictions_valid = tree_model.predict(features_valid)
    result = tree_model.score(features_valid, target_valid)
    if result > best_result:
        best_result = result
        best_depth = depth

print(f'The quality of the decision tree on the validation set: {best_result}, max_depth: {best_depth}')

The quality of the decision tree on the validation set: 0.7853810264385692, max_depth: 3


Random forest.

In [6]:
best_model = None
best_result = 0
best_est = 0
best_depth = 0
for est in range(10, 51, 10):
    for depth in range (1, 11):
        rand_forest_model = RandomForestClassifier(random_state=12345, n_estimators=est, max_depth=depth)
        rand_forest_model.fit(features_train, target_train)
        predictions_valid = rand_forest_model.predict(features_valid)
        result = rand_forest_model.score(features_valid, target_valid)
        if result > best_result:
            best_model = rand_forest_model
            best_result = result
            best_est = est
            best_depth = depth

print(f'The quality of the random forest on the validation set: {best_result}, n_estimators: {best_est}, max_depth: {best_depth}')

The quality of the random forest on the validation set: 0.8087091757387247, n_estimators: 40, max_depth: 8


Logistic regression.

In [7]:
logistic_model = LogisticRegression(random_state=12345, solver='lbfgs', max_iter=1000)
logistic_model.fit(features_train, target_train)
logistic_accuracy = logistic_model.score(features_valid, target_valid)
print(f'The quality of the logistic regression on the validation set: {logistic_accuracy}')

The quality of the logistic regression on the validation set: 0.7107309486780715


_Conclusion_: After testing each of the models on the training set (60% of the original) and making predictions on the validation set (20%), the following results were obtained regarding the quality of each model:

- Decision Tree - 0.7853810264385692, depth: 3
- Random Forest - 0.8087091757387247, number of trees: 40, depth: 8
- Logistic Regression - 0.7107309486780715

The most accurate of the tested models turned out to be the Random Forest. The decision tree is optimal in terms of processing time and accuracy. Logistic regression requires additional adjustments as its quality is below the specified 0.75 threshold.

## Let's check our best model - Random forest on the test set.

In [8]:
rand_forest_model = RandomForestClassifier(random_state=12345, n_estimators=40, max_depth=8)
rand_forest_model.fit(features_train, target_train)
predictions_test = rand_forest_model.predict(features_test)
rand_accuracy_test = rand_forest_model.score(features_test, target_test)
print(f'The quality of the random forest on the test set: {rand_accuracy_test}')

The quality of the random forest on the test set: 0.7962674961119751


__Conclusion__: The model's quality has decreased, but the result meets the customer's specified requirement of accuracy above 0.75. The model is quite resource-intensive, so if the amount of data increases significantly in the future, the processing time will also increase, potentially affecting customer loyalty. Therefore, the optimal model in this case would be the decision tree model. However, if there is no expectation of a sharp increase in subscribers and, consequently, an increase in data in the future, I recommend using the random forest model, as there is almost a 2% improvement with random forest compared to the decision tree.

## (Bonus) Let's check the models for adequacy.

We will assess the adequacy of all three models by comparing them with the baseline metric. To do this, first, let's find out the number of "1" (Ultra tariff) and "0" (Smart tariff) in the original data.

In [9]:
print(df['is_ultra'].value_counts())

is_ultra
0    2229
1     985
Name: count, dtype: int64


So, our baseline metric - base_accuracy is the number of "1"s (Ultra) divided by the total number of all tariffs (Smart + Ultra).

In [10]:
base_accuracy = 2229 / (2229 + 985)
print(f'base_accuracy = {base_accuracy}')

base_accuracy = 0.693528313627878


In [11]:
base_accuracy = 0.693528313627878

decision_tree_accuracy = 0.779
random_forest_accuracy = 0.796
logistic_regression_accuracy = 0.684

decision_tree_improvement = (decision_tree_accuracy - base_accuracy) / base_accuracy
random_forest_improvement = (random_forest_accuracy - base_accuracy) / base_accuracy
logistic_regression_improvement = (logistic_regression_accuracy - base_accuracy) / base_accuracy

print(f'Improvement in the accuracy of the decision tree: {decision_tree_improvement:.2%}')
print(f'Improvement in the accuracy of the random forest: {random_forest_improvement:.2%}')
print(f'Improvement in the accuracy of the logistic regression: {logistic_regression_improvement:.2%}')

Improvement in the accuracy of the decision tree: 12.32%
Improvement in the accuracy of the random forest: 14.78%
Improvement in the accuracy of the logistic regression: -1.37%


_Conclusion_:

- Improvement in the accuracy of the decision tree: 12.32% - This means that the decision tree demonstrates a 12.32% improvement compared to the baseline model. This is a positive result, indicating that the model has a higher prediction accuracy compared to the baseline.

- Improvement in the accuracy of the random forest: 14.78% - The random forest also shows a 14.78% improvement compared to the baseline model. This is another positive result, and the random forest proves to be more accurate than the decision tree.

- Improvement in the accuracy of logistic regression: -1.37% - Here, there is no improvement, and the accuracy of logistic regression is slightly worse than the baseline model. In this case, it suggests that logistic regression, in its current state, does not outperform the naive classification strategy (predicting the most frequent class) on our data.

In conclusion, the decision tree and random forest show positive improvements compared to the baseline model, indicating better learning and adaptation to our data. Logistic regression, on the other hand, performs poorly, worse than the baseline model with its naive classification strategy.