# Phone plan recommendations

The mobile operator "Megaline" has discovered that many customers are using legacy tariff plans. They want to build a system capable of analyzing customer behavior and offering them a new tariff: "Smart" or "Ultra."

They have access to data on the behavior of customers who have already switched to these tariffs. The goal is to build a classification model that will select the appropriate tariff. Data preprocessing is not needed, as it's already been done.

The project's objective is to build a model with the highest possible accuracy. The model is considered successful if the accuracy is at least 0.75.

## Data exploration

In [1]:
import pandas as pd

from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
 
from joblib import dump

In [2]:
data = pd.read_csv('/Users/mojseenkodenis/Desktop/users_behavior.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [3]:
data.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


The data has already been cleaned and consists of information about user behavior for a month, with the following columns:

 - minutes — total call duration in minutes,
 - messages — the number of SMS messages,
 - mb_used — internet traffic used in megabytes,
 - is_ultra — which tariff the user used during the month (1 for "Ultra," 0 for "Smart").

In this study, the goal is to create a model capable of recommending a tariff to the user. Therefore, the target feature will be the "is_ultra" column, while the others will be used as features.

In [4]:
##Splitting data into train, validation and testing sets in 60%-20%-20% proportion
train_valid, test = train_test_split(data, test_size=0.2, random_state=12345)
train, valid = train_test_split(train_valid, test_size=0.25, random_state=12345)

In [5]:
for i, name in zip((train, valid, test), ('train', 'valid', 'test')):
    print(f'Size of {name}: {i.shape[0] / data.shape[0] * 100:.2f}%')

Size of train: 59.99%
Size of valid: 20.01%
Size of test: 20.01%


## Model training

In this study, the goal is to determine the appropriate tariff for the user, which is a categorical feature, indicating that we will be addressing a classification problem. Therefore, it is necessary to explore three types of models: decision trees, random forests, and logistic regression.

### Decision tree

In [6]:
features_train = train.drop('is_ultra', axis=1)
target_train = train['is_ultra']
features_valid = valid.drop('is_ultra', axis=1)
target_valid = valid['is_ultra']
features_test = test.drop('is_ultra', axis=1)
target_test = test['is_ultra']

In [7]:
best_model = None
best_result = 0
best_depth = 0

for depth in range(1, 6):
    model = DecisionTreeClassifier(random_state=12345, max_depth = depth)
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)
    result = accuracy_score(target_valid, predictions)
    if result > best_result:
        best_model = model
        best_result = result
        best_depth = depth
print('Accuracy of the best model:', best_result)    
print('Depth of the best result:', best_depth)

Accuracy of the best model: 0.7651632970451011
Depth of the best result: 3


In [8]:
for depth in range(1, 6):
    model = DecisionTreeClassifier(random_state=12345, max_depth = depth)
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)
    result = accuracy_score(target_valid, predictions)
    print('Depth=', depth, '(', result, ')')

Depth= 1 ( 0.7387247278382582 )
Depth= 2 ( 0.7573872472783826 )
Depth= 3 ( 0.7651632970451011 )
Depth= 4 ( 0.7636080870917574 )
Depth= 5 ( 0.7589424572317263 )


The primary hyperparameter for the decision tree is the maximum depth (max_depth). Through a loop, it was determined that the model achieved the best result with an accuracy of 0.765 when the maximum depth was set to 3.

### Random forest

In [9]:
best_model_forest = None
best_result_forest = 0
best_n_estimators = 0
best_depth = 0

for est in range(50, 200, 10):
    for depth in range(1, 11): 
        model_forest = RandomForestClassifier(random_state = 12345, n_estimators = est, max_depth=depth)
        model_forest.fit(features_train, target_train)
        result = model_forest.score(features_valid, target_valid)
        if result > best_result_forest:
            best_model_forest = model_forest
            best_result_forest = result
            best_n_estimators = est
            best_depth = depth
print('Accuracy of the best model:', best_result_forest)    
print('n estimators of the best result:', best_n_estimators)        
print('Depth of the best result:', best_depth)   

Accuracy of the best model: 0.7993779160186625
n estimators of the best result: 130
Depth of the best result: 10


The key parameter for the random forest is the number of estimators (n_estimators). It was shown that the model achieved the best result with an accuracy of 0.799 when the number of estimators was set to 130 and the maximum depth was set to 10.

### Logistic regression

In [10]:
best_model_reg = None
best_result_reg = 0
best_max_iter = 0
best_c_value = 0

for c_value in [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]: 
    for iter in range(100, 1000, 100):
        model_regression = LogisticRegression(random_state=12345, C = c_value, solver='lbfgs', max_iter=iter)
        model_regression.fit(features_train, target_train)
        model_regression.predict(features_valid)
        result_regression = model_regression.score(features_valid, target_valid)
        if result > best_result_reg:
            best_model_reg = model_regression
            best_result_reg = result_regression
            best_max_iter = iter
            best_c_value = c_value
        
print('Accuracy:', result_regression)
print('Best max_iter:', best_max_iter)
print('Best c_value:', best_c_value)

Accuracy: 0.7262830482115086
Best max_iter: 900
Best c_value: 100


## Testing the model

The model with the highest accuracy turned out to be the random forest (0.788), so we will use it for testing with a maximum depth of 10 and 130 estimators.

In [11]:
model = RandomForestClassifier(random_state = 12345, n_estimators = 130, max_depth = 10)
model.fit(features_train, target_train)
predictions_test = model.predict(features_test)
result_test = accuracy_score(target_test, predictions_test)
print('Accuracy:', result_test)

Accuracy: 0.7993779160186625


The test on the test set showed that the accuracy of the random forest model on the data is 0.799, which exceeds the specified threshold of 0.75.

## Testing the model for adequacy

To test the models for adequacy, we will use the DummyClassifier, which looks for the most popular value and makes predictions based on its count.

In [12]:
dummy = DummyClassifier(strategy='most_frequent', random_state=12345)
dummy.fit(features_train, target_train)
dummy_prediction = dummy.predict(features_test)
dummy_result = dummy.score(target_test, predictions_test)
dummy_accuracy = accuracy_score(target_test, dummy_prediction)

print(f"Accuracy of the Dummy Classifier: {dummy_result:.2f}")

Accuracy of the Dummy Classifier: 0.79


In the end, the "dumb classifier" showed an accuracy of 0.790, which is higher than that of the decision tree (0.765) and logistic regression (0.726) but lower than the best model, the random forest (0.799). Therefore, the most accurate model predicts user tariffs quite successfully.

## Conclusion

In the study, three models were trained to predict two user tariffs (Smart and Ultra).

The first model, the decision tree, showed the best result with an accuracy of 0.765 at a depth of 3 out of 5 considered.

The second model, the random forest, showed the highest accuracy among the models at 0.799 with 130 estimators and a depth of 10.

The third model, logistic regression, showed the lowest accuracy among the models (0.726) with a maximum number of training iterations at 900.

Thus, only two models were able to achieve an accuracy of at least 0.75. Additionally, the adequacy check showed that the random forest model performed better than a random model (0.790), indicating its successful performance.