# Predicting the Best Mobile Plan for Megaline Subscribers

## Introduction

Megaline, a leading mobile carrier, has identified a trend where many of their subscribers continue to use legacy plans, despite the availability of more feature-rich options. To address this, Megaline aims to develop a model that can analyze subscriber behavior and recommend a suitable upgrade to one of their newer plans, Smart or Ultra.

Using historical data from subscribers who have already made the switch to these new plans, you are tasked with creating a classification model that predicts which plan would be the best fit for each subscriber. The goal is to achieve an accuracy of at least 0.75, based on a test dataset. By leveraging subscriber behavior data and advanced modeling techniques, this project will help Megaline optimize its plan recommendations, improving customer satisfaction and retention.

In [90]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

In [91]:
df = pd.read_csv('/datasets/users_behavior.csv')

In [92]:
df

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


In [93]:
features = df.drop(['is_ultra'], axis=1)
target = df['is_ultra']

I defined the features to be everything but the 'is_ultra' column and I made the target that 'is_ultra' column.

In [94]:
features_train, features_temp, target_train, target_temp = train_test_split(features, target, test_size=0.40, random_state=12345)

Split data into training set and temporary set of data which I will then split into test set and validation set.

In [95]:
features_test, features_valid, target_test, target_valid = train_test_split(features_temp, target_temp, test_size=0.50, random_state=12345)

Took the initial data set and split it into a training data set (60%) and a temporary data set (40%)

Then I took the temporary data set and split it into a test data set (20%) and a validation data set (20%)

Totaling 100% of the data (60-20-20).

In [96]:
print(len(df))

3214


Checked length of the original dataframe.

In [97]:
print(len(features_train), len(features_test), len(features_valid))

print(len(target_train), len(target_test), len(target_valid))

1928 643 643
1928 643 643


Checked the length of the data sets that were split. 60% of 3214 is 1928. 20% of 3214 is 643. So the split was performed correctly.

In [98]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV

Creating a function to evaluate models

In [126]:
def eval_model(model, features_train, features_valid, target_train, target_valid):
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)
    accuracy = accuracy_score(target_valid, predictions)
    precision = precision_score(target_valid, predictions)
    recall = recall_score(target_valid, predictions)
    f1 = f1_score(target_valid, predictions)
    return {'Accuracy': accuracy, 'Precision': precision, 'Recall': recall, 'F1 Score': f1}

I will now train and tune the models using the training and validation set. And I will adjust the hyperparameters to achieve the best performance on the validation set.

### Decision Tree

In [127]:
tree_params = {
    'max_depth': [None, 5, 10, 15, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5]
}

tree_clf = GridSearchCV(DecisionTreeClassifier(random_state=12345), tree_params, cv=5) # cross validation

tree_metrics = eval_model(tree_clf, features_train, features_valid, target_train, target_valid)


### Random Forest

In [128]:
forest_params = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10, 15],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

forest_clf = GridSearchCV(RandomForestClassifier(random_state=12345), forest_params, cv=5)

forest_metrics = eval_model(forest_clf, features_train, features_valid, target_train, target_valid)


### Logistic Regression

In [129]:
log_params = {
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['liblinear']
}

log_clf = GridSearchCV(LogisticRegression(random_state=12345), log_params, cv=5)

log_metrics = eval_model(log_clf, features_train, features_valid, target_train, target_valid)

### Compare metrics and select the best model

In [131]:
metrics = {
    'Decision Tree': tree_metrics,
    'Random Forest': forest_metrics,
    'Logistic Regression': log_metrics
}

for model, metrics in metrics.items():
    print(f"{model} Metrics:", metrics)

Decision Tree Metrics: {'Accuracy': 0.7822706065318819, 'Precision': 0.7787610619469026, 'Recall': 0.43349753694581283, 'F1 Score': 0.5569620253164557}
Random Forest Metrics: {'Accuracy': 0.80248833592535, 'Precision': 0.7567567567567568, 'Recall': 0.5517241379310345, 'F1 Score': 0.6381766381766382}
Logistic Regression Metrics: {'Accuracy': 0.7262830482115086, 'Precision': 0.7872340425531915, 'Recall': 0.18226600985221675, 'F1 Score': 0.296}


Accuracy (80.2%) was the highest on the Random Forest model using the validation set. I will now evaluate that one on the test set.

### Testing the Best Model set (Random Forest) on the Test set

In [132]:
# Best Random Forest model from GridSearchCV
best_rand_forest = forest_clf.best_estimator_
best_rand_forest

In [133]:
# Now predicting on the test set
test_predictions = best_rand_forest.predict(features_test)

In [134]:
# Test accuracy on the test set, this is a test on completely unseen data
test_accuracy = accuracy_score(target_test, test_predictions)
print('Test Accuracy of the best model (Random Forest):', test_accuracy)

Test Accuracy of the best model (Random Forest): 0.7978227060653188


In [138]:
from sklearn.metrics import confusion_matrix, classification_report

In [136]:
print("Confusion Matrix:\n", confusion_matrix(target_test, test_predictions))

Confusion Matrix:
 [[420  34]
 [ 96  93]]


In [137]:
print("\nClassification Report:\n", classification_report(target_test, test_predictions))


Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.93      0.87       454
           1       0.73      0.49      0.59       189

    accuracy                           0.80       643
   macro avg       0.77      0.71      0.73       643
weighted avg       0.79      0.80      0.78       643



# Conclusion

In this project, I examined 3 models to conduct the classification experiment.

I applied grid search to find the best hyperparameters of the Decision Tree and Random Forest.

I looked at Accuracy, Precision, Recall and F1 Score. We were asked to look at the highest possible accuracy in which the Random Forest scored the highest on the validation set at 80.2%. The Logistic Regression scored the worst at 72.6%.

The Random Forest was selected to do the testing on the test data since it had the highest accuracy. When testing the test data the accuracy score using Random Forest was 79.7%. Being that the accuracy on the validation set and the accuracy on the test set were so close, we can confidently say that the model generalizes well to the new data (meaning it has not overfitted on the training/validation data).

This model is likely to perform similarly on real-world data.