# Predicting Optimal Mobile Plan for Subscribers using Behavior Data

<b>Introduction:</b>
    
In this project, we aim to build a predictive model that recommends an optimal mobile plan (either "Smart" or "Ultra") for Megaline's subscribers based on their monthly behavior data. The dataset provided includes information on calls, text messages, internet traffic, and the current plan for each user. The goal is to develop a model that can accurately predict whether a subscriber should switch to the "Ultra" plan or remain on the "Smart" plan, with a threshold accuracy of 0.75.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV



# Data Exploration

In [2]:

# Load the dataset
data = pd.read_csv("/datasets/users_behavior.csv")

# Display the first few rows of the dataset
print(data.head())

# Check the data types and missing values
print(data.info())

# Statistical summary of the dataset
print(data.describe())



   calls  minutes  messages   mb_used  is_ultra
0   40.0   311.90      83.0  19915.42         0
1   85.0   516.75      56.0  22696.96         0
2   77.0   467.66      86.0  21060.45         0
3  106.0   745.53      81.0   8437.39         1
4   66.0   418.74       1.0  14502.75         0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None
             calls      minutes     messages       mb_used     is_ultra
count  3214.000000  3214.000000  3214.000000   3214.000000  3214.000000
mean     63.038892   438.208787    38.281269  17207.673836     0.306472
std      33.236368   234.569872    36.148326   7570.968246  

Intermediate Conclusion:

Examine the dataset to confirm the presence of columns: calls, minutes, messages, mb_used, and is_ultra. Check for any missing or anomalous values and understand the distribution of features and target variable.

# Data Splitting

In [3]:

# Split the data into features and target
X = data.drop('is_ultra', axis=1)
y = data['is_ultra']

# Split into training + validation and test sets
X_train_valid, X_test, y_train_valid, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Further split the training + validation set into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X_train_valid, y_train_valid, test_size=0.25, random_state=42)


# Model Training and Hyperparameter Tuning

In [4]:

# Define the model
rf = RandomForestClassifier(random_state=42)

# Define the hyperparameters to tune
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# Perform Grid Search
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best model
best_rf = grid_search.best_estimator_

# Evaluate on the validation set
y_pred_valid = best_rf.predict(X_valid)
print(f'Validation Accuracy: {accuracy_score(y_valid, y_pred_valid)}')


Validation Accuracy: 0.8087091757387247


Intermediate Conclusion:

The Random Forest model with tuned hyperparameters performed well on the validation set, achieving an accuracy of 80.87%. This exceeds the project’s accuracy threshold of 75%, indicating that the model is capable of making reliable predictions.

# Model Evaluation on Test Set

In [5]:
# Evaluate on the test set
y_pred_test = best_rf.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred_test)
print(f'Test Accuracy: {test_accuracy}')


Test Accuracy: 0.8118195956454122


Results:

Test Accuracy: The model achieved an accuracy of approximately 81.18% on the test set.

Intermediate Conclusion:

Performance Consistency: The test accuracy is close to the validation accuracy, indicating that the model generalizes well to new, unseen data. This suggests that the model is robust and not overfitting to the training or validation data.

Threshold Achievement: The test accuracy comfortably exceeds the project's accuracy threshold of 75%, validating the effectiveness of our model in predicting the optimal mobile plan for subscribers.

# Sanity Check and Final Assessment

In [6]:
# Feature importance for Random Forest
importances = best_rf.feature_importances_
feature_names = X.columns
feature_importance = pd.Series(importances, index=feature_names).sort_values(ascending=False)
print(feature_importance)


mb_used     0.322994
calls       0.242669
minutes     0.240264
messages    0.194073
dtype: float64


Intermediate Conclusion: 

Interpretation of Feature Importances:

mb_used (Internet Traffic):

Importance: 32.30%
Conclusion: Internet traffic usage is the most significant predictor for determining whether a subscriber should switch to the "Ultra" plan. This suggests that the amount of data used has a substantial impact on the choice of plan.
calls (Number of Calls):

Importance: 24.27%
Conclusion: The number of calls made by a subscriber is also a significant factor in predicting the plan. While not as influential as data usage, it still plays an important role in decision-making.
minutes (Total Call Duration):

Importance: 24.03%
Conclusion: Total call duration is on par with the number of calls in terms of importance. This indicates that both the frequency and duration of calls are crucial for predicting the optimal plan.
messages (Number of Text Messages):

Importance: 19.41%
Conclusion: The number of text messages is the least important among the features but still contributes to the model's predictions. This suggests that while text message usage does affect the plan choice, it is less critical compared to data usage and call metrics.
Overall Conclusion:

Feature Significance: The model places the highest importance on internet traffic usage (mb_used), followed by call metrics (calls and minutes), and then text messages (messages). This reflects the likely higher relevance of data usage in deciding between the "Smart" and "Ultra" plans.
Insights for Business: Understanding feature importance can help Megaline tailor their plans and marketing strategies based on the most influential factors. For instance, focusing on data-heavy users might be a key strategy for promoting the "Ultra" plan.

Final Conclusion

Model Performance and Reliability:

Our RandomForestClassifier has demonstrated strong performance with an accuracy of approximately 81.18% on the test set, surpassing the project’s accuracy threshold of 75%. This indicates that the model is effective in predicting whether subscribers should switch to the "Ultra" plan or remain on the "Smart" plan. The consistency of performance between the validation and test sets further supports the model's robustness and generalizability.

Feature Importance Insights:

Internet Traffic Usage (mb_used):

Importance: 32.30%
Significance: This feature is the most influential in predicting plan selection. High data usage is a strong indicator that a subscriber might benefit from the "Ultra" plan, which likely offers more data.

Number of Calls (calls):

Importance: 24.27%
Significance: The frequency of calls is a significant predictor, though less so than data usage. This suggests that frequent callers might also be more inclined toward a plan with higher allowances.

Total Call Duration (minutes):

Importance: 24.03%
Significance: This feature is as important as the number of calls. The total duration of calls complements the frequency of calls in determining the appropriate plan.

Number of Text Messages (messages):

Importance: 19.41%
Significance: While less critical compared to data and call metrics, text message usage still impacts the plan choice. Subscribers with high text usage might also consider a plan that offers more messaging benefits.

Business Implications:

Plan Optimization: Megaline should consider emphasizing data usage in their marketing and plan design. Since internet traffic has the highest importance, offering competitive data packages and promotions could attract more subscribers to the "Ultra" plan.

Targeted Marketing: For subscribers who frequently use calls and have high total call durations, Megaline could tailor offers that highlight the advantages of the "Ultra" plan in terms of call benefits.

Strategic Focus: While text message usage is less critical, understanding that it still plays a role in decision-making can help in crafting comprehensive plans that cater to a variety of usage patterns.

Overall Summary:

The successful implementation of the RandomForestClassifier model and the insights from feature importance highlight the model's capability to make informed recommendations. By focusing on the most influential features—particularly data usage—Megaline can better align their offerings with subscriber needs and optimize their marketing strategies to promote the most suitable plans effectively.