### Introduction

This project applies machine learning models to analyze the behavior of mobile subscribers. The company Megaline found that many subscribers use legacy plans. They want to recommend one of two new plans to subscribers: Smart or Ultra, based on the behavior of subscribers who have already started using either Smart or Ultra. The goal is to develop a model with an accuracy greater than or equal to 0.75.

**Data Description**

Every observation in the dataset contains monthly behavior information about one user. The information given is as follows:

- `calls` — number of calls,
- `minutes` — total call duration in minutes,
- `messages` — number of text messages,
- `mb_used` — Internet traffic used in MB,
- `is_ultra` — plan for the current month (Ultra - 1, Smart - 0).


#### Open and look through the data file

In [19]:
# load libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import sklearn



In [2]:
# load data

try:
    users_df = pd.read_csv('users_behavior.csv')
except FileNotFoundError:
    users_df = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
# Examine few rows

users_df.sample(5)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
2014,74.0,456.81,27.0,20700.46,0
845,118.0,799.4,0.0,12828.73,1
962,35.0,236.72,72.0,13380.67,0
1559,106.0,781.37,26.0,37962.31,1
1116,101.0,694.2,41.0,9351.78,0


In [4]:
# Examine dataset

users_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


I think it is appropriate to change the columns `calls` and `messages` to int64 because decimals do not make sense here. I would change the column `is_ultra` to category dtype if I were doing general data analysis. Since I am working with machine learning, I will keep that column as it is.

In [5]:
# changing float64 to int64
users_df['calls'] = users_df['calls'].astype('int64')
users_df['messages'] = users_df['messages'].astype('int64')

In [6]:
# check missing values

users_df.isnull().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

No missing values.

In [7]:
# check duplicates

users_df.duplicated().sum()

0

No duplicates

In [8]:
# create features and target

features = users_df.drop(['is_ultra'], axis=1)
target = users_df['is_ultra'] 

#### Split the source data into a training set, a validation set, and a test set

In [9]:
# Splitting the data. 
# Before splitting the data, lets check the proportion of the target.

proportion = users_df['is_ultra'].value_counts(normalize=True)*100
print(proportion.apply(lambda x: f"{x:.2f} %"))

is_ultra
0    69.35 %
1    30.65 %
Name: proportion, dtype: object


It is clear that that the output is skewed. So, we need to maintain this skewness while splitting the data to 
get the flavor of the original data.

In [10]:
# First Split: keep 20 % of data from the orinigal data as test data
# This dataset will will be used in final evaluation
# The remaining 80% (features_tr_val, target_tr_val) will be used for training and validation.
# In order to ensure the proportion of `0`s and `1`s in the test set, I use (`stratify=target`).

features_tr_val, features_test, target_tr_val, target_test = train_test_split(
    features, target, test_size = 0.2, random_state = 123, stratify = target)



In [11]:
# Second split: From the remaining 80 % of the data, allocate 60 % for training and 20 % for validation
# `features_tr` and `target_tr` will be used for model training.
#  `features_val` and `target_val` will be used for hyperparameter tuning (validation).

features_tr, features_val, target_tr, target_val = train_test_split(
    features_tr_val, target_tr_val, test_size = 0.25, random_state = 123, stratify = target_tr_val)

# test_size is 20 % of 80 i.e, 25 % or 0.25.



#### Investigate the quality of different models by changing hyperparameters

In [12]:
# Lets work on Decision tree classifier

#First start with default mode

default_model_tree = DecisionTreeClassifier(random_state = 123)
default_model_tree.fit(features_tr, target_tr)
predictions_default_tree = default_model_tree.predict(features_val)

# Print accuracy for the default Decision Tree model
print("Default Decision Tree Accuracy:", round(accuracy_score(target_val, predictions_default_tree)*100, 2), '%')




Default Decision Tree Accuracy: 72.47 %


In [13]:
# Hyperparameter tuning: Decision Tree with different max_depth values
for depth in range(1, 15):  # Testing max_depth from 1 to 14
    tuned_model_tree = DecisionTreeClassifier(random_state=123, max_depth = depth)
    tuned_model_tree.fit(features_tr, target_tr)
    predictions_tuned_tree = tuned_model_tree.predict(features_val)
    print("max_depth =", depth, ": ", end='')
    print(round(accuracy_score(target_val, predictions_tuned_tree)*100, 2), '%')

max_depth = 1 : 75.58 %
max_depth = 2 : 77.29 %
max_depth = 3 : 78.54 %
max_depth = 4 : 78.54 %
max_depth = 5 : 78.38 %
max_depth = 6 : 77.76 %
max_depth = 7 : 78.85 %
max_depth = 8 : 78.54 %
max_depth = 9 : 79.32 %
max_depth = 10 : 78.07 %
max_depth = 11 : 77.14 %
max_depth = 12 : 77.29 %
max_depth = 13 : 75.27 %
max_depth = 14 : 75.12 %


We can say that the optimal max_depth for the Decision Tree model was found to be 9. With that depth, the accuracy was around 79.32%. If we increase the depth, the accuracy does not increase and might suggest overfitting the model. On the other hand, if we decrease the depth, it might suggest underfitting the model. The default accuracy was around 72.47%, which suggests that it is an underfitted model. We got these values based on a random_state of 123.

In [14]:
# Lets focus on Random forest

# Default Random Forest Classifier 
default_model_rf = RandomForestClassifier(random_state = 123)

# Train on training data
default_model_rf.fit(features_tr, target_tr)

# Predict on validation data
predictions_rf_default = default_model_rf.predict(features_val)

# Print accuracy for the default model
print("Default Random Forest Accuracy:", round(accuracy_score(target_val, predictions_rf_default)*100, 2), '%')

Default Random Forest Accuracy: 79.63 %


In [18]:
# Hyperparameter tuning: Random Forest with different n_estimators (number of trees)
best_score_rf = 0
best_estimators_rf = 0

for est in range(1, 100):  # Test n_estimators from 1 to 99
    model_rf = RandomForestClassifier(random_state = 123, n_estimators = est)
    model_rf.fit(features_tr, target_tr)  # Train on training data
    score_rf = model_rf.score(features_val, target_val)  # Evaluate on validation set
    if score_rf > best_score_rf:
        best_score_rf = score_rf  # Update best score
        best_estimators_rf = est  # Save the number of estimators for the best score

# Output the best model's accuracy and n_estimators
print(f"Accuracy of the best Random Forest model on the validation set (n_estimators = {best_estimators_rf}): {round(best_score_rf * 100, 2)} %")


Accuracy of the best Random Forest model on the validation set (n_estimators = 42): 80.4 %


We can say that the optimal n_estimators is 42. It gives accuracy of about 80.4 %. The default parameters give accuracy of 79.63 %. If we increase n_estimators greater tha 42, it might overfit the data. On the otherhand, if n_estimators is less than 42, it might suggest underfitting of the data. The answer was based on random_state of 123.

In [15]:
# Logistic regression model

# Initialize Logistic Regression model with solver='liblinear'
model = LogisticRegression(random_state=123, solver='liblinear')

# Train the model using the training set
model.fit(features_tr, target_tr)

# Calculate accuracy on the training set
score_train = model.score(features_tr, target_tr)

# Calculate accuracy on the validation set
score_valid = model.score(features_val, target_val)

# Print the accuracy of the model on both the training and validation sets
print(
    "Accuracy of the logistic regression model on the training set:",
    round(score_train * 100 ,2), '%'
)
print(
    "Accuracy of the logistic regression model on the validation set:",
    round(score_valid, 2) * 100, '%'
)

Accuracy of the logistic regression model on the training set: 74.48 %
Accuracy of the logistic regression model on the validation set: 75.0 %


Logistic regression accuracy was found to be lower than that of the Random Forest model but higher than the default Decision Tree. Although logistic regression is faster, it is far less accurate than the Random Forest model. With n_estimators = 42, the accuracy for Random Forest was 80.4%, which is the highest when considering random_state = 123.

#### Check the quality of the model using the test set

In [16]:
# Best Random Forest Classifier with n_estimators=42 and random_state=123
best_model_rf = RandomForestClassifier(random_state=123, n_estimators=42)

# Train on training data
best_model_rf.fit(features_tr, target_tr)

# Predict on the test set
predictions_rf_test = best_model_rf.predict(features_test)

# Calculate accuracy of the model on the test set
model_rf_accuracy = accuracy_score(target_test, predictions_rf_test)

# Print accuracy for the best Random Forest model on the test set
print("Best Random Forest Accuracy on the test set (n_estimators=42):", round(model_rf_accuracy * 100, 2), "%")


Best Random Forest Accuracy on the test set (n_estimators=42): 79.94 %


The accuracy of the model on the test data is 79.94%, which is slightly lower than the accuracy on the validation set (80.40%). Still, the model performs well enough.

#### sanity check the model

In our case, the target variable is_ultra has two values: 0 for smart plan and 1 for ultra plan. We found that 69.35% of target values belong to the smart plan. This means if we were to guess the plan and choose smart (value 0), we would likely be correct 69.35% of the time. However, our best model predictions on test data achieve 78.38%. Therefore, we can say that our model performs well.

#### Conclusion
In this project, I successfully applied machine learning models to analyze the behaviors of Megaline mobile subscribers. The model was able to recommend plans to new users in an effective way. I performed exploratory data analysis to confirm the data is clean. Then I split data into training (60%), validation (20%) and test (20%). Since the original data was skewed, I maintained the skewness in split data.

I analyzed different models such as Decision Tree, Random Forest and Logistic Regression with tuning different hyperparameters. The Random Forest model with 42 estimators and random state of 123 has the highest accuracy of 80.4% on the validation dataset. But we must keep in mind that I might have got slightly different answers if I had used different random states. I also did a sanity check which confirmed that the model is trustworthy at predicting. Finally, the best Random Forest model achieved 79.94% accuracy on the test dataset, which is above the threshold of 0.75 (75%). Therefore, the model is effective in recommending either the Smart or Ultra plans to new customers.