# Sprint 7: Intro to Machine Learning Project

# Project Description

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra. 

I have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, I will need to develop a model that will pick the right plan. 

I will develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset. 

## Initializing 

Import Libraries and Data

In [1]:
# Importing libraries
import pandas as pd
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

In [2]:
# Importing data
df = pd.read_csv('/datasets/users_behavior.csv')

## Data Review and Preprocessing

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
df.head(10)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
5,58.0,344.56,21.0,15823.37,0
6,57.0,431.64,20.0,3738.9,1
7,15.0,132.4,6.0,21911.6,0
8,7.0,43.39,3.0,2538.67,1
9,90.0,665.41,38.0,17358.61,0


In [5]:
df.isna().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

**Data looks clean! However, I will be converting the 'calls' and 'messages' to integer data type to help with run speed when figuring out which model will be ideal for use in this case.**

In [6]:
df['calls'] = df['calls'].astype('int64')
df['messages'] = df['messages'].astype('int64')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   int64  
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   int64  
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 125.7 KB


## Splitting the data into a training set, validation set and a test set.

In [7]:
# Split the dataset
train, validate, test = \
              np.split(df.sample(frac=1, random_state=42), 
                       [int(.6*len(df)), int(.8*len(df))])

train_features = train.drop(['is_ultra'], axis=1)
train_target = train['is_ultra']

validate_features = validate.drop(['is_ultra'], axis=1)
validate_target = validate['is_ultra']

test_features = test.drop(['is_ultra'], axis=1)
test_target = test['is_ultra']

Splitting the data into 60% training, 20% data, and 20%  as it's a 3:1:1 ratio, which made the most sense. Then I designated 20% of the data to create the validation set, then from the remaining 80%, I took 25% to make the testing set. This follows the ratio of 3:1:1 or 60%, 20%, and 20% since after the first split, 80% of the initial data remained, and 0.8 * 0.25 = 0.2, representing a second 20% share of the full data set.


## Investigating Different Models

In [71]:
# Declare variables for features and targets

features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']
features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']
features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

**Decision Tree**

In [72]:
# No hyperparameter tuning
dt_model = DecisionTreeClassifier()
dt_model.fit(train_features, train_target)

dt_predict_train = dt_model.predict(train_features)
dt_predict_valid = dt_model.predict(validate_features)

print("Model accuracy score using train data is:", accuracy_score(train_target, dt_predict_train) * 100)
print("Model accuracy score using validation data is:", accuracy_score(validate_target, dt_predict_valid) * 100)

Model accuracy score using train data is: 100.0
Model accuracy score using validation data is: 72.93934681181959


In [73]:
# With hyperparameter tunning
for depth in range(1, 11):
    dtree_model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    dtree_model.fit(train_features, train_target)
    
    predict_train = dtree_model.predict(train_features) 
    predict_valid = dtree_model.predict(validate_features) 
    
    acc_train = accuracy_score(train_target, predict_train) * 100
    acc_valid = accuracy_score(validate_target, predict_valid) * 100
        
    print('max_depth =', depth, ': ', end='')
    print(f'Train data accuracy is {acc_train} and validation data accuracy is {acc_valid}')

max_depth = 1 : Train data accuracy is 75.4149377593361 and validation data accuracy is 72.31726283048211
max_depth = 2 : Train data accuracy is 78.7344398340249 and validation data accuracy is 76.98289269051321
max_depth = 3 : Train data accuracy is 79.92738589211619 and validation data accuracy is 78.53810264385692
max_depth = 4 : Train data accuracy is 79.9792531120332 and validation data accuracy is 78.53810264385692
max_depth = 5 : Train data accuracy is 81.58713692946058 and validation data accuracy is 78.53810264385692
max_depth = 6 : Train data accuracy is 82.88381742738589 and validation data accuracy is 78.69362363919129
max_depth = 7 : Train data accuracy is 83.97302904564316 and validation data accuracy is 78.0715396578538
max_depth = 8 : Train data accuracy is 85.47717842323651 and validation data accuracy is 78.53810264385692
max_depth = 9 : Train data accuracy is 86.35892116182573 and validation data accuracy is 79.00466562986003
max_depth = 10 : Train data accuracy is 8

The accuracy scores between the two models are significantly different, suggesting the presence of overfitting. Which is why performing the hyperparameter tuning to optimize the model's performance had to be done. 

The highest accuracy score of 87.65% is for the training data and 79.16% for the validation data when using a max_depth of 10 in the decision tree model.

**Random Forest**

In [74]:
# No hyperparameter tuning
rforest = RandomForestClassifier() 
rforest.fit(train_features, train_target) 

rf_predict_train = rforest.predict(train_features)
rf_predict_valid = rforest.predict(validate_features)

print("Model accuracy score using train data is:", accuracy_score(train_target, rf_predict_train) * 100)
print("Model accuracy score using validation data is:", accuracy_score(validate_target, rf_predict_valid) * 100)

Model accuracy score using train data is: 100.0
Model accuracy score using validation data is: 80.40435458786936


In [75]:
# With hyperparameter tunning

for est in range(10, 101, 10):
    rf_model = RandomForestClassifier(random_state=54321, max_depth=10, n_estimators=est)
    rf_model.fit(train_features, train_target) 

    rf_predict_train_tunning = rf_model.predict(train_features) 
    rf_predict_valid_tunning = rf_model.predict(validate_features) 

    acc_train = accuracy_score(train_target, rf_predict_train_tunning) * 100
    acc_valid = accuracy_score(validate_target, rf_predict_valid_tunning) * 100

    print('n_estimators =', est, ': ', end='')
    print(f'Train data accuracy is {acc_train.round(2)} and validation data accuracy is {acc_valid.round(2)}')

n_estimators = 10 : Train data accuracy is 88.95 and validation data accuracy is 80.25
n_estimators = 20 : Train data accuracy is 88.95 and validation data accuracy is 80.56
n_estimators = 30 : Train data accuracy is 89.16 and validation data accuracy is 80.72
n_estimators = 40 : Train data accuracy is 89.26 and validation data accuracy is 80.09
n_estimators = 50 : Train data accuracy is 89.21 and validation data accuracy is 79.78
n_estimators = 60 : Train data accuracy is 89.47 and validation data accuracy is 80.09
n_estimators = 70 : Train data accuracy is 89.26 and validation data accuracy is 80.25
n_estimators = 80 : Train data accuracy is 89.26 and validation data accuracy is 80.56
n_estimators = 90 : Train data accuracy is 89.21 and validation data accuracy is 80.25
n_estimators = 100 : Train data accuracy is 89.16 and validation data accuracy is 80.25


I observed a significant accuracy gap in the random forest model, indicating overfitting. To address this issue, I applied hyperparameter tuning. Based on previous models, I found that setting max_depth to 10 yielded the best accuracy. Therefore, in this model, I focused on tuning the n_estimators parameter to find the optimal number of trees.

**Logistic Regression**

In [76]:
# Create a logistic regression model
lr_model = LogisticRegression(random_state=54321, solver='liblinear') 
lr_model.fit(train_features, train_target) 
score_train = lr_model.score(train_features, train_target) 
score_valid = lr_model.score(validate_features, validate_target) 

print("Accuracy of logistic regression model based on training set:", score_train)
print("Accuracy of logistic regression model based on validation set:", score_valid)

Accuracy of logistic regression model based on training set: 0.7012448132780082
Accuracy of logistic regression model based on validation set: 0.7045101088646968


Unlike the previous two models, I do not need to specify the tree depth or number of trees hyperparameters. Instead, I only need to set the solver parameter, and in this case, I used liblinear. I kept the random_state parameter as 12345.

**The random forest model achieved the highest score of 80.72% when validating the dataset. To obtain this level of accuracy, a model with 30 trees was analyzed.**

**Considering these findings, it is concluded that the random forest model is the best choice. Although it requires more trees compared to the decision tree model, the increase is still manageable and provides a significant improvement in accuracy.**

## Testing the Model


In [77]:
# Create the final model
final_model = RandomForestClassifier(random_state=54321, max_depth=10, n_estimators=30)
final_model.fit(train_features, train_target)

RandomForestClassifier(max_depth=10, n_estimators=30, random_state=54321)

In [78]:
# Make predictions on the test_features dataset
test_predictions = final_model.predict(test_features)

#Test accuracy
print('Accuracy score:', accuracy_score(test_target, test_predictions))

Accuracy score: 0.8102643856920684


**Based on the higher accuracy score compared to the training results, this model is performing quite well.**

## Sanity Check

In [79]:
# Counts the number of Smart (0) vs Ultra (1) on test_target
test_target.value_counts()

0    444
1    199
Name: is_ultra, dtype: int64

In [80]:
# Percentage of users
test_target.value_counts() / test_target.shape[0] * 100

0    69.051322
1    30.948678
Name: is_ultra, dtype: float64

The model achieved an accuracy rate of 81.02%, which is approximately 12% higher than the majority class. This indicates that the model is able to effectively analyze the features and make accurate predictions. 

However, I'd like to highlight that there is a data imbalance. I observed that the test_target dataset consists of 69.05% Smart customers. Despite this imbalance, the model is still performing well. 

In [81]:
# Here are the predicted counts of Smart (0) and Ultra (1) subscribers generated by the model:

test_prediction_df = pd.DataFrame(test_predictions)
test_prediction_df.value_counts() / test_prediction_df.shape[0] * 100


0    78.693624
1    21.306376
dtype: float64

**The model predicts that 78.69% of the customers in the test dataset can be recommended to purchase the Smart plan. The model also predicts that a significantly larger number of customers can be offered the Smart package compared to the Ultra package. This observation aligns with the distribution in the test_target dataset.**



# Conclusion

**After evaluating multiple models, including the decision tree and random forest, it is determined that the random forest model outperforms the others in terms of accuracy. This chosen model demonstrates good predictive performance and passes the sanity check process. Overall, it provides reliable predictions with a high level of accuracy.**

**Based on my analysis and the performance of the chosen random forest model, my recommendation for Megaline would be to utilize this model to make package recommendations for their customers. The random forest model has demonstrated a high level of accuracy in predicting customer preferences between the Smart and Ultra packages.**

**By leveraging this model, Megaline can effectively analyze customer behavior and recommend the appropriate package to customers who have not yet switched to the latest package. This approach will enable Megaline to optimize their offerings and cater to individual customer needs, ultimately leading to improved customer satisfaction and potentially higher subscription rates for the recommended package.**