# Recommendation of new mobile plan: Smart or Ultra 

Background:
- We currently have a lot of subscribers using legacy mobile plan. 

Goal:
- Our goal is to analyze subscribers' behavior and recommend them a new plan, Smart or Ultra.

Stages:
- We have already performed the data processing step earlier, so we will focus on creating the model here.

- Since we are recommeding 2 plans, Smart or Ultra, we will use classification models as opposed to regression models. 

- We will set the threshold of accuracy to 0.75. We will split the data into training set, validation set and a test set.

<div class="alert alert-block alert-warning">
<b></b> <a class="tocSkip"></a>   
    
> # Contents <a id='back'></a>
> * [Introduction](#intro)
    * [Stage 1. Importing libraries](#Import-Libraries)
    * [Stage 2. Load data](#Load-data)
    * [Stage 3. Splitting data](#Splitting-data)
    * [Stage 4. Testing different binary classifications](#Testing-different-binary-classification)
        * [4.1 Logistic regression](#4.1-Logistic-regression)
        * [4.2 Decision Tree Classifier](#4.2-Decision-Tree-Classifier )
        * [4.3 Random Forest Classifierl](#4.3-Random-Forest-Classifier)
    * [Stage 5. What model to choose ](#What-model-to-choose?)
    * [Stage 6. Testing Logistic Regression on testing data](#Testing-Logistic-Regression-on-testing-data)
    * [Conclusion](#Conclusion)

# Import Libraries

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

# Load data

In [2]:
try:
    subscribers = pd.read_csv('/Users/dankeichow/Downloads/users_behavior.csv')
except:
    subscribers = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
subscribers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
subscribers.head(5)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


We have already completed EDA in previous project and performed the following:

* `removing duplicates`
* Checking summary statistics with `describe()` function.
* Various univariate and multi-variate visualisation you will learn in subsequent session.



They are 3000+ rows of data. The number of calls, total call duration, number of text messages and MB used would be used as features. The is_ultra column (phone plan of the month) will be used as target

In [5]:
# Creating seperate datasets to store features and target

features = subscribers.drop(['is_ultra'], axis=1)

target = subscribers['is_ultra']

# Splitting data 

Since test set does not exist, the source data has to be split into 3 parts: train, validation and test sets. The size of validation set and test set are usually equal. 

Training data: 60%
Validating data: 20%
Test data: 20%

We will have to set a specific random state value in order to ensure the reproducibility. We want to ensure that the splits remain the same even if the code is executed multiple times.

In [6]:
# Splitting the data into train, validation and test sets

# First step, splitting the data into a training set and a test set; 80% training and 20% test

features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.2, random_state=12345 )


In [7]:
#Second step, use the splitted 80% training data and split 20% out for validation set

features_train, features_val, target_train, target_val = train_test_split(
    features_train, target_train, test_size=0.2, random_state=12345 )

features_train.shape


(2056, 4)

In [8]:
test = [features_train, target_train,features_val,target_val,features_test,target_test]

for shape in test:
    test1 = shape.shape
    print(f'shape of data {test1}')


shape of data (2056, 4)
shape of data (2056,)
shape of data (515, 4)
shape of data (515,)
shape of data (643, 4)
shape of data (643,)


The datasets we have splitted now are:

`training`: features_train, target_train

`validation`: features_val, target_val

`test`: features_test, target_test

There are 1315 rows of data in our training set; 329 rows of data in our validation set, and 623 rows in our testing data.

# Testing different binary classification

# 4.1 Logistic regression

In [9]:
# 1st method: logistic regression 
# Imported above but just fyi: from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression(random_state = 12345, solver='liblinear') #create an empty model
lr_model.fit(features_train,target_train) #train the model with training data



LogisticRegression(random_state=12345, solver='liblinear')

In [10]:
# Test the accuracy between training and validation

# training data
lr_train_predictions = lr_model.predict(features_train)
lr_train_accuracy = accuracy_score(target_train, lr_train_predictions)

# validation data
lr_val_predictions = lr_model.predict(features_val)
lr_val_accuracy = accuracy_score(target_val, lr_val_predictions)

print(f'Training set accuracy: {lr_train_accuracy}')
print(f'Validation set accuracy: {lr_val_accuracy}')

Training set accuracy: 0.745136186770428
Validation set accuracy: 0.7203883495145631


The logistic regression model correctly classified 74.5% of the data in the training set while 72% in the validation set

# 4.2 Decision Tree Classifier

In [11]:
# Imported above but just fyi: from sklearn.tree import DecisionTreeClassifier 
# To check which max depth is more suitable and what is the accuracy for training and validation set


for depth in range(1,11):
    model = DecisionTreeClassifier(random_state = 12345, max_depth = depth)#create an empty model
    model.fit (features_train, target_train)#train the model with training data
    predictions_train = model.predict(features_train)
    predictions_valid = model.predict(features_val) #get the model's prediction
    accuracy_score_train=accuracy_score(target_train, predictions_train)
    accuracy_score_val= accuracy_score(target_val, predictions_valid)
    print(f'Accuracy for max depth {depth} for training is {accuracy_score_train}.')
    print(f'Accuracy for max depth {depth} for validation is {accuracy_score_val}.')
    print()      


Accuracy for max depth 1 for training is 0.7611867704280155.
Accuracy for max depth 1 for validation is 0.7223300970873786.

Accuracy for max depth 2 for training is 0.7923151750972762.
Accuracy for max depth 2 for validation is 0.7475728155339806.

Accuracy for max depth 3 for training is 0.811284046692607.
Accuracy for max depth 3 for validation is 0.7553398058252427.

Accuracy for max depth 4 for training is 0.8195525291828794.
Accuracy for max depth 4 for validation is 0.7533980582524272.

Accuracy for max depth 5 for training is 0.828307392996109.
Accuracy for max depth 5 for validation is 0.7572815533980582.

Accuracy for max depth 6 for training is 0.8336575875486382.
Accuracy for max depth 6 for validation is 0.7611650485436893.

Accuracy for max depth 7 for training is 0.8516536964980544.
Accuracy for max depth 7 for validation is 0.7650485436893204.

Accuracy for max depth 8 for training is 0.8706225680933852.
Accuracy for max depth 8 for validation is 0.7631067961165049.

Ac

<b>Oberservation</b>

Tested different max depth to see which max depth is most suitable to use. Here are the findings:
- The accuracy for training set increases as the max depth increases 
- The accuracy for validation set decreases as the max depth increases 

If the accuracy for training is high but the validation accuracy is lower, this is a sign that the model is overfitting. Overfitting occurs when the model learns the training data too well and is not able to generalize the new data. 

Therefore, max depth 6 seems to be the best parameter for Decision tree classifier. The difference on accuracy score between training and validation is not too large so the model is not overfitting.

Accuracy for max depth 6 for training is 0.8336575875486382

Accuracy for max depth 6 for validation is 0.7611650485436893

# 4.3 Random Forest Classifier

In [12]:
# Imported above but just fyi: from sklearn.ensemble import RandomForestClassifier
# To check which n_estimator is more suitable and what is the accuracy for training and validation set

best_score = 0
best_est = 0

for est in range(1,11):
    model = RandomForestClassifier(random_state=12345, n_estimators = est) #create an empty model
    model.fit(features_train,target_train) # training the model with training data
    train_score = model.score(features_train, target_train)
    val_score = model.score(features_val, target_val)
    print(f'Accuracy for n_estimators {est} for training is {train_score}.')
    print(f'Accuracy for n_estimators {est} for validation is {val_score}.')
    print()     


Accuracy for n_estimators 1 for training is 0.9071011673151751.
Accuracy for n_estimators 1 for validation is 0.7223300970873786.

Accuracy for n_estimators 2 for training is 0.9051556420233463.
Accuracy for n_estimators 2 for validation is 0.7398058252427184.

Accuracy for n_estimators 3 for training is 0.9542801556420234.
Accuracy for n_estimators 3 for validation is 0.7436893203883496.

Accuracy for n_estimators 4 for training is 0.9460116731517509.
Accuracy for n_estimators 4 for validation is 0.7533980582524272.

Accuracy for n_estimators 5 for training is 0.9727626459143969.
Accuracy for n_estimators 5 for validation is 0.7398058252427184.

Accuracy for n_estimators 6 for training is 0.9640077821011673.
Accuracy for n_estimators 6 for validation is 0.7553398058252427.

Accuracy for n_estimators 7 for training is 0.9795719844357976.
Accuracy for n_estimators 7 for validation is 0.7553398058252427.

Accuracy for n_estimators 8 for training is 0.9747081712062257.
Accuracy for n_esti

<b> Oberservation </b>

Tested different n_estimators to see which n_estimators is most suitable to use. Here are the findings:
- The accuracy for training set increases as the max depth increases 
- The accuracy for validation set tend to increase but fluctuates as the max depth increases 
- Random Forest Classifier correctly classified at least 90% of the data in the training set

There is overfitting issue with Random Forest Classifier. The accuracy score is too high and the difference between the training and validation data set are too wide.  

The n_estimators = 2 is the best parameter as the difference between both training and validation dataset is smallest. 

# What model to choose?

We have tested logistic regression, decision tree classifier and random forest classifer 

For classification metrics, we will have to decide if we value precision or recall. 

Recall will be a priority if it is important to find all required oberservations while precision will be important if the goal is to minimize the error. 

The project goal is to recommend newer plans to users. Therefore, accuracy seems to be less of a priority comparing to being able to recommend plan to all subscribers at Megaline. 

The random forest has too high accuracy which means model is overfitted so we will choose between logistic regression and decision tree classifier.
 
Logistic regression has fastest run speed amongst the 3 models and has moderate accuracy amongst the 3. Even decision tree classifer has higher accuracy compares to logistic regression, the difference between the training data and validation set is wide which also suggests overfitting. 

The validation set accuracy between logistic regression and decision tree is about 3% apart. Therefore, I will pick logistic regression

# Testing Logistic Regression on testing data

In [13]:
# Already created empty model and trained data for regression model. Therefore, will not repeat

#lr_model = LogisticRegression(random_state = 12345, solver='liblinear') 

#lr_model.fit(features_train,target_train) 

# Test the accuracy between on testing data
lr_test_predictions = lr_model.predict(features_test)
lr_test_accuracy = accuracy_score(target_test, lr_test_predictions)

print(f'Training set accuracy: {lr_train_accuracy}')
print(f'Validation set accuracy: {lr_val_accuracy}')
print(f'Test set accuracy: {lr_test_accuracy}')


Training set accuracy: 0.745136186770428
Validation set accuracy: 0.7203883495145631
Test set accuracy: 0.749611197511664


# Conclusion

The goal of the project is to recommend new mobile phone plans, Smart and Ultra, to exisiting subscribers. We are using behavior data from subscribers that have already switched to the new plans to recommend subscribers to pick a new plan.

This is a classification task because we are not predicting numbers but category. We have tested logistic regression, random forest and decision tree. 

As we are trying to recommend to all users so the speed would be important and accuracy would be less of a priority. Therefore, I chose logistic regression as our machine learning model after comparing the 3. 

The result shows that the training set accuracy from logistic regression model correctly predicted 74.5% of the target. The validation set has lower accuracy of 72% while the testing set has higher accuracy than the validation set of 75%. 

The score between all 3 data sets are close which suggests the model is not overfitting and able to generalize to new data as well. 

The test score is higher than validation score suggests the model is able to learn the pattern in the data well.