# Sprint 7 Project: Megaline Machine Learning Analysis

## Prompt: 
Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra. 

You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since youâ€™ve already performed the data preprocessing step, you can move straight to creating the model.  

Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.  

## Introduction
For this project, I will be working with dataset titled 'user_behavior' which contains 5 columns: calls (number of calls), minutes (total duration of the calls), messages (number of test messages), mb_used (internet traffic used in mb), and is_ultra (0 = customer has Smart plan, 1 = customer bas ultra plan). 

The goal of this project is to design a model that can predict which plan (Smart or Ultra) a customer should register for based on their usage of the features described above using the data in the user_bahvior dataset.

First I will split the dataset into three (test, train, and valid). Train will be used to tune the model and I will compare the model scores to those of the valid set. Since this is a binary classification task, I will be tuning models using the Decision Tree Classifier, Random Forest Classifier, and Logistic Regression. After determining which model returns the highest, most effective valid score in comparison to the train score, I will test the model using the test data to see how well it performs. Lastly, I will perform a sanity check.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

uploading necessary libraries

In [2]:
df = pd.read_csv('/datasets/users_behavior.csv')

reading dataset

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


since this data is from a previous sprint and has already been cleaned, I will jsut be calling info and head to take a look at the data, but no additional changes are necessary before analysis.

In [4]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [5]:
train_valid, test = train_test_split(df, test_size = .2, random_state=12345)
train, valid = train_test_split(train_valid, test_size = .25, random_state=12345)

splitting dataset into valid, test, and train. train will be used to tune the model.

In [6]:
features_train = train.drop(['is_ultra'], axis=1)

In [7]:
target_train = train['is_ultra']

In [8]:
features_valid = valid.drop(['is_ultra'], axis=1)

In [9]:
target_valid = valid['is_ultra']

In [10]:
features_test = test.drop(['is_ultra'], axis=1)

In [11]:
target_test = test['is_ultra']

In [12]:
print(features_train.shape)
print(features_valid.shape)
print(features_test.shape)

(1928, 4)
(643, 4)
(643, 4)


created features and targets for each split of data, printed out shape to ensure necessary split was made.

## Tuning Models

### Decision Tree

In [13]:
print('Decision Tree')
for depth in range(1, 20):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)
    print('max_depth =', depth)
    print('Train:', model.score(features_train, target_train))
    print('Valid:', model.score(features_valid, target_valid))

Decision Tree
max_depth = 1
Train: 0.758298755186722
Valid: 0.7387247278382582
max_depth = 2
Train: 0.79201244813278
Valid: 0.7573872472783826
max_depth = 3
Train: 0.8117219917012448
Valid: 0.7651632970451011
max_depth = 4
Train: 0.8205394190871369
Valid: 0.7636080870917574
max_depth = 5
Train: 0.8272821576763485
Valid: 0.7589424572317263
max_depth = 6
Train: 0.8335062240663901
Valid: 0.7573872472783826
max_depth = 7
Train: 0.8506224066390041
Valid: 0.7744945567651633
max_depth = 8
Train: 0.8661825726141079
Valid: 0.7667185069984448
max_depth = 9
Train: 0.875
Valid: 0.7620528771384136
max_depth = 10
Train: 0.8910788381742739
Valid: 0.7713841368584758
max_depth = 11
Train: 0.9024896265560166
Valid: 0.7589424572317263
max_depth = 12
Train: 0.9154564315352697
Valid: 0.7558320373250389
max_depth = 13
Train: 0.9242738589211619
Valid: 0.749611197511664
max_depth = 14
Train: 0.9367219917012448
Valid: 0.7573872472783826
max_depth = 15
Train: 0.9439834024896265
Valid: 0.7527216174183515
max_dep

### Random Forest

In [15]:
print('Random Forest')
for estim in range(10, 101, 5):
    model = RandomForestClassifier(n_estimators=estim, random_state=12345)
    model.fit(features_train, target_train)
    print('n_estimators =', estim)
    print('Train:', model.score(features_train, target_train))
    print('Valid:', model.score(features_valid, target_valid))


Random Forest
n_estimators = 10
Train: 0.9797717842323651
Valid: 0.7884914463452566
n_estimators = 15
Train: 0.991701244813278
Valid: 0.7838258164852255
n_estimators = 20
Train: 0.9948132780082988
Valid: 0.7900466562986003
n_estimators = 25
Train: 0.9968879668049793
Valid: 0.7884914463452566
n_estimators = 30
Train: 0.9968879668049793
Valid: 0.7884914463452566
n_estimators = 35
Train: 0.9974066390041494
Valid: 0.7853810264385692
n_estimators = 40
Train: 0.9963692946058091
Valid: 0.7869362363919129
n_estimators = 45
Train: 0.9984439834024896
Valid: 0.7931570762052877
n_estimators = 50
Train: 0.9979253112033195
Valid: 0.7947122861586314
n_estimators = 55
Train: 0.9989626556016598
Valid: 0.7962674961119751
n_estimators = 60
Train: 0.9984439834024896
Valid: 0.7962674961119751
n_estimators = 65
Train: 0.9989626556016598
Valid: 0.7993779160186625
n_estimators = 70
Train: 0.9989626556016598
Valid: 0.7962674961119751
n_estimators = 75
Train: 1.0
Valid: 0.7978227060653188
n_estimators = 80
Trai

### Logistic Regression

In [20]:
print('Logistic Regression')
model =  LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train, target_train)
print('Train:', model.score(features_train, target_train))
print('Valid:', model.score(features_valid, target_valid))

Logistic Regression
Train: 0.7422199170124482
Valid: 0.7293934681181959


## Findings

- Decision Tree produced good results with not too much overfitting, however the scores of the validation set are higher using Random Forest
- Although there was slight overfitting, Random Forest produced the highest validation set scores, all of which were over the given threshold of 75%. The peak was at n_estimators = 80 with train score: 1.0 and valid score: .799.
- Logistic Regression does not work for this model. Neither the train or valid scores reach the 75% threshold.

## Testing the model

In [22]:
features_full_train = train_valid.drop(['is_ultra'], axis=1)
target_full_train = train_valid['is_ultra']

In [23]:
model = RandomForestClassifier(random_state=12345, n_estimators=80)
model.fit(features_full_train, target_full_train)
model.score(features_test, target_test)

0.7713841368584758

goal achieved with >75% accuracy using Random Forest with n_estimators=80

## Additional Task: Sanity Check

In [24]:
df['is_ultra'].value_counts()/df.shape[0]

0    0.693528
1    0.306472
Name: is_ultra, dtype: float64

sanity check score of about 69% indicates the logitic regression hasn't learned much

## Conclusion

Out of the three models used, Random Classifier produced the best model with n_estimator = 80. When used on the test set, it produced a score of .771 or 77.1% accuracy which exceeds the goal of 75%. I recommend using this model to recommend plans to customers.