# ML model - Ultra or Smart ? 

Mobile carrier Megaline has found out that many of our subscribers use legacy plans. Management wants to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.
We have access to behavior data about subscribers who have already switched to the new plans. For this classification task, we need to develop a model that will pick the right plan. Since we’ve already performed the data preprocessing step, we can move straight to creating the model.

**Project Goal**: 


**Develop a model** that will recommend the right plan (Ultra or Smart), based on behavior data about subscribers who have already switched to the new plans, **with the highest possible accuracy**. Managemnt defined that the **threshold for accuracy is 0.75** (checked on a test set that was not used at all for training). 

## Initialization and loading data

In [1]:
# Loading libraries
from scipy import stats as st
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
import warnings
warnings.filterwarnings("ignore")
# pd.options.display.float_format = '{:.3f}'.format

In [2]:
# load the data into a DataFrame: 
df = pd.read_csv('/datasets/users_behavior.csv')

## Looking through the data

In [3]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [18]:
# checking for explicit duplicates: 
df.duplicated().sum()

0

Every observation in the dataset contains monthly behavior information about one user. We can see we have full 3214 entries (no missing values, no explicit duplicates, and datatypes are ok). The information given is as follows:
- сalls — number of calls,
- minutes — total call duration in minutes,
- messages — number of text messages,
- mb_used — Internet traffic used in MB,
- is_ultra — plan for the current month (Ultra - 1, Smart - 0).
 

## Split the data into training, validation, and test sets

We have only these 3214 entries of data, so we will split it into 3 sets: training set (60% of the data), validation set (20%) and test set (20%). We will later use the training set and validation set to train and investigate different models, and after choosing our best accuracy model (as asked by management) - we will check it's accuracy using the test set. In order to create random splits - we will use the function 'train_test_split' (twice, because it splits randomly into 2 sets). 

For convenience we will symbol 'features' as 'X', and 'target' as 'y'. For the features of each set we will drop the 'is_ultra' column, and it will be the target of each set. We will check the sizes ('shape') of the 3 sets after the splits. 

In [5]:
# split the data into 3 sets, and check their sizes('shape'):   
from sklearn.model_selection import train_test_split

# We want to split the data in 60:20:20 for train:valid:test dataset
train_size=0.6

# For convenience we will symbol 'features' as 'X', and 'target' as 'y'
X = df.drop(['is_ultra'], axis=1)
y = df['is_ultra']

# In the first step we will split the data in training and remaining dataset
X_train, X_rem, y_train, y_rem = train_test_split(X,y, train_size=0.6)

# Now since we want the valid and test size to be equal (20% each of overall data). 
# we have to define valid_size=0.5 (that is 50% of remaining data)
test_size = 0.5
X_valid, X_test, y_valid, y_test = train_test_split(X_rem,y_rem, test_size=0.5)

print('Training features shape:', X_train.shape), print('Training target shape:', y_train.shape)
print('Validation features shape:', X_valid.shape), print('Validation target shape:', y_valid.shape)
print('Test features shape:', X_test.shape), print('Test target shape:', y_test.shape)

Training features shape: (1928, 4)
Training target shape: (1928,)
Validation features shape: (643, 4)
Validation target shape: (643,)
Test features shape: (643, 4)
Test target shape: (643,)


(None, None)

The 3 data sets sizes ('shape') are as expected: 643 (20%) entries for the validation and the test sets, and 1928 (60%) entries for the training set. Features have 4 columns after the 'is_ultra' was drop.  

## Training different models and investigating their accuracy

We will train different models of 3 kinds that are used for classification: decision tree, random forest, and logistic regression. In eah kind - we will try to improve accuracy by changing hyperparameters. We will not use the test set at all at this stage, but use each time the vaidation set to investigate the accuracy of each model with the specific hyperparameters. In order to make our comparisments more reliable - we will keep random_state in all investigations on the same number (12345).  

### Decision Tree 

The hyperparameter we will change will be the tree depth:

In [19]:
# Decision tree with depth of 3:

from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier

# Define the model
model = DecisionTreeClassifier(random_state=12345, max_depth=3)

# Train the model on the training set:
model.fit(X_train, y_train)

# Investigate model accuracy for training and validation sets: 
train_predictions = model.predict(X_train) 
valid_predictions = model.predict(X_valid) 

print('Accuracy')
print('Training set:', accuracy_score(y_train, train_predictions)) 
print('Validation set:', accuracy_score(y_valid, valid_predictions)) 

Accuracy
Training set: 0.808091286307054
Validation set: 0.7744945567651633


Tree depth of 3 gave us already 77.45% accuracy on the validation set, which is above the defined threshold, but we were asked to develop the highest possible accuracy, so we will try with  ahigher depth:

In [20]:
# Decision tree with depth of 4:

from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier

# Define the model
model = DecisionTreeClassifier(random_state=12345, max_depth=4)

# Train the model on the training set:
model.fit(X_train, y_train)

# Investigate model accuracy for training and validation sets: 
train_predictions = model.predict(X_train) 
valid_predictions = model.predict(X_valid) 

print('Accuracy')
print('Training set:', accuracy_score(y_train, train_predictions)) 
print('Validation set:', accuracy_score(y_valid, valid_predictions)) 

Accuracy
Training set: 0.8112033195020747
Validation set: 0.7776049766718507


Tree depth of 4 gave us slightly higher accuracy on the validation set (77.76%), let's try to raise the depth once more: 

In [8]:
# Decision tree with depth of 5:

from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier

# Define the model
model = DecisionTreeClassifier(random_state=12345, max_depth=5)

# Train the model on the training set:
model.fit(X_train, y_train)

# Investigate model accuracy for training and validation sets: 
train_predictions = model.predict(X_train) 
valid_predictions = model.predict(X_valid) 

print('Accuracy')
print('Training set:', accuracy_score(y_train, train_predictions)) 
print('Validation set:', accuracy_score(y_valid, valid_predictions)) 

Accuracy
Training set: 0.8241701244813278
Validation set: 0.7713841368584758


Tree depth of 5 kept raising the accuracy on the training set (3: 80.81%, 4: 81.12%, 5: 82.42%) BUT the accuracy on the validation set - the important one - got lower (only 77.14%). This is clear overfitting. So for now our best accuracy on validation set is 77.76%. Let's try other kinds of models:

### Rendom Forest 

The hyperparameter we will change will be the number of trees:

In [9]:
# Random Forest with 10 trees: 
from sklearn.ensemble import RandomForestClassifier

# Define the model
model = RandomForestClassifier(random_state=12345, n_estimators=10) 

# Train the model on the training set:
model.fit(X_train, y_train)

# Investigate model accuracy for training and validation sets: 
train_predictions = model.predict(X_train) 
valid_predictions = model.predict(X_valid) 

print('Accuracy')
print('Training set:', accuracy_score(y_train, train_predictions)) 
print('Validation set:', accuracy_score(y_valid, valid_predictions)) 

Accuracy
Training set: 0.9771784232365145
Validation set: 0.7791601866251944


10 trees gave us already 77.92% accuracy on the validation set, which is above the defined threshold and higher than our best accuracy decision tree model, but we were asked to develop the highest possible accuracy, so we will try with more trees (altough there is a tradeoff with speed, but we are committed to the current project definition of best accuracy):

In [10]:
# Random Forest with 30 trees: 
from sklearn.ensemble import RandomForestClassifier

# Define the model
model = RandomForestClassifier(random_state=12345, n_estimators=30) 

# Train the model on the training set:
model.fit(X_train, y_train)

# Investigate model accuracy for training and validation sets: 
train_predictions = model.predict(X_train) 
valid_predictions = model.predict(X_valid) 

print('Accuracy')
print('Training set:', accuracy_score(y_train, train_predictions)) 
print('Validation set:', accuracy_score(y_valid, valid_predictions)) 

Accuracy
Training set: 0.9963692946058091
Validation set: 0.7884914463452566


Random Forest with 30 trees gave us slightly higher accuracy on the validation set (78.85%), let's try to raise the number of trees once more: 

In [11]:
# Random Forest with 50 trees: 
from sklearn.ensemble import RandomForestClassifier

# Define the model
model = RandomForestClassifier(random_state=12345, n_estimators=50) 

# Train the model on the training set:
model.fit(X_train, y_train)

# Investigate model accuracy for training and validation sets: 
train_predictions = model.predict(X_train) 
valid_predictions = model.predict(X_valid) 

print('Accuracy')
print('Training set:', accuracy_score(y_train, train_predictions)) 
print('Validation set:', accuracy_score(y_valid, valid_predictions)) 

Accuracy
Training set: 0.9989626556016598
Validation set: 0.7853810264385692


Random Forest with 50 trees kept raising the accuracy on the training set (10 trees: 97.72%, 30 trees: 99.64%, 50 trees: 99.9%) BUT the accuracy on the validation set - the important one - got slightly lower (only 78.54%). Since the default of RandomForestClassifier is 100 trees, let's investigate what accuracy it will produce:

In [12]:
# Random Forest with 100 trees: 
from sklearn.ensemble import RandomForestClassifier

# Define the model
model = RandomForestClassifier(random_state=12345, n_estimators=100)
                               

# Train the model on the training set:
model.fit(X_train, y_train)

# Investigate model accuracy for training and validation sets: 
train_predictions = model.predict(X_train) 
valid_predictions = model.predict(X_valid) 

print('Accuracy')
print('Training set:', accuracy_score(y_train, train_predictions)) 
print('Validation set:', accuracy_score(y_valid, valid_predictions)) 

Accuracy
Training set: 1.0
Validation set: 0.7807153965785381


As expected, with 100 trees the accuracy on the training set became perfect (100%), but the accuracy on the validation set - the important one - kept going down to 78.07%. Our best accuracy model so far is random forest with 30 trees (78.85%). 

### Logistic Regression

We will first try the model with most of it's default values for most hyperparameters:

In [13]:
# Logistic Regression with 'liblinear' solver:  

from sklearn.linear_model import LogisticRegression

# Define the model:
model = LogisticRegression(random_state=12345, solver='liblinear')

# Train the model on the training set:
model.fit(X_train, y_train)

# Investigate model accuracy for training and validation sets: 
train_predictions = model.predict(X_train) 
valid_predictions = model.predict(X_valid) 

print('Accuracy')
print('Training set:', accuracy_score(y_train, train_predictions)) 
print('Validation set:', accuracy_score(y_valid, valid_predictions)) 

Accuracy
Training set: 0.7121369294605809
Validation set: 0.7060653188180405


Accuracy is disappointing, only 70.61% on validation set (and only 71.21 on training set), which is far under our defined threshold. It might say taht this kind of model fit's less to the currect kind of data. Let's try to change 3 hyperparameters and see if we get any serious improvement:

In [14]:
# Logistic Regression with solver='newton-cg', penalty='l2', C=0.5: 

from sklearn.linear_model import LogisticRegression

# Define the model:
model = LogisticRegression(random_state=12345, solver='newton-cg', penalty='l2', C=0.5)

# Train the model on the training set:
model.fit(X_train, y_train)

# Investigate model accuracy for training and validation sets: 
train_predictions = model.predict(X_train) 
valid_predictions = model.predict(X_test) 

print('Accuracy')
print('Training set:', accuracy_score(y_train, train_predictions)) 
print('Validation set:', accuracy_score(y_valid, valid_predictions))       

Accuracy
Training set: 0.7505186721991701
Validation set: 0.6625194401244168


The accuracy on the validation set got even lower (only 66.25%) although the accuracy on the training set got higher (75.05%). It seems that logistic regression will not be our choice. 

### Choosing our model 

As we saw, our best accuracy model so far was random forest with 30 trees, resulting in accuracy of 78.85% on the validation set. Before we check it on the test set - let's investigate if 40 trees will be better with accuracy on the validation set.

In [21]:
# Random Forest with 40 trees: 
from sklearn.ensemble import RandomForestClassifier

# Define the model
model = RandomForestClassifier(random_state=12345, n_estimators=40) 

# Train the model on the training set:
model.fit(X_train, y_train)

# Investigate model accuracy for training and validation sets: 
train_predictions = model.predict(X_train) 
valid_predictions = model.predict(X_valid) 

print('Accuracy')
print('Training set:', accuracy_score(y_train, train_predictions)) 
print('Validation set:', accuracy_score(y_valid, valid_predictions)) 

Accuracy
Training set: 0.9984439834024896
Validation set: 0.7822706065318819


No, slightly less (78.83%). Maybe 20 trees will beat 30 trees with accuracy?

In [22]:
# Random Forest with 20 trees: 
from sklearn.ensemble import RandomForestClassifier

# Define the model
model = RandomForestClassifier(random_state=12345, n_estimators=20) 

# Train the model on the training set:
model.fit(X_train, y_train)

# Investigate model accuracy for training and validation sets: 
train_predictions = model.predict(X_train) 
valid_predictions = model.predict(X_valid) 

print('Accuracy')
print('Training set:', accuracy_score(y_train, train_predictions)) 
print('Validation set:', accuracy_score(y_valid, valid_predictions)) 

Accuracy
Training set: 0.9901452282157677
Validation set: 0.7776049766718507


No, we got back to only 77.76% accuracy on the validation set. We will stick to 30 trees! let's run it once again, to return our 'model' variable to our chosen one, before checking it on our test set.  

In [23]:
# Running again our chosen model - Random Forest with 30 trees: 
from sklearn.ensemble import RandomForestClassifier

# Define the model
model = RandomForestClassifier(random_state=12345, n_estimators=30) 

# Train the model on the training set:
model.fit(X_train, y_train)

# Investigate model accuracy for training and validation sets: 
train_predictions = model.predict(X_train) 
valid_predictions = model.predict(X_valid) 

print('Accuracy')
print('Training set:', accuracy_score(y_train, train_predictions)) 
print('Validation set:', accuracy_score(y_valid, valid_predictions)) 

Accuracy
Training set: 0.9963692946058091
Validation set: 0.7884914463452566


## Check the quality of the model using the test set

Now we are ready to check the accuracy of our chosen model - first time on the test set. 

In [24]:
# Checking the quality of our chosen model - first time using the test set: 
test_predictions = model.predict(X_test) 

print('Accuracy')
print('Test set:', accuracy_score(y_test, test_predictions))   

Accuracy
Test set: 0.80248833592535


Results are ok, we needed to be at least on 75% accuracy checked on the test set, and we got 80.25%. Our chosen model of random forest classifier with 30 trees is ok for recommending a plan based on user behavior.   

## General conclusion

1. **Project Goal**: **Develop a model** that will recommend the right plan (Ultra or Smart), based on behavior data about subscribers who have already switched to the new plans, **with the highest possible accuracy**. Managemnt defined that the **threshold for accuracy is 0.75** (checked on a test set that was not used at all for training). 

2. **Data**: Every observation in the dataset contains monthly behavior information about one user. We can see we have full 3214 entries (no missing values, no explicit duplicates, and datatypes are ok). The information given is as follows:
- сalls — number of calls,
- minutes — total call duration in minutes,
- messages — number of text messages,
- mb_used — Internet traffic used in MB,
- is_ultra — plan for the current month (Ultra - 1, Smart - 0).
 
3. **Data split**: data was randomly splitted to 3 sets, training set(60% of the data), validation set (20%), and test set (20%). 

4. **Investigatiing and developing models**: 3 kinds of models for classification were investigated: decision tree, random forset and logistic regression. For each kind hyperparameters were changed to investigate what model creates best accuracy on the validation set after being trined on the training set. 

5. **Chosen model**: the chosen model was random forest classifier with 30 trees, that created the best accuracy on the validation set (78.85%), and was above the defined threshold.

6. **Accuracy check on the test set**: the accuracy of our chosen model, checked on the test set, is 80.25% or 0.8, which meets the project's requierments. 

7. **Overall conclusion**: a random forest classifier model was developed to recommend the right plan (Ultra or Smart), based on behavior data about subscribers who have already switched to the new plans. The model has accuracy of 0.8 and the project goal is achieved.   