# Project: Megaline's newer plans Smart or Ultra

## Table of Contents

* [Project description](#Project_description)
* [Description of the data](#data)
* [Open the data file](#open)
* [Split the source data ](#split)
* [Check the quality of the model using the test set](#test)
* [Conclusion](#conclusion)

## Project description <a class="anchor" id="Project_description"></a>

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model.
Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.

## Description of the data  <a class="anchor" id="data"></a>

Every observation in the dataset contains monthly behavior information about one user. The information given is as follows:
* сalls — number of calls,
* minutes — total call duration in minutes,
* messages — number of text messages,
* mb_used — Internet traffic used in MB,
* is_ultra — plan for the current month (Ultra - 1, Smart - 0).

In [1]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
from scipy import stats as st
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from joblib import dump
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LogisticRegression

## Open and look through the data file <a class="anchor" id="open"></a>

In [2]:
 try:
    data = pd.read_csv('/datasets/users_behavior.csv')    
except:
    print("Something went wrong when opening the file")

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
data.head(5)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [4]:
report = data.isna().sum().to_frame()
report = report.rename(columns = {0: 'missing_values'})
report['% of total'] = (report['missing_values'] / data.shape[0]).round(2)
report.sort_values(by = 'missing_values', ascending = False)

Unnamed: 0,missing_values,% of total
calls,0,0.0
minutes,0,0.0
messages,0,0.0
mb_used,0,0.0
is_ultra,0,0.0


In [5]:
print("duplicates number: {}".format(data.duplicated().sum()))

duplicates number: 0


<div style="border:solid black 2px; padding: 20px"> <b>Note:</b><br>
No duplicates or missing data were found.  The data set is ready for training a model. 


### Split the source data into a training set, a validation set, and a test set. <a class="anchor" id="split"></a>


<div style="border:solid black 2px; padding: 20px"> <b>Note:</b><br>

test set doesn't exist. In that case, the source data has to be split into three parts: training, validation, and test. The sizes of validation set and test set are usually equal. It gives us source data split in a 3:1:1 ratio.

In [6]:
#split 0.2 test, train 0.8
df_train, df_test = train_test_split(data, test_size=0.20, random_state=12345)

In [7]:
#df_vaild - 0.25 from 0.8 is 0.2 from total, adn the 0.75 from 0.8 is 0.6- 1:1:3
df_train_2, df_valid = train_test_split(df_train, test_size=0.25, random_state=12345)

<div style="border:solid black 2px; padding: 20px"> <b>Note:</b><br>
3 datasets

* df_train_2
* df_valid
* df_test

In [8]:
#checking the shape of the data set
df_train_2.shape

(1928, 5)

In [9]:
#checking the shape of the data set
df_valid.shape

(643, 5)

In [10]:
#checking the shape of the data set
df_test.shape

(643, 5)

<div style="border:solid black 2px; padding: 20px"> <b>models for classification:</b><br>
there are two models for classification that I'll check

* RandomForestClassifier
* DecisionTreeClassifier 
* Logistic Regression

In [11]:
#preparing data for the classification models 
#iloc[row slicing, column slicing]
#training set
train_features = df_train_2.iloc[:, 0:4]
train_target = df_train_2['is_ultra']

In [12]:
#validation set
valid_features = df_valid.iloc[:, 0:4]
valid_target = df_valid['is_ultra']

In [25]:
#iterating through the DecisionTreeClassifier model while changing the max depth and 
#checking it for overfitting and its accuracy  - sanity check.
for i in range(1,11):
    #set DecisionTreeClassifier paremeters
    model = DecisionTreeClassifier(random_state=12345,max_depth=i)
    # < train the model with training data set>
    model.fit(train_features, train_target)
    #validation of the data
    predictions_valid = model.predict(valid_features)
    print("max_depth =", i, ": ", end='')
    print(accuracy_score(valid_target, predictions_valid)) 


max_depth = 1 : 0.7387247278382582
max_depth = 2 : 0.7573872472783826
max_depth = 3 : 0.7651632970451011
max_depth = 4 : 0.7636080870917574
max_depth = 5 : 0.7589424572317263
max_depth = 6 : 0.7573872472783826
max_depth = 7 : 0.7744945567651633
max_depth = 8 : 0.7667185069984448
max_depth = 9 : 0.7620528771384136
max_depth = 10 : 0.7713841368584758


In [26]:
#iterating through the RandomForestClassifier model while changing the n_estimators 
#and checking it for overfitting and its accuracy - sanity check.
for i in range(1,11):
    #set RandomForestClassifier paremeters
    model = RandomForestClassifier(random_state=12345,n_estimators=i)
    # < train the model with training data set>
    model.fit(train_features, train_target)
    #validation of the data
    predictions_valid = model.predict(valid_features)
    print("n_estimators =", i, ": ", end='')
    print(accuracy_score(valid_target, predictions_valid)) 

n_estimators = 1 : 0.702954898911353
n_estimators = 2 : 0.7573872472783826
n_estimators = 3 : 0.744945567651633
n_estimators = 4 : 0.7651632970451011
n_estimators = 5 : 0.7620528771384136
n_estimators = 6 : 0.7698289269051322
n_estimators = 7 : 0.7713841368584758
n_estimators = 8 : 0.7869362363919129
n_estimators = 9 : 0.7838258164852255
n_estimators = 10 : 0.7884914463452566


In [27]:
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(train_features, train_target)
predictions_valid = model.predict(valid_features)
print(accuracy_score(valid_target, predictions_valid)) 

0.6967340590979783


<div style="border:solid black 2px; padding: 20px"> <b>Note:</b>
It seems that is the RandomForestClassifier model is more fitting for the question. The parameters: random_state=12345 and n_estimators=10 give an accuracy of ~ 0.788 after training the model and checking it with the validation dataset

## Check the quality of the model using the test set <a class="anchor" id="test"></a>

In [23]:
#preparing the data of the test set
test_features = df_test.iloc[:, 0:4]
test_target = df_test['is_ultra']

In [28]:
#using the RandomForestClassifier with random_state=12345,n_estimators=11
model = RandomForestClassifier(random_state=12345,n_estimators=11)
model.fit(train_features, train_target)
predictions_test = model.predict(test_features)
print("accuracy_score:",accuracy_score(test_target, predictions_test)) 

accuracy_score: 0.7822706065318819


<div style="border:solid black 2px; padding: 20px"> <b>Note:</b><br>
After checking the models with the test data set it seems that my previous assumption was wrong and the DecisionTreeClassifier model is with the parameters: random_state=12345 and max_depth=3, is the more fitting model. 

## conclusion <a class="anchor" id="conclusion">

A random forest has the highest accuracy because it uses an ensemble of trees instead of just one.
The runner-up is logistic regression. The model is straightforward so there will be no overfitting.
The decision tree has the lowest quality of prediction. 
The model I choose would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra is RandomForestClassifier. The parameters: random_state=12345 and n_estimators=10 to go above 0.75 accuracy to an accuracy of 0.788.