# Project description

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.


You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model.


Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset. 

importing some libraries and looking through the data:

In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
#from sklearn.metrics import mean_squared_error

In [5]:
try:
    data = pd.read_csv('C:\\Users\\aviv\\Downloads\\users_behavior.csv')

except:
    data = pd.read_csv('/datasets/users_behavior.csv')\
    
data.sample(10)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
2828,67.0,433.41,80.0,15662.73,0
873,71.0,556.84,71.0,19869.21,0
2940,66.0,430.35,24.0,19005.98,0
2731,70.0,414.64,0.0,21102.9,0
55,13.0,106.03,16.0,37328.45,1
1597,92.0,719.71,66.0,23827.53,0
1014,96.0,655.52,2.0,20432.78,0
846,114.0,734.12,134.0,25969.8,0
350,65.0,423.06,40.0,18625.97,0
466,71.0,408.79,70.0,22576.01,0


In [6]:
data.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was loaded and briefly inspected!

</div>

# Splitting the source data to training, validation and test sets

i split the data first into features and target

then split into training and validation sets, 6:4
then split the validation set into validation and test set, for a final ratio of 6:2:2

In [8]:
features = data.drop(['is_ultra'], axis=1)
target = data['is_ultra']


In [9]:
features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.4, random_state=33)
features_valid, features_test, target_valid, target_test = train_test_split(features_valid, target_valid, test_size=0.5,
                                                                           random_state=33)

In [62]:

features_valid.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 643 entries, 2185 to 1108
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     643 non-null    float64
 1   minutes   643 non-null    float64
 2   messages  643 non-null    float64
 3   mb_used   643 non-null    float64
dtypes: float64(4)
memory usage: 25.1 KB


In [63]:
features_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1928 entries, 622 to 3092
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     1928 non-null   float64
 1   minutes   1928 non-null   float64
 2   messages  1928 non-null   float64
 3   mb_used   1928 non-null   float64
dtypes: float64(4)
memory usage: 75.3 KB


In [65]:
features_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 643 entries, 1724 to 2487
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     643 non-null    float64
 1   minutes   643 non-null    float64
 2   messages  643 non-null    float64
 3   mb_used   643 non-null    float64
dtypes: float64(4)
memory usage: 25.1 KB


# Investigating the quality of different models by changing hyperparameters

briefly describe the findings

let's train a decision tree classifier and toy a bit with its max_depth parameter

In [66]:

best_model = None
best_result = 0
best_depth = 0
for depth in range(1, 10):
    model = DecisionTreeClassifier(random_state=33, max_depth=depth) # create a model with the given depth
    model.fit(features_train,target_train) # train the model
    predictions = model.predict(features_valid) # get the model's predictions
    result = model.score(features_valid, target_valid)
   # result = accuracy_score(target_valid, predictions) # calculate the accuracy
    if result > best_result:
        best_model = model
        best_result = result
        best_depth = depth
        
print("Accuracy of the best model:", best_result,"depth of best model:", best_depth)

Accuracy of the best model: 0.7916018662519441 depth of best model: 8


we see that we get the best accuracy score out of a max depth of 3, any other and we overfit/underfit the model.
so if we were to choose DTC as our model, we'd use max_depth of 3

let's train a random forest classifier next

In [67]:
best_score = 0
best_est = 0
for est in range(1,11):
    model = RandomForestClassifier(random_state=33, n_estimators = est)
    model.fit(features_train, target_train)
    score= model.score(features_valid, target_valid)
    if score > best_score:
        best_score = score
        best_est = est
print("Accuracy of the best random forest classfier model on the validation set (n_estimators = {}): {}".format(best_est, best_score))

Accuracy of the best random forest classfier model on the validation set (n_estimators = 6): 0.7884914463452566


now to try logistic regression model

In [68]:
model = LogisticRegression(random_state=33)
model.fit(features_train, target_train)
model.score(features_valid, target_valid)

0.76049766718507

# Check the quality of the model using the test set.

In [69]:
features = data.drop(['is_ultra'], axis=1)
target = data['is_ultra']
final_model = DecisionTreeClassifier(random_state=33, max_depth=3)
final_model.fit(features_train, target_train)
final_model.score(features_test, target_test)

0.8040435458786936

# Sanity check

To sanity check the model in classification problems, we need to compare it with chance.

Let's assume we have a model that assigns "0" or "1" randomly with a 50/50 chance. What is the accuracy of the model?

The model's answers are not linked to correct answers, so the probability of guessing “1” is 50% (the same for “0“), meaning the accuracy is 0.5.

Our model's accuracy score tested about 0.8 which is significantly better than chance's 0.5.

we could also compare our model to a constant model which always predicts the majority class 0 (whose share is 0.69) we'll get accuracy equal to that share and the model also exceeds that.
