## <font color="green"><b>Develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.</b></font>

<b>Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.
You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model.
Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.</b>

### <font color="green"><b>Importing all libraries needed for this project.</b></font>

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import joblib

### <font color="green"><b>1. Open and look through the data file. Path to the file:datasets/users_behavior.csv</b></font>

In [2]:
data = pd.read_csv('/datasets/users_behavior.csv')
print(data.head(10))

   calls  minutes  messages   mb_used  is_ultra
0   40.0   311.90      83.0  19915.42         0
1   85.0   516.75      56.0  22696.96         0
2   77.0   467.66      86.0  21060.45         0
3  106.0   745.53      81.0   8437.39         1
4   66.0   418.74       1.0  14502.75         0
5   58.0   344.56      21.0  15823.37         0
6   57.0   431.64      20.0   3738.90         1
7   15.0   132.40       6.0  21911.60         0
8    7.0    43.39       3.0   2538.67         1
9   90.0   665.41      38.0  17358.61         0


### <font color="green"><b>Checking the datatypes</b></font>

In [3]:
data.dtypes

calls       float64
minutes     float64
messages    float64
mb_used     float64
is_ultra      int64
dtype: object

#### Data types of calls, messages cannot be float, is_ultra is having bool values. So changing calls, messages to int datatype and is_ultra to bool datatype.

In [4]:
data['calls'] = data['calls'].astype(int)
data['messages'] = data['messages'].astype(int)
data['is_ultra'] = data['is_ultra'].astype(bool)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null int64
minutes     3214 non-null float64
messages    3214 non-null int64
mb_used     3214 non-null float64
is_ultra    3214 non-null bool
dtypes: bool(1), float64(2), int64(2)
memory usage: 103.7 KB


### <font color="green"><b>Checking Missing Values</b></font>

In [6]:
data.isnull().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

#### No missing values in the dataset.

## <font color="green"><b>2. Split the source data into a training set, a validation set, and a test set.</b></font>

In [7]:
features=data.drop(['is_ultra'], axis=1)
targets=data['is_ultra']

In [8]:
## splitting data into train and test set(75%:25%)
split_features, test_features, split_targets, test_targets = train_test_split(
    features, targets, test_size=0.25, shuffle=True, random_state=12345)


In [9]:
## splitting train set again into train and validation set(75%:25%)
train_features, valid_features, train_targets, valid_targets = train_test_split(
    split_features, split_targets, test_size=0.25, shuffle=True, random_state=12345)


In [10]:
print(train_features.shape)
print(train_targets.shape)
print(valid_features.shape)
print(valid_targets.shape)
print(test_targets.shape)
print(test_features.shape)

(1807, 4)
(1807,)
(603, 4)
(603,)
(804,)
(804, 4)


## <font color="green"><b>3. Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.</b></font>

## <font color="green"><b>DecisionTreeClassifier</b></font>

In [11]:
for depth in range(1,31, 3):
    decisionTreeClassifier=DecisionTreeClassifier(random_state=12345, max_depth=depth)
    decisionTreeClassifier.fit(train_features, train_targets)
    print("Accuracy Score for depth",depth,":",decisionTreeClassifier.score(valid_features, valid_targets))

Accuracy Score for depth 1 : 0.7495854063018242
Accuracy Score for depth 4 : 0.7893864013266998
Accuracy Score for depth 7 : 0.7827529021558872
Accuracy Score for depth 10 : 0.7844112769485904
Accuracy Score for depth 13 : 0.7728026533996684
Accuracy Score for depth 16 : 0.7396351575456053
Accuracy Score for depth 19 : 0.7280265339966833
Accuracy Score for depth 22 : 0.736318407960199
Accuracy Score for depth 25 : 0.7313432835820896
Accuracy Score for depth 28 : 0.7429519071310116


#### Accuracy range is 72 -78

## <font color="green"><b>RandomForestClassifier</b></font>

In [12]:
for estimators in range(1,56,2):
    for depth in range(5, 30, 5):
        randomForestClassifier=RandomForestClassifier(random_state=12345, max_depth=depth, n_estimators=estimators)
        randomForestClassifier.fit(train_features, train_targets)
        print("Accuracy Score for depth", depth,":")
        print("Accuracy Score for estimators",estimators,":",randomForestClassifier.score(valid_features, valid_targets))
        

Accuracy Score for depth 5 :
Accuracy Score for estimators 1 : 0.7943615257048093
Accuracy Score for depth 10 :
Accuracy Score for estimators 1 : 0.7711442786069652
Accuracy Score for depth 15 :
Accuracy Score for estimators 1 : 0.7678275290215588
Accuracy Score for depth 20 :
Accuracy Score for estimators 1 : 0.736318407960199
Accuracy Score for depth 25 :
Accuracy Score for estimators 1 : 0.736318407960199
Accuracy Score for depth 5 :
Accuracy Score for estimators 3 : 0.7860696517412935
Accuracy Score for depth 10 :
Accuracy Score for estimators 3 : 0.7976782752902156
Accuracy Score for depth 15 :
Accuracy Score for estimators 3 : 0.802653399668325
Accuracy Score for depth 20 :
Accuracy Score for estimators 3 : 0.75787728026534
Accuracy Score for depth 25 :
Accuracy Score for estimators 3 : 0.7645107794361525
Accuracy Score for depth 5 :
Accuracy Score for estimators 5 : 0.802653399668325
Accuracy Score for depth 10 :
Accuracy Score for estimators 5 : 0.7910447761194029
Accuracy Scor

In [22]:
best_score_params = {"score": 0, "n_estimators": None, "depth": None}

for estimators in range(1,56,2):
    print(f"Number of estimators: {estimators}")
    for depth in range(5, 30, 5):
        randomForestClassifier=RandomForestClassifier(random_state=12345, max_depth=depth, n_estimators=estimators)
        randomForestClassifier.fit(train_features, train_targets)
        score = randomForestClassifier.score(valid_features, valid_targets)
        print(f"Accuracy Score for depth={depth}: {score}")
        
        if score > best_score_params["score"]:
            best_score_params["score"] = score
            best_score_params["n_estimators"] = estimators
            best_score_params["depth"] = depth
        
print(f"Best of the best: {best_score_params}")



Number of estimators: 1
Accuracy Score for depth=5: 0.7943615257048093
Accuracy Score for depth=10: 0.7711442786069652
Accuracy Score for depth=15: 0.7678275290215588
Accuracy Score for depth=20: 0.736318407960199
Accuracy Score for depth=25: 0.736318407960199
Number of estimators: 3
Accuracy Score for depth=5: 0.7860696517412935
Accuracy Score for depth=10: 0.7976782752902156
Accuracy Score for depth=15: 0.802653399668325
Accuracy Score for depth=20: 0.75787728026534
Accuracy Score for depth=25: 0.7645107794361525
Number of estimators: 5
Accuracy Score for depth=5: 0.802653399668325
Accuracy Score for depth=10: 0.7910447761194029
Accuracy Score for depth=15: 0.7910447761194029
Accuracy Score for depth=20: 0.7711442786069652
Accuracy Score for depth=25: 0.7893864013266998
Number of estimators: 7
Accuracy Score for depth=5: 0.8076285240464345
Accuracy Score for depth=10: 0.8076285240464345
Accuracy Score for depth=15: 0.8009950248756219
Accuracy Score for depth=20: 0.7860696517412935
Ac

#### This is tough to read, i am not able to sort all the estimators with depth 5, depth 10, depth 15, depth 20, depth 25 separately. Please suggest how to sort this.

In [13]:
for estimators in range(1,56,2):
    randomForestClassifier=RandomForestClassifier(random_state=12345, max_depth=5, n_estimators=estimators)
    randomForestClassifier.fit(train_features, train_targets)
    print("Accuracy Score for estimators",estimators,":",randomForestClassifier.score(valid_features, valid_targets))

Accuracy Score for estimators 1 : 0.7943615257048093
Accuracy Score for estimators 3 : 0.7860696517412935
Accuracy Score for estimators 5 : 0.802653399668325
Accuracy Score for estimators 7 : 0.8076285240464345
Accuracy Score for estimators 9 : 0.8092868988391376
Accuracy Score for estimators 11 : 0.8043117744610282
Accuracy Score for estimators 13 : 0.8059701492537313
Accuracy Score for estimators 15 : 0.8059701492537313
Accuracy Score for estimators 17 : 0.8092868988391376
Accuracy Score for estimators 19 : 0.8092868988391376
Accuracy Score for estimators 21 : 0.8092868988391376
Accuracy Score for estimators 23 : 0.8092868988391376
Accuracy Score for estimators 25 : 0.8076285240464345
Accuracy Score for estimators 27 : 0.8076285240464345
Accuracy Score for estimators 29 : 0.8076285240464345
Accuracy Score for estimators 31 : 0.8059701492537313
Accuracy Score for estimators 33 : 0.8076285240464345
Accuracy Score for estimators 35 : 0.8076285240464345
Accuracy Score for estimators 37 :

#### With depth 5 the accuracy score is in range 0.786 - 0.817

In [14]:
for estimators in range(1,56,2):
    randomForestClassifier=RandomForestClassifier(random_state=12345, max_depth=10, n_estimators=estimators)
    randomForestClassifier.fit(train_features, train_targets)
    print("Accuracy Score for estimators",estimators,":",randomForestClassifier.score(valid_features, valid_targets))

Accuracy Score for estimators 1 : 0.7711442786069652
Accuracy Score for estimators 3 : 0.7976782752902156
Accuracy Score for estimators 5 : 0.7910447761194029
Accuracy Score for estimators 7 : 0.8076285240464345
Accuracy Score for estimators 9 : 0.8109452736318408
Accuracy Score for estimators 11 : 0.8092868988391376
Accuracy Score for estimators 13 : 0.8059701492537313
Accuracy Score for estimators 15 : 0.8192371475953566
Accuracy Score for estimators 17 : 0.8225538971807629
Accuracy Score for estimators 19 : 0.8275290215588723
Accuracy Score for estimators 21 : 0.8275290215588723
Accuracy Score for estimators 23 : 0.8308457711442786
Accuracy Score for estimators 25 : 0.8325041459369817
Accuracy Score for estimators 27 : 0.8291873963515755
Accuracy Score for estimators 29 : 0.8291873963515755
Accuracy Score for estimators 31 : 0.8325041459369817
Accuracy Score for estimators 33 : 0.8308457711442786
Accuracy Score for estimators 35 : 0.8258706467661692
Accuracy Score for estimators 37 

#### With depth 10 the accuracy score is in range 0.771 - 0.832

In [15]:
for estimators in range(1,56,2):
    randomForestClassifier=RandomForestClassifier(random_state=12345, max_depth=15, n_estimators=estimators)
    randomForestClassifier.fit(train_features, train_targets)
    print("Accuracy Score for estimators",estimators,":",randomForestClassifier.score(valid_features, valid_targets))

Accuracy Score for estimators 1 : 0.7678275290215588
Accuracy Score for estimators 3 : 0.802653399668325
Accuracy Score for estimators 5 : 0.7910447761194029
Accuracy Score for estimators 7 : 0.8009950248756219
Accuracy Score for estimators 9 : 0.8059701492537313
Accuracy Score for estimators 11 : 0.8076285240464345
Accuracy Score for estimators 13 : 0.802653399668325
Accuracy Score for estimators 15 : 0.8092868988391376
Accuracy Score for estimators 17 : 0.8175787728026535
Accuracy Score for estimators 19 : 0.8159203980099502
Accuracy Score for estimators 21 : 0.8159203980099502
Accuracy Score for estimators 23 : 0.8208955223880597
Accuracy Score for estimators 25 : 0.8192371475953566
Accuracy Score for estimators 27 : 0.8175787728026535
Accuracy Score for estimators 29 : 0.8159203980099502
Accuracy Score for estimators 31 : 0.814262023217247
Accuracy Score for estimators 33 : 0.814262023217247
Accuracy Score for estimators 35 : 0.8175787728026535
Accuracy Score for estimators 37 : 0.

#### With depth 15 the accuracy score is in range 0.767 - 0.827

In [16]:
for estimators in range(1,56,2):
    randomForestClassifier=RandomForestClassifier(random_state=12345, max_depth=20, n_estimators=estimators)
    randomForestClassifier.fit(train_features, train_targets)
    print("Accuracy Score for estimators",estimators,":",randomForestClassifier.score(valid_features, valid_targets))

Accuracy Score for estimators 1 : 0.736318407960199
Accuracy Score for estimators 3 : 0.75787728026534
Accuracy Score for estimators 5 : 0.7711442786069652
Accuracy Score for estimators 7 : 0.7860696517412935
Accuracy Score for estimators 9 : 0.7860696517412935
Accuracy Score for estimators 11 : 0.7976782752902156
Accuracy Score for estimators 13 : 0.8109452736318408
Accuracy Score for estimators 15 : 0.8092868988391376
Accuracy Score for estimators 17 : 0.8159203980099502
Accuracy Score for estimators 19 : 0.8126036484245439
Accuracy Score for estimators 21 : 0.8159203980099502
Accuracy Score for estimators 23 : 0.814262023217247
Accuracy Score for estimators 25 : 0.8192371475953566
Accuracy Score for estimators 27 : 0.814262023217247
Accuracy Score for estimators 29 : 0.814262023217247
Accuracy Score for estimators 31 : 0.8175787728026535
Accuracy Score for estimators 33 : 0.8208955223880597
Accuracy Score for estimators 35 : 0.8225538971807629
Accuracy Score for estimators 37 : 0.82

#### With depth 20 the accuracy score is in range 0.736 - 0.822

## <font color="green"><b>Logisitic Regression</b></font>

In [17]:
logisticRegression=LogisticRegression()
logisticRegression.fit(train_features, train_targets)
print("Accuracy Score is",logisticRegression.score(valid_features, valid_targets))

Accuracy Score is 0.7330016583747927




#### Accuracy score of Logistic Regression is 0.73

#### The highest accuracy among all the models is Random Forest at depth 10 with estimators 31 is 83.2. So i am choosing this as the best model.

In [18]:
best_model=RandomForestClassifier(random_state=12345, max_depth=10, n_estimators=31)
best_model.fit(split_features, split_targets)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=10, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=31,
                       n_jobs=None, oob_score=False, random_state=12345,
                       verbose=0, warm_start=False)

## <font color="green"><b>4.Check the quality of the model using the test set.</b></font>

In [19]:
print("Test set Accuracy for best model is", best_model.score(test_features, test_targets))

Test set Accuracy for best model is 0.8034825870646766


## <font color="green"><b>5. Additional task: sanity check the model</b></font>

In [20]:
predictions=pd.Series(0, index=test_targets.index)
print("Base Accuracy if all predictions are 0 is", accuracy_score(test_targets, predictions))

Base Accuracy if all predictions are 0 is 0.7002487562189055


In [21]:
predictions=pd.Series(1, index=test_targets.index)
print("Base Accuracy if all predictions are 0 is", accuracy_score(test_targets, predictions))

Base Accuracy if all predictions are 0 is 0.2997512437810945


<div class="alert alert-block alert-info">
<b>Improve: </b> There is a typo here (0 instead of 1).
</div>

#### Since this is Classification Problem, We will check the Accuracy score when All the predicted values, either be 0 or 1

While sanity, We consider accuracy of biggest class as base accuracy value.

Since 0 is the biggest class here, We need to consider 70% as a base accuracy value. Our trained model should have atleast higher value than base value.

## <font color="green"><b>Conclusion</b></font>

#### RandomForestClassifier with max_depth=10 and n_estimators=31 gave 83.2% test Accuracy, it gives better test accuracy than base accuracy.