# Recommendation of new plan for users of Megaline depending on user behavior

## 1. Loading data file
All necessary Python libraries will be imported in this step. Then, data file will be loaded and check for issues, if any!

In [29]:
# required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns

In [30]:
df=pd.read_csv('users_behavior.csv')
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [32]:
df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [33]:
df.isna().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

In [34]:
df.duplicated().sum()

0

In [35]:
#check correlation betwwen columns
df.corr(method='pearson').round(3).style.background_gradient(axis=1)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
calls,1.0,0.982,0.177,0.286,0.207
minutes,0.982,1.0,0.173,0.281,0.207
messages,0.177,0.173,1.0,0.196,0.204
mb_used,0.286,0.281,0.196,1.0,0.199
is_ultra,0.207,0.207,0.204,0.199,1.0


In [37]:
#correlation between calls and minutes columns were so high, thereby one was dropped
df=df.drop('minutes', axis=1)
df.head()

Unnamed: 0,calls,messages,mb_used,is_ultra
0,40.0,83.0,19915.42,0
1,85.0,56.0,22696.96,0
2,77.0,86.0,21060.45,0
3,106.0,81.0,8437.39,1
4,66.0,1.0,14502.75,0


### 1.1 Conclusion
Data (with 5 columns and 3214 rows) has been successfully loaded and check for missing values, data types, duplicates and for other possible issues. No critical issue has been observed so far. We are ready to split our data into training and validation datasets.

## 2. Splitting into training, validation and testing sets

In [38]:
# get training data as 60% of full data
df_train, df_rest = train_test_split(df, test_size=0.4, random_state=234) #assign random=234 to replicate

# now divide rest into two parts to get validation and test datasets
df_valid, df_test = train_test_split(df_rest, test_size=0.5, random_state=234)

#check the ratio of each dataset to full dataset
ratios=(round(len(df_train)/len(df), 2),
        round(len(df_valid)/len(df), 2),
        round(len(df_test)/len(df), 2))
print('Ratios of each dataset to full data: \nTraining:', ratios[0], '\nValidation:', ratios[1], '\nTest:', ratios[2])

Ratios of each dataset to full data: 
Training: 0.6 
Validation: 0.2 
Test: 0.2


In [39]:
#prepare feature and target data for each dataset
#training dataset
try:
    features_train=df_train.drop('is_ultra', axis=1)
    target_train=df_train['is_ultra']

    #validation dataset
    features_valid=df_valid.drop('is_ultra', axis=1)
    target_valid=df_valid['is_ultra']


    #testing dataset
    features_test=df_test.drop('is_ultra', axis=1)
    target_test=df_test['is_ultra']
    print('Mission complete!')
except:
    print('Something went wrong!')

Mission complete!


### 2.1 Conclusion
We have been provided with only one dataset, there is no test dataset. Hence we will divide our whole data into three parts: training, validation and testing datasets with ratios of 3:1:1. With the `train_test_split` function of the SKlearn library, we can divide data into two parts only. Therefore, first, we will produce a training dataset consisting of 60% of the whole data, then divide the rest into two equal parts, where each will make 20% of whole data.

We have to build a model that suggests either the **Ultra** or the **Smart** plan, depending on how many `calls, minutes, messages and internet` used by a subscriber.  So, the `is_ultra` column is our target, and the `calls, minutes, messages and internet` columns are our features. Features and target data has been prepared for the training, validation and testing datasets. 

## 3. Applying different models
Our model should tell us 1 if a subscriber needs the **Ultra** plan, or 0 if a subscriber needs the **Smart** plan. Hence, we will build a classification model. We will use DecisionTreeClassifier, RandomForestClassifier and LogisticRegression classification models of sklearn. We will tune basic but important hyperparameters of each model and compare their accuracy.

### 3.1 DecisionTreeClassifier

In [40]:
#build a decision tree model using different max_depth values and compare the accuracies
#instead of trying each one by one, do it in for loop
depth=[]      #list for max_depth values
dt_accu_t=[]  #list for accuracy of training data
dt_accu_v=[]  #list for accuracy of validation data
for i in range(1,10):
    dt_model=DecisionTreeClassifier(random_state=234, max_depth=i)
    dt_model.fit(features_train, target_train)
    dt_accu_v.append(dt_model.score(features_valid, target_valid))
    dt_accu_t.append(dt_model.score(features_train, target_train))
    depth.append(i)

In [41]:
# build a function, that makes a table for easy visualition using hyperparameter value, accuracy of the model using\
#training data and validation data
def scorer(train, valid, col):
    scores=pd.DataFrame([train, valid], columns=col, index=['train', 'valid'])\
        .round(3).style.background_gradient(axis=1)
    return scores

In [42]:
#table of max_depth and accuracy values
dt_scores=scorer(dt_accu_t, dt_accu_v, depth)
dt_scores

Unnamed: 0,1,2,3,4,5,6,7,8,9
train,0.767,0.795,0.81,0.822,0.829,0.835,0.85,0.864,0.874
valid,0.723,0.77,0.776,0.788,0.782,0.771,0.778,0.764,0.764


### 3.2 RandomForestClassifier

In [43]:
#build a RandomForest model using different number of trees (n_estimators), compare accuracy scores 
rf_accu_v=[]
rf_accu_t=[]
estimators=[1, 10 , 20, 30, 40, 50, 60, 70, 80, 90, 100]
for i in estimators:
    rf_model=RandomForestClassifier(random_state=234, n_estimators=i)
    rf_model.fit(features_train, target_train)
    rf_accu_v.append(rf_model.score(features_valid, target_valid))
    rf_accu_t.append(rf_model.score(features_train, target_train))

In [44]:
#use previously built function to make a table
rf_scores=scorer(rf_accu_t, rf_accu_v, estimators)
rf_scores

Unnamed: 0,1,10,20,30,40,50,60,70,80,90,100
train,0.898,0.981,0.993,0.997,0.999,0.999,1.0,1.0,1.0,1.0,1.0
valid,0.72,0.773,0.776,0.787,0.785,0.782,0.784,0.79,0.79,0.792,0.788


### 3.3 LogisticRegression

In [45]:
# build a logisticregression model and compare accuracy using training and validation data
lg_model=LogisticRegression(random_state=234, solver='liblinear')

#fit and check accuracy
lg_model.fit(features_train, target_train)
lg_accu_v=lg_model.score(features_valid, target_valid)
lg_accu_t=lg_model.score(features_train, target_train)


#print the results
print('Accuracy of logistic regression model using training data:', round(lg_accu_t, 3),
     'with validation data:', round(lg_accu_v, 3))

Accuracy of logistic regression model using training data: 0.741 with validation data: 0.708


### 3.4 Conclusion
Three different classification models were compared: `DecisionTree, RandomForest and LogisticRegression`. And important parameters of models were tuned to get optimal possible scores. The highest accuracy was achieved with the `RandomForest` model, **0.795**, with the number of 10 trees. Further increasing the number of trees did not affect the result much. Runner-up was the `DecisionTree` model with an accuracy of **0.778** (max_depth of 5). The `LogisticRegression` model achieved an accuracy of **0.723** only.

## 4. Checking quality of the model
RandomForest model performed best. Hence we will use it with the number of estimators as 10 to check its accuracy with the testing dataset. More the data, the higher the accuracy of the model. Therefore, here we decided to combine our previous training and validation datasets to make more data for training our model.

In [46]:
#build a RandomFOrestClassifier model with n_estimators=10
final_model=RandomForestClassifier(random_state=234, n_estimators=90)

In [47]:
#number of observations in each previous datasets
display(len(df_train))
display(len(df_valid))
len(df_test)

1928

643

643

In [48]:
#combine training and validation datasets
train_big=pd.concat([df_train, df_valid])

#prepare features and target data
features_train_big=train_big.drop('is_ultra', axis=1)
target_train_big=train_big['is_ultra']

#train the model using training dataset
final_model.fit(features_train_big, target_train_big)


#check the accuracy of the model
accu_test=final_model.score(features_test, target_test)

print('Accuracy of model is:', round(accu_test, 3))

Accuracy of model is: 0.787


### 4.1 Conclusion
A model has been built using `RandomForestClassifier, using 90 trees`. The amount of data affects the accuracy of a model. Therefore, to increase the number of observation for training the model, the previous training dataset has been combined with the validation dataset. Then the accuracy of the model has been tested and its **accuracy was 0.787**.

## 5. Sanity check
We will create fake answers using random ones and zeros. Then we check the accuracy of our model using these fake answers.
Let's follow three options to make fake answers:
- random mixture of ones and zeros;
- only ones;
- only zeros.

In [49]:
#create random ones and zeros
size=len(target_test)  #length of array
np.random.seed(234)  #to replicate the results

#create the Pandas Seris object
target_rand=pd.Series(np.random.randint(2, size=size))

#check how many of fake answers were correctly matched
accu_random=accuracy_score(target_rand, target_test) 

#print the result
print('If the model randomly retrieves ones and zeros, then its accuracy would be:', round(accu_random, 3))

If the model randomly retrieves ones and zeros, then its accuracy would be: 0.505


In [50]:
#create only ones
size=len(target_test)  #length of array

#create the Pandas Seris object
target_ones=pd.Series(np.ones(size))

#check how many of fake answers were correctly matched
accu_ones=accuracy_score(target_ones, target_test)


#print the result
print('If the model retrieves only ones, then its accuracy would be:', round(accu_ones, 3))

If the model retrieves only ones, then its accuracy would be: 0.331


In [51]:
#create only zeros
size=len(target_test)  #length of array

#create the Pandas Seris object
target_zeros=pd.Series(np.zeros(size))

#check how many of fake answers were correctly matched
accu_zeros=accuracy_score(target_zeros, target_test)


#print the result
print('If the model retrieves only zeros, then its accuracy would be:', round(accu_zeros, 3))

If the model retrieves only zeros, then its accuracy would be: 0.669


## 5.1 Conclusion
Sanity check of the model has been performed with three different approaches.
- If the model recommends the **Ultra** and **smart** plans randomly, then accuracy would be 0.505;
- If the model recommends only the **Ultra** plan for everyone, then the accuracy would be only 0.331;
- If the model recommends only **Smart** plan for everyone, then the accuracy would be 00.669.

Compared to these three fake answers our model performed very well, with an accuracy of 0.782.

## 6. Summary
- Data with 5 columns and 3214 rows has been loaded and check for missing values, data types, duplicates and other possible issues. No critical issue has been observed so far.


- The whole data has been divided into three parts: training, validation and testing datasets with ratios of 3:1:1 or 60%, 20% and 20% of whole data.


- A model that suggests either the **Ultra** or the **Smart** plan, depending on how many `calls, minutes, messages and internet` used by a subscriber has been planned.  So, the `is_ultra` column was assigned a target, and the `calls, minutes, messages and internet` columns were assigned as features.


- Three different classification models have been compared: `DecisionTree, RandomForest and LogisticRegression`. And important parameters of models have been tuned to get optimal possible scores. The highest accuracy was achieved with the `RandomForest` model, **0.795**, with the number of 10 trees. Further increasing the number of trees did not affect the result much. Runner-up was the `DecisionTree` model with an accuracy of **0.778** (max_depth of 5). The `LogisticRegression` model achieved an accuracy of **0.723** only.


- A model has been built using `RandomForestClassifier, using 10 trees`, based on the previous task. The amount of data affects the accuracy of a model. Therefore, to increase the number of observation for training the model, the previous training dataset has been combined with the validation dataset. Then the accuracy of the model has been tested and its **accuracy was 0.782**.


- Sanity check of the model has been performed with three different approaches.
    - If the model recommends the **Ultra** and **smart** plans randomly, then accuracy would be 0.505;
    - If the model recommends only the **Ultra** plan for everyone, then the accuracy would be only 0.331;
    - If the model recommends only **Smart** plan for everyone, then the accuracy would be 00.669.

    Compared to these three fake answers our model performed very well, with an accuracy of 0.782.