# Recommendation for Megaline's plan

The telecom company Megaline has found out that many of the subscribers use legacy plans. 
We want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.
We have access to behavior data about subscribers who have already switched to the new plans.
We need to develop a model that will pick the right plan.
We'll develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75.

## Opening the Data file

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score 
from sklearn.dummy import DummyClassifier

In [2]:
# open the data file
df = pd.read_csv('/datasets/users_behavior.csv')
df.head(10)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
5,58.0,344.56,21.0,15823.37,0
6,57.0,431.64,20.0,3738.9,1
7,15.0,132.4,6.0,21911.6,0
8,7.0,43.39,3.0,2538.67,1
9,90.0,665.41,38.0,17358.61,0


**Data Description:**  
`сalls` — number of calls  
`minutes` — total call duration in minutes  
`messages` — number of text messages  
`mb_used` — Internet traffic used in MB  
`is_ultra` — plan for the current month (Ultra - 1, Smart - 0)

In [3]:
# show general info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
# show statistical summary
df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


There are some data types that should be changed. The calls and messages are a whole numbers so we'll convert them to integers.

In [5]:
# change data types
df = df.astype({"calls": int, "messages": int})
df.dtypes

calls         int64
minutes     float64
messages      int64
mb_used     float64
is_ultra      int64
dtype: object

In [6]:
# check for duplicates
df.duplicated().sum()

0

In [7]:
# the share of the 'Ultra' users
(df['is_ultra'].sum()/df['is_ultra'].count()).round(2)

0.31

In [8]:
# the share of the 'Smart' users
((df['is_ultra'].count() - df['is_ultra'].sum())/df['is_ultra'].count()).round(2)

0.69

Datatypes of a few columns in the dataset had to be changed, and they were successfully changed.
There were no missing values or duplicates.
We found that the shares of the plans are not balanced. It might affect the results. Probably the most common plan will be the most predicted one.
Now we can move on for choosing the model for prediction.

## Chosing the best model

We will split the data into train, validation and test sets with this ratio 3:1:1.
For each type of model we will find the most accurate one, and then we will chose the best typoe for our task.

In [9]:
# splitting the source data into train, validation and test sets
df_train, df_check = train_test_split(df, test_size=0.4, random_state=12345)
df_valid, df_test = train_test_split(df_check, test_size=0.5, random_state=12345)

features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']

features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']

features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

First we will check the Decision Tree model

In [10]:
for i in range(1, 6):
    model = DecisionTreeClassifier(random_state=12345, max_depth = i)
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    
    print("max_depth =", i, ": ", end='')
    print('{:.4f}'.format(accuracy_score(target_valid, predictions_valid)))

max_depth = 1 : 0.7543
max_depth = 2 : 0.7823
max_depth = 3 : 0.7854
max_depth = 4 : 0.7792
max_depth = 5 : 0.7792


We see that the best score is for depth of 3.
Now we wi'll check the Random Forest model.

In [11]:
for j in range(2, 11):
    print("max_depth =", j, ": ", end='\n')
    for i in range(1, 7):
        model = RandomForestClassifier(random_state=12345, n_estimators=i, max_depth = j)
        model.fit(features_train, target_train)
        predictions_valid = model.predict(features_valid)
    
        print("n_estimators =", i, ": ", end='')
        print('{:.4f}'.format(accuracy_score(target_valid, predictions_valid)))

max_depth = 2 : 
n_estimators = 1 : 0.7854
n_estimators = 2 : 0.7823
n_estimators = 3 : 0.7776
n_estimators = 4 : 0.7869
n_estimators = 5 : 0.7698
n_estimators = 6 : 0.7729
max_depth = 3 : 
n_estimators = 1 : 0.7854
n_estimators = 2 : 0.7854
n_estimators = 3 : 0.7854
n_estimators = 4 : 0.7838
n_estimators = 5 : 0.7838
n_estimators = 6 : 0.7854
max_depth = 4 : 
n_estimators = 1 : 0.7745
n_estimators = 2 : 0.7776
n_estimators = 3 : 0.7760
n_estimators = 4 : 0.7885
n_estimators = 5 : 0.7854
n_estimators = 6 : 0.7869
max_depth = 5 : 
n_estimators = 1 : 0.7760
n_estimators = 2 : 0.7729
n_estimators = 3 : 0.7885
n_estimators = 4 : 0.7854
n_estimators = 5 : 0.7885
n_estimators = 6 : 0.7885
max_depth = 6 : 
n_estimators = 1 : 0.7854
n_estimators = 2 : 0.7854
n_estimators = 3 : 0.7838
n_estimators = 4 : 0.7885
n_estimators = 5 : 0.7947
n_estimators = 6 : 0.7932
max_depth = 7 : 
n_estimators = 1 : 0.7776
n_estimators = 2 : 0.7776
n_estimators = 3 : 0.7869
n_estimators = 4 : 0.7869
n_estimators =

We see that the number of estimators that brings the best results is 6 with max_depth of 8.   
max_depth = 8, n_estimators = 6 : 0.7963  
Now we'll check the Logistic Regression model.

In [12]:
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train, target_train)
predictions_valid = model.predict(features_valid)
print('{:.4f}'.format(accuracy_score(target_valid, predictions_valid)))

0.7589


The best result for the Decision Tree was:  
max_depth = 3 : 0.7854  
The best result for the Random Forest was:
max_depth = 8, n_estimators = 6 : 0.7963  
The best result for the Logistic Regression was:  
0.7589  

From all that was mentioned above, the most accurate model is Random Forest model with max_depth = 8, n_estimators = 6.

In [13]:
# the chosen model
model = RandomForestClassifier(random_state=12345, n_estimators=4, max_depth = 4)
features = pd.concat([features_train, features_valid], ignore_index=True)
target = pd.concat([target_train, target_valid], ignore_index=True)
model.fit(features, target)
predictions_test = model.predict(features_test)
predictions_train = model.predict(features)

print('Accuracy')
print('Training set: {:.4f}'.format(accuracy_score(target, predictions_train)))
print('Test set: {:.4f}'.format(accuracy_score(target_test, predictions_test)))

Accuracy
Training set: 0.7942
Test set: 0.7885


In [14]:
dummy_model = DummyClassifier(random_state=12345, strategy="most_frequent")
dummy_model.fit(features, target)
dummy_predictions_test = dummy_model.predict(features_test)
dummy_predictions_train = dummy_model.predict(features)

print('Accuracy')
print('Training set: {:.4f}'.format(accuracy_score(target, dummy_predictions_train)))
print('Test set: {:.4f}'.format(accuracy_score(target_test, dummy_predictions_test)))

Accuracy
Training set: 0.6958
Test set: 0.6843


We found out that accuracy-wise, our model is doing better with the training set compared to the test set. This is a cllasic symptom of overfitting.
The sanity test for our model is showing that it is working better than just by chance.

## Conclusions

In general our dataset was of a good quality. 
There were no duplicates or missing values but there were datatypes that could be stored in a better way. We found that the shares of the plans are not balanced which might affect the results.

We checked three types of classification models and compared their preformence in terms of accurecy:
1) Decision Tree
2) Random Forest
3) Logistic Regression
For each type we tried to find the best hyperparameters.

At the end we chose the best model as **Random Forest with max_depth of 8 and n_estimatiors of 6**.
We made sure that the results obtained were better than guessing by a sanity test.