# Introduction to Machine Learning Project

# Project description

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.
You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model.
Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.

# Project instructions:

1. Open and look through the data file.
2. Split the source data into a training set, a validation set, and a test set.
3. Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.
4. Check the quality of the model using the test set.
5. Additional task: sanity check the model. This data is more complex than what you’re used to working with, so it's not an easy task. We'll take a closer look at it later.

# Data description:

 - сalls — number of calls
 - minutes — total call duration in minutes
 - messages — number of text messages
 - mb_used — Internet traffic used in MB
 - is_ultra — plan for the current month (Ultra - 1, Smart - 0)

## Open and look through the data file.

In [32]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
import warnings
warnings.filterwarnings("ignore")

In [33]:
df = pd.read_csv('/datasets/users_behavior.csv')
df.sample(10)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
2022,56.0,354.15,98.0,8278.77,0
2518,69.0,535.17,70.0,19913.49,0
3191,57.0,363.42,23.0,16282.29,0
853,104.0,704.71,30.0,43304.53,1
1146,58.0,400.82,15.0,19335.93,0
365,20.0,131.97,23.0,26720.17,1
989,58.0,373.58,63.0,14802.42,0
1340,32.0,244.8,13.0,7213.58,0
2946,65.0,410.56,0.0,26756.07,0
2598,19.0,145.13,6.0,5181.6,1


In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


We have dataset with total 5 columns and 3214 rows. 4 columns are float64 type and last one int64 type. We don't have any missing values in columns.

In [35]:
df.corr()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
calls,1.0,0.982083,0.177385,0.286442,0.207122
minutes,0.982083,1.0,0.17311,0.280967,0.206955
messages,0.177385,0.17311,1.0,0.195721,0.20383
mb_used,0.286442,0.280967,0.195721,1.0,0.198568
is_ultra,0.207122,0.206955,0.20383,0.198568,1.0


There is a big correlation between columns calls and minutes. Let's drop down one which correlates less with tariff column.

In [36]:
df = df.drop(['minutes'], axis=1)

In [37]:
df.head()

Unnamed: 0,calls,messages,mb_used,is_ultra
0,40.0,83.0,19915.42,0
1,85.0,56.0,22696.96,0
2,77.0,86.0,21060.45,0
3,106.0,81.0,8437.39,1
4,66.0,1.0,14502.75,0


Ok, now we have a prepared dataset to work with.

## Split the source data into a training set, a validation set, and a test set.

In [38]:
features = df.drop(['is_ultra'], axis=1)
target = df['is_ultra']

In [39]:
features, features_test, target, target_test = train_test_split(
    features, target, test_size=.2, random_state=123)

In [40]:
features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=.25, random_state=123)

We splited all data source into a training, validation and test sets in proportion 60/20/20

## Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.

### Decision Tree model.

In [41]:
%%time
best_tree_model = None
best_tree_result = 0
best_tree_depth = 0
for depth in range(1, 21):
    model_dtc = DecisionTreeClassifier(random_state=123, max_depth=depth)
    model_dtc.fit(features_train, target_train)
    predictions_dtc = model_dtc.predict(features_valid)
    result_dtc = accuracy_score(target_valid, predictions_dtc)
    if result_dtc > best_tree_result:
        best_tree_model = model_dtc
        best_tree_result = result_dtc
        best_tree_depth = depth
print('The Best depth of model:', best_tree_depth)
print('The best result of decision tree model:', best_tree_result.round(2))

The Best depth of model: 7
The best result of decision tree model: 0.79
CPU times: user 102 ms, sys: 0 ns, total: 102 ms
Wall time: 106 ms


The best depth for deciosion tree model is 7, with this depth we got result of 0.79 and pretty good speed.

### Random forest model.

In [42]:
%%time
best_forest_model = None
best_forest_result = 0
best_forest_est = 0
for est in range(1, 26):
    model_rfc = RandomForestClassifier(random_state=123, n_estimators=est)
    model_rfc.fit(features_train, target_train)
    result_rfc = model_rfc.score(features_valid, target_valid)
    if result_rfc > best_forest_result:
        best_forest_model = model_rfc
        best_forest_result = result_rfc
        best_forest_est = est
print('The best amount of est:', best_forest_est)
print('The best result:', best_forest_result.round(2))

The best amount of est: 22
The best result: 0.8
CPU times: user 813 ms, sys: 6.62 ms, total: 819 ms
Wall time: 824 ms


The best amount of estimators is 22, we have a better result 0.8 compared with result of Decision tree model. But this model needs much more time. Let's see if depth will impact the result.

In [43]:
%%time
best_forest_model = None
best_forest_result = 0
best_forest_est = 0
best_forest_depth = 0

for est in range(1, 21):
    for depth in range(1, 11):
        model_rfc = RandomForestClassifier(random_state=123, n_estimators=est, max_depth=depth)
        model_rfc.fit(features_train, target_train)
        result_rfc = model_rfc.score(features_valid, target_valid)
        if result_rfc > best_forest_result:
            best_forest_model = model_rfc
            best_forest_result = result_rfc
            best_forest_est = est
            best_forest_depth = depth
print('The best amount of estimators:', best_forest_est)
print('Best depth:', best_forest_depth)
print('Best result of random forest:', best_forest_result.round(2))

The best amount of estimators: 9
Best depth: 6
Best result of random forest: 0.81
CPU times: user 3.97 s, sys: 31.8 ms, total: 4 s
Wall time: 4.01 s


With depth we got another results. With depth = 6 the best amount of estimators is 9 and it gives us better result of 0.81, but the time increased dramatically.

### Logistic Regression Model

In [44]:
%%time
model_lr = LogisticRegression(random_state=123)
model_lr.fit(features_train, target_train)
result_lr = model_lr.score(features_valid, target_valid)
print('Result of logistic regression model:', result_lr.round(2))

Result of logistic regression model: 0.71
CPU times: user 7.17 ms, sys: 0 ns, total: 7.17 ms
Wall time: 13 ms


This model took a lot of time, longer than decision tree and random forest models, and gave worst result - 0.71.

## Check the quality of the model using the test set.

### The test of decision tree model

In [45]:
%%time
test_prediction_dtc = best_tree_model.predict(features_test)
test_result_dtc = accuracy_score(target_test, test_prediction_dtc)
print('Accuracy of decision tree model:', test_result_dtc.round(2))

Accuracy of decision tree model: 0.79
CPU times: user 3.34 ms, sys: 198 µs, total: 3.54 ms
Wall time: 2.25 ms


Decision tree model gives us a good result of 0.79 with good speed of execution.

### The test of random forest model

In [46]:
%%time
test_result_rfc = best_forest_model.score(features_test, target_test)
print('Accuracy of random forest model:', test_result_rfc.round(2))

Accuracy of random forest model: 0.81
CPU times: user 3.71 ms, sys: 0 ns, total: 3.71 ms
Wall time: 3.17 ms


We have a very good result of 0.81, time is little bit bigger but not dramatically.

### Test of logistic regression model

In [47]:
%%time
test_result_lr = model_lr.score(features_test, target_test)
print('Accuracy of logistic regression model:', test_result_lr.round(2))

Accuracy of logistic regression model: 0.71
CPU times: user 1.67 ms, sys: 99 µs, total: 1.77 ms
Wall time: 1.51 ms


This model gave us the best time of execution but worst result compared with other models.

## Additional task: sanity check the model. 

In [48]:
dummy_clf = DummyClassifier(random_state=123)
dummy_clf.fit(features_test, target_test)

predictions_dc = dummy_clf.predict(features_test)

result_dc = dummy_clf.score(features_test, target_test)

print('Result of sanity check:', result_dc.round(2))
print('Accuracy of decision tree model:', test_result_dtc.round(2))

Result of sanity check: 0.59
Accuracy of decision tree model: 0.79


# Conclusion.

We have dataset with total 5 columns and 3214 rows. 4 columns are float64 type and last one int64 type. We don't have any missing values in columns. There is a big correlation between columns calls and minutes. I droped down one which correlates less with tariff column. The best depth for decision tree model is 7, with this depth we got result of 0.79 and pretty good speed. The best amount of estimators is 22, we have a better result 0.8 compared with result of Decision tree model. But this model needs much more time. Let's see if depth will impact the result. With depth we got another results. With depth = 6 the best amount of estimators is 9 and it gives us better result of 0.81, but the time increased dramatically. This model took a lot of time, longer than decision tree and random forest models, and gave worst result - 0.71.
Checking of model quality also gave us result that the best model for our dataset is decision tree model. It gives a good accuracy and doesn't take much time for execution. Also after sanity check we as well see that this model gave us a good results and it is the best choice for this task.