**Review**

Hi, my name is Dmitry and I will be reviewing your project.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did a great job! The project is accepted. Good luck on the next sprint!

# Introduction

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

For this classification task, you need to develop a model that will pick the right plan.
 
The threshold for accuracy is 0.75. Check the accuracy using the test dataset.

## Initialization

### Load necessary libraries

In [1]:
# Load up necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

   ### Create and view Dataframe

In [2]:
# Load Dataframe
df = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
# Visualize Dataframe
df

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


In [4]:
# Check df info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


From the above results, the dataframe seems clean with no missing data or inappropriate data types so we are ready to create our models.

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was loaded and briefly inspected!

</div>

## Splitting and visualizing split Dataframes

### Split Dataframe into train, test and validate

In [5]:
# Split Dataframe into train, test and validate in the ratio 3:1:1 i.e 60%:20%:20% respectively
train, test = train_test_split(df, test_size=0.2, random_state=12345)
train, validate = train_test_split(train, test_size=0.25, random_state=12345)

### Visualizing split Dataframes

In [6]:
# Visualize train
train

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
2656,30.0,185.07,34.0,17166.53,0
823,42.0,290.69,77.0,21507.03,0
2566,41.0,289.83,15.0,22151.73,0
1451,45.0,333.49,50.0,17275.47,0
2953,43.0,300.39,69.0,17277.83,0
...,...,...,...,...,...
1043,106.0,796.79,23.0,42250.70,1
2132,18.0,117.80,0.0,10006.79,1
1642,87.0,583.02,1.0,11213.97,0
1495,63.0,408.68,63.0,24970.26,0


In [7]:
# Visualize test
test

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
1415,82.0,507.89,88.0,17543.37,1
916,50.0,375.91,35.0,12388.40,0
1670,83.0,540.49,41.0,9127.74,0
686,79.0,562.99,19.0,25508.19,1
2951,78.0,531.29,20.0,9217.25,0
...,...,...,...,...,...
2061,66.0,478.48,0.0,16962.58,0
1510,40.0,334.24,91.0,11304.14,0
2215,62.0,436.68,52.0,12311.24,0
664,117.0,739.27,124.0,22818.56,1


In [8]:
# Visualize validate
validate

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
2699,71.0,512.53,27.0,15772.68,0
242,183.0,1247.04,150.0,29186.41,1
2854,34.0,246.06,31.0,8448.76,0
1638,63.0,468.66,0.0,11794.34,0
1632,4.0,19.85,28.0,13107.42,0
...,...,...,...,...,...
2551,83.0,539.32,33.0,20967.28,0
1261,80.0,552.33,23.0,18457.85,0
1658,111.0,899.74,0.0,26866.61,0
353,68.0,493.00,29.0,20021.73,0


### Checking split ratio

In [9]:
# Checking train ratio
train_ratio = len(train)/len(df)
print(f'Train ratio: {train_ratio:.0%} ')

Train ratio: 60% 


In [10]:
# Checking test ratio
test_ratio = len(test)/len(df)
print(f'Test ratio: {test_ratio:.0%} ')

Test ratio: 20% 


In [11]:
# Checking validate ratio
validate_ratio = len(validate)/len(df)
print(f'Validate ratio: {validate_ratio:.0%} ')

Validate ratio: 20% 


The data has been split into train, test and validate in the appropriate proportions. Now we can start testing out the best models for the task

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data split is reasonable!

</div>

## Working with models

### Establish variables

In [12]:
# Variables for train df
features_train = train.drop('is_ultra', axis=1)
target_train = train['is_ultra']

In [13]:
# Variables for test df
features_test = test.drop('is_ultra', axis=1)
target_test = test['is_ultra']

In [14]:
# Variables for validate df
features_validate = validate.drop('is_ultra', axis=1)
target_validate = validate['is_ultra']

All useful variables from the Dataframes have been created

<div class="alert alert-success">
<b>Reviewer's comment</b>

Ok!

</div>

### Decision Tree Model

In [15]:
# Create decision tree model with the most accuracy
best_result = 0
best_tree_model = None
best_depth = 0
for depth in range(1,11):
    tree_model = DecisionTreeClassifier(max_depth= depth, random_state=12345)
    tree_model.fit(features_train, target_train)
    prediction = tree_model.predict(features_validate)
    result = accuracy_score(target_validate, prediction)
    if result > best_result:
        best_result = result
        best_tree_model = tree_model
        best_depth = depth

In [16]:
# Check accuracy of tree model using validation dataset
print(f'Accuracy of best model: {best_result}')
print(f'Most efficient depth: {best_depth}')

Accuracy of best model: 0.7744945567651633
Most efficient depth: 7


In [17]:
# Check accuracy of tree model using the test dataset
tree_test_pred = best_tree_model.predict(features_test)
tree_test_res = accuracy_score(target_test, tree_test_pred)
print(f'Test results for Decision Tree Classifier: {tree_test_res}')

Test results for Decision Tree Classifier: 0.7884914463452566


From the results above, the decision tree model seems to be handling the task quite well and seems fairly accurate with an accuracy of 78.8% on the test dataset at a depth of 7. The model isn't over or under fitted

### Random Forest Model

In [18]:
# Create Random Forest Model with the most accuracy
best_score = 0
best_est = 0
best_f_depth = 0
best_rf_model = None
for est in range(1,51, 10):
    for depth in range(1,11):
        forest_model = RandomForestClassifier(random_state=54321, n_estimators=est, max_depth=depth)
        forest_model.fit(features_train, target_train)
        rf_prediction = forest_model.predict(features_validate)
        result = accuracy_score(target_validate, rf_prediction)
        if result > best_score:
            best_score = result
            best_est = est
            best_f_depth = depth
            best_rf_model = forest_model

In [19]:
# Check accuracy of forest model using validation dataset
print(f'Accuracy of best model: {best_score}')
print(f'Best number of estimators: {best_est}')
print(f'Most efficient depth: {best_f_depth}')

Accuracy of best model: 0.7931570762052877
Best number of estimators: 11
Most efficient depth: 10


In [20]:
# Check accuracy of forest model using test dataset
forest_test_pred = best_rf_model.predict(features_test)
forest_test_res = accuracy_score(target_test, forest_test_pred)
print(f'Test results of Random Forest Classifier: {forest_test_res}')

Test results of Random Forest Classifier: 0.7947122861586314


From the above results, the random forest model is also quite effective in running this task with an accuracy of 79.5% on the test dataset, a max_depth of 10 and an n_estimators of 11. But seeing as the accuracy is just marginally better than that of the decision tree model, if speed is needed for this task, then this might not be the best model for the task.

### Logistic Regression Model

In [21]:
# Create Logistic Regression Model
lr_model = LogisticRegression(random_state= 12345, solver='lbfgs')
lr_model.fit(features_train, target_train)

LogisticRegression(random_state=12345)

In [22]:
# Check accuracy of logistic regression model using validation dataset
acc_score = lr_model.score(features_validate, target_validate)
print(f'Accuracy score on validation dataset: {acc_score}')

Accuracy score on validation dataset: 0.7262830482115086


In [23]:
# Check accuracy of Logistic regression model using test dataset
acc_score_test = lr_model.score(features_test, target_test)
print(f'Accuracy score on test dataset: {acc_score_test}')

Accuracy score on test dataset: 0.7589424572317263


From the above results, we can see that the Logistic Regression Model is also fairly accurate with accuracy of 72.6% and 75.9% on validationa and test datasets respectively.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Great, you tuned the models' hyperparameters using the validation set and evaluated the final models on the test set!

</div>

## Conclusion

In summary, We split the dataset into train, validate and test in the ratio 3:1:1, i.e 60%, 20% and 20% respectively and created different machine learning models to decide which model is best to use to pick the right plans for the customers.

We were able to create 3 models to accurately pick the right plan based on customer behaviour and data. And of the 3 models, the Random Forest Model is the most accurate and good accuracy is needed to drive up revenue. 
We recommend this model if speed of execution is not a factor.


The Logistic Regression Model is the least accurate on our datasets but the fastest. In reality, 70% accuracy is really not bad. 
We recommend this model if speed is very important.

The Decision Tree Model is quite accurate and also very fast. This might be good for the short term. But this model is prone to overfitting or underfitting. 
However, this model is the most efficient and we think it's the best model for this task.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Conclusions make sense!

</div>