**Review**

Hello Caleb!

I'm happy to review your project today.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did a great job overall, but there are some small problems that need to be fixed before the project will be accepted. Let me know if you have any questions!


## Title: Using Classifying Models to Pick the Right Megaline Plan

# Introduction:  
In this project my goal will be to use a model along with Megaline's data to reccomend one of their new plans to customers. I will split the data, train, fit and test it within the model with the goal of obtaining 75% accuracy or more. Since this is a categorical task I will use Classification instead of Regression, which is used for numerical tasks. I will then perform a sanity check for peace of mind.

#revised

<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

It's not the best idea to use header format for regular text. Header formats should be used only for titles and subtitles.

</div>

In [1]:
import pandas as pd                                #import modules/libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error           #not needed for this project

In [2]:
megaline_data_df = pd.read_csv('/datasets/users_behavior.csv')   #save df with appropriate name


In [3]:
megaline_data_df.info()                              #its always a good idea to call info to get an idea of the data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
## Removed entire cell

<div class="alert alert-block alert-danger">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Duplicate code. You did it below

</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Thank you!

</div>

In [5]:
#df_train, df_valid = train_test_split(megaline_data_df, test_size = 0.25, random_state = 12345) 
#choose the data and split the size

#review how to split the data 3 ways

#old split, new split is below

In [6]:
df_train, df_temp = train_test_split(megaline_data_df, test_size=0.4, random_state=12345)


In [7]:
df_valid, df_test = train_test_split(df_temp, test_size=0.5, random_state=12345)

In [8]:
print(f'Training set size: {len(df_train)}')
print(f'Validation set size: {len(df_valid)}')
print(f'Test set size: {len(df_test)}')

Training set size: 1928
Validation set size: 643
Test set size: 643


<div class="alert alert-block alert-danger">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

You should split the data into 3 parts here: train, validation and test with the ratio 60/20/20 or 70/15/15

</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Good job!

</div>

In [9]:
megaline_data_df.head(20) #it looks roughly from the data that users who use more than 10,000mb have a 0 is_ultra value
# I want to find the mean/median mb_used of the 'smart' (0) users and use it as a threshold value

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
5,58.0,344.56,21.0,15823.37,0
6,57.0,431.64,20.0,3738.9,1
7,15.0,132.4,6.0,21911.6,0
8,7.0,43.39,3.0,2538.67,1
9,90.0,665.41,38.0,17358.61,0


In [10]:
smart_df_slice = megaline_data_df[megaline_data_df['is_ultra']==0] #getting the wanted data slice

In [11]:
mb_mean = smart_df_slice['mb_used'].mean() #better to use mean or median? median = 16506.94. mean = 16208.47

In [12]:
mb_mean

16208.46694930462

In [13]:
 #the threshold is the median


#removed entire cell

<div class="alert alert-block alert-danger">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Why? You already have c columns 'is_ultra' in your dataset. You don't need to recreate it.

</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Thank you:)

</div>

In [14]:
features_train = df_train.drop(['is_ultra'], axis=1)  #separate the features from the target for training set
target_train = df_train['is_ultra']
features_valid = df_valid.drop(['is_ultra'], axis=1) #pretty standard, same process as above but for validation
target_valid = df_valid['is_ultra']
#revised 

<div class="alert alert-block alert-danger">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Column 'mb_used' shouldn't be removed

</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Thank you once again

</div>

In [15]:
for depth in range(1,20):                                #picking depth of decision tree
    model = DecisionTreeClassifier(random_state=12345, max_depth = depth) #accuracy maxes out at a depth of 6 with 
    model.fit(features_train, target_train)                              #  a score of nearly 0.784
    predictions_valid = model.predict(features_valid)    #code taken from chapter 4, sprint 7
    print('max_depth =', depth, ': ', end= '')
    print(accuracy_score(target_valid,predictions_valid))  

max_depth = 1 : 0.7542768273716952
max_depth = 2 : 0.7822706065318819
max_depth = 3 : 0.7853810264385692
max_depth = 4 : 0.7791601866251944
max_depth = 5 : 0.7791601866251944
max_depth = 6 : 0.7838258164852255
max_depth = 7 : 0.7822706065318819
max_depth = 8 : 0.7791601866251944
max_depth = 9 : 0.7822706065318819
max_depth = 10 : 0.7744945567651633
max_depth = 11 : 0.7620528771384136
max_depth = 12 : 0.7620528771384136
max_depth = 13 : 0.7558320373250389
max_depth = 14 : 0.7589424572317263
max_depth = 15 : 0.7465007776049767
max_depth = 16 : 0.7340590979782271
max_depth = 17 : 0.7356143079315708
max_depth = 18 : 0.7309486780715396
max_depth = 19 : 0.7278382581648523


<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Correct

</div>

In [16]:
#best accuracy score for decision tree is .784 This is above the threshold but not by much

In [17]:
megaline_data_df.loc[megaline_data_df['mb_used'] > 16506, 'is_ultra'] = 0 #the threshold is the median
megaline_data_df.loc[megaline_data_df['mb_used'] <= 16506, 'is_ultra'] = 1 

 # data for features df and target df is split in cells 6 and 7


best_score = 0
best_est = 0
for est in range(1, 11): # choose hyperparameter range
    model = RandomForestClassifier(random_state=54321, n_estimators=est) # set number of trees
    model.fit(features_train,target_train) # train model on training set from cell 14
    score = model.score(features_valid,target_valid) # calculate accuracy score on validation set
    if score > best_score:
        best_score = score # save best accuracy score on validation set
        best_est = est# save number of estimators corresponding to best accuracy score

print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))



Accuracy of the best model on the validation set (n_estimators = 10): 0.7791601866251944


<div class="alert alert-block alert-warning">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Two first rows should be removed. They do nothing useful. Other code looks fine:)

</div>

In [18]:
features_test = df_test.drop(['is_ultra'], axis = 1) 
target_test = df_test['is_ultra']

In [19]:
final_model = RandomForestClassifier(random_state=54321, n_estimators = 10) # change n_estimators to get best model
final_model.fit(features_train, target_train)    #changed from 'valid' back to train    
print(final_model.score(features_test, target_test)) #use test features and target on best model

0.7822706065318819


<div class="alert alert-block alert-danger">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

1. A lot of duplicate code and the same mistakes as above
2. You can train any ML model on only on train data. You can't train it on validation or test data.
3. You need to choose the best model and test in on test data.
4. You need to achive accuracy > 0.75 on test data

</div>

<div class="alert alert-block alert-success">
<b>Reviewer's comment V2</b> <a class="tocSkip"></a>

Good job! Everything is correct now

</div>

#Testing the RandomForest classifier improves the accuracy of the model by almost 20 percent more than the Decision Tree Classifier

In [20]:
from sklearn.dummy import DummyClassifier

# Create a baseline model that predicts the most frequent class
dummy_model = DummyClassifier(strategy="most_frequent")
dummy_model.fit(features_train, target_train)
dummy_predictions = dummy_model.predict(features_valid)
dummy_accuracy = accuracy_score(target_valid, dummy_predictions)

print("Baseline Model Accuracy (most frequent class):", dummy_accuracy)



Baseline Model Accuracy (most frequent class): 0.7060653188180405


<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Correct

</div>

The sanity check shows that the basic model is above 50% accurate which is acceptable.

Conclusion: The RandomForest Classifier has the best accuracy with the drawback of slower speed than that of the DecisionTree Classifier Model. Its accuracy is only marginally better, but as the lesson says, higher classification accuracy generates more profit. After the sanity check is performed I'm proud of the accuracy of the model. I have learned that each type of data has a specific use, and you can't leave out the data from the features that you are using for testing!