<div style="border:solid blue 2px; padding: 20px"> 

<strong>Reviewer's Introduction</strong>

Hello Avon! 👋 

I'm happy to review your project today.

I will categorize my comments in green, blue or red boxes like this:

<div class="alert alert-success">
    <b>Success:</b> Everything is done successfully.
</div>
<div class="alert alert-warning">
    <b>Remarks:</b> Suggestions for optimizations or improvements.
</div>
<div class="alert alert-danger">
    <b>Needs fixing:</b> This must be fixed for a project to be approved.
</div>

Please don't remove my comments :) If you have any questions or comments, don't hesitate to respond to my comments by creating a box that looks like this: 
<div class="alert alert-info"> <b>Student's comment:</b> Your text here.</div>    
<br>


📌 Here's how to create code for student comments inside a Markdown cell:
    
    
    <div class="alert alert-info">
    <b> Student's comment</b>

    Your text here. 
    </div>

You can find out how to **format text** in a Markdown cell or how to **add links** [here](https://sqlbak.com/blog/jupyter-notebook-markdown-cheatsheet). 


<hr>
Reviewer: Han Lee <br>
</div>


<div style="border: solid blue 2px; padding: 15px; margin: 10px">
	<b>Reviewer's Comments – Iteration 1</b>

Congratulations! 

This project meets all requirements ✅, and is approved. 🎉


<b>Notable strengths:</b>  

✔️ Clear understanding of splitting dataset 

✔️ Use of baseline to compare against multiple models  

✔️ Use of custom hyperparameter tuning 

✔️ Concise and informative conclusion

Well done on your first machine learning project! You will continue to use these techniques in future sprints. Keep up the great work.


</div>



***known***
- data contains only new plan subscribers
- preprocessing done
- binary classification
- 3214 rows * 5 columns
- target = ['is_ultra']
- features = ['calls', 'minutes', 'messages', 'mb_used']
- 3:1:1 / 60%:20%:20% / train:valid:test - no external test set
- .75 accuracy threshold

***unknown***
- which model to use?
- hyperparameters?
- findings?

In [1]:
import pandas as pd

from sklearn.tree import DecisionTreeClassifier # decision tree
from sklearn.ensemble import RandomForestClassifier # random forest
from sklearn.linear_model import LogisticRegression # logistic regression

from sklearn.model_selection import train_test_split # train:valid:test

**creating a dataframe with the given .csv file, checking for any immediately observable trends**

In [2]:
data = pd.read_csv('/datasets/users_behavior.csv') 

#data.info()
#data.shape
#print(data.duplicated().sum())
#print(data.sample(17, random_state=12345))
#print(data[data['is_ultra'] == 1].sort_values(by='minutes'))
#print(data[data['is_ultra'] == 0].sort_values(by='minutes'))

In [3]:
#ultra = data[data['is_ultra'] == 1]
#smart = data[data['is_ultra'] == 0]

#display(ultra.describe())
#display(smart.describe())

**There are no immediately obvious differences in the two plans regarding user behavior, with 'Ultra' having over twice the amount of subscribers compared to 'Smart'. 2229 to 985 respectively.**

**Ultra ranges**
- Calls: (0, 244)
- Minutes: (0, 1632)
- Messages: (0, 224)
- MB Used: (0, 49745.73)

**Smart ranges**
- Calls: (0, 198)
- Minutes: (0, 1390)
- Messages: (0, 143)
- MB Used: (0, 38552.62)

***Ultra subscribers make up 69.35% of our dataset***

***Smart subscribers make up 30.65% of our dataset***

**Assigning features & target, splitting our data into 3 sets for training, validation, and testing**

In [4]:
# assigning features and target
features = data.drop('is_ultra', axis = 1)
target = data['is_ultra']

# splitting the data into training/temporary (temp for 2nd split) sets 60%/40%
features_train, features_temp, target_train, target_temp = train_test_split(features, target, test_size = .4, random_state = 12345)

# splitting our temporary set (remaining 40%), into validation/test sets 20%/20%
features_valid, features_test, target_valid, target_test = train_test_split(features_temp, target_temp, test_size = .5, random_state = 12345)

**double checking ratio after split**

In [5]:
print('Training set size:', (len(features_train)/len(data))*100, end='%')
print()
print('Validation set size:', (len(features_valid)/len(data))*100, end='%')
print()
print('Test set size:', (len(features_test)/len(data))*100, end='%')

Training set size: 59.98755444928439%
Validation set size: 20.00622277535781%
Test set size: 20.00622277535781%

**triple checking for unique entries**

In [6]:
#print(features_train.head())
#print(features_valid.head())
#print(features_test.head())

***DecisionTreeClassifier - Training on a depth range of (1, 11) exclusive***

In [7]:
best_score_tree = 0
best_model_depth = 0
for depth in range(1,11):
    model = DecisionTreeClassifier(random_state = 12345, max_depth = depth)
    model.fit(features_train, target_train)
    score = model.score(features_valid, target_valid)
    if score > best_score_tree:
        best_score_tree = score
        best_model_depth = depth

print('Accuracy of best model on validation set (depth = {}): {}'.format(best_model_depth, best_score_tree))

Accuracy of best model on validation set (depth = 3): 0.7853810264385692


**A depth value of 3 yielded us the most accurate model with a 78.53% accuracy rating on the validation set**

In [8]:
final_tree_model = DecisionTreeClassifier(random_state = 12345, max_depth = 3)
final_tree_model.fit(features_train, target_train)

test_score = final_tree_model.score(features_test, target_test)
print('Accuracy of final model on test set:', test_score*100, end = '%')

Accuracy of final model on test set: 77.91601866251943%

**After utilizing a depth value of 3, our model has score a 77.92% accuracy rating against our test set, scoring slighlty higher than our desired threshold of 75%**

***RandomForestClassifier - Training on an n_estimator range of (10, 51) exclusive, by increments of 10, with tree depth ranges of (1,11) exclusive, for a total of 50 hyperparameter combinations***

In [9]:
best_score_forest = 0
best_model_depth = 0
best_model_est = 0
for est in range(10, 51, 10):
    for depth in range(1, 11):
        model = RandomForestClassifier(random_state = 12345, n_estimators = est, max_depth = depth)
        model.fit(features_train, target_train)
        score = model.score(features_valid, target_valid)
        if score > best_score_forest:
            best_score_forest = score
            best_model_depth = depth
            best_model_est = est

print('Accuracy of best model on validation set (depth = {}, est = {}): {}'.format(best_model_depth, best_model_est, best_score_forest*100), end = '%')

Accuracy of best model on validation set (depth = 8, est = 40): 80.87091757387248%

**A max_depth of 8, and n_estimator of 40 yields us the best model, with an accuracy value of 80.87%**

In [10]:
final_forest_model = RandomForestClassifier(random_state = 12345, n_estimators = 40, max_depth = 8)
final_forest_model.fit(features_train, target_train)

test_score = final_forest_model.score(features_test, target_test)
print('Accuracy of final model on test set:', test_score*100, end = '%')

Accuracy of final model on test set: 79.62674961119751%

**Creating a final model using our optimal hyperparameters, the model scored a 79.63% accuracy rating against our test set. Slightly lower than our score against the validation set, but performing better than our DecisionTreeClassifier model**

***LogisticRegression***

In [11]:
log_model = LogisticRegression(random_state = 12345, solver = 'liblinear')
log_model.fit(features_train, target_train)
log_score_train = log_model.score(features_train, target_train)
log_score_valid = log_model.score(features_valid, target_valid)
log_score_test = log_model.score(features_test, target_test)

print('Accuracy of model on training set:', log_score_train*100, end = '%')
print()
print('Accuracy of model on valid set:', log_score_valid*100, end = '%')
print()
print('Accuracy of model on test set:', log_score_test*100, end = '%')

Accuracy of model on training set: 71.57676348547717%
Accuracy of model on valid set: 70.91757387247279%
Accuracy of model on test set: 68.89580093312597%

***Final Models and their Accuracy rating against the Test Set:***

**DecisionTreeClassifier: 77.92%**

**RandomForestClassifier: 79.63%**

**LogisticRegression: 68.89%**

***After training our different models, RandomForestClassifier gave us the best results when predicting the 'is_ultra' column on our test set, which is 'new data' from the perspective of our trained model***

In [12]:
# best model with hyperparameters:

final_model = RandomForestClassifier(random_state = 12345, n_estimators = 40, max_depth = 8)
final_model.fit(features_train, target_train)

final_score = final_model.score(features_test, target_test)

#print(final_score)

**Performing Sanity Testing on our final model: Naive Baseline Check**

In [13]:
#print((target_test == 0).sum())
#print((target_test == 1).sum())
most_frequent = target_test.mode()[0]
baseline_accuracy = (target_test == most_frequent).sum() / len(target_test)

print('Naive baseline accuracy of test set:', baseline_accuracy*100, end = '%')
print()
print('Accuracy of our model against the test set:', final_score*100, end = '%')

Naive baseline accuracy of test set: 68.42923794712286%
Accuracy of our model against the test set: 79.62674961119751%

***Conclusion***

**Best Performing Model & Hyperparameters:**

**RandomForestClassifier, Depth = 8, Estimators = 40**

**Accuracy rating against test set: 79.63%**

**This value passes our expected accuracy threshold of .75, or 75%, and beats the naive baseline accuracy of 68.43%**