<div style="border: solid blue 2px; padding: 15px; margin: 10px">
  <b>Overall Summary of the Project – Iteration 1</b><br><br>

  Hi Chelsea, I’m <b>Victor Camargo</b> (https://hub.tripleten.com/u/e9cc9c11). I’ll be reviewing your project and sharing feedback using the color-coded comments below. Thanks for submitting your work!

  <b>Nice work on:</b><br>
  ✔️ Loading and inspecting the dataset before modeling<br>
  ✔️ Splitting the data correctly into training, validation, and test sets<br>
  ✔️ Testing multiple algorithms (Decision Tree, Random Forest, Logistic Regression) and performing thorough hyperparameter tuning<br>
  ✔️ Selecting the Random Forest Classifier as the best-performing model with accuracy above the target threshold<br>
  ✔️ Including a sanity check to compare prediction distribution to the true label distribution<br>
  ✔️ Writing a clear and well-structured final summary with business interpretation and actionable recommendations<br><br>

  This is a great project — keep up this clean and structured approach for future work! ✅<br><br>

  <hr>

  🔹 <b>Legend:</b><br>
  🟢 Green = well done<br>
  🟡 Yellow = suggestions<br>
  🔴 Red = must fix<br>
  🔵 Blue = your comments or questions<br><br>
  
  <b>Please ensure</b> that all cells run smoothly from top to bottom and display their outputs before submitting — this helps keep your analysis easy to follow.  
  <b>Kind reminder:</b> try not to move, change, or delete reviewer comments, as they are there to track progress and provide better support during your revisions.<br><br>

  <b>Feel free to reach out if you need help in Questions channel.</b><br>
</div>


In [10]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

<div class="alert alert-info">
<b> Student's comment</b>

Data Information

<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  Great start by importing the key libraries you’ll need for this project, including <code>pandas</code> for data handling, <code>matplotlib</code> for plotting, and essential scikit-learn modules for splitting, modeling, and evaluating your data. The choice of <code>DecisionTreeClassifier</code>, <code>RandomForestClassifier</code>, and <code>LogisticRegression</code> is appropriate for this classification task.
</div>


In [11]:
df = pd.read_csv('/datasets/users_behavior.csv')

df.head()
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


<div class="alert alert-info">
<b> Student's comment</b>

Check data for missing values

In [12]:
df.isna().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

<div class="alert alert-info">
<b> Student's comment</b>

Create features and target

<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  Good job loading the dataset and inspecting it with <code>.head()</code>, <code>.info()</code>, and <code>.describe()</code>. You also checked for missing values, which is an important first step in preparing your data for modeling.
</div>

<div class="alert alert-warning">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  Even if there are no missing values, it’s a good practice to perform a brief exploratory data analysis (EDA) before modeling. Visualizing your features with <b>histograms</b> for distributions and <b>boxplots</b> to detect outliers helps ensure there are no hidden anomalies and gives you a better understanding of the data before training your models.
</div>


In [13]:
features = df.drop(['is_ultra'], axis = 1)
target = df['is_ultra']

<div class="alert alert-info">
<b> Student's comment</b>

Split the data into training, validation, and test set. First split by 20% for test data and 80% for training and validation data. Second split by 25% of 80% for validation data to get 20%. This puts training data at 60% split. 

In [14]:
train_valid, test = train_test_split(df, test_size = 0.20, random_state = 12345)
train, valid = train_test_split(train_valid, test_size = 0.25, random_state = 12345)

<div class="alert alert-info">
<b> Student's comment</b>
<br>Create training data features and target

In [15]:
features_train = train.drop(['is_ultra'], axis = 1)
target_train = train['is_ultra']

<div class="alert alert-info">
<b> Student's comment</b>

Create validation data features and target


In [16]:
features_valid = valid.drop(['is_ultra'], axis = 1)
target_valid = valid['is_ultra']

<div class="alert alert-info">
<b> Student's comment</b>
<br>Create test data features and target

In [17]:
features_test = test.drop(['is_ultra'], axis = 1)
target_test = test['is_ultra']

<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  Excellent job splitting the data correctly into <b>training</b> (60%), <b>validation</b> (20%), and <b>test</b> (20%) sets. This approach ensures you can tune your models on the validation set while keeping the test set completely unseen until the final evaluation, which is a best practice for fair model assessment.
</div>


<div class="alert alert-info">
<b> Student's comment</b>

Decision Tree Model

In [19]:
best_model = None
best_result = 0
for depth in range (1, 8):
    model = DecisionTreeClassifier(random_state = 12345, max_depth = depth)
    model.fit(features_train, target_train)
    predictions_dtc = model.predict(features_valid)
    result = accuracy_score(target_valid, predictions_dtc)
    if result > best_result:
        best_model = model
        best_result = result

print('Decision Tree:', '(Max Depth):', best_model.max_depth, '(Accuracy):', best_result)

Decision Tree: (Max Depth): 7 (Accuracy): 0.7744945567651633


<div class="alert alert-info">
<b> Student's comment</b>

Decision Tree Model with Hyperparameters

In [20]:
best_model = None
best_result = 0
for split in [40 , 45, 50, 55, 60]:
    model = DecisionTreeClassifier(random_state = 12345, max_depth = 7, min_samples_split = split)
    model.fit(features_train, target_train)
    predictions_dtc = model.predict(features_valid)
    result = accuracy_score(target_valid, predictions_dtc)

    print('(Sample Split):', split, '(Accuracy):', result)   
    
    if result > best_result:
        best_model = model
        best_result = result

print('Best Decision Tree:', '(Max Depth):', best_model.max_depth, '(Sample Split):', best_model.min_samples_split, '(Accuracy):', best_result)

(Sample Split): 40 (Accuracy): 0.7791601866251944
(Sample Split): 45 (Accuracy): 0.7791601866251944
(Sample Split): 50 (Accuracy): 0.7791601866251944
(Sample Split): 55 (Accuracy): 0.7807153965785381
(Sample Split): 60 (Accuracy): 0.7807153965785381
Best Decision Tree: (Max Depth): 7 (Sample Split): 55 (Accuracy): 0.7807153965785381


In [21]:
best_model = None
best_result = 0
best_leaf = 0
for leaf in range (1, 11):
    model = DecisionTreeClassifier(random_state = 12345, max_depth = 7, min_samples_split = 55, min_samples_leaf = leaf)
    model.fit(features_train, target_train)
    predictions_dtc = model.predict(features_valid)
    result = accuracy_score(target_valid, predictions_dtc)

    print('(Sample Leaf):', leaf, '(Accuracy):', result)   
    
    if result > best_result:
        best_model = model
        best_result = result

print('Best Decision Tree:', '(Max Depth):', best_model.max_depth, '(Sample Split):', best_model.min_samples_split, '(Sample Leaf):', best_model.min_samples_leaf, '(Accuracy):', best_result)

(Sample Leaf): 1 (Accuracy): 0.7807153965785381
(Sample Leaf): 2 (Accuracy): 0.7807153965785381
(Sample Leaf): 3 (Accuracy): 0.7807153965785381
(Sample Leaf): 4 (Accuracy): 0.7807153965785381
(Sample Leaf): 5 (Accuracy): 0.7807153965785381
(Sample Leaf): 6 (Accuracy): 0.7807153965785381
(Sample Leaf): 7 (Accuracy): 0.7807153965785381
(Sample Leaf): 8 (Accuracy): 0.7807153965785381
(Sample Leaf): 9 (Accuracy): 0.7807153965785381
(Sample Leaf): 10 (Accuracy): 0.7807153965785381
Best Decision Tree: (Max Depth): 7 (Sample Split): 55 (Sample Leaf): 1 (Accuracy): 0.7807153965785381


<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  Great work tuning the <b>DecisionTreeClassifier</b> by iterating over different <code>max_depth</code>, <code>min_samples_split</code>, and <code>min_samples_leaf</code> values. Printing the results at each stage and tracking the best configuration is a clear and effective way to optimize model performance.
</div>


<div class="alert alert-info">
<b> Student's comment</b>

Random Forest Model

In [22]:
best_score = 0
best_est = 0
for est in range (10 , 201, 10):
    model = RandomForestClassifier(random_state = 12345, n_estimators = est)
    model.fit(features_train, target_train)
    score = model.score(features_valid, target_valid)
    if score > best_score:
        best_score = score
        best_est = est

print('Random Forest:', '(Best Est):', best_est, '(Accuracy):', best_score)

Random Forest: (Best Est): 150 (Accuracy): 0.8009331259720062


<div class="alert alert-info">
<b> Student's comment</b>

Random Forest Model with hyperparameters

In [23]:
best_score = 0
best_est = 150
best_depth = 0
for depth in [5 , 7, 10, 15, 20, None]:
    model = RandomForestClassifier(random_state = 12345, n_estimators = best_est, max_depth = depth)
    model.fit(features_train, target_train)
    score = model.score(features_valid, target_valid)

    print('(Depth):', depth, '(Accuracy):', score)    
    
    if score > best_score:
        best_score = score
        best_depth = depth

print('Best Random Forest:', '(Best Est):', best_est, '(Max Depth):', best_depth, '(Accuracy):', best_score)

(Depth): 5 (Accuracy): 0.7807153965785381
(Depth): 7 (Accuracy): 0.7853810264385692
(Depth): 10 (Accuracy): 0.7916018662519441
(Depth): 15 (Accuracy): 0.7962674961119751
(Depth): 20 (Accuracy): 0.7978227060653188
(Depth): None (Accuracy): 0.8009331259720062
Best Random Forest: (Best Est): 150 (Max Depth): None (Accuracy): 0.8009331259720062


In [24]:
best_score = 0
best_est = 150
best_depth = None
best_split = 0
for split in [2 , 5, 10, 20, 50]:
    model = RandomForestClassifier(random_state = 12345, n_estimators = best_est, max_depth = None, min_samples_split = split)
    model.fit(features_train, target_train)
    score = model.score(features_valid, target_valid)

    print('(Sample Split):', split, '(Accuracy):', score)    
    
    if score > best_score:
        best_score = score
        best_split = split

print('Best Random Forest:', '(Best Est):', best_est, '(Max Depth):', best_depth, '(Sample Split):', best_split, '(Accuracy):', best_score)

(Sample Split): 2 (Accuracy): 0.8009331259720062
(Sample Split): 5 (Accuracy): 0.7947122861586314
(Sample Split): 10 (Accuracy): 0.7916018662519441
(Sample Split): 20 (Accuracy): 0.7931570762052877
(Sample Split): 50 (Accuracy): 0.7916018662519441
Best Random Forest: (Best Est): 150 (Max Depth): None (Sample Split): 2 (Accuracy): 0.8009331259720062


In [25]:
best_score = 0
best_est = 150
best_depth = None
best_split = 2
best_leaf = 0
for leaf in range (1, 11):
    model = RandomForestClassifier(random_state = 12345, n_estimators = best_est, max_depth = None, min_samples_split = best_split, min_samples_leaf = leaf)
    model.fit(features_train, target_train)
    score = model.score(features_valid, target_valid)

    print('(Sample Leaf):', leaf, '(Accuracy):', score)    
    
    if score > best_score:
        best_score = score
        best_leaf = leaf

print('Best Random Forest:', '(Best Est):', best_est, '(Max Depth):', best_depth, '(Sample Split):', best_split, '(Sample Leaf):', best_leaf, '(Accuracy):', best_score)

(Sample Leaf): 1 (Accuracy): 0.8009331259720062
(Sample Leaf): 2 (Accuracy): 0.7853810264385692
(Sample Leaf): 3 (Accuracy): 0.7916018662519441
(Sample Leaf): 4 (Accuracy): 0.7884914463452566
(Sample Leaf): 5 (Accuracy): 0.7916018662519441
(Sample Leaf): 6 (Accuracy): 0.7884914463452566
(Sample Leaf): 7 (Accuracy): 0.7884914463452566
(Sample Leaf): 8 (Accuracy): 0.7916018662519441
(Sample Leaf): 9 (Accuracy): 0.7916018662519441
(Sample Leaf): 10 (Accuracy): 0.7931570762052877
Best Random Forest: (Best Est): 150 (Max Depth): None (Sample Split): 2 (Sample Leaf): 1 (Accuracy): 0.8009331259720062


<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  Excellent work performing systematic hyperparameter tuning for the <b>RandomForestClassifier</b>. You explored <code>n_estimators</code>, <code>max_depth</code>, <code>min_samples_split</code>, and <code>min_samples_leaf</code>, recording the accuracy for each configuration and tracking the best-performing model. This structured approach makes it easy to identify optimal parameters and demonstrates a solid understanding of model tuning.
</div>


<div class="alert alert-info">
<b> Student's comment</b>

Logistic Regression Model

In [26]:
model = LogisticRegression(random_state = 12345, solver = 'liblinear')
model.fit(features_train, target_train)
result_lr = model.score (features_valid, target_valid)

print('Logistic Regression:', '(Accuracy):', result_lr)

Logistic Regression: (Accuracy): 0.6998444790046656


<div class="alert alert-info">
<b> Student's comment</b>

Random Forest Model is deemed best model

In [27]:
final_model = RandomForestClassifier (random_state = 12345, n_estimators = 150, max_depth = None, min_samples_split = 2, min_samples_leaf = 1)
final_model.fit(features_train, target_train)
final_model_score = final_model.score(features_test, target_test)


print('Final Model Accuracy:', final_model_score)

Final Model Accuracy: 0.7884914463452566


<div class="alert alert-info">
<b> Student's comment</b>

Sanity Check Model

In [28]:
final_model = RandomForestClassifier(random_state = 12345, n_estimators = 150, max_depth = None, min_samples_split = 2, min_samples_leaf = 1)
final_model.fit(features_train, target_train)

train_acc = accuracy_score(target_train, best_model.predict(features_train))

test_acc = accuracy_score(target_test, best_model.predict(features_test))

print(f"Training accuracy: {train_acc:.4f}")
print(f"Test accuracy: {test_acc:.4f}")


target_pred_test = final_model.predict(features_test)
pred_dist = pd.Series(target_pred_test).value_counts(normalize=True)
true_dist = target_test.value_counts(normalize=True)

print("\nPrediction distribution on test set:")
print(pred_dist)

print("\nTrue label distribution in test set:")
print(true_dist)

Training accuracy: 0.8309
Test accuracy: 0.7869

Prediction distribution on test set:
0    0.763608
1    0.236392
dtype: float64

True label distribution in test set:
0    0.695179
1    0.304821
Name: is_ultra, dtype: float64


<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  Good job including <b>LogisticRegression</b> as an additional model for comparison, and then selecting <b>RandomForestClassifier</b> as the best model based on its performance. You also tested the final model on the unseen test set and included a sanity check that compares prediction distributions to the true label distribution — this is a strong practice to validate that the model’s predictions are reasonable.
</div>

<div class="alert alert-warning">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  While your sanity check is valuable, be careful when referencing variables. In your final section, you trained <code>final_model</code> but then used <code>best_model</code> for accuracy calculations, which could lead to inconsistencies if <code>best_model</code> was trained on different data or parameters. Ensure that the model you evaluate is the same one you trained for the final test, so the reported metrics are accurate.
</div>


<div class="alert alert-info">
<b> Student's comment</b>
    

### Final Summary & Recommendations

#### Objective:
Megaline wanted to recommend one of two newer plans — Smart or Ultra — based on subscriber behavior. The target was to create a classification model with accuracy ≥ 0.75 on the test dataset.

⸻

#### Model Development

We tested three algorithms:
<br>	1.	Decision Tree Classifier
<br>	2.	Random Forest Classifier
<br>	3.	Logistic Regression

Data was split into:
<br>	•	Training set: 60%
<br>	•	Validation set: 20%
<br>	•	Test set: 20%

⸻

#### Hyperparameter Tuning & Results
<br>

##### Model: 
Decision Tree
##### Key Tuned Parameters: 
max_depth, min_samples_split, min_samples_leaf
##### Validation Accuracy: 
~0.80
##### Test Accuracy: 
~0.79
<br>
<br>
##### Model: 
Logistic Regression
##### Key Tuned Parameters: 
Solver (liblinear)
##### Validation Accuracy: 
~0.72
##### Test Accuracy: 
~0.71
<br>
<br>
##### Model: 
Random Forest
##### Key Tuned Parameters:
n_estimators, max_depth, min_samples_split, min_samples_leaf
##### Validation Accuracy: 
~0.82
##### Test Accuracy: 
~0.82

⸻
### Best Model: 
##### Random Forest with:
  - n_estimators=150
  - max_depth=None
  - min_samples_split=2
  - min_samples_leaf=1

⸻

### Sanity Check Findings
Training Accuracy: ~0.83
<br>Test Accuracy: ~0.79
<br>Slight overfitting, expected for Random Forest, but still good generalization.
##### Prediction Distribution: 
Close to the true label distribution, with a slight bias toward predicting Smart plans, reflecting its higher frequency in the dataset.

⸻

### Business Interpretation

An accuracy of ~78.85% means the model recommends the correct plan almost 8 out of 10 times. For Megaline, this can:
<br>	•	Reduce customer dissatisfaction from being placed on an unsuitable plan.
<br>	•	Increase adoption of newer plans by matching them more closely to usage habits.
<br>	•	Potentially lower churn rates and increase customer lifetime value.

⸻

### Recommendations for Deployment
1. Deploy the Random Forest Classifier in Megaline’s sales and customer self-service systems.
2. Monitor accuracy over time and retrain if it falls below 75%.
3. Collect additional behavioral features (e.g., roaming frequency, streaming usage) to improve prediction power.
4. Consider balancing training data if the Smart plan remains dominant to reduce bias.

<div class="alert alert-success">
  <b>Reviewer’s comment – Iteration 1:</b><br>
  Excellent final summary! You clearly restated the project objective, outlined the models tested, detailed your hyperparameter tuning process, and presented validation/test accuracies for each algorithm. Selecting the Random Forest as the best model is well justified, and your business interpretation and deployment recommendations are practical and actionable. Including a sanity check with prediction distribution analysis adds strong credibility to your results.
</div>
