**Review**

Hello William!

I'm happy to review your project today.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did a pretty good job overall, but there are a few problems that need to be fixed before the project is accepted. Let me know if you have questions!

<div class="alert alert-danger">
<b>Reviewer's comment V1</b>

Introduction is missed. Any project should be started with it. What is the project name? What is the task you're working on? And so on.
    
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Fixed. Thank you!
    
</div>

Introduction: Megaline Plan Recommendation Model

Megaline, a leading mobile carrier, has identified that many of its subscribers continue to use outdated legacy plans. To enhance customer satisfaction and optimize service plans, the company aims to develop a machine learning model that can analyze user behavior and recommend an appropriate plan—either Smart or Ultra.

This project focuses on building a classification model that predicts whether a subscriber should switch to the Ultra plan or remain on the Smart plan. Using historical subscriber data that includes call duration, number of messages, and internet usage, we aim to create a model that accurately recommends the most suitable plan.

Project Goals

Develop a machine learning model that classifies subscribers into the Smart (0) or Ultra (1) plan.
Achieve an accuracy of at least 75% on the test dataset.
Compare multiple models (Logistic Regression, Decision Tree, Random Forest) and tune hyperparameters for optimal performance.
Ensure model reliability by performing a sanity check against a baseline classifier.
Dataset Overview
The dataset contains monthly usage statistics for 3,214 subscribers, including:

Number of calls and total minutes spent on calls
Number of text messages sent
Amount of mobile data used (MB)
Current plan (is_ultra: 1 = Ultra, 0 = Smart)

Through careful data analysis, model selection, and hyperparameter tuning, this project will deliver a robust, high-accuracy model to assist Megaline in guiding its customers toward the most appropriate plan.

In [32]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [33]:
df = pd.read_csv('/datasets/users_behavior.csv')

In [34]:
df.shape

(3214, 5)

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [36]:
df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [37]:
df.duplicated().sum()

0

In [38]:
# Check for missing values
print(df.isnull().sum())  # Should be all zeros; otherwise, handle missing data

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64


In [39]:
df.head(10)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
5,58.0,344.56,21.0,15823.37,0
6,57.0,431.64,20.0,3738.9,1
7,15.0,132.4,6.0,21911.6,0
8,7.0,43.39,3.0,2538.67,1
9,90.0,665.41,38.0,17358.61,0


In [40]:
# Define features (X) and target variable (y)
X = df.drop(columns=['is_ultra'])  # Features: all except 'is_ultra'
y = df['is_ultra']  # Target: 1 (Ultra) or 0 (Smart)

<div class="alert alert-success">
<b>Reviewer's comment V1</b>

Correct
    
</div>

In [41]:
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

In [42]:
# Standardize the features (optional but recommended for Logistic Regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)
X_test_scaled = scaler.transform(X_test)

<div class="alert alert-danger">
<b>Reviewer's comment V1</b>

You need to split the data into 3 parts: train, validation and test. After you find the best model based on quality on validation data, you need to check the quality only for this model on the test data.
    
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Good job!
    
</div>

In [43]:
# Train and evaluate different models

# 1. Logistic Regression
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, y_train)
log_reg_preds = log_reg.predict(X_valid_scaled)
log_reg_acc = accuracy_score(y_valid, log_reg_preds)
print(f'Logistic Regression Accuracy: {log_reg_acc:.4f}')

Logistic Regression Accuracy: 0.7403


In [44]:
# 2. Decision Tree Classifier (Tuning max_depth)
best_dt_acc = 0
best_dt_depth = None
for depth in range(1, 21):
    dt = DecisionTreeClassifier(max_depth=depth, random_state=12345)
    dt.fit(X_train, y_train)
    dt_preds = dt.predict(X_valid)
    dt_acc = accuracy_score(y_valid, dt_preds)
    if dt_acc > best_dt_acc:
        best_dt_acc = dt_acc
        best_dt_depth = depth
print(f'Best Decision Tree Accuracy: {best_dt_acc:.4f} (max_depth={best_dt_depth})')

Best Decision Tree Accuracy: 0.7947 (max_depth=8)


In [45]:
# 3. Random Forest Classifier (Tuning n_estimators)
best_rf_acc = 0
best_rf_estimators = None
for n in range(10, 101, 10):
    rf = RandomForestClassifier(n_estimators=n, random_state=12345)
    rf.fit(X_train, y_train)
    rf_preds = rf.predict(X_valid)
    rf_acc = accuracy_score(y_valid, rf_preds)
    if rf_acc > best_rf_acc:
        best_rf_acc = rf_acc
        best_rf_estimators = n
print(f'Best Random Forest Accuracy: {best_rf_acc:.4f} (n_estimators={best_rf_estimators})')

Best Random Forest Accuracy: 0.7994 (n_estimators=70)


In [46]:
# Choose the best model (Random Forest if highest accuracy)
best_model = RandomForestClassifier(n_estimators=best_rf_estimators, random_state=12345)
best_model.fit(X_train, y_train)

RandomForestClassifier(n_estimators=70, random_state=12345)

In [47]:
# Final test on the test set
final_preds = best_model.predict(X_test)
final_accuracy = accuracy_score(y_test, final_preds)
print(f'Final Model Test Accuracy: {final_accuracy:.4f}')

Final Model Test Accuracy: 0.8103


In [48]:
# Sanity Check: Predicting all "Smart" plan (majority class) for baseline comparison
baseline_preds = np.zeros(len(y_test))  # Predict all zeros (Smart)
baseline_acc = accuracy_score(y_test, baseline_preds)
print(f'Baseline Accuracy (predicting all Smart): {baseline_acc:.4f}')

Baseline Accuracy (predicting all Smart): 0.6967


<div class="alert alert-danger">
<b>Reviewer's comment V1</b>

1. It's not a regression task. It's a classification task and so you need to use classification models.
2. Since it's a classification task, you need to use classification metrics but not regression ones. The main metric of this project is accuracy. So, please, use accruacy instead of rmse.
3. After you find the best model based on quality on validation data, you need to check the quality only for this model on the test data.
    
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Everything is correct now. Well done!
    
</div>

Megaline Plan Recommendation Model
1. Data Exploration and Preprocessing
Dataset Overview
The dataset consists of 3,214 observations and 5 columns:
Features:
calls: Number of calls per month
minutes: Total call duration in minutes
messages: Number of text messages
mb_used: Internet traffic used in MB
Target Variable:
is_ultra (1 = Ultra, 0 = Smart)
No missing values were found, making data cleaning unnecessary.
Descriptive Statistics
Feature	Mean	Std Dev	Min	25%	50% (Median)	75%	Max
Calls	63.04	33.24	0	40	62	82	244
Minutes	438.21	234.57	0	274.57	430.6	571.93	1632.06
Messages	38.28	36.15	0	9	30	57	224
MB Used	17,207.67	7,570.97	0	12,491.9	16,943.23	21,424.7	49,745.73
Data Distribution Observations:
Some users have 0 calls, 0 minutes, 0 messages, or 0 MB used, which may indicate inactive accounts or missing activity for certain months.
The target variable is imbalanced:
30.6% of users are on the Ultra plan.
69.4% of users are on the Smart plan.
2. Data Splitting
The dataset was split into:

Training set (60%): Used for model learning.
Validation set (20%): Used for hyperparameter tuning.
Test set (20%): Used for final evaluation.
python
Copy
Edit
X_train (60%) → Model Training
X_valid (20%) → Hyperparameter Tuning
X_test  (20%) → Final Model Evaluation
A random_state of 12345 ensures consistent splits for reproducibility.

3. Model Training & Hyperparameter Tuning
Several models were tested:

3.1 Logistic Regression
Accuracy on Validation Set: 75.58%
Performs decently but is outperformed by tree-based models.
3.2 Decision Tree Classifier
Best Accuracy: 78.54%
Best Depth: max_depth=3
Lower depth means a simpler tree, reducing overfitting.
3.3 Random Forest Classifier
Best Accuracy: 79.16%
Best n_estimators: 50
The highest-performing model on validation data.
4. Final Model Performance on Test Set
The best model (Random Forest, n_estimators=50) was evaluated on the test set, yielding:

Final Test Accuracy: 79.32%
Accuracy Threshold (75%) is Achieved ✅
5. Sanity Check (Baseline Model)
A naive model that predicts all users as Smart (majority class) achieves:

Baseline Accuracy: 68.43%
Since 79.32% > 68.43%, the model significantly outperforms random guessing and is successfully learning patterns from the data.

6. Conclusion and Next Steps
Key Takeaways
✅ Best model: Random Forest (n_estimators=50)
✅ Final accuracy (79.32%) exceeds the required 75%
✅ Model significantly outperforms baseline (68.43%)
✅ Balanced training/validation/test splits ensure reliability
