# Project: Classifying Megaline Plans – Smart or Ultra
## 🎯 Objective

#### The goal of this project is to develop a binary classification model that can accurately recommend one of Megaline's modern mobile plans Smart or Ultra based on subscriber usage behavior. The target is to build a model that meets or exceeds an accuracy score of 0.75 on unseen test data.

#### This analysis uses historical monthly usage data of Megaline customers, including the number of calls, call duration, messages sent, and internet data used. The final output is a classification prediction: Smart (0) or Ultra (1).
#### Develop a binary classification model that predicts whether a customer should be offered the Smart (0) or Ultra (1) plan based on monthly behavior data. The target accuracy on the test set must be at least 0.75.
 



# Initial Data Exploration(Light EDA)
#### • Load the dataset.
#### • Inspect Using: info(), describe(), head().
#### • Check for missing values, outliers, class balance, and feature correlations.

In [2]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [18]:
data = pd.read_csv('users_behavior.csv')


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [5]:
data.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


## Data Preparation
 

In [None]:
#Define features and target variable
target = data['is_ultra']
features = data.drop(columns=['is_ultra'])

#Split the dataset into training and testing sets
X_temp, X_test, y_temp, y_test = train_test_split(features, target, test_size=0.25, random_state=54321)

#Further split the training set into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X_temp, y_temp, test_size=0.33, random_state=54321)




#  Base Model Training

#### To establish a strong baseline, we selected three classification algorithms widely used for structured data tasks:

#### • Logistic Regression: A fast and interpretable linear model, useful for benchmarking.

#### • Decision Tree Classifier: A flexible, rule-based model that captures non-linear relationships and is easy to visualize.

#### • Random Forest Classifier: An ensemble method that combines multiple decision trees to improve accuracy and reduce overfitting.

#### These models offer a good mix of simplicity, interpretability, and predictive power. Comparing them allows us to understand how linear and non-linear approaches perform on this dataset.

## Model: Decision Tree Classifier.

In [14]:
#Train a Decision Tree Classifier
model = DecisionTreeClassifier(random_state=54321)
model.fit(X_train, y_train)

#Make predictions on the test set
predictions = model.predict(X_valid)

#Evaluate the model's accuracy
accuracy = accuracy_score(y_valid, predictions)

print(f"Model Accuracy: {accuracy:.2f}")
print("Model training and evaluation completed successfully.")

Model Accuracy: 0.73
Model training and evaluation completed successfully.


## Model: Random Forest Classifier.

In [15]:
# Train a Random Forest Classifier
rf_model = RandomForestClassifier(random_state=54321)
rf_model.fit(X_train, y_train)

# Make predictions with the Random Forest model
rf_predictions = rf_model.predict(X_valid)
rf_accuracy = accuracy_score(y_valid, rf_predictions)

print(f"Random Forest Model Accuracy: {rf_accuracy:.2f}")
print("Model training and evaluation completed successfully.")

Random Forest Model Accuracy: 0.80
Model training and evaluation completed successfully.


## Model: Logistic Regression

In [16]:
# Train a Logistic Regression model
lr_model = LogisticRegression(random_state=54321, max_iter=1000)
lr_model.fit(X_train, y_train)

# Make predictions with the Logistic Regression model
lr_predictions = lr_model.predict(X_valid)
lr_accuracy = accuracy_score(y_valid, lr_predictions)

print(f"Logistic Regression Model Accuracy: {lr_accuracy:.2f}")
print("Model training and evaluation completed successfully.")

Logistic Regression Model Accuracy: 0.73
Model training and evaluation completed successfully.


## Tunning Hyperparameters
#### Focus on tuning the RandomForestClassifier, as it showed the best validation accuracy (0.80).

In [13]:
best_model = None
best_accuracy = 0
best_params = {}

for depth in [5, 10, 15]:
    for est in [50, 100, 150]:
        rf_model = RandomForestClassifier(max_depth=depth, n_estimators=est, random_state=54321)
        rf_model.fit(X_train, y_train
                     )
        rf_predictions = rf_model.predict(X_valid)
        rf_accuracy = accuracy_score(y_valid, rf_predictions)
        if rf_accuracy > best_accuracy:
            best_accuracy = rf_accuracy
            best_model = rf_model
            best_params = {'max_depth': depth, 'n_estimators': est}
print(f"Best Random Forest Model Accuracy: {best_accuracy:.2f}")
print(f"Best Hyperparameters: {best_params}")


Best Random Forest Model Accuracy: 0.80
Best Hyperparameters: {'max_depth': 5, 'n_estimators': 100}


## Sanity Check
#### To validate the model beyond accuracy score, a sanity check was performed by:

#### • Inspecting predictions for synthetic or edge-case inputs (e.g., users with zero messages or extremely high data usage).

#### These checks help verify that the model's behavior is not only accurate but also reasonable and interpretable in real-world scenarios.

#### Evaluate how the model reacts to extreme or edge-case data

In [11]:
# Test cases for prediction
test_cases = pd.DataFrame({
    'calls': [0, 150, 10],
    'minutes': [0, 1500, 50],
    'messages': [0, 300, 1],
    'mb_used': [0, 20000, 500]
})

# Make predictions on the test cases using the best model
predictions = best_model.predict(test_cases)

# Output the predictions for each test case
for i, pred in enumerate(predictions):
    plan = 'Ultra' if pred == 1 else 'Smart'
    print(f"Test Case {i+1}: Predicted Plan → {plan}")


Test Case 1: Predicted Plan → Ultra
Test Case 2: Predicted Plan → Ultra
Test Case 3: Predicted Plan → Smart


## Final model evaluation.

#### • Retrain or reuse the final model.

#### • Evaluate on the test set using accuracy_score.

#### • Confirm the model meets the 0.75 threshold.

In [12]:
# Evaluate the best model on the test set
y_test_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)

print(f"Final Model Accuracy on Test Set: {test_accuracy:.2f}")

# Check if the model meets the required threshold   
if test_accuracy >= 0.75:
    print("✅ The model meets the required threshold.")
else:
    print("❌ The model does not meet the required threshold.")


Final Model Accuracy on Test Set: 0.80
✅ The model meets the required threshold.


# Finding and Interpretation:
#### After testing multiple classification models, the Random Forest Classifier achieved the highest validation accuracy of 0.80, outperforming Logistic Regression and Decision Tree Classifier.

#### • Through hyperparameter tuning using combinations of max_depth and n_estimators, we further optimized the Random Forest model.

#### • The selected model met the project requirement of minimum 0.75 accuracy on the test set, showing robust generalization ability.

#### • The simplicity of the data combined with the strength of ensemble methods contributed to the model’s performance.

#### • Future work could explore ensemble stacking or feature engineering to further enhance performance.