# Plan Recommendation Model for Megaline Subscribers

## Description:
This project focuses on building a machine learning model to recommend updated plans for Megaline subscribers based on their usage patterns. The goal is to analyze customer behavior and create a classification model that predicts whether a user should switch to the "Smart" or "Ultra" plan.

## Objective:
Develop a machine learning model with the highest possible accuracy to classify subscribers into one of two plans.
Ensure the model meets or exceeds the required accuracy threshold of 0.75 on the test dataset.

## Data Source:
The dataset users_behavior.csv includes monthly usage behavior for each subscriber, such as:

calls: Number of calls made.
minutes: Total call duration in minutes.
messages: Number of text messages sent.
mb_used: Internet traffic used in megabytes.
is_ultra: Current plan (Ultra = 1, Smart = 0).

## Approach:
1. Data Preparation:
Load and inspect the dataset to understand its structure and content.
Split the dataset into training, validation, and test subsets.
2. Model Development:
Train various machine learning models with different hyperparameters.
Compare models based on validation accuracy to select the best-performing one.
3. Model Evaluation:
Evaluate the chosen model's performance on the test set.
Conduct a sanity check to ensure the model behaves logically and reliably.

## Tools:
Libraries: pandas, scikit-learn, and matplotlib.
Machine Learning Models: Decision Tree, Random Forest, Logistic Regression, and others as needed.

## Deliverables:
A trained classification model that meets the accuracy requirement.
Analysis of different models and their hyperparameters.
Insights and recommendations based on model performance.
This project will help Megaline improve customer satisfaction by identifying and recommending the most suitable plan for their subscribers based on their monthly usage behavior.

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score

In [2]:
# Load datasets

df = pd.read_csv('/datasets/users_behavior.csv')

# Preview the first few rows
print(df.head())

# Check dataset structure and basic info
df.info()

# Check for missing values and duplicates
print("\nMissing Values:")
print(df.isnull().sum())

print("\nNumber of Duplicate Rows:", df.duplicated().sum())

   calls  minutes  messages   mb_used  is_ultra
0   40.0   311.90      83.0  19915.42         0
1   85.0   516.75      56.0  22696.96         0
2   77.0   467.66      86.0  21060.45         0
3  106.0   745.53      81.0   8437.39         1
4   66.0   418.74       1.0  14502.75         0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB

Missing Values:
calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

Number of Duplicate Rows: 0


# Dataset Overview:
The dataset has 5 columns:<br>
calls: Number of calls (float).<br>
minutes: Total call duration in minutes (float).<br>
messages: Number of text messages (float).<br>
mb_used: Internet traffic used in MB (float).<br>
is_ultra: Target variable indicating the plan (0 for Smart and 1 for Ultra) (integer).<br>
Total rows: 3214.<br>
No missing values or duplicate rows are present, so no cleaning is necessary.<br>
The target variable is_ultra is already in the appropriate format (integer).<br>
Features such as calls, minutes, messages, and mb_used are all numeric (float64), making them ready for modeling.<br>

In [3]:
# Define features and target
features = df.drop(['is_ultra'], axis=1)
target = df['is_ultra']

# Split the dataset into training (60%), validation (20%), and test (20%) sets
features_train, features_temp, target_train, target_temp = train_test_split(
    features, target, test_size=0.4, random_state=12345)  # Initial split: 60% train, 40% temp

features_valid, features_test, target_valid, target_test = train_test_split(
    features_temp, target_temp, test_size=0.5, random_state=12345)  # Split temp into 20% valid, 20% test

# Display the shapes of the resulting subsets
print("Training set size:", features_train.shape, target_train.shape)
print("Validation set size:", features_valid.shape, target_valid.shape)
print("Test set size:", features_test.shape, target_test.shape)

Training set size: (1928, 4) (1928,)
Validation set size: (643, 4) (643,)
Test set size: (643, 4) (643,)


In [4]:
# Train a Decision Tree Classifier on the training set
model = DecisionTreeClassifier(random_state=12345)
model.fit(features_train, target_train)

# Make predictions on the validation set
predictions_valid = model.predict(features_valid)

# Calculate and print accuracy on the validation set
accuracy = accuracy_score(target_valid, predictions_valid)
print("Validation Set Accuracy:", accuracy)

Validation Set Accuracy: 0.713841368584759


# Result Analysis
The Decision Tree Classifier achieved an accuracy of 71.38% on the validation set. This result is below the required threshold of 75%, indicating that the model's predictions can be improved.

In [5]:
# Tune the 'max_depth' hyperparameter for the Decision Tree
best_accuracy = 0
best_depth = 0

for depth in range(1, 11):  # Try depths from 1 to 10
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    accuracy = accuracy_score(target_valid, predictions_valid)
    print(f"max_depth={depth}, Validation Accuracy: {accuracy}")
    
    # Track the best depth and accuracy
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_depth = depth

print(f"\nBest max_depth: {best_depth}, Best Validation Accuracy: {best_accuracy}")

max_depth=1, Validation Accuracy: 0.7542768273716952
max_depth=2, Validation Accuracy: 0.7822706065318819
max_depth=3, Validation Accuracy: 0.7853810264385692
max_depth=4, Validation Accuracy: 0.7791601866251944
max_depth=5, Validation Accuracy: 0.7791601866251944
max_depth=6, Validation Accuracy: 0.7838258164852255
max_depth=7, Validation Accuracy: 0.7822706065318819
max_depth=8, Validation Accuracy: 0.7791601866251944
max_depth=9, Validation Accuracy: 0.7822706065318819
max_depth=10, Validation Accuracy: 0.7744945567651633

Best max_depth: 3, Best Validation Accuracy: 0.7853810264385692


# Analysis of Results
Optimal Depth:

The best validation accuracy was achieved with max_depth=3, resulting in an accuracy of 78.54%.
Performance Trend:

Accuracy improves initially as the depth increases, peaking at max_depth=3.
Beyond max_depth=3, the model starts overfitting slightly, as accuracy either stagnates or decreases.

In [6]:
# Train the Decision Tree with the best max_depth
best_model = DecisionTreeClassifier(random_state=12345, max_depth=3)
best_model.fit(features_train, target_train)

# Extract feature importances
importances = best_model.feature_importances_

# Create a DataFrame to display the feature importances
feature_importance = pd.DataFrame({
    'Feature': features_train.columns,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# Display feature importance
print(feature_importance)

    Feature  Importance
3   mb_used    0.513756
1   minutes    0.274619
2  messages    0.193568
0     calls    0.018057


# Feature Importance Analysis
The Decision Tree model identifies the following features as most influential for predicting whether a subscriber should switch to the "Ultra" plan:

mb_used (Internet usage in MB):

Importance: 51.38%
Internet usage is the most significant factor influencing the prediction. Users with higher internet usage likely need the "Ultra" plan.
minutes (Total call duration in minutes):

Importance: 27.46%
The total time spent on calls also significantly contributes to the prediction. Higher call durations may indicate a preference for a more comprehensive plan like "Ultra."
messages (Number of text messages):

Importance: 19.36%
Text messaging usage is a moderately important factor. Users sending more messages may benefit from the "Ultra" plan.
calls (Number of calls):

Importance: 1.81%
The number of calls made is the least significant factor. While it provides some information, it doesn't heavily influence the prediction.


Feature Importance Analysis
The Decision Tree model identifies the following features as most influential for predicting whether a subscriber should switch to the "Ultra" plan:

mb_used (Internet usage in MB):

Importance: 51.38%
Internet usage is the most significant factor influencing the prediction. Users with higher internet usage likely need the "Ultra" plan.
minutes (Total call duration in minutes):

Importance: 27.46%
The total time spent on calls also significantly contributes to the prediction. Higher call durations may indicate a preference for a more comprehensive plan like "Ultra."
messages (Number of text messages):

Importance: 19.36%
Text messaging usage is a moderately important factor. Users sending more messages may benefit from the "Ultra" plan.
calls (Number of calls):

Importance: 1.81%
The number of calls made is the least significant factor. While it provides some information, it doesn't heavily influence the prediction.

# Key Insights
The model relies heavily on internet usage (mb_used) and call duration (minutes) for making decisions.
Text messages (messages) play a secondary role.
The number of calls (calls) has minimal impact, suggesting it may not vary significantly between the two plans or is not as indicative of plan suitability.