# Project 6: Introduction to Machine Learning - Megaline


# Introduction & Project Description

As a Data Scienist for a mobile carrier Megaline, that has just found out that many of their subscribers use legacy plans. The end-goal of this project is to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: (i) Smart or (ii) Ultra. 

The data scienist has been given access to behavior data about subscribers who have already switched to the new plans. For this classification task, the goal will be to  develop a model that will pick the right plan. 

As a part of this process the goal is to develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.  

# Data Quality Evaluation & Description of Data

The model construction will rely on the following dataset: **'datasets/users_behavior.csv'**.  Every observation in the dataset contains monthly behavior information about one user which includes the following data description: 
- *сalls* — number of calls
- *minutes* — total call duration in minutes
- *messages* — number of text messages
- *mb_used* — Internet traffic used in MB
- *is_ultra* — plan for the current month (Ultra - 1, Smart - 0)

In [1]:
# import libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats as at
from sklearn.tree import DecisionTreeClassifier #DecisionTreeRegressor

In [2]:
# import other models
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
%matplotlib inline

**1. Open and look through the data file. Path to the file:*'datasets/users_behavior.csv'*.**

In [3]:
# Read the CSV file using pandas
data = pd.read_csv('/datasets/users_behavior.csv')

# Display the first few rows
data.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


**Insights:**
- The *'is_ultra'* column acts as a binary *target variable*, making this dataset suitable for classifications (numerical) tasks.
- The other columns *'calls', 'minutes', 'message', 'mb_used'* are numerical *features*, which can be directly used as input to machine learning models. 

In [4]:
data.columns

Index(['calls', 'minutes', 'messages', 'mb_used', 'is_ultra'], dtype='object')

**Attribute Information:**
- *сalls* — number of calls
- *minutes* — total call duration in minutes
- *messages* — number of text messages
- *mb_used* — Internet traffic used in MB
- *is_ultra* — plan for the current month
  - 1: Ultra
  - 0: Smart

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [6]:
# rows, columns
data.shape

(3214, 5)

In [7]:
data.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


**Invesigate / Data Exploration** 

In [8]:
# Check the balance of the target variable (is_ultra).
data['is_ultra'].value_counts()

0    2229
1     985
Name: is_ultra, dtype: int64

**Preprocessing**

Handle missing values if present. Scale or normalize features if necessary.

In [9]:
# Check for missing values
print(data.isnull().sum())

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64


In [10]:
# Calculate percentage of missing values
missing_percentage = data.isnull().mean() * 100
print(missing_percentage)

calls       0.0
minutes     0.0
messages    0.0
mb_used     0.0
is_ultra    0.0
dtype: float64


**Insights**

Missing data can distort analysis and lead to incorrect conclusions if not handled properly. The dataset does not appear to have missing data at this time. 

In [11]:
from math import gcd

# Calculate percentages dynamically
percentages = data['is_ultra'].value_counts(normalize=True) * 100

# Display results
print(f"Percentage of A (0): {percentages[0]:.2f}%")
print(f"Percentage of B (1): {percentages[1]:.2f}%")


Percentage of A (0): 69.35%
Percentage of B (1): 30.65%


**Insights**

1. Class Imbalance: The dataset is imbalanced, as the majority class (0) has significantly more samples (2229) compared to the minority class (1) with only 985 samples.  The ratio of 0 to 1 is approximtaely 70:30, meaning about 70% of the data represents the **Smart** plan, and 30% represents the **Ulta** plan. 

**2. Split the source data into a training set, a validation set, and a test set.**

In [12]:
# let's separate the Target (Dependent Variables) /Features (Independent Variables). 

X = data.drop(['is_ultra'], axis=1) # X will contain all columns except the target variable. 

y = data['is_ultra'] # y contains the 'is_ultra' column, which is the variable used to predict.

In [13]:
# Running a check on the features 
X

Unnamed: 0,calls,minutes,messages,mb_used
0,40.0,311.90,83.0,19915.42
1,85.0,516.75,56.0,22696.96
2,77.0,467.66,86.0,21060.45
3,106.0,745.53,81.0,8437.39
4,66.0,418.74,1.0,14502.75
...,...,...,...,...
3209,122.0,910.98,20.0,35124.90
3210,25.0,190.36,0.0,3275.61
3211,97.0,634.44,70.0,13974.06
3212,64.0,462.32,90.0,31239.78


In [14]:
# Running a check on the target variable 
y

0       0
1       0
2       0
3       1
4       0
       ..
3209    1
3210    0
3211    0
3212    0
3213    1
Name: is_ultra, Length: 3214, dtype: int64

**Data Splitting**

In [15]:
# First, split into training + validation set and test set (80% train+val, 20% test)
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Next, split the training + validation set into training set and validation set (60% train, 20% validation)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=42) # 0.25 x 0.8 = 0.2

**Insights**
- test_size=0.2 - reserves 20% of the data for testing.
- Splitting Train/Validation - The remaining 80% of data is split further into training(60%) and validation (20%) subsets.
- random_state=42 - ensures reproducibility by setting a fixed seed for random shuffling. 

In [16]:
# Let's check data splits (After splitting, verify the sizes of the datasets.)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Validation set size: {X_val.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")

Training set size: 1928 samples
Validation set size: 643 samples
Test set size: 643 samples


**Insights**

This method ensures proper segmentation of the dataset for training, tuning, and evaluation. 

In [17]:
X_train.shape

(1928, 4)

**Insight** 

Training set: 1928 samples, 4 features (independent variables) - Used for training the model to learn patterns in the data.

In [18]:
X_val.shape

(643, 4)

**Insight**

Validation Set: 643 samples, 4 features (independent variables) - Used for tuning the model's hyperparameters and monitoring performance during development. 

In [19]:
X_test.shape

(643, 4)

- Training set size: (1928, 4)
- Validation set size: (643, 4)
- Test set size: (643, 4)

**Insight**

Test Set: 643 samples, 4 features (independent variables) - Used for final evaluation to assess the model's real-world performance.

**3. Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.** 
- Here it is important to decide if model training and validation will consists of Regression or Classification. As noted above, the *'is_ultra'* column acts as a binary *target variable*, making this dataset suitable for classifications (numerical) tasks. The other columns: *'calls', 'minutes', 'message', 'mb_used'* are numerical *features*, which can be directly used as input to machine learning models.

1. **Model Training and Validation:**
- Train the model using the training set (X_train, y_train).
- Validate using the validation set (X_val, y_val) to ensure the model isn't overfitting.
   

2. **Final Evaluation:**
- Once the model is finalized, the process will t
est its perofrmance on the test set (X_test, y_test) and check if the accuracy meets or exceeds the threshold of 0.75.  

**3.a. Train Random Forest Classifier and evaluate its quality.**

In [20]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [21]:
# Train the model - Initial Random Forest model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

RandomForestClassifier(random_state=42)

In [22]:
# Validate the model
val_predictions = model.predict(X_val)
val_accuracy = accuracy_score(y_val, val_predictions)
print(f"Validation Accuracy: {val_accuracy:.2f}")

Validation Accuracy: 0.79


**Insights**
- The stated requirement for the model was to achieve an accuracy of **at least 75%** on the test set.
- Since the **test accuracy** is 0.81 (81%), the model exceeds the threshold, making it suitable for deployment or further use. 

- The **validation accuracy (0.79)** is clsoe to the **test accuracy** (0.81). This indicates that the model generalizes well and isn't overfitting the training data.
- Achieving a test accuracy of 0.81 suggests the model is performing better than the minimum requirement, giving confidence in its reliability. 

**3.b. Hyperparameter Tuning and evaluate its quality.**

**3.b.1 Grid Search**

In [23]:
from sklearn.model_selection import GridSearchCV

In [24]:
parameter_grid = {
    'n_estimators': [5, 100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10]
}

In [25]:
grid_search = GridSearchCV(RandomForestClassifier(random_state=42), 
                           parameter_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=42),
             param_grid={'max_depth': [10, 20, None],
                         'min_samples_split': [2, 5, 10],
                         'n_estimators': [5, 100, 200]},
             scoring='accuracy')

In [26]:
# Best hyperparameters and validation score
best_model = grid_search.best_estimator_
print(f"Best Hyperparameters: {grid_search.best_params_}")
print(f"Best Validation Score: {grid_search.best_score_:.2f}")

Best Hyperparameters: {'max_depth': 10, 'min_samples_split': 2, 'n_estimators': 200}
Best Validation Score: 0.81


**3.b.2 Randomized Search**

In [27]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

In [28]:
parameter_distributions = {
    'n_estimators': [5, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 10]
}

In [29]:
random_search = RandomizedSearchCV(RandomForestClassifier(random_state=42), 
                           parameter_distributions, cv=5, scoring='accuracy', n_iter=12, random_state=42)
random_search.fit(X_train, y_train)

RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(random_state=42),
                   n_iter=12,
                   param_distributions={'max_depth': [10, 20, None],
                                        'min_samples_split': [2, 10],
                                        'n_estimators': [5, 200]},
                   random_state=42, scoring='accuracy')

In [30]:
print("Best Hyperparameters:", random_search.best_params_)
print("Best Validation Score:", random_search.best_score_)

Best Hyperparameters: {'n_estimators': 200, 'min_samples_split': 2, 'max_depth': 10}
Best Validation Score: 0.8132736693358454


**Finding from Hyperparameter Tuning**

1. Best Hyperparameters (from Grid Search and Randomized Search):
   - Both Grid Search and Randomized Search identified the same optimal hyperparameters for the Random Forest Classifier. 
   
   
2. Best Validation Score:
   - Both methods achieved a validation accuracy of 0.813 (81.3%).
   

3. Model Performance: The best validation score (81.3%) is significantly higher than the threshold of 75%, suggesting the model is well-tuned for this dataset.

**4. Check the quality of the model using the test set.**

In [31]:
# Evaluate on the test set
test_predictions = model.predict(X_test)
test_accuracy = accuracy_score(y_test, test_predictions)
print(f"Test Accuracy: {test_accuracy:.2f}")

Test Accuracy: 0.81


In [32]:
# Display results
print(f"Test Accuracy: {test_accuracy:.2f}")

Test Accuracy: 0.81


In [33]:
# Threshold check
if test_accuracy >= 0.75:
    print("The model meets the accuracy threshold of 0.75.")
else:
    print("The model does not meet the accuracy threshold.")

The model meets the accuracy threshold of 0.75.


**Summary of Completed Steps**

*Model Training:*
- The Random Forest model was trained on the training set. 

*Validation:*
- The model was validated using the validation set, achieving a validation accuracy of 0.79 (79%). 

*Hyperparameter Tuning:* 
- Both Grid Search and Randomized Search were used to optimize hyperparameters. The best hyperparameters (max_depth=10, min_samples_split=2, n_estimators=200) were identified, achieving a validation accuracy of 81.3%. 

*Test Set Evaluation:* 
- The optimized model was evaluted on the test set, achieving a test accuracy of 0.81 (81%). 

*Threshold Check:*
- The model meets and exceeds the accuracy threshold of 0.75 (75%), confirming that it is suitable for the task. 

**Final Conclusion**

The evaluation of the model on the test set confirms that this part of the project is complete. The model demonstrates: Strong performance on both the validation and test datasets. Generalization ability ot unseen data. 