# Megaline - Classification Machine Learning Project

## Introduction

This project is to analyze customer behavior of Megaline users. We are going to develop a model that would analyze subscribers' behavior based on the legacy plans and recommend one of Megaline's newer plans: Smart or Ultra.

As this is a classification problem between two plans, the different classification models used are Decision Tree Classifier Model, Logistic Regression Model, and Random Forest Classifier Model. The accuracy threshold set for this model is 0.75, so any model that gives a lower accuracy will not be accepted.



## Prepare Data

In [1]:
# Import Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

In [2]:
# Download Dataset
df = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
print(df.info())
df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


In [4]:
# Check for missing values
print(df.isna().sum())

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64


In [5]:
# Check for duplicate values
print(df.duplicated().sum())

0


In [6]:
# Separate features and target
target = df['is_ultra']
features = df.drop(['is_ultra'], axis=1)

In [7]:
# Split data into training (60%) validation (20%) and test (20%) sets

# Split data into training 60% and validation/test 40% set
features_train, features_validtest, target_train, target_validtest = train_test_split(
    features, target, 
    test_size=0.4, random_state=12345)

# Split the validation/test data into separate equal validation and test set
features_valid, features_test, target_valid, target_test = train_test_split(
    features_validtest, target_validtest, 
    test_size=0.5, random_state=12345)

## Classification Models

### Decision Tree Classifier Model

In [8]:
# Decision Tree Classifier Model

# Figure out the best max_depth hyperparameter
best_result = 0
best_max_depth = 0
for depth in range(1, 11): # choose hyperparameter range
    dt_model = DecisionTreeClassifier(random_state=98765, max_depth=depth)
    dt_model.fit(features_train, target_train)
    predictions = dt_model.predict(features_valid)
    result = accuracy_score(target_valid, predictions)
    print(f'Max_depth = {depth}, Accuracy: {result}')
    if result > best_result:
        best_result = result
        best_max_depth = depth
        
print('\nBest accuracy for Decision Tree Classifier was achieved with max_depth of', best_max_depth)

Max_depth = 1, Accuracy: 0.7542768273716952
Max_depth = 2, Accuracy: 0.7822706065318819
Max_depth = 3, Accuracy: 0.7853810264385692
Max_depth = 4, Accuracy: 0.7791601866251944
Max_depth = 5, Accuracy: 0.7791601866251944
Max_depth = 6, Accuracy: 0.7838258164852255
Max_depth = 7, Accuracy: 0.7822706065318819
Max_depth = 8, Accuracy: 0.7807153965785381
Max_depth = 9, Accuracy: 0.7853810264385692
Max_depth = 10, Accuracy: 0.7729393468118196

Best accuracy for Decision Tree Classifier was achieved with max_depth of 3


In [9]:
# Decision Tree Classifier Model on the test set

decision_tree_classifier_model = DecisionTreeClassifier(random_state=98765, max_depth=best_max_depth)
decision_tree_classifier_model.fit(features_test, target_test)

accuracy_decision_tree = decision_tree_classifier_model.score(features_test, target_test) 

print(f'Accuracy with the Decision Tree Classifier model (with max_depth set as {best_max_depth}) is {accuracy_decision_tree}' )



Accuracy with the Decision Tree Classifier model (with max_depth set as 3) is 0.7993779160186625


### Logistic Regression Model

In [10]:
# Logistic Regression Model
logistic_regression_model =  LogisticRegression(random_state=98765, solver='liblinear')
logistic_regression_model.fit(features_train, target_train)

logistic_regression_score_valid = logistic_regression_model.score(features_valid, target_valid)
print(f'Quality check - accuracy on the validating set is {logistic_regression_score_valid}' )


Quality check - accuracy on the validating set is 0.7589424572317263


In [11]:
#Logistic Regression Model on the test set
logistic_regression_score = logistic_regression_model.score(features_test, target_test)

print('Accuracy with the Logistic Regression model is', logistic_regression_score)

Accuracy with the Logistic Regression model is 0.7402799377916018


The logistic regression model gives an accuracy of 0.74 which is lower than the threshold of 0.75.

### Random Forest Classifier Model

In [12]:
# Random Forest Classifier Model

# Figure out the best n_estimators hyperparameter
best_score = 0
best_est = 0
for est in range(1, 11): # choose hyperparameter range
    rf_model = RandomForestClassifier(random_state=98765, n_estimators=est)
    rf_model.fit(features_train, target_train) 
    score = rf_model.score(features_valid, target_valid) 
    print(f'n_estimators = {est}, Accuracy: {score}')
    if score > best_score:
        best_score = score
        best_est = est
    
print('\nBest accuracy for Random Forest Classifier was achieved with n_estimators of', best_est)


n_estimators = 1, Accuracy: 0.7013996889580093
n_estimators = 2, Accuracy: 0.7480559875583204
n_estimators = 3, Accuracy: 0.7480559875583204
n_estimators = 4, Accuracy: 0.76049766718507
n_estimators = 5, Accuracy: 0.7713841368584758
n_estimators = 6, Accuracy: 0.7776049766718507
n_estimators = 7, Accuracy: 0.7776049766718507
n_estimators = 8, Accuracy: 0.7776049766718507
n_estimators = 9, Accuracy: 0.7667185069984448
n_estimators = 10, Accuracy: 0.7713841368584758

Best accuracy for Random Forest Classifier was achieved with n_estimators of 6


In [13]:
# Random Forest Classifier Model with the best hyperparameter

random_forest_classifier_model = RandomForestClassifier(random_state=54321, n_estimators=best_est) 
random_forest_classifier_model.fit(features_train, target_train)

random_forest_classifier_score = random_forest_classifier_model.score(features_test, target_test)

print(f'Accuracy with the Random Forest Classifier model (with n_estimators set as {best_est}) is {random_forest_classifier_score}', )


Accuracy with the Random Forest Classifier model (with n_estimators set as 6) is 0.7807153965785381


## Conclusion 

Based on the models developed above, Decision Tree Classifier model should be used to recommend the newer plans to Megaline customers. It gave us an accuracy of 0.799, which is a actually a little higher than the Random Forest Classifier model that has an accuracy of 0.78. In addition, Decision tree classifier also has a higher speed compared to Random Forest Classifier model.

On the other hand Logistic Regression model gave an accuracy of 0.74 and did not pass the threshold set for 0.75 accuracy.