# MegaLine Subscriber Plan Prediction

## Purpose

The purpose of this project is to develop a model that will most accurately predict whether a customer uses the Smart or Ultra plans with an accuracy of at least 75%. This model will be utilized to recommend one of the two plans to future customers that best suit their monthly usage.

## Table of Contents
<a href='#General Data Information'>General Data Information</a>

<a href='#Model Development'>Model Development</a>

<a href='#Dataset Separation'>Dataset Separation</a>

<a href='#Model Training'>Model Training</a>

<a href='#Model Testing'>Model Testing</a>

<a href='#Overall Conclusion'>Overall Conclusion</a>

<a id='General Data Information'></a>
## General Data Information

Initially, a general look at the data is performed and the necessary sklearn modules are imported. Since the target attribute is categorical (either ultra or not ultra), classification modules are imported for Decision Tree, Random Forest, and Logistic Regression.

In [14]:
#Import necessary libraries and modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy as sc
from scipy import stats as st
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.dummy import DummyClassifier

In [15]:
user_data = pd.read_csv('/datasets/users_behavior.csv')

user_data.info()

user_data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


This data was already preprocessed during a previous effort; therefore, it was not expected to have any missing values. This was verified and confirmed.

<a id='Model Development'></a>
## Model Development

<a id='Dataset Separation'></a>
### Dataset Separation

Since there is only one dataset available, this must be separated into a training dataset, validation dataset, and testing dataset in order to build, validate, and test the models. The original dataset is split into a 3:1:1 ratio for the training, validating, and testing datasets, respectively.

In [16]:
#Splits the original dataset into 60% training, 40% leftover
user_train, user_valid_test = train_test_split(user_data, test_size = 0.4, random_state=12345)
#Splits the leftover dataset into 50% validating, 50% testing for 20%/20% overall
user_valid, user_test = train_test_split(user_valid_test, test_size = 0.5, random_state=12345)

Since the goal of this model is to predict whether a customer is on the Ultra plan or not, the `is_ultra` attribute is set as the target, while the remainder of the attributes are selected as the features, as these all have an effect on whether a customer is on the Ultra plan or not.

In [17]:
#Creates features and targets of each dataset
features_train = user_train.drop('is_ultra', axis=1)
target_train = user_train['is_ultra']

features_valid = user_valid.drop('is_ultra', axis=1)
target_valid = user_valid['is_ultra']

features_test = user_test.drop('is_ultra', axis=1)
target_test = user_test['is_ultra']

With the training, validating, and testing datasets created, and targets and features defined for each, various models can now be trained and their accuracies compared.

<a id='Model Training'></a>
### Model Training

Three different forms of classification models will be trained: Logistic Regression, Decision Tree, and Random Forest. For each model, various hyperparameters will be altered to create the most accurate model for each. All models are trained with the training dataset and then the accuracy is determined against the validation dataset.

#### Logistic Regression Model

For the logistic regression model, two hyperparameters will be altered: the `C` parameter (inverse regularization) and the `penalty` parameter. For the C parameter, a range of regularization values from very small (0.01) to very large (100) are selected, including 1 (neutral) to determine how it affects overfitting. For the penalty parameter, Lasso Regression (l1) and Ridge Regression (l2) are the two options evaluated.

In [18]:
#Loops through 6 different c values and 2 different regression regularizations and outputs the accuracy of them
for c_value in [0.01, 0.1, 0.5, 1, 10, 100]:
    for penalty in ['l1','l2']:
        logistic_regression_model = LogisticRegression(random_state = 12345, penalty=penalty, C = c_value, solver = 'liblinear')
        logistic_regression_model.fit(features_train, target_train)
        logistic_regression_accuracy = logistic_regression_model.score(features_valid, target_valid)
        print("C Value:", c_value, "  Penalty:", penalty, "  Accuracy: {:.3f}".format(logistic_regression_accuracy))
    print('')

C Value: 0.01   Penalty: l1   Accuracy: 0.711
C Value: 0.01   Penalty: l2   Accuracy: 0.725

C Value: 0.1   Penalty: l1   Accuracy: 0.757
C Value: 0.1   Penalty: l2   Accuracy: 0.748

C Value: 0.5   Penalty: l1   Accuracy: 0.757
C Value: 0.5   Penalty: l2   Accuracy: 0.756

C Value: 1   Penalty: l1   Accuracy: 0.756
C Value: 1   Penalty: l2   Accuracy: 0.759

C Value: 10   Penalty: l1   Accuracy: 0.756
C Value: 10   Penalty: l2   Accuracy: 0.708

C Value: 100   Penalty: l1   Accuracy: 0.756
C Value: 100   Penalty: l2   Accuracy: 0.756



From the various logistic regression models trained, the neutral C value Ridge Regression had the best accuracy. However, for C values 0.1 and greater, and for either regularization technique, the accuracies are all close together, showing very little variation from changing the two hyperparameters. The exception to this for a model with a C value of 10 for the Ridge Regression which has a significantly lower accuracy than the rest of the models. This is likely an anomaly with the model/data.

#### Decision Tree Model

For the Decision Tree model, two hyperparameters will be altered: the `max_depth` parameter and the `min_samples_leaf` parameter. For the `max_depth` values from 1 to 7 were used and for the `min_samples_leaf` parameter, values of 1 and 2 were used to determine the effects on model accuracy.

In [19]:
#Loops through a max_depth from 1 to 7 and a min_samples_leaf of 1 or 2. Calculates accuracy for each
for depth in range (1,8):
    for leaf in range(1,3):
        decision_tree_model = DecisionTreeClassifier(max_depth=depth, min_samples_leaf=leaf, random_state=12345)
        decision_tree_model.fit(features_train, target_train)
        decision_tree_accuracy = decision_tree_model.score(features_valid, target_valid)
        print("max_depth:",depth,"  min_leaf_sample:", leaf, "  Accuracy: {:.3f}".format(decision_tree_accuracy))
    print('')

max_depth: 1   min_leaf_sample: 1   Accuracy: 0.754
max_depth: 1   min_leaf_sample: 2   Accuracy: 0.754

max_depth: 2   min_leaf_sample: 1   Accuracy: 0.782
max_depth: 2   min_leaf_sample: 2   Accuracy: 0.782

max_depth: 3   min_leaf_sample: 1   Accuracy: 0.785
max_depth: 3   min_leaf_sample: 2   Accuracy: 0.785

max_depth: 4   min_leaf_sample: 1   Accuracy: 0.779
max_depth: 4   min_leaf_sample: 2   Accuracy: 0.779

max_depth: 5   min_leaf_sample: 1   Accuracy: 0.779
max_depth: 5   min_leaf_sample: 2   Accuracy: 0.778

max_depth: 6   min_leaf_sample: 1   Accuracy: 0.784
max_depth: 6   min_leaf_sample: 2   Accuracy: 0.779

max_depth: 7   min_leaf_sample: 1   Accuracy: 0.782
max_depth: 7   min_leaf_sample: 2   Accuracy: 0.784



with the exception of a `max_depth` of 1, the accuracy of the various models were all fairly similar, with a `max_depth` of 3 having the highest accuracy, regardless of the `min_leaf_sample`. These models generally proved to be more accurate than the logistic regression models.

#### Random Forest Model

For the Random Forest model, two hyperparameters will be altered: the `n_estimators` parameter and the `max_depth` parameter. The number of trees (`n_estimators`) varied from 10 to 50 in increments of 10. The `max_depth` parameter varied from 2 to 10 in increments of 2 since the Decision Tree model showed little variation between changes to the max_depth past the first depth.

In [20]:
#Loops through n_estimators of 10, 20, 30, 40, and 50, and max_depth of 2, 4, 6, 8, and 10. Calculates accuracy for each
for tree in range(10, 51, 10):
    for depth in range(2, 11, 2):
        random_forest_model = RandomForestClassifier(n_estimators = tree, max_depth = depth, random_state=12345)
        random_forest_model.fit(features_train, target_train)
        random_forest_accuracy = random_forest_model.score(features_valid, target_valid)
        print("n_estimators:", tree, '  max_depth:', depth, '  Accuracy: {:.3f}'.format(random_forest_accuracy))
    print('')

n_estimators: 10   max_depth: 2   Accuracy: 0.778
n_estimators: 10   max_depth: 4   Accuracy: 0.790
n_estimators: 10   max_depth: 6   Accuracy: 0.801
n_estimators: 10   max_depth: 8   Accuracy: 0.796
n_estimators: 10   max_depth: 10   Accuracy: 0.792

n_estimators: 20   max_depth: 2   Accuracy: 0.784
n_estimators: 20   max_depth: 4   Accuracy: 0.788
n_estimators: 20   max_depth: 6   Accuracy: 0.799
n_estimators: 20   max_depth: 8   Accuracy: 0.798
n_estimators: 20   max_depth: 10   Accuracy: 0.792

n_estimators: 30   max_depth: 2   Accuracy: 0.784
n_estimators: 30   max_depth: 4   Accuracy: 0.787
n_estimators: 30   max_depth: 6   Accuracy: 0.801
n_estimators: 30   max_depth: 8   Accuracy: 0.799
n_estimators: 30   max_depth: 10   Accuracy: 0.795

n_estimators: 40   max_depth: 2   Accuracy: 0.785
n_estimators: 40   max_depth: 4   Accuracy: 0.790
n_estimators: 40   max_depth: 6   Accuracy: 0.802
n_estimators: 40   max_depth: 8   Accuracy: 0.809
n_estimators: 40   max_depth: 10   Accuracy:

For the Random Forest models, models that had a higher number of trees generally had higher accuracies, although the impact was not significant. For the `max_depth` highest accuracies at the 6 and 8 values were generally observed. The most accurate model was one with 40 trees and a `max_depth` of 8.

#### Model Conclusion

From the three types of models trained, logistic regression models had the lowest accuracy at roughly 71-76%, with Decision Tree having the next highest at 75-78%, and Random Forest having the best accuracy at 78-80%. Random Forest had the most consistent results with little variation between the parameters chosen, whereas logistic regression had the highest variation in accuracy. The Random Forest model with 40 trees and a `max_depth` of 8 had the best accuracy, at 80.8%. This model will be selected for the testing dataset.

<a id='Model Testing'></a>
### Model Testing

The most accurate Random Forest model from the training and validating dataset will now be tested with the testing dataset using the same hyperparameters.

In [21]:
#Calculate accuracy of the Random Forest Model using the testing dataset. Hyperparameters of the most accurate model used.
random_forest_model_test = RandomForestClassifier(n_estimators = 40, max_depth = 8, random_state=12345)
random_forest_model_test.fit(features_train, target_train)
random_forest_model_test_accuracy = random_forest_model_test.score(features_test, target_test)
print("The accuracy of the model is {:.1%}.".format(random_forest_model_test_accuracy))

The accuracy of the model is 79.6%.


Using the selected Random Forest model with the testing dataset resulted in an accuracy of 79.6%. This is within the range of accuracies that were predicted for the validation set, so it seems that the model has performed as expected. However, a sanity check will still be performed to verify.

#### Model Sanity Check

A sanity check of the model will be performed by using a Dummy Classifier with two separate strategies: a most frequent strategy that assumes each answer in the testing dataset is the most frequent response, and a stratified strategy that makes predictions based on the statistical distribution of the testing dataset. The accuracy of these two methods will be compared to the Random Forest model to ensure the model is working as expected. 

In [22]:
most_frequent_test = DummyClassifier(strategy = 'most_frequent', random_state=12345)
most_frequent_test.fit(features_train, target_train)
most_frequent_test_accuracy = most_frequent_test.score(features_test, target_test)
print("The accuracy of the most frequent method is {:.1%}".format(most_frequent_test_accuracy))

The accuracy of the most frequent method is 68.4%


In [23]:
stratified_test = DummyClassifier(strategy = 'stratified', random_state=12345)
stratified_test.fit(features_train, target_train)
stratified_test_accuracy = stratified_test.score(features_test, target_test)
print("The accuracy of the stratified frequent method is {:.1%}".format(stratified_test_accuracy))

The accuracy of the stratified frequent method is 53.7%


The accuracies of the most frequent and stratified dummy classifiers are 68.4% and 53.7%, respectively. This is well below the selected Random Forest model accuracy of 79.6% which confirms that the Random Forest Model is working as intended. 

<a id='Overall Conclusion'></a>
## Overall Conclusion

The purpose of this project is to develop a model that will most accurately predict whether a customer uses the Smart of Ultra plans based on previously provided data. The available data was split into three different datasets: training, validating, and testing datasets. Three separate classification models, Logistic Regression, Decision Tree, and Random Forest, were trained with varying hyperparameters on the training dataset. The accuracy of each of the model iterations were then calculated against the validating dataset.

The Random Forest models had the highest accuracy, closely followed by the Decision Tree models. Using the tuned hyperparameters, a model with 79.8% accuracy on the testing dataset was developed. This model was validated to be working correctly by performing two sanity checks on the testing dataset with a dummy classifier.

Using a Random Forest model with hyperparameters of 40 `n_estimators` and a `max_depth` of 8 will provide the best accuracy for the data in question and meet the accuracy requirement of 75%. 