# Megaline Plan Prediction Model

## Introduction

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra. 

### Purpose

Develop the best model that recommends the more applicable phone plan with 75% accuracy on a test set.

### Data Description

Every observation in the dataset contains monthly behavior information about one user. The information given is as follows:

`calls:` Number of calls

`minutes:` Total call duration in minutes

`messages:` number of text messages

`mb_used:` Internet traffic used in MB

`is_ultra:` plan for the current month (Ultra - 1, Smart - 0)

### Process

The following steps will be made:
- Read and Check Data
- Split Data
- Train Models:
    - Decision Trees
    - Random Forests
    - Logistic Regression
- Create Sanity Check
- Make Conclusions

## Read and Check Data

As the dataset has already been cleaned in the statistical data analysis project, the data will only quickly be checked for missing values and duplicates. Notes on the type of task will also be made.

### Import Packages

In [3]:
# Import relevant packages
import pandas as pd # to save dataframe
from sklearn.model_selection import train_test_split # to split data into training and testing sets
from sklearn.tree import DecisionTreeClassifier # to train model with decision tree
from sklearn.ensemble import RandomForestClassifier # to train model with random forest
from sklearn.linear_model import LogisticRegression # to train model with logistic regression
from sklearn.metrics import accuracy_score # to calculate accuracy score
import warnings # to ignore warnings
from sklearn.dummy import DummyClassifier # to train model with dummy classifier

### Save Dataframe
Use pandas to read the users_behavior.csv file

In [4]:
# Read csv file 
df = pd.read_csv('datasets/users_behavior.csv') 

# Show df head
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


### Confirm Data is Clean

#### Null Values

In [5]:
# Show info to check column types and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


This confirms that no null values exist.

#### Duplicate Values

In [6]:
# Check for duplicates
df.duplicated().sum()

0

This confirms that no duplicate values exist.

#### Unique Values

In [7]:
# Check is_ultra column to confirm only two unique values exist
df['is_ultra'].value_counts()

is_ultra
0    2229
1     985
Name: count, dtype: int64

Only 0 and 1 are included in this column, correctly giving each value a plan designation. 

### Types and Subtypes

#### Supervised Learning

As both the features (calls, minutes, messages, mb_used) and target of plan type (ultra or smart) are given, the machine learning type will be supervised learning. This means an algorithym will use the features to determine the target value. 

#### Classification

As the target is given in the form of a category, this will be a classification task. As the target only provides for two outcomes (ultra or smart), this will be a binary classification task.

## Split Data

As we do not have a test set, the data will be split into 3 parts. These are for the training, validation and test sets. This split will occur in a 3:1:1 ratio. This split is ideal as it keeps the validation and the test sets at the same size.

We can do this by first splitting the data in a 3:2 ratio to give two datasets. The first will serve as the test set. The second will be split again in 1:1 ratio to give the validation and test sets.

Once this done, the datasets will be split into features and target.

### Training Set

In [8]:
# Split data into train and validation with test 
df_train, df_valid_test = train_test_split(df, test_size=0.4, random_state=12345)

### Validation and Test Sets

In [9]:
# Split df_valid_test into validation and test
df_valid, df_test = train_test_split(df_valid_test, test_size=0.5, random_state=12345)

### Split into Features and Target

In [10]:
# Split the training data into features and target
features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']

# Split the validation data into features and target
features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']

# Split the test data into features and target
features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

In [11]:
# Confirm the size of splitted data sets
print('x_train size:', df_train.shape)
print('x_test size:', df_test.shape)
print('x_val size:', df_test.shape)

x_train size: (1928, 5)
x_test size: (643, 5)
x_val size: (643, 5)


## Model Testing

Now that the sets are split, 3 different models will be tested. These include:

1. Decision Trees
2. Random Forests
3. Logistic Regression

Each will be tested. Hyperparameters will be set and iterated to find the best model.

### Decision Trees

#### Description

Decision trees attempt to simplify the problem by creating binomial options at multiple levels. For instance, the decision tree might first classify a user's messaging habits as high or low by setting a certain messaging threshold. If this is low, a follow up may look to a users internet usage. 

Where internet usage and messaging habits are both low, the user may have a higher chance of holding a plan that has tighter limits.  However, the decision tree may go another level deeper and look to the number of phone calls a user might use. 

#### Hyperparameters

`max_depth`: The maximum number of levels a decision tree will go. 

`random_state`: Allows for experiment consistency by assigning a random state. '12345' will be used.

By iterating through a number of depth levels and checking the accuracy of the model at each point, an optimal depth can be found. This will be done in the context of megaline phone users to find the best decision tree model. A random state '12345' will be used.

In [12]:
# Find the best accuracy score for the model with accomodating max_depth
# Save the best model and its result as best_model and best_result
best_result = 0
best_model = None

# Create loop to find the best accuracy score with different max_depths
for depth in range(1, 6):
	model = DecisionTreeClassifier(random_state=12345, max_depth= depth) # create a model with the given depth
	model.fit(features_train, target_train) # train the model
	predictions = model.predict(features_valid) # get the model's predictions
	result = accuracy_score(target_valid, predictions) # calculate accuracy against the validation set
	if result > best_result: # if the result is better than the previous best, save the model and its result
		best_model = model
		best_result = result
     
print("Accuracy of the best model:", round(best_result*100,2), "%")
print("Found with a depth of:", best_model.max_depth)

Accuracy of the best model: 78.54 %
Found with a depth of: 3


### Random Forest

#### Description
A random forest is another learning algorithym. It trains groups of independent trees and makes decisions based on majority class prediction.

#### Hyperparameters

`max_depth`: Like with decision trees, depth will be iterated for values 1 to 10.

`n_estimators`: The number of trees that are built before voting takes place. This will be iterated from 10 to 50. 

`random_state`: '12345' will be used.

#### Testing

In [13]:
# Find the best accuracy score for the model with accomodating max_depth and n_estimators
best_score = 0
best_est = 0
best_depth = 0

# Create loop to find the best accuracy score with different max_depths and n_estimators
for est in range(10, 51): # choose hyperparameter range
    for depth in range(1, 11):
        model = RandomForestClassifier(max_depth=depth, random_state=12345, n_estimators=est) # set number of trees
        model.fit(features_train, target_train) # train model on training set
        score = model.score(features_valid, target_valid) # calculate accuracy score on validation set
        if score > best_score: # if accuracy score is better than previous best, save model and its accuracy score
            best_score = score
            best_est = est
            best_depth = depth

print("Accuracy of the best model:", round(best_score*100,2), "%")
print("Found with",best_est, "estimators and a depth of", best_depth)

Accuracy of the best model: 80.87 %
Found with 40 estimators and a depth of 8


### Logistic Regression

#### Description

Logistic regression may be used by assigning observations of features (calls, messages, usage, etc) as either positive or negative (Ultra or Smart plans) based on probability.

##### Parameters

`solver`: Sets the method for fitting the data. Each will be used and compared.

`random_state`: '12345' will be used.

In [14]:
# Save solvers to a list
solvers = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']

# Create a loop to find the best accuracy score with different solvers
for solver in solvers:
    # Ignore warnings
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        
        # Train model
        model = LogisticRegression(random_state=12345, solver=solver)
        model.fit(features_train, target_train)
        score_valid = round(model.score(features_valid, target_valid), 3) * 100
        print('Solver:', solver)
        print('Accuracy on the validation set:', score_valid, '%')
        print()


Solver: newton-cg
Accuracy on the validation set: 75.6 %

Solver: lbfgs
Accuracy on the validation set: 75.6 %

Solver: liblinear
Accuracy on the validation set: 70.8 %

Solver: sag
Accuracy on the validation set: 70.6 %

Solver: saga
Accuracy on the validation set: 70.6 %



Out of these models, the Newton-CG solver provides for the best accuracy at 75.6%.

## Model Quality

The best model will take into account both accuracy and fit. 

### Accuracy

The random forest had the best accuracy on the validation set at 80.87%. This is followed by the decision tree at 78.54% and lastly with logistic regression. 

### Fitting

Models may accurately fit data to the training data, but fail to do so with the validation and set tests. This is particularly true for decision tree models. 

Random forests and logistic regression models handle this better. As the random forest for this project has a higher accuracy than the other models, the random forest model will be used. 

## Chosen Model: Random Forest

As this is the chosen model, combine the training and validation sets to strengthen the model. Apply the model to the test data to determine final accuracy.

In [15]:
# Combine training and validation sets
features_train_valid = pd.concat([features_train, features_valid], axis=0)

# Combine training and validation targets
target_train_valid = pd.concat([target_train, target_valid], axis=0)

# Train model on RandomForestClassifier with best hyperparameters
model = RandomForestClassifier(max_depth=8, random_state=12345, n_estimators=40)

# Train model on training and validation sets
model.fit(features_train_valid, target_train_valid)

# Calculate accuracy score on test set
score_test = round(model.score(features_test, target_test), 3) * 100

# Print accuracy score
print('Accuracy on the test set:', score_test, '%')


Accuracy on the test set: 79.9 %


### Analysis

Interestingly the accuracy has decreased, but is still above the 75% threshold.

## Sanity Check

A sanity check will be made to ensure that the model performs better than chance. To do so, a dummy classifier will be trained.

In [16]:
# Train model
model = DummyClassifier(strategy='uniform', random_state=12345)
model.fit(features_train_valid, target_train_valid)

# Calculate accuracy score on test set
score_test = round(model.score(features_test, target_test), 2) * 100

# Print accuracy score
print('Accuracy on the test set:', score_test, '%')

Accuracy on the test set: 48.0 %


### Analysis

The sanity check is passed a as the random model only garnered an accuracy of 48%, one that is much lower than that yielded by the random forest model.

## Conclusion

### Dataset
In summary, the data was loaded and confirmed to have been cleaned. Supervised learning on a binomial classification was determined as the appropriate type of task.

### Split

The data was split into training, validation and test sets in a 3:1:1 ratio. This split occurred as it is optimal to keep the validation and test sets similar in size.

### Model Testing

Three models were compared. This included decision trees, random forests, and logistic regression. Each were iterated using different parameters to find the optimal accuracy.

### Model Quality

Model quality was determined looking to accuracy and fitting. The random forests had the best accuracy and is not as subject to fitting as decision trees. Thus, the random forest model was chosen.

### Chosen Model

The random forest model was expanded by including the validation set. When testing, the random forest produced an accuracy of 79.9%.

### Sanity Check

To ensure the model is sufficient, a sanity check was made. A model that randomly assigned classes was created and garnered an accuracy of 48%. The sanity check is passed as this accuracy is much lower than the random trees model.