# Megaline Plan Recommendation Analysis

## Project Description

This project encompasses an analysis of preprocessed data concerning Megaline client data and thier current subscription plans. The purpose of this analysis is to build a predictive model capable of reliably recommending subscription plans to users based on thier monthly data. 

The data will be used to train diverse models to finally obtain the most accurate. Features available in our data consist of:

* сalls — number of calls,
* minutes — total call duration in minutes,
* messages — number of text messages,
* mb_used — Internet traffic used in MB,
* is_ultra — plan for the current month (Ultra - 1, Smart - 0).

### Import Libraries

In [1]:
# Import nessecary libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

### Load data

In [2]:
# Load data into variable
df = pd.read_csv('/datasets/users_behavior.csv')

### View Data

We now visualize our data to obtain a sense of its overall structure by both displaying the dataset and visualizing its general information. As expacted our data contains the five described columns.

In [3]:
# Display data and info
display(df)
df.info()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


From the information available we can see the dataset ranges to 3214 rows and displays **no missing values**. All features also appear to have the correct data type.

## Data Preparation

We now focus on preparing the data to be used for modeling. Our first task will be to isolate our features and target data of concern. As our data shows, features available demonstrate service usage of each client as well as their current subscription plan. This subscription plan feature in the `is_ultra` column shows each of the two plans represented by either a 1 (user is subscribed to Ultra plan) or a 0 (user is subscribed to Smart plan). 

The focus then becomes bulding a model that can **classify** each clients usage data and predict one of these two plans to recommend.

### Feature Engineering

We will isolate for two variables taking into account data it will use to predict as our features, and set our target variable to the plan data it must predict, as follows:

In [4]:
# Isolate features and target variables
features = df.drop('is_ultra', axis=1)
target = df['is_ultra']

### Spliting Data

Our next task will be to properly split our data for training, validation and testing. Since we do not posses a separate testing dataset, we will opt to split the data in a **3:1:1** ratio, where the training set will base itself on 60% of the data, followed by 20% for validation and finally 20% for testing.

In [5]:
# Split data into train, validation and test sets
features_main, features_test, target_main, target_test = train_test_split(features, target, random_state=12345, test_size=0.2)
features_train, features_valid, target_train, target_valid = train_test_split(features_main, target_main, random_state=12345, test_size=0.25)

## Model Training and Validation

Now that the data has been split accordingly we will begin by training different types of classification models for validation and testing. For the purpose of this project we will be focusing on training our data with: `DecisionTreeClassifer`, `RandomForestClassifier` and `LogisitcRegression` algorithims available in the Scikit-Learn library.

In this stage trained models will be iterated upon tuning for distinct hyperparameters to find the highest prediciton accuracy score for validation sets. In order to replicate model `random_state` hyperparameter will be set to `12345` for all models.

### Decision Tree 

The decision tree model will be trained using the mentions random_state value and tune for best depth in the range of 1 to 9. After which it will validate for best accuracy using the built in score function. 

In [6]:
# Prime model variables for iteration.
best_modelDT = None
best_depthDT = 0
best_scoreDT = 0

# Train and iterate depth 
for depthDT in range(1, 10):
    modelDT = DecisionTreeClassifier(random_state=12345, max_depth=depthDT)
    modelDT.fit(features_train, target_train)
    scoreDT = modelDT.score(features_valid, target_valid)
    if scoreDT > best_scoreDT:
        best_modelDT = modelDT
        best_depthDT = depthDT
        best_scoreDT = scoreDT

# Print results
print("Best depth:", best_depthDT)
print("Best score:", best_scoreDT)

Best depth: 7
Best score: 0.7744945567651633


Results show best depth parameter of 7 to have the highest accuracy with a score of 0.77. This parameter will be taken into account on the following models. 

### Random Forest

We Will now apply a random forest algorithm to train our model. Random state parameter will remain the same, max depth will will tune in the range of 1 to 8, based on the previous result from our decision tree model, while the number of estimators will iterate range from 10 to 50 in factors of 10. 

In [7]:
# Prime model variables for iteration.
best_modelRF = None
best_depthRF = 0
best_estRF = 0
best_scoreRF = 0

# Train and iterate depth and number of estimators
for estRF in range(10, 50, 10):
    for depthRF in range(1, 8):
        modelRF = RandomForestClassifier(random_state=12345, max_depth=depthRF, n_estimators=estRF)
        modelRF.fit(features_train, target_train)
        scoreRF = modelRF.score(features_valid, target_valid)
        if scoreRF > best_scoreRF:
            best_modelRF = modelRF
            best_depthRF = depthRF
            best_estRF = estRF
            best_scoreRF = scoreRF

# Print results
print("Best depth:", best_depthRF)
print("Best number of est:", best_estRF)
print("Best score:", best_scoreRF)

Best depth: 7
Best number of est: 10
Best score: 0.7869362363919129


Our model shows best number of estimators to be 10 with an accuracy of 0.78. The random forest model's accuracy leads us to believe that the decision tree might have been underfitted. 

### Logistic Regression

Our model using logisitc regression algorithm will be using the `liblinear` solver with the random state parameter used in previous models.  

In [8]:
# Train model using liblinear solver
modelLR = LogisticRegression(random_state=12345, solver='liblinear')
modelLR.fit(features_train, target_train)

# Calculate accuracy
scoreLR = modelLR.score(features_valid, target_valid)

# Print Results
print("Model score:", scoreLR) 

Model score: 0.7293934681181959


Results show the logistic regression model to underperform both the decision tree and random forest models, whether or not the model is underfitted remains to be seen.

## Model Testing

We now have an understanding of the most accurate models and effective hyperparameters for testing. Using our obtained hyperparameters for the models, we will now use our test data on all three model types to see if results match our validation process and accuracy remains similar.  

### Decision Tree Testing

The effective depth for our validation set resulted in 7, so it will also be used for testing.

In [9]:
# Recreate and train model using obtained hyperparameter
final_modelDT = DecisionTreeClassifier(random_state=12345, max_depth=7)
final_modelDT.fit(features_train, target_train)

# Test model
final_modelDT_score = final_modelDT.score(features_test, target_test)

# Print score
print("Decision Tree test score:", final_modelDT_score)

Decision Tree test score: 0.7884914463452566


Result shows a higher accuracy of 0.78 for the test set. Higher accuracy than in the validation set may suggest the model to be overfitted as opposed to the previous observation where underfitting was most likley. It is however to early to draw conclusions stating the model effectivness.

### Random Forest Testing

Random forest model test will use a depth of 7 and number of estimtors at 10, both of which are hyperparameters that validation confirmed most accuracte.

In [10]:
# Recreate and train model using obtained hyperparameters
final_modelRF = RandomForestClassifier(random_state=12345, max_depth=7, n_estimators=10)
final_modelRF.fit(features_train, target_train)

# Test model
final_modelRF_score = final_modelRF.score(features_test, target_test)

# Print score
print("Decision Tree test score:", final_modelDT_score)

Decision Tree test score: 0.7884914463452566


The resulting acuraccy score is 0.78 which matches the decision tree ouput score. 

### Logistic Regression Testing

Finally we will run a logistic regression model on the test set.

In [11]:
# Test model on test set
final_modelLR_score = modelLR.score(features_test, target_test)

# Print score
print("Logisitc Regression test score:", final_modelLR_score)

Logisitc Regression test score: 0.7511664074650077


The model shows a score of 0.75 which is higher than its validation set score but lower than both decision tree and random forest scores. This may suggest the model to be underfitted compared to the two counterparts.

## Conclusion

Results from all models on both validation and primarly test sets demonstrate both decision tree and random forest algorithms to produce the highest accuracy score, thus being more likely to effectively classify client data into a service plan to recomend.

Hyperparameters used to reach the most effective results include. 

- Decision Tree: max_depth = 7
- Random Forest: max_depth = 7, n_estimators = 10

While it is usually expected the decision tree algorithm to underfit models. this case presents an equality between both random forest and decision tree, which may suggest that in this case a decision tree algorithm might be the most effective model as it only requires one specified hyperparameter to produce the same result as the random forest.