<a href="https://colab.research.google.com/github/fkihu/Model-Quality-and-Improvement-Assignment/blob/main/Regression_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**PROBLEM** **STATEMENT**



Mobile carrier Megaline has found out that many of their subscribers use legacy plans.
They want to develop a model that would analyze subscribers' behavior and recommend
one of Megaline's newer plans: Smart or Ultra. This is using available subscriber behaviour data for some subscribers who're already on the two plans. In order to determine which plan is the better of the two, we'll explore the data and evaluate the accuracy of the regression models. The model that has the highest accuracy score is the best.


In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error


Reading the data

In [None]:
telco_df = pd.read_csv('https://bit.ly/UsersBehaviourTelco')
telco_df.head()
telco_df.tail()
telco_df.sample(10)


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
2360,93.0,679.07,61.0,22704.52,0
1194,105.0,830.37,21.0,21165.03,1
2635,47.0,341.32,0.0,13936.79,0
3060,42.0,277.25,49.0,15483.11,0
3176,58.0,401.49,71.0,28074.34,1
1608,101.0,708.27,0.0,16216.94,1
69,84.0,607.05,22.0,24875.03,0
2309,31.0,219.09,26.0,18407.09,0
258,86.0,589.43,9.0,21046.69,0
1928,64.0,520.31,0.0,18407.05,0


Checking the shape of the dataset

In [None]:
telco_df.shape #3214 observations with 5 features

(3214, 5)

Checking for null values

In [None]:
telco_df.isnull().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

Exploring the data types

In [None]:
telco_df.dtypes

calls       float64
minutes     float64
messages    float64
mb_used     float64
is_ultra      int64
dtype: object

Splitting the dataset into the training, validation and test sets. These will be split in the ratio 60%:20%:20% respectively. To do this, we will begin by splitting the dataset into two, the training set (80%) and test set (20%).

In [None]:
from sklearn.model_selection import train_test_split

telco_train, telco_test = train_test_split(telco_df, test_size=0.20, random_state=12345)

In [None]:
print(telco_train.shape) #2571 records
print(telco_test.shape) #643 records

(2571, 5)
(643, 5)


We will then split the training set into two in order to get a validation set (20%).

In [None]:
telco_train, telco_valid = train_test_split(telco_df, test_size=0.20, random_state=12345)

In [None]:
print(telco_train.shape) #2571 records
print(telco_valid.shape) #643 records
print(telco_test.shape) #643 records

(2571, 5)
(643, 5)
(643, 5)


Declaring the features and target for the telco_train

In [None]:
features_train = telco_train.drop(columns=['is_ultra'])
target_train = telco_train["is_ultra"]

In [None]:
# Creating the decision tree model and training it.

for depth in range(1, 10):
  model = DecisionTreeRegressor(random_state=12345, max_depth=depth)
  model.fit(features_train, target_train)
  features_valid = telco_valid.drop(columns=['is_ultra']) #Declaring the features of the validation dataset
  target_valid = telco_valid["is_ultra"]
  predictions_valid = model.predict(features_valid) # Finding the prediction of the decision tree regressor model using the validation dataset.
  model.score(features_valid, target_valid)
  print("max_depth =", depth, ": ", end='')
  print(model.score(features_valid, target_valid))



max_depth = 1 : 0.10750191611813276
max_depth = 2 : 0.20214605360034532
max_depth = 3 : 0.22014945955792686
max_depth = 4 : 0.228880600423233
max_depth = 5 : 0.239069247263892
max_depth = 6 : 0.21988045722525773
max_depth = 7 : 0.2095357717516576
max_depth = 8 : 0.16765025360907715
max_depth = 9 : 0.15838534731255238


In [None]:
# Creating the random forest model and training it.

# Tuning the Hyperparameters
# Best n_estimators is 69

for n_est in range(61, 71):
  model_rf = RandomForestRegressor(random_state=12345, n_estimators=n_est, max_depth=10)
  model_rf.fit(features_train, target_train)
  features_valid = telco_valid.drop(columns=['is_ultra']) #Declaring the features of the validation dataset
  target_valid = telco_valid["is_ultra"]
  predictions_valid = model.predict(features_valid) # Finding the prediction of the random forest regressor model using the validation dataset.
  model_rf.score(features_valid, target_valid)
  print("n_estimators =", n_est, ": ", end='')
  print(model_rf.score(features_valid, target_valid))

n_estimators = 61 : 0.2832786759353956
n_estimators = 62 : 0.2834528298371318
n_estimators = 63 : 0.2842233351673833
n_estimators = 64 : 0.28390078794126206
n_estimators = 65 : 0.28378896467690307
n_estimators = 66 : 0.2846588553909597
n_estimators = 67 : 0.2850307526591992
n_estimators = 68 : 0.2854163820558151
n_estimators = 69 : 0.28638250362871487
n_estimators = 70 : 0.2870155631637288


In [None]:

# Tuning the Hyperparameters
# Best n_estimators is 69
# Best max_depth is 8

for depth in range(1, 11):
  model_rf = RandomForestRegressor(random_state=12345, n_estimators=n_est, max_depth=depth)
  model_rf.fit(features_train, target_train)
  features_valid = telco_valid.drop(columns=['is_ultra']) #Declaring the features of the validation dataset
  target_valid = telco_valid["is_ultra"]
  predictions_valid = model.predict(features_valid) # Finding the prediction of the random forest regressor model using the validation dataset.
  model_rf.score(features_valid, target_valid)
  print("max_depth =", depth, ": ", end='')
  print(model_rf.score(features_valid, target_valid))

max_depth = 1 : 0.12328887094622974
max_depth = 2 : 0.22258699251126002
max_depth = 3 : 0.23499487097923155
max_depth = 4 : 0.27505826518584775
max_depth = 5 : 0.2916042504876165
max_depth = 6 : 0.29414080937418474
max_depth = 7 : 0.2986848487834123
max_depth = 8 : 0.2997348925396571
max_depth = 9 : 0.29439208894315305
max_depth = 10 : 0.2870155631637288


In [None]:
# Tuning the Hyperparameters
# Best n_estimators is 69
# Best max_depth is 8
# Best max_leaf_nodes is 72 0.29734392013858213

for max_leaf in range(70, 80):
  model_rf = RandomForestRegressor(random_state=12345, n_estimators=69, max_depth=8, max_leaf_nodes=max_leaf)
  model_rf.fit(features_train, target_train)
  features_valid = telco_valid.drop(columns=['is_ultra']) #Declaring the features of the validation dataset
  target_valid = telco_valid["is_ultra"]
  predictions_valid = model.predict(features_valid) # Finding the prediction of the random forest regressor model using the validation dataset.
  model_rf.score(features_valid, target_valid)
  print("max_leaf_nodes =", max_leaf, ": ", end='')
  print(model_rf.score(features_valid, target_valid))

max_leaf_nodes = 70 : 0.29685296008563833
max_leaf_nodes = 71 : 0.2967995726863245
max_leaf_nodes = 72 : 0.29734392013858213
max_leaf_nodes = 73 : 0.2972184166423115
max_leaf_nodes = 74 : 0.29699892685583384
max_leaf_nodes = 75 : 0.2971152860287125
max_leaf_nodes = 76 : 0.2970886418143497
max_leaf_nodes = 77 : 0.2970286585119314
max_leaf_nodes = 78 : 0.296744115954932
max_leaf_nodes = 79 : 0.29673662011426827


In [None]:
# Creating the Linear Regression Model

model_lr = LinearRegression()


In [None]:
# Training the model
model_lr.fit(features_train, target_train)

In [None]:
# Finding the prediction of the linear regression model using the validation dataset
predictions_valid3 = model_lr.predict(features_valid)

In [None]:
# Testing the accuracy of the model
model_lr.score(features_valid, target_valid)

0.07303309479416342

Findings:

The best model is the decision tree model, having achieved a best score of 0.29734392013858213