<a href="https://colab.research.google.com/github/fkivuti/megaline_project/blob/main/megaline_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prerequisite

In [1]:
# import initial libraries i.e. pandas and numpy
import pandas as pd
import numpy as np

The model to be selected should be the one that best predicts if a customer is recommended an Ultra or Smart subscription plan based on customer behaviour.

I view of the above, our problem is a classification problem.

The project will be successful if we are able to select the best model among Random Forest, Decision tree and Linear regression models.

# Load and preview the megaline carrier dataset

In [2]:
# Import megaline subscriber behaviour dataset and preview the first few records
mega_df = pd.read_csv("https://bit.ly/UsersBehaviourTelco")
mega_df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


# Data Exploration

In [3]:
# check the structure of the dataframe
mega_df.shape

(3214, 5)

In [None]:
# check the column datatypes
mega_df.dtypes

calls       float64
minutes     float64
messages    float64
mb_used     float64
is_ultra      int64
dtype: object

In [None]:
# check the mb_used column
mega_df['mb_used'].describe()


count     3214.000000
mean     17207.673836
std       7570.968246
min          0.000000
25%      12491.902500
50%      16943.235000
75%      21424.700000
max      49745.730000
Name: mb_used, dtype: float64

# Data cleanup

In [4]:
# Check for Outliers in the dataframe and decide on what to do with them
# We first defining our quantiles using the quantile() function
# ---
# 
Q1 = mega_df.quantile(0.25)
Q3 = mega_df.quantile(0.75)
IQR = Q3 - Q1
IQR

# Then filtering out our outliers by getting values which are outside our IQR Range.
# ---
#
mega_df_iqr = mega_df[((mega_df < (Q1 - 1.5 * IQR)) | (mega_df > (Q3 + 1.5 * IQR))).any(axis=1)]

# One way of dealing with outliers is removing them 
# Checking the size of the dataset with outliers for cleaning purposes
# ---
#
mega_df_iqr.shape

(208, 5)

There are 208 records which were viewed to be outliers. It has been decided that we remove these records before we can create our model

In [5]:
# Lets drop the outliers and retain a clean dataframe
clean_df = mega_df[ ~((mega_df < (Q1 - 1.5 * IQR)) | (mega_df > (Q3 + 1.5 * IQR))).any(axis=1)]

# Checking the size of our final dataset.
clean_df.shape

(3006, 5)

Our clean dataframe has 3006 records after dropping the outliers

# Modelling

Random Forest Regression model

In [22]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# define features and target
features = clean_df.drop(['is_ultra'], axis=1)
target = clean_df['is_ultra']

# split the dataset between tran set and test set with test_size being 25% of the dataset
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.25, random_state=12345
)

# declare four variables
features_train = features_train
target_train = target_train
features_test = features_test
target_test = target_test

# initialize constructor for model that had the best RMSE value
rfr_model = RandomForestRegressor(n_estimators=40, random_state=12345)

# train model on training set
rfr_model.fit(features_train,target_train) 

#get the test RMSE value
#predictions = pd.Series(target.mean(), index=target.index)
predictions = rfr_model.predict(features_test)
answers = target_test

mse= mean_squared_error(answers,predictions)
rmse = mse**0.5
print('MSE: ', mse)
print('RMSE:',rmse)

new_features = pd.DataFrame(
    [
        [90, 500, 83 , 10000],
    ],
    columns=features.columns
)

predicted = rfr_model.predict(new_features)  
print(predicted)

MSE:  0.15733460771276597
RMSE: 0.3966542672312577
[0.65]


Linear Regression Model

In [19]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# define features and target
features = clean_df.drop(['is_ultra'], axis=1)
target = clean_df['is_ultra']

# split the dataset between tran set and test set with test_size being 25% of the dataset
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.25, random_state=12345
)

# declare four variables
features_train = features_train
target_train = target_train
features_test = features_test
target_test = target_test

# initialize constructor for model that had the best RMSE value
lr_model = LinearRegression()

# train model on training set
lr_model.fit(features_train,target_train) 

#get the test RMSE value
#predictions = pd.Series(target.mean(), index=target.index)
predictions2 = lr_model.predict(features_test)
answers2 = target_test

mse2= mean_squared_error(answers2,predictions2)
rmse2 = mse2**0.5
print('MSE: ', mse2)
print('RMSE:',rmse2)

MSE:  0.187616316872702
RMSE: 0.4331469922240047


In [18]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# define features and target
features = clean_df.drop(['is_ultra'], axis=1)
target = clean_df['is_ultra']

# split the dataset between tran set and test set with test_size being 25% of the dataset
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.25, random_state=12345
)

# declare four variables
features_train = features_train
target_train = target_train
features_test = features_test
target_test = target_test

# initialize constructor for model that had the best RMSE value
dt_model = DecisionTreeRegressor(max_depth=40)

# train model on training set
dt_model.fit(features_train,target_train) 

#get the test RMSE value
#predictions = pd.Series(target.mean(), index=target.index)
predictions3 = dt_model.predict(features_test)
answers3 = target_test

mse3= mean_squared_error(answers3,predictions3)
rmse3 = mse3**0.5
print('MSE: ', mse3)
print('RMSE:',rmse3)

MSE:  0.28856382978723405
RMSE: 0.5371813751306295


# Findings / Recommendations


The selectted model for these project is a Random Forest Regression model. This gave the lowest RMSE value of 0.396. 

One sample prediction if rounded up indicates that the business should recommend an Ultra plan for the customer who had 90 calls, 500 Minutes, 83 Messages and  10000 mb of data.