Mobile carrier Megaline has found out that many of their subscribers use legacy plans.
They want to develop a model that would analyze subscribers' behavior and recommend
one of Megaline's newer plans: Smart or Ultra.
You have access to behavior data about subscribers who have already switched to the
new plans (from the project for the Statistical Data Analysis course). For this
classification task, you need to develop a model that will pick the right plan. Since you’ve
already performed the data preprocessing step, you can move straight to creating the
model.
Develop a model with the highest possible accuracy. In this project, the threshold for
accuracy is 0.75. Check the accuracy using the test dataset.
1. Open and look through the data file.
2. Split the source data into a training set, a validation set, and a test set.
3. Investigate the quality of different models by changing hyperparameters. Briefly
describe the findings of the study.
4. Check the quality of the model using the test set.
5. Additional task: sanity check the model. This data is more complex than what
you’re used to working with, so it's not an easy task. We'll take a closer look at it
later.

In [None]:
# Import Libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import precision_score


# Read Data
df_subs = pd.read_csv('https://bit.ly/UsersBehaviourTelco')

# Preview records
df_subs.head(5)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


Data Cleaning

In [None]:
# Check for null values

df_subs.isna().sum()
df_subs.info()
#Let's study the data types of the dataframe
df_subs['messages'] = df_subs['messages'].astype(int) 
df_subs['calls'] = df_subs['calls'].astype(int) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   int64  
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   int64  
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 125.7 KB


Split Dataset

In [None]:
#Split dataset into 20% for testing and 80% for training.
features = df_subs.drop(columns=['is_ultra'])
target =  df_subs['is_ultra']
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.20, random_state=12345)
features_train, features_valid, target_train, target_valid = train_test_split(features_train, target_train, test_size=0.2, random_state=12345)

print(features_train.shape)
print(features_test.shape)
print(target_train.shape)
print(target_test.shape)

(2056, 4)
(643, 4)
(2056,)
(643,)


In [None]:
#First, let fit the datasets to a Logistic Regression model
LogRegMod = LogisticRegression(random_state=12345, solver='liblinear') 
LogRegMod.fit(features_train, target_train) 
LogRegMod.score(features_train, target_train)

0.745136186770428

In [None]:
#Now, let us fit a decision tree model
DecTreeMod = DecisionTreeClassifier(random_state=12345, max_depth=5)

DecTreeMod.fit(features_train, target_train)
DecTreeMod.score(features_train, target_train)

0.828307392996109

In [None]:
depth_param = {'max_depth':range(1,25)}
DecTreeMod = DecisionTreeClassifier(random_state=12345)
DecTreeModOpt = GridSearchCV(DecTreeMod,depth_param)
DecTreeModOpt.fit(features_train, target_train)
DecTreeModOpt.score(features_train, target_train)
print(DecTreeModOpt.best_estimator_)

DecisionTreeClassifier(max_depth=4, random_state=12345)


In [None]:
depth_param = {'max_depth':range(1,10), 'n_estimators':range(1,50)}
RandForestMod = RandomForestClassifier(random_state=12345)
RandForestOpt = GridSearchCV(RandForestMod,depth_param)
RandForestOpt.fit(features_train, target_train)
print(RandForestOpt.best_estimator_)
RandForestOpt.score(features_train, target_train)

RandomForestClassifier(max_depth=7, n_estimators=42, random_state=12345)


0.8599221789883269

In [None]:
features_test_accuracy = features_test
predictions_test_accuracy = RandForestOpt.predict(features_test_accuracy)
quality = accuracy_score(target_test, predictions_test_accuracy)
quality

0.7884914463452566

In [None]:
features_test_accuracy = features_test
predictions_test_accuracy = DecTreeModOpt.predict(features_test_accuracy)
quality = accuracy_score(target_test, predictions_test_accuracy)
quality

0.7884914463452566

In [None]:
precision = precision_score(RandForestOpt.predict(features_test_accuracy), target_test)
precision 

0.46938775510204084

In [None]:
precision = precision_score(DecTreeModOpt.predict(features_test_accuracy), target_test)
precision 

0.4642857142857143