Problem Statement
Mobile carrier Megaline has found out that many of their subscribers use legacy plans.
They want to develop a model that would analyze subscribers' behavior and recommend
one of Megaline's newer plans: Smart or Ultra.
You have access to behavior data about subscribers who have already switched to the
new plans (from the project for the Statistical Data Analysis course). For this
classification task, you need to develop a model that will pick the right plan. Since you’ve
already performed the data preprocessing step, you can move straight to creating the
model.
Develop a model with the highest possible accuracy. In this project, the threshold for
accuracy is 0.75. Check the accuracy using the test dataset.
1. Open and look through the data file.
2. Split the source data into a training set, a validation set, and a test set.
3. Investigate the quality of different models by changing hyperparameters. Briefly
describe the findings of the study.
4. Check the quality of the model using the test set.
5. Additional task: sanity check the model. This data is more complex than what
you’re used to working with, so it's not an easy task. We'll take a closer look at it
later.


In [35]:
#Open and look through the data file.
#reading the data
import pandas as pd
df = pd.read_csv('https://bit.ly/UsersBehaviourTelco')

In [36]:
#Exploring the data
df.head(4)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1


In [37]:
df.shape

(3214, 5)

In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [39]:
df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [54]:
#Spliting the data

from sklearn.model_selection import train_test_split

df_features = df.drop(["is_ultra"], axis=1)
df_2 = df["is_ultra"]

train_ratio = 0.75
validation_ratio = 0.15
test_ratio = 0.10

x_train, x_test, y_train, y_test = train_test_split(df_features, df_2, test_size=0.25)

# test is now 10% of the initial data set
# validation is now 15% of the initial data set
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio)) 





In [55]:
#Testing using different models
#models - RandomForest
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

model_1 = RandomForestClassifier(random_state=222, max_depth=3, n_estimators=3)
model_1.fit(x_train,y_train)
pred_1 = model_1.predict(x_test)

acc_1 = accuracy_score(y_test,pred_1)

print('Accuracy: Random_Forest: %.3f' % acc_1)

#After alreting different values of the max_depth and n_estimators , the highest accuracy value attained was 81.4%

Accuracy: Random_Forest: 0.814


In [56]:
#models - DecisionTree
from sklearn.tree import DecisionTreeClassifier
model_2 = DecisionTreeClassifier(random_state=111,max_depth = 4)
model_2.fit(x_train,y_train)
pred_2 = model_2.predict(x_test)

acc_2 = accuracy_score(y_test,pred_2)

print('Accuracy: Decision Tree: %.3f' % acc_2)

#altering the max_depth of the decision tree to 4 gave us the highest accuracy

Accuracy: Decision Tree: 0.783


In [57]:
#logisticregression
from sklearn.linear_model import LogisticRegression
model_3 = LogisticRegression(random_state=12345, solver='liblinear')
model_3.fit(x_train,y_train)
pred_3 = model_3.predict(x_test)

acc_3 = accuracy_score(y_test,pred_3)

print('Accuracy: Decision Tree: %.3f' % acc_3)

#best score attained using validation set 73.6%

Accuracy: Decision Tree: 0.736


In [17]:
print('Accuracy: Random_Forest: %.3f' % acc_1)
print('Accuracy: Decision Tree: %.3f' % acc_2)
print('Accuracy:Logistic Regression: %.3f' % acc_3)

Accuracy: Random_Forest: 0.786
Accuracy: Decision Tree: 0.776
Accuracy:Logistic Regression: 0.699


In [49]:
#cross validation scores for the three models
from sklearn.model_selection import cross_val_score
score_1 = cross_val_score(model_1, x_train, y_train, cv=5)
score_2 = cross_val_score(model_2, x_train, y_train, cv=5)
score_3 = cross_val_score(model_3, x_train, y_train, cv=5)


print('Cross Validation Score: Random_Forest: %.3f' % score_1.mean())
print('Cross Validation Score: Decision Tree: %.3f' % score_2.mean())
print('Cross Validation Score: Decision Tree: %.3f' % score_3.mean())

Cross Validation Score: Random_Forest: 0.791
Cross Validation Score: Decision Tree: 0.790
Cross Validation Score: Decision Tree: 0.712


In [60]:
#Using validation data - Random_forest

model_1 = RandomForestClassifier(random_state=222, max_depth=3, n_estimators=3)
model_1.fit(x_train,y_train)
pred_1 = model_1.predict(x_val)

acc_1 = accuracy_score(y_val,pred_1)

print('Accuracy: Random_Forest: %.3f' % acc_1)

model_2 = DecisionTreeClassifier(random_state=111,max_depth = 4)
model_2.fit(x_train,y_train)
pred_2 = model_2.predict(x_val)

acc_2 = accuracy_score(y_val,pred_2)

print('Accuracy: Decision Tree: %.3f' % acc_2)

model_3 = LogisticRegression(random_state=12345, solver='liblinear')
model_3.fit(x_train,y_train)
pred_3 = model_3.predict(x_val)

acc_3 = accuracy_score(y_val,pred_3)

print('Accuracy: Decision Tree: %.3f' % acc_3)

Accuracy: Random_Forest: 0.822
Accuracy: Decision Tree: 0.805
Accuracy: Decision Tree: 0.766


In [None]:
#The highest accuracy score using the validation data gave us 82.2% using the random forest algorithm and after setting the max_depth to 3 and n_estimators to 3