# Introduction

We are charged with building a classifier with the highest possible accuracy for determining the category of phone plans from user behavior. Below are the results.

# Models

In [1]:
import warnings
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from random import random

In [2]:
warnings.filterwarnings("ignore")

# if training and validation accuracy is more than 5% off, throw away
TRAIN_VALID_DELTA_THRESHOLD = 0.05

# array indices
ACC_TRAIN = 0
ACC_VALID = 1
DEPTH = 2
N_ESTIMATORS = 2
SOLVER = 2

In [3]:
users_data = pd.read_csv("/datasets/users_behavior.csv")

display(users_data.head())
display(users_data.info())

data_train_model, data_test = train_test_split(users_data, test_size=0.2, random_state=54321)
data_train, data_valid = train_test_split(data_train_model, test_size=0.25, random_state=54321)

# For hyperparameter tuning
features_train = data_train.drop(["is_ultra"], axis=1)
target_train = data_train["is_ultra"]

features_valid = data_valid.drop(["is_ultra"], axis=1)
target_valid = data_valid["is_ultra"]

# For actual training
features_train_model = data_train_model.drop(["is_ultra"], axis=1)
target_train_model = data_train_model["is_ultra"]

features_test = data_test.drop(["is_ultra"], axis=1)
target_test = data_test["is_ultra"]

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


None

### Notes

We split the data into training, validation, and testing sets at a 3:1:1 ratio.

In [4]:
# Baseline
ultra_ratio = (pd.DataFrame(target_test).query("is_ultra == 1").count() / len(target_test))[0]
most_frequent = ""
if ultra_ratio < 0.5:
    most_frequent = "Standard"
else:
    most_frequent = "Ultra"

print("Most frequent category is", most_frequent)

guesses_non_ultra = np.zeros_like(target_test)
accuracy_constant = accuracy_score(guesses_non_ultra, target_test)
print("'Choose most frequent' accuracy =", accuracy_constant)

Most frequent category is Standard
'Choose most frequent' accuracy = 0.6640746500777605


### Notes

We determine which is the most frequently chosen category and then just select it as our guess. With this method, we can already achieve a 66.4% accuracy, far better than random chance.

But we must exceed 75% to find the most successful model.

In [5]:
# Decision Tree
best_tree = [0, 0, 0] # accuracy_train, accuracy_valid, depth
for depth in range(1,10):
    dtc = DecisionTreeClassifier(max_depth=depth, random_state=54321)
    dtc.fit(features_train, target_train)
    dtc_preds_train = dtc.predict(features_train)    
    accuracy_train = accuracy_score(target_train, dtc_preds_train)
    dtc_preds_valid = dtc.predict(features_valid)
    accuracy_valid = accuracy_score(target_valid, dtc_preds_valid)
    if abs(accuracy_valid - accuracy_train) < TRAIN_VALID_DELTA_THRESHOLD and accuracy_valid > best_tree[ACC_VALID]:
        best_tree[ACC_TRAIN] = accuracy_train
        best_tree[ACC_VALID] = accuracy_valid
        best_tree[DEPTH] = depth
        
print("Best depth =", best_tree[DEPTH])
print("Best acc_train =", best_tree[ACC_TRAIN])
print("Best acc_valid =", best_tree[ACC_VALID])

Best depth = 5
Best acc_train = 0.8246887966804979
Best acc_valid = 0.8180404354587869


### Notes

After a relatively thorough search, we find that in with a depth of 5 decisions, we can achieve a validation accuracy of 81.8%, and very quickly at that. Further, there's less the 1% difference between the training and validation accuracy.

This already exceeds the 75% goal.

In [6]:
# Random Forest
best_forest = [0, 0, 0] # accuracy_train, accuracy_valid, n_estimators
for n_estimators in range(5, 51):
    rfc = RandomForestClassifier(max_depth=10, n_estimators=n_estimators, random_state=54321)
    rfc.fit(features_train, target_train)
    accuracy_train = rfc.score(features_train, target_train)
    accuracy_valid = rfc.score(features_valid, target_valid)
    if abs(accuracy_valid - accuracy_train) < TRAIN_VALID_DELTA_THRESHOLD and accuracy_valid > best_forest[ACC_VALID]:
        best_forest[ACC_TRAIN] = accuracy_train
        best_forest[ACC_VALID] = accuracy_valid
        best_forest[N_ESTIMATORS] = n_estimators
        
print("Best n_estimators =", best_forest[N_ESTIMATORS])
print("Best acc_train =", best_forest[ACC_TRAIN])
print("Best acc_valid =", best_forest[ACC_VALID])

Best n_estimators = 6
Best acc_train = 0.8760373443983402
Best acc_valid = 0.8304821150855366


### Notes

With random forests, we find that 6 estimators get us an even higher accuracy for validation at 83% and a difference of less the 5% with the testing accuracy. 

In [7]:
# Logistic Regression
solvers = ['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga']

best_log_reg = [0, 0, ''] # accuracy_train, accuracy_valid, solver
for solver in solvers:
    lr = LogisticRegression(random_state=54321, solver=solver)
    lr.fit(features_train, target_train)
    accuracy_train = lr.score(features_train, target_train)
    accuracy_valid = lr.score(features_valid, target_valid)
    if abs(accuracy_valid - accuracy_train) < TRAIN_VALID_DELTA_THRESHOLD and accuracy_valid > best_log_reg[ACC_VALID]:
        best_log_reg[ACC_TRAIN] = accuracy_train
        best_log_reg[ACC_VALID] = accuracy_valid
        best_log_reg[SOLVER] = solver

        
print("Best solver =", best_log_reg[SOLVER])
print("Best acc_train =", best_log_reg[ACC_TRAIN])
print("Best acc_valid =", best_log_reg[ACC_VALID])

Best solver = newton-cg
Best acc_train = 0.7562240663900415
Best acc_valid = 0.7776049766718507


### Notes

The worst performer is the logistic regression using the newton-cg solver, which yields a tolerable 77.7%--and oddly, a higher validation accuracy than its training accuracy. It is above our desired accuracy, but still the lowest of the three models.

In [8]:
# Fit the best of each of the three models
# DECISION TREE
dtc = DecisionTreeClassifier(max_depth=5, random_state=54321)
dtc.fit(features_train_model, target_train_model)
dtc_preds = dtc.predict(features_test)
accuracy_tree = accuracy_score(target_test, dtc_preds)

print("Accuracy Best Tree = ", accuracy_tree)

# RANDOM FOREST
rfc = RandomForestClassifier(max_depth=10, n_estimators=6, random_state=54321)
rfc.fit(features_train_model, target_train_model)
accuracy_forest = rfc.score(features_test, target_test)

print("Accuracy Best Forest = ", accuracy_forest)

# LOGISTIC REGRESSION
lr = LogisticRegression(solver="newton-cg", random_state=54321)
lr.fit(features_train_model, target_train_model)
accuracy_log_reg = lr.score(features_test, target_test)

print("Accuracy Best Logistic Regression = ", accuracy_log_reg)

Accuracy Best Tree =  0.7713841368584758
Accuracy Best Forest =  0.7651632970451011
Accuracy Best Logistic Regression =  0.7076205287713841


# Conclusion

The Logistic Regression fails to live up to expectations by dropping to a mere 70.7% accuracy, thus becoming disqualified from the running altogether.

On the test set, we find that the highest performer is the Decision Tree model with an accuracy of 77.1% at a depth of 5. However, it only slightly outperforms the Random Forest model (by ~0.6%).

Still, our goal is to find the one with the highest accuracy, and the answer is clearly the Decision Tree.