### Step 1. Open the data file

In [30]:
import pandas as pd

from sklearn.model_selection import train_test_split 
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score

import warnings
from IPython.display import display

try:
    df = pd.read_csv('../datasets/users_behavior.csv')
    display(df.sample(5))
except:
    print("Data not found")

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
921,93.0,722.64,0.0,14198.92,0
1342,43.0,326.98,0.0,7331.52,0
2804,60.0,370.47,27.0,15877.8,0
2143,32.0,232.3,63.0,11409.2,0
2425,21.0,100.66,5.0,8252.13,0


Data is phone, SMS, and internet usage of ultra and non-ultra users. 

In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


All data is proper type and no nulls.

In [32]:
df[df.duplicated() == True].count()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

No duplicate data found.

### Step 2. Data splitting

A common ratio, that I will use, for splitting raw data into training, validation, and test data is 3:1:1.

In [33]:
# < split data into training, validation, and testing >
df_train, df_valid_test = train_test_split(df, test_size=0.4,random_state=12345)
df_valid, df_test = train_test_split(df_valid_test, test_size=0.5,random_state=12345)
print("Size of training dataset:",len(df_train))
print("Size of validation dataset:",len(df_valid))
print("Size of testing dataset:",len(df_test))

Size of training dataset: 1928
Size of validation dataset: 643
Size of testing dataset: 643


### Step 2. Tuning Hyperparameters

I will begin by declaring variables for features and target features of the datasets. The target of the model will be 'is_ultra' because we are trying to recommend the newer plans: Smart or Ultra. This also indicates we need want a model for a classification problem.

In [34]:
features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']
features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']
features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

Now I will tune the hyperparameters on decision tree, forest and linear models to check their accuracy.

#### Decision Tree

In [35]:
# tuning min_samples_split
for depth in range(1,10):
    for samples_split in range(2,72,10):
        for leaf_samples in range(2,72,10):
            model = DecisionTreeClassifier(random_state=12345, max_depth=depth,min_samples_split=samples_split,min_samples_leaf=leaf_samples)
            model.fit(features_train, target_train)
            predictions_valid = model.predict(features_valid)
            accuracy = accuracy_score(target_valid, predictions_valid)
            print("max_depth =", depth, ", ", end='')
            print("min_samples_split =", samples_split, ", ", end='')
            print("min_samples_leaf =", leaf_samples, ", ", end='')
            print("accuracy =","{:.2f}".format(accuracy))

max_depth = 1 , min_samples_split = 2 , min_samples_leaf = 2 , accuracy = 0.75
max_depth = 1 , min_samples_split = 2 , min_samples_leaf = 12 , accuracy = 0.75
max_depth = 1 , min_samples_split = 2 , min_samples_leaf = 22 , accuracy = 0.75
max_depth = 1 , min_samples_split = 2 , min_samples_leaf = 32 , accuracy = 0.75
max_depth = 1 , min_samples_split = 2 , min_samples_leaf = 42 , accuracy = 0.75
max_depth = 1 , min_samples_split = 2 , min_samples_leaf = 52 , accuracy = 0.75
max_depth = 1 , min_samples_split = 2 , min_samples_leaf = 62 , accuracy = 0.75
max_depth = 1 , min_samples_split = 12 , min_samples_leaf = 2 , accuracy = 0.75
max_depth = 1 , min_samples_split = 12 , min_samples_leaf = 12 , accuracy = 0.75
max_depth = 1 , min_samples_split = 12 , min_samples_leaf = 22 , accuracy = 0.75
max_depth = 1 , min_samples_split = 12 , min_samples_leaf = 32 , accuracy = 0.75
max_depth = 1 , min_samples_split = 12 , min_samples_leaf = 42 , accuracy = 0.75
max_depth = 1 , min_samples_split = 1

There are many parameter tunings that result in the highest accuaracy. The highest accuracy achieved is 79%.

### Random Forest

In [36]:
for estimators in range(6,43,6):
    for depth in range(1,10):
        for samples_split in range(5,15):
            model = RandomForestClassifier(random_state=12345, n_estimators=estimators, max_depth=depth, min_samples_split=samples_split)
            model.fit(features_train, target_train)
            predictions_valid = model.predict(features_valid)
            accuracy = accuracy_score(target_valid, predictions_valid)
            print("n_estimators =", estimators, ", ", end='')
            print("max_depth = ", depth, ", ", end='')
            print("min_samples_split = ", samples_split, ", ", end='')
            print("accuracy =","{:.2f}".format(accuracy))

n_estimators = 6 , max_depth =  1 , min_samples_split =  5 , accuracy = 0.76
n_estimators = 6 , max_depth =  1 , min_samples_split =  6 , accuracy = 0.76
n_estimators = 6 , max_depth =  1 , min_samples_split =  7 , accuracy = 0.76
n_estimators = 6 , max_depth =  1 , min_samples_split =  8 , accuracy = 0.76
n_estimators = 6 , max_depth =  1 , min_samples_split =  9 , accuracy = 0.76
n_estimators = 6 , max_depth =  1 , min_samples_split =  10 , accuracy = 0.76
n_estimators = 6 , max_depth =  1 , min_samples_split =  11 , accuracy = 0.76
n_estimators = 6 , max_depth =  1 , min_samples_split =  12 , accuracy = 0.76
n_estimators = 6 , max_depth =  1 , min_samples_split =  13 , accuracy = 0.76
n_estimators = 6 , max_depth =  1 , min_samples_split =  14 , accuracy = 0.76
n_estimators = 6 , max_depth =  2 , min_samples_split =  5 , accuracy = 0.77
n_estimators = 6 , max_depth =  2 , min_samples_split =  6 , accuracy = 0.77
n_estimators = 6 , max_depth =  2 , min_samples_split =  7 , accuracy =

The model achieves highest accuracy with 36 n_estimators, maximum depth of 7, a minimum sample split of 5. The highest accuracy achieved is 81%.

### Logistic Regression

In [37]:
warnings.filterwarnings("ignore")

solvers = ['newton-cg', 'lbfgs','liblinear','sag','saga']
weights = ['','balanced']

for solver in solvers:
    for weight in weights:
        for iter in range(25,226,25):
            model = LogisticRegression(random_state=12345, solver=solver,class_weight=weight,max_iter=iter)
            model.fit(features_train, target_train)
            predictions_valid = model.predict(features_valid)
            accuracy = accuracy_score(target_valid, predictions_valid)
            print("solver =", solver, ", ", end='')
            print("class_weight =", weight, ", ", end='')
            print("max_iter =", iter, ", ", end='')
            print("accuracy =","{:.2f}".format(accuracy)) 

solver = newton-cg , class_weight =  , max_iter = 25 , accuracy = 0.76
solver = newton-cg , class_weight =  , max_iter = 50 , accuracy = 0.76
solver = newton-cg , class_weight =  , max_iter = 75 , accuracy = 0.76
solver = newton-cg , class_weight =  , max_iter = 100 , accuracy = 0.76
solver = newton-cg , class_weight =  , max_iter = 125 , accuracy = 0.76
solver = newton-cg , class_weight =  , max_iter = 150 , accuracy = 0.76
solver = newton-cg , class_weight =  , max_iter = 175 , accuracy = 0.76
solver = newton-cg , class_weight =  , max_iter = 200 , accuracy = 0.76
solver = newton-cg , class_weight =  , max_iter = 225 , accuracy = 0.76
solver = newton-cg , class_weight = balanced , max_iter = 25 , accuracy = 0.59
solver = newton-cg , class_weight = balanced , max_iter = 50 , accuracy = 0.62
solver = newton-cg , class_weight = balanced , max_iter = 75 , accuracy = 0.62
solver = newton-cg , class_weight = balanced , max_iter = 100 , accuracy = 0.62
solver = newton-cg , class_weight = ba

The model achieves highest accuracy with newton-cg solver, no class weight, and any number of max iterations. The highest accuracy achieved is 76%.

### Step 3. Test the Model

We will use the tuned parameters that achieved the highest accuracy on the Forest model.

In [38]:
model = RandomForestClassifier(random_state=12345, n_estimators=36,max_depth=7,min_samples_split=5)
model.fit(features_train, target_train)

predictions_train = model.predict(features_train)
accuracy_train = accuracy_score(target_train,predictions_train)

predictions_test = model.predict(features_test)
accuracy_test = accuracy_score(target_test,predictions_test)

print("Model accuracy on training set: {:.2f}".format(accuracy_train))
print("Model accuracy on test set: {:.2f}".format(accuracy_test))

Model accuracy on training set: 0.85
Model accuracy on test set: 0.80


The model on the test set has achieved more than 75% accuracy on the test set and far higher on the training set because that is the data the model was trained on.

### Step 4. Sanity Check

In [40]:
model = DummyClassifier()
model.fit(features_train, target_train)

predictions_train = model.predict(features_train)
accuracy_train = accuracy_score(target_train,predictions_train)

predictions_test = model.predict(features_test)
accuracy_test = accuracy_score(target_test,predictions_test)

print("Dummy model accuracy on training set: {:.2f}".format(accuracy_train))
print("Dummy model accuracy on test set: {:.2f}".format(accuracy_test))

Dummy model accuracy on training set: 0.69
Dummy model accuracy on test set: 0.68


In [41]:
model = RandomForestClassifier(random_state=12345)
model.fit(features_train, target_train)

predictions_train = model.predict(features_train)
accuracy_train = accuracy_score(target_train,predictions_train)

predictions_test = model.predict(features_test)
accuracy_test = accuracy_score(target_test,predictions_test)

print("Model accuracy on training set: {:.2f}".format(accuracy_train))
print("Model accuracy on test set: {:.2f}".format(accuracy_test))

Model accuracy on training set: 1.00
Model accuracy on test set: 0.79


### Conclusion

The tuned models have achieved better accuracy than a dummy model and their untuned counterpart. This performance indicates that we chose the correct model type (categorical vs. regression) for the problem and that parameter tuning can result in a more accurate model that is less overtuned to the training set.