## Table of Contents

- [1. Data Extraction](#etl)
- [2. Data Preprocessing](#dpp)
    - [2.1 Missing Data](#null)
    - [2.2 Duplicate Data](#dupe)
    - [2.3 Data Splitting](#split)
- [3. Model Training](#train)
    - [3.1 Decision Tree](#dt)
    - [3.2 Random Forest](#rf)
    - [3.3 Logistic Regression](#lr)
- [4. Model Testing](#test)
- [5. Sanity Check](#sanity)
- [Conclusion](#end)

### 1. Data Extraction <a class="anchor" id="etl"></a>

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split 
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score

import warnings
from IPython.display import display

try:
    df = pd.read_csv('../datasets/users_behavior.csv')
    display(df.sample(5))
except:
    print("Data not found")

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
1254,37.0,256.66,35.0,8645.57,0
2641,69.0,512.66,14.0,14603.95,0
2075,87.0,578.57,22.0,26271.47,1
218,62.0,365.46,113.0,34224.57,1
176,65.0,424.16,23.0,11043.53,0


Data is phone, SMS, and internet usage of ultra and non-ultra users. 

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


### 2. Data preprocessing <a class="anchor" id="dpp"></a>

#### 2.1 Missing Data <a class="anchor" id="null"></a>

In [3]:
df.isna().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

All data is proper type and no nulls.

#### 2.2 Duplicate Data <a class="anchor" id="dupe"></a>

In [4]:
df[df.duplicated() == True].count()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

No duplicate data found.

### 2.3. Data splitting <a class="anchor" id="split"></a>

A common ratio, that I will use, for splitting raw data into training, validation, and test data is 3:1:1.

In [5]:
# < split data into training, validation, and testing >
df_train, df_valid_test = train_test_split(df, test_size=0.4,random_state=12345)
df_valid, df_test = train_test_split(df_valid_test, test_size=0.5,random_state=12345)
print("Size of training dataset:",len(df_train))
print("Size of validation dataset:",len(df_valid))
print("Size of testing dataset:",len(df_test))

Size of training dataset: 1928
Size of validation dataset: 643
Size of testing dataset: 643


### 3. Model Training <a class="anchor" id="train"></a>

I will begin by declaring variables for features and target features of the datasets. The target of the model will be 'is_ultra' because we are trying to recommend the newer plans: Smart or Ultra. This also indicates we need want a model for a classification problem.

In [6]:
features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']
features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']
features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

Now I will tune the hyperparameters on decision tree, forest and linear models to check their accuracy.

#### 3.1 Decision Tree <a class="anchor" id="dt"></a>

In [7]:
# tuning min_samples_split
for depth in range(1,10):
    for samples_split in range(2,72,10):
        for leaf_samples in range(2,72,10):
            model = DecisionTreeClassifier(random_state=12345, max_depth=depth,min_samples_split=samples_split,min_samples_leaf=leaf_samples)
            model.fit(features_train, target_train)
            predictions_valid = model.predict(features_valid)
            accuracy = accuracy_score(target_valid, predictions_valid)
            print("max_depth =", depth, ", ", end='')
            print("min_samples_split =", samples_split, ", ", end='')
            print("min_samples_leaf =", leaf_samples, ", ", end='')
            print("accuracy =","{:.2f}".format(accuracy))

max_depth = 1 , min_samples_split = 2 , min_samples_leaf = 2 , accuracy = 0.75
max_depth = 1 , min_samples_split = 2 , min_samples_leaf = 12 , accuracy = 0.75
max_depth = 1 , min_samples_split = 2 , min_samples_leaf = 22 , accuracy = 0.75
max_depth = 1 , min_samples_split = 2 , min_samples_leaf = 32 , accuracy = 0.75
max_depth = 1 , min_samples_split = 2 , min_samples_leaf = 42 , accuracy = 0.75
max_depth = 1 , min_samples_split = 2 , min_samples_leaf = 52 , accuracy = 0.75
max_depth = 1 , min_samples_split = 2 , min_samples_leaf = 62 , accuracy = 0.75
max_depth = 1 , min_samples_split = 12 , min_samples_leaf = 2 , accuracy = 0.75
max_depth = 1 , min_samples_split = 12 , min_samples_leaf = 12 , accuracy = 0.75
max_depth = 1 , min_samples_split = 12 , min_samples_leaf = 22 , accuracy = 0.75
max_depth = 1 , min_samples_split = 12 , min_samples_leaf = 32 , accuracy = 0.75
max_depth = 1 , min_samples_split = 12 , min_samples_leaf = 42 , accuracy = 0.75
max_depth = 1 , min_samples_split = 1

max_depth = 3 , min_samples_split = 12 , min_samples_leaf = 62 , accuracy = 0.78
max_depth = 3 , min_samples_split = 22 , min_samples_leaf = 2 , accuracy = 0.79
max_depth = 3 , min_samples_split = 22 , min_samples_leaf = 12 , accuracy = 0.79
max_depth = 3 , min_samples_split = 22 , min_samples_leaf = 22 , accuracy = 0.79
max_depth = 3 , min_samples_split = 22 , min_samples_leaf = 32 , accuracy = 0.79
max_depth = 3 , min_samples_split = 22 , min_samples_leaf = 42 , accuracy = 0.79
max_depth = 3 , min_samples_split = 22 , min_samples_leaf = 52 , accuracy = 0.79
max_depth = 3 , min_samples_split = 22 , min_samples_leaf = 62 , accuracy = 0.78
max_depth = 3 , min_samples_split = 32 , min_samples_leaf = 2 , accuracy = 0.79
max_depth = 3 , min_samples_split = 32 , min_samples_leaf = 12 , accuracy = 0.79
max_depth = 3 , min_samples_split = 32 , min_samples_leaf = 22 , accuracy = 0.79
max_depth = 3 , min_samples_split = 32 , min_samples_leaf = 32 , accuracy = 0.79
max_depth = 3 , min_samples_sp

max_depth = 5 , min_samples_split = 32 , min_samples_leaf = 52 , accuracy = 0.79
max_depth = 5 , min_samples_split = 32 , min_samples_leaf = 62 , accuracy = 0.78
max_depth = 5 , min_samples_split = 42 , min_samples_leaf = 2 , accuracy = 0.78
max_depth = 5 , min_samples_split = 42 , min_samples_leaf = 12 , accuracy = 0.78
max_depth = 5 , min_samples_split = 42 , min_samples_leaf = 22 , accuracy = 0.78
max_depth = 5 , min_samples_split = 42 , min_samples_leaf = 32 , accuracy = 0.78
max_depth = 5 , min_samples_split = 42 , min_samples_leaf = 42 , accuracy = 0.79
max_depth = 5 , min_samples_split = 42 , min_samples_leaf = 52 , accuracy = 0.79
max_depth = 5 , min_samples_split = 42 , min_samples_leaf = 62 , accuracy = 0.78
max_depth = 5 , min_samples_split = 52 , min_samples_leaf = 2 , accuracy = 0.78
max_depth = 5 , min_samples_split = 52 , min_samples_leaf = 12 , accuracy = 0.78
max_depth = 5 , min_samples_split = 52 , min_samples_leaf = 22 , accuracy = 0.78
max_depth = 5 , min_samples_sp

max_depth = 8 , min_samples_split = 2 , min_samples_leaf = 32 , accuracy = 0.78
max_depth = 8 , min_samples_split = 2 , min_samples_leaf = 42 , accuracy = 0.79
max_depth = 8 , min_samples_split = 2 , min_samples_leaf = 52 , accuracy = 0.79
max_depth = 8 , min_samples_split = 2 , min_samples_leaf = 62 , accuracy = 0.78
max_depth = 8 , min_samples_split = 12 , min_samples_leaf = 2 , accuracy = 0.78
max_depth = 8 , min_samples_split = 12 , min_samples_leaf = 12 , accuracy = 0.77
max_depth = 8 , min_samples_split = 12 , min_samples_leaf = 22 , accuracy = 0.78
max_depth = 8 , min_samples_split = 12 , min_samples_leaf = 32 , accuracy = 0.78
max_depth = 8 , min_samples_split = 12 , min_samples_leaf = 42 , accuracy = 0.79
max_depth = 8 , min_samples_split = 12 , min_samples_leaf = 52 , accuracy = 0.79
max_depth = 8 , min_samples_split = 12 , min_samples_leaf = 62 , accuracy = 0.78
max_depth = 8 , min_samples_split = 22 , min_samples_leaf = 2 , accuracy = 0.77
max_depth = 8 , min_samples_split 

There are many parameter tunings that result in the highest accuaracy. The highest accuracy achieved is 79%.

### 3.2 Random Forest <a class="anchor" id="rf"></a>

In [None]:
for estimators in range(6,43,6):
    for depth in range(1,10):
        for samples_split in range(5,15):
            model = RandomForestClassifier(random_state=12345, n_estimators=estimators, max_depth=depth, min_samples_split=samples_split)
            model.fit(features_train, target_train)
            predictions_valid = model.predict(features_valid)
            accuracy = accuracy_score(target_valid, predictions_valid)
            print("n_estimators =", estimators, ", ", end='')
            print("max_depth = ", depth, ", ", end='')
            print("min_samples_split = ", samples_split, ", ", end='')
            print("accuracy =","{:.2f}".format(accuracy))

n_estimators = 6 , max_depth =  1 , min_samples_split =  5 , accuracy = 0.76
n_estimators = 6 , max_depth =  1 , min_samples_split =  6 , accuracy = 0.76
n_estimators = 6 , max_depth =  1 , min_samples_split =  7 , accuracy = 0.76
n_estimators = 6 , max_depth =  1 , min_samples_split =  8 , accuracy = 0.76
n_estimators = 6 , max_depth =  1 , min_samples_split =  9 , accuracy = 0.76
n_estimators = 6 , max_depth =  1 , min_samples_split =  10 , accuracy = 0.76
n_estimators = 6 , max_depth =  1 , min_samples_split =  11 , accuracy = 0.76
n_estimators = 6 , max_depth =  1 , min_samples_split =  12 , accuracy = 0.76
n_estimators = 6 , max_depth =  1 , min_samples_split =  13 , accuracy = 0.76
n_estimators = 6 , max_depth =  1 , min_samples_split =  14 , accuracy = 0.76
n_estimators = 6 , max_depth =  2 , min_samples_split =  5 , accuracy = 0.77
n_estimators = 6 , max_depth =  2 , min_samples_split =  6 , accuracy = 0.77
n_estimators = 6 , max_depth =  2 , min_samples_split =  7 , accuracy =

n_estimators = 12 , max_depth =  3 , min_samples_split =  5 , accuracy = 0.79
n_estimators = 12 , max_depth =  3 , min_samples_split =  6 , accuracy = 0.79
n_estimators = 12 , max_depth =  3 , min_samples_split =  7 , accuracy = 0.79
n_estimators = 12 , max_depth =  3 , min_samples_split =  8 , accuracy = 0.79
n_estimators = 12 , max_depth =  3 , min_samples_split =  9 , accuracy = 0.79
n_estimators = 12 , max_depth =  3 , min_samples_split =  10 , accuracy = 0.79
n_estimators = 12 , max_depth =  3 , min_samples_split =  11 , accuracy = 0.79
n_estimators = 12 , max_depth =  3 , min_samples_split =  12 , accuracy = 0.79
n_estimators = 12 , max_depth =  3 , min_samples_split =  13 , accuracy = 0.79
n_estimators = 12 , max_depth =  3 , min_samples_split =  14 , accuracy = 0.79
n_estimators = 12 , max_depth =  4 , min_samples_split =  5 , accuracy = 0.79
n_estimators = 12 , max_depth =  4 , min_samples_split =  6 , accuracy = 0.79
n_estimators = 12 , max_depth =  4 , min_samples_split =  7

n_estimators = 18 , max_depth =  4 , min_samples_split =  12 , accuracy = 0.79
n_estimators = 18 , max_depth =  4 , min_samples_split =  13 , accuracy = 0.79
n_estimators = 18 , max_depth =  4 , min_samples_split =  14 , accuracy = 0.79
n_estimators = 18 , max_depth =  5 , min_samples_split =  5 , accuracy = 0.79
n_estimators = 18 , max_depth =  5 , min_samples_split =  6 , accuracy = 0.79
n_estimators = 18 , max_depth =  5 , min_samples_split =  7 , accuracy = 0.79
n_estimators = 18 , max_depth =  5 , min_samples_split =  8 , accuracy = 0.79
n_estimators = 18 , max_depth =  5 , min_samples_split =  9 , accuracy = 0.79
n_estimators = 18 , max_depth =  5 , min_samples_split =  10 , accuracy = 0.79
n_estimators = 18 , max_depth =  5 , min_samples_split =  11 , accuracy = 0.79
n_estimators = 18 , max_depth =  5 , min_samples_split =  12 , accuracy = 0.79
n_estimators = 18 , max_depth =  5 , min_samples_split =  13 , accuracy = 0.79
n_estimators = 18 , max_depth =  5 , min_samples_split = 

n_estimators = 24 , max_depth =  6 , min_samples_split =  10 , accuracy = 0.80
n_estimators = 24 , max_depth =  6 , min_samples_split =  11 , accuracy = 0.80
n_estimators = 24 , max_depth =  6 , min_samples_split =  12 , accuracy = 0.80
n_estimators = 24 , max_depth =  6 , min_samples_split =  13 , accuracy = 0.81
n_estimators = 24 , max_depth =  6 , min_samples_split =  14 , accuracy = 0.80
n_estimators = 24 , max_depth =  7 , min_samples_split =  5 , accuracy = 0.80
n_estimators = 24 , max_depth =  7 , min_samples_split =  6 , accuracy = 0.80
n_estimators = 24 , max_depth =  7 , min_samples_split =  7 , accuracy = 0.80
n_estimators = 24 , max_depth =  7 , min_samples_split =  8 , accuracy = 0.81
n_estimators = 24 , max_depth =  7 , min_samples_split =  9 , accuracy = 0.80
n_estimators = 24 , max_depth =  7 , min_samples_split =  10 , accuracy = 0.80
n_estimators = 24 , max_depth =  7 , min_samples_split =  11 , accuracy = 0.80
n_estimators = 24 , max_depth =  7 , min_samples_split = 

n_estimators = 30 , max_depth =  8 , min_samples_split =  7 , accuracy = 0.79
n_estimators = 30 , max_depth =  8 , min_samples_split =  8 , accuracy = 0.79
n_estimators = 30 , max_depth =  8 , min_samples_split =  9 , accuracy = 0.79
n_estimators = 30 , max_depth =  8 , min_samples_split =  10 , accuracy = 0.80
n_estimators = 30 , max_depth =  8 , min_samples_split =  11 , accuracy = 0.79
n_estimators = 30 , max_depth =  8 , min_samples_split =  12 , accuracy = 0.80
n_estimators = 30 , max_depth =  8 , min_samples_split =  13 , accuracy = 0.80
n_estimators = 30 , max_depth =  8 , min_samples_split =  14 , accuracy = 0.80
n_estimators = 30 , max_depth =  9 , min_samples_split =  5 , accuracy = 0.79
n_estimators = 30 , max_depth =  9 , min_samples_split =  6 , accuracy = 0.80
n_estimators = 30 , max_depth =  9 , min_samples_split =  7 , accuracy = 0.80
n_estimators = 30 , max_depth =  9 , min_samples_split =  8 , accuracy = 0.79
n_estimators = 30 , max_depth =  9 , min_samples_split =  9

The model achieves highest accuracy with 36 n_estimators, maximum depth of 7, a minimum sample split of 5. The highest accuracy achieved is 81%.

### 3.3 Logistic Regression <a class="anchor" id="lr"></a>

In [None]:
warnings.filterwarnings("ignore")

solvers = ['newton-cg', 'lbfgs','liblinear','sag','saga']
weights = ['','balanced']

for solver in solvers:
    for weight in weights:
        for iter in range(25,226,25):
            model = LogisticRegression(random_state=12345, solver=solver,class_weight=weight,max_iter=iter)
            model.fit(features_train, target_train)
            predictions_valid = model.predict(features_valid)
            accuracy = accuracy_score(target_valid, predictions_valid)
            print("solver =", solver, ", ", end='')
            print("class_weight =", weight, ", ", end='')
            print("max_iter =", iter, ", ", end='')
            print("accuracy =","{:.2f}".format(accuracy)) 

The model achieves highest accuracy with newton-cg solver, no class weight, and any number of max iterations. The highest accuracy achieved is 76%.

### 4. Model Testing <a class="anchor" id="test"></a>

We will use the tuned parameters that achieved the highest accuracy on the Forest model.

In [None]:
model = RandomForestClassifier(random_state=12345, n_estimators=36,max_depth=7,min_samples_split=5)
model.fit(features_train, target_train)

predictions_train = model.predict(features_train)
accuracy_train = accuracy_score(target_train,predictions_train)

predictions_test = model.predict(features_test)
accuracy_test = accuracy_score(target_test,predictions_test)

print("Model accuracy on training set: {:.2f}".format(accuracy_train))
print("Model accuracy on test set: {:.2f}".format(accuracy_test))

The model on the test set has achieved more than 75% accuracy on the test set and far higher on the training set because that is the data the model was trained on.

### 5. Sanity Check <a class="anchor" id="sanity"></a>

In [None]:
model = DummyClassifier()
model.fit(features_train, target_train)

predictions_train = model.predict(features_train)
accuracy_train = accuracy_score(target_train,predictions_train)

predictions_test = model.predict(features_test)
accuracy_test = accuracy_score(target_test,predictions_test)

print("Dummy model accuracy on training set: {:.2f}".format(accuracy_train))
print("Dummy model accuracy on test set: {:.2f}".format(accuracy_test))

In [None]:
model = RandomForestClassifier(random_state=12345)
model.fit(features_train, target_train)

predictions_train = model.predict(features_train)
accuracy_train = accuracy_score(target_train,predictions_train)

predictions_test = model.predict(features_test)
accuracy_test = accuracy_score(target_test,predictions_test)

print("Model accuracy on training set: {:.2f}".format(accuracy_train))
print("Model accuracy on test set: {:.2f}".format(accuracy_test))

## Conclusion <a class="anchor" id="end"></a>

The tuned models have achieved better accuracy than a dummy model and their untuned counterpart. This performance indicates that we chose the correct model type (categorical vs. regression) for the problem and that parameter tuning can result in a more accurate model that is less overtuned to the training set.