## Content <a id='content'></a>

[Introduction](#intro)

[Datasets](#datasets)

[Data overview](#prep)

[ML](#ml)

[Conclusions](#conclusions)

# Introduction <a id='intro'></a>

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

We have already performed the data preprocessing step of the dataset, which icnlude behavior data about subscribers who have already switched to the new plans and can move straight to creating the model that will pick the right plan.

Thus, we have a classification task and need to develop a model with the highest possible accuracy. For the purposes of this project, we will take accuracy of 0.75 as a threshold. We will check the accuracy using the test dataset.

### I have a plan

Our journey will include the following steps:
    
1. Open and look through the data file.
2. Split the source data into a training set, a validation set, and a test set.
3. Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.
4. Check the quality of the model using the test set.
5. Additional task: sanity check the model.

# Datasets <a id='datasets'></a>

Every observation in the dataset contains monthly behavior information about one user. The information given is as follows:

- `сalls` — number of calls,
- `minutes` — total call duration in minutes,
- `messages` — number of text messages,
- `mb_used` — Internet traffic used in MB,
- `is_ultra` — plan for the current month (Ultra - 1, Smart - 0).

# Data overview <a id='prep'></a>

## Libraries

In [1]:
# import potentially necessary libraris and functions
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

## File upload

We will use a load() function to avoid potential problems with the file pathes.

In [2]:
# function `load` for a csv load with try-except and a number of rows limit
def load(filename, sep = ',', nrow = None):
    """
    i=In addition to the file name, the function takes nrows parameter
    for a particular number of rows to load. If None, then the file is loaded fully.
    """
    try:
        df_raw = pd.read_csv(filename, sep = sep, nrows = nrow)
    except:
        df_raw = pd.read_csv('/'+filename, sep = sep, nrows = nrow)
    return df_raw

In [3]:
# chec the dataset data types and memory usage on the first 500 rows
users_raw = load('datasets/users_behavior.csv', nrow = 500)

In [4]:
users_raw.info(memory_usage = 'deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     500 non-null    float64
 1   minutes   500 non-null    float64
 2   messages  500 non-null    float64
 3   mb_used   500 non-null    float64
 4   is_ultra  500 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 19.7 KB


The dataset doesnot look like memory-intensive and the data types are relevant, so let's read it alltogether.

In [5]:
users_raw = load('datasets/users_behavior.csv')

In [6]:
users_raw.shape

(3214, 5)

The raw dataset contains data on 3214 customers, the target variable and the 4 features described above.

[Back to Content](#content)

## Data overview <a id='data_overview'></a>

Let's explore the given dataset; we will answer the following questions:
1. Does it contain any missing features or duplicates?
2. What are the basic stats for this dataset?
3. Are there some obvious differences between the two groups of customers - users of the plans?

### Checking for missing values and duplicates <a id = "dup"> <a/>

In [7]:
users_raw.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [8]:
users_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


The data types are correct (the boolean type could be used for the `is_ultra` columns, but for our purposes we can leave it as is); no missing values revealed.

Let's make sure we do not have duplicated rows in the dataset.

In [9]:
users_raw.duplicated().sum()

0

No duplicates.

[Back to Content](#content)

### Basic stats <a id = "basic_stats"> <a/>

In [10]:
# stats on numeric variables
users_raw.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
calls,3214.0,63.038892,33.236368,0.0,40.0,62.0,82.0,244.0
minutes,3214.0,438.208787,234.569872,0.0,274.575,430.6,571.9275,1632.06
messages,3214.0,38.281269,36.148326,0.0,9.0,30.0,57.0,224.0
mb_used,3214.0,17207.673836,7570.968246,0.0,12491.9025,16943.235,21424.7,49745.73
is_ultra,3214.0,0.306472,0.4611,0.0,0.0,0.0,1.0,1.0


Out of 3214 users, 30% have the `Ultra` plan. No significant outliers can be noticed in numerical variables.

[Back to Content](#content)

### Group stats <a id = "gr_stats"> <a/>

In [11]:
users_raw.groupby(by = 'is_ultra').describe().T.round(2)

Unnamed: 0,is_ultra,0,1
calls,count,2229.0,985.0
calls,mean,58.46,73.39
calls,std,25.94,43.92
calls,min,0.0,0.0
calls,25%,40.0,41.0
calls,50%,60.0,74.0
calls,75%,76.0,104.0
calls,max,198.0,244.0
minutes,count,2229.0,985.0
minutes,mean,405.94,511.22


Looks like users of the `Ultra` plan make more calls, talk more minutes, send more messages, and even use more internet.

[Back to Content](#content)

### Summary on datasets  <a id='data_summary'></a>

Let's sum up:
1. We have uploaded the dataset, updated column names and variable types, and checked the data for duplicates and missing variables; the dataset at hand, does not have missing values, duplicates, as well as outliers.
2. Out of 3214 users we have a group of 985 who have use the `Ultra` plan.
3. Users of the `Ultra` plan seem to make more calls, talk more minutes, send more messages, and even use more internet.

[Back to Content](#content)

## Machine Learning <a id='ml'></a>

In this part, we will build a **binary classification model** for users where the target feature is the user's plan. To this end, we will:
1. Split the source data into a training set, a validation set, and a test set. * We will standardize the data as well.
2. Building a model to recommend one of the to plans; we will train the model on the train set with two methods:
    - logistic regression;
    - random forest.
3. Investigate the quality of different models by changing hyperparameters on the validation set.
4. Check the quality of the model using the test set.
5. Perform a sanity check.

### Split the data

Let's split the data (we use random splitting, since we are not dealing with a time series). One thing to take into account is that we have pretty meaningful class imbalance in the data, so we will employ the `stratify` parameter to ensure preservation of that imbalance during the split.

In [12]:
#divide the data into features (the X matrix) and a target variable (y)
X = users_raw.drop('is_ultra', axis = 1)
y = users_raw['is_ultra']

In [13]:
# divide the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0, stratify = y)

In [14]:
# Use the same function above for the validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train,
                                                  test_size=0.25, random_state = 0, stratify = y_train) # 0.25 x 0.8 = 0.2

### Standardize the data

Now we will standardize it:

In [15]:
# create a StandardScaler object and apply it to the train set
scaler = StandardScaler() # we could try MinMaxScaler() to scale features to lie between a zero and one
X_train_st = scaler.fit_transform(X_train) # train the scaler and transform the matrix for the train set

In [16]:
# apply standardization to the feature matrix for the validation set
X_val_st = scaler.transform(X_val) 

In [17]:
# apply standardization to the feature matrix for the test set
X_test_st = scaler.transform(X_test) 

[Back to Content](#content)

### Building a model to recommend a plan

Let's train the models.For the logistic regression, we will employ `sklearn.linear_model.LogisticRegression()` model. To implement the random forest algorithm, we will employ `sklearn.ensemble.RandomForestClassifier()` model. We will add `ROC_AUC` to the metrics as well and explain [below](#binary), why in this case, more attention should be paid to `recall`.

In [18]:
# declare the list of models
models = [LogisticRegression(random_state = 54321),
          RandomForestClassifier(random_state = 54321)]

In [19]:
# create the function that takes the model and data as input and outputs metrics
# to compare the default model with default hyperparameters/threshold value against a tuned model,
# we will employ the validation set to calculate the model metrics.
def make_prediction(m, features_train, target_train, features_valid, target_valid, log_regression = False):
    model = m # wite your code here
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)
    probabilities = model.predict_proba(features_valid)
    if log_regression:
        print(model.intercept_, model.coef_)
    
    print(f'Accuracy:{model.score(features_valid, target_valid):.4f}')
    print('Precision:{:.2f} Recall:{:.2f}'.format(precision_score(target_valid,predictions),
                                                                recall_score(target_valid,predictions))
                                                                )
    print('ROC_AUC: {:.2f}\n'.format(roc_auc_score(target_valid, probabilities[:,1])))

In [20]:
# apply
for i in range(len(models)):
    print(f"Model {i}: {models[i]}")
    if i < len(models)-1:
        # print coefficients for regression models
        make_prediction(models[i], X_train, y_train, X_val, y_val, log_regression = True)
    else:
        make_prediction(models[i], X_train, y_train, X_val, y_val)

Model 0: LogisticRegression(random_state=54321)
[-0.01071696] [[-1.77046025e-02  2.07214709e-03  4.86316874e-03 -3.00027398e-05]]
Accuracy:0.7092
Precision:0.81 Recall:0.07
ROC_AUC: 0.45

Model 1: RandomForestClassifier(random_state=54321)
Accuracy:0.7947
Precision:0.70 Recall:0.57
ROC_AUC: 0.80



[Back to Content](#content)

#### Finetuning the models  <a id='solver'></a>

The logistic regression model result is less persuading, let's try to eleborate and finetune the parameters of the `RandomForestClassifier()`, first we will consider [each one separately](https://www.geeksforgeeks.org/hyperparameters-of-random-forest-classifier/) and start, of course, with `n_estimators`, the number of trees in the forest:

Applying different solver and setting the maximum iteration number to 1000 fix the convergence problem with the same values of the metrics. Model 0 ends up with better result, although `ROC_AUC` metric value is still lower for the Logistic Regression than for the Random Forest.

In [21]:
best_score = 0
best_est = 0
for est in range(1, 20): # choose hyperparameter range
    model = RandomForestClassifier(random_state=54321, n_estimators=est) # set number of trees
    model.fit(X_train, y_train) # train model on training set
    score = model.score(X_val, y_val) # calculate accuracy score on validation set
    if score > best_score:
        best_score = score # save best accuracy score on validation set
        best_est = est # save number of estimators corresponding to best accuracy score

print(f"Accuracy of the best model on the validation set (n_estimators = {best_est}): {best_score}")

Accuracy of the best model on the validation set (n_estimators = 17): 0.7931570762052877


Now, `max_depth`, the longest path between the root node and the leaf node.

In [22]:
best_score = 0
best_depth = 0
for depth in range(1, 5): # choose hyperparameter range
    model = RandomForestClassifier(random_state=54321, max_depth=depth) # set depth
    model.fit(X_train, y_train) # train model on training set
    score = model.score(X_val, y_val) # calculate accuracy score on validation set
    if score > best_score:
        best_score = score # save best accuracy score on validation set
        best_depth = depth # save number corresponding to best accuracy score

print(f"Accuracy of the best model on the validation set (max_depth = {best_depth}): {best_score}")

Accuracy of the best model on the validation set (max_depth = 4): 0.8242612752721618


`min_samples_split`, the minimum required number of observations in any given node to split it:

In [23]:
best_score = 0
best_samples_split = 0
for split in range(2, 6): # choose hyperparameter range
    model = RandomForestClassifier(random_state=54321, min_samples_split=split) 
    model.fit(X_train, y_train) # train model on training set
    score = model.score(X_val, y_val) # calculate accuracy score on validation set
    if score > best_score:
        best_score = score # save best accuracy score on validation set
        best_samples_split = split # save the number corresponding to best accuracy score

print(f"Accuracy of the best model on the validation set (min_samples_split = {best_samples_split}): {best_score}")

Accuracy of the best model on the validation set (min_samples_split = 5): 0.8040435458786936


`min_samples_leaf`, the minimum number of samples that should be present in the leaf node after splitting a node, 1 is the default:

In [24]:
best_score = 0
best_samples_leaf = 0
for leaf in range(1, 6): # choose hyperparameter range
    model = RandomForestClassifier(random_state=54321, min_samples_leaf=leaf) 
    model.fit(X_train, y_train) # train model on training set
    score = model.score(X_val, y_val) # calculate accuracy score on validation set
    if score > best_score:
        best_score = score # save best accuracy score on validation set
        best_samples_leaf = leaf # save the number corresponding to best accuracy score

print(f"Accuracy of the best model on the validation set (min_samples_leaf = {best_samples_leaf}): {best_score}")

Accuracy of the best model on the validation set (min_samples_leaf = 4): 0.8180404354587869


`max_features`:

In [25]:
best_score = 0
best_max_features = 0
for max_f in range(1, 5): # max_features should take values between from (0,n_features]
    model = RandomForestClassifier(random_state=54321, max_features=max_f) 
    model.fit(X_train, y_train) # train model on training set
    score = model.score(X_val, y_val) # calculate accuracy score on validation set
    if score > best_score:
        best_score = score # save best accuracy score on validation set
        best_samples_leaf = max_f # save the number corresponding to best accuracy score

print(f"Accuracy of the best model on the validation set (max_features = {best_samples_leaf}): {best_score}")

Accuracy of the best model on the validation set (max_features = 1): 0.80248833592535


`max_leaf_nodes`, limits the splitting of the nodes in the tree and hence restricts the growth of the tree:

In [26]:
best_score = 0
best_max_leaf_nodes = 0
for max_l in range(2, 100): # either None, or > 1
    model = RandomForestClassifier(random_state=54321, max_leaf_nodes=max_l) 
    model.fit(X_train, y_train) # train model on training set
    score = model.score(X_val, y_val) # calculate accuracy score on validation set
    if score > best_score:
        best_score = score # save best accuracy score on validation set
        best_samples_leaf = max_l # save the number corresponding to best accuracy score

print(f"Accuracy of the best model on the validation set (max_leaf_nodes = {best_samples_leaf}): {best_score}")

Accuracy of the best model on the validation set (max_leaf_nodes = 23): 0.8258164852255054


`max_samples`, max part of the original dataset is given to any individual tree:

In [27]:
best_score = 0
best_max_samples = 0
for max_s in range(1, 1000): # either None, or > 1
    model = RandomForestClassifier(random_state=54321, max_samples=max_s) 
    model.fit(X_train, y_train) # train model on training set
    score = model.score(X_val, y_val) # calculate accuracy score on validation set
    if score > best_score:
        best_score = score # save best accuracy score on validation set
        best_max_samples = max_s # save the number corresponding to best accuracy score

print(f"Accuracy of the best model on the validation set (max_samples = {best_max_samples}): {best_score}")

Accuracy of the best model on the validation set (max_samples = 210): 0.8320373250388803


`criterion`:

In [28]:
best_score = 0
best_criterion =''
for cr in ['gini', 'entropy']: # either None, or > 1
    model = RandomForestClassifier(random_state=54321, criterion=cr) 
    model.fit(X_train, y_train) # train model on training set
    score = model.score(X_val, y_val) # calculate accuracy score on validation set
    if score > best_score:
        best_score = score # save best accuracy score on validation set
        best_criterion = cr # save the number corresponding to best accuracy score

print(f"Accuracy of the best model on the validation set (criterion = {best_criterion}): {best_score}")

Accuracy of the best model on the validation set (criterion = entropy): 0.7993779160186625


N.B. We will leave the bootstrap methoud being True (sampling with replacement).

Let's try to employ all of the "best" hyperparameters all together:

In [29]:
final_model = RandomForestClassifier(random_state=54321,
                                     n_estimators = 17,
                                     max_depth = 4, # the same as max_features, due to small amount of features
                                     min_samples_split = 4,
                                     min_samples_leaf = 1,
                                     max_leaf_nodes = 23,
                                     max_samples = 210,
                                     criterion = 'entropy'
                                    ) 
final_model.fit(X_train, y_train)
score = final_model.score(X_val, y_val)
print(f"Accuracy of the final model on the validation set: {score}")

Accuracy of the final model on the validation set: 0.8102643856920684


Hm... We do recieve better accuracy, but it is lower than applying each of the parameters separately (e.g. with `max_samples = 210` we were able to achieve the accuracy of 0.8320373250388803). This illustartes the mutual impact and how important is the hyperparameters working together.


Let's try to apply [grid search](https://scikit-learn.org/stable/auto_examples/model_selection/plot_multi_metric_evaluation.html#sphx-glr-auto-examples-model-selection-plot-multi-metric-evaluation-py). N.B. Grid search takes a lot of resources, so we will try it for a grid of three parameters.

In [30]:
from sklearn.model_selection import GridSearchCV

In [31]:
#Multiple metric parameter search can be done by setting the scoring parameter to a list of metric scorer names
# or a dict mapping the scorer names to the scorer callables.
scoring = {"AUC": "roc_auc", "Accuracy": "accuracy"}

In [32]:
gs = GridSearchCV(
    RandomForestClassifier(random_state=54321),
    param_grid={"n_estimators": range(10,20),
                "max_depth": range(1,5),
                "max_leaf_nodes": range(2,50)
               
               },
    scoring=scoring,
    # Setting refit='AUC', refits an estimator on the whole dataset with the
    # parameter setting that has the best cross-validated AUC score.
    # That estimator is made available at ``gs.best_estimator_`` along with
    # parameters like ``gs.best_score_``, ``gs.best_params_`` and
    # ``gs.best_index_``
    refit="AUC",
    return_train_score=True,
)
gs.fit(X_train, y_train)
results = gs.cv_results_

Let's see how the `results` dictionary looks like.

In [33]:
results

{'mean_fit_time': array([0.01858377, 0.01676483, 0.01792421, ..., 0.0327651 , 0.03418932,
        0.03688989]),
 'std_fit_time': array([0.00465421, 0.00017308, 0.00022667, ..., 0.00071769, 0.00038687,
        0.00220144]),
 'mean_score_time': array([0.00715547, 0.00706534, 0.00730052, ..., 0.00920863, 0.00901952,
        0.00927348]),
 'std_score_time': array([0.00039725, 0.00045606, 0.0004648 , ..., 0.00100137, 0.00013904,
        0.00053901]),
 'param_max_depth': masked_array(data=[1, 1, 1, ..., 4, 4, 4],
              mask=[False, False, False, ..., False, False, False],
        fill_value='?',
             dtype=object),
 'param_max_leaf_nodes': masked_array(data=[2, 2, 2, ..., 49, 49, 49],
              mask=[False, False, False, ..., False, False, False],
        fill_value='?',
             dtype=object),
 'param_n_estimators': masked_array(data=[10, 11, 12, ..., 17, 18, 19],
              mask=[False, False, False, ..., False, False, False],
        fill_value='?',
            

The scores of all the scorers are available in the cv_results_ dict at keys ending in '_<scorer_name>' ('mean_test_accuracy', 'rank_test_accuracy').

In [34]:
best_index_acc = np.nonzero(results["rank_test_%s" % 'Accuracy'] == 1)[0][0]
best_score_acc = results["mean_test_%s" % 'Accuracy'][best_index_acc]

We can also check the `best_score_` and the `best_params_` values of the `gs` estimator:

In [35]:
best_score = gs.best_score_
best_params = gs.best_params_

Let's take a look at all these values:

In [36]:
best_index_acc, best_score_acc

(1534, 0.7961725321310814)

In [37]:
best_score, best_params

(0.7874561465297056,
 {'max_depth': 4, 'max_leaf_nodes': 12, 'n_estimators': 19})

Interestingly, the `best_score_` is worse than the `best_score_acc`.

The set of hyperparameters values corresponding to  the best accuracy score:

In [38]:
results['params'][best_index_acc]

{'max_depth': 4, 'max_leaf_nodes': 11, 'n_estimators': 14}

Let's show that the `best_score_auc` actually is the `best_score` when using `refit = "AUC"` for runnign the estimator.

In [39]:
best_index_auc = np.nonzero(results["rank_test_%s" % 'AUC'] == 1)[0][0]
best_score_auc = results["mean_test_%s" % 'AUC'][best_index_auc]

In [40]:
best_index_auc, best_score_auc

(1549, 0.7874561465297056)

In [41]:
results['params'][best_index_auc]

{'max_depth': 4, 'max_leaf_nodes': 12, 'n_estimators': 19}

It actually is:) The `best_score_auc` is actually the `best_score_` when using `refit = "AUC"`, and actually, it shows, that refit parameter should be chosen carefully [whenever there are considerations other than maximum score in choosing a best estimator](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

Obviously, we have not achieved the best result, but let's see what the final score is, if we employ these three hyperparameters for the `best_score_acc` and the `best_score_`:

In [42]:
final_model = RandomForestClassifier(random_state=54321,
                                     n_estimators = 14,
                                     max_depth = 4, 
                                     max_leaf_nodes = 11,
                                    ) 
final_model.fit(X_train, y_train)
score = final_model.score(X_val, y_val)
print(f"Accuracy of the final model on the validation set: {score}")

Accuracy of the final model on the validation set: 0.8164852255054432


In [43]:
final_model = RandomForestClassifier(random_state=54321,
                                     n_estimators = 19,
                                     max_depth = 4, 
                                     max_leaf_nodes = 12,
                                    ) 
final_model.fit(X_train, y_train)
score = final_model.score(X_val, y_val)
print(f"Accuracy of the final model on the validation set: {score}")

Accuracy of the final model on the validation set: 0.8211508553654744


Actually, both cases perform better than all the "individually best" hyperparameters values applied alltogether, but not nuch higher than the initial score of the random forest model (0.7947).

Let's try another estimator: 

In [44]:
from sklearn.model_selection import RandomizedSearchCV

Unlike grid search, randomized search is much more faster resulting in cost-effective (computationally less intensive) and time-effective (faster – less computational time) model training [link](https://jmlr.csail.mit.edu/papers/volume13/bergstra12a/bergstra12a.pdf). 

In [45]:
from scipy.stats import randint as sp_randint  # for initializing random integer values

rs = RandomizedSearchCV(
    RandomForestClassifier(random_state=54321),
    param_distributions = {"max_depth": range(2,5),
              "min_samples_split": sp_randint(2, 50),
              "min_samples_leaf": sp_randint(2, 30),
              "max_leaf_nodes": sp_randint(2,50),
              "bootstrap": [True, False],
              "n_estimators": sp_randint(0,100),
              "criterion" : ["gini", "entropy"],
              'max_features': ['sqrt', 'log2', None]
             },
    scoring='accuracy',
    n_iter=50,
    # refit=True by default
    # return_train_score=False by default, to save the computing resources
)
rs.fit(X_train, y_train)

RandomizedSearchCV(estimator=RandomForestClassifier(random_state=54321),
                   n_iter=50,
                   param_distributions={'bootstrap': [True, False],
                                        'criterion': ['gini', 'entropy'],
                                        'max_depth': range(2, 5),
                                        'max_features': ['sqrt', 'log2', None],
                                        'max_leaf_nodes': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fcbcd21c250>,
                                        'min_samples_leaf': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fcbcd26f0d0>,
                                        'min_samples_split': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fcbcd27ddc0>,
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fcbcd2595b0>},
                   scoring='accuracy')

Let's take a look at `best_score_` and `best_params_` attributes of the `rs` estimator:

In [46]:
rs.best_score_

0.7946154363770944

In [47]:
rs.best_params_

{'bootstrap': True,
 'criterion': 'gini',
 'max_depth': 4,
 'max_features': 'log2',
 'max_leaf_nodes': 12,
 'min_samples_leaf': 2,
 'min_samples_split': 39,
 'n_estimators': 80}

It looks like this was not the best run. Let's focus the parameter distributions and apply more iterations...

In [48]:
rs = RandomizedSearchCV(
    RandomForestClassifier(random_state=54321),
    param_distributions = {"max_depth": [3,4],
              "min_samples_split": sp_randint(2, 20),
              "min_samples_leaf": sp_randint(2, 20),
              "max_leaf_nodes": sp_randint(2,20),
              "bootstrap": [True, False],
              "n_estimators": [10,20,30,40,50]
             },
    scoring='accuracy',
    n_iter=300,
    # refit=True by default
    # return_train_score=False by default, to save the computing resources
)
rs.fit(X_train, y_train)

RandomizedSearchCV(estimator=RandomForestClassifier(random_state=54321),
                   n_iter=300,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [3, 4],
                                        'max_leaf_nodes': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fcbcd21c5b0>,
                                        'min_samples_leaf': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fcbcd259400>,
                                        'min_samples_split': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fcbcd26f880>,
                                        'n_estimators': [10, 20, 30, 40, 50]},
                   scoring='accuracy')

In [49]:
rs.best_score_

0.7956543974160555

In [50]:
rs.best_params_

{'bootstrap': True,
 'max_depth': 4,
 'max_leaf_nodes': 10,
 'min_samples_leaf': 4,
 'min_samples_split': 6,
 'n_estimators': 40}

No meaningful improvement achieved, at least the score became no worse than that of the initial random forest model.

[Back to Content](#content)

#### Comparing the model performance on the validation set

Let's sum up and compare five random forest models.

In [51]:
final_model_0 = RandomForestClassifier(random_state=54321, #independent optimization of all hyperparameters
                                     n_estimators = 17,
                                     max_depth = 4, 
                                     min_samples_split = 4,
                                     min_samples_leaf = 1,
                                     max_leaf_nodes = 23,
                                     max_samples = 210,
                                     criterion = 'entropy'
                                    )
final_model_0.fit(X_train, y_train)
score = final_model_0.score(X_val, y_val)
print(f"Accuracy of the final model on the test set: {score}")

Accuracy of the final model on the test set: 0.8102643856920684


In [52]:
final_model_1 = RandomForestClassifier(random_state=54321, # grid search with best accuracy
                                     n_estimators = 14,
                                     max_depth = 4, 
                                     max_leaf_nodes = 11,
                                    ) 
final_model_1.fit(X_train, y_train)
score = final_model_1.score(X_val, y_val)
print(f"Accuracy of the final model on the test set: {score}")

Accuracy of the final model on the test set: 0.8164852255054432


In [53]:
final_model_2 = RandomForestClassifier(random_state=54321, # grid search with best AUC
                                     n_estimators = 19,
                                     max_depth = 4, 
                                     max_leaf_nodes = 12,
                                    ) 
final_model_2.fit(X_train, y_train)
score = final_model_2.score(X_val, y_val)
print(f"Accuracy of the final model on the test set: {score}")

Accuracy of the final model on the test set: 0.8211508553654744


In [54]:
final_model_3 = RandomForestClassifier(random_state=54321, # random search
                                     n_estimators = 10,
                                     max_depth = 4, 
                                     max_leaf_nodes = 10,
                                     min_samples_leaf = 2,
                                     min_samples_split = 14,
                                    ) 
final_model_3.fit(X_train, y_train)
score = final_model_3.score(X_val, y_val)
print(f"Accuracy of the final model on the test set: {score}")

Accuracy of the final model on the test set: 0.8180404354587869


In [55]:
final_model_4 = RandomForestClassifier(random_state=54321, # unique hyperparameter optimization
                                     max_samples = 210,
                                    ) 
final_model_4.fit(X_train, y_train)
score = final_model_4.score(X_val, y_val)
print(f"Accuracy of the final model on the test set: {score}")

Accuracy of the final model on the test set: 0.8320373250388803


Actually, the last one shows the best result.

Let's compare the final Logistic Regression model and the Random Forest Clissifier on the test subset:

In [56]:
# choose finalmodel
models = [LogisticRegression(random_state = 54321),
          RandomForestClassifier(random_state=54321, max_samples=210)
         ]

In [57]:
# apply make_prediction_thre() function to Logistic Regression, use the initial version for the Random Forest Classifier
for i in range(len(models)):
    print(f"Model {i}: {models[i]}")
    if i < len(models)-1:
        # print coefficients for regression models
        make_prediction(models[i], X_train, y_train, X_test, y_test, log_regression = True)
    else:
        make_prediction(models[i], X_train, y_train, X_test, y_test)

Model 0: LogisticRegression(random_state=54321)
[-0.01071696] [[-1.77046025e-02  2.07214709e-03  4.86316874e-03 -3.00027398e-05]]
Accuracy:0.6998
Precision:0.70 Recall:0.04
ROC_AUC: 0.46

Model 1: RandomForestClassifier(max_samples=210, random_state=54321)
Accuracy:0.8134
Precision:0.77 Recall:0.56
ROC_AUC: 0.83



The Random Forest algorithm with `max_samples` hyperparameter vaue of 210 resulted in the best result (though it can be tricky to explain the reasons to the business). For a real business task of this kind futher search for a model with better performance should be considered given the potential improvement of the financial results from prediction of right plan.

[Back to Content](#content)

### Sanity check

Let's think whether our best model preforms better than a random one. Consider three cases:

(a) Let's assume we have a random model which predict `1` or `0` with equal probability. It will assign `1` to a half of the cases, in which `0` would be correct and in a half of the cases in which assigning `1` would be actually correct. Thus, we would result in 50% of cases being assigned `1` and accuracy of 0.5 (the precision of such a model would be also 0.5, since only half of the prediction would be also correct answers).

(b) If we assume that our random model assignes `1` in 30% of cases (according to the proportion in the source dataset), it would assign `1` in 30% of cases which should have been `0` or `1`, meaning that 0.7 of the 70% of `0` would be correctly assigned to class `0` and 0.3 of 30% of `1` would be correctly assigned to class `1`, i.e. the accuracy of such a model would be 0.49 + 0.09 = 0.55.

(c) Another option can be suggested: if the model always says `0`, it will be correct in 70% of cases.

To check our speculations and create a more sound baseline, we will employ the [`DummyClassifier()`](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html). The specific behavior of the classifier is selected with the strategy parameter.

In [58]:
from sklearn.dummy import DummyClassifier

In [59]:
# ~case (a)
dummy_clf = DummyClassifier(strategy="uniform", random_state = 54321) # use random_state to make the results comparable
dummy_clf.fit(X_train, y_train)

dummy_clf.score(X_test, y_test)

0.5163297045101088

In [60]:
# ~case (b)
dummy_clf = DummyClassifier(strategy="stratified", random_state = 54321) # use random_state to make the results comparable
dummy_clf.fit(X_train, y_train)

dummy_clf.score(X_test, y_test)

0.5723172628304821

In [61]:
# case (c)
dummy_clf = DummyClassifier(strategy="most_frequent") # the model of always saying 0
dummy_clf.fit(X_train, y_train)

dummy_clf.score(X_test, y_test)

0.6936236391912908

Our models still perform better.

[Back to Content](#content)

## Conclusions and recommendations <a id='conclusions'></a>

#### Summary on the dataset

1. We have uploaded the dataset, updated column names and variable types, and checked the data for duplicates and missing variables; the dataset at hand, does not have missing values, duplicates, as well as outliers.
2. Out of 3214 users we have a group of 985 who have use the `Ultra` plan.
3. Users of the `Ultra` plan seem to make more calls, talk more minutes, send more messages, and even use more internet.

#### Results interpretation and model adjustment  <a id='binary'></a>

To interperet the results, let's reflect on some peculiarities of the metrics we use:
1. Accuracy only works when classes are balanced, i.e. when objects are distributed almost evenly between the classes, approximately 50/50. In our dataset, the classes are not balanced enough.
2. In our case, precision and ROC-AUC can help to evaluate a model without looking at the balance of classes:
    - Precision looks at the share of correct answers in the predictions deemed belonging to the target class by the model (they’re marked as "1");
    - AUC-ROC tries to measure if the rank ordering of classifications is correct.
3. A grid search with more computational resources can be applied to find the best combination of the hyperparameter values.

#### Summary on the model training

1. We have employed Logistic Regression and Random Forest Classifier to build the models (we used standardization as well to avoid possible differences in the feature scales).
2. The logistic regression resulted in less promising accuracy value (other metrics showed insufficient results as well); thus, we focused on tuning the hyperparameters of the random forest model.
3. We have achieved better accuracy for some of the hyperparameters, when considering them individually.
4. We have tried to employ both grid search and random search to find a better combination of hyperparameter values; the first one requires more computing resources as compared to the latter one, but neither of them resulted any meaningful improvement of the model in terms of accuracy score.
5. The best accuracy score for the final model on the test subset (with `max_samples` value of 210) was 0.81.
6. The sanity check showed better performance of the final model as compared to a dummy classifier.

[Back to Content](#content)