<h1><center>Customer Satisfaction Classification</center></h1>
In this Notebook, I will explain 
* how I have addressed the un-balanced problems between the two classes,
* what kind of strategy I have used to classify happy and unhappy customers.

In [2]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.utils import shuffle
from sklearn.model_selection import RandomizedSearchCV

from boruta import boruta_py

import pandas as pd
import matplotlib.pyplot as plt
import copy
from time import time

from utility import random_forest_classifer
from utility import random_forest_classifer_params

from imblearn.over_sampling import ADASYN

In [2]:
data_train = pd.read_csv('data/train2.csv',index_col=0)

In [3]:
matrix_features = data_train.as_matrix()[:,:-1]
labels = data_train['TARGET'].as_matrix()
f1_score_list,confusion_matrix_list = random_forest_classifer(matrix_features,labels)

#The score method used in this case is the F1 score, which is the harmonic mean of precision and recall#The sc 
F1_accuracy_str="F1 accuracy: %0.3f (+/- %0.3f)" % (np.mean(f1_score_list),
                                                    np.std(f1_score_list) * 2)
print(F1_accuracy_str)

F1 accuracy: 0.544 (+/- 0.010)


In [4]:
filtering_binary =data_train.apply(pd.Series.nunique) ==2
data_train_binary_feature=data_train.loc[:,filtering_binary]

filtering_nobinary =data_train.apply(pd.Series.nunique) >2
filtering_nobinary ['TARGET']=True
data_train_nobinary_feature=data_train.loc[:,filtering_nobinary]

Let us compute a classification baseline to measure the improvements  

In [5]:
matrix_features = data_train_nobinary_feature.as_matrix()[:,:-1]
labels = data_train_nobinary_feature['TARGET'].as_matrix()
f1_score_list,confusion_matrix_list = random_forest_classifer(matrix_features,labels)

#The score method used in this case is the F1 score, which is the harmonic mean of precision and recall#The sc 
F1_accuracy_str="F1 accuracy: %0.3f (+/- %0.3f)" % (np.mean(f1_score_list),
                                                    np.std(f1_score_list) * 2)
print(F1_accuracy_str)

F1 accuracy: 0.544 (+/- 0.007)


Let us load the normalized features computed in the NumericFeatureAnalysis.ipynb Notebook.

In [16]:
data_train_4_classification = pd.read_csv('data/dataframe_train_4_classification.csv',index_col=0) 

<h1><center> UnBalanced Classes </center></h1>
* In order to address the problem of UnBalanced Classes, I decided to use an oversampling strategy instead of undersampling one that could exclude useful information.
* In particular, I have adapted the Adaptive Synthetic Sampling Approach, implemented in the [scikit-learn contrib repository (scikit-learn compatible projects)](http://contrib.scikit-learn.org/imbalanced-learn/stable/auto_examples/over-sampling/plot_adasyn.html)  

In [17]:
matrix_features = data_train_4_classification.as_matrix()[:,:-1]
labels = data_train_4_classification['TARGET'].as_matrix()
ada = ADASYN()
matrix_features_resampled, labels_resampled = ada.fit_sample(matrix_features, labels)

In [18]:
print(matrix_features_resampled.shape)
print(labels_resampled.shape)

(146323, 230)
(146323,)


In [19]:
matrix_features = matrix_features_resampled
labels = labels_resampled
f1_score_list,confusion_matrix_list = random_forest_classifer(matrix_features,labels)
#The score method used in this case is the F1 score, which is the harmonic mean of precision and recall#The sc 
F1_accuracy_str="F1 accuracy: %0.3f (+/- %0.3f)" % (np.mean(f1_score_list),
                                                    np.std(f1_score_list) * 2)
F1_accuracy = np.mean(f1_score_list)
print(F1_accuracy_str)

F1 accuracy: 0.834 (+/- 0.001)


In [20]:
print('Extract (randomly) one confusion matrix (Real vs Prediction) from the previous run: ')
shuffle(confusion_matrix_list,random_state=15)
cnf_matrix = confusion_matrix_list[0]
dataframe=pd.DataFrame(cnf_matrix,index=['Real happy',' Real unhappy'],columns=['Predicted happy',' Predicted unhappy'])
pd.set_option('display.float_format', lambda x: '%.4f' % x)
print(dataframe)

Extract (randomly) one confusion matrix (Real vs Prediction) from the previous run: 
               Predicted happy   Predicted unhappy
Real happy              0.8686              0.1314
 Real unhappy           0.2019              0.7981


In [12]:
matrix_features_original = data_train.as_matrix()[:,:-1]
labels_original = data_train['TARGET'].as_matrix()
ada = ADASYN()
matrix_features_resampled_original, labels_resampled_original = ada.fit_sample(matrix_features_original, labels_original)

In [13]:
matrix_features = matrix_features_resampled_original
labels = labels_resampled_original
f1_score_list,confusion_matrix_list = random_forest_classifer(matrix_features,labels)
#The score method used in this case is the F1 score, which is the harmonic mean of precision and recall#The sc 
F1_accuracy_str="F1 accuracy: %0.3f (+/- %0.3f)" % (np.mean(f1_score_list),
                                                    np.std(f1_score_list) * 2)
F1_accuracy = np.mean(f1_score_list)
print(F1_accuracy_str)

F1 accuracy: 0.959 (+/- 0.000)


In [14]:
print('Extract (randomly) one confusion matrix (Real vs Prediction) from the previous run: ')
shuffle(confusion_matrix_list,random_state=15)
cnf_matrix = confusion_matrix_list[0]
dataframe=pd.DataFrame(cnf_matrix,index=['Real happy',' Real unhappy'],columns=['Predicted happy',' Predicted unhappy'])
pd.set_option('display.float_format', lambda x: '%.4f' % x)
print(dataframe)

Extract (randomly) one confusion matrix (Real vs Prediction) from the previous run: 
               Predicted happy   Predicted unhappy
Real happy              0.9638              0.0362
 Real unhappy           0.0458              0.9542


* Frome the above resuls, the oversampling strategy is helping in improving the F1-accuracy both for the numeric features after the cleaning, and the original training data.

* The result of 95% of F1-accuracy for the original training data are due to the overfitting, infact uploading the result with the same strategy on the test data on the Kaggle site, I get only 54& of AUC.

* I would like to point out that oversampling is also prone to overfitting.


<h1><center> Fine Tuning of the Random Forest </center></h1>

* Let us fine-tuning the random forest looking for the best solution in the space of hyperparameters.
* I will use the random search as the technique for parameters search.  

In [25]:

# Utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results['mean_test_score'][candidate],
                  results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")


# this is the parameter space
max_features = int(np.sqrt(matrix_features_resampled.shape[1]))
number_samples_features = int(max_features/2)

# same seed to ensure replicability of the experiments 
np.random.seed(35)

max_features_selection = np.random.choice(range(1,max_features),
                                          number_samples_features,
                                          replace=False)
min_samples_split_selection = np.random.choice(range(2,100),10,
                                               replace=False)
min_samples_leaf_selection = np.random.choice(range(2,100),10,
                                              replace=False)
estimator_selection = np.random.choice(range(20,500),10,
                                       replace=False)
depth_selection = [3,5,10,20,None]



param_dist = { "n_estimators":estimator_selection,
               "max_depth": depth_selection,
              "max_features": max_features_selection,
              "min_samples_split": min_samples_split_selection,
              "min_samples_leaf": min_samples_leaf_selection,
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}



# build a classifier
classifier_to_tune = RandomForestClassifier()

# run randomized search
n_iter_search = 50
random_search = RandomizedSearchCV(classifier_to_tune, 
                                   param_distributions=param_dist,
                                   n_iter=n_iter_search,random_state=32)

rounds_train_validation_test = StratifiedShuffleSplit(n_splits=1, 
                                                      test_size=0.2,
                                                      random_state=11)

for train_validation_index, test_index in rounds_train_validation_test.split(matrix_features_resampled,
                                                                             labels_resampled):
        
        matrix_train_validation = matrix_features_resampled[train_validation_index]
        classes_train_validation = labels_resampled[train_validation_index]
        matrix_test = matrix_features_resampled[test_index]
        classes_test = labels_resampled[test_index]


rounds_train_validation  = StratifiedShuffleSplit(n_splits=1, 
                                                  test_size=0.3,
                                                  random_state=79)
        
for train_index, validation_index in rounds_train_validation.split(matrix_train_validation,
                                                                   classes_train_validation):
        
        matrix_train= matrix_train_validation[train_index]
        classes_train= classes_train_validation[train_index]
        matrix_validation = matrix_train_validation[validation_index]
        classes_validation = classes_train_validation[validation_index]

start = time()
random_search.fit(matrix_validation, classes_validation)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)

RandomizedSearchCV took 626.89 seconds for 50 candidates parameter settings.
Model with rank: 1
Mean validation score: 0.754 (std: 0.001)
Parameters: {'n_estimators': 436, 'min_samples_split': 12, 'min_samples_leaf': 51, 'max_features': 12, 'max_depth': 20, 'criterion': 'entropy', 'bootstrap': False}

Model with rank: 2
Mean validation score: 0.753 (std: 0.002)
Parameters: {'n_estimators': 274, 'min_samples_split': 37, 'min_samples_leaf': 54, 'max_features': 12, 'max_depth': None, 'criterion': 'entropy', 'bootstrap': False}

Model with rank: 3
Mean validation score: 0.752 (std: 0.001)
Parameters: {'n_estimators': 100, 'min_samples_split': 83, 'min_samples_leaf': 54, 'max_features': 12, 'max_depth': None, 'criterion': 'gini', 'bootstrap': False}



In [26]:
matrix_features = matrix_features_resampled
labels = labels_resampled
f1_score_list,confusion_matrix_list = random_forest_classifer_params(matrix_features,labels,
                                                                      number_rounds = 3,
                                                                     test_size_value = 0.5,
                                                                     n_estimators= 436,
                                                                     min_samples_split= 12,
                                                                     min_samples_leaf= 51,
                                                                     max_features= 12,
                                                                     max_depth=20, 
                                                                     criterion= 'entropy',
                                                                     bootstrap=False)
#The score method used in this case is the F1 score, which is the harmonic mean of precision and recall#The sc 
F1_accuracy_str="F1 accuracy: %0.3f (+/- %0.3f)" % (np.mean(f1_score_list),
                                                    np.std(f1_score_list) * 2)
F1_accuracy = np.mean(f1_score_list)
print(F1_accuracy_str)



F1 accuracy: 0.768 (+/- 0.005)


* The parameters selected by the fine tuning are not offering any improvement respect to the initial setting.
* I should explore better the parameter space, but this requires times.

In [27]:
matrix_features = matrix_features_resampled
labels = labels_resampled
f1_score_list,confusion_matrix_list = random_forest_classifer_params(matrix_features,labels,
                                                                      number_rounds = 3,
                                                                     test_size_value = 0.5,
                                                                     n_estimators= 100,
                                                                     min_samples_split= 2,
                                                                     min_samples_leaf= 1,
                                                                     max_features= 'sqrt',
                                                                     max_depth=None, 
                                                                     criterion= 'gini',
                                                                     bootstrap=True)
#The score method used in this case is the F1 score, which is the harmonic mean of precision and recall#The sc 
F1_accuracy_str="F1 accuracy: %0.3f (+/- %0.3f)" % (np.mean(f1_score_list),
                                                    np.std(f1_score_list) * 2)
F1_accuracy = np.mean(f1_score_list)
print(F1_accuracy_str)


F1 accuracy: 0.834 (+/- 0.001)
