<h1><center>Customer Satisfaction Classification</center></h1>
In this Notebook, I will explain 
* how I have addressed the un-balanced problems between the two classes,
* what kind of strategy I have used to classify happy and unhappy customers.

In [2]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.utils import shuffle
from sklearn.model_selection import RandomizedSearchCV

from boruta import boruta_py

import pandas as pd
import matplotlib.pyplot as plt
import copy
from time import time

from utility import random_forest_classifer
from utility import random_forest_classifer_params

from imblearn.over_sampling import ADASYN

In [2]:
data_train = pd.read_csv('data/train2.csv',index_col=0)

In [3]:
matrix_features = data_train.as_matrix()[:,:-1]
labels = data_train['TARGET'].as_matrix()
f1_score_list,confusion_matrix_list = random_forest_classifer(matrix_features,labels)

#The score method used in this case is the F1 score, which is the harmonic mean of precision and recall#The sc 
F1_accuracy_str="F1 accuracy: %0.3f (+/- %0.3f)" % (np.mean(f1_score_list),
                                                    np.std(f1_score_list) * 2)
print(F1_accuracy_str)

F1 accuracy: 0.544 (+/- 0.010)


In [4]:
filtering_binary =data_train.apply(pd.Series.nunique) ==2
data_train_binary_feature=data_train.loc[:,filtering_binary]

filtering_nobinary =data_train.apply(pd.Series.nunique) >2
filtering_nobinary ['TARGET']=True
data_train_nobinary_feature=data_train.loc[:,filtering_nobinary]

Let us compute a classification baseline to measure the improvements  

In [5]:
matrix_features = data_train_nobinary_feature.as_matrix()[:,:-1]
labels = data_train_nobinary_feature['TARGET'].as_matrix()
f1_score_list,confusion_matrix_list = random_forest_classifer(matrix_features,labels)

#The score method used in this case is the F1 score, which is the harmonic mean of precision and recall#The sc 
F1_accuracy_str="F1 accuracy: %0.3f (+/- %0.3f)" % (np.mean(f1_score_list),
                                                    np.std(f1_score_list) * 2)
print(F1_accuracy_str)

F1 accuracy: 0.544 (+/- 0.007)


Let us load the normalized features computed in the NumericFeatureAnalysis.ipynb Notebook.

In [6]:
matrix_nobinary_features_normalized = np.load('data/matrix_nobinary_features_normalized.npy')
labels = data_train_nobinary_feature['TARGET'].as_matrix()
matrix_nobinary_features_normalized_with_labels=np.hstack([matrix_nobinary_features_normalized,labels.reshape(-1,1)])
data_train_nobinary_clean_normalized=pd.DataFrame(matrix_nobinary_features_normalized_with_labels,columns=data_train_nobinary_feature.columns.values)
print(data_train_nobinary_clean_normalized.shape)

(76020, 231)


<h1><center> UnBalanced Classes </center></h1>
* In order to address the problem of UnBalanced Classes, I decided to use an oversampling strategy instead of undersampling one that could exclude useful information.
* In particular, I have adapted the Adaptive Synthetic Sampling Approach, implemented in the [scikit-learn contrib repository (scikit-learn compatible projects)](http://contrib.scikit-learn.org/imbalanced-learn/stable/auto_examples/over-sampling/plot_adasyn.html)  

In [7]:
matrix_features = data_train_nobinary_clean_normalized.as_matrix()[:,:-1]
labels = data_train_nobinary_clean_normalized['TARGET'].as_matrix()
ada = ADASYN()
matrix_features_resampled, labels_resampled = ada.fit_sample(matrix_features, labels)

In [9]:
print(matrix_features_resampled.shape)
print(labels_resampled.shape)

(145547, 230)
(145547,)


In [10]:
matrix_features = matrix_features_resampled
labels = labels_resampled
f1_score_list,confusion_matrix_list = random_forest_classifer(matrix_features,labels)
#The score method used in this case is the F1 score, which is the harmonic mean of precision and recall#The sc 
F1_accuracy_str="F1 accuracy: %0.3f (+/- %0.3f)" % (np.mean(f1_score_list),
                                                    np.std(f1_score_list) * 2)
F1_accuracy = np.mean(f1_score_list)
print(F1_accuracy_str)

F1 accuracy: 0.812 (+/- 0.001)


In [11]:
print('Extract (randomly) one confusion matrix (Real vs Prediction) from the previous run: ')
shuffle(confusion_matrix_list,random_state=15)
cnf_matrix = confusion_matrix_list[0]
dataframe=pd.DataFrame(cnf_matrix,index=['Real happy',' Real unhappy'],columns=['Predicted happy',' Predicted unhappy'])
pd.set_option('display.float_format', lambda x: '%.4f' % x)
print(dataframe)

Extract (randomly) one confusion matrix (Real vs Prediction) from the previous run: 
               Predicted happy   Predicted unhappy
Real happy              0.7962              0.2038
 Real unhappy           0.1721              0.8279


In [12]:
matrix_features_original = data_train.as_matrix()[:,:-1]
labels_original = data_train['TARGET'].as_matrix()
ada = ADASYN()
matrix_features_resampled_original, labels_resampled_original = ada.fit_sample(matrix_features_original, labels_original)

In [13]:
matrix_features = matrix_features_resampled_original
labels = labels_resampled_original
f1_score_list,confusion_matrix_list = random_forest_classifer(matrix_features,labels)
#The score method used in this case is the F1 score, which is the harmonic mean of precision and recall#The sc 
F1_accuracy_str="F1 accuracy: %0.3f (+/- %0.3f)" % (np.mean(f1_score_list),
                                                    np.std(f1_score_list) * 2)
F1_accuracy = np.mean(f1_score_list)
print(F1_accuracy_str)

F1 accuracy: 0.959 (+/- 0.000)


In [14]:
print('Extract (randomly) one confusion matrix (Real vs Prediction) from the previous run: ')
shuffle(confusion_matrix_list,random_state=15)
cnf_matrix = confusion_matrix_list[0]
dataframe=pd.DataFrame(cnf_matrix,index=['Real happy',' Real unhappy'],columns=['Predicted happy',' Predicted unhappy'])
pd.set_option('display.float_format', lambda x: '%.4f' % x)
print(dataframe)

Extract (randomly) one confusion matrix (Real vs Prediction) from the previous run: 
               Predicted happy   Predicted unhappy
Real happy              0.9638              0.0362
 Real unhappy           0.0458              0.9542


* Frome the above resuls, the oversampling strategy is helping in improving the F1-accuracy both for the numeric features after the cleaning, and the original data.

* I would like to point out that oversampling is also prone to overfitting.


In [None]:
scaler = MinMaxScaler()