<h1 style="color:rgb(0,120,191);">Anomaly Detection</h1>
<br/>
<p>
    In this tutorial we will explore a machine learning use case regarding <strong style="color:rgb(0,120,191);">anomaly detection in bank transactions</strong>. In general, the datasets used in anomaly detection are highly imbalanced, that is, one class is over-representated compared to the other. Our use-case is based on a fraud detection dataset available on <a href="https://www.kaggle.com/mlg-ulb/creditcardfraud">kaggle</a>. <br/> 
    Here, we will build a predictive model to detect fraudulent cases and detail the ways to approach such a problem, and explain the key metrics of a classification task
</p>
<p> 
    <strong>Prerequisites</strong>:<br/> knowledge of following machine learning concepts: feature, label, training data, validation data 
</p>

In [None]:
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import plotly.graph_objs as go
import plotly.tools as tls
import seaborn as sns
import tensorflow as tf
import s3fs
import boto3

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly import tools
from imblearn import under_sampling, over_sampling
from imblearn.pipeline import make_pipeline as make_pipeline_imb
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB, ComplementNB
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
from sklearn.metrics import average_precision_score, roc_auc_score, f1_score, roc_curve, precision_recall_curve
from sklearn.ensemble import RandomForestClassifier, VotingClassifier, BaggingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.decomposition import PCA, TruncatedSVD
from xgboost import XGBClassifier


init_notebook_mode(connected=True)
%matplotlib inline

In [None]:
# Read File from a public bucket in Amazon S3
bucket = 'ml-sa-us-east-1'   # 
prefix = 'sagemaker/fraud-detection'
file_name = 'creditcard.csv'
src_file = 's3://{}/{}/{}'.format(bucket, prefix, file_name)
print('Reading source file  from S3: {}'.format(src_file))
df = pd.read_csv(src_file)
print('src_file = {} read!'.format(src_file))
# df = pd.read_csv('https://s3.amazonaws.com/ml-sa-us-east-1/sagemaker/fraud-detection/creditcard.csv')

<h1 style="color:rgb(0,120,191);">1) Exploratory Data Analysis</h1>
<h3>Overal shape, missing values, memory usage, distributions</h3>

In [None]:
print('DataFrame of size {} with {} duplicate row(s)'.format(df.shape, len(df.duplicated()[df.duplicated()==True])))
def df_mem(dataf):
    dataf_mem = np.round(dataf.memory_usage().sum()/1024/1024, 2)
    print('Memory usage of the dataframe: {} MB'.format(dataf_mem))
df_mem(df)
df.head()


In [None]:
# Dropping duplicate rows
print('Among the duplicated rows, {} are class 0, and {} are class 1'.format(df.loc[df.duplicated()==True, 'Class'].value_counts()[0],
                                                                             df.loc[df.duplicated()==True, 'Class'].value_counts()[1]))
print("Let's remove the duplicates")
df = df.drop_duplicates(keep='first')
df_mem(df)
print('DataFrame of size {} with {} duplicate row(s)'.format(df.shape, len(df.duplicated()[df.duplicated()==True])))



In [None]:
print('Missing values for each column: {}\n'.format(list(df.isnull().sum()) ))
print('Distinct values: ')
print([ (col, len(df[col].unique())) for col in df.columns ])

In [None]:
df.describe(percentiles=[0.50, 0.95]).loc[['min', '50%','95%','max', 'mean', 'std']]

<p>
    We don't know what the attributes stand for, but they have various ranges as well as various variances. 
    <br/>In the modelling part, the numeric inputs will be rescaled to be used with algorithms which compute distances (KNN, K-means, ...) <br/>
<p>

<h3>Evaluating statistic correlation of the attributes</h3>

In [None]:
data = [go.Heatmap(z=df.corr().values, x=df.corr().index, y=df.corr().index.tolist())]
layout = go.Layout(title='Correlation heatmap', width=800, height=400)
fig = go.Figure(data=data, layout=layout)
iplot(fig)

<p>
    From the heatmap, we deduce:
    <ul>
        <li>V7 and amount are positively correlated with a correlation score of 0.4</li>
        <li>The other attributes are poorly correlated between each other</li>
    </ul>
</p>

In [None]:
fig = tools.make_subplots(rows=1, cols=1, print_grid=False)
fig['layout'].update(title='amount VS V7',
                         showlegend=False, width = 900, height=350,
                         margin=go.layout.Margin(l=50, r=50, b=50, t=50, pad=10))
data = [ go.Scattergl(x=df['V7'], y=df['Amount'], name='Amount VS V7', mode='markers') ]    
# fig.append_trace(data, 1, 1)
# data = go.Scattergl(x=df['attribute3'], y=df['attribute9'], name='attr9 VS attr3', mode='markers',
# #                    xaxis=dict(title='attribute3'), yaxis=dict(title='attribute9')
#                    ) 
fig = go.Figure(data)
iplot(fig)

<h3>Plotting the distributions</h3>

In [None]:
data = [go.Histogram(x = df['Class'])]
xtick_labs = ['No Fraud', 'Fraud Occured']
layout = go.Layout(title='Distribution of the fraud label', 
                   showlegend=False, 
                   width = 800,
                   height=350, 
                   margin=go.layout.Margin(l=50, r=50, b=50, t=50, pad=10), 
                   xaxis=dict(autorange=False, range=[-1,2], showticklabels=True, ticktext=xtick_labs, tickvals=[0,1]))
fig = go.Figure(data=data, layout=layout)
iplot(fig)

<p>
The dataset is <strong>severly imbalanced</strong>! Out of 283726 samples, 283253 are negative (i.e. no fraud) and only  473 are positive (i.e. fraud occured). <br/>The ratio is 599:1
    <br/> Let's plot the distribution of the features before discussing how we will work with this imbalanced dataset.    
</p>

In [None]:
nCols = 5
fig = tools.make_subplots(rows=1, cols=nCols, print_grid=False)
for idx, col in enumerate(df.columns[:nCols],1):
    data = go.Box(y=df[col], name = col)
    fig.append_trace(data,1,idx)
iplot(fig)

In [None]:
# Extract the rows with fraud occurences in df_fail
df_fail = df.loc[df['Class']==1]

# draw a box plot
def draw_box(colName, position):
    fig = tools.make_subplots(rows=1, cols=5, print_grid=False)
    data = go.Box(y=df[col], name = col)
    fig.append_trace(data,1,position)
    return fig    
    
# draw a scatter plot 
def draw_scatter(colName):
    fig = tools.make_subplots(rows=1, cols=1, print_grid=False)
    fig['layout'].update(title='fraud = f(' + colName + ')',
                         showlegend=False, width = 900, height=350,
                         margin=go.layout.Margin(l=50, r=50, b=50, t=50, pad=10))
    data = go.Scattergl(x=df[colName], y=df['Class'], name=colName, mode='markers', 
                        marker=dict(color=df['Class']))     
    fig.append_trace(data, 1, 1)
    iplot(fig)
    return fig

def draw_distrib(colName):
    fig = tools.make_subplots(rows=1, cols=2, print_grid=False)
    fig['layout'].update(title='Distribution of the column "' + '<b>' + colName + '</b>' + '"''<br> <span style="color:rgb(0,120,191);"> OVER ALL SAMPLES</span>                            <span style="color:rgb(255,140,0);"> OVER SAMPLES WHERE FRAUD=1</span>',
                         showlegend=False, width = 900, height=350,
                         margin=go.layout.Margin(l=50, r=50, b=50, t=50, pad=10))
    data = go.Histogram(x=df[colName], autobinx=True, name=colName)    
    fig.append_trace(data, 1, 1)
    data = go.Bar(x=df_fail[colName].value_counts().index, y=df_fail[colName].value_counts().values, name=colName) 
#     data = go.Histogram(x=df_fail[colName], autobinx=True, name=colName) 
    fig.append_trace(data, 1, 2)
    iplot(fig)
    return fig

In [None]:
for col in df.columns[:5]:
    draw_distrib(col)


In [None]:
for col in df.columns[5:10]:
    draw_distrib(col)

In [None]:
for col in df.columns[10:15]:
    draw_distrib(col)

In [None]:
for col in df.columns[15:20]:
    draw_distrib(col)

In [None]:
for col in df.columns[20:25]:
    draw_distrib(col)

In [None]:
for col in df.columns[25:]:
    draw_distrib(col)

<p>
    <ul>
    <li>Fraudulent operations occur at any time, they do not exibit any periodic pattern</li>    
    <li>V1 distribution is skewed, it is concentrated in the range [-10, 2], plus some outliers in the range [-55, -10[ which don't correlate with the fraudulent cases<br/>
        <li>V24 and V26 look like gaussian mixtures, and exhibit some outliers which don't correlate with the fraud cases</li>
    <li>All other attributes V2, V3 until V23 (included), V25, V27, V28 show a gaussian distribution plus some outliers which don't correlate with the fraudulent cases </li>    
    <li>95% of the amount is below 365 (Unit), nevertheless fraudulent cases occur for all amounts</li>            
    </ul>
</p>

<h3>Outlier removal</h3>
<p>
As the outlier do not seem to be correlated with the failure occurences, they will be removed from the dataset
</p>    

In [None]:
def cap_outlier(dataf, cols, quant, thd, label):
    """
    dataf (pandas dataframe)
    cols (list) of the columns of interest of the dataframe
    quant (list) of the quantiles to extract
    thd (float)  when the relative gap (Max-99%)/Max exceeds this thd we cap the values
    """
    perc = df.describe(percentiles=quant).loc[['99%','max']]
    rel_Err = (perc.loc['max']-perc.loc['99%']) / perc.loc['max']
    cols_to_cap = rel_Err[rel_Err > thd]
    cols_to_cap = cols_to_cap[cols_to_cap.index.str.contains('attribute')].index.tolist()

    for col in cols_to_cap:
        capped_val = perc.loc['99%',col]
        dataf.loc[dataf[col]>capped_val, col] = capped_val
    return dataf   

In [None]:
df = cap_outlier(df, df.columns.str.contains('attribute').tolist(), [0.99, 1], 0.95, 'failure')

<h3>Rescaling numeric features (i.e. attribute N)</h3>
<p>
    MinMaxScaler will be used. 
    The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data.
</p>    

In [None]:
col_to_scale = df.columns.drop(['Time', 'Class']).tolist()
df = scale_df(df, col_to_scale, False)
df_mem(df)
df.describe()

<h3>Dimensionnality reduction techniques and clustering</h3>

In [None]:
def scale_df(dataf, columns, Bprint):
    min_max_scaler = MinMaxScaler()
    scaled_val = min_max_scaler.fit_transform(dataf[columns])
    if Bprint:
        print('min_max_scaler.scale_ = {}'.format(min_max_scaler.scale_))
    for idx, col in  enumerate(columns): 
        dataf[col] = scaled_val[:,idx]
    return dataf

In [None]:
def plotPCA(dataf, label, method,n_comp, leg0, leg1, mean_rem):
    """
    df is the source dataframe
    label is a string with the name of the label
    method is a string indicating which part of the dataframe to keep, method='us' for an undersampled version, method='all' for the whole dataset
    n_comp is an integer, number of components of the PCA   
    leg0 is a string for the legend of the Negative class
    leg1 is a string for the legend of the Positive class
    mean_rem is a boolean value, True for removing mean of the columns, False otherwise
    """
    
    df_ano = dataf[dataf[label]==1]
    if method=='us':
        X = dataf.loc[dataf[label]==0]    # Keep only negative samples
        X = X.sample(n=df_ano.shape[0])
        X = pd.concat([X[:df_ano.shape[0]], df_ano], axis=0)
    else:
        X = dataf
    
   
    # PCA Implementation
    t0 = time.time()
    pca = PCA(n_components=n_comp, random_state=42)
    X_red = pca.fit_transform(X)
    print('PCA explained variance = {:1.2f}% {:1.2f}%'.format(pca.explained_variance_ratio_[0]*100, pca.explained_variance_ratio_[1]*100))
    print('PCA explained variance = {:1.6f} {:1.6f}'.format(pca.explained_variance_[0], pca.explained_variance_[1]))
    print('singular_values_ = {:1.6f} {:1.6f}'.format(pca.singular_values_[0], pca.singular_values_[1]))
    print('mean_ = {:1.6f} {:1.6f}'.format(pca.mean_[0], pca.mean_[1]))

    t1 = time.time()
    
    idx_class0 = X[X[label]==0].index     # Get index of the Negative class
    idx_class1 = X[X[label]==1].index     # Get index of the Positive class
    c0_iloc = [ X.index.get_loc(x) for x in idx_class0 ]  # Transform index into iloc(index)
    c1_iloc = [ X.index.get_loc(x) for x in idx_class1 ]
    
    X = X.drop(label,axis=1)
    if mean_rem==True:
        X = X-np.mean(X,axis=0)    

    lay = go.Layout(xaxis=dict(title='Principal Component 1'), yaxis=dict(title='Principal Component 2'), title='PCA Decomposition')
    data = [ go.Scattergl(x=X_red[c0_iloc,0], y=X_red[c0_iloc,1], mode='markers', marker=dict(color='blue'), name=leg0)]
    data.append( go.Scattergl(x=X_red[c1_iloc,0], y=X_red[c1_iloc,1], mode='markers', marker=dict(color='red'), name=leg1) )
    fig = go.Figure(data=data,layout=lay)
    iplot(fig)
    
    return fig, X_red, pca


In [None]:
fig, X_red, pca = plotPCA(df, label='Class', method='all',n_comp=2, leg0='No Fraud', leg1='Fraud', mean_rem=True)

<p> The data is quite complex, it seems it cannot be separated along any of the 2 principal components. Nevertheless Fraudulent data are much widely spread along the Y axis than non fraudulent data.
</p>

<h1 style="color:rgb(0,120,191);">2) Some Feature Engineering</h1>
<ul>
From the observations made before,<br/><br/>
    <li>The range of the numeric features are various, thus we will rescale them to fit in the range [0,1]</li>
    <li>As failure occurences do not exhibit any temporality, we will be able to shuffle the dataset as needed. Plus, the date won't be used in the features </li>
    <li>Attributes V2, V3, V28 have several levels with low frequencies, and chances are they wouldn’t be available in test set, so these attributes will be binned (i.e. partitioned into intervals) </li>
</ul>  

<h3>Adding Principal Component to the dataframe</h3>
<p>
    MinMaxScaler will be used. 
    The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data.
</p>    

In [None]:
pca = PCA(n_components=2, random_state=42)
X_red = pca.fit_transform(df)
df['PC1'] = X_red[:,0]
df['PC2'] = X_red[:,1]
df.columns

<h1 style="color:rgb(0,120,191);">3) The Testing Procedure</h1>
<p>
    As said before, the dataset provided by the customer is highly imbalanced, <strong>99.8% of the values belong to class 0 (No fraud)</strong> and <strong>0.2% to the positive class (fraud occured)</strong>. In absolute number we have  <strong>283253 negative samples and 473 positive samples.</strong>
</p>
<p>
    To evaluate the performance of the classifiers we design, Ideally, we need a test set which was not used to train the classifiers. To build such a test set from the sample dataset provided, we will divide the dataset into 2 parts which keep the same class imbalance ratio. 
</p> 
    
<p>
    While this procedure will allow us to leverage the test set for detecting overfitting issues, we must keep in mind that we will be "losing" valuable information. So, <strong>at first</strong>, we will train classifiers on the train set and validate them on the test set.
</p>
<p>We could also try cross-validation, but for now let's stick to the train/test split! 
</p>    

<p>
    In the next sections, we will define functions for dividing dataset, feature selection, model selection and evaluation ...
    Then we will assemble all these into a nice pipeline
</p>    


<h3>Dividing the dataset into train & test</h3>

In [None]:
def create_test_set(dataf, test_siz):
    x_train, x_test, y_train, y_test = train_test_split(dataf.drop(columns='Class'), dataf['Class'], test_size=test_siz, random_state=42, stratify=dataf['Class'])
    return x_train, x_test, y_train, y_test


<h3>Selecting features</h3>

In [None]:
def select_features(dataf, mode):
    if mode == 'not_pca':
        feats = dataf.columns[~dataf.columns.str.contains('PC')].drop(['Time', 'Class']).tolist()       
    elif isinstance(mode, dict):
        feats = mode['attr']   
    elif mode == 'all':
        feats = df.columns.drop(['Time', 'Class']).tolist()
    elif mode == 'pca':
        feats = ['PC1', 'PC2']        
    print('List of {} features: {}'.format(len(feats),feats))
    return feats

<h1 style="color:rgb(0,120,191);">4) Methods for working with an imbalanced dataset</h1>
<p>
A classifier is built to minimize classification errors. Since the probability of instances belonging to the majority class is very high in this imbalanced data set, the algorithms will tend to classify unseen observations to the majority class. In our case, with a classifier which systematically predict the majority class (i.e. no failure), we could easily achieve 99.8% accuracy!
</p>
<p>
    To address imbalanced dataset, we can act on 2 approaches
    <ul>
        <li><strong>algorithm approach</strong>: basically, ML algorithms penalize False Positive (FP) and False Negative (FN) equally, but implementation of some algorithms provide a hyper-parameter which weights each class example proportionally to the inverse of its frequency</li><br/>
        <li><strong>data approach</strong>: it consists of resampling the data so as to lower the imbalance ratio, we can use under-sampling or/and over-sampling</li>.<ul>
            <li>Under-sampling balances the dataset by reducing the size of the abundant class. But, since it is removing observations from the original data set, it might discard useful information</li>
        <li>Over-sampling balances the dataset by increasing the size of rare samples. No information from the original training set is lost, as all observations from the minority and majority classes are kept. But it is prone to overfitting</li>
        </ul>
    </ul>
</p>

<p>
    Now let's implement Under-sampling and Over-sampling technics with the <strong>imbalance-learn</strong> package before discussing the ML algorithms we'll use
</p>

<h3>Under-sampling the data</h3>
<p>
    Let's wrap 3 under_sampling methods in a custom function:
    <ul>
        <li>RandomUnderSampler is a naive way which randomly selects a given number of samples by the targetted class</li>
        <li>EditedNearestNeighbours removes samples of the majority class for which their class differ from the one of their nearest-neighbors</li>
        <li>NearMiss-1 selects samples from the majority class for which the average distance of the k nearest samples of the minority class is the smallest</li>
</ul>
</p>

In [None]:
def under_samp(method, seed, r_Majas_over_min, x_train, y_train):
    """ 
    dataf (dataframe) on which the undersampling is to be applied
    method (string) chooses on of the under-sampling method available in the imbalanced-learn package
    seed (int) controls the initialization of the random generator
    r_Majas_over_min (float) is the ratio of the nb_majority_class after resampling/nb_minority_class 
    """
    if method == 'random':
        us = under_sampling.RandomUnderSampler(ratio=r_Majas_over_min, return_indices=True, random_state=seed, replacement=False)
    elif method == 'ENN': 
        us = under_sampling.EditedNearestNeighbours(sampling_strategy=r_Majas_over_min, return_indices=True, random_state=seed, n_neighbors=3, kind_sel='all', n_jobs=1)
    elif method =='NM':
        us = under_sampling.NearMiss(sampling_strategy=r_Majas_over_min, return_indices=True, random_state=seed, version=1, n_neighbors=3, n_neighbors_ver3=3, n_jobs=1)        
    # Resample data
    print('under-sampling data!')
    x_train, y_train, ind_us = us.fit_resample(x_train, y_train)
    print('After under_sampling x_train.shape = {}, y_train.shape= {}'.format(x_train.shape, y_train.shape))  
    
    return us, x_train, y_train, ind_us

<h3>Over-sampling the data</h3>
<p>
    Whith over-sampling, more information is retained since we don't delete any rows unlike in random undersampling.
We will take more time to train since no rows are eliminated as previously stated. <br/>
    Let's wrap 3 over_sampling methods in a custom function:
    <ul>
        <li>RandomOverSampler is a naive way which generates new samples by randomly sampling with replacement the current available samples in the minority classes</li>
        <li>SMOTE generates new samples by creating synthetic samples from the minor class instead of creating copies. SMOTE picks the distance between the closest neighbors of the minority class, in between these distances it creates synthetic points</li>
        <li>ADASYN also creates synthetic data points but, for the new data points to be realistic, ADASYN adds a small error to the data points to allow for some variance</li>
</ul>
</p>

In [None]:
def over_samp(method, seed, r_Maj_over_minas, x_train, y_train):
    """ 
    method (string) chooses on of the over-sampling method available in the imbalanced-learn package
    seed (int) controls the initialization of the random generator
    r_Maj_over_minas (float) is the ratio of the nb_majority_class/nb_minority_class after resampling 
    """
    if method == 'random':
        os = over_sampling.RandomOverSampler(sampling_strategy=r_Maj_over_minas, random_state=seed, replacement=False)
    elif method == 'SMOTE': 
        os = over_sampling.SMOTE(sampling_strategy=r_Maj_over_minas, random_state=seed, k_neighbors=5, n_jobs=1)
    elif method =='ADASYN':
        os = over_sampling.ADASYN(sampling_strategy=r_Maj_over_minas, random_state=seed, n_neighbors=5, n_jobs=1)

    # Resample data
    print('over-sampling data!')
    n_fail_y_train = len(y_train[y_train==1])   
    print('Before over_sampling x_train.shape = {}, y_train.shape= {}'.format(x_train.shape, y_train.shape))
    x_train, y_train = os.fit_resample(x_train, y_train)
    print('After over_sampling x_train.shape = {}, y_train.shape= {}'.format(x_train.shape, y_train.shape))
                                  
    return os, n_fail_y_train, x_train, y_train                                  

<h2>Choosing the classification algorithm</h2>
<p>
    We'll give a try at Logistic regression & Support vector Machines, as these two support hyperparameters for balancing the dataset.<br/> 
    We will aso try Naïve Bayes, Random Forest, Gradient boosting and a feed-forward Neural network .
    Let's wrap the model classifier selection into a custom function which selects either of these
</p>

In [None]:
def select_clf(method, seed, feat_list):
    """
    method (string) specifying the type of classifier to use
    seed (int) specifying the initialisation state of the random generator
    """
    if method == 'log_reg':
        clf = LogisticRegression(class_weight=None, random_state=seed, solver='saga', max_iter=1000, multi_class='ovr')
    elif method == 'log_reg_w':
        clf = LogisticRegression(class_weight='balanced', random_state=seed, solver='saga', max_iter=1000, multi_class='ovr')
    elif method == 'rCV_log_reg_w':
        clf1 = LogisticRegression(class_weight='balanced', random_state=seed, solver='saga', max_iter=1000, multi_class='ovr')
        log_reg_params = {"penalty": ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
        clf = RandomizedSearchCV(clf1, log_reg_params, n_iter=4, cv=3, scoring='precision')  #'average_precision', 'precision'

    elif method == 'GNB':
        clf = GaussianNB()
    elif method == 'CNB':
        clf = ComplementNB()        
    elif method == 'SVM':
        clf = LinearSVC(loss='squared_hinge', dual=False, tol=0.0001, multi_class='ovr', class_weight=None, random_state=seed, max_iter=1000)
    elif method == 'SVM_w':
        clf = LinearSVC(loss='squared_hinge', dual=False, tol=0.0001, multi_class='ovr', class_weight='balanced', random_state=seed, max_iter=1000)
    elif method == 'RF':
        clf = RandomForestClassifier(n_estimators=100, max_depth=None, max_features=None, max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=None, 
                                     random_state=seed, verbose=0, warm_start=False, class_weight=None)
    elif method == 'RF_w':
        clf = RandomForestClassifier(n_estimators=100, max_depth=None, max_features=None, max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=None, 
                                     random_state=seed, verbose=0, warm_start=False, class_weight='balanced')        
    elif method == 'xgb':
        clf = XGBClassifier(objective='binary:logistic', scale_pos_weight=1)
    elif method == 'rCV_xgb':
        clf1 = XGBClassifier(objective='binary:logistic')      
        xgb_params = {"eta": [0.01,0.1, 1], 'scale_pos_weight': np.linspace(0.5,141626/237, 10), 'max_depth': [6, 10, 15, 20]}
        clf = RandomizedSearchCV(clf1, xgb_params, n_iter=4, cv=3, scoring='precision')        
    elif method == 'MLP':
        clf = MLPClassifier(hidden_layer_sizes=(len(feat_list), len(feat_list)), activation='tanh', solver='adam', alpha=1e-5, batch_size='auto', max_iter=200, shuffle=True, random_state=seed)
    elif method == 'bagMLP':
        clf1 = MLPClassifier(hidden_layer_sizes=(len(feat_list), len(feat_list)), activation='tanh', solver='adam', alpha=1e-5, batch_size='auto', max_iter=200, shuffle=True, random_state=seed)
        clf = BaggingClassifier(clf1)
    return clf


def ens_clf(ml_clf, seed, feat_list, x_train, y_train, div_factor):
    """
    ml_clf (string, list or dictionnary) if list, create ensemble of classifiers all trained of the same dataset
                                         if dictionnary, create an ensemble of classifiers each classifier is trained
                                         on a different portion of the negative samples  concatenated with all the positive samples
                                         if list, just get the corresponding classifier from the select_clf function defined above
    div_factor (integer), partition negative samples of train set into (div_factor*number_of_minority samples) parts 
    """
    # Select a classifier
    if isinstance(ml_clf,list):  # Ensemble classifiers
        print('ml_clf is a list, Ensembling classifiers')
        list_estimators = []
        for idx, cur_clf in enumerate(ml_clf):
            clf_temp = select_clf(cur_clf, seed, feat_list)       
            list_estimators.append((cur_clf, clf_temp))
        clf = VotingClassifier(estimators=list_estimators, voting='hard')
        
    elif isinstance(ml_clf, dict):
        print('ml_clf is a dict', 'Ensembling classifiers trained on different samples of the train set')
#   Divide train set into N * n_fail_y_train parts. Each training set must contain all the samples of the >=0 class
#  and N * n_fail_y_train different samples of the <=0 class        
        x_train_pos = x_train[y_train==1]
        x_train_neg = x_train[y_train==0]
        divide_factor = div_factor*x_train_pos.shape[0]
        print('Divide factor = {:.1f}'.format(divide_factor))
        nbClass2Train = x_train_neg.shape[0] // divide_factor + 1
        print('Number of classifiers to train = {:.2f}'.format(nbClass2Train))
#         pred_probs = np.zeros((divide_factor+n_fail_y_train, nbClass2Train))
        list_estimators = []    
        all_estim = list(ml_clf.keys())
        for idx in range((nbClass2Train)):
            cur_clf = all_estim[idx % len(all_estim)]
            if idx == nbClass2Train :
                x_train_cur = x_train_neg[idx*divide_factor:]
            else:
                x_train_cur = x_train_neg[idx*divide_factor: (idx+1)*divide_factor] 
            y_train_cur = np.zeros((x_train_cur.shape[0],))  
            x_train_cur = np.concatenate((x_train_cur, x_train_pos), axis=0)
            y_train_cur = np.concatenate((y_train_cur, np.ones((x_train_pos.shape[0],))), axis=0)             
            clf_temp = select_clf(cur_clf, seed, feat_list)
            print('Training Classifier {}: {}'.format(str(idx+1), cur_clf))
            clf_temp.fit(x_train_cur, y_train_cur)
            list_estimators.append((cur_clf+str(idx), clf_temp))       
        clf = VotingClassifier(estimators=list_estimators, voting='soft')   
        clf.estimators_ = [estim[1] for estim in list_estimators]
        clf.le_ = LabelEncoder().fit(y_train_cur)
        clf.classes_ = clf.le_.classes_
        
    else:
        clf = select_clf(ml_clf, seed, feat_list)

    if not isinstance(ml_clf, dict):        
        clf.fit(x_train, y_train)
        
    if ml_clf == 'rCV_log_reg_w':
        clf = clf.best_estimator_

    return clf


<h2>The problem with accuracy</h2>
<p>
    First, let's remind the definition of a confusion matrix:<br/>
    Class 0 (no failure) is called the negative class, class 1 (failure occured) is called the positive class.<br/>
    Given a new samples, a trained classifier will predict either class 0 or class 1. Now the 4 cases described below can happen
    <img src="confusion_matrix.png" alt="Image of confusion_matrix" width="300px"/>
</p>
<p>
    False positive (FP) is when our classifier predicts a failure whereas there is none, at worst the consequences would be a useless replacement of the machine, so it would cost time and money. 
</p>
<p>On the other hand, false negative (FN) occures when the classifier predicts no failure whereas a failure occures, the consequences of such misclassification (FN) is to have to replace the machine (long?) after the default has occured, which would cost not only time of replacement and the cost of the machine but also everything that was manufactured/controlled by the device is likely to be wasted. 
</p>
<p>So in this sense, FN may be worse than FP. Anyway we want our classifier to minimize the false positive (FP) and false negative (FN) as well.
</p> 
<p>
        As stated before, the high imbalance ratio of this dataset makes it impossible to evaluate the performance of our models with the accuracy metric. <br/>
    The accuracy is defined as such:  <strong>accuracy = (TP +TN) / (TP + TN + FP + FN)</strong> <br/>
    In our case, with a classifier which systematically predict the majority class (i.e. no failure), we could easily achieve 99.9% accuracy!
</p><br/>

<h2>Choosing the right metrics</h2>

<ul> 
    <li><strong>Precision</strong> measures the fraction of examples classiﬁed as positive that are truly positive: <strong>precision = TP / (TP + FP) </strong></li><br/>
    <li><strong>Recall</strong> (also called True Positive Rate) evaluates the ability of the classifier to label positive exemples correctly: <strong>recall = TP / (TP + FN)</strong></li><br/>
    <li><strong>Fscore</strong> is the harmonic mean of precision and recall: <strong>F-Score = 2 x precision x recall / (precision + recall)</strong></li><br/>
    <li><strong>ROC curve</strong> plots the TPR (True Positive Rate, also called Recall) as a function of the FPR (False Positive Rate). 
While the TPR measures the fraction of positive examples that are correctly labeled, the FPR measures the fraction of negative examples that are misclassified as positive: <strong>FPR = FP / (FP + TN)</strong>
    </li><br/>
    <li><strong>PR curve</strong> plots precision as a function of recall. In our case, the number of negative examples greatly exceeds the number of positives examples. Consequently, a large change in the number of false positives can lead to a small change in the false positive rate used in ROC analysis. Precision, on the other hand, by comparing false positives to true positives rather than true negatives, captures the effect of the large number of negative examples on the algorithm’s performance
    </li>    

</ul>

In [None]:
def compute_metrics(y_true, y_pred, y_probas):
    dic_metrics = {}
    dic_metrics['accuracy'] = np.round(100*accuracy_score(y_true, y_pred), 2) 
    dic_metrics['precision'] = np.round(100*precision_score(y_true, y_pred), 2) 
    dic_metrics['recall'] = np.round(100*recall_score(y_true, y_pred), 2)
    dic_metrics['f1_score'] = np.round(100*f1_score(y_true, y_pred), 2)
    
    if not isinstance(y_probas, str):
        precision_vec, recall_vec, thresholds_vec = precision_recall_curve(y_true, y_probas)     
        dic_metrics['avg_precision'] = np.round(100*average_precision_score(y_true, y_probas), 2)
        trace = go.Scattergl(x=recall_vec, y=precision_vec, 
                    mode='lines',
                    line=dict(width=2),
                    name='Precision-Recall curve')
    else:
        trace = go.Scattergl(x=[0], y=[0])
        dic_metrics['avg_precision'] = None
    return dic_metrics, trace

In [None]:
def plot_confusion_matrix(y_true, y_pred, classes, title=None, cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
#     title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    # Only use the labels that appear in the data

    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax

In [None]:
def plotPRCurve(traces_dict, perfos_dict):
    
    layout = go.Layout(
#             title='Precision-Recall example: AUC={0:0.2f}'.format(average_precision),
            title='Precision-Recall train and test set',
            xaxis=dict(title='Recall'),
            yaxis=dict(title='Precision'),
            height=350, width=800)

    traces_to_plot = []    
    for key, val in traces_dict.items():
        if not isinstance(val, str):
            traces_to_plot.append(val)
    
    fig = go.Figure(data=traces_to_plot, layout=layout)
    iplot(fig)    
    return fig

<h1 style="color:rgb(0,120,191);">5) Creating pipelines</h1>
<p>
    Let's write a code which combines every function we built
</p>

In [None]:
def myPipeline(dataf, seed_used, test_size, feat_sel_mode, cross_val_meth, under_samp_meth, 
               over_samp_meth, r_Majas_over_min, r_Maj_over_minas, ml_clf='log_reg_w', plot_prc=True, plot_cfm=True):
    """
    dataf (DataFrame) full dataset before splitting into train/test sets
    seed_used (integer) controls the initialization state of the random generator
    test_size (float) in the range [0.0, 1.0]
    feat_sel_mode (string)  'binned' or 'not_binned'
    under_samp (string)  method used for under_sampling,  '' for no under_sampling
    over_samp (string) method used for over_sampling,  '' for no over_sampling
    r_Majas_over_min (float) is the ratio of the nb_majority_class after resampling/nb_minority_class, considered only if under_samp_meth != None
    r_min_over_Maj (float) is the ratio of the nb_majority_class/nb_minority_class after resampling, considered only if over_samp_meth != None 
    ml_clf  if (string) then: classifier to use. 
            If (list of strings) then: list of the classifiers to use as an ensemble algorithm
            If dictionnary, keys are string describing the 'algo', values are float describing the number of this type of classifier/
    plot_prc (boolean) True to plot Precision-Recall curve
    plot_cfm (boolean)  True to plot confusion matrix
    """
    
    # Divide dataset into train and test sets of equal number, and preserve class imbalance ratio
    x_train, x_test, y_train, y_test = create_test_set(dataf, test_size)  # stratified division
    n_fail_df = dataf.loc[dataf.Class==1].shape[0]   # nb anomaly in df
    n_fail_y_train = len(y_train[y_train==1])   # nb of anomaly in the minority class
    print('Number of anomalies in train set = {}\nNumber of regular samples in train set = {}'.format(n_fail_y_train, x_train.shape[0]-n_fail_y_train ))
    print('Dataset divided into train and test')
    print('x_train.shape = {}\nx_test.shape = {}\ny_train.shape = {}\ny_test.shape = {}\n'.format(x_train.shape, x_test.shape, y_train.shape, y_test.shape))

    # Select features
    feat_list = select_features(df, mode=feat_sel_mode)
    x_train, x_test = x_train[feat_list], x_test[feat_list]
    print('After selecting features, x_train.shape = {}\nx_test.shape = {}\n'.format(x_train.shape, x_test.shape))

    out = {}
#     Cross-validation
    traces_dict = {}
    perf_dict = {}
    if cross_val_meth == 'cv':
        skf = StratifiedKFold(5, shuffle=False, random_state=seed_used)
        perf_cv, traces_cv = {} , {}  # Initialize dictionnary for storing the performance metrics
        acc_list, rec_list, prec_list, f1_score_list = [], [], [], []
        for train_index, test_index in skf.split(x_train, y_train):
            x_train_cv, y_train_cv = x_train.iloc[train_index], y_train.iloc[train_index]
            x_val_cv, y_val_cv =  x_train.iloc[test_index], y_train.iloc[test_index]  
            if under_samp_meth != None:
                us, x_train_cv, y_train_cv, ind_us = under_samp(method, seed, r_Maj_over_minas, x_train_cv, y_train_cv)
                out['us'] = us
            if over_samp_meth != None:
                os, n_fail_y_train, x_train_cv, y_train_cv = over_samp(method, seed, r_Maj_over_minas, x_train_cv, y_train_cv)            
                out['os'] = os
            clf = ens_clf(ml_clf, seed_used, feat_list, x_train_cv, y_train_cv, 100)
            clf.fit(x_train_cv, y_train_cv)
            y_pred_val_cv = clf.predict(x_val_cv)
            try:
                y_proba_val_cv = clf.predict_proba(x_val_cv)[:,1]
            except:
                y_proba_val_cv = 'Classifier has no "predict_proba" method'
            perf_cv, traces_cv = compute_metrics(y_val_cv, y_pred_val_cv, y_proba_val_cv)
            acc_list.append(perf_cv['accuracy']), rec_list.append(perf_cv['precision']), 
            prec_list.append(perf_cv['recall']), f1_score_list.append(perf_cv['f1_score'])
        mean_cv = {}; std_cv = {}
        mean_cv['accuracy'] = np.mean(acc_list) ; std_cv['accuracy'] = np.std(acc_list)
        mean_cv['precision'] = np.mean(prec_list)   ; std_cv['precision'] = np.std(acc_list)        
        mean_cv['recall'] = np.mean(rec_list)   ; std_cv['recall'] = np.std(acc_list)
        mean_cv['f1_score'] = np.mean(f1_score_list)  ; std_cv['f1_score'] = np.std(acc_list)
        print('Validation set: Cross-validation average = {}'.format(mean_cv))
        print('Validation set: Cross-validation std = {:}'.format(std_cv))
           
    else:       
        #      Resample data
        if under_samp_meth != None:
            us, x_train, y_train, ind_us = under_samp(method, seed, r_Maj_over_minas, x_train, y_train)
        if over_samp_meth != None:
            os, n_fail_y_train, x_train, y_train = over_samp(method, seed, r_Maj_over_minas, x_train, y_train)    
        clf = ens_clf(ml_clf, seed_used, feat_list, x_train, y_train, 100)    
        # Make predictions on train set
        y_pred_train = clf.predict(x_train)
        out['y_pred_train'] = y_pred_train
        # keep probabilities for the positive outcome only
        try:
            y_proba_train = clf.predict_proba(x_train)[:,1]
        except:
            y_proba_train = 'Classifier has no "predict_proba" method'
            print('Train set: Classifier ' + str(ml_clf) + ' has no "predict_proba" method')
        out['y_proba_train'] = y_proba_train
        
        # Compute performance metrics on train set
        perf_dict['train'], traces_dict['train'] = compute_metrics(y_train, y_pred_train, y_proba_train)
        # Print train metrics
        print('train_perf = \n{}'.format(perf_dict['train']))
        if plot_cfm :         
            plot_confusion_matrix(y_train, y_pred_train, classes=['0', '1'], title='Confusion matrix on TRAIN SET')
            plt.show()
                
    # Make predictions on test set
    y_pred_test = clf.predict(x_test)
    # keep probabilities for the positive outcome only
    try:           
        y_proba_test = clf.predict_proba(x_test)[:,1]
    except:
        y_proba_test = 'Classifier has no "predict_proba" method'
        print('Test set: Classifier ' + str(ml_clf) + ' has no "predict_proba" method')                       
    # Compute performance metrics on test set
    perf_dict['test'], traces_dict['test'] = compute_metrics(y_test, y_pred_test, y_proba_test)

    # Print test metrics
    print('test_perf = \n{}'.format(perf_dict['test']))

    # Plot Confusion matrices
    if plot_cfm :         
        plot_confusion_matrix(y_test, y_pred_test, classes=['0', '1'], title='Confusion matrix on TEST SET')
        plt.show()
 
    # Plot PR Curve
    if (plot_prc == 1):
        fig = plotPRCurve(traces_dict, perf_dict)
    else:
        fig = None
        
    out['feat_list'] = feat_list
    out['clf'] = clf
    out['x_train'] = x_train
    out['x_test'] = x_test    
    out['y_pred_test'] = y_pred_test
    out['y_proba_test'] = y_proba_test
    out['fig'] = fig
    
    return out, perf_dict, traces_dict

<h3>Test6: Bag of Classifiers with  <=0 samples</h3>

In [None]:
t1 = time.time()
out_t6, perf_dict_t6, traces_dict_t6 = myPipeline(df, 2, test_size=0.5, feat_sel_mode='not_pca', cross_val_meth='cv', under_samp_meth=None, 
               over_samp_meth=None, r_Majas_over_min=1, r_Maj_over_minas=1, ml_clf= 'CNB', plot_prc=True, plot_cfm=True )
t2 = time.time()
elapsedTime = t2-t1
print('\nelapsed time = {}'.format(np.round(elapsedTime,2)))

<h3>Test1: single classifier</h3>

In [None]:
%time
t1 = time.time()
out_t1, perf_dict_t1, traces_dict_t1 = myPipeline(df, 2, test_size=0.5, feat_sel_mode='not_pca', under_samp_meth=None, 
               over_samp_meth=None, r_Majas_over_min=1, r_Maj_over_minas=1, ml_clf='RF', plot_prc=True, plot_cfm=True )
t2 = time.time()
elapsedTime = t2-t1
print('\nelapsed time = {}'.format(np.round(elapsedTime,2)))

<h3>Test2: Ensemble of classifiers</h3>

In [None]:
t1 = time.time()
out_t2, perf_dict_t2, traces_dict_t2 = myPipeline(df, 42, test_size=0.5, feat_sel_mode='not_pca', under_samp_meth=None, 
               over_samp_meth=None, r_Majas_over_min=1, r_Maj_over_minas=1, ml_clf= ['xgb', 'MLP'], plot_prc=True, plot_cfm=True )
t2 = time.time()
elapsedTime = t2-t1
print('\nelapsed time = {}'.format(np.round(elapsedTime,2)))

In [None]:
tr_dict = {'t1':traces_dict_t1['test'],'t2':traces_dict_t2['test']}
pe_dict = {'t1':perf_dict_t1['test'],'t2':perf_dict_t2['test']}
plotPRCurve(tr_dict, pe_dict)