# Exercise 15

# Fraud Detection

**_Andrés Mauricio Obando Acevedo_**    
_am.obando_

## Introduction

- Fraud Detection Dataset from Microsoft Azure: [data](http://gallery.cortanaintelligence.com/Experiment/8e9fe4e03b8b4c65b9ca947c72b8e463)

Fraud detection is one of the earliest industrial applications of data mining and machine learning. Fraud detection is typically handled as a binary classification problem, but the class population is unbalanced because instances of fraud are usually very rare compared to the overall volume of transactions. Moreover, when fraudulent transactions are discovered, the business typically takes measures to block the accounts from transacting to prevent further losses. 

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
#from sklearn import metrics
from sklearn.metrics import accuracy_score, f1_score, fbeta_score

In [2]:
import pandas as pd

url = 'https://raw.githubusercontent.com/albahnsen/PracticalMachineLearningClass/master/datasets/15_fraud_detection.csv.zip'
df = pd.read_csv(url, index_col=0)
df.head()

Unnamed: 0,accountAge,digitalItemCount,sumPurchaseCount1Day,sumPurchaseAmount1Day,sumPurchaseAmount30Day,paymentBillingPostalCode - LogOddsForClass_0,accountPostalCode - LogOddsForClass_0,paymentBillingState - LogOddsForClass_0,accountState - LogOddsForClass_0,paymentInstrumentAgeInAccount,ipState - LogOddsForClass_0,transactionAmount,transactionAmountUSD,ipPostalCode - LogOddsForClass_0,localHour - LogOddsForClass_0,Label
0,2000,0,0,0.0,720.25,5.064533,0.421214,1.312186,0.566395,3279.574306,1.218157,599.0,626.16465,1.259543,4.745402,0
1,62,1,1,1185.44,2530.37,0.538996,0.481838,4.40137,4.500157,61.970139,4.035601,1185.44,1185.44,3.981118,4.921349,0
2,2000,0,0,0.0,0.0,5.064533,5.096396,3.056357,3.155226,0.0,3.314186,32.09,32.09,5.00849,4.742303,0
3,1,1,0,0.0,0.0,5.064533,5.096396,3.331154,3.331239,0.0,3.529398,133.28,132.729554,1.324925,4.745402,0
4,1,1,0,0.0,132.73,5.412885,0.342945,5.563677,4.086965,0.001389,3.529398,543.66,543.66,2.693451,4.876771,0


In [3]:
df.shape, df.Label.sum(), df.Label.mean()

((138721, 16), 797, 0.0057453449730033666)

In [4]:
df.groupby('Label')['Label'].count()

Label
0    137924
1       797
Name: Label, dtype: int64

# Exercise 15.1

Estimate a Logistic Regression and a Decision Tree

Evaluate using the following metrics:
* Accuracy
* F1-Score
* F_Beta-Score (Beta=10)

Comment about the results

In [5]:
#Import the libraries:
from sklearn.model_selection import train_test_split

In [6]:
#Metrics:
cols=['Model', 'Balance Type', 'Target percentage' ,'Accuracy','F1-score','Fbeta-Score']
metrics=pd.DataFrame(columns=cols,data=[])

In [7]:
#Define the features and the output:
X = df.drop('Label',axis=1)
y = df.Label

In [8]:
# Split the data:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
n_samples = X_train.shape[0]

**Models**  
Defining the models to predict:

In [9]:
models = {'Log': LogisticRegression(solver='liblinear',C=1e9),
          'DTree': DecisionTreeClassifier()}

**Unbalanced**

In [10]:
for model in models.keys():
    models[model].fit(X_train, y_train)
    
y_pred = pd.DataFrame(index=X_test.index, columns=models.keys())
for model in models.keys():
    y_pred[model] = models[model].predict(X_test)



In [11]:
#Metrics
k = metrics.shape[0]
for model in models.keys():
    Accuracy = accuracy_score(y_test, y_pred[model])
    F1 = f1_score(y_test, y_pred[model])
    Fbeta = fbeta_score(y_test, y_pred[model], beta=10)
    metrics.loc[k] = [model,'Unblanced', 0.0 ,Accuracy,F1,Fbeta]
    k += 1
metrics

Unnamed: 0,Model,Balance Type,Target percentage,Accuracy,F1-score,Fbeta-Score
0,Log,Unblanced,0.0,0.993905,0.0,0.0
1,DTree,Unblanced,0.0,0.988204,0.111842,0.123379


# Exercise 15.2

Under-sample the negative class using random-under-sampling

Which is parameter for target_percentage did you choose?
How the results change?

**Only apply under-sampling to the training set, evaluate using the whole test set**

To test the diferents target_percentage, let's use the function used in class:

**Define a general function for Under Sampling**

In [12]:
def UnderSampling(X, y, target_percentage=0.5, seed=None):
    # Minority class is the minor class actually.
    
    if (y == 0).sum() > (y == 1).sum():
        Minor = 1
        Mayor = 0
    else:
        Minor = 0
        Mayor = 1
    
    n_samples = y.shape[0]
    n_samples_may = (y == Mayor).sum()
    n_samples_min = (y == Minor).sum()

    n_samples_may_new =  n_samples_min / target_percentage - n_samples_min
    n_samples_may_new_per = n_samples_may_new / n_samples_may

    filter_ = y == Mayor

    np.random.seed(seed)
    rand_min = np.random.binomial(n=1, p=n_samples_may_new_per, size=n_samples)
    
    filter_ = filter_ & rand_min
    filter_ = filter_ | (y == Minor)
    filter_ = filter_.astype(bool)
    
    return X[filter_], y[filter_]

In [13]:
for target_percentage in [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]:
    X_u, y_u = UnderSampling(X_train, y_train, target_percentage, 1)
    y_pred = pd.DataFrame(index=X_test.index, columns=models.keys())
    for model in models.keys():
        models[model].fit(X_u, y_u)
        y_pred[model] = models[model].predict(X_test)
        # Metrics:
        Accuracy = accuracy_score(y_test, y_pred[model])
        F1 = f1_score(y_test, y_pred[model])
        Fbeta = fbeta_score(y_test, y_pred[model], beta=10)
        metrics.loc[k] = [model,'Under Sampling', target_percentage,Accuracy,F1,Fbeta]
        k += 1  
metrics

Unnamed: 0,Model,Balance Type,Target percentage,Accuracy,F1-score,Fbeta-Score
0,Log,Unblanced,0.0,0.993905,0.0,0.0
1,DTree,Unblanced,0.0,0.988204,0.111842,0.123379
2,Log,Under Sampling,0.1,0.991874,0.005348,0.00366
3,DTree,Under Sampling,0.1,0.918083,0.052552,0.33685
4,Log,Under Sampling,0.2,0.988619,0.015123,0.014556
5,DTree,Under Sampling,0.2,0.849578,0.043079,0.45481
6,Log,Under Sampling,0.3,0.977806,0.036053,0.06786
7,DTree,Under Sampling,0.3,0.789615,0.030599,0.413132
8,Log,Under Sampling,0.4,0.855455,0.027912,0.281941
9,DTree,Under Sampling,0.4,0.71139,0.026669,0.448075


# Exercise 15.3

Same analysis using random-over-sampling

In [14]:
import random
def OverSampling(X, y, target_percentage=0.5, seed=None):
    # Assuming minority class is the negative (0)
    if (y == 0).sum() > (y == 1).sum():
        Minor = 1
        Mayor = 0
    else:
        Minor = 0
        Mayor = 1    
    
    
    n_samples = y.shape[0]
    n_samples_min = (y == Minor).sum()
    n_samples_may = (y == Mayor).sum()

    n_samples_min_new =  -target_percentage * n_samples_may / (target_percentage- 1)
    
    np.random.seed(seed)
    filter_ = np.random.choice(X[y == Minor].shape[0], int(n_samples_min_new))
    # filter_ is within the positives, change to be of all
    filter_ = np.nonzero(y == Minor)[0][filter_]
    
    filter_ = np.concatenate((filter_, np.nonzero(y == Mayor)[0]), axis=0)

    return X[filter_], y[filter_]

In [15]:
for target_percentage in [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]:
    X_o, y_o = OverSampling(np.array(X_train), np.array(y_train), target_percentage, 1)

    y_pred = pd.DataFrame(index=X_test.index, columns=models.keys())
    for model in models.keys():
        models[model].fit(X_o, y_o)
        y_pred[model] = models[model].predict(X_test)
        # Metrics:
        Accuracy = accuracy_score(y_test, y_pred[model])
        F1 = f1_score(y_test, y_pred[model])
        Fbeta = fbeta_score(y_test, y_pred[model], beta=10)
        metrics.loc[k] = [model,'Over Sampling', target_percentage,Accuracy,F1,Fbeta]
        k += 1  
metrics     

Unnamed: 0,Model,Balance Type,Target percentage,Accuracy,F1-score,Fbeta-Score
0,Log,Unblanced,0.0,0.993905,0.0,0.0
1,DTree,Unblanced,0.0,0.988204,0.111842,0.123379
2,Log,Under Sampling,0.1,0.991874,0.005348,0.00366
3,DTree,Under Sampling,0.1,0.918083,0.052552,0.33685
4,Log,Under Sampling,0.2,0.988619,0.015123,0.014556
5,DTree,Under Sampling,0.2,0.849578,0.043079,0.45481
6,Log,Under Sampling,0.3,0.977806,0.036053,0.06786
7,DTree,Under Sampling,0.3,0.789615,0.030599,0.413132
8,Log,Under Sampling,0.4,0.855455,0.027912,0.281941
9,DTree,Under Sampling,0.4,0.71139,0.026669,0.448075


# Exercise 15.4 (3 points)

Evaluate the results using SMOTE

Which parameters did you choose?

In [16]:
def SMOTE(X, y, target_percentage=0.5, k=5, seed=None):
    # Assuming minority class is the positive (1)
    if (y == 0).sum() > (y == 1).sum():
        Minor = 1
        Mayor = 0
    else:
        Minor = 0
        Mayor = 1    
    
    n_samples = y.shape[0]
    n_samples_may = (y == Mayor).sum()
    n_samples_min = (y == Minor).sum()
       
    # New samples
    n_samples_min_new =  int(-target_percentage * n_samples_may / (target_percentage- 1) - n_samples_min)
    
    # A matrix to store the synthetic samples
    new = np.zeros((n_samples_min_new, X.shape[1]))
    
    # Create seeds
    np.random.seed(seed)
    seeds = np.random.randint(1, 1000000, 3)
    
    # Select examples to use as base
    np.random.seed(seeds[0])
    sel_ = np.random.choice(y[y==Minor].shape[0], n_samples_min_new)
    
    # Define random seeds (2 per example)
    np.random.seed(seeds[1])
    nn__ = np.random.choice(k, n_samples_min_new)
    np.random.seed(seeds[2])
    steps = np.random.uniform(size=n_samples_min_new)  

    # For each selected examples create one synthetic case
    for i, sel in enumerate(sel_):
        # Select neighbor
        nn_ = nn__[i]
        step = steps[i]
        # Create new sample
        new[i, :] = X[y==Minor][sel] - step * (X[y==Minor][sel] - X[y==Minor][nn_])
    
    X = np.vstack((X, new))
    y = np.append(y, np.ones(n_samples_min_new))
    
    return X, y

In [17]:
for target_percentage in [0.25, 0.5]:
    # Try two Nearest neighbours
    for r in [5, 15]:
        X_sm, y_sm = SMOTE(np.array(X_train), np.array(y_train), target_percentage, r, seed=3)
        y_pred = pd.DataFrame(index=X_test.index, columns=models.keys())
        for model in models.keys():
            models[model].fit(X_sm, y_sm)
            y_pred[model] = models[model].predict(X_test)
            # Metrics:
            Accuracy = accuracy_score(y_test, y_pred[model])
            F1 = f1_score(y_test, y_pred[model])
            Fbeta = fbeta_score(y_test, y_pred[model], beta=10)
            metrics.loc[k] = [model,'SMOTE_r' + str(r), target_percentage,Accuracy,F1,Fbeta]
            k += 1  
metrics

Unnamed: 0,Model,Balance Type,Target percentage,Accuracy,F1-score,Fbeta-Score
0,Log,Unblanced,0.0,0.993905,0.0,0.0
1,DTree,Unblanced,0.0,0.988204,0.111842,0.123379
2,Log,Under Sampling,0.1,0.991874,0.005348,0.00366
3,DTree,Under Sampling,0.1,0.918083,0.052552,0.33685
4,Log,Under Sampling,0.2,0.988619,0.015123,0.014556
5,DTree,Under Sampling,0.2,0.849578,0.043079,0.45481
6,Log,Under Sampling,0.3,0.977806,0.036053,0.06786
7,DTree,Under Sampling,0.3,0.789615,0.030599,0.413132
8,Log,Under Sampling,0.4,0.855455,0.027912,0.281941
9,DTree,Under Sampling,0.4,0.71139,0.026669,0.448075


# Exercise 15.5 (3 points)

Evaluate the results using Adaptive Synthetic Sampling Approach for Imbalanced
Learning (ADASYN)

http://www.ele.uri.edu/faculty/he/PDFfiles/adasyn.pdf
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.ADASYN.html#rf9172e970ca5-1

In [18]:
from imblearn.over_sampling import ADASYN

ada = ADASYN(random_state=42)

In [19]:
X_res, y_res = ada.fit_resample(X_train, y_train)
y_pred = pd.DataFrame(index=X_test.index, columns=models.keys())
for model in models.keys():
    models[model].fit(X_res, y_res)
    y_pred[model] = models[model].predict(X_test)
    # Metrics:
    Accuracy = accuracy_score(y_test, y_pred[model])
    F1 = f1_score(y_test, y_pred[model])
    Fbeta = fbeta_score(y_test, y_pred[model], beta=10)
    metrics.loc[k] = [model,'ADASYN', 0,Accuracy,F1,Fbeta]
    k += 1  
metrics     

Unnamed: 0,Model,Balance Type,Target percentage,Accuracy,F1-score,Fbeta-Score
0,Log,Unblanced,0.0,0.993905,0.0,0.0
1,DTree,Unblanced,0.0,0.988204,0.111842,0.123379
2,Log,Under Sampling,0.1,0.991874,0.005348,0.00366
3,DTree,Under Sampling,0.1,0.918083,0.052552,0.33685
4,Log,Under Sampling,0.2,0.988619,0.015123,0.014556
5,DTree,Under Sampling,0.2,0.849578,0.043079,0.45481
6,Log,Under Sampling,0.3,0.977806,0.036053,0.06786
7,DTree,Under Sampling,0.3,0.789615,0.030599,0.413132
8,Log,Under Sampling,0.4,0.855455,0.027912,0.281941
9,DTree,Under Sampling,0.4,0.71139,0.026669,0.448075


# Exercise 15.6 (3 points)

Compare and comment about the results

So, let's to compare the results from the diferents models.

In [20]:
table=pd.pivot_table(metrics,index=["Model","Balance Type","Target percentage"])
table

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Accuracy,F1-score,Fbeta-Score
Model,Balance Type,Target percentage,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
DTree,ADASYN,0.0,0.983136,0.085308,0.129538
DTree,Over Sampling,0.2,0.989165,0.123675,0.127199
DTree,Over Sampling,0.3,0.989296,0.115523,0.116347
DTree,Over Sampling,0.4,0.990257,0.138996,0.13106
DTree,Over Sampling,0.5,0.989493,0.117431,0.116385
DTree,Over Sampling,0.6,0.989668,0.11257,0.109158
DTree,Over Sampling,0.7,0.989711,0.109641,0.105534
DTree,Over Sampling,0.8,0.989689,0.109434,0.105531
DTree,SMOTE_r15,0.25,0.981978,0.076148,0.122128
DTree,SMOTE_r15,0.5,0.979051,0.071636,0.132246


To see which type of samplig gives the best accuracy results, let's organize the information above, for each model. First for the Logistic Regression.

In [21]:
table2 = table.loc['Log'].sort_values(by='Accuracy', ascending=False)
table2

Unnamed: 0_level_0,Unnamed: 1_level_0,Accuracy,F1-score,Fbeta-Score
Balance Type,Target percentage,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Unblanced,0.0,0.993905,0.0,0.0
Under Sampling,0.1,0.991874,0.005348,0.00366
Under Sampling,0.2,0.988619,0.015123,0.014556
Over Sampling,0.2,0.981738,0.052154,0.082648
Under Sampling,0.3,0.977806,0.036053,0.06786
SMOTE_r15,0.25,0.959282,0.044103,0.14886
Over Sampling,0.3,0.951308,0.045396,0.181089
SMOTE_r5,0.25,0.949102,0.032392,0.132926
Over Sampling,0.4,0.915964,0.041366,0.268359
Under Sampling,0.4,0.855455,0.027912,0.281941


On this case, we can see that the best sampligs is the unbalanced model, but the F1 and $F_\beta$ scores are equal zero, that implies that the negatives are not predicted well (nothing at all), this is caused for the big difference on the unbalanced data. Predict all the positives is easy on this case, because, the majority of cases are positive.  
So, a best model is the second on list, the undersampling with target percent 10%, but is not the best. 

In [22]:
np.max(table2['F1-score'])

0.05215419501133786

So, a best model, has an oversampling, with an target percent of 20%, and an accuracy of 0.9817.

Now, let's see what happend on the Decision Tree model:

In [23]:
table3 = table.loc['DTree'].sort_values(by='Accuracy', ascending=False)
table3

Unnamed: 0_level_0,Unnamed: 1_level_0,Accuracy,F1-score,Fbeta-Score
Balance Type,Target percentage,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Over Sampling,0.4,0.990257,0.138996,0.13106
Over Sampling,0.7,0.989711,0.109641,0.105534
Over Sampling,0.8,0.989689,0.109434,0.105531
Over Sampling,0.6,0.989668,0.11257,0.109158
Over Sampling,0.5,0.989493,0.117431,0.116385
Over Sampling,0.3,0.989296,0.115523,0.116347
Over Sampling,0.2,0.989165,0.123675,0.127199
Unblanced,0.0,0.988204,0.111842,0.123379
SMOTE_r5,0.25,0.985036,0.082999,0.111933
SMOTE_r5,0.5,0.984927,0.08971,0.122717


In this case, the best accuracy is given by the model that use the over sampling with target percent of the 40%, even has a good F1 score. On case on the unbalanced model, has a good performance, even predicting the negatives, but, is the best model?

In [24]:
np.max(table3['F1-score'])

0.138996138996139

In this case, is the best model, because has the best accuracy and the best F1 score.