# Exercise 15

# Fraud Detection

## Introduction

- Fraud Detection Dataset from Microsoft Azure: [data](http://gallery.cortanaintelligence.com/Experiment/8e9fe4e03b8b4c65b9ca947c72b8e463)

Fraud detection is one of the earliest industrial applications of data mining and machine learning. Fraud detection is typically handled as a binary classification problem, but the class population is unbalanced because instances of fraud are usually very rare compared to the overall volume of transactions. Moreover, when fraudulent transactions are discovered, the business typically takes measures to block the accounts from transacting to prevent further losses. 

In [2]:
%matplotlib inline
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

import warnings
warnings.filterwarnings("ignore")

In [3]:
import pandas as pd

url = 'https://raw.githubusercontent.com/albahnsen/PracticalMachineLearningClass/master/datasets/15_fraud_detection.csv.zip'
df = pd.read_csv(url, index_col=0)
df.head()

Unnamed: 0,accountAge,digitalItemCount,sumPurchaseCount1Day,sumPurchaseAmount1Day,sumPurchaseAmount30Day,paymentBillingPostalCode - LogOddsForClass_0,accountPostalCode - LogOddsForClass_0,paymentBillingState - LogOddsForClass_0,accountState - LogOddsForClass_0,paymentInstrumentAgeInAccount,ipState - LogOddsForClass_0,transactionAmount,transactionAmountUSD,ipPostalCode - LogOddsForClass_0,localHour - LogOddsForClass_0,Label
0,2000,0,0,0.0,720.25,5.064533,0.421214,1.312186,0.566395,3279.574306,1.218157,599.0,626.16465,1.259543,4.745402,0
1,62,1,1,1185.44,2530.37,0.538996,0.481838,4.40137,4.500157,61.970139,4.035601,1185.44,1185.44,3.981118,4.921349,0
2,2000,0,0,0.0,0.0,5.064533,5.096396,3.056357,3.155226,0.0,3.314186,32.09,32.09,5.00849,4.742303,0
3,1,1,0,0.0,0.0,5.064533,5.096396,3.331154,3.331239,0.0,3.529398,133.28,132.729554,1.324925,4.745402,0
4,1,1,0,0.0,132.73,5.412885,0.342945,5.563677,4.086965,0.001389,3.529398,543.66,543.66,2.693451,4.876771,0


In [4]:
df.shape, df.Label.sum(), df.Label.mean()

((138721, 16), 797, 0.0057453449730033666)

In [5]:
# Copy the dataset to prevent to modify it by mistake

base = df

How many 0 (*aka negatives*) or 1 (*aka positives*) exist on the database?

In [6]:
print("negatives, positives")

(base.Label == 0).sum(), (base.Label == 1).sum()

negatives, positives


(137924, 797)

In [7]:
print("negatives, positives [%]")

(base.Label == 0).sum()/base.Label.count(), (base.Label == 1).sum()/base.Label.count()

negatives, positives [%]


(0.9942546550269966, 0.0057453449730033666)

# Exercise 15.1

Estimate a Logistic Regression, a Decision Tree and a Random Forest

Evaluate using the following metrics:
* Accuracy
* F1-Score
* F_Beta-Score (Beta=10)

Comment about the results

#### Pre-processing

In [8]:
X = base.drop(['Label'], axis = 1)
y = base['Label']

In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, train_size = 0.7, test_size = 0.3)

* ### Model implementations

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier

models = {'LogisticRegression': LogisticRegression(),
          'DecisionTreeRegressor': DecisionTreeRegressor(),
          'RandomForestClassifier': RandomForestClassifier()}

In [11]:
y_pred = pd.DataFrame(index=pd.DataFrame(X_test).index, columns=models.keys())

for model in models.keys():
    models[model].fit(X_train, y_train)
    y_pred[model] = models[model].predict(X_test)

In [13]:
from sklearn import metrics

accuracy = []
f1_score = []
fbeta_score = []
name = []
target = []

for i in np.arange(len(models)):
    name.append(pd.DataFrame.from_dict(models).columns[i])
    target.append('none')
    accuracy.append(metrics.accuracy_score(y_pred.iloc[:,[i]].astype(int), y_test))
    f1_score.append(metrics.f1_score(y_pred.iloc[:,[i]].astype(int), y_test))
    fbeta_score.append(metrics.fbeta_score(y_pred.iloc[:,[i]].astype(int), y_test, beta=10))

results = pd.DataFrame([accuracy, f1_score, fbeta_score, name, target]).T
results.rename(columns = {0 : 'Accuracy', 1 : 'F1', 2 : 'FBeta', 3 : 'Name', 4: 'Target_perc'}, inplace = True)
results.set_index('Name', inplace=True)
results.sort_values(by=['FBeta'], inplace=True, ascending=False)
results

Unnamed: 0_level_0,Accuracy,F1,FBeta,Target_perc
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RandomForestClassifier,0.994089,0.115108,0.455853,none
DecisionTreeRegressor,0.989091,0.140152,0.130916,none
LogisticRegression,0.994113,0.0,0.0,none


The `LogisticRegression` gives a `0` for `FBeta` because is not calculating no `1` or positives.

The `RandomForestClassifier` is best to predict in terms that is the model that gives the best `FBeta` (gives a higher weight for those predictions correct in the positives or `1`

# Exercise 15.2

Under-sample the negative class using random-under-sampling

Which is parameter for target_percentage did you choose?
How the results change?

**Only apply under-sampling to the training set, evaluate using the whole test set**

_________

Will be used the function 'UnderSampling' defined in [15-Unbalanced_Datasets](https://github.com/albahnsen/PracticalMachineLearningClass/blob/master/notebooks/15-Unbalanced_Datasets.ipynb)
_________

In [15]:
def UnderSampling(X, y, target_percentage=0.5, seed=None):
    # Assuming minority class is the positive
    n_samples = y.shape[0]
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()

    n_samples_0_new =  n_samples_1 / target_percentage - n_samples_1
    n_samples_0_new_per = n_samples_0_new / n_samples_0

    filter_ = y == 0

    np.random.seed(seed)
    rand_1 = np.random.binomial(n=1, p=n_samples_0_new_per, size=n_samples)
    
    filter_ = filter_ & rand_1
    filter_ = filter_ | (y == 1)
    filter_ = filter_.astype(bool)
    
    return X[filter_], y[filter_]

* ### Model implementations

In [17]:
y_pred_u = pd.DataFrame(index=pd.DataFrame(X_test).index, columns=models.keys())

accuracy_u = []
f1_score_u = []
fbeta_score_u = []
name_u = []
target_u = []

for target_percentage in [0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50]:
    X_u_train, y_u_train = UnderSampling(X_train.values, y_train, target_percentage, 42)
    
    for model in models.keys():
        models[model].fit(X_u_train, y_u_train)
        y_pred_u[model] = models[model].predict(X_test)
        
    for i in np.arange(len(models)):
        name_u.append(str(pd.DataFrame.from_dict(models).columns[i]))
        target_u.append(target_percentage)
        accuracy_u.append(metrics.accuracy_score(y_pred_u.iloc[:,[i]].astype(int), y_test))
        f1_score_u.append(metrics.f1_score(y_pred_u.iloc[:,[i]].astype(int), y_test))
        fbeta_score_u.append(metrics.fbeta_score(y_pred_u.iloc[:,[i]].astype(int), y_test, beta=10))

results_u = pd.DataFrame([accuracy_u, f1_score_u, fbeta_score_u, name_u, target_u]).T
results_u.rename(columns = {0 : 'Accuracy', 1 : 'F1', 2 : 'FBeta', 3 : 'Model', 4: 'Target_perc'}, inplace = True)
results_u.set_index('Model', inplace=True)
results_u.sort_values(by=['FBeta'], inplace=True, ascending=False)
results_u.head(10)

Unnamed: 0_level_0,Accuracy,F1,FBeta,Target_perc
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RandomForestClassifier,0.990605,0.231827,0.223644,0.05
RandomForestClassifier,0.981306,0.158009,0.108196,0.1
RandomForestClassifier,0.969892,0.130465,0.079219,0.15
RandomForestClassifier,0.952399,0.0917011,0.0521035,0.2
RandomForestClassifier,0.933465,0.081592,0.0448087,0.25
DecisionTreeRegressor,0.952183,0.0778499,0.0442925,0.05
RandomForestClassifier,0.913569,0.0693402,0.0373615,0.3
LogisticRegression,0.970084,0.0532319,0.0329619,0.2
RandomForestClassifier,0.880073,0.0552716,0.0292553,0.35
DecisionTreeRegressor,0.915035,0.0520107,0.0280922,0.1


According to the table above, that was sorted to shows these with the best `FScore`, we notice that `RandomForestClassifier` gives the best prediction for any combination of `target_percentage`. 

`RandomForestClassifier` still running even in unbalanced datasets!

What is an interesting found is that, even when the elimination of values (using the `UnderSampling` method), an incrementation of the `target_percentage` does not means a better `FScore`! 

*Note: Seems a similar process of calibration like the hyperparameters find process*

# Exercise 15.3

Same analysis using random-over-sampling

_________
Will be used the function 'OverSampling' defined in [15-Unbalanced_Datasets](https://github.com/albahnsen/PracticalMachineLearningClass/blob/master/notebooks/15-Unbalanced_Datasets.ipynb)
_________

In [18]:
import random
def OverSampling(X, y, target_percentage=0.5, seed=None):
    # Assuming minority class is the positive
    n_samples = y.shape[0]
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()

    n_samples_1_new =  -target_percentage * n_samples_0 / (target_percentage- 1)

    np.random.seed(seed)
    filter_ = np.random.choice(X[y == 1].shape[0], int(n_samples_1_new))
    # filter_ is within the positives, change to be of all
    filter_ = np.nonzero(y == 1)[0][filter_]
    
    filter_ = np.concatenate((filter_, np.nonzero(y == 0)[0]), axis=0)
    
    # in case that exist any Missing Value on the y vector.
    aux = pd.DataFrame(y)
    aux.fillna(0, inplace=True)
    y = aux.values
    
    return X[filter_], y[filter_]

In [19]:
y_pred_o = pd.DataFrame(index=pd.DataFrame(X_test).index, columns=models.keys())

accuracy_o = []
f1_score_o = []
fbeta_score_o = []
name_o = []
target_o = []

for target_percentage in [0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50]:
    X_o_train, y_o_train = OverSampling(X_train.values, y_train, target_percentage, 42)
    
    for model in models.keys():
        models[model].fit(X_o_train, y_o_train)
        y_pred_o[model] = models[model].predict(X_test)
        
    for i in np.arange(len(models)):
        name_o.append(str(pd.DataFrame.from_dict(models).columns[i]))
        target_o.append(target_percentage)
        accuracy_o.append(metrics.accuracy_score(y_pred_o.iloc[:,[i]].astype(int), y_test))
        f1_score_o.append(metrics.f1_score(y_pred_o.iloc[:,[i]].astype(int), y_test))
        fbeta_score_o.append(metrics.fbeta_score(y_pred_o.iloc[:,[i]].astype(int), y_test, beta=10))

results_o = pd.DataFrame([accuracy_o, f1_score_o, fbeta_score_o, name_o, target_o]).T
results_o.rename(columns = {0 : 'Accuracy', 1 : 'F1', 2 : 'FBeta', 3 : 'Model', 4: 'Target_perc'}, inplace = True)
results_o.set_index('Model', inplace=True)
results_o.sort_values(by=['FBeta'], inplace=True, ascending=False)
results_o.head(10)

Unnamed: 0_level_0,Accuracy,F1,FBeta,Target_perc
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RandomForestClassifier,0.993753,0.166667,0.378114,0.25
RandomForestClassifier,0.99368,0.170347,0.366286,0.3
RandomForestClassifier,0.993632,0.153355,0.344074,0.5
RandomForestClassifier,0.993656,0.142857,0.339496,0.2
RandomForestClassifier,0.99356,0.1625,0.339057,0.1
RandomForestClassifier,0.99356,0.1625,0.339057,0.15
RandomForestClassifier,0.99344,0.175227,0.331148,0.05
RandomForestClassifier,0.993512,0.15625,0.326017,0.4
RandomForestClassifier,0.99344,0.138801,0.298455,0.45
RandomForestClassifier,0.993416,0.127389,0.282715,0.35


The result obtained above for the `FBeta` according to different models using the `OverSampling` method (to balance the unbalance dataset), implies that the `RandomForestClassifier` is the best model to predict a positive (or `1`) in the fraud detection context.

Additional to it, an important found to highlight is that the best `target_percentage` is not the greater neither the lower of the given.

Keep on mind that only were evaluated `target_percentage` in [0 - 0.5] since is considered that, to increase the dataset without a clear limit, to *replicate* positives, will create un unwanted bias that will give us incorrect results.

# Exercise 15.4 (3 points)

Evaluate the results using SMOTE

Which parameters did you choose?

_________
Will be used the function 'SMOTE' defined in [15-Unbalanced_Datasets](https://github.com/albahnsen/PracticalMachineLearningClass/blob/master/notebooks/15-Unbalanced_Datasets.ipynb)
_________

In [20]:
def SMOTE(X, y, target_percentage=0.5, k=5, seed=None):
    # Calculate the NearestNeighbors
    from sklearn.neighbors import NearestNeighbors
    nearest_neighbour_ = NearestNeighbors(n_neighbors=k + 1)
    nearest_neighbour_.fit(X[y==1])
    nns = nearest_neighbour_.kneighbors(X[y==1], 
                                    return_distance=False)[:, 1:]
    
    
    n_samples = y.shape[0]
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()
    
    # New samples
    n_samples_1_new =  int(-target_percentage * n_samples_0 / (target_percentage- 1) - n_samples_1)
    
    # A matrix to store the synthetic samples
    new = np.zeros((n_samples_1_new, X.shape[1]))
    
    # Create seeds
    np.random.seed(seed)
    seeds = np.random.randint(1, 1000000, 3)
    
    # Select examples to use as base
    np.random.seed(seeds[0])
    sel_ = np.random.choice(y[y==1].shape[0], n_samples_1_new)
    
    # Define random seeds (2 per example)
    np.random.seed(seeds[1])
    nn__=[]
    # Select one random neighbor for each example to use as base
    for i, sel in enumerate(sel_):
        nn__.append(np.random.choice(nns[sel]))
    
    np.random.seed(seeds[2])
    steps = np.random.uniform(size=n_samples_1_new)  

    # For each selected examples create one synthetic case
    for i, sel in enumerate(sel_):
        # Select neighbor
        nn_ = nn__[i]
        step = steps[i]
        # Create new sample
        new[i, :] = X[y==1][sel] - step * (X[y==1][sel] - X[y==1][nn_])
    
    X = np.vstack((X, new))
    y = np.append(y, np.ones(n_samples_1_new))
    
    return X, y

In [22]:
y_pred_s = pd.DataFrame(index=pd.DataFrame(X_test).index, columns=models.keys())

accuracy_s = []
f1_score_s = []
fbeta_score_s = []
name_s = []
target_s = []
k_s = []

for target_percentage in [0.25, 0.5]:
    for k in [5, 15]:
        X_s_train, y_s_train = SMOTE(X_train.values, y_train, target_percentage, k, seed=3)

        for model in models.keys():
            models[model].fit(X_s_train, y_s_train)
            y_pred_s[model] = models[model].predict(X_test)
        
        for i in np.arange(len(models)):
            name_s.append(str(pd.DataFrame.from_dict(models).columns[i]))
            target_s.append(target_percentage)
            k_s.append(k)
            accuracy_s.append(metrics.accuracy_score(y_pred_s.iloc[:,[i]].astype(int), y_test))
            f1_score_s.append(metrics.f1_score(y_pred_s.iloc[:,[i]].astype(int), y_test))
            fbeta_score_s.append(metrics.fbeta_score(y_pred_s.iloc[:,[i]].astype(int), y_test, beta=10))

results_s = pd.DataFrame([accuracy_s, f1_score_s, fbeta_score_s, name_s, target_s, k_s]).T
results_s.rename(columns = {0 : 'Accuracy', 1 : 'F1', 2 : 'FBeta', 3 : 'Model', 
                            4: 'Target_perc', 5: 'k'}, inplace = True)
results_s.set_index('Model', inplace=True)
results_s.sort_values(by=['FBeta'], inplace=True, ascending=False)
results_s.head(10)

Unnamed: 0_level_0,Accuracy,F1,FBeta,Target_perc,k
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
RandomForestClassifier,0.993248,0.175953,0.30777,0.25,5
RandomForestClassifier,0.992575,0.162602,0.23962,0.25,15
RandomForestClassifier,0.992551,0.162162,0.23774,0.5,5
RandomForestClassifier,0.990965,0.160714,0.176977,0.5,15
DecisionTreeRegressor,0.985439,0.12931,0.100232,0.25,5
DecisionTreeRegressor,0.982027,0.0966184,0.0690067,0.25,15
DecisionTreeRegressor,0.984429,0.0898876,0.0688465,0.5,5
DecisionTreeRegressor,0.979167,0.0864067,0.058617,0.5,15
LogisticRegression,0.962035,0.0458937,0.0271534,0.25,5
LogisticRegression,0.98527,0.0285261,0.0234007,0.25,15


Using the result above, is important to highlight that `target_percentage` is the most important feature for the function `SMOTE`. The results depends, first of `target_percentage` and then `k`.

Additional to it, the best result is lower than the found on `OverSampling`.

On the other hand `RandomForestClassifier` is not the model that always gives the best prediction; `DecisionTreeRegressor` is giving good results **with the same combinations of `target_percentage` and `k`**!

# Exercise 15.5 (3 points)

Evaluate the results using Adaptive Synthetic Sampling Approach for Imbalanced
Learning (ADASYN)

http://www.ele.uri.edu/faculty/he/PDFfiles/adasyn.pdf
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.ADASYN.html#rf9172e970ca5-1

In [23]:
from imblearn.over_sampling import ADASYN

ada = ADASYN(random_state = 42)
X_adasyn, y_adasyn = ada.fit_resample(X_train, y_train)

In [24]:
y_pred_adasyn = pd.DataFrame(index=pd.DataFrame(X_test).index, columns=models.keys())

for model in models.keys():
    models[model].fit(X_adasyn, y_adasyn)
    y_pred_adasyn[model] = models[model].predict(X_test)

In [52]:
accuracy_adasyn = []
f1_score_adasyn = []
fbeta_score_adasyn = []
name_adasyn = []
target_adasyn = []


for i in np.arange(len(models)):
    name_adasyn.append(str(pd.DataFrame.from_dict(models).columns[i]))
    target_adasyn.append('none')
    accuracy_adasyn.append(metrics.accuracy_score(y_pred_adasyn.iloc[:,[i]].astype(int), y_test))
    f1_score_adasyn.append(metrics.f1_score(y_pred_adasyn.iloc[:,[i]].astype(int), y_test))
    fbeta_score_adasyn.append(metrics.fbeta_score(y_pred_adasyn.iloc[:,[i]].astype(int), y_test, beta=10))

results_adasyn = pd.DataFrame([accuracy_adasyn, f1_score_adasyn, fbeta_score_adasyn, name_adasyn, target_adasyn]).T
results_adasyn.rename(columns = {0 : 'Accuracy', 1 : 'F1', 2 : 'FBeta', 3 : 'Model', 4: 'Target_perc'}, inplace = True)
results_adasyn.set_index('Model', inplace=True)
results_adasyn.sort_values(by=['FBeta'], inplace=True, ascending=False)
results_adasyn.head(10)

Unnamed: 0_level_0,Accuracy,F1,FBeta,Target_perc
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RandomForestClassifier,0.992551,0.179894,0.253525,none
DecisionTreeRegressor,0.983396,0.0823373,0.061338,none
LogisticRegression,0.572506,0.019077,0.00976448,none


In [54]:
y_pred_a = pd.DataFrame(index=pd.DataFrame(X_test).index, columns=models.keys())

accuracy_a = []
f1_score_a = []
fbeta_score_a = []
name_a = []
k_a = []

for k_ in [5, 15]:
    ada = ADASYN(n_neighbors = k_,random_state = 42)
    X_adasyn, y_adasyn = ada.fit_resample(X_train, y_train)
    
    for model in models.keys():
        models[model].fit(X_adasyn, y_adasyn)
        y_pred_a[model] = models[model].predict(X_test)
        
    for i in np.arange(len(models)):
        name_a.append(str(pd.DataFrame.from_dict(models).columns[i]))
        k_a.append(k_)
        accuracy_a.append(metrics.accuracy_score(y_pred_a.iloc[:,[i]].astype(int), y_test))
        f1_score_a.append(metrics.f1_score(y_pred_a.iloc[:,[i]].astype(int), y_test))
        fbeta_score_a.append(metrics.fbeta_score(y_pred_a.iloc[:,[i]].astype(int), y_test, beta=10))

results_a = pd.DataFrame([accuracy_a, f1_score_a, fbeta_score_a, name_a, k_a]).T
results_a.rename(columns = {0 : 'Accuracy', 1 : 'F1', 2 : 'FBeta', 3 : 'Model', 4: 'k'}, inplace = True)
results_a.set_index('Model', inplace=True)
results_a.sort_values(by=['FBeta'], inplace=True, ascending=False)
results_a.head(10)

Unnamed: 0_level_0,Accuracy,F1,FBeta,k
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
RandomForestClassifier,0.992431,0.136986,0.206207,5
RandomForestClassifier,0.991181,0.152425,0.175007,15
DecisionTreeRegressor,0.981162,0.0862471,0.0607198,15
DecisionTreeRegressor,0.983156,0.0788436,0.0584434,5
LogisticRegression,0.570752,0.0196466,0.0100554,15
LogisticRegression,0.572506,0.019077,0.00976448,5


The `RandomForestClassifier` is the best model so far for predict the `FBeta`, no matter the method used to try to balance the dataset. 

One of the relevant aspects of `ADASYN` is that implies a best computational performance (even when the processing time was not measured, took less than time that the other methods).

On the other hand, the `FBeta` was not improved; in fact, decreased varying the `n_neighbors`

# Exercise 15.6 (3 points)

Compare and comment about the results

*Note: Each result were already commented in each section*

As summary, the exercise to balance a unbalance dataset gives better results that only leave it as is obtained; in fact, prevents conceptual issues like the case of `LogisticRegression` without balance it giving always `zeros`/`negatives`/`0`.

On the other, depends of each case to take advantage of each method; not the same method gives the same result for every case. Is advised to each the calibration process of the `target_percentage` (similar to the hyperparameters search process).

For this case of **fraud detection**, the best method was `OverSampling` with a `FBeta = 0.378114` and `target_percentage = 0.25`, but that does not implies that the `SMOTE` or `ADASYN` methods does not gives better results; in fact, the computational processing of `ADASYN` implies that the can be calibrate.