# Exercise 15

# Fraud Detection

## Introduction

- Fraud Detection Dataset from Microsoft Azure: [data](http://gallery.cortanaintelligence.com/Experiment/8e9fe4e03b8b4c65b9ca947c72b8e463)

Fraud detection is one of the earliest industrial applications of data mining and machine learning. Fraud detection is typically handled as a binary classification problem, but the class population is unbalanced because instances of fraud are usually very rare compared to the overall volume of transactions. Moreover, when fraudulent transactions are discovered, the business typically takes measures to block the accounts from transacting to prevent further losses. 

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier,  DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.neighbors import NearestNeighbors
from sklearn.decomposition import PCA

from collections import Counter

from imblearn.over_sampling import RandomOverSampler, SMOTE, ADASYN

import random

import warnings
warnings.filterwarnings('ignore')

In [2]:
url = 'https://raw.githubusercontent.com/albahnsen/PracticalMachineLearningClass/master/datasets/15_fraud_detection.csv.zip'
#df = pd.read_csv(url, index_col=0)
df = pd.read_csv('15_fraud_detection.csv', index_col=0,)
df.head()

Unnamed: 0,accountAge,digitalItemCount,sumPurchaseCount1Day,sumPurchaseAmount1Day,sumPurchaseAmount30Day,paymentBillingPostalCode - LogOddsForClass_0,accountPostalCode - LogOddsForClass_0,paymentBillingState - LogOddsForClass_0,accountState - LogOddsForClass_0,paymentInstrumentAgeInAccount,ipState - LogOddsForClass_0,transactionAmount,transactionAmountUSD,ipPostalCode - LogOddsForClass_0,localHour - LogOddsForClass_0,Label
0,2000,0,0,0.0,720.25,5.064533,0.421214,1.312186,0.566395,3279.574306,1.218157,599.0,626.16465,1.259543,4.745402,0
1,62,1,1,1185.44,2530.37,0.538996,0.481838,4.40137,4.500157,61.970139,4.035601,1185.44,1185.44,3.981118,4.921349,0
2,2000,0,0,0.0,0.0,5.064533,5.096396,3.056357,3.155226,0.0,3.314186,32.09,32.09,5.00849,4.742303,0
3,1,1,0,0.0,0.0,5.064533,5.096396,3.331154,3.331239,0.0,3.529398,133.28,132.729554,1.324925,4.745402,0
4,1,1,0,0.0,132.73,5.412885,0.342945,5.563677,4.086965,0.001389,3.529398,543.66,543.66,2.693451,4.876771,0


In [3]:
df.head()

Unnamed: 0,accountAge,digitalItemCount,sumPurchaseCount1Day,sumPurchaseAmount1Day,sumPurchaseAmount30Day,paymentBillingPostalCode - LogOddsForClass_0,accountPostalCode - LogOddsForClass_0,paymentBillingState - LogOddsForClass_0,accountState - LogOddsForClass_0,paymentInstrumentAgeInAccount,ipState - LogOddsForClass_0,transactionAmount,transactionAmountUSD,ipPostalCode - LogOddsForClass_0,localHour - LogOddsForClass_0,Label
0,2000,0,0,0.0,720.25,5.064533,0.421214,1.312186,0.566395,3279.574306,1.218157,599.0,626.16465,1.259543,4.745402,0
1,62,1,1,1185.44,2530.37,0.538996,0.481838,4.40137,4.500157,61.970139,4.035601,1185.44,1185.44,3.981118,4.921349,0
2,2000,0,0,0.0,0.0,5.064533,5.096396,3.056357,3.155226,0.0,3.314186,32.09,32.09,5.00849,4.742303,0
3,1,1,0,0.0,0.0,5.064533,5.096396,3.331154,3.331239,0.0,3.529398,133.28,132.729554,1.324925,4.745402,0
4,1,1,0,0.0,132.73,5.412885,0.342945,5.563677,4.086965,0.001389,3.529398,543.66,543.66,2.693451,4.876771,0


In [4]:
df.shape, df.Label.sum(), df.Label.mean()

((138721, 16), 797, 0.0057453449730033666)

# Exercise 15.1

Estimate a Logistic Regression and a Decision Tree

Evaluate using the following metrics:
* Accuracy
* F1-Score
* F_Beta-Score (Beta=10)

Comment about the results

In [5]:
results = pd.DataFrame(columns=('Modelo','Técnica de balanceo','Accuracy','F1-Score','F_Beta-Score'))
results

Unnamed: 0,Modelo,Técnica de balanceo,Accuracy,F1-Score,F_Beta-Score


In [6]:
X = df.drop(['Label'], axis=1)
y = df['Label']

# train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### Logistic Regression

In [7]:
# train a Logist Regresor model

logreg = LogisticRegression(solver='liblinear', C=1e9).fit(X_train, y_train)

# make predictions for testing set
y_pred = logreg.predict(X_test)


# calculate testing accuracy
print('Accuracy:', metrics.accuracy_score(y_test, y_pred))
print('F1-Score:', metrics.f1_score(y_test, y_pred))
print('F_Beta-Score (Beta=10):', metrics.fbeta_score(y_test, y_pred, beta=10))


Accuracy: 0.9940313139759522
F1-Score: 0.0
F_Beta-Score (Beta=10): 0.0


In [8]:
results.loc[len(results)]=['LogisticRegression','-',metrics.accuracy_score(y_test, y_pred),metrics.f1_score(y_test, y_pred),metrics.fbeta_score(y_test, y_pred, beta=10)] 
results

Unnamed: 0,Modelo,Técnica de balanceo,Accuracy,F1-Score,F_Beta-Score
0,LogisticRegression,-,0.994031,0.0,0.0


### Decision Tree

In [9]:
treeclf = DecisionTreeClassifier(max_depth=3, random_state=42).fit(X_train, y_train)
y_pred = treeclf.predict(X_test)

# calculate testing accuracy
print('Accuracy:', metrics.accuracy_score(y_test, y_pred))
print('F1-Score:', metrics.f1_score(y_test, y_pred))
print('F_Beta-Score (Beta=10):', metrics.fbeta_score(y_test, y_pred, beta=10))

Accuracy: 0.9942908220639544
F1-Score: 0.029411764705882356
F_Beta-Score (Beta=10): 0.015298394425931535


In [10]:
results.loc[len(results)]=['DecisionTreeClassifier','-',metrics.accuracy_score(y_test, y_pred),metrics.f1_score(y_test, y_pred),metrics.fbeta_score(y_test, y_pred, beta=10)] 
results

Unnamed: 0,Modelo,Técnica de balanceo,Accuracy,F1-Score,F_Beta-Score
0,LogisticRegression,-,0.994031,0.0,0.0
1,DecisionTreeClassifier,-,0.994291,0.029412,0.015298


### Random Forest

In [11]:
rfclf = RandomForestClassifier(random_state=42).fit(X_train, y_train)
y_pred = rfclf.predict(X_test)

# calculate testing accuracy
print('Accuracy:', metrics.accuracy_score(y_test, y_pred))
print('F1-Score:', metrics.f1_score(y_test, y_pred))
print('F_Beta-Score (Beta=10):', metrics.fbeta_score(y_test, y_pred, beta=10))

Accuracy: 0.9943196562959545
F1-Score: 0.12444444444444444
F_Beta-Score (Beta=10): 0.07131689110808494


In [12]:
results.loc[len(results)]=['RandomForestClassifier','-',metrics.accuracy_score(y_test, y_pred),metrics.f1_score(y_test, y_pred),metrics.fbeta_score(y_test, y_pred, beta=10)] 
results

Unnamed: 0,Modelo,Técnica de balanceo,Accuracy,F1-Score,F_Beta-Score
0,LogisticRegression,-,0.994031,0.0,0.0
1,DecisionTreeClassifier,-,0.994291,0.029412,0.015298
2,RandomForestClassifier,-,0.99432,0.124444,0.071317


- La data se encuentra desbalanceada, y el nivel de precisión del 99.43% indica que el modelo esta prediciendo todo de la misma clase. El RandomForestClassifier mejora la clasificación y se evidencia en las medidas de desempeño.

# Exercise 15.2

Under-sample the negative class using random-under-sampling

Which is parameter for target_percentage did you choose?
How the results change?

**Only apply under-sampling to the training set, evaluate using the whole test set**

In [13]:
n_samples = y.shape[0]
n_samples_0 = (y == 0).sum()
n_samples_1 = (y == 1).sum()


print('n_samples:',n_samples)
print('n_samples_0:',n_samples_0)
print('n_samples_1:',n_samples_1)


n_samples: 138721
n_samples_0: 137924
n_samples_1: 797


In [14]:
n_samples_1 / n_samples

0.0057453449730033666

In [15]:
n_samples_0_new =  n_samples_1 / 0.5 - n_samples_1
n_samples_0_new

797.0

In [16]:
n_samples_0_new_per = n_samples_0_new / n_samples_0
n_samples_0_new_per

0.005778544705779994

In [17]:
# Select all negatives
filter_ = y == 0

# Random sample
np.random.seed(42)
rand_1 = np.random.binomial(n=1, p=n_samples_0_new_per, size=n_samples)

# Combine
filter_ = filter_ & rand_1

In [18]:
filter_.sum()

757

In [19]:
filter_ = filter_ | (y == 1)
filter_ = filter_.astype(bool)

In [20]:
def UnderSampling(X, y, target_percentage=0.5, seed=None):
    # Assuming minority class is the positive
    n_samples = y.shape[0]
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()

    n_samples_0_new =  n_samples_1 / target_percentage - n_samples_1
    n_samples_0_new_per = n_samples_0_new / n_samples_0

    filter_ = y == 0

    np.random.seed(seed)
    rand_1 = np.random.binomial(n=1, p=n_samples_0_new_per, size=n_samples)
    
    filter_ = filter_ & rand_1
    filter_ = filter_ | (y == 1)
    filter_ = filter_.astype(bool)
    
    return X[filter_], y[filter_]

In [21]:
for target_percentage in [0.1, 0.2, 0.3, 0.4, 0.5]:
    X_u, y_u = UnderSampling(X_train, y_train, target_percentage, 1)
    logreg = LogisticRegression(solver='liblinear', C=1e9).fit(X_u, y_u)
    y_pred = logreg.predict(X_test)
    print('Target_percentage:', target_percentage,'Accuracy:', metrics.accuracy_score(y_test, y_pred),'F1-Score:', metrics.f1_score(y_test, y_pred),'F_Beta-Score:', metrics.fbeta_score(y_test, y_pred, beta=10))

Target_percentage: 0.1 Accuracy: 0.9937429716559499 F1-Score: 0.0 F_Beta-Score: 0.0
Target_percentage: 0.2 Accuracy: 0.9665234566477322 F1-Score: 0.04128819157720892 F_Beta-Score: 0.12131840676500265
Target_percentage: 0.3 Accuracy: 0.9257806868314062 F1-Score: 0.03450862715678919 F_Beta-Score: 0.20864020118555773
Target_percentage: 0.4 Accuracy: 0.8652864680949223 F1-Score: 0.03070539419087137 F_Beta-Score: 0.3060355417246745
Target_percentage: 0.5 Accuracy: 0.5885931778207087 F1-Score: 0.017490703759812695 F_Beta-Score: 0.37589379908568754


In [22]:
X_u, y_u = UnderSampling(X_train, y_train, target_percentage, 1)
logreg = LogisticRegression(solver='liblinear', C=1e9).fit(X_u, y_u)
y_pred = logreg.predict(X_test)
print('Accuracy:', metrics.accuracy_score(y_test, y_pred),'F1-Score:', metrics.f1_score(y_test, y_pred),'F_Beta-Score:', metrics.fbeta_score(y_test, y_pred, beta=10))

Accuracy: 0.5885931778207087 F1-Score: 0.017490703759812695 F_Beta-Score: 0.37589379908568754


In [23]:
results.loc[len(results)]=['LogisticRegression','UnderSampling',metrics.accuracy_score(y_test, y_pred),metrics.f1_score(y_test, y_pred),metrics.fbeta_score(y_test, y_pred, beta=10)] 
results

Unnamed: 0,Modelo,Técnica de balanceo,Accuracy,F1-Score,F_Beta-Score
0,LogisticRegression,-,0.994031,0.0,0.0
1,DecisionTreeClassifier,-,0.994291,0.029412,0.015298
2,RandomForestClassifier,-,0.99432,0.124444,0.071317
3,LogisticRegression,UnderSampling,0.588593,0.017491,0.375894


In [24]:
for target_percentage in [0.1, 0.2, 0.3, 0.4, 0.5]:
    X_u, y_u = UnderSampling(X_train, y_train, target_percentage, 1)
    treeclf = DecisionTreeClassifier(max_depth=3, random_state=42).fit(X_u, y_u)
    y_pred = treeclf.predict(X_test)
    print('Accuracy:', metrics.accuracy_score(y_test, y_pred),'F1-Score:', metrics.f1_score(y_test, y_pred),'F_Beta-Score:', metrics.fbeta_score(y_test, y_pred, beta=10))

Accuracy: 0.9885528098959084 F1-Score: 0.13882863340563992 F_Beta-Score: 0.16109255844091117
Accuracy: 0.989244831463914 F1-Score: 0.10978520286396183 F_Beta-Score: 0.11602817042105788
Accuracy: 0.8623165421988985 F1-Score: 0.03710425489009881 F_Beta-Score: 0.3783233581694556
Accuracy: 0.8288688330786309 F1-Score: 0.03448836830974458 F_Beta-Score: 0.41578313720921195
Accuracy: 0.8243418586545948 F1-Score: 0.03454833597464343 F_Beta-Score: 0.4248610682309355


In [25]:
X_u, y_u = UnderSampling(X_train, y_train, target_percentage, 1)
treeclf = DecisionTreeClassifier(max_depth=3, random_state=42).fit(X_u, y_u)
y_pred = treeclf.predict(X_test)
print('Accuracy:', metrics.accuracy_score(y_test, y_pred),'F1-Score:', metrics.f1_score(y_test, y_pred),'F_Beta-Score:', metrics.fbeta_score(y_test, y_pred, beta=10))

Accuracy: 0.8243418586545948 F1-Score: 0.03454833597464343 F_Beta-Score: 0.4248610682309355


In [26]:
results.loc[len(results)]=['DecisionTreeClassifier','UnderSampling',metrics.accuracy_score(y_test, y_pred),metrics.f1_score(y_test, y_pred),metrics.fbeta_score(y_test, y_pred, beta=10)] 
results

Unnamed: 0,Modelo,Técnica de balanceo,Accuracy,F1-Score,F_Beta-Score
0,LogisticRegression,-,0.994031,0.0,0.0
1,DecisionTreeClassifier,-,0.994291,0.029412,0.015298
2,RandomForestClassifier,-,0.99432,0.124444,0.071317
3,LogisticRegression,UnderSampling,0.588593,0.017491,0.375894
4,DecisionTreeClassifier,UnderSampling,0.824342,0.034548,0.424861


In [27]:
for target_percentage in [0.1, 0.2, 0.3, 0.4, 0.5]:
    X_u, y_u = UnderSampling(X_train, y_train, target_percentage, 1)
    rfclf = RandomForestClassifier(random_state=42).fit(X_u, y_u)
    y_pred = rfclf.predict(X_test)
    print('Target_percentage:', target_percentage,'Accuracy:', metrics.accuracy_score(y_test, y_pred),'F1-Score:', metrics.f1_score(y_test, y_pred),'F_Beta-Score:', metrics.fbeta_score(y_test, y_pred, beta=10))

Target_percentage: 0.1 Accuracy: 0.9814884230558519 F1-Score: 0.1640625 F_Beta-Score: 0.31237113402061856
Target_percentage: 0.2 Accuracy: 0.9490787462875926 F1-Score: 0.08875128998968007 F_Beta-Score: 0.40324976787372335
Target_percentage: 0.3 Accuracy: 0.9162077218073297 F1-Score: 0.06679511881824021 F_Beta-Score: 0.4624053530551154
Target_percentage: 0.4 Accuracy: 0.850119662062801 F1-Score: 0.04553800954829232 F_Beta-Score: 0.5
Target_percentage: 0.5 Accuracy: 0.7748046480781985 F1-Score: 0.03341584158415842 F_Beta-Score: 0.4925583411603207


In [28]:
target_percentage = 0.4

X_u, y_u = UnderSampling(X_train, y_train, target_percentage, 1)
rfclf = RandomForestClassifier(random_state=42).fit(X_u, y_u)
y_pred = rfclf.predict(X_test)
print('Accuracy:', metrics.accuracy_score(y_test, y_pred),'F1-Score:', metrics.f1_score(y_test, y_pred),'F_Beta-Score:', metrics.fbeta_score(y_test, y_pred, beta=10))

Accuracy: 0.850119662062801 F1-Score: 0.04553800954829232 F_Beta-Score: 0.5


In [29]:
results.loc[len(results)]=['RandomForestClassifier','UnderSampling',metrics.accuracy_score(y_test, y_pred),metrics.f1_score(y_test, y_pred),metrics.fbeta_score(y_test, y_pred, beta=10)] 
results

Unnamed: 0,Modelo,Técnica de balanceo,Accuracy,F1-Score,F_Beta-Score
0,LogisticRegression,-,0.994031,0.0,0.0
1,DecisionTreeClassifier,-,0.994291,0.029412,0.015298
2,RandomForestClassifier,-,0.99432,0.124444,0.071317
3,LogisticRegression,UnderSampling,0.588593,0.017491,0.375894
4,DecisionTreeClassifier,UnderSampling,0.824342,0.034548,0.424861
5,RandomForestClassifier,UnderSampling,0.85012,0.045538,0.5


- Con base en los modelos usados el Target_percentage que presenta las mejores metricas de desempeño para la regresión logistica y los arboles de decisión es "0.5", sin embargo, para el RandomFores el mejor Target_percentage es "0.4".

# Exercise 15.3

Same analysis using random-over-sampling

In [30]:
def OverSampling(X, y, target_percentage=0.5, seed=None):
    # Assuming minority class is the positive
    n_samples = y.shape[0]
    n_samples_0 = (y == 0).sum()
    n_samples_1 = (y == 1).sum()
   
    n_samples_1_new =  -target_percentage * n_samples_0 / (target_percentage- 1)
    np.random.seed(seed)
    filter_ = np.random.choice(X[y == 1].shape[0], int(n_samples_1_new))
    # filter_ is within the positives, change to be of all
    filter_ = np.nonzero(y == 1)[0][filter_]
    filter_ = np.concatenate((filter_, np.nonzero(y == 0)[0]), axis=0)
    
    return X.iloc[filter_], y.iloc[filter_]

In [31]:
for target_percentage in [0.1, 0.2, 0.3, 0.4, 0.5]:
    X_u, y_u = OverSampling(X_train, y_train, target_percentage, 1)
    logreg = LogisticRegression(solver='liblinear', C=1e9).fit(X_u, y_u)
    y_pred = logreg.predict(X_test)
    print('Target_percentage:', target_percentage,'Accuracy:', metrics.accuracy_score(y_test, y_pred),'F1-Score:', metrics.f1_score(y_test, y_pred),'F_Beta-Score:', metrics.fbeta_score(y_test, y_pred, beta=10))

Target_percentage: 0.1 Accuracy: 0.9942043193679536 F1-Score: 0.0 F_Beta-Score: 0.0
Target_percentage: 0.2 Accuracy: 0.993137452783945 F1-Score: 0.0 F_Beta-Score: 0.0
Target_percentage: 0.3 Accuracy: 0.9446382745595571 F1-Score: 0.03614457831325302 F_Beta-Score: 0.1683801055848847
Target_percentage: 0.4 Accuracy: 0.9041838470632335 F1-Score: 0.037648421662322615 F_Beta-Score: 0.28475384949034915
Target_percentage: 0.5 Accuracy: 0.5649491075805196 F1-Score: 0.01821967725143155 F_Beta-Score: 0.40434658278524443


In [32]:
X_u, y_u = OverSampling(X_train, y_train, target_percentage, 1)
logreg = LogisticRegression(solver='liblinear', C=1e9).fit(X_u, y_u)
y_pred = logreg.predict(X_test)
print('Accuracy:', metrics.accuracy_score(y_test, y_pred),'F1-Score:', metrics.f1_score(y_test, y_pred),'F_Beta-Score:', metrics.fbeta_score(y_test, y_pred, beta=10))

Accuracy: 0.5649491075805196 F1-Score: 0.01821967725143155 F_Beta-Score: 0.40434658278524443


In [33]:
results.loc[len(results)]=['LogisticRegression','OverSampling',metrics.accuracy_score(y_test, y_pred),metrics.f1_score(y_test, y_pred),metrics.fbeta_score(y_test, y_pred, beta=10)] 
results

Unnamed: 0,Modelo,Técnica de balanceo,Accuracy,F1-Score,F_Beta-Score
0,LogisticRegression,-,0.994031,0.0,0.0
1,DecisionTreeClassifier,-,0.994291,0.029412,0.015298
2,RandomForestClassifier,-,0.99432,0.124444,0.071317
3,LogisticRegression,UnderSampling,0.588593,0.017491,0.375894
4,DecisionTreeClassifier,UnderSampling,0.824342,0.034548,0.424861
5,RandomForestClassifier,UnderSampling,0.85012,0.045538,0.5
6,LogisticRegression,OverSampling,0.564949,0.01822,0.404347


In [34]:
for target_percentage in [0.1, 0.2, 0.3, 0.4, 0.5]:
    X_u, y_u = OverSampling(X_train, y_train, target_percentage, 1)
    treeclf = DecisionTreeClassifier(max_depth=3, random_state=42).fit(X_u, y_u)
    y_pred = treeclf.predict(X_test)
    print('Target_percentage:', target_percentage,'Accuracy:', metrics.accuracy_score(y_test, y_pred),'F1-Score:', metrics.f1_score(y_test, y_pred),'F_Beta-Score:', metrics.fbeta_score(y_test, y_pred, beta=10))

Target_percentage: 0.1 Accuracy: 0.9901963611199216 F1-Score: 0.11458333333333334 F_Beta-Score: 0.11117782447713401
Target_percentage: 0.2 Accuracy: 0.9900810241919207 F1-Score: 0.10416666666666666 F_Beta-Score: 0.1010707495246673
Target_percentage: 0.3 Accuracy: 0.9705890833597647 F1-Score: 0.07441016333938295 F_Beta-Score: 0.20000965996908812
Target_percentage: 0.4 Accuracy: 0.8157780917505262 F1-Score: 0.03504002416553391 F_Beta-Score: 0.4467833581207337
Target_percentage: 0.5 Accuracy: 0.8157780917505262 F1-Score: 0.03504002416553391 F_Beta-Score: 0.4467833581207337


In [35]:
X_u, y_u = OverSampling(X_train, y_train, target_percentage, 1)
treeclf = DecisionTreeClassifier(max_depth=3, random_state=42).fit(X_u, y_u)
y_pred = treeclf.predict(X_test)
print('Accuracy:', metrics.accuracy_score(y_test, y_pred),'F1-Score:', metrics.f1_score(y_test, y_pred),'F_Beta-Score:', metrics.fbeta_score(y_test, y_pred, beta=10))

Accuracy: 0.8157780917505262 F1-Score: 0.03504002416553391 F_Beta-Score: 0.4467833581207337


In [36]:
results.loc[len(results)]=['DecisionTreeClassifier','OverSampling',metrics.accuracy_score(y_test, y_pred),metrics.f1_score(y_test, y_pred),metrics.fbeta_score(y_test, y_pred, beta=10)] 
results

Unnamed: 0,Modelo,Técnica de balanceo,Accuracy,F1-Score,F_Beta-Score
0,LogisticRegression,-,0.994031,0.0,0.0
1,DecisionTreeClassifier,-,0.994291,0.029412,0.015298
2,RandomForestClassifier,-,0.99432,0.124444,0.071317
3,LogisticRegression,UnderSampling,0.588593,0.017491,0.375894
4,DecisionTreeClassifier,UnderSampling,0.824342,0.034548,0.424861
5,RandomForestClassifier,UnderSampling,0.85012,0.045538,0.5
6,LogisticRegression,OverSampling,0.564949,0.01822,0.404347
7,DecisionTreeClassifier,OverSampling,0.815778,0.03504,0.446783


In [37]:
for target_percentage in [0.1, 0.2, 0.3, 0.4, 0.5]:
    X_u, y_u = OverSampling(X_train, y_train, target_percentage, 1)
    rfclf = RandomForestClassifier(random_state=42).fit(X_u, y_u)
    y_pred = rfclf.predict(X_test)
    print('Target_percentage:', target_percentage,'Accuracy:', metrics.accuracy_score(y_test, y_pred),'F1-Score:', metrics.f1_score(y_test, y_pred),'F_Beta-Score:', metrics.fbeta_score(y_test, y_pred, beta=10))

Target_percentage: 0.1 Accuracy: 0.9936564689599493 F1-Score: 0.15384615384615383 F_Beta-Score: 0.10170174201993756
Target_percentage: 0.2 Accuracy: 0.9935699662639486 F1-Score: 0.14559386973180075 F_Beta-Score: 0.09661179076675223
Target_percentage: 0.3 Accuracy: 0.9935411320319484 F1-Score: 0.15789473684210525 F_Beta-Score: 0.1067545802295148
Target_percentage: 0.4 Accuracy: 0.9934834635679479 F1-Score: 0.15037593984962405 F_Beta-Score: 0.10167102879001409
Target_percentage: 0.5 Accuracy: 0.9937141374239498 F1-Score: 0.16153846153846152 F_Beta-Score: 0.10678682912093446


In [38]:
X_u, y_u = OverSampling(X_train, y_train, target_percentage, 1)
rfclf = RandomForestClassifier(random_state=42).fit(X_u, y_u)
y_pred = rfclf.predict(X_test)
print('Accuracy:', metrics.accuracy_score(y_test, y_pred),'F1-Score:', metrics.f1_score(y_test, y_pred),'F_Beta-Score:', metrics.fbeta_score(y_test, y_pred, beta=10))

Accuracy: 0.9937141374239498 F1-Score: 0.16153846153846152 F_Beta-Score: 0.10678682912093446


In [39]:
results.loc[len(results)]=['RandomForestClassifier','OverSampling',metrics.accuracy_score(y_test, y_pred),metrics.f1_score(y_test, y_pred),metrics.fbeta_score(y_test, y_pred, beta=10)] 
results

Unnamed: 0,Modelo,Técnica de balanceo,Accuracy,F1-Score,F_Beta-Score
0,LogisticRegression,-,0.994031,0.0,0.0
1,DecisionTreeClassifier,-,0.994291,0.029412,0.015298
2,RandomForestClassifier,-,0.99432,0.124444,0.071317
3,LogisticRegression,UnderSampling,0.588593,0.017491,0.375894
4,DecisionTreeClassifier,UnderSampling,0.824342,0.034548,0.424861
5,RandomForestClassifier,UnderSampling,0.85012,0.045538,0.5
6,LogisticRegression,OverSampling,0.564949,0.01822,0.404347
7,DecisionTreeClassifier,OverSampling,0.815778,0.03504,0.446783
8,RandomForestClassifier,OverSampling,0.993714,0.161538,0.106787


- Al usar OverSampling el target_percentage qu mejora el modelo para la regresión lineal es 0.4, sin embargo para los arboles de desición, el target_percentage 0.4 y 0.5 generan el mismo F_Beta-Score. Para el caso del RandomForestClassifier el mejor target_percentage es 0.5.

# Exercise 15.4 (3 points)

Evaluate the results using SMOTE

Which parameters did you choose?

In [40]:
def SMOTE(X, y, target_percentage=0.5, k=5, seed=None):
    # Calculate the NearestNeighbors
    from sklearn.neighbors import NearestNeighbors
    nearest_neighbour_ = NearestNeighbors(n_neighbors=k + 1)
    nearest_neighbour_.fit(X[y==1])
    nns = nearest_neighbour_.kneighbors(X[y==1], 
                                    return_distance=False)[:, 1:]
    
    # New samples
    n_samples_1_new =  int(-target_percentage * n_samples_0 / (target_percentage- 1) - n_samples_1)
    
    # A matrix to store the synthetic samples
    new = np.zeros((n_samples_1_new, X.shape[1]))
    
    # Create seeds
    np.random.seed(seed)
    seeds = np.random.randint(1, 1000000, 3)
    
    # Select examples to use as base
    np.random.seed(seeds[0])
    sel_ = np.random.choice(y[y==1].shape[0], n_samples_1_new)
    
    # Define random seeds (2 per example)
    np.random.seed(seeds[1])
    nn__=[]
    for i, sel in enumerate(sel_):
        nn__.append(np.random.choice(nns[sel]))
    
    np.random.seed(seeds[2])
    steps = np.random.uniform(size=n_samples_1_new)  

    # For each selected examples create one synthetic case
    for i, sel in enumerate(sel_):
        # Select neighbor
        nn_ = nn__[i]
        step = steps[i]
        # Create new sample
        new[i, :] = X[y==1].iloc[sel] - step * (X[y==1].iloc[sel] - X[y==1].iloc[nn_])
    
    X = np.vstack((X, new))
    y = np.append(y, np.ones(n_samples_1_new))
    
    return X, y

In [41]:
for target_percentage in [0.25, 0.5]:
    for k in [5, 15]:
        X_u, y_u = SMOTE(X_train, y_train, target_percentage, k, seed=3)
        logreg = LogisticRegression(solver='liblinear', C=1e9).fit(X_u, y_u)
        y_pred = logreg.predict(X_test)
        print('Target_percentage:', target_percentage,'k ', k,'Accuracy:', metrics.accuracy_score(y_test, y_pred),'F1-Score:', metrics.f1_score(y_test, y_pred),'F_Beta-Score:', metrics.fbeta_score(y_test, y_pred, beta=10))

Target_percentage: 0.25 k  5 Accuracy: 0.9337677690954701 F1-Score: 0.03527929441411173 F_Beta-Score: 0.1929672929081563
Target_percentage: 0.25 k  15 Accuracy: 0.9320377151754563 F1-Score: 0.033620336203362036 F_Beta-Score: 0.18787713806088654
Target_percentage: 0.5 k  5 Accuracy: 0.46062685620368504 F1-Score: 0.01723232110959336 F_Beta-Score: 0.4287193291230976
Target_percentage: 0.5 k  15 Accuracy: 0.4473631094835789 F1-Score: 0.017128205128205128 F_Beta-Score: 0.43135900976932123


In [42]:
X_u, y_u = SMOTE(X_train, y_train, target_percentage, k, seed=3)
logreg = LogisticRegression(solver='liblinear', C=1e9).fit(X_u, y_u)
y_pred = logreg.predict(X_test)
print('Accuracy:', metrics.accuracy_score(y_test, y_pred),'F1-Score:', metrics.f1_score(y_test, y_pred),'F_Beta-Score:', metrics.fbeta_score(y_test, y_pred, beta=10))

Accuracy: 0.4473631094835789 F1-Score: 0.017128205128205128 F_Beta-Score: 0.43135900976932123


In [43]:
results.loc[len(results)]=['LogisticRegression','SMOTE',metrics.accuracy_score(y_test, y_pred),metrics.f1_score(y_test, y_pred),metrics.fbeta_score(y_test, y_pred, beta=10)] 
results

Unnamed: 0,Modelo,Técnica de balanceo,Accuracy,F1-Score,F_Beta-Score
0,LogisticRegression,-,0.994031,0.0,0.0
1,DecisionTreeClassifier,-,0.994291,0.029412,0.015298
2,RandomForestClassifier,-,0.99432,0.124444,0.071317
3,LogisticRegression,UnderSampling,0.588593,0.017491,0.375894
4,DecisionTreeClassifier,UnderSampling,0.824342,0.034548,0.424861
5,RandomForestClassifier,UnderSampling,0.85012,0.045538,0.5
6,LogisticRegression,OverSampling,0.564949,0.01822,0.404347
7,DecisionTreeClassifier,OverSampling,0.815778,0.03504,0.446783
8,RandomForestClassifier,OverSampling,0.993714,0.161538,0.106787
9,LogisticRegression,SMOTE,0.447363,0.017128,0.431359


In [44]:
for target_percentage in [0.25, 0.5]:
    for k in [5, 15]:
        X_u, y_u = SMOTE(X_train, y_train, target_percentage, k, seed=3)
        treeclf = DecisionTreeClassifier(max_depth=3, random_state=42).fit(X_u, y_u)
        y_pred = treeclf.predict(X_test)
        print('Target_percentage:', target_percentage,'k ', k,'Accuracy:', metrics.accuracy_score(y_test, y_pred),'F1-Score:', metrics.f1_score(y_test, y_pred),'F_Beta-Score:', metrics.fbeta_score(y_test, y_pred, beta=10))

Target_percentage: 0.25 k  5 Accuracy: 0.903261151639226 F1-Score: 0.02781802376122863 F_Beta-Score: 0.21029800893593026
Target_percentage: 0.25 k  15 Accuracy: 0.9007237392232058 F1-Score: 0.027126306866346424 F_Beta-Score: 0.2094982930729009
Target_percentage: 0.5 k  5 Accuracy: 0.5211787434041695 F1-Score: 0.020179372197309416 F_Beta-Score: 0.472530779753762
Target_percentage: 0.5 k  15 Accuracy: 0.538623453764309 F1-Score: 0.020086961847020638 F_Beta-Score: 0.46099468425593504


In [45]:
X_u, y_u = SMOTE(X_train, y_train, target_percentage, k, seed=3)
treeclf = DecisionTreeClassifier(max_depth=3, random_state=42).fit(X_u, y_u)
y_pred = treeclf.predict(X_test)
print('Accuracy:', metrics.accuracy_score(y_test, y_pred),'F1-Score:', metrics.f1_score(y_test, y_pred),'F_Beta-Score:', metrics.fbeta_score(y_test, y_pred, beta=10))

Accuracy: 0.538623453764309 F1-Score: 0.020086961847020638 F_Beta-Score: 0.46099468425593504


In [46]:
results.loc[len(results)]=['DecisionTreeClassifier','SMOTE',metrics.accuracy_score(y_test, y_pred),metrics.f1_score(y_test, y_pred),metrics.fbeta_score(y_test, y_pred, beta=10)] 
results

Unnamed: 0,Modelo,Técnica de balanceo,Accuracy,F1-Score,F_Beta-Score
0,LogisticRegression,-,0.994031,0.0,0.0
1,DecisionTreeClassifier,-,0.994291,0.029412,0.015298
2,RandomForestClassifier,-,0.99432,0.124444,0.071317
3,LogisticRegression,UnderSampling,0.588593,0.017491,0.375894
4,DecisionTreeClassifier,UnderSampling,0.824342,0.034548,0.424861
5,RandomForestClassifier,UnderSampling,0.85012,0.045538,0.5
6,LogisticRegression,OverSampling,0.564949,0.01822,0.404347
7,DecisionTreeClassifier,OverSampling,0.815778,0.03504,0.446783
8,RandomForestClassifier,OverSampling,0.993714,0.161538,0.106787
9,LogisticRegression,SMOTE,0.447363,0.017128,0.431359


In [47]:
for target_percentage in [0.25, 0.5]:
    for k in [5, 15]:
        X_u, y_u = SMOTE(X_train, y_train, target_percentage, k, seed=3)
        rfclf = RandomForestClassifier(random_state=42).fit(X_u, y_u)
        y_pred = rfclf.predict(X_test)
        print('Target_percentage:', target_percentage,'k ', k,'Accuracy:', metrics.accuracy_score(y_test, y_pred),'F1-Score:', metrics.f1_score(y_test, y_pred),'F_Beta-Score:', metrics.fbeta_score(y_test, y_pred, beta=10))

Target_percentage: 0.25 k  5 Accuracy: 0.9929644473919437 F1-Score: 0.15277777777777776 F_Beta-Score: 0.11171442936148818
Target_percentage: 0.25 k  15 Accuracy: 0.9926184366079409 F1-Score: 0.17948717948717946 F_Beta-Score: 0.14201064577684042
Target_percentage: 0.5 k  5 Accuracy: 0.9927626077679421 F1-Score: 0.1716171617161716 F_Beta-Score: 0.13192665159507663
Target_percentage: 0.5 k  15 Accuracy: 0.9908595484559268 F1-Score: 0.15466666666666665 F_Beta-Score: 0.14661861140311358


In [48]:
target_percentage = 0.5
K = 5

X_u, y_u = SMOTE(X_train, y_train, target_percentage, k, seed=3)
rfclf = RandomForestClassifier(random_state=42).fit(X_u, y_u)
y_pred = rfclf.predict(X_test)
print('Accuracy:', metrics.accuracy_score(y_test, y_pred),'F1-Score:', metrics.f1_score(y_test, y_pred),'F_Beta-Score:', metrics.fbeta_score(y_test, y_pred, beta=10))

Accuracy: 0.9908595484559268 F1-Score: 0.15466666666666665 F_Beta-Score: 0.14661861140311358


In [49]:
results.loc[len(results)]=['RandomForestClassifier','SMOTE',metrics.accuracy_score(y_test, y_pred),metrics.f1_score(y_test, y_pred),metrics.fbeta_score(y_test, y_pred, beta=10)] 
results

Unnamed: 0,Modelo,Técnica de balanceo,Accuracy,F1-Score,F_Beta-Score
0,LogisticRegression,-,0.994031,0.0,0.0
1,DecisionTreeClassifier,-,0.994291,0.029412,0.015298
2,RandomForestClassifier,-,0.99432,0.124444,0.071317
3,LogisticRegression,UnderSampling,0.588593,0.017491,0.375894
4,DecisionTreeClassifier,UnderSampling,0.824342,0.034548,0.424861
5,RandomForestClassifier,UnderSampling,0.85012,0.045538,0.5
6,LogisticRegression,OverSampling,0.564949,0.01822,0.404347
7,DecisionTreeClassifier,OverSampling,0.815778,0.03504,0.446783
8,RandomForestClassifier,OverSampling,0.993714,0.161538,0.106787
9,LogisticRegression,SMOTE,0.447363,0.017128,0.431359


- Al realizar la evalución los parametros que generan una mejor presición del modelo son el target_percentage = 0.5 y K = 5.

# Exercise 15.5 (3 points)

Evaluate the results using Adaptive Synthetic Sampling Approach for Imbalanced
Learning (ADASYN)

http://www.ele.uri.edu/faculty/he/PDFfiles/adasyn.pdf
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.ADASYN.html#rf9172e970ca5-1

In [50]:
%%time
pd.options.mode.chained_assignment = None  # default='warn'
adsn = ADASYN(n_neighbors = 5, ratio = 'auto', random_state=42, n_jobs = -1)

X_u, y_u  = adsn.fit_sample(X_train, y_train)
print('Original data shape: {}'.format(Counter(y_train)))
print('Reshaped data shape: {}'.format(Counter(y_u)))

Original data shape: Counter({0: 103441, 1: 599})
Reshaped data shape: Counter({1: 103473, 0: 103441})
Wall time: 11 s


In [51]:
#adsn = ADASYN(ratio={1: 797, 0: 137924}, random_state=42, n_neighbors=3, n_jobs= -1)

X_u, y_u = adsn.fit_sample(X_train, y_train)
logreg = LogisticRegression(solver='liblinear', C=1e9).fit(X_u, y_u)
y_pred = logreg.predict(X_test)
print('Accuracy:', metrics.accuracy_score(y_test, y_pred),'F1-Score:', metrics.f1_score(y_test, y_pred),'F_Beta-Score:', metrics.fbeta_score(y_test, y_pred, beta=10))

Accuracy: 0.5667079957325336 F1-Score: 0.017522066034651847 F_Beta-Score: 0.38782703384245065


In [52]:
results.loc[len(results)]=['LogisticRegression','ADASYN',metrics.accuracy_score(y_test, y_pred),metrics.f1_score(y_test, y_pred),metrics.fbeta_score(y_test, y_pred, beta=10)] 
results

Unnamed: 0,Modelo,Técnica de balanceo,Accuracy,F1-Score,F_Beta-Score
0,LogisticRegression,-,0.994031,0.0,0.0
1,DecisionTreeClassifier,-,0.994291,0.029412,0.015298
2,RandomForestClassifier,-,0.99432,0.124444,0.071317
3,LogisticRegression,UnderSampling,0.588593,0.017491,0.375894
4,DecisionTreeClassifier,UnderSampling,0.824342,0.034548,0.424861
5,RandomForestClassifier,UnderSampling,0.85012,0.045538,0.5
6,LogisticRegression,OverSampling,0.564949,0.01822,0.404347
7,DecisionTreeClassifier,OverSampling,0.815778,0.03504,0.446783
8,RandomForestClassifier,OverSampling,0.993714,0.161538,0.106787
9,LogisticRegression,SMOTE,0.447363,0.017128,0.431359


In [53]:
X_u, y_u = adsn.fit_sample(X_train, y_train)
treeclf = DecisionTreeClassifier(max_depth=3, random_state=42).fit(X_u, y_u)
y_pred = treeclf.predict(X_test)
print('Accuracy:', metrics.accuracy_score(y_test, y_pred),'F1-Score:', metrics.f1_score(y_test, y_pred),'F_Beta-Score:', metrics.fbeta_score(y_test, y_pred, beta=10))

Accuracy: 0.7447593783339581 F1-Score: 0.025324818321955517 F_Beta-Score: 0.4049295774647887


In [54]:
results.loc[len(results)]=['DecisionTreeClassifier','ADASYN',metrics.accuracy_score(y_test, y_pred),metrics.f1_score(y_test, y_pred),metrics.fbeta_score(y_test, y_pred, beta=10)] 
results

Unnamed: 0,Modelo,Técnica de balanceo,Accuracy,F1-Score,F_Beta-Score
0,LogisticRegression,-,0.994031,0.0,0.0
1,DecisionTreeClassifier,-,0.994291,0.029412,0.015298
2,RandomForestClassifier,-,0.99432,0.124444,0.071317
3,LogisticRegression,UnderSampling,0.588593,0.017491,0.375894
4,DecisionTreeClassifier,UnderSampling,0.824342,0.034548,0.424861
5,RandomForestClassifier,UnderSampling,0.85012,0.045538,0.5
6,LogisticRegression,OverSampling,0.564949,0.01822,0.404347
7,DecisionTreeClassifier,OverSampling,0.815778,0.03504,0.446783
8,RandomForestClassifier,OverSampling,0.993714,0.161538,0.106787
9,LogisticRegression,SMOTE,0.447363,0.017128,0.431359


In [55]:
X_u, y_u = adsn.fit_sample(X_train, y_train)
rfclf = RandomForestClassifier(random_state=42).fit(X_u, y_u)
y_pred = rfclf.predict(X_test)
print('Accuracy:', metrics.accuracy_score(y_test, y_pred),'F1-Score:', metrics.f1_score(y_test, y_pred),'F_Beta-Score:', metrics.fbeta_score(y_test, y_pred, beta=10))

Accuracy: 0.9929356131599435 F1-Score: 0.18060200668896323 F_Beta-Score: 0.13702829003567657


In [56]:
results.loc[len(results)]=['RandomForestClassifier','ADASYN',metrics.accuracy_score(y_test, y_pred),metrics.f1_score(y_test, y_pred),metrics.fbeta_score(y_test, y_pred, beta=10)] 
results

Unnamed: 0,Modelo,Técnica de balanceo,Accuracy,F1-Score,F_Beta-Score
0,LogisticRegression,-,0.994031,0.0,0.0
1,DecisionTreeClassifier,-,0.994291,0.029412,0.015298
2,RandomForestClassifier,-,0.99432,0.124444,0.071317
3,LogisticRegression,UnderSampling,0.588593,0.017491,0.375894
4,DecisionTreeClassifier,UnderSampling,0.824342,0.034548,0.424861
5,RandomForestClassifier,UnderSampling,0.85012,0.045538,0.5
6,LogisticRegression,OverSampling,0.564949,0.01822,0.404347
7,DecisionTreeClassifier,OverSampling,0.815778,0.03504,0.446783
8,RandomForestClassifier,OverSampling,0.993714,0.161538,0.106787
9,LogisticRegression,SMOTE,0.447363,0.017128,0.431359


# Exercise 15.6 (3 points)

Compare and comment about the results

In [57]:
results

Unnamed: 0,Modelo,Técnica de balanceo,Accuracy,F1-Score,F_Beta-Score
0,LogisticRegression,-,0.994031,0.0,0.0
1,DecisionTreeClassifier,-,0.994291,0.029412,0.015298
2,RandomForestClassifier,-,0.99432,0.124444,0.071317
3,LogisticRegression,UnderSampling,0.588593,0.017491,0.375894
4,DecisionTreeClassifier,UnderSampling,0.824342,0.034548,0.424861
5,RandomForestClassifier,UnderSampling,0.85012,0.045538,0.5
6,LogisticRegression,OverSampling,0.564949,0.01822,0.404347
7,DecisionTreeClassifier,OverSampling,0.815778,0.03504,0.446783
8,RandomForestClassifier,OverSampling,0.993714,0.161538,0.106787
9,LogisticRegression,SMOTE,0.447363,0.017128,0.431359


- Con base en los resultados finales se observa que para los modelos con la data desbalanceada la precición es casi perfecta y el puntaje de F1 es bajo, en este sentido no se puede identificar si se tiene problemas con falsos positivos o falsos negativos. Para el caso analizado, se busca "balancear" la data ya que el modelo de fraude, el 99% de las compras son correctas, y un 1% son fraudulentas, esto indica que al realizar un modelo se estima al 99% de que todas las compras son correctas, en este caso es necesario realizar un balanceo de la data para poder detectar los patrones..

    * Para el modelo que utiliza la regresión logistica, la técnica que mejor se ajusta desde la precisión es la de OverSampling, con la que se busca realizar un sobremuestreo para incrementar los casos de fraude, con un accuracy de 56.4% y un F1-Score de 0.018220. Sin embargo, es preferible observar el Fβ-score, en el cual se podría decidir que el recall es más importante, ya que en el ejercicio de fraude, el impacto de un falso positivo, es menor a un falso negativo. En este sentido la mejor tecnica es la de SMOTE con un Fβ-score de 0.431359.
    * En el caso de los arboles de desición la desición es similar ya que el SMOTE (Synthetic Minority Oversampling Technique) selecciona dos instancias similares utilizando vecinos más cercanos y bootstrapping, y genera muestras sintéticas a partir de instancias de las clases minoritarias generando un Fβ-score de 0.460995.
    * Al usar bosques aleatorios es preferible realizar UnderSampling que selecciona de una manera aleatoria instancias de la clase mayoritaria para ser eliminados sin remplazamiento hasta que ambas clases quedan desbalanceadas. Con un Fβ-score de 0.5.

- En esta linea se tiene presente que el Under-Sampling elimina objetos de la clase mayoritaria, con el objetivo de crear un equilibrado conjunto de datos, pero tiene el inconveniente del enfoque de sub muestreo es que puede excluir algunos objetos representativos del conjunto de entrenamiento afectando de esta manera el modelo construido por el clasificador. Y en el Over-Sampling, la idea principal es la creación de nuevos objetos de la clase minoritaria para producir unos nuevos conjuntos de datos con una distribución equilibrada de clase. Sin embargo, el principal inconveniente del enfoque de sobre muestreó es que puede incluir también muchos objetos artificiales que pueden producir sobreajuste.

    * Para el caso de fraude se propone el método de muestreo SMOTE para entrenar el modelo considerando dos cuestiones: la distribución de clases y la dificultad de clasificar correctamente la muestra. Sin embargo esta técnica requiere mayor tiempo de computo.
