# Training Fraud Detection Data

In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import auc
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
import imblearn
import matplotlib.pyplot as plt
import seaborn as sns



%matplotlib inline

## Prepare Training/Testing Data

In [12]:
dat = pd.read_csv('./../data/explored.csv')

In [14]:
final_cols = ['merchantidentifier','first_transaction_amount','time_to_first_txn','loc_hmg_coeff','org_fraud_prevalence','is_fraud']

In [25]:
dat_sub = dat[final_cols]

In [34]:
#convert loc_hmg_coeff into a categorical so that the trees traverse each similarity category
dat_sub = pd.get_dummies(dat_sub, columns=['loc_hmg_coeff'])

In [36]:
dat_sub.sample(3)

Unnamed: 0,merchantidentifier,first_transaction_amount,time_to_first_txn,org_fraud_prevalence,is_fraud,loc_hmg_coeff_0.25,loc_hmg_coeff_0.5,loc_hmg_coeff_0.75,loc_hmg_coeff_1.0
41824,38708331,301.24,3,0.024647,0,0.0,0.0,0.0,1.0
116222,41581539,19.0,2,0.046667,0,0.0,1.0,0.0,0.0
103663,43394609,19.99,1,0.025276,0,0.0,0.0,0.0,1.0


In [38]:
Xtrain, Xtest, ytrain, ytest = train_test_split(dat_sub[['first_transaction_amount','time_to_first_txn','loc_hmg_coeff_0.25','loc_hmg_coeff_0.5','loc_hmg_coeff_0.75','loc_hmg_coeff_1.0','org_fraud_prevalence']],
                                                dat_sub[['is_fraud']], train_size=0.75)

In [59]:
Xtrain.to_csv('./../data/xtrain.csv', index=False)
ytrain.to_csv('./../data/ytrain.csv', index=False)
Xtest.to_csv('./../data/xtest.csv', index=False)
ytest.to_csv('./../data/ytest.csv', index=False)

## Cost Sensitive Learning  

Assigning a high cost to misclassification of the minority class, and trying to minimize the overall cost. Reference: http://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf

We propose two ways to deal with the problem of extreme imbalance, both based on the random Forest
(RF) algorithm (Breiman, 2001). One incorporates class weights into the RF classifier, thus making it cost
sensitive, and it penalizes misclassifying the minority class. The other combines the sampling technique
and the ensemble idea. It down-samples the majority class and grows each tree on a more balanced data
set. A majority vote is taken as usual for prediction.  

However, similar to most classifiers, RF can also suffer from the curse of learning
from an extremely imbalanced training data set. As it is constructed to minimize the overall error rate, it
will tend to focus more on the prediction accuracy of the majority class, which often results in poor accuracy
for the minority class. To alleviate the problem, we propose two solutions: balanced random forest (BRF)
and weighted random forest (WRF).

Balanced Random Forest  

As recent
research shows (e.g., Ling & Li (1998),Kubat & Matwin (1997),Drummond & Holte (2003)), for the tree
classifier, artificially making class priors equal either by down-sampling the majority class or over-sampling
the minority class is usually more effective with respect to a given performance measurement, and that downsampling
seems to have an edge over over-sampling. However, down-sampling the majority class may result
in loss of information, as a large part of the majority class is not used. Random forest inspired us to ensemble
trees induced from balanced down-sampled data. The Balanced Random Forest (BRF) algorithm is shown
below:
1. For each iteration in random forest, draw a bootstrap sample from the minority class. Randomly draw
the same number of cases, with replacement, from the majority class.
2. Induce a classification tree from the data to maximum size, without pruning. The tree is induced with
the CART algorithm, with the following modification: At each node, instead of searching through all
variables for the optimal split, only search through a set of mtry randomly selected variables.
3. Repeat the two steps above for the number of times desired. Aggregate the predictions of the ensemble
and make the final prediction.

Weighted Random Forest  

Another approach to make random forest more suitable for learning from extremely imbalanced data follows
the idea of cost sensitive learning. Since the RF classifier tends to be biased towards the majority class, we
shall place a heavier penalty on misclassifying the minority class. We assign a weight to each class, with the
minority class given larger weight (i.e., higher misclassification cost). The class weights are incorporated
into the RF algorithm in two places. In the tree induction procedure, class weights are used to weight
the Gini criterion for finding splits. In the terminal nodes of each tree, class weights are again taken into
consideration. The class prediction of each terminal node is determined by “weighted majority vote”; i.e.,
the weighted vote of a class is the weight for that class times the number of cases for that class at the
terminal node. The final class prediction for RF is then determined by aggregatting the weighted vote from
each individual tree, where the weights are average weights in the terminal nodes. Class weights are an
essential tuning parameter to achieve desired performance. The out-of-bag estimate of the accuracy from
RF can be used to select weights.

In [3]:
np.bincount(np.array([0,0,1,1,2,1,1]))

array([2, 4, 1])

In [4]:
y = [1, 0, 1, 1, 1, 1, 0, 0, 1, 1]

In [7]:
(2*np.bincount(y))

array([ 6, 14])

In [11]:
10.0/np.array([6,14])

array([ 1.66666667,  0.71428571])