# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### First, download the data from: https://www.kaggle.com/ntnu-testimon/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?

In [1]:
# Your code here
import pandas as pd
import numpy as np 

In [2]:
pay = pd.read_csv(r'C:\Users\tiina\Downloads\1069_1940_bundle_archive\fraud.csv')

### What is the distribution of the outcome? 

In [3]:
# Your response here
pay.isFraud.value_counts(normalize=True)

0    0.998709
1    0.001291
Name: isFraud, dtype: float64

Only 0.1% of transactions were fraudulent

### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [4]:
# Your code here
pay.shape

(6362620, 11)

In [5]:
pay.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0
mean,243.3972,179861.9,833883.1,855113.7,1100702.0,1224996.0,0.00129082,2.514687e-06
std,142.332,603858.2,2888243.0,2924049.0,3399180.0,3674129.0,0.0359048,0.001585775
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,156.0,13389.57,0.0,0.0,0.0,0.0,0.0,0.0
50%,239.0,74871.94,14208.0,0.0,132705.7,214661.4,0.0,0.0
75%,335.0,208721.5,107315.2,144258.4,943036.7,1111909.0,0.0,0.0
max,743.0,92445520.0,59585040.0,49585040.0,356015900.0,356179300.0,1.0,1.0


In [6]:
pay.isnull().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

In [7]:
pay.dtypes

step                int64
type               object
amount            float64
nameOrig           object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest           object
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

In [8]:
# Step data
print('steps-column represents 1 hour interval. There are 30 days of data, we dont know exactly what time it started, so it si hard to transform it to datetime etc. We could try to do it so that each step represents a time of the day to see if that makes a difference')

steps-column represents 1 hour interval. There are 30 days of data, we dont know exactly what time it started, so it si hard to transform it to datetime etc. We could try to do it so that each step represents a time of the day to see if that makes a difference


In [9]:
pay.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [10]:
pay.corr()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
step,1.0,0.022373,-0.010058,-0.010299,0.027665,0.025888,0.031578,0.003277
amount,0.022373,1.0,-0.002762,-0.007861,0.294137,0.459304,0.076688,0.012295
oldbalanceOrg,-0.010058,-0.002762,1.0,0.998803,0.066243,0.042029,0.010154,0.003835
newbalanceOrig,-0.010299,-0.007861,0.998803,1.0,0.067812,0.041837,-0.008148,0.003776
oldbalanceDest,0.027665,0.294137,0.066243,0.067812,1.0,0.976569,-0.005885,-0.000513
newbalanceDest,0.025888,0.459304,0.042029,0.041837,0.976569,1.0,0.000535,-0.000529
isFraud,0.031578,0.076688,0.010154,-0.008148,-0.005885,0.000535,1.0,0.044109
isFlaggedFraud,0.003277,0.012295,0.003835,0.003776,-0.000513,-0.000529,0.044109,1.0


In [11]:
pay.isFraud.value_counts()

0    6354407
1       8213
Name: isFraud, dtype: int64

In [12]:
# Deal with categorical data
dummies_type = pd.get_dummies(pay['type'], prefix='type')

In [13]:
pay  = pd.concat([pay, dummies_type], axis=1)

In [14]:
# Drop columns 
pay.drop(columns={'nameOrig', 'nameDest', 'type'}, inplace=True)

# Train Test Split and Resampling

In [15]:
from sklearn.model_selection import train_test_split

In [16]:
X = pay.drop(columns={'isFraud'})
y = pay['isFraud']

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [17]:
from sklearn.utils import resample

In [18]:

# concatenate our training data back together
X = pd.concat([X_train, y_train], axis=1)

# separate minority and majority classes
not_fraud = X[X.isFraud==0]
fraud = X[X.isFraud==1]

# upsample minority
fraud_upsampled = resample(fraud,
                          replace=True, # sample with replacement
                          n_samples=10000
                          ) 

# combine majority and upsampled minority
upsampled = pd.concat([not_fraud, fraud_upsampled])

X_train = upsampled.drop('isFraud', axis=1)
y_train = upsampled.isFraud

In [19]:
# downsampling 
not_fraud_downsampled = resample(not_fraud,
                                replace = False, # sample without replacement
                                n_samples = len(fraud)
                                )

# combine minority and downsampled majority
downsampled = pd.concat([not_fraud_downsampled, fraud])

### Run a logisitc regression classifier and evaluate its accuracy.

In [20]:
# Your code here
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import r2_score, confusion_matrix, accuracy_score, f1_score, recall_score


In [21]:
# Upsamples model
X_train = upsampled.drop('isFraud', axis=1)
y_train = upsampled.isFraud

upsample = LogisticRegression().fit(X_train, y_train)

upsampled_pred = upsample.predict(X_test)

# Checking accuracy
print(accuracy_score(y_test, upsampled_pred))
    
# f1 score
print(f1_score(y_test, upsampled_pred))
    
print(recall_score(y_test, upsampled_pred))

print(confusion_matrix(y_test, upsampled_pred))

print(r2_score(y_test, upsampled_pred))

0.9979238112601413
0.37423022264329703
0.4767652383826192
[[1269092    1775]
 [    867     790]]
-0.596526692969874


In [22]:
# Upsamples model
X_train = downsampled.drop('isFraud', axis=1)
y_train = downsampled.isFraud

downsample = LogisticRegression().fit(X_train, y_train)

downsampled_pred = downsample.predict(X_test)

# Checking accuracy
print(accuracy_score(y_test, downsampled_pred))
    
# f1 score
print(f1_score(y_test, downsampled_pred))
    
print(recall_score(y_test, downsampled_pred))

print(confusion_matrix(y_test, downsampled_pred))

print(r2_score(y_test, downsampled_pred))

0.9102350918332385
0.025474772638080775
0.9010259505129753
[[1156803  114064]
 [    164    1493]]
-68.02651441505026


### Now pick a model of your choice and evaluate its accuracy.

In [23]:
# Your code here

from sklearn.ensemble import RandomForestClassifier



In [24]:
# train model
# Upsamples model
X_train = upsampled.drop('isFraud', axis=1)
y_train = upsampled.isFraud

rfc = RandomForestClassifier(n_estimators=10).fit(X_train, y_train)

# predict on test set
rfc_pred = rfc.predict(X_test)

print(accuracy_score(y_test, rfc_pred))


print(f1_score(y_test, rfc_pred))

print(recall_score(y_test, rfc_pred))
      
print(confusion_matrix(y_test, rfc_pred))

print(r2_score(y_test, rfc_pred))


0.9996707331256621
0.8582064297800338
0.7652383826191913
[[1270837      30]
 [    389    1268]]
0.7468036773829003


In [25]:
# Down sample
# train model
# Upsamples model
X_train = downsampled.drop('isFraud', axis=1)
y_train = downsampled.isFraud

rfc = RandomForestClassifier(n_estimators=10).fit(X_train, y_train)

# predict on test set
rfc_pred = rfc.predict(X_test)

print(accuracy_score(y_test, rfc_pred))


print(f1_score(y_test, rfc_pred))

print(recall_score(y_test, rfc_pred))
      
print(confusion_matrix(y_test, rfc_pred))

print(r2_score(y_test, rfc_pred))

0.9886784060654259
0.18655073118400994
0.9969824984912492
[[1256465   14402]
 [      5    1652]]
-7.705965202731633


### Which model worked better and how do you know?

In [26]:
# Your response here
print('Randomforest with upsampling works better than logistics regression. The f1-score and r2-score is very bad with logistic regression')

Randomforest with upsampling works better than logistics regression. The f1-score and r2-score is very bad with logistic regression
