# Inbalanced Classes
## In this lab, we are going to explore a case of imbalanced classes. 


Like we disussed in class, when we have noisy data, if we are not careful, we can end up fitting our model to the noise in the data and not the 'signal'-- the factors that actually determine the outcome. This is called overfitting, and results in good results in training, and in bad results when the model is applied to real data. Similarly, we could have a model that is too simplistic to accurately model the signal. This produces a model that doesnt work well (ever). 


### First, download the data from: https://www.kaggle.com/ntnu-testimon/paysim1. Import the dataset and provide some discriptive statistics and plots. What do you think will be the important features in determining the outcome?

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
from scipy import stats
from sklearn.linear_model import LinearRegression
import seaborn as sns

In [3]:
# Your code here
data = pd.read_csv('/Users/erinberardi/Downloads/PS_20174392719_1491204439457_log.csv')
data.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [4]:
data.shape

(6362620, 11)

In [5]:
data.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0
mean,243.3972,179861.9,833883.1,855113.7,1100702.0,1224996.0,0.00129082,2.514687e-06
std,142.332,603858.2,2888243.0,2924049.0,3399180.0,3674129.0,0.0359048,0.001585775
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,156.0,13389.57,0.0,0.0,0.0,0.0,0.0,0.0
50%,239.0,74871.94,14208.0,0.0,132705.7,214661.4,0.0,0.0
75%,335.0,208721.5,107315.2,144258.4,943036.7,1111909.0,0.0,0.0
max,743.0,92445520.0,59585040.0,49585040.0,356015900.0,356179300.0,1.0,1.0


In [51]:
data.dtypes

step                int64
type               object
amount            float64
oldbalanceOrg     float64
newbalanceOrig    float64
oldbalanceDest    float64
newbalanceDest    float64
isFraud             int64
isFlaggedFraud      int64
dtype: object

In [6]:
data.isnull().sum()

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64

In [None]:
fig, ax = plt.subplots(nrows=4,ncols=3,figsize=(12,6))
ax[0,0].boxplot(data['step'])
ax[0,1].boxplot(data['type'])
ax[0,2].boxplot(data['amount'])

ax[1,0].boxplot(data['nameOrig'])
ax[1,1].boxplot(data['oldbalanceOrg'])
ax[1,2].boxplot(data['newbalanceOrig'])

ax[2,0].boxplot(data['nameDest'])
ax[2,1].boxplot(data['oldbalanceDest'])
ax[2,2].boxplot(data['newbalanceDest'])

ax[3,0].boxplot(data['isFraud'])
ax[3,1].boxplot(data['isFlaggedFraud'])


In [21]:
data['type'].value_counts()
data_dummy = pd.get_dummies(data['type'],drop_first=True)

In [31]:
data_dummy.head()

Unnamed: 0,CASH_OUT,DEBIT,PAYMENT,TRANSFER
0,0,0,1,0
1,0,0,1,0
2,0,0,0,1
3,1,0,0,0
4,0,0,1,0


In [30]:
#data.drop(data[['nameOrig','nameDest']],inplace=True,axis = 1)
data.head()

Unnamed: 0,step,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,170136.0,160296.36,0.0,0.0,0,0
1,1,PAYMENT,1864.28,21249.0,19384.72,0.0,0.0,0,0
2,1,TRANSFER,181.0,181.0,0.0,0.0,0.0,1,0
3,1,CASH_OUT,181.0,181.0,0.0,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,41554.0,29885.86,0.0,0.0,0,0


In [37]:
data_merge = data.merge(data_dummy,left_index=True, right_index=True)
data_merge

Unnamed: 0,step,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,CASH_OUT,DEBIT,PAYMENT,TRANSFER
0,1,PAYMENT,9839.64,170136.00,160296.36,0.00,0.00,0,0,0,0,1,0
1,1,PAYMENT,1864.28,21249.00,19384.72,0.00,0.00,0,0,0,0,1,0
2,1,TRANSFER,181.00,181.00,0.00,0.00,0.00,1,0,0,0,0,1
3,1,CASH_OUT,181.00,181.00,0.00,21182.00,0.00,1,0,1,0,0,0
4,1,PAYMENT,11668.14,41554.00,29885.86,0.00,0.00,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6362615,743,CASH_OUT,339682.13,339682.13,0.00,0.00,339682.13,1,0,1,0,0,0
6362616,743,TRANSFER,6311409.28,6311409.28,0.00,0.00,0.00,1,0,0,0,0,1
6362617,743,CASH_OUT,6311409.28,6311409.28,0.00,68488.84,6379898.11,1,0,1,0,0,0
6362618,743,TRANSFER,850002.52,850002.52,0.00,0.00,0.00,1,0,0,0,0,1


In [38]:
data_merge.drop(['type'],inplace = True, axis = 1)

In [39]:
data_merge.head()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,CASH_OUT,DEBIT,PAYMENT,TRANSFER
0,1,9839.64,170136.0,160296.36,0.0,0.0,0,0,0,0,1,0
1,1,1864.28,21249.0,19384.72,0.0,0.0,0,0,0,0,1,0
2,1,181.0,181.0,0.0,0.0,0.0,1,0,0,0,0,1
3,1,181.0,181.0,0.0,21182.0,0.0,1,0,1,0,0,0
4,1,11668.14,41554.0,29885.86,0.0,0.0,0,0,0,0,1,0


### What is the distribution of the outcome? 

In [53]:
# Your response here
print(data_merge['isFraud'].value_counts())
print(data_merge['isFlaggedFraud'].value_counts())

0    6354407
1       8213
Name: isFraud, dtype: int64
0    6362604
1         16
Name: isFlaggedFraud, dtype: int64


### Clean the dataset. How are you going to integrate the time variable? Do you think the step (integer) coding in which it is given is appropriate?

In [None]:
# Your code here
'''I am keeping the time step coding as is'''

### Run a logisitc regression classifier and evaluate its accuracy.

In [47]:
# Your code here
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [54]:
#split data to X and y
X = data_merge.loc[:,data_merge.columns !='isFraud']
y = data_merge['isFraud']

In [55]:
#split further into train test split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2)

In [56]:
#fit model
data_model = LogisticRegression()
data_model.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [60]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score

#make predictions on test and check accuracy
y_pred = data_model.predict(X_test)
acc = accuracy_score(y_test,y_pred)
f1 = f1_score(y_test,y_pred)
recall = recall_score(y_test,y_pred)

print('WOAH!  That is a pretty high accuracy score!',acc)
print('f1 score = ',f1)
print('recall score =  ',recall)

WOAH!  That is a pretty high accuracy score! 0.998306515240577
f1 score =  0.3900367959241438
recall score =   0.42426108374384236


### Now pick a model of your choice and evaluate its accuracy.

In [61]:
# Your code here
from sklearn.ensemble import RandomForestClassifier

# train model
rfc = RandomForestClassifier(n_estimators=10).fit(X_train, y_train)

# predict on test set
rfc_pred = rfc.predict(X_test)

acc2 = accuracy_score(y_test, rfc_pred)
f12 =  f1_score(y_test, rfc_pred)
recall2 = recall_score(y_test, rfc_pred)

print('Accuracy =  ',acc2)
print('f1 score = ',f12)
print('recall score =  ',recall2)

Accuracy =   0.9996903791205509
f1 score =  0.8643250688705234
recall score =   0.7727832512315271


### Which model worked better and how do you know?

In [None]:
# Your response here
'''It looked like Random Forest worked better due to the better f1 and recall scores.'''