# IEEE-CIS Fraud Detection



### Data glossary :
### Transaction table

- TransactionDT: timedelta from a given reference datetime (not an actual timestamp)

- TransactionAMT: transaction payment amount in USD

- ProductCD: product code, the product for each transaction

- card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.

- addr: address

- dist: distance

- P_ and (R__) emaildomain: purchaser and recipient email domain

- C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.

- D1-D15: timedelta, such as days between previous transaction, etc.

- M1-M9: match, such as names on card and address, etc.

- Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.

### Identity Table
- DeviceType
- DeviceInfo
- id_12 - id_38

# 1. Loading Libraries

In [2]:
import numpy as np
import pandas as pd
import random
from collections import Counter 

#visualization lbraries
import matplotlib.pyplot as plt
import seaborn as sns

#to ignore warning in the notebook
import warnings
warnings.filterwarnings('ignore')


# 2. Loading Data

In [4]:
fraud_data=pd.read_csv("https://raw.githubusercontent.com/dphi-official/Imbalanced_classes/master/fraud_data.csv")

# 3. Exploratory Data Analysis

In [5]:
fraud_data.head()

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38,DeviceType,DeviceInfo
0,2994681,0,242834,25.0,H,9803,583.0,150.0,visa,226.0,...,firefox 56.0,24.0,1920x1080,match_status:2,T,F,T,T,desktop,rv:56.0
1,3557242,0,15123000,117.0,W,7919,194.0,150.0,mastercard,166.0,...,,,,,,,,,,
2,3327470,0,8378575,73.773,C,12778,500.0,185.0,mastercard,224.0,...,,,,,,,,,,
3,3118781,0,2607840,400.0,R,12316,548.0,150.0,visa,195.0,...,mobile safari generic,32.0,1136x640,match_status:2,T,F,T,F,mobile,iOS Device
4,3459772,0,12226544,31.95,W,9002,453.0,150.0,visa,226.0,...,,,,,,,,,,


In [6]:
fraud_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59054 entries, 0 to 59053
Columns: 434 entries, TransactionID to DeviceInfo
dtypes: float64(385), int64(18), object(31)
memory usage: 195.5+ MB


There are 434 columns with 59054 observations.

In [7]:
fraud_data.describe()

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,card1,card2,card3,card5,addr1,addr2,...,id_17,id_18,id_19,id_20,id_21,id_22,id_24,id_25,id_26,id_32
count,59054.0,59054.0,59054.0,59054.0,59054.0,58139.0,58896.0,58610.0,52326.0,52326.0,...,14061.0,4511.0,14059.0,14054.0,525.0,525.0,487.0,524.0,525.0,7715.0
mean,3282166.0,0.033952,7368220.0,134.142888,9910.36636,362.438054,153.264551,199.104231,290.653939,86.806616,...,189.89958,14.23875,351.767622,404.594777,385.257143,15.748571,12.73922,326.225191,148.794286,26.499028
std,170257.3,0.181107,4612063.0,233.112295,4893.704524,157.360648,11.395609,41.296438,101.796538,2.639572,...,30.34787,1.524658,141.600677,152.201538,213.565534,6.496154,2.275238,97.662855,31.168092,3.73914
min,2987019.0,0.0,86730.0,0.292,1008.0,100.0,100.0,100.0,100.0,13.0,...,100.0,11.0,100.0,100.0,114.0,14.0,11.0,100.0,100.0,0.0
25%,3135748.0,0.0,3074217.0,42.95,6019.0,215.0,150.0,166.0,204.0,87.0,...,166.0,13.0,266.0,256.0,252.0,14.0,11.0,321.0,119.0,24.0
50%,3282062.0,0.0,7288450.0,68.017,9749.0,361.0,150.0,226.0,299.0,87.0,...,166.0,15.0,339.0,484.0,252.0,14.0,11.0,321.0,147.0,24.0
75%,3429699.0,0.0,11239180.0,117.0,14223.0,512.0,150.0,226.0,330.0,87.0,...,225.0,15.0,427.0,533.0,554.0,14.0,15.0,361.0,169.0,32.0
max,3577536.0,1.0,15811050.0,5279.95,18390.0,600.0,229.0,237.0,536.0,102.0,...,225.0,29.0,670.0,660.0,854.0,43.0,24.0,548.0,216.0,32.0


In [8]:
# Taking a look a target variable
fraud_data.isFraud.value_counts()

0    57049
1     2005
Name: isFraud, dtype: int64

There are 2005 fraud transaction

In [9]:
 # Normalize = True will find the proportion of fraud transaction and not fraud transaction 
fraud_data.isFraud.value_counts(normalize=True)

0    0.966048
1    0.033952
Name: isFraud, dtype: float64

#sns.countplot(fraud_data.isFraud)

There are only 3% of data which are fraud and the rest 97% of data are not fraud. This is clearly the class imbalance problem. 

# 4.Data Preparation

In [11]:
# Missing value 

def miss_val_info(df):
  """
  This function will take a dataframe and calculates the frequency and percentage of missing values in each column.
  """
  missing_count = df.isnull().sum().sort_values(ascending = False)
  missing_percent = round(missing_count / len(df) * 100, 2)
  missing_info = pd.concat([missing_count, missing_percent], axis = 1, keys=['Missing Value Count','Percent of missing values'])
  return missing_info[missing_info['Missing Value Count'] != 0]


In [12]:
miss_val_info(fraud_data) 

Unnamed: 0,Missing Value Count,Percent of missing values
id_24,58567,99.18
id_25,58530,99.11
id_27,58529,99.11
id_21,58529,99.11
id_22,58529,99.11
...,...,...
V309,3,0.01
V308,3,0.01
V307,3,0.01
V306,3,0.01


Out of 434 columns, 414 have some missing value

In [14]:
# Eliminate columns with more than 20% missing value

fraud_data= fraud_data[fraud_data.columns[fraud_data.isnull().mean() < 0.2]]

In [15]:
# filling missing value of numerical columns with mean value
num_cols= fraud_data.select_dtypes(include=np.number).columns
fraud_data[num_cols]= fraud_data[num_cols].fillna(fraud_data[num_cols].mean())

In [16]:
#filling missing value of categorical columns with mode value
cat_cols = fraud_data.select_dtypes(include="object").columns
fraud_data[cat_cols] = fraud_data[cat_cols].fillna(fraud_data[cat_cols].mode().iloc[0])

In [17]:
miss_val_info(fraud_data)

Unnamed: 0,Missing Value Count,Percent of missing values


In [18]:
fraud_data.head()

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,V312,V313,V314,V315,V316,V317,V318,V319,V320,V321
0,2994681,0,242834,25.0,H,9803,583.0,150.0,visa,226.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,3557242,0,15123000,117.0,W,7919,194.0,150.0,mastercard,166.0,...,234.0,0.0,225.5,0.0,288.0,1707.0,1707.0,0.0,0.0,0.0
2,3327470,0,8378575,73.773,C,12778,500.0,185.0,mastercard,224.0,...,0.0,0.0,73.772797,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3118781,0,2607840,400.0,R,12316,548.0,150.0,visa,195.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,3459772,0,12226544,31.95,W,9002,453.0,150.0,visa,226.0,...,0.0,0.0,99.900002,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
fraud_data = pd.get_dummies(fraud_data, columns=cat_cols) 

In [20]:
fraud_data.head()

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,card1,card2,card3,card5,addr1,addr2,...,P_emaildomain_web.de,P_emaildomain_windstream.net,P_emaildomain_yahoo.co.jp,P_emaildomain_yahoo.co.uk,P_emaildomain_yahoo.com,P_emaildomain_yahoo.com.mx,P_emaildomain_yahoo.de,P_emaildomain_yahoo.es,P_emaildomain_yahoo.fr,P_emaildomain_ymail.com
0,2994681,0,242834,25.0,9803,583.0,150.0,226.0,269.0,87.0,...,0,0,0,0,1,0,0,0,0,0
1,3557242,0,15123000,117.0,7919,194.0,150.0,166.0,181.0,87.0,...,0,0,0,0,0,0,0,0,0,0
2,3327470,0,8378575,73.773,12778,500.0,185.0,224.0,284.0,60.0,...,0,0,0,0,0,0,0,0,0,0
3,3118781,0,2607840,400.0,12316,548.0,150.0,195.0,441.0,87.0,...,0,0,0,0,0,0,0,0,0,0
4,3459772,0,12226544,31.95,9002,453.0,150.0,226.0,264.0,87.0,...,0,0,0,0,1,0,0,0,0,0


In [23]:
# Separate input features and output feature
X = fraud_data.drop(columns = ['isFraud'])       # input features
Y = fraud_data.isFraud      # output feature

from sklearn.model_selection import train_test_split

# Split randomly into 70% train data and 30% test data
X_train, X_Test, Y_train, Y_Test = train_test_split(X, Y, test_size = 0.3, random_state = 123)

In [26]:
!pip install imblearn

Collecting imblearn
  Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Collecting imbalanced-learn
  Downloading imbalanced_learn-0.7.0-py3-none-any.whl (167 kB)
Installing collected packages: imbalanced-learn, imblearn
Successfully installed imbalanced-learn-0.7.0 imblearn-0.0


In [28]:
# Dealing with imbalance Data
# import SMOTE 

from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state = 25, sampling_strategy = 1.0)   # again we are eqalizing both the classes

In [29]:
# fit the sampling
X_train, Y_train = sm.fit_sample(X_train, Y_train)

In [30]:
np.unique(Y_train, return_counts=True)

(array([0, 1], dtype=int64), array([39944, 39944], dtype=int64))

# 5. Building Random Forest Model

In [31]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(criterion='entropy')

In [33]:
rfc.fit(X_train, Y_train)

RandomForestClassifier(criterion='entropy')

In [34]:
rfc.score(X_train, Y_train)

0.9999874824754656

# 6. Feature Selection

In [35]:
from sklearn.feature_selection import SelectKBest, f_classif

In [36]:
selector = SelectKBest(f_classif, k=10)

In [37]:
X_new = selector.fit_transform(X, Y)

In [39]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X_new, Y, test_size = 0.2, random_state=42)

In [40]:
rfc.fit(X_train, Y_train)

RandomForestClassifier(criterion='entropy')

In [41]:
rfc.score(X_train,Y_train)

0.9696674639629151

The point to note is that we had got 97.5% accuracy with 249 features, then after that we selected only 10 features of 249 features and still able to get 96.9% accurate results. The conclusion is that we have reduced a lot of computational complexities which is very good as our aim is not only to increase the performance of the model at any cost but also to reduce the computational complexity of our model.

# 7. Cross Validation

In [44]:
# We will use here k - fold cross validation technique
from sklearn.model_selection import cross_validate

In [45]:
cv_result = cross_validate(rfc,X_new, Y, cv=10, scoring=['accuracy','precision','recall'])
cv_result

{'fit_time': array([5.30673194, 4.97993612, 4.7560699 , 4.53021145, 4.84601331,
        4.95895386, 4.6641314 , 4.39828944, 5.1588192 , 4.57318163]),
 'score_time': array([0.15794945, 0.21586943, 0.17989731, 0.1589036 , 0.17789006,
        0.16889381, 0.16590357, 0.16090584, 0.15990639, 0.16489935]),
 'test_accuracy': array([0.9693532 , 0.9685066 , 0.96918388, 0.9685066 , 0.96782388,
        0.96883997, 0.96799323, 0.96833192, 0.9700254 , 0.96900931]),
 'test_precision': array([0.85714286, 0.75862069, 0.78787879, 0.8       , 0.64705882,
        0.75      , 0.67741935, 0.68571429, 0.81081081, 0.8       ]),
 'test_recall': array([0.11940299, 0.10945274, 0.12935323, 0.09950249, 0.11      ,
        0.12      , 0.105     , 0.12      , 0.15      , 0.11940299])}

In [46]:
print('Accuracy :', cv_result['test_accuracy'].mean())

Accuracy : 0.9687573996564295


with cross validation we are getting approx 96,87% of accurate result

# 8. Hyper parameter Tunning

In [48]:
from sklearn.model_selection import GridSearchCV

In [50]:
from sklearn.ensemble import RandomForestClassifier

In [51]:
# Different parameters in random forest

criterion = ['gini', 'entropy']        # what criteria to consider

n_estimators = [100, 200, 300]       # Number of trees in random forest

max_features = ['auto', 'sqrt']       # Number of features to consider at every split

max_depth = [10, 20]      # Maximum number of levels in tree. Hope you remember linspace function from numpy session

max_depth.append(None)     # also appendin 'None' in max_depth i.e. no maximum depth to be considered.

params = {'criterion': criterion,
          'n_estimators': n_estimators,
          'max_features': max_features,
          'max_depth': max_depth}

In [52]:
params

{'criterion': ['gini', 'entropy'],
 'n_estimators': [100, 200, 300],
 'max_features': ['auto', 'sqrt'],
 'max_depth': [10, 20, None]}

In [53]:
gs = GridSearchCV(rfc, param_grid=params, n_jobs=2)

In [54]:
gs.fit(X_train,Y_train)

GridSearchCV(estimator=RandomForestClassifier(criterion='entropy'), n_jobs=2,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': [10, 20, None],
                         'max_features': ['auto', 'sqrt'],
                         'n_estimators': [100, 200, 300]})

In [55]:
gs.best_params_

{'criterion': 'gini',
 'max_depth': 10,
 'max_features': 'auto',
 'n_estimators': 300}

In [56]:
gs.best_score_

0.968418616846677

In [57]:
gs.score(X_test,Y_test)

0.9689272711878757

### Conclusion

- The dataset contained missing value. We removed some columns and filled missing values for numerical column with mean and categorical column with mode
- We observed that the dataset was imbalaced. We use 'SMOTE'to generate the new data with the problem of imbalanced data
- We built Random Forest model with accuracy 99,9%
- Then we select 10  most important features using SelectKBest and  f_classif. Here the model complexity to reduce a lot with very little decrease in accuracy.
- Validation and Hyper Parameter Tunning gave nearly 96,89% of accurate result which not bad. Most of the times the dafault value for hyper parameter of the model are same that we got through the hyper parameter tunning. That's the reason there is not much difference between normal model with tunned model.