# IEEE-CIS Credit Card Fraud Detection
Capstone Project 1 for Springboard.

Boris Iskra

## Description

Credit card fraud is an inclusive term for fraud committed using a payment card, such as a credit card or debit card. The purpose may be to obtain goods or services, or to make payment to another account which is controlled by a criminal. The Payment Card Industry Data Security Standard (PCI DSS) is the data security standard created to help businesses process card payments securely and reduce card fraud.

Credit card fraud can be authorised, where the genuine customer themselves processes a payment to another account which is controlled by a criminal, or unauthorised, where the account holder does not provide authorisation for the payment to proceed and the transaction is carried out by a third party. In 2018, unauthorised financial fraud losses across payment cards and remote banking totalled £844.8 million in the United Kingdom. Whereas banks and card companies prevented £1.66 billion in unauthorised fraud in 2018. That is the equivalent to £2 in every £3 of attempted fraud being stopped.

Credit cards are more secure than ever, with regulators, card providers and banks taking considerable time and effort to collaborate with investigators worldwide to ensure fraudsters aren't successful. Cardholders' money is usually protected from scammers with regulations that make the card provider and bank accountable. The technology and security measures behind credit cards are becoming increasingly sophisticated making it harder for fraudsters to steal money.

(Source: Wikipedia https://en.wikipedia.org/wiki/Credit_card_fraud)

## The Data

The data comes from the Kaggle Competions found at:

        https://www.kaggle.com/c/ieee-fraud-detection/data

Description of the data was given as follows:

The data is broken into two files identity and transaction, which are joined by TransactionID. Not all transactions have corresponding identity information.


- TransactionDT: timedelta from a given reference datetime (not an actual timestamp)
- TransactionAMT: transaction payment amount in USD
- ProductCD: product code, the product for each transaction
- card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
- addr: address
- dist: distance
- P_ and (R__) emaildomain: purchaser and recipient email domain
- C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
= D1-D15: timedelta, such as days between previous transaction, etc.
- M1-M9: match, such as names on card and address, etc.
- Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.

The Vxxx features range from 1 - 339

        
### Data Wrangling and Cleaning

The data comes as .csv files, which are easily loaded into pandas dataframes.

For the moment I will look at the *Vxxx: Vesta engineered features* to find what kind of data they hold.

All this featuers contain numerical data except for missing values.

First, I'll try to separate Categorical from non-categorical data. In order to do this, I'll look a the features that contain only integer values and consider those as cathegorical the rest will be considered as non-cathgorical data.

For the missing values for the categorical variables I'll will considered as another class in the categorical

### Feature Selection and Engineering




In [2]:
%matplotlib notebook
import matplotlib.pyplot as plt
import pandas as pd
from pandas.plotting import scatter_matrix
import numpy as np
from sklearn import tree
from functools import reduce
from collections import Counter

%matplotlib inline

In [3]:
df = pd.read_csv('ieee-data/train_transaction.csv') #,nrows=50000)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 590540 entries, 0 to 590539
Columns: 394 entries, TransactionID to V339
dtypes: float64(376), int64(4), object(14)
memory usage: 1.7+ GB


In [4]:
#list(df.columns)

In [5]:
original_categorical = ['TransactionID', 'ProductCD', 'card1', 'card2', 'card3', 'card4', 'card5', 'card6', 'addr1', 'addr2', 'P_emaildomain', 'R_emaildomain', 'M1', 'M2', 'M3', 'M4', 'M5', 'M6', 'M7', 'M8', 'M9']# Categorical Features - Identity DeviceType DeviceInfo id_12 - id_38 ]
#categorical.extend(['DeviceType', 'DeviceInfo', 'id12 - id38']) need to fix
df[original_categorical].head()

Unnamed: 0,TransactionID,ProductCD,card1,card2,card3,card4,card5,card6,addr1,addr2,...,R_emaildomain,M1,M2,M3,M4,M5,M6,M7,M8,M9
0,2987000,W,13926,,150.0,discover,142.0,credit,315.0,87.0,...,,T,T,T,M2,F,T,,,
1,2987001,W,2755,404.0,150.0,mastercard,102.0,credit,325.0,87.0,...,,,,,M0,T,T,,,
2,2987002,W,4663,490.0,150.0,visa,166.0,debit,330.0,87.0,...,,T,T,T,M0,F,F,F,F,F
3,2987003,W,18132,567.0,150.0,mastercard,117.0,debit,476.0,87.0,...,,,,,M0,T,F,,,
4,2987004,H,4497,514.0,150.0,mastercard,102.0,credit,420.0,87.0,...,,,,,,,,,,


In [6]:
df.describe()

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,card1,card2,card3,card5,addr1,addr2,...,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
count,590540.0,590540.0,590540.0,590540.0,590540.0,581607.0,588975.0,586281.0,524834.0,524834.0,...,82351.0,82351.0,82351.0,82351.0,82351.0,82351.0,82351.0,82351.0,82351.0,82351.0
mean,3282270.0,0.03499,7372311.0,135.027176,9898.734658,362.555488,153.194925,199.278897,290.733794,86.80063,...,0.775874,721.741883,1375.783644,1014.622782,9.807015,59.16455,28.530903,55.352422,151.160542,100.700882
std,170474.4,0.183755,4617224.0,239.162522,4901.170153,157.793246,11.336444,41.244453,101.741072,2.690623,...,4.727971,6217.223583,11169.275702,7955.735482,243.861391,387.62948,274.57692,668.486833,1095.034387,814.946722
min,2987000.0,0.0,86400.0,0.251,1000.0,100.0,100.0,100.0,100.0,10.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3134635.0,0.0,3027058.0,43.321,6019.0,214.0,150.0,166.0,204.0,87.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,3282270.0,0.0,7306528.0,68.769,9678.0,361.0,150.0,226.0,299.0,87.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,3429904.0,0.0,11246620.0,125.0,14184.0,512.0,150.0,226.0,330.0,87.0,...,0.0,0.0,25.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,3577539.0,1.0,15811130.0,31937.391,18396.0,600.0,231.0,237.0,540.0,102.0,...,55.0,160000.0,160000.0,160000.0,55125.0,55125.0,55125.0,104060.0,104060.0,104060.0


In [7]:
df.head(6)

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,
3,2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,,,
4,2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,2987005,0,86510,49.0,W,5937,555.0,150.0,visa,226.0,...,,,,,,,,,,


In [9]:
def have_same_nans(ser1,ser2):
    """
    function to determine if two (Pandas) Series
    have Nan's in the same places
    """
    a = ser1.isna() ^ ser2.isna()
    return ~(a.any())

In [11]:
for x in df.columns:
    print(x,end=', ')

TransactionID, isFraud, TransactionDT, TransactionAmt, ProductCD, card1, card2, card3, card4, card5, card6, addr1, addr2, dist1, dist2, P_emaildomain, R_emaildomain, C1, C2, C3, C4, C5, C6, C7, C8, C9, C10, C11, C12, C13, C14, D1, D2, D3, D4, D5, D6, D7, D8, D9, D10, D11, D12, D13, D14, D15, M1, M2, M3, M4, M5, M6, M7, M8, M9, V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13, V14, V15, V16, V17, V18, V19, V20, V21, V22, V23, V24, V25, V26, V27, V28, V29, V30, V31, V32, V33, V34, V35, V36, V37, V38, V39, V40, V41, V42, V43, V44, V45, V46, V47, V48, V49, V50, V51, V52, V53, V54, V55, V56, V57, V58, V59, V60, V61, V62, V63, V64, V65, V66, V67, V68, V69, V70, V71, V72, V73, V74, V75, V76, V77, V78, V79, V80, V81, V82, V83, V84, V85, V86, V87, V88, V89, V90, V91, V92, V93, V94, V95, V96, V97, V98, V99, V100, V101, V102, V103, V104, V105, V106, V107, V108, V109, V110, V111, V112, V113, V114, V115, V116, V117, V118, V119, V120, V121, V122, V123, V124, V125, V126, V127, V128, V129, V130,

In [12]:
# Making lists for the diferent columns
#card_cols, addr_cols, dist_cols, email_cols, C_cols, D_cols, M_cols, V_cols

card_cols = [x for x in df.columns if x.find('card') != -1]
addr_cols = [x for x in df.columns if x.find('addr') != -1]
dist_cols = [x for x in df.columns if x.find('dist') != -1]
email_cols = [x for x in df.columns if x.find('email') != -1]
C_cols = [x for x in df.columns if x.find('C',0,1) != -1]
D_cols = [x for x in df.columns if x.find('D',0,1) != -1]
M_cols = [x for x in df.columns if x.find('M',0,1) != -1]
V_cols = [x for x in df.columns if x.find('V',0,1) != -1]

In [18]:
# Separating the V_cols into lists with missing values (Nan's)
# in the same places this will be considered as "corelated"   
cols = df.columns.values.tolist()
cols = V_cols.copy()
same_nans_lst = []
similar0 = [0]
while cols.copy() != []:
    x = cols.pop(0)
    total = df[x].isna().values.sum()
    if total==0:
        similar0.append(x)
    else:
        similar = [total,x]
        for y in cols.copy():
            same = have_same_nans(df[x],df[y])
            if same:
                similar.append(y)
                cols.remove(y)
        same_nans_lst.append(similar)
same_nans_lst.append(similar0)


In [19]:
for lst in same_nans_lst:
    print(lst)

[279287, 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11']
[76073, 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29', 'V30', 'V31', 'V32', 'V33', 'V34']
[168969, 'V35', 'V36', 'V37', 'V38', 'V39', 'V40', 'V41', 'V42', 'V43', 'V44', 'V45', 'V46', 'V47', 'V48', 'V49', 'V50', 'V51', 'V52']
[77096, 'V53', 'V54', 'V55', 'V56', 'V57', 'V58', 'V59', 'V60', 'V61', 'V62', 'V63', 'V64', 'V65', 'V66', 'V67', 'V68', 'V69', 'V70', 'V71', 'V72', 'V73', 'V74']
[89164, 'V75', 'V76', 'V77', 'V78', 'V79', 'V80', 'V81', 'V82', 'V83', 'V84', 'V85', 'V86', 'V87', 'V88', 'V89', 'V90', 'V91', 'V92', 'V93', 'V94']
[314, 'V95', 'V96', 'V97', 'V98', 'V99', 'V100', 'V101', 'V102', 'V103', 'V104', 'V105', 'V106', 'V107', 'V108', 'V109', 'V110', 'V111', 'V112', 'V113', 'V114', 'V115', 'V116', 'V117', 'V118', 'V119', 'V120', 'V121', 'V122', 'V123', 'V124', 'V125', 'V126', 'V127', 'V128', 'V129', 'V130', 'V131', 'V132', 'V13

In [20]:
#Separating Categorical and non-categorical variables
V_categorical = []
V_non_categorical = []
for lst in same_nans_lst.copy():
    print(lst)
    n = lst[0]
    for col in lst[1:]:
        is_cat = True
        x= df[col].unique()
        x.sort()
        #print(col, x, x.shape)
        for val in x:
            if np.isnan(val):
                pass
            else:
                r = val-int(val)
                if r>0:
                    is_cat = False
                    #print(r, end=" ")
        #print()
        if is_cat:
            V_categorical.append(col)
        else:
            V_non_categorical.append(col)

[279287, 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11']
[76073, 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29', 'V30', 'V31', 'V32', 'V33', 'V34']
[168969, 'V35', 'V36', 'V37', 'V38', 'V39', 'V40', 'V41', 'V42', 'V43', 'V44', 'V45', 'V46', 'V47', 'V48', 'V49', 'V50', 'V51', 'V52']
[77096, 'V53', 'V54', 'V55', 'V56', 'V57', 'V58', 'V59', 'V60', 'V61', 'V62', 'V63', 'V64', 'V65', 'V66', 'V67', 'V68', 'V69', 'V70', 'V71', 'V72', 'V73', 'V74']
[89164, 'V75', 'V76', 'V77', 'V78', 'V79', 'V80', 'V81', 'V82', 'V83', 'V84', 'V85', 'V86', 'V87', 'V88', 'V89', 'V90', 'V91', 'V92', 'V93', 'V94']
[314, 'V95', 'V96', 'V97', 'V98', 'V99', 'V100', 'V101', 'V102', 'V103', 'V104', 'V105', 'V106', 'V107', 'V108', 'V109', 'V110', 'V111', 'V112', 'V113', 'V114', 'V115', 'V116', 'V117', 'V118', 'V119', 'V120', 'V121', 'V122', 'V123', 'V124', 'V125', 'V126', 'V127', 'V128', 'V129', 'V130', 'V131', 'V132', 'V13

In [22]:
print("categorical:\n",V_categorical)
print('\nnon_categorical:\n',V_non_categorical)

categorical:
 ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29', 'V30', 'V31', 'V32', 'V33', 'V34', 'V35', 'V36', 'V37', 'V38', 'V39', 'V40', 'V41', 'V42', 'V43', 'V44', 'V45', 'V46', 'V47', 'V48', 'V49', 'V50', 'V51', 'V52', 'V53', 'V54', 'V55', 'V56', 'V57', 'V58', 'V59', 'V60', 'V61', 'V62', 'V63', 'V64', 'V65', 'V66', 'V67', 'V68', 'V69', 'V70', 'V71', 'V72', 'V73', 'V74', 'V75', 'V76', 'V77', 'V78', 'V79', 'V80', 'V81', 'V82', 'V83', 'V84', 'V85', 'V86', 'V87', 'V88', 'V89', 'V90', 'V91', 'V92', 'V93', 'V94', 'V95', 'V96', 'V97', 'V98', 'V99', 'V100', 'V101', 'V102', 'V103', 'V104', 'V105', 'V106', 'V107', 'V108', 'V109', 'V110', 'V111', 'V112', 'V113', 'V114', 'V115', 'V116', 'V117', 'V118', 'V119', 'V120', 'V121', 'V122', 'V123', 'V124', 'V125', 'V138', 'V139', 'V140', 'V141', 'V142', 'V146', 'V147', 'V148', 'V149', 'V153', 'V154', 'V155