# Credit Card Fraud Detection

## **First part: Data Wrangling**

## **Introduction**
This project focuses on detecting fraudulent credit card transactions using a dataset that contains transaction, card usage, and customer details. The goal is to explore transaction patterns and build a machine learning model to predict fraud.

## **Importing Libraries**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

## **Data Extraction**

In [2]:
df = pd.read_json('/content/transactions.txt', lines=True)
df.head()

Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,...,echoBuffer,currentBalance,merchantCity,merchantState,merchantZip,cardPresent,posOnPremises,recurringAuthInd,expirationDateKeyInMatch,isFraud
0,737265056,737265056,5000,5000.0,2016-08-13T14:27:32,98.55,Uber,US,US,2,...,,0.0,,,,False,,,False,False
1,737265056,737265056,5000,5000.0,2016-10-11T05:05:54,74.51,AMC #191138,US,US,9,...,,0.0,,,,True,,,False,False
2,737265056,737265056,5000,5000.0,2016-11-08T09:18:39,7.47,Play Store,US,US,9,...,,0.0,,,,False,,,False,False
3,737265056,737265056,5000,5000.0,2016-12-10T02:14:50,7.47,Play Store,US,US,9,...,,0.0,,,,False,,,False,False
4,830329091,830329091,5000,5000.0,2016-03-24T21:04:46,71.18,Tim Hortons #947751,US,US,2,...,,0.0,,,,True,,,False,False


## **Data Exploration**

In [3]:
df.shape

(786363, 29)

In [4]:
df.columns

Index(['accountNumber', 'customerId', 'creditLimit', 'availableMoney',
       'transactionDateTime', 'transactionAmount', 'merchantName',
       'acqCountry', 'merchantCountryCode', 'posEntryMode', 'posConditionCode',
       'merchantCategoryCode', 'currentExpDate', 'accountOpenDate',
       'dateOfLastAddressChange', 'cardCVV', 'enteredCVV', 'cardLast4Digits',
       'transactionType', 'echoBuffer', 'currentBalance', 'merchantCity',
       'merchantState', 'merchantZip', 'cardPresent', 'posOnPremises',
       'recurringAuthInd', 'expirationDateKeyInMatch', 'isFraud'],
      dtype='object')

In [5]:
print(df.dtypes)

accountNumber                 int64
customerId                    int64
creditLimit                   int64
availableMoney              float64
transactionDateTime          object
transactionAmount           float64
merchantName                 object
acqCountry                   object
merchantCountryCode          object
posEntryMode                 object
posConditionCode             object
merchantCategoryCode         object
currentExpDate               object
accountOpenDate              object
dateOfLastAddressChange      object
cardCVV                       int64
enteredCVV                    int64
cardLast4Digits               int64
transactionType              object
echoBuffer                   object
currentBalance              float64
merchantCity                 object
merchantState                object
merchantZip                  object
cardPresent                    bool
posOnPremises                object
recurringAuthInd             object
expirationDateKeyInMatch    

## **Data Cleaning**

### 1. Remove Missing or Invalid Data

The method of handling missing data in each column can be determined by the percentage of missing values.
* For columns with a high proportion of missing values, consider imputing the data using the mean or mode.
* If the percentage of missing values is low, it may be more effective to simply delete the incomplete entries.

In [6]:
#replace white space with nan
df = df.replace(r'^\s*$', np.nan, regex=True)
df.head()

Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,...,echoBuffer,currentBalance,merchantCity,merchantState,merchantZip,cardPresent,posOnPremises,recurringAuthInd,expirationDateKeyInMatch,isFraud
0,737265056,737265056,5000,5000.0,2016-08-13T14:27:32,98.55,Uber,US,US,2,...,,0.0,,,,False,,,False,False
1,737265056,737265056,5000,5000.0,2016-10-11T05:05:54,74.51,AMC #191138,US,US,9,...,,0.0,,,,True,,,False,False
2,737265056,737265056,5000,5000.0,2016-11-08T09:18:39,7.47,Play Store,US,US,9,...,,0.0,,,,False,,,False,False
3,737265056,737265056,5000,5000.0,2016-12-10T02:14:50,7.47,Play Store,US,US,9,...,,0.0,,,,False,,,False,False
4,830329091,830329091,5000,5000.0,2016-03-24T21:04:46,71.18,Tim Hortons #947751,US,US,2,...,,0.0,,,,True,,,False,False


In [7]:
#number and percentage of missing elements of each columns
total= df.isnull().sum()
percent = (df.isnull().sum()/df.isnull().count())
missing = pd.concat([total, percent*100], axis=1, keys=['Total', 'Percent'])
missing

Unnamed: 0,Total,Percent
accountNumber,0,0.0
customerId,0,0.0
creditLimit,0,0.0
availableMoney,0,0.0
transactionDateTime,0,0.0
transactionAmount,0,0.0
merchantName,0,0.0
acqCountry,4562,0.580139
merchantCountryCode,724,0.092069
posEntryMode,4054,0.515538


In [8]:
columns_to_keep = missing[missing['Percent'] <= 70].index
columns_with_missing = missing[(missing['Percent'] > 0) & (missing['Percent'] <= 70)].index
# Create a new DataFrame with only the columns to keep
df_filtered = df[columns_to_keep]
df_filtered.head()

Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,...,accountOpenDate,dateOfLastAddressChange,cardCVV,enteredCVV,cardLast4Digits,transactionType,currentBalance,cardPresent,expirationDateKeyInMatch,isFraud
0,737265056,737265056,5000,5000.0,2016-08-13T14:27:32,98.55,Uber,US,US,2,...,2015-03-14,2015-03-14,414,414,1803,PURCHASE,0.0,False,False,False
1,737265056,737265056,5000,5000.0,2016-10-11T05:05:54,74.51,AMC #191138,US,US,9,...,2015-03-14,2015-03-14,486,486,767,PURCHASE,0.0,True,False,False
2,737265056,737265056,5000,5000.0,2016-11-08T09:18:39,7.47,Play Store,US,US,9,...,2015-03-14,2015-03-14,486,486,767,PURCHASE,0.0,False,False,False
3,737265056,737265056,5000,5000.0,2016-12-10T02:14:50,7.47,Play Store,US,US,9,...,2015-03-14,2015-03-14,486,486,767,PURCHASE,0.0,False,False,False
4,830329091,830329091,5000,5000.0,2016-03-24T21:04:46,71.18,Tim Hortons #947751,US,US,2,...,2015-08-06,2015-08-06,885,885,3143,PURCHASE,0.0,True,False,False


In [9]:
print(columns_with_missing)
print(columns_with_missing.dtype)

Index(['acqCountry', 'merchantCountryCode', 'posEntryMode', 'posConditionCode',
       'transactionType'],
      dtype='object')
object


In [10]:
# Replace NaN with mode of columns
for column in columns_with_missing:
    mode_value = df_filtered[column].mode()[0]
    df_filtered[column].fillna(mode_value, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered[column].fillna(mode_value, inplace=True)


In [11]:
df_filtered.isnull().sum()

Unnamed: 0,0
accountNumber,0
customerId,0
creditLimit,0
availableMoney,0
transactionDateTime,0
transactionAmount,0
merchantName,0
acqCountry,0
merchantCountryCode,0
posEntryMode,0


### 2. Checking for Duplicate Rows


In [12]:
# Check for duplicate rows
df_filtered = df_filtered.drop_duplicates()
duplicates = df_filtered.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

Number of duplicate rows: 0


### 3. Fix Columns Types

In [13]:
## Convert columns format
from datetime import datetime
df_filtered['transactionDateTime'] = pd.to_datetime(df_filtered['transactionDateTime'].str.split('T').str[0], format='%Y-%m-%d')
df_filtered['dateOfLastAddressChange'] = pd.to_datetime(df_filtered['dateOfLastAddressChange'], format='%Y-%m-%d')
df_filtered['accountOpenDate'] = pd.to_datetime(df_filtered['accountOpenDate'], format='%Y-%m-%d')
df_filtered['currentExpDate'] = pd.to_datetime(df_filtered['currentExpDate'], format='%m/%Y')

print(df_filtered['transactionDateTime'].dtype)
print(df_filtered['dateOfLastAddressChange'].dtype)
print(df_filtered['accountOpenDate'].dtype)
print(df_filtered['currentExpDate'].dtype)

datetime64[ns]
datetime64[ns]
datetime64[ns]
datetime64[ns]


### 4. Outlier Detection

In [14]:
from scipy import stats
numeric_columns = ['transactionAmount']

z_scores = stats.zscore(df_filtered[numeric_columns])
abs_z_scores = np.abs(z_scores)
outliers = df_filtered[(abs_z_scores >= 5).any(axis=1)]

outliers

Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,...,accountOpenDate,dateOfLastAddressChange,cardCVV,enteredCVV,cardLast4Digits,transactionType,currentBalance,cardPresent,expirationDateKeyInMatch,isFraud
91,574788567,574788567,2500,2307.09,2016-12-24,1091.13,Shell Tire,US,US,05,...,2015-10-13,2015-10-13,206,206,8522,PURCHASE,192.91,False,False,False
177,984504651,984504651,50000,39749.00,2016-02-03,1007.69,Lyft,US,US,05,...,2015-07-27,2015-07-27,640,640,8332,PURCHASE,10251.00,False,False,False
448,984504651,984504651,50000,15521.41,2016-05-27,1112.37,Uber,US,US,09,...,2015-07-27,2016-05-05,640,640,8332,PURCHASE,34478.59,False,False,False
783,984504651,984504651,50000,42047.30,2016-10-14,1158.35,NYSC #127559,US,US,05,...,2015-07-27,2016-06-25,640,640,8332,PURCHASE,7952.70,False,False,False
890,984504651,984504651,50000,28202.85,2016-11-29,1041.75,Uber,US,US,09,...,2015-07-27,2016-11-26,640,640,8332,PURCHASE,21797.15,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
784698,473474510,473474510,10000,4415.27,2016-12-27,1080.67,AMC #692956,US,US,02,...,2013-01-26,2013-01-26,496,496,2472,PURCHASE,5584.73,True,False,False
784977,841351704,841351704,50000,18031.52,2016-04-04,903.14,Los Angeles News,US,US,02,...,2015-12-17,2015-12-17,651,651,7831,PURCHASE,31968.48,False,False,False
785401,841351704,841351704,50000,41840.77,2016-10-03,884.77,Washington News,US,US,05,...,2015-12-17,2015-12-17,651,651,7831,PURCHASE,8159.23,True,False,False
785534,841351704,841351704,50000,22099.55,2016-12-04,880.16,Washington News,US,US,05,...,2015-12-17,2015-12-17,651,651,7831,PURCHASE,27900.45,False,False,False


In [15]:
df_filtered.head()

Unnamed: 0,accountNumber,customerId,creditLimit,availableMoney,transactionDateTime,transactionAmount,merchantName,acqCountry,merchantCountryCode,posEntryMode,...,accountOpenDate,dateOfLastAddressChange,cardCVV,enteredCVV,cardLast4Digits,transactionType,currentBalance,cardPresent,expirationDateKeyInMatch,isFraud
0,737265056,737265056,5000,5000.0,2016-08-13,98.55,Uber,US,US,2,...,2015-03-14,2015-03-14,414,414,1803,PURCHASE,0.0,False,False,False
1,737265056,737265056,5000,5000.0,2016-10-11,74.51,AMC #191138,US,US,9,...,2015-03-14,2015-03-14,486,486,767,PURCHASE,0.0,True,False,False
2,737265056,737265056,5000,5000.0,2016-11-08,7.47,Play Store,US,US,9,...,2015-03-14,2015-03-14,486,486,767,PURCHASE,0.0,False,False,False
3,737265056,737265056,5000,5000.0,2016-12-10,7.47,Play Store,US,US,9,...,2015-03-14,2015-03-14,486,486,767,PURCHASE,0.0,False,False,False
4,830329091,830329091,5000,5000.0,2016-03-24,71.18,Tim Hortons #947751,US,US,2,...,2015-08-06,2015-08-06,885,885,3143,PURCHASE,0.0,True,False,False


In [16]:
df_filtered.dtypes

Unnamed: 0,0
accountNumber,int64
customerId,int64
creditLimit,int64
availableMoney,float64
transactionDateTime,datetime64[ns]
transactionAmount,float64
merchantName,object
acqCountry,object
merchantCountryCode,object
posEntryMode,object


In [17]:
df_filtered.to_csv('cleaned_data.csv', index=False)

After cleaning data and saving it as a CSV, I imported it into PostgreSQL database for further descriptive statistics.