In [1]:
import pandas as pd
pd.options.display.max_columns = None
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(rc = {'figure.figsize':(30,40)})

In [2]:
df = pd.read_csv("../data/whole_fraud.csv")

FileNotFoundError: [Errno 2] No such file or directory: '../data/whole_fraud.csv'

**Data Description**
* **index** - Unique Identifier for each row
* **transdatetrans_time** - Transaction DateTime
* **cc_num** - Credit Card Number of Customer
* **merchant** - Merchant Name
* **category** - Category of Merchant
* **amt** - Amount of Transaction
* **first** - First Name of Credit Card Holder
* **last** - Last Name of Credit Card Holder
* **gender** - Gender of Credit Card Holder
* **street** - Street Address of Credit Card Holder
* **city** - City of Credit Card Holder
* **state** - State of Credit Card Holder
* **zip** - Zip of Credit Card Holder
* **lat** - Latitude Location of Credit Card Holder
* **long** - Longitude Location of Credit Card Holder
* **city_pop** - Credit Card Holder's City Population
* **job** - Job of Credit Card Holder
* **dob** - Date of Birth of Credit Card Holder
* **trans_num** - Transaction Number
* **unix_time** - UNIX Time of transaction
* **merch_lat** - Latitude Location of Merchant
* **merch_long** - Longitude Location of Merchant
* **is_fraud** - Fraud Flag <--- Target Class


In [None]:
df.shape

In [None]:
df.head(3)

In [None]:
df.info()

In [None]:
#check if there is any missing values
df.isnull().sum().sum()

In [None]:
def transform_col_todate(df,col_names):
    for col in col_names:
        #convert trans_date_trans_time , dob to datetime
        df[col] = pd.to_datetime(df[col])
    return df

In [None]:
col_todate=["trans_date_trans_time","dob"]
df =  transform_col_todate(df,col_todate)

In [None]:
#create new columns day,month,year
df["year"]=df["trans_date_trans_time"].dt.year
df["month"]=df["trans_date_trans_time"].dt.month
df["day"]=df["trans_date_trans_time"].dt.day

In [None]:
#Extract month_name,day_name
df["month_name"]=df["trans_date_trans_time"].dt.month_name()
df["day_name"]=df["trans_date_trans_time"].dt.day_name()

In [None]:
#Extract hour,minute and second
df["hour"]=df["trans_date_trans_time"].dt.hour
df["month"]=df["trans_date_trans_time"].dt.month
df["sec"]=df["trans_date_trans_time"].dt.second

**Let's start to examine how different features relate to the target column (fraud)**

## 1. Relation between transaction amount and fraud :

In [None]:
#let's check the distribution of the amount transaction col
df.amt.describe().round()

In [None]:
#99 percentile
np.percentile(df.amt , 99)

**The 99 percentile here is around 546$**

In [None]:
from IPython.display import Image
Image(filename="images/FIG 6 Example.jpg")

In [None]:
#we substracted the data with amount transaction less than 1000 to get more readable graph
ax=sns.histplot(x='amt',data=df[df.amt<=1000],hue='is_fraud',stat='probability',multiple='dodge',common_norm=False,bins=25)
ax.set_ylabel("Percentage in Each Type")
ax.set_xlabel("Transaction Amount in USD")

**The result is very interesting! While normal transactions tend to be around 200 USD or less, we see fraudulent transactions peak around 300 USD and then at the 800 USD -1000 USD range. There is a very clear pattern here!**

## 2. Relation between Gender and fraud :

In [None]:
#Gender vs Fraud
ax=sns.histplot(x='gender',data=df, hue='is_fraud',stat='probability',multiple='dodge',common_norm=False)
ax.set_ylabel('Percentage')
ax.set_xlabel('Credit Card Holder Gender')

**In this case, we do not see a clear difference between both genders. Data seem to suggest that females and males are almost equally susceptible (50%) to transaction fraud. Gender is not very indicative of a fraudulent transaction.**

## 3. Relation between Spending category and fraud :

Now, we will examine in which spending categories fraud happens most predominantly. To do this, we first calculate the distribution in normal transactions and then the the distribution in fraudulent activities. The difference between the 2 distributions will demonstrate which category is most susceptible to fraud.

In [None]:
#calculate the percentage difference
a=df[df['is_fraud']==0]['category'].value_counts(normalize=True).to_frame().reset_index()
a.columns=['category','not fraud percentage']

b=df[df['is_fraud']==1]['category'].value_counts(normalize=True).to_frame().reset_index()
b.columns=['category','fraud percentage']
ab=a.merge(b,on='category')
ab['diff']=ab['fraud percentage']-ab['not fraud percentage']

ax=sns.barplot(y='category',x='diff',data=ab.sort_values('diff',ascending=False))
ax.set_xlabel('Percentage Difference')
ax.set_ylabel('Transaction Category')
plt.title('The Percentage Difference of Fraudulent over Non-Fraudulent Transations in Each Spending Category ')

**Some spending categories indeed see more fraud than others! Fraud tends to happen more often in Shopping_net', 'Grocery_pos'and 'misc_net' while 'home' and 'kids_pets' among others tend to see more normal transactions than fraudulent ones.**

## 4. Relation between age and fraud :

In [None]:
#age vs fraud
import datetime as dt
df['age']=dt.date.today().year-pd.to_datetime(df['dob']).dt.year
ax=sns.kdeplot(x='age',data=df, hue='is_fraud', common_norm=False)
ax.set_xlabel('Credit Card Holder Age')
ax.set_ylabel('Density')
plt.xticks(np.arange(0,110,5))
plt.title('Age Distribution in Fraudulent vs Non-Fraudulent Transactions')
plt.legend(title='Type', labels=['Fraud', 'Not Fraud'])

 **In normal transactions, there are 2 peaks at the age of 37-38 and 49-50, while in fraudulent transactions, the age distribution is a little smoother and the second peak does include a wider age group from 50-65. This does suggest that older people are potentially more prone to fraud.**

## 5. Cyclicality of Credit Card Fraud:

How do fraudulent transactions distribute on the temporal spectrum? Is there an hourly, monthly, or seasonal trend? We can use the transaction time column to answer this question.

### Hourly trend :

In [None]:
ax=sns.histplot(data=df, x="hour", hue="is_fraud", common_norm=False,stat='probability',multiple='dodge')
ax.set_ylabel('Percentage')
ax.set_xlabel('Time (Hour) in a Day')
plt.xticks(np.arange(0,24,1))
plt.show()

**There is a clear pattern ! While normal transactions distribute more or less equally throughout the day, fraudulent payments happen disproportionately around midnight when most people are asleep!**

### Weekly trend :

In [None]:
ax=sns.histplot(data=df, x="day_name", hue="is_fraud", common_norm=False,stat='probability',multiple='dodge')
ax.set_ylabel('Percentage')
ax.set_xlabel('day name')

**Normal transactions tend to happen more often on Monday and Sunday while fraudulent ones tend to spread out more evenly throughout the week.**

### Monthly trend :

In [None]:
ax=sns.histplot(data=df, x="month_name", hue="is_fraud", common_norm=False,stat='probability',multiple='dodge')
ax.set_ylabel('Percentage')
ax.set_xlabel('month name')
plt.show()

**Very interesting results! While normal payments peak around December (Christmas), and then late spring to early summer, fraudulent transactions are more concentrated in Jan-May. There is a clear seasonal trend.**

### Yearly trend :

In [None]:
d19 = df[df["year"]==2019]["is_fraud"].to_frame()

In [None]:
perc_frau_2019= (d19.value_counts()[1]/len(d19))*100

In [None]:
perc_frau_2019

In [None]:
ax = sns.countplot(x="is_fraud", data=d19)

In [None]:
d20 = df[df["year"]==2020]["is_fraud"].to_frame()

In [None]:
perc_frau_2020= (d20.value_counts()[1]/len(d20))*100

In [None]:
perc_frau_2020

In [None]:
ax = sns.countplot(x="is_fraud", data=d20)

**The percentage of fraud transaction in 2020 is higher than how it was in 2019**