<div><span style="background-color: #9e4244; padding-top: 80px; padding-right: 20px; padding-bottom: 50px; padding-left: 20px; color: white; font-size: 22px; font-weight: bold">Session 1: Introduction to Credit Card Fraud Analysis</span></div>

by BYJ Cirio

<div class="alert alert-danger alert-info">
     In this notebook we will be having an overview of the credit card fraud dataset. Specifically, the topics covered are as follows:<br>
    <ol>
        <li> Exploratory Data Analysis</li>
        <li>Cleaning and Pre-processing</li>
        <li>Baselining</li>
        <li><i>Exercise: Generting Insights thru EDA</i></li>
    </ol>
</div>

In [1]:
# general libraries
import warnings
import numpy as np
import pandas as pd
from tqdm import tqdm
from collections import Counter
warnings.filterwarnings("ignore")

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from nltk.corpus import stopwords

<div><span style="background-color: #ff0257; padding-top: 100px; padding-right: 20px; padding-bottom: 50px; padding-left: 20px; color: #FFFAF0; font-size: 18px; font-weight: bold">Data Cleaning and Preprocessing </span></div>

In [2]:
cc_fraud = pd.read_csv("full_fraud_dataset.csv", nrows=2500000)
display(cc_fraud.shape, cc_fraud.head())

(2500000, 25)

Unnamed: 0,ssn,cc_num,first,last,gender,street,city,state,zip,lat,...,trans_num,trans_date,trans_time,unix_time,category,amt,is_fraud,merchant,merch_lat,merch_long
0,8013-2690062-6,4895039978433579,Harry,Mckee,M,"B20 L43 11th Road, Fitzpatrick Estates",Tagbilaran City,PH,105051,9.65,...,b7e590b6def607cf89d91a9909985b82,2021-02-13,22:39:49,1613255989,misc_net,825.04,1,MedStoreRx,9.265518,123.368859
1,8013-2690062-6,4895039978433579,Harry,Mckee,M,"B20 L43 11th Road, Fitzpatrick Estates",Tagbilaran City,PH,105051,9.65,...,d248c10143858b069bb776065646ac48,2021-02-12,23:01:09,1613170869,grocery_pos,306.55,1,Ever Supermarket,9.236416,124.337292
2,8013-2690062-6,4895039978433579,Harry,Mckee,M,"B20 L43 11th Road, Fitzpatrick Estates",Tagbilaran City,PH,105051,9.65,...,a4fb2e02583977d4afc46b09506611f8,2021-02-12,14:45:57,1613141157,entertainment,346.48,1,Nine Media Corporation,9.460644,124.461767
3,8013-2690062-6,4895039978433579,Harry,Mckee,M,"B20 L43 11th Road, Fitzpatrick Estates",Tagbilaran City,PH,105051,9.65,...,f315ea3abc262b459360f62bd3619c12,2021-02-12,23:08:59,1613171339,shopping_net,919.5,1,Zalora,10.60602,123.930862
4,8013-2690062-6,4895039978433579,Harry,Mckee,M,"B20 L43 11th Road, Fitzpatrick Estates",Tagbilaran City,PH,105051,9.65,...,ac3503258c2fff1c16a381d3633b1e0d,2021-02-13,22:56:43,1613257003,shopping_net,1131.45,1,Ubuy Co.,8.806183,123.719528


In [3]:
cc_fraud['full_name'] = cc_fraud['first'] + ' ' + cc_fraud['last']
cc_fraud['full_name'].value_counts()

Ryan Allen          6340
Robert Johnson      5659
Robert Jones        4953
Mark Brown          4251
Kevin Smith         4244
                    ... 
Patrick Reynolds       7
Brad Nielsen           7
Jason Jenkins          7
Richard Whitaker       7
Allen Mccarty          7
Name: full_name, Length: 2204, dtype: int64

### 1. Drop Unnecessary Variables

In [4]:
to_drop = ['ssn', 'cc_num', 'first', 'last', 'street', 'state', 'zip', 'acct_num',
          'trans_num', 'unix_time', 'full_name']
cc_clean = cc_fraud.drop(to_drop, axis=1)
cc_clean.head()

Unnamed: 0,gender,city,lat,long,city_pop,job,dob,trans_date,trans_time,category,amt,is_fraud,merchant,merch_lat,merch_long
0,M,Tagbilaran City,9.65,123.85,105051,Planning and development surveyor,1960-08-31,2021-02-13,22:39:49,misc_net,825.04,1,MedStoreRx,9.265518,123.368859
1,M,Tagbilaran City,9.65,123.85,105051,Planning and development surveyor,1960-08-31,2021-02-12,23:01:09,grocery_pos,306.55,1,Ever Supermarket,9.236416,124.337292
2,M,Tagbilaran City,9.65,123.85,105051,Planning and development surveyor,1960-08-31,2021-02-12,14:45:57,entertainment,346.48,1,Nine Media Corporation,9.460644,124.461767
3,M,Tagbilaran City,9.65,123.85,105051,Planning and development surveyor,1960-08-31,2021-02-12,23:08:59,shopping_net,919.5,1,Zalora,10.60602,123.930862
4,M,Tagbilaran City,9.65,123.85,105051,Planning and development surveyor,1960-08-31,2021-02-13,22:56:43,shopping_net,1131.45,1,Ubuy Co.,8.806183,123.719528


### 2. Clean Date and Time

In [5]:
# transaction date
cc_clean['trans_datetime'] = pd.to_datetime(cc_clean['trans_date'])
cc_clean['trans_date'] = cc_clean['trans_datetime'].dt.date
cc_clean['trans_year'] = cc_clean['trans_datetime'].dt.year.astype(str)
cc_clean['trans_month'] = cc_clean['trans_datetime'].dt.month
cc_clean['trans_day'] = cc_clean['trans_datetime'].dt.day

# transaction time
cc_clean['trans_hour'] = cc_clean['trans_time'].str[:2].astype(int)

# convert month to string
month_map = {1: 'Jan',
             2: 'Feb',
             3: 'Mar',
             4: 'Apr',
             5: 'May',
             6: 'Jun',
             7: 'Jul',
             8: 'Aug',
             9: 'Sep',
             10: 'Oct',
             11: 'Nov',
             12: 'Dec'}

cc_clean['trans_month_str'] = cc_clean['trans_month'].map(month_map)

In [6]:
def get_part_of_day(hour):
    if (hour > 22) or (hour <= 6):
        return "early morning"
    elif hour <= 11:
        return "breakfast"
    elif hour <= 14:
        return "lunch"
    elif hour <= 17:
        return "afternoon"
    else:
        return "evening"

cc_clean.loc[:, "part_of_day"] = cc_clean['trans_hour'].apply(get_part_of_day)
cc_clean.head()

Unnamed: 0,gender,city,lat,long,city_pop,job,dob,trans_date,trans_time,category,...,merchant,merch_lat,merch_long,trans_datetime,trans_year,trans_month,trans_day,trans_hour,trans_month_str,part_of_day
0,M,Tagbilaran City,9.65,123.85,105051,Planning and development surveyor,1960-08-31,2021-02-13,22:39:49,misc_net,...,MedStoreRx,9.265518,123.368859,2021-02-13,2021,2,13,22,Feb,evening
1,M,Tagbilaran City,9.65,123.85,105051,Planning and development surveyor,1960-08-31,2021-02-12,23:01:09,grocery_pos,...,Ever Supermarket,9.236416,124.337292,2021-02-12,2021,2,12,23,Feb,early morning
2,M,Tagbilaran City,9.65,123.85,105051,Planning and development surveyor,1960-08-31,2021-02-12,14:45:57,entertainment,...,Nine Media Corporation,9.460644,124.461767,2021-02-12,2021,2,12,14,Feb,lunch
3,M,Tagbilaran City,9.65,123.85,105051,Planning and development surveyor,1960-08-31,2021-02-12,23:08:59,shopping_net,...,Zalora,10.60602,123.930862,2021-02-12,2021,2,12,23,Feb,early morning
4,M,Tagbilaran City,9.65,123.85,105051,Planning and development surveyor,1960-08-31,2021-02-13,22:56:43,shopping_net,...,Ubuy Co.,8.806183,123.719528,2021-02-13,2021,2,13,22,Feb,evening


### 3. Age

In [7]:
cc_clean['dob_datetime'] = pd.to_datetime(cc_clean['dob'])
cc_clean['dob'] = cc_clean['dob_datetime'].dt.date
cc_clean['age'] = (cc_clean['trans_date'] - cc_clean['dob'])/365
cc_clean['age'] = cc_clean['age'].apply(lambda x: x.days)
cc_clean.head()

Unnamed: 0,gender,city,lat,long,city_pop,job,dob,trans_date,trans_time,category,...,merch_long,trans_datetime,trans_year,trans_month,trans_day,trans_hour,trans_month_str,part_of_day,dob_datetime,age
0,M,Tagbilaran City,9.65,123.85,105051,Planning and development surveyor,1960-08-31,2021-02-13,22:39:49,misc_net,...,123.368859,2021-02-13,2021,2,13,22,Feb,evening,1960-08-31,60
1,M,Tagbilaran City,9.65,123.85,105051,Planning and development surveyor,1960-08-31,2021-02-12,23:01:09,grocery_pos,...,124.337292,2021-02-12,2021,2,12,23,Feb,early morning,1960-08-31,60
2,M,Tagbilaran City,9.65,123.85,105051,Planning and development surveyor,1960-08-31,2021-02-12,14:45:57,entertainment,...,124.461767,2021-02-12,2021,2,12,14,Feb,lunch,1960-08-31,60
3,M,Tagbilaran City,9.65,123.85,105051,Planning and development surveyor,1960-08-31,2021-02-12,23:08:59,shopping_net,...,123.930862,2021-02-12,2021,2,12,23,Feb,early morning,1960-08-31,60
4,M,Tagbilaran City,9.65,123.85,105051,Planning and development surveyor,1960-08-31,2021-02-13,22:56:43,shopping_net,...,123.719528,2021-02-13,2021,2,13,22,Feb,evening,1960-08-31,60


### 4. Retain final columns

In [8]:
to_drop2 = ['dob', 'trans_date', 'trans_month', 'trans_datetime', 'dob_datetime', 'trans_time', 'trans_hour']
cc_final = cc_clean.drop(to_drop2, axis=1)
cc_final.head()

Unnamed: 0,gender,city,lat,long,city_pop,job,category,amt,is_fraud,merchant,merch_lat,merch_long,trans_year,trans_day,trans_month_str,part_of_day,age
0,M,Tagbilaran City,9.65,123.85,105051,Planning and development surveyor,misc_net,825.04,1,MedStoreRx,9.265518,123.368859,2021,13,Feb,evening,60
1,M,Tagbilaran City,9.65,123.85,105051,Planning and development surveyor,grocery_pos,306.55,1,Ever Supermarket,9.236416,124.337292,2021,12,Feb,early morning,60
2,M,Tagbilaran City,9.65,123.85,105051,Planning and development surveyor,entertainment,346.48,1,Nine Media Corporation,9.460644,124.461767,2021,12,Feb,lunch,60
3,M,Tagbilaran City,9.65,123.85,105051,Planning and development surveyor,shopping_net,919.5,1,Zalora,10.60602,123.930862,2021,12,Feb,early morning,60
4,M,Tagbilaran City,9.65,123.85,105051,Planning and development surveyor,shopping_net,1131.45,1,Ubuy Co.,8.806183,123.719528,2021,13,Feb,evening,60


In [9]:
cc_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500000 entries, 0 to 2499999
Data columns (total 17 columns):
 #   Column           Dtype  
---  ------           -----  
 0   gender           object 
 1   city             object 
 2   lat              float64
 3   long             float64
 4   city_pop         int64  
 5   job              object 
 6   category         object 
 7   amt              float64
 8   is_fraud         int64  
 9   merchant         object 
 10  merch_lat        float64
 11  merch_long       float64
 12  trans_year       object 
 13  trans_day        int64  
 14  trans_month_str  object 
 15  part_of_day      object 
 16  age              int64  
dtypes: float64(5), int64(4), object(8)
memory usage: 324.2+ MB


### 5. One-hot encode categorical variables

In [10]:
to_drop3 = []
for col in tqdm(cc_final.columns):
    if cc_final[col].dtype == 'O':
        dummies = pd.get_dummies(cc_final[col], prefix=col, drop_first=False)
        cc_final = pd.concat([cc_final, dummies], axis=1)
        to_drop3.append(col)
cc_final = cc_final.drop(to_drop3, axis=1)
cc_final.head()

100%|██████████████████████████████████████████████████████████████████████████████████| 17/17 [00:13<00:00,  1.28it/s]


Unnamed: 0,lat,long,city_pop,amt,is_fraud,merch_lat,merch_long,trans_day,age,gender_F,...,trans_month_str_Mar,trans_month_str_May,trans_month_str_Nov,trans_month_str_Oct,trans_month_str_Sep,part_of_day_afternoon,part_of_day_breakfast,part_of_day_early morning,part_of_day_evening,part_of_day_lunch
0,9.65,123.85,105051,825.04,1,9.265518,123.368859,13,60,0,...,0,0,0,0,0,0,0,0,1,0
1,9.65,123.85,105051,306.55,1,9.236416,124.337292,12,60,0,...,0,0,0,0,0,0,0,1,0,0
2,9.65,123.85,105051,346.48,1,9.460644,124.461767,12,60,0,...,0,0,0,0,0,0,0,0,0,1
3,9.65,123.85,105051,919.5,1,10.60602,123.930862,12,60,0,...,0,0,0,0,0,0,0,1,0,0
4,9.65,123.85,105051,1131.45,1,8.806183,123.719528,13,60,0,...,0,0,0,0,0,0,0,0,1,0


<div><span style="background-color: #ff0257; padding-top: 100px; padding-right: 20px; padding-bottom: 50px; padding-left: 20px; color: #FFFAF0; font-size: 18px; font-weight: bold">Exploratory Data Analysis </span></div>

### Valid vs Fraud 

### Gender

### Location

### Jobs

### Merchant Category

### Date

### Transaction Amount

### Correlation

<div><span style="background-color: #ff0257; padding-top: 100px; padding-right: 20px; padding-bottom: 50px; padding-left: 20px; color: #FFFAF0; font-size: 18px; font-weight: bold">Baselining </span></div>

<div><span style="background-color: #ff0257; padding-top: 100px; padding-right: 20px; padding-bottom: 50px; padding-left: 20px; color: #FFFAF0; font-size: 18px; font-weight: bold">Exercise </span></div>

We have explored and interpreted some charactertistics of the credit card fraud dataset. Now, it's your turn to generate interesting insights! Here are some of the things that you can explore but feel free to add anything you can think of :)
- What if we group the city per region, what region has the highest number of transactions? valid? fraud? population? transaction amount?
- What year, day of week, time of day has the highest number of transactions? transaction amount?
- What merchant has the highest number of transactions? transaction amount?
- What is the distribution of age? 
- What are the descriptive statistics (mean, max, min) and distribution of the quantitative variables?