## Datasets

The raw dataset [Synthetic Credit Card Transaction](https://datafabrica.info/products/synthetic-credit-card-transaction-data?variant=45263134621986) is provided from [DataFabrica](https://datafabrica.info/). The data contains synthetic credit card transaction amounts, credit card information, transaction IDs and more.

|Feature|Type|Description|
|---|---|---|
|cardholder_name|Object|The individual's name associated with the card used for the transaction.|
|card_number|Numeric(Discrete)|A unique number linked to the credit card utilized in the transaction. Note: Sensitive data; handle with privacy measures.|
|card_type|Object(Cetagorical)|The brand or type of the credit card employed (e.g., Visa, Mastercard, American Express).|
|merchant_name|Object|The title of the merchant or business where the transaction took place.|
|merchant_category|Object(Cetagorical)|The sector or industry classification of the merchant (e.g., retail, dining, entertainment).|
|merchant_state|Object(Cetagorical)|The state in which the merchant is situated.|
|merchant_city|Object(Cetagorical)|The city where the merchant operates.|
|transaction_amount|Numeric(Continious)|The monetary value spent during the transaction.|
|merchant_category_code|Object(Cetagorical)|An alphanumeric or numerical code denoting the specific category of the merchant according to a predefined system.|
|fraud_flag|Numeric(Discrete)|A binary indicator marking the transaction as fraudulent (1) or non-fraudulent (0).|

## Imports

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score, accuracy_score, confusion_matrix

## Functions

In [95]:
def fraud_barchart(column_name, xaxix_name, title, xaxix_angle):
    
    # Make new dataframe where frauds happen
    fraud_df = df[df['fraud_flag'] == 1]
    
    # Counts of each column value
    column_vc = fraud_df[column_name].value_counts()

    # Make a barchart for visualizing number of frauds by city
    fig = px.bar(x=column_vc.index, y=column_vc.values, labels={'x':xaxix_name, 'y':'Fraudant Count'},
            title=title)

    # Change the angle of state name for having better view
    fig.update_layout(xaxis_tickangle=xaxix_angle)

    # Show the barchart
    fig.show()

## Load Data

In [54]:
# Determine the data path
data_path = "../data/synthetic_transaction_data_Dining.csv"

In [55]:
df = pd.read_csv(data_path)

In [56]:
df.shape

(100000, 13)

In [57]:
# Read the raw dataset
df = pd.read_csv(data_path)

# Get some basic information
print(f"Sample data:\n\n {df.head()}")
print(f"\n--------------------\n\n Columns: {[i for i in df.columns]}")
print(f"\n--------------------\n\n Total number of features: {df.shape[0]}")
print(f"\n--------------------\n\n The data size: {df.shape[1]}")
print(f"\n--------------------\n\n Number of Numerical features: {df.select_dtypes(include=[int, float]).shape[1]}")
print(f"\n--------------------\n\n Number of Categorical features: {df.select_dtypes(include=[object]).shape[1]}")


Sample data:

        transaction_id     transaction_date         cardholder_name  \
0  PJ82FF-U65G-T9SH9V  2018-09-18 23:40:00            Meagan Smith   
1  5WFUH5-VMPB-RHOXKV  2020-01-17 16:30:00  Miss Vanessa Briggs MD   
2  TW1LD6-3CGZ-MQW4KN  2022-08-26 22:41:00             Casey Lyons   
3  DKKV7Q-H2LA-406TWY  2018-04-18 20:25:00           Cynthia Munoz   
4  0CK2EQ-K275-9AC5UT  2023-05-26 21:55:00               Lynn Pham   

        card_number card_type   merchant_name merchant_category  \
0  4408914864277480      visa             KFC         Fast Food   
1  4533948622139044      visa      McDonald's         Fast Food   
2  4350240875308199      visa  Domino's Pizza         Fast Food   
3  4756687869818916      visa      McDonald's         Fast Food   
4  4813038430033752      visa     Papa John's         Fast Food   

  merchant_state merchant_city transaction_status  transaction_amount  \
0     New Jersey   Jersey City           Declined           38.055684   
1        Montan

In [58]:
# Check for the null values
df.isna().sum()

transaction_id            0
transaction_date          0
cardholder_name           0
card_number               0
card_type                 0
merchant_name             0
merchant_category         0
merchant_state            0
merchant_city             0
transaction_status        0
transaction_amount        0
merchant_category_code    0
fraud_flag                0
dtype: int64

In [66]:
# Check the type of values and their numbers in categorical features
for i in df.select_dtypes(include=[object]).columns:
    print(df[i].value_counts())

transaction_id
PJ82FF-U65G-T9SH9V    1
L5X4RM-IMCC-RQKIX3    1
1LMGRU-0OHQ-FKJRBT    1
CGKXN3-BZ2P-E4MDPF    1
UNRDJF-4X3F-6U2AHA    1
                     ..
DM9E04-C0XO-1TRKVH    1
B06AL7-LS8A-GLIU1M    1
RQ05E0-LYXD-HTXPNV    1
N08U3L-ZL2Y-5AX8VA    1
BUGFV7-VY1D-65SNYM    1
Name: count, Length: 100000, dtype: int64
transaction_date
2023-06-27 23:17:00    3
2022-09-09 17:45:00    3
2022-04-12 23:18:00    3
2021-05-21 10:18:00    3
2023-04-29 07:33:00    3
                      ..
2021-04-30 06:37:00    1
2019-11-17 22:06:00    1
2021-12-23 20:28:00    1
2022-07-18 07:37:00    1
2020-10-22 06:31:00    1
Name: count, Length: 97884, dtype: int64
cardholder_name
Michael Johnson        39
Jennifer Smith         35
Michael Davis          34
Christopher Johnson    34
Michael Brown          33
                       ..
Alexandra Nelson        1
Robert Rodgers          1
Melanie Roth            1
Eugene Torres           1
Shannon Kerr            1
Name: count, Length: 70980, dtype: int64
car

### Contribution of fraud flags

In [59]:
# Boxplot by Plotly
fig = px.box(df, x='fraud_flag', y='transaction_amount', title='Fraud Flags Transaction Amounts')

# Label the axis
fig.update_xaxes(title_text='Fraud Flag')
fig.update_yaxes(title_text='Transaction Amount')

fig.show()

## Exploratory Data Analysis (EDA)

In [94]:
# Make new dataframe where frauds happen
fraud_df = df[df['fraud_flag'] == 1]

# Counts of each marchant
fraud_merchant_cat = fraud_df['merchant_category'].value_counts()

# Make a barchart for visualizing number of frauds in each merchant
fig = px.bar(x=fraud_merchant_cat.index, y=fraud_merchant_cat.values, labels={'x':'Merchant Category', 'y':'Fraudant Count'},
            title='Number of Frauds in Each Merchant Category')

fig.update_layout(xaxis_tickangle=0)

# Show the barchart
fig.show()

In [92]:
# Counts of each card type
fraud_card_type = fraud_df['card_type'].value_counts()

# Make a barchart for visualizing number of frauds in by card type
fig = px.bar(x=fraud_card_type.index, y=fraud_card_type.values, labels={'x':'Card Type', 'y':'Fraudant Count'},
            title='Number of Frauds By Card Type')

# Show the barchart
fig.show()

In [98]:
# Make a barchart for visualizing number of frauds by state
fraud_barchart(column_name='merchant_state', xaxix_name='State', title='Number of Frauds in Each State', xaxix_angle=-45)

In [96]:
# Make a barchart for visualizing number of frauds by city
fraud_barchart(column_name='merchant_city', xaxix_name='City', title='Number of Frauds in Each City', xaxix_angle-45)