# **Factors of Fraud**

## Objectives
* Answer business requirement 1:
    * ACB would like to understand the patterns in the transaction data to better understand the most relevant variables correlated to a fraudulent transaction.*   

## Inputs

* outputs/datasets/collection/card_transactions.csv

## Outputs

* Generate code and visualisations that fulfil business requirement 1, above.

___

## Set up the Working Directory

Define and confirm the current working directory

In [None]:
import os
current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
current_dir

___

## Load Collected Data

In [None]:
import pandas as pd
df_raw_path = "outputs/datasets/collection/card_transactions.csv"
df = pd.read_csv(df_raw_path)
df.head()

___

## Correlation Study

Credit: Code Institute Walkthrough 2 - Churnometer
Add infromation about Pearsaon and Spearman

In [None]:
corr_spearman = df.corr(method='spearman')['fraud'].sort_values(key=abs, ascending=False)[1:]
corr_spearman

In [None]:
corr_pearson = df.corr(method='pearson')['fraud'].sort_values(key=abs, ascending=False)[1:]
corr_pearson

For both Spearman and Pearson, there is weak or moderate correlation between Fraud and any given variable. However, the top four variables given by both methods - ratio_to_median_purchase_price, online_order, distance_from_home and used_pin_number seem worthy of futher investigation.

## Exploratory Data Analysis (EDA) on Selected Variables

In [None]:
vars_to_study = ['ratio_to_median_purchase_price', 'online_order', 'distance_from_home','used_pin_number']

In [None]:
df_eda = df.filter(vars_to_study + ['fraud'])
df.head()

For this EDA, it is more useful to view the encoded variables as objects/strings.

In [None]:
df_eda['online_order'] = df_eda['online_order'].replace({1: 'Online', 0: 'Not Online'})
df_eda['used_pin_number'] = df_eda['used_pin_number'].replace({1: 'Pin Used', 0: 'No Pin'})
df_eda['fraud'] = df_eda['fraud'].replace({1.0: 'Fraud', 0: 'No Fraud'})
df_eda.head()

### Variable Distribution by Fraud

Categorical Variables - Online Order & Used PIN Number

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')


def plot_categorical(df, col, target_var):
    plt.figure(figsize=(12, 5))
    sns.countplot(data=df, x=col, hue=target_var,
                  order=df[col].value_counts().index)
    plt.xticks(rotation=45)
    plt.ylabel("Count")
    plt.xlabel(f"{col.replace('_',' ').title()}")
    plt.title(f"{col.replace('_',' ').title()}", fontsize=20, y=1.05)
    plt.legend(title="Fraud")
    plt.show()

target_var = 'fraud'
for col in vars_to_study:
    if df_eda[col].dtype == 'object':
        plot_categorical(df_eda, col, target_var)
        print("\n\n")

These graphs suggest that:
1. A transaction is more likely to be fraudulent if no PIN Number is used.
2. Online transactions are more likely to be fraudulent than offline transactions

### Numerical/Continuous Variables - Ratio to Median Purchase Price & Distance from Home

In [None]:
def plot_numerical(df, col, target_var,xlim=None):
    plt.figure(figsize=(8, 5))
    sns.histplot(data=df, x=col, hue=target_var, kde=True, element="step")
    plt.ylabel("Count")
    plt.xlabel(f"{col.replace('_',' ').title()}")
    plt.title(f"{col.replace('_',' ').title()}", fontsize=20, y=1.05)
    if xlim is not None:
        plt.xlim(xlim)  
    plt.show()


target_var = 'fraud'
for col in vars_to_study:
    if df_eda[col].dtype == 'float64':
        plot_numerical(df_eda, col, target_var)
        print("\n\n")

These graphs are difficult to interpret as the ranges for the variables are so large. The graphs below limit the x-axis at the 95th percentile.

In [None]:
nth_percentile = 95
for col in vars_to_study:
    if df_eda[col].dtype == 'float64':
        variable_percentile = df_eda[col].quantile(nth_percentile/100)
        plot_numerical(df_eda, col, target_var, xlim=[0, variable_percentile])
        print("\n\n")

However, these graphs are still inconclusive and require further investigation.

## Parallel Plot

In order to better visualise the data, Ratio to Median Purchase Price and Distance from Home will be discretised.

In [None]:
from feature_engine.discretisation import EqualFrequencyDiscretiser

n_classes = 10
disc = EqualFrequencyDiscretiser(q=n_classes, variables=['distance_from_home','ratio_to_median_purchase_price'])
df_parallel = disc.fit_transform(df_eda)
df_parallel.head()

The discretised data will also be re-labelled with more meaninful labels.

In [None]:

classes_ranges = disc.binner_dict_['distance_from_home'][1:-1]

distance_map = {}
for n in range(0, n_classes):
    if n == 0:
        distance_map[n] = f"<{round(classes_ranges[0],2)}"
    elif n == n_classes-1:
        distance_map[n] = f"+{round(classes_ranges[-1],2)}"
    else:
        distance_map[n] = f"{round(classes_ranges[n-1],2)} to {round(classes_ranges[n],2)}"

In [None]:
classes_ranges = disc.binner_dict_['ratio_to_median_purchase_price'][1:-1]

ratio_labels_map = {}
for n in range(0, n_classes):
    if n == 0:
        ratio_labels_map[n] = f"<{round(classes_ranges[0],2)}"
    elif n == n_classes-1:
        ratio_labels_map[n] = f"+{round(classes_ranges[-1],2)}"
    else:
        ratio_labels_map[n] = f"{round(classes_ranges[n-1],2)} to {round(classes_ranges[n],2)}"

In [None]:
df_parallel['distance_from_home'] = df_parallel['distance_from_home'].replace(distance_map)
df_parallel['ratio_to_median_purchase_price'] = df_parallel['ratio_to_median_purchase_price'].replace(ratio_labels_map)

For the parallel plot, the Fraud variable should also be encoded.

In [None]:
df_parallel['fraud'] = df_parallel['fraud'].replace({'Fraud': 1, 'No Fraud': 0})

In [None]:
import plotly.graph_objects as go

ratio_dim = go.parcats.Dimension(
    values=df_parallel.ratio_to_median_purchase_price,
    categoryorder='category ascending', label="Ratio to Median Purchase Price"
)

online_dim = go.parcats.Dimension(
    values=df_parallel.online_order, label="Online Order")
pin_dim = go.parcats.Dimension(
    values=df_parallel.used_pin_number, label="Used PIN Number")

distance_dim = go.parcats.Dimension(
    values=df_parallel.distance_from_home, label="Distance from Home", categoryorder='category ascending')

fraud_dim = go.parcats.Dimension(
    values=df_parallel.fraud, label="Fraud", categoryarray=[0, 1], ticktext=['No Fraud', 'Fraud'])

colorscale = [[0, 'lightsteelblue'], [1, 'red']]

fig = go.Figure(data=[go.Parcats(dimensions=[ratio_dim, distance_dim, online_dim, pin_dim,  fraud_dim],
                                 line={
    'color': df_parallel['fraud'], 'colorscale': colorscale},
    hoveron='color', hoverinfo='count',
    labelfont={'size': 18, 'family': 'Arial'},
    tickfont={'size': 16, 'family': 'Arial'},
    arrangement='freeform')])

fig.show()