# **Factors of Fraud**

## Objectives
* Answer business requirement 1:
    * ACB would like to understand the patterns in the transaction data to better understand the most relevant variables correlated to a fraudulent transaction.*   

## Inputs

* outputs/datasets/collection/card_transactions.csv

## Outputs

* Generate code and visualisations that fulfil business requirement 1, above.

___

## Set up the Working Directory

Define and confirm the current working directory

In [1]:
import os
current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
current_dir

'/Users/edsmacbook/Library/CloudStorage/OneDrive-Personal/code_institute/project-5/fraud-detection'

___

## Load Collected Data

In [2]:
import pandas as pd
df_raw_path = "outputs/datasets/collection/card_transactions.csv"
df = pd.read_csv(df_raw_path)
df.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1,1,0,0,0
1,10.829943,0.175592,1.294219,1,0,0,0,0
2,5.091079,0.805153,0.427715,1,0,0,1,0
3,2.247564,5.600044,0.362663,1,1,0,1,0
4,44.190936,0.566486,2.222767,1,1,0,1,0


___

## Correlation Study

Credit: Code Institute Walkthrough 2 - Churnometer
Add infromation about Pearsaon and Spearman

In [3]:
corr_spearman = df.corr(method='spearman')['fraud'].sort_values(key=abs, ascending=False)[1:]
corr_spearman

ratio_to_median_purchase_price    0.342838
online_order                      0.191973
used_pin_number                  -0.100293
distance_from_home                0.095032
used_chip                        -0.060975
distance_from_last_transaction    0.034661
repeat_retailer                  -0.001357
Name: fraud, dtype: float64

In [4]:
corr_pearson = df.corr(method='pearson')['fraud'].sort_values(key=abs, ascending=False)[1:]
corr_pearson

ratio_to_median_purchase_price    0.462305
online_order                      0.191973
distance_from_home                0.187571
used_pin_number                  -0.100293
distance_from_last_transaction    0.091917
used_chip                        -0.060975
repeat_retailer                  -0.001357
Name: fraud, dtype: float64

For both Spearman and Pearson, there is weak or moderate correlation between Fraud and any given variable. However, the top four variables given by both methods - ratio_to_median_purchase_price, online_order, distance_from_home and used_pin_number seem worthy of futher investigation.

## Exploratory Data Analysis (EDA) on Selected Variables

In [6]:
vars_to_study = ['ratio_to_median_purchase_price', 'online_order', 'distance_from_home','used_pin_number']

In [13]:
df_eda = df.filter(vars_to_study + ['fraud'])
df.head()

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
0,57.877857,0.31114,1.94594,1,1,0,0,0
1,10.829943,0.175592,1.294219,1,0,0,0,0
2,5.091079,0.805153,0.427715,1,0,0,1,0
3,2.247564,5.600044,0.362663,1,1,0,1,0
4,44.190936,0.566486,2.222767,1,1,0,1,0


For this EDA, it is more useful to view the encoded variables as objects/strings.

In [14]:
df_eda['online_order'] = df_eda['online_order'].replace({1: 'Online', 0: 'Not Online'})
df_eda['used_pin_number'] = df_eda['used_pin_number'].replace({1: 'Pin Used', 0: 'No Pin'})
df_eda['fraud'] = df_eda['fraud'].replace({1.0: 'Fraud', 0: 'No Fraud'})
df_eda

Unnamed: 0,ratio_to_median_purchase_price,online_order,distance_from_home,used_pin_number,fraud
0,1.945940,Not Online,57.877857,No Pin,No Fraud
1,1.294219,Not Online,10.829943,No Pin,No Fraud
2,0.427715,Online,5.091079,No Pin,No Fraud
3,0.362663,Online,2.247564,No Pin,No Fraud
4,2.222767,Online,44.190936,No Pin,No Fraud
...,...,...,...,...,...
999995,1.626798,Not Online,2.207101,No Pin,No Fraud
999996,2.778303,Not Online,19.872726,No Pin,No Fraud
999997,0.218075,Online,2.914857,No Pin,No Fraud
999998,0.475822,Online,4.258729,No Pin,No Fraud
