# Fraud Detection

## Functions and methods used in this notebook

In [None]:
import imblearn
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

pd.options.display.float_format = "{:.2f}".format

# Features Dictionary
Dataset contains a record of the transactions carried out within a six months.  
All transactions were carried out by an user with the card in a property (No Internet).  

Feature | Description
:- | :-
CustomerID | Customer ID
isFraud | 1=True; 0=False
Value | Value of the transaction
Time | Hour of the transaction*
Max_Dist_Nat | Maximum distance between transactions
Date | Date when transaction was carried out
CountryISOCode | Country where the transaction was carried out
BusinessChannel | ATM or Payment-Terminal**
WeekDay	| Week of the day***
MonthDay | Day of the month the transaction was carried out
VincDate | Date corresponding to account creation
VincOffice | Vinculation office
Gender | M=Male; F=Female
Segment | Segment client belongs to
Age | Client's age
Income | Income
Expenditures | Expenditures
VisitedCountries | Number of visited countries
Dist_Sum_Int | Total International distance traveled by client
Dist_Mean_Int | Average International distance traveled by client
NatVisitedCities | Number of national cities visited by client
Dist_Mean_Nal | Average National distance traveled by client
Dist_Today | Difference in distance between last transaction and actual one
Dist_Sum_Nal | Total National distance traveled by client

_All distances are in km_  
_Missing distances cannot be calculated, these values corresponds to new clients_  
_Value, Income and Expenditures are in USD_  
_*No minutes, no seconds_  
_**Includes payment terminal types_  
_***0=Sunday, 1=Monday ... 6= Saturday_  

## Data Loading & Overview

In [None]:
df = pd.read_csv("dataset/Fraud.csv", parse_dates=[['Date','Time'],'VincDate'])
df['Date_Time'] = df['Date_Time'].apply(lambda x: str(x) + ':00').apply(lambda x : pd.to_datetime(x))
print(f'Dataset has {df.shape[0]} rows and {df.shape[1]} columns.')

In [None]:
df.head()

## Data Exploration

In [None]:
df.info()

### Missing values

In [None]:
df.isnull().sum()[df.isnull().sum()>0].to_frame('Nulls')

According to business rules the distances cannot be calculated nor imputed. The missing values will be replaced with zeros.  
Also, values  missing values in other columns will be dropped.

In [None]:
df['Dist_Sum_Int'].replace(np.NaN,0.,inplace=True)
df['Dist_Mean_Int'].replace(np.NaN,0.,inplace=True)
df['Dist_Max_Int'].replace(np.NaN,0.,inplace=True)
df['Dist_Mean_Nat'].replace(np.NaN,0.,inplace=True)
df.dropna(axis=0,inplace = True)

In [None]:
Let's focus on the target feature `isFraud` and its correlations with the other features.

In [None]:
corr_matrix = df.corr()
corr_matrix[['isFraud']].sort_values(by ='isFraud',ascending=False)

### Balance of target variable

In [None]:
print("Proportion:", round(df['isFraud'].value_counts()[0]/df['isFraud'].value_counts()[1],1),": 1")
df['isFraud'].value_counts().to_frame()

We can notice that we have more Non-Fraudulent transactions than Fraudulent ones.

In [None]:
sns.countplot(df['isFraud'])
plt.show()

Continue by checking the ocurrences of `Frauds` according to the `Business Channel`.

In [None]:
sns.countplot(df['BusinessChannel'],hue=df['isFraud'])
plt.show()

Notice that most of the frauds were carried out in ATMs.

In [None]:
fig,ax = plt.subplots(2,1, figsize=(6, 6), sharex=True,sharey=True)
ax[0].set_title('Non-Fraudulent')
sns.distplot(df[df['isFraud']==0]['Value'],ax=ax[0],color='orange')
ax[1].set_title('Fraudulent')
sns.distplot(df[df['isFraud']==1]['Value'],ax=ax[1],color='blue')
plt.tight_layout()

The value of Non-Fraudulent and Fraudulent shows a similar distribution, but Fraudulent values do not 

### Check the correlation between features

In [None]:
def graph_corr_matrix(df: pd.DataFrame):
    corr = df.corr().abs()
    fig, ax = plt.subplots(figsize=(10,10))

    cmap = sns.diverging_palette(250, 15, s=75, l=40,n=9,center="light",as_cmap=True)
    img_corr = sns.heatmap(corr, cmap=cmap, vmax=1, center=0,square=True, linewidths=.7, 
                           cbar_kws={"shrink": .7},ax=ax)
    return img_corr
graph_corr_matrix(df)
plt.show()

## Categorical Features

Most Machine Learning algorithms prefer work with numbers, let's convert these categories from text to numbers.

In [None]:
to_dummies = ['CountryISOCode','BusinessChannel','Gender','Segment']
dummies = pd.get_dummies(df[to_dummies])
df_filtered = pd.concat([df,dummies],axis = 1,sort=False)
df_filtered.drop(labels = to_dummies, axis=1,inplace = True)
df_filtered.head()

National & International distances / Income & Expenditures are correlated. Only the features that are more correlated to `isFraud` will not be dropped in order. This process is intended to avoid multicollinearity.

In [None]:
df[['isFraud','Max_Dist_Nat','Dist_Mean_Nat','Dist_Sum_Nat']].corr().abs()

In [None]:
df[['isFraud','Dist_Sum_Int','Dist_Mean_Int','Dist_Max_Int']].corr().abs()

In [None]:
df[['isFraud','Income', 'Expenditures']].corr().abs()

In [None]:
df.drop(labels=['Dist_Mean_Nat','Dist_Sum_Nat','Dist_Sum_Int', 'Dist_Mean_Int','Income'],axis=1,inplace=True)
graph_corr_matrix(df)
plt.show()

In [None]:
sns.countplot('isFraud',data=df)
plt.show()

In [None]:
sns.lmplot(x='Age',y='Value',data=df,hue='isFraud',fit_reg=False)

In [None]:
df[df['Age']==0]['Age'].value_counts()

In [None]:
df.isnull().sum()