# Credit Card Fraud Detection

## Libraries and data

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Feature Selection
from sklearn.feature_selection import mutual_info_classif, SelectKBest

# Standardization
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler

# Divisão treino e teste
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

# Classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import PrecisionRecallDisplay
from yellowbrick.classifier.rocauc import roc_auc
from yellowbrick.classifier import ConfusionMatrix
from sklearn.metrics import plot_roc_curve
from sklearn.metrics import average_precision_score, precision_recall_curve
from sklearn.metrics import f1_score, accuracy_score, recall_score, precision_score
from sklearn.metrics import auc, plot_precision_recall_curve
from sklearn.metrics import roc_auc_score
from sklearn import metrics

# Hyperparameter selection
from skopt import gp_minimize

In [None]:
data_fraude = pd.read_csv('creditcard_dataset.csv')

In [None]:
data_fraude

## EDA

The objective is to know and explore the data, identifying patterns/relationships.

In [None]:
data_fraude.info()

There are 30 variables in the dataset, all numeric:
- Features V1-V28: principal components obtained with PCA (due to confidentiality issues).
- Time: contains the seconds elapsed between each transaction and the first transaction in the dataset.
- Amount: is the transaction Amount
- Class: is the response variable and it takes value 1 in case of fraud and 0 otherwise.

In [None]:
# Checking for missing values. # There are no missing values
data_fraude.isnull().sum()

In [None]:
# Checking data that has been classified as fraud
data_fraude[data_fraude['Class'] ==1][:20]

There are many values of Amount 1 (small financial values) in the Fraud class.

In [None]:
# Statistical summary
pd.set_option('display.max_columns', None) # see all columns
data_fraude.describe(percentiles = [.25, .5, .75, .90, .95, .99]) 

As informed in Kaggle, for security reasons, the real variables are not being shared, they were transformed by PCA. Thus, the variables do not have much discrepancy, with the exception of the Amount and Time variables, which were not transformed by PCA.

The Amount variable has a mean of 88. The mean here is not a good metric to look at this data, since we have a median of 22 and Quartile 3 of 77, that is, 75% of the values are below the mean. The average is being influenced by high values (outliers) of amount. This can also be seen in the standard deviation, which is 250.

Let's look at these metrics by class.

In [None]:
# Colors for the graphics
cores = ['#436DA9', '#E73788']

In [None]:
# Distribution of variable class with amount
plt.figure(figsize=(10,10))
sns.boxplot(data=data_fraude, x='Class', y='Amount', palette=cores)
plt.show()

In [None]:
# Distribution of variable class with amount
plt.figure(figsize=(10,5))
sns.boxplot(x = 'Class', y = 'Amount', data = data_fraude[data_fraude.Amount < 1e2], palette=cores)

The graphs above corroborate the conclusions of the statistical summary of the data. Most of the Amount values are below 100, but we have outliers, which are as high as 25,000 in the Non-Fraud Class and are generating a mean much higher than the median of the data, as well as a high standard deviation.

In the case of non-fraud, we have more data between the second and third quartile, which varies from 10 to 40 reais. In the case of fraud, most data are between the second and third quartile, ranging from 1 to 25 reais.

In [None]:
# Statistical summary of amount by class
fraude = data_fraude[data_fraude['Class'] == 1]
naofraude = data_fraude[data_fraude['Class'] == 0]

print("Fraude - resumo estatístico")
print(fraude["Amount"].describe())
print("\nNão Fraude - resumo estatístico")
print(naofraude["Amount"].describe())

As seen in the boxplots above, both categories have a similar mean value, but it does not represent the data well, as 75% of them have a value below the mean. Both categories have outliers, but non-fraudulent transactions have outliers with much higher values.

In [None]:
# Time variable is the simulation moment (2 days), in which the transaction is performed. Distribution throughout the day

# Converting Time to hours
timedelta = pd.to_timedelta(data_fraude['Time'], unit='s')
data_fraude['Time_h'] = (timedelta.dt.components.hours).astype(int)

data_fraude.drop(['Time'], axis=1, inplace=True)

# Plotting the 24-hour transaction graph
bins=24
data_fraude[(data_fraude['Class'] == 0)].hist(column="Time_h",color="#436DA9",bins=bins)
data_fraude[(data_fraude['Class'] == 1)].hist(column ="Time_h",color="#E73788",bins=bins)
plt.show()

In [None]:
# For a better view of how transactions occur throughout the day
plt.figure(figsize=(12,6))
target_0 = data_fraude.loc[data_fraude['Class'] == 0]
target_1 = data_fraude.loc[data_fraude['Class'] == 1]


sns.distplot(target_0[['Time_h']], hist=False, rug=True, color="#436DA9",)
sns.distplot(target_1[['Time_h']], hist=False, rug=True, color="#E73788")

plt.show()

Fraudulent transactions have a more even distribution, they are evenly distributed over 24 hours. Normal transactions are smaller at night.

In [None]:
# Checking duplicate lines
data_fraude.shape

In [None]:
data_fraude.drop_duplicates(inplace=True) # deleting duplicate variables

In [None]:
data_fraude.shape # There were approximately 1000 duplicate data

The problem is a case of supervised learning of binary classification, where we use labeled data to train the model, the data contains the desired answer (fraud or non-fraud). Let's analyze the target variable.

In [None]:
# Checking the distribution of the target class
plt.figure(figsize=(8,10))
g = sns.countplot('Class', data=data_fraude, palette=cores)
g.set_title('Distribuição das Classes')
g.set_ylabel('Quantidade de ocorrências')
g.set_xlabel('Classe')
g.set_xticklabels(['Não Fraude', 'Fraude'])

In [None]:
# Checking the Fraud Percentage
total = data_fraude['Class'].value_counts()[0] + data_fraude['Class'].value_counts()[1]
fraude = (data_fraude['Class'].value_counts()[1]/total) * 100
print('Porcentagem de fraude:', fraude)

In [None]:
# Checking the data in numbers
total = len(data_fraude)
normal = len(data_fraude[data_fraude.Class == 0])
fraudes = len(data_fraude[data_fraude.Class == 1])
print('Número total de transações {}'.format(total))
print('Número de Transações Normais {}'.format(normal))
print('Número de Transações Fraudulentas {}'.format(fraudes))

The target class has two outputs: non-fraud (0) and fraud (1). As expected in banking transactions, fraud cases are much lower than normal transactions. Here, frauds represent only 0.166% of the data (473 out of 280922) which makes the target variable very unbalanced.

#### Let's check the financial impact of fraud cases.

In [None]:
# Amount transacted in 2 days
sum(data_fraude['Amount'])

In [None]:
# Amount of fraud
fraude = data_fraude[data_fraude["Class"] == 1]
sum(fraude['Amount'])

In [None]:
naofraude = data_fraude[data_fraude["Class"] == 0]
sum(naofraude['Amount'])

We handled an amount of R$ 25 billion in 2 days, of which R$ 58,000 were from fraudulent transactions, which represents 0.23% of financial loss with customer reimbursement.

## Seleção de features

We will select the most important features to reduce the risk of overfitting.

In [None]:
# Split features and target
d = data_fraude.iloc[:,:29]
e = data_fraude.iloc[:,30:]
data = pd.concat([d, e], axis=1)
data

In [None]:
alvo = pd.DataFrame(data_fraude.iloc[:,29])
alvo

As we don't know what the features are (they went through PCA), let's make a selection of the ones that most contribute to the prediction variable, through Mutual information.

In [None]:
mi = mutual_info_classif(data, alvo)
mi

In [None]:
# Creating series to better visualize the most important variables
mi = pd.Series(mi)
mi.index = data.columns
mi = mi.sort_values(ascending=False)

In [None]:
# Plotting the results
mi.plot.bar(figsize=(22,10))

In [None]:
# Selecting the most important features
sel = SelectKBest(mutual_info_classif, k=10).fit(data, alvo)
data.columns[sel.get_support()]

We chose the top 10 variables to build the model.

In [None]:
selecionadas = data_fraude[['V3', 'V4', 'V10', 'V11', 'V12', 'V14', 'V16', 'V17', 'V18', 'Time_h', 'Class']]
selecionadas

## Split train and test

In [None]:
X = selecionadas.drop(columns = ['Class'])
y = selecionadas['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
# Stratify to ensure both groups contain the same percentage of fraud cases

We split the data by putting 70% data in “training set” and 30% data in “testing set”, to avoid overfitting. If we use the same data to test the model that was used for training, then the model will perform well, but this is not good as the model memorizes the data and will not provide accurate results for unseen data.

In [None]:
# Training dataset size
len(X_train)

In [None]:
# Test dataset size
len(X_test)

In [None]:
# Checking the percentage of classes in datasets
print('Não fraude', round(
        y_train.value_counts()[0]/len(y_train)*100, 2), '% do dataset de treino')
print('Fraude', round(
        y_train.value_counts()[1]/len(y_train)*100, 2), '% do dataset de treino')

In [None]:
print('Não fraude', round(
        y_test.value_counts()[0]/len(y_test)*100, 2), '% do dataset de teste')
print('Fraud', round(
        y_test.value_counts()[1]/len(y_test)*100, 2), '% do dataset de teste')

## Standardization

Let's standardize the data to put it into a common range of values as they can impact model metrics. The model will be able to "learn" that larger values have greater relevance for the forecast, but will have this conclusion under the influence of the column order of magnitude, and not by the importance of the variable itself. We avoid this with the standardization.

In [None]:
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Random Forest for Classification

In [None]:
RF = RandomForestClassifier(random_state=0)

RF.fit(X_train_scaled, y_train)

predictions = RF.predict(X_test_scaled)

In [None]:
print(classification_report(y_test, RF.predict(X_test_scaled)))

In [None]:
matrix = ConfusionMatrix(RF, classes=['Não Fraude', 'Fraude'], cmap=['#436DA9', '#E73788'])
matrix.fit(X_train_scaled, y_train)
matrix.score(X_test_scaled, y_test)
matrix.show();

In [None]:
# As the classes are unbalanced, the AP curve will be considered for the evaluation of the model
display2 = PrecisionRecallDisplay.from_estimator(RF, X_test_scaled, y_test, name="Random Forest teste", color='#436DA9')
display2 = PrecisionRecallDisplay.from_estimator(RF, X_train_scaled, y_train, name="Random Forest treino", color='#E73788', ax=display2.ax_ )
display2.figure_.suptitle("Random Forest")
plt.show()

In [None]:
# Just looking at the ROC curve out of curiosity.
rff = metrics.plot_roc_curve(RF, X_test_scaled, y_test, name='Random Forest teste', color='#436DA9')
rff = metrics.plot_roc_curve(RF, X_train_scaled, y_train, name='Random Forest treino', color='#E73788', ax=rff.ax_)
rff.figure_.suptitle("Random Forest")

plt.show()

We had good results, so let's cross-validate for evaluation.

## Cross validation

In [None]:
#Cross validation of recall
recallRF = cross_val_score(RF, X_train_scaled, y_train, cv=5, scoring='recall')

In [None]:
print(recallRF)

In [None]:
print(recallRF.mean())

In [None]:
print(recallRF.std())

In [None]:
def intervalo(recallRF):
    mean = recallRF.mean()
    dv = recallRF.std()
    print('Recall média: {:.2f}%'.format(mean*100))
    print('Intervalo de recall: [{:.2f}% ~ {:.2f}%]'
           .format((mean - 2*dv)*100, (mean + 2*dv)*100))
intervalo(recallRF)

## Hyperparameters

In [None]:
# Checking parameters RandomForest
?RandomForestClassifier

Let's try to improve the model. For this, we are going to use some Random Forest hyperparameters.

## Bayesian Optimization

In [None]:
# Checking which are the best hyperparameters that improve the AP curve

def treinar_modelo(params):
    max_features = params[0]
    n_estimators = params[1]
    max_depth = params[2]

    
    print(params, '\n') # prints which parameters are testing on each iteration
    
    mdl = RandomForestClassifier(max_features=max_features, n_estimators=n_estimators,
                        max_depth=max_depth, random_state=0, n_jobs=-1)
    mdl.fit(X_train_scaled, y_train)
    
    predictions = mdl.predict(X_test_scaled)
    
    return -average_precision_score(y_test, predictions) 
# The returned metric will be the one chosen to check if the model is good due to imbalance

# Minimum and maximum parameter values
space = [(0.1, 1.0), #max_features
         (100, 500), # n_estimators
         (1, 8)] # max_depth

In [None]:
resultados_gp = gp_minimize(treinar_modelo, space, 
                            random_state=1, 
                            verbose=1, #show process
                            n_calls=30, # number of interactions it will test
                            n_random_starts=10) #testa amostras de 10 pontos aleatoriamente, depois treina o modelo e encontra 
# qual parâmetro foi mais promissor, e a partir disso testa parâmetros que parecem mais optimos e explora eles ao invés
# de fazer isso aleatoriamente. Explorar áreas promissoras e explorar áreas que parecem melhores

In [None]:
# Checking the best hyperparameters
resultados_gp.x

## Running the model with the hyperparameters

In [None]:
RF2 = RandomForestClassifier(random_state=0, max_features=0.55, n_estimators=498, max_depth=7)

RF2.fit(X_train_scaled, y_train)

predictions2 = RF2.predict(X_test_scaled)

In [None]:
print(confusion_matrix(y_test, predictions2))
print(classification_report(y_test, RF.predict(X_test_scaled)))

In [None]:
display2 = PrecisionRecallDisplay.from_estimator(RF2, X_test_scaled, y_test, name="Random Forest teste", color='#436DA9')
display2 = PrecisionRecallDisplay.from_estimator(RF2, X_train_scaled, y_train, name="Random Forest treino", color='#E73788', ax=display2.ax_ )
display2.figure_.suptitle("Random Forest")
plt.show()

Even looking for better hyperparameters, the model didn't have many gains. However, the end result is considered good.

## Conclusions

- The model correctly classified 84131 non-fraud transactions
- The model classified only 4 true transactions as fraud (here we have the lowest number of customers with transactions being barred, as our objective was also to avoid classifying good customers as suspicious as much as possible)
- The model correctly classified 115 frauds
- The model classified 27 frauds as true transactions