# Credit Card Fraud Detection

Credit card fraud has been a big problem in the last few years, especially during the pandemic. We needed to buy everything online in this period such as food, medicine, clothe, book, and other. As a result, the criminals saw many opportunities to commit different crimes. Phishing was one of the crimes that had a growth. It's about stealing the user's data through calls, emails, and messages. In these social media, criminals send messages passing by financial companies to convince the users to answer with their data.


<center><img width="60%" src="https://github.com/brayannmb/Data-Science/blob/main/credit_card_fraud/images/banner_credit_card.png?raw=true"
></center>

## Obtaining the Data

Using data collected from Kaggle, we'll build a Machine Learning model that is capable of predicting fraudulent transactions. Beyond this, it will help the financial companies to minimize the problems with clients.

This dataset contains 284807 lines and 31 columns. It was the result of the PCA method that reduced the dimensionality of a dataset. 


In [56]:
#libraries

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from imblearn.under_sampling import NearMiss
from imblearn.over_sampling import SMOTEN

sns.set_style('dark')


#obtaining the data
df = pd.read_csv("https://www.dropbox.com/s/b44o3t3ehmnx2b7/creditcard.csv?dl=1")

MemoryError: 

The great practice is to check the first five lines and the last five lines.

In [None]:
# checking first five lines

df.head(5)

In [None]:
# checking last five lines

df.tail()

Something good in these results above is that we can't see any missing value. So, let's go to confirm this hypothesis.

## Are there any missing values in this dataset?


In [None]:
# checking missing values, primitive types and memory. 

df.info()

We were right, there are no missing values no in this dataset.


## Basics Statistics

A good practice in the beginning of the analysis is to check basics statistics about the numerical variables.

In [None]:
# checking basics statistics

df.describe().T

It's so hard to understand this result because we only have three variables that make sense. However, there is something that we can analyze, this is the **minimum value** for the amount variable.

I perceived that the minimum value is zero, but not make sense to make a transaction with the value zero.

Thinking about this, I'll analyze this variable better.

## Understanding Amount Variable

In [None]:
# checking Amount variable for non-fraud transactions

df.loc[(df.Class == 0) & (df.Amount == 0)]

We can see that there are many non-fraud transactions with value zero, almost **1.800 transactions**. Although, we don't have much information about why this happened.

In [None]:
# checking Amount variable for frauds transactions

df.loc[(df.Class == 1) & (df.Amount == 0)]

At fraud transactions, this value is less than non-fraud transactions with only **27 transactions.**



Still analyzing the amount variable, it is possible to check in the result of describe command that the mean is **88 USD** and the median is **22 USD**, but the maximum value is **25.691,16 USD**. It looks like **an outlier.**

## Are outliers non-fraud transactions or fraudulent transactions?

So, let's go to analyze the distribution of the amount variable.

In [None]:
# Checking the distribution of Amount variable using histogram

fig, ax = plt.subplots(figsize=(10,5))

ax.hist(df.Time[df.Class == 0], bins=40);

In [None]:
# Checking the distribution of Amount variable using histogram

fig, ax = plt.subplots(figsize=(10,5))

ax.hist(df.Time[df.Class == 1], bins=40);

In [None]:
# Checking the distribution of Amount variable using boxplot

fig, ax = plt.subplots(figsize=(15,5))

non_fraud = df.loc[df.Class == 0].copy()

non_fraud.Amount.plot(kind='box', ax=ax, vert=False);

In [None]:
# Checking the distribution of Amount variable using boxplot

fig, ax = plt.subplots(figsize=(15,5))

frauds = df.loc[df.Class == 1].copy()

frauds.Amount.plot(kind='box', ax=ax, vert=False);

Just paying attention to the values between **10.000 and 25.000**, they look like outliers. However, I don't make any changes because this dataset is still very **unbalanced.**

## Class Variable

As was introduced in the scope of the project, the Class variable is very unbalanced. I will check this problem in this topic and how we can solve this issue. 

In [None]:
# checking distribution of the Class variable

ax = sns.countplot(x="Class", data=df)

In [None]:
frauds_percentage = (len(frauds) / df.shape[0]) * 100

print(f'Non-fraud transactions: {len(non_fraud)}')
print(f'Frauds transactions: {len(frauds)}')
print(f'Frauds percentage: {(frauds_percentage):.4f}')

When we are working with credit card fraud, it is normal to have more non-fraud transactions than frauds. 

However, this problem will prejudice the Machine Learning model. 
If we train a model using this dataset without correct this problem, the result will be so good with the train data but will be horrible with new data. It happens because the model didn't learn the variability about the dataset, and the model just decorated the majority class.

## Preparing Data

As mentioned in the problem, the dataset suffered a normalization by the PCA method. However, the Amount and Time variables didn't suffer normalization, and it can be a problem in the future when applying Logistic Regression to predict the output. 

The dataset is also very unbalanced. In this topic, I will apply undersampling, oversampling, and standardization methods to improve the dataset.

The steps: 

* Split the dataset in train_data and test_data
* Undersampling using the Near-Miss method.
* Oversampling using the SMOTEEN method.
* Standardization using the StandardScaler method.

### Separating Train and Test Datasets

In [None]:
# separar variáveis entre X e y
X = df.drop('Class', axis=1)
y = df['Class']

# dividir o dataset entre treino e teste
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, shuffle=True)

In [None]:
X_train