# Credit Card Fraud Detection

All the previous exercises made you take a closer look at all the different parts of a neural network: the architecture, the compilation and the fitting.

Let's now work on a real-life dataset that has **a lot of data**!

The data
For this open challenge, you will work with data extracted from credit card transactions. As these are sensitive data, from all the 31 columns, only 3 are known: the rest are data that have been transformed to anonymize them (in fact, they are PCA projections of initial data).

The other three known columns are:

- "TIME": the time elapsed between the transaction and the first transaction in the dataset
- "AMOUNT": the amount of the transaction
- "CLASS" (our target): 0 means that the transaction is valid whereas 1 means that it is a fraud.

❓ **Question** ❓ Start by downloading the data on the Kaggle website here [here](https://www.kaggle.com/mlg-ulb/creditcardfraud) and load data to create `X` and `y`

## 1. Rebalancing classes

In [4]:
# Let's check class balance
pd.Series(y).value_counts()

☝️ in this `fraud detection` challenge, the classes are extremely imbalanced:
* 99.8 % of normal transactions
* 0.2 % of fraudulent transactions

We won't be able to detect frauds unless we apply some serious rebalancing strategies!

❓ **Question** ❓
1. **First**, create three separate Train/Val/Test splits from your dataset. It is extremely important to keep validation and testing sets **not rebalanced** so as to evaluate your model in true conditions without data leak. Keep your test set for the very last cell of this notebook only.

&nbsp;
2. **Second**, rebalance you training set (and only this one). You have many choices:

- Simply oversample the minority class randomly using plain numpy functions.
- Or use <a href="https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/">Synthetic Minority Oversampling Technique</a> to generate new datapoints by weighting the existing ones
- In addition, try also <a href="https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/">RandomUnderSampler</a> to downsample a little bit the majority class

In [5]:
# YOUR CODE HERE

## 2. Neural Network iterations

Now that you have rebalanced your classes, try to fit a neural network to optimize your test score. Feel free to use the following hints:

- Normalize your inputs!
    - Use preferably a [`Normalization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/Normalization) layer inside the model to "pipeline" your preprocessing within your model. 
    - Or use sklearn's [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) outside of your model, applied your `X_train` and `X_val` and `X_test`.
- Make model overfit, then, regularize using
    - Early Stopping criteria 
    - [`Dropout`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout) layers
    - or [`regularizers`](https://www.tensorflow.org/api_docs/python/tf/keras/regularizers) layers
- 🚨 Think carefully about the metric you want to track and the loss you want to use ?


In [15]:
# YOUR CODE HERE

## 3. Score your model on unseen Test set

❓ **Question** ❓: Compute your confusion matrix and classification report on the test set

### 🧪 Test your score

Store below your real test performance on a (`X_test`, `y_test`) representative sample of the original unbalance dataset

In [32]:
precision = 0 # ??
recall = 0 # ??

In [39]:
from nbresult import ChallengeResult

result = ChallengeResult('solution',
    precision=precision,
    recall=recall,
    fraud_number=len(y_test[y_test == 1]),
    non_fraud_number=len(y_test[y_test == 0]),
)

result.write()
print(result.check())

## 🏁 Optional : Read Google's solution for this challenge
Congratulation for finishing all challenges for this session!

To conclude, take some time to read Google's own solution direcly [on Colab here](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/structured_data/imbalanced_data.ipynb). You will discover interesting techniques and best practices
