# Neural Networks with Keras

Note: Please watch the video for this NoteBook, explaning how Keras works will help you build your own Neural Networks! 

Here we are using Keras library to implement Neural Networks. Keras is a deep learning API which runs on top of TensorFlow. You can find more information about Keras and TensorFlow in the following links:

https://keras.io/about/

https://www.tensorflow.org

Here, we would like to make a credit card fraud detection model. This was another Kaggle competition; the data is huge so you have to make an account and take the file from Kaggle ;) 

https://www.kaggle.com/mlg-ulb/creditcardfraud/. 

The dataset presents here is a two-day credit card transactions. Out of 284,807 transactions, only 492 of them are fraud. Thus the data is highly skewed (unbalanced dataset). All the features in this example are numerical, but for most of the feaures we dont know the names and only some alias of V1 to V28. However, we do know name of two features. There is a feature called "Time" and another which is called "Amount". Time presents the time elapsed (in unit of seconds) between two transactions and the "Amount" is the amount of each transaction. 


Note: Data is uploaded in OWL under the resources/ Unit 5 video files Folder

In [1]:
# import libraries that we need
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from tensorflow import keras

- In the cell above we imported the libraries we needed from which we have already talked about pandas in details. Numpy might be a new library for you! NumPy is a scientific computing package of Python designed for fast operations on arrays (including mathematical and logical processes). Moreover, it has basic linear algebra and statistical operation helper functions. The well-written documentation of Numpy includes simple and usefull examples:
https://numpy.org/doc/stable/user/whatisnumpy.html

- Sklearn has a data preprocessing section too! https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing it is highly recommanded to take a look at all its options. 

- As mentioned earlier Keras is implemented on top of TensorFlow and can be easily be called from tensorflow. 



In [2]:
# read the data
data = pd.read_csv("creditcard.csv")

In [3]:
data

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.00,0


This is a Classification task, this means it is a supervised method (have features and targets). Here, the target is a binary (fraud or not fraud) label. We are going to make the target and features dataframe, and we are going to make a numpy array (instead of pandas data frame).

In [4]:
# make target and feature seperate from each other  
target = data["Class"]
features = data.drop(columns = ["Class"])
# change the data frame to be a numpy array
target = np.array(target)
features = np.array(features)

We should have a training set and a validation set (train the model and see how robust it is using the validation set). We assign 20% of data as validation and the rest for training. We used a sklearn helper function in the previous codes to split our data; here we show another method:


In [5]:
# find the length of 20% of the data
num_val_samples = int(len(features) * 0.2)

# split data into train and validation
train_features = features[:-num_val_samples]
train_target = target[:-num_val_samples]
val_features = features[-num_val_samples:]
val_target = target[-num_val_samples:]

# Imbalanced data: 

- Our data-set is highly skewed, we want to see in our data-set how many of the cased are positive (fraudulent) and how many are negative (law-full transactions). So, we can just look at our "train_target" array and count all 0s and 1s (numpy provides an easy way to do so).

- As, it is important to be able to detect as many positive (fraudulent) cases as possible it is a good practice to give the positive cases more weight! 


In [6]:
count_0 = len(np.where(train_target ==0)[0])
count_1 = len(np.where(train_target ==1)[0])

weight_for_0 = 1.0 / count_0
weight_for_1 = 1.0 / count_1



# Preprocessing of data:

It is a good practice to scale the data. For many algorithms the estimators might behave poorly if the features are not scaled. When the features have different ranges for example one feature varies between 0 and 1 and the other between 0 and 100000; the algorithm might put more weight (significance) on the larger numbers however, they are not more important! There are different methods to scale a data-set. One commen practice is to standardize features by removing the mean and scaling to unit variance. Python provides a comprehensive set of options:
https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py

In this notebook, we are goona use the StandardScaler method which is explained in the link below: 
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html


In [7]:
# Here we use StandardScaler
scaler = StandardScaler()
# We "fit" and "transform" our data as follows:
train_feature_normal = scaler.fit_transform(train_features)
val_features_normal = scaler.fit_transform(val_features)



# The Sequential Class:

We are using Sequential class within Keras. If you recall from Lesson 1, Neural Networks contain linear layers which are connected to each other via a nonlinear (activation) function.  Sequential groups a linear stack of layers and build a model. To build each layer, we can use Keras Layers API (Keras.layers) and we use a regular densly connected network (keras.layers.Dense) where we can define which activation function we want to use (exe: relu, tanh, and many more options) and we can define how many neurons each layer should have Input layer has the shape of the length of features; and the outout layer has the shape of output which for the binary classfication is 1. Layers between the input and output layers (the hidden layers) can have what ever number of neurons we choose (a hyperparameter choice). Between each layer we put anothe layer of Dropput. This step is not necessary (although recommended) and it randomly sets some input values to the new layer to 0; this helps preventing overfitting! 

Below is a set of useful links to learn more about Keras layers. 
- Sequential explained: https://keras.io/api/models/sequential/
- Layer explained: https://keras.io/api/layers/
- Dense Layer explained: https://keras.io/api/layers/core_layers/dense/
- Dropout Layer: https://keras.io/api/layers/regularization_layers/dropout/


In [8]:

model = keras.Sequential(
    [
        keras.layers.Dense(
            256, activation="relu", input_shape=(train_features.shape[-1],)
        ),
        keras.layers.Dense(256, activation="relu"),
        keras.layers.Dropout(0.3),
        keras.layers.Dense(256, activation="relu"),
        keras.layers.Dropout(0.3),
        keras.layers.Dense(1, activation="sigmoid"),
    ]
)
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 256)               7936      
_________________________________________________________________
dense_1 (Dense)              (None, 256)               65792     
_________________________________________________________________
dropout (Dropout)            (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 256)               65792     
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 257       
Total params: 139,777
Trainable params: 139,777
Non-trainable params: 0
__________________________________________________

# Let the training begin!

- This is a classification task so to evaluate the perfomance we should take a look at False Negatives, False Positives, True Negatives, True Positives, Precision and Recall. So, we can use Keras.metric to calculate the above metrics! 

- After building the model; we need to train the model which is simple using Keras! model.compile and model.fit are the only things we need to use! In model.compile we choose what loss function we want to use, which optimization method and ...

- model.fit will have many features: train features and labels, batch-size, epoch and many more! 
- We made sure to put the class weight as a parameter (remember we want to put more weight to the posotive cases as they are rare but critical to catch!) 



In [9]:
metrics = [
    keras.metrics.FalseNegatives(name="fn"),
    keras.metrics.FalsePositives(name="fp"),
    keras.metrics.TrueNegatives(name="tn"),
    keras.metrics.TruePositives(name="tp"),
    keras.metrics.Precision(name="precision"),
    keras.metrics.Recall(name="recall"),
]

model.compile(
    optimizer=keras.optimizers.Adam(1e-2), loss="binary_crossentropy", metrics=metrics
)

class_weight = {0: weight_for_0, 1: weight_for_1}

model.fit(
    train_feature_normal,
    train_target,
    batch_size=2048,
    epochs=30,
    verbose=2,
    class_weight=class_weight,
    validation_data=(val_features_normal, val_target)
)

Epoch 1/30
112/112 - 15s - loss: 2.1961e-06 - fn: 45.0000 - fp: 27388.0000 - tn: 200041.0000 - tp: 372.0000 - precision: 0.0134 - recall: 0.8921 - val_loss: 0.1471 - val_fn: 6.0000 - val_fp: 2588.0000 - val_tn: 54298.0000 - val_tp: 69.0000 - val_precision: 0.0260 - val_recall: 0.9200
Epoch 2/30
112/112 - 9s - loss: 1.3889e-06 - fn: 32.0000 - fp: 6687.0000 - tn: 220742.0000 - tp: 385.0000 - precision: 0.0544 - recall: 0.9233 - val_loss: 0.1923 - val_fn: 6.0000 - val_fp: 3782.0000 - val_tn: 53104.0000 - val_tp: 69.0000 - val_precision: 0.0179 - val_recall: 0.9200
Epoch 3/30
112/112 - 8s - loss: 1.1753e-06 - fn: 28.0000 - fp: 7480.0000 - tn: 219949.0000 - tp: 389.0000 - precision: 0.0494 - recall: 0.9329 - val_loss: 0.2308 - val_fn: 3.0000 - val_fp: 5154.0000 - val_tn: 51732.0000 - val_tp: 72.0000 - val_precision: 0.0138 - val_recall: 0.9600
Epoch 4/30
112/112 - 8s - loss: 9.8716e-07 - fn: 24.0000 - fp: 6267.0000 - tn: 221162.0000 - tp: 393.0000 - precision: 0.0590 - recall: 0.9424 - val_

Epoch 30/30
112/112 - 6s - loss: 1.6606e-07 - fn: 1.0000 - fp: 2201.0000 - tn: 225228.0000 - tp: 416.0000 - precision: 0.1590 - recall: 0.9976 - val_loss: 0.0382 - val_fn: 10.0000 - val_fp: 735.0000 - val_tn: 56151.0000 - val_tp: 65.0000 - val_precision: 0.0812 - val_recall: 0.8667


<tensorflow.python.keras.callbacks.History at 0x7f9fd3afbf40>

# Conclusion 

Looking at the training result (focusing on validation part): 

- FP: 735
- FN: 10
- TP: 65
- TN: 56151

This means that we correctly identified 65 numbers of fraudulant activities and we missed 10 of them. This cost us flagging 735 of legitimate activities as fraudulant which is a price we pay! Knowing that False Negatives are more important (wrongfully classify a fraudulant transaction as a legitimate one) this result is quite satisfactory. And, in the real world, they even put more weight on the positive cases which causes even more False Negatives!  


# More Stuff to Read (Optional):

- Binary Cross Entropy: loss function for binary classification: 
https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a

- Adam optimization: original paper: https://arxiv.org/abs/1412.6980

- Adam optimization: nice video explaning it: https://www.youtube.com/watch?v=JXQT_vxqwIs
