# A simple demonstration of Artificial Neural Network applied to KDDCUP data

### Intro

The dataset can be found at <a href="http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html">this adress</a>.

The objective is to determine wether the traffic is normal or not(last column of the dataset).

In [1]:
#We start the import

import pandas as pd #Used to read the dataset file
import numpy as np #Used for quick manipulation of the columns

from tqdm import tqdm_notebook as tqdm #Will display a nice animation while loading.
from keras_tqdm import TQDMNotebookCallback
import urllib #Used to data architecture.

from tensorflow import keras #Used for our neural network.
from tensorflow.keras import layers #Will define what's inside the neural network.

from sklearn.model_selection import train_test_split #Will be used to split the testing and training data.
from sklearn.metrics import confusion_matrix #Used to see the efficiency of the neural net.

Using TensorFlow backend.


In [2]:
#We start by loading the dataset we will work on. I work here on the 10% dataset.
df = pd.read_csv("kddcup.data_10_percent_corrected", header = None)
df.head() #We display the 5 first row.

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,32,33,34,35,36,37,38,39,40,41
0,0,tcp,http,SF,181,5450,0,0,0,0,...,9,1.0,0.0,0.11,0.0,0.0,0.0,0.0,0.0,normal.
1,0,tcp,http,SF,239,486,0,0,0,0,...,19,1.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,normal.
2,0,tcp,http,SF,235,1337,0,0,0,0,...,29,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,normal.
3,0,tcp,http,SF,219,1337,0,0,0,0,...,39,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,normal.
4,0,tcp,http,SF,217,2032,0,0,0,0,...,49,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,normal.


## Data Processing

### Extracting labels and formating them
Neural networks do not understand inputs like "Normal, "http", "udp". We need to work with numerical values. The purpose of the data processing is to reshape the data to have a proper "communication" with the neural network.

We will use a feedforward neural network, which takes 1 neuron as output, firing 1 if normal, 0 if not.

In [3]:
#We split the output of the neuron (the answer we look for, that we will call a "label")
Y = df[41] #our labels.

#We next take the inputs. The information we will pass through the network to predict if the traffic is normal or not.
X = df.drop(columns=[41])

#Next, we reshape the structure, from a String "normal." to 1... or 0 if not "normal."
Y = np.where(Y.get_values() == "normal.", 1, 0)

#Next, we control that we have different types of outputs in dataset (we check that we don't have only "normal." elements..
if np.sum(Y) != Y.shape[0]:
    print("Different objects confirms")

Different objects confirms


### Data Cleaning

If you look at http://kdd.ics.uci.edu/databases/kddcup99/kddcup.names which presents the list of the features, you will see there are multiple types of features.
- Continuous: Values that are numeric, but not necessary between 0 and 1 (what we will prefer to train the neural network)
- Symbolic: A sign represents something different. There might be more than 2 elements and we cannot use a simple neuron to fire values between 0 and 1 to represent this concept. instead, we will use a <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html">one hot encoding with get_dummies() function from pandas</a>.


In [4]:
#First : we collect the list of continuous features.
bases = urllib.request.urlopen("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.names").read(10000).decode("utf-8") 
bases = bases.split("\n")[1:-1]
cols_continuous = []

for bi, bcontent in enumerate(bases):
    if "continuous" in bcontent:
        cols_continuous.append(bi)

#We now have the list. We are about to correct the input to format them to the good shape.

#Will list the columns to remove
#(containing symbolic elements, because they will be replaced by new columns of a one_hotèshape)
col_to_rem = [] 

#Since we reshape the dataframe, we create a copy of the column list to not iterate through our newly created columns.
copy_columns = X.columns

#We look the columns, one by one.
for col in tqdm(copy_columns):
    if col in cols_continuous: #This is a continuous data. We reshape it to the range [0;1]
        X[col] = (X[col]-X[col].min())/(X[col].max() - X[col].min()) #Way to reshape it through "Min Max Scaling"
    else:#Symbolic data.
        col_to_rem.append(col) #We will remove this column
        X = pd.concat([X, pd.get_dummies(X[col])], axis=1); #And we add the one hot encoded version.

print("Removing columns:")
print(col_to_rem)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`


HBox(children=(FloatProgress(value=0.0, max=41.0), HTML(value='')))


Removing columns:
[1, 2, 3, 6, 11, 20, 21]


### Unnecessary data and NaN

We now have the columsn to remove in the col_to_rem variable.
However, our dataset contains NaN values. This value represent missing information (incomplete informations).
We replace all NaN value by "0".

In [5]:
X = X.drop(columns=col_to_rem) #Cleaning columns
X = X.fillna(0) #Cleaning missing values.
X.shape

(494021, 118)

Our dataset contains 494021 samples of 118 dimensions.
1 dimension equals to 1 information we have in each sample.

We now our neural network will have 118 dimension as input.
We now it will have 1 dimension as output.

### Creating the Neural Network

I chose to use a very smal neural network of 3 layers : The input, an output layer, and a layer in between that we will call "hidden layer".

<img src="img/neural_network.png">

I chose to put 16 neurons in the hidden layer. 

We can add more layers and more neurons... Or even less. The more layers you add, the more you increase the complexity which can be captured by the network. This attitude however tend to require more computation time and more data. It is also more prone to overfitting.

The smaller the network, the better.<br><br><br>

<a href="https://www.youtube.com/watch?v=woa34ugDSwY"><img src="img/meme1.jpg">
    <center>https://www.youtube.com/watch?v=woa34ugDSwY</center></a>

In [6]:
#We define our entrance.    
inputs = keras.Input(shape=(118,))

#We connect 16 neurons to this entrance. The data will flow to this neurons.
x = layers.Dense(16, activation="relu")(inputs)

#We add a BatchNormalization.
#This will help the value to stay between 0-1 inside the neural network, proving a better stability.
x = layers.BatchNormalization()(x)

#We next create an output layer which will give the prediction of the neural network.
out = layers.Dense(1, activation="sigmoid")(x)

#The model is now ready to be created.
model = keras.Model(inputs=inputs, outputs=out, name="simple_ann")

#We display the structure
model.summary()

W0508 18:58:06.085824 19500 deprecation.py:506] From c:\users\leximus\appdata\local\programs\python\python36\lib\site-packages\tensorflow_core\python\ops\resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Model: "simple_ann"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 118)]             0         
_________________________________________________________________
dense (Dense)                (None, 16)                1904      
_________________________________________________________________
batch_normalization (BatchNo (None, 16)                64        
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 1,985
Trainable params: 1,953
Non-trainable params: 32
_________________________________________________________________


The model contains around 2000 parameters to train, which is a small number. This will help to train faster the network. If the model does not capture enough accuracy, we will try to inrease the number of neurons/layers.

(do not hesitate to have a look with the column "OutputShape" and the diagram of the network above!)

In [7]:
#We now "build" the network, defining how we want to train it.

#Binary crossentropy defines the loss. It will tell how the network make mistake (and thus, how to converge to the solution.)
model.compile(loss=keras.losses.binary_crossentropy,
    optimizer=keras.optimizers.SGD(), #The strategy used to reduce the error. Here, SGD.
    metrics=["accuracy"]) #We display the accuracy of the neural network.

W0508 18:58:06.181569 19500 deprecation.py:323] From c:\users\leximus\appdata\local\programs\python\python36\lib\site-packages\tensorflow_core\python\ops\nn_impl.py:183: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


### Splitting the dataset : A system to detect overfitting
When we train the network, we show samples `S`.
The network behave as a function `N`, generating a prediction `P` of what it is.

We can write`N(S) = P`

We know what the answer `A` should be (this is the label we extracted, in the `Y` variable !). 
To know how much the network is mistaken, we will calculate the error `L` that we call the loss.

There are a lot of different loss. I chose to use a binary_crossentropy but to give you a simpler one for description purpose, imagine that we have `L = P - A`.

The purpose of the neural network is to reduce `L` to its minimum. This system will lead to reducing the difference between our prediction `P` and the answer `A`.

However, if we were to show <b>all</b> samples, the network may just find statistical anomalies to make its decision. In somehow, you could compare it to a student learning by heart the answer to all questions of the test. It can be good for some subject like history... but for math, imaigne now a student who learned this way, without understanding the mathematical object themselves. As soon as you cange the values of the mathematical equations, the student will not be able to solve the problems given.

A similar problem may appear here, where our network learns all the question by heart. This is what we call <b>overfitting</b>. And overfitting is bad. To check if we are learning from the dataset or simply memorizing it (and thus, no use of the neural network...), we take apart of our samples (20%), that the model will not train on.

If our network can solve the 20% it has never seen, it means it will have learned to solve the problem. Of course, this should also implement a balanced representation of the different output. If the model only returns "this is a normal connection" and you only test on normal connection, you will see good results... but you will not find your model is bad.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1234)

In [9]:
#We test for balance
print(confusion_matrix(y_test, y_test))

[[79496     0]
 [    0 19309]]


We see we do not have the same quantity between "normal" and "abnormal". However, we have enoug samples to control if the network learnt or not.

### Training the network
We are now ready to train the network. We could use the testing set we have made to check the model does not overfit while we train it... but I decided to choose to use a validation dataset. This is 20% of the 80% we have in the Training Dataset. This will be displayed as we train.

Next, we will be able to see how we perform on the testing dataset.

In [10]:
model.fit(X_train, y_train, 
          batch_size=128, #We show 128 samples to compute the loss.
          epochs=10, #And will see 10 times the dataset.
          validation_split = 0.2) #The 20% we take from the Training for the validation training set
#(The network is not trained on the validation. It is only on the testing)

Train on 316172 samples, validate on 79044 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1fee3d9a278>

Our model is trained and Keras tells us that we have more than 99.88% of accuracy on the validation test (that he never saw!).

Let's test it with our testing set.

In [11]:
y_pred = model.predict(X_test);
y_pred = np.round(y_pred)

print("confusion matrix ANN:")
print(confusion_matrix(y_test, y_pred))

confusion matrix ANN:
[[79418    78]
 [   37 19272]]


We see a small number of false positive and false negative. It seems our model has indeed correctly classify the dataset.