<a href="https://colab.research.google.com/github/dcpatton/Structured-Data/blob/main/kdd_cup_1999.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Objective

Below is an exploration of a computer network intrusion detection dataset (https://www.kdd.org/kdd-cup/view/kdd-cup-1999). I will first approach it as a multiple classification problem (identifying 23 different methods) and then approach it as a binary classification (identifying normal network use and abnormal use). There is a large class imbalance in the dataset, but this is improved in the binary case.

In [1]:
import tensorflow as tf
import pandas as pd
import random

import warnings
warnings.filterwarnings("ignore")
tf.get_logger().setLevel('ERROR')

seed = 52
random.seed(seed)
tf.random.set_seed(seed)

tf.__version__

'2.3.0'

In [2]:
from google.colab import files
uploaded = files.upload()

Saving kaggle.json to kaggle.json


In [3]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json

In [4]:
# !kaggle datasets list -s KDD
# !kaggle datasets download -d galaxyh/kdd-cup-1999-data -f corrected.gz
!kaggle datasets download -d galaxyh/kdd-cup-1999-data

Downloading kdd-cup-1999-data.zip to /content
 92% 81.0M/87.8M [00:01<00:00, 47.4MB/s]
100% 87.8M/87.8M [00:01<00:00, 69.9MB/s]


In [5]:
!unzip kdd-cup-1999-data.zip

Archive:  kdd-cup-1999-data.zip
  inflating: corrected.gz            
  inflating: corrected/corrected     
  inflating: kddcup.data.corrected   
  inflating: kddcup.data.gz          
  inflating: kddcup.data/kddcup.data  
  inflating: kddcup.data_10_percent.gz  
  inflating: kddcup.data_10_percent/kddcup.data_10_percent  
  inflating: kddcup.data_10_percent_corrected  
  inflating: kddcup.names            
  inflating: kddcup.newtestdata_10_percent_unlabeled.gz  
  inflating: kddcup.newtestdata_10_percent_unlabeled/kddcup.newtestdata_10_percent_unlabeled  
  inflating: kddcup.testdata.unlabeled.gz  
  inflating: kddcup.testdata.unlabeled/kddcup.testdata.unlabeled  
  inflating: kddcup.testdata.unlabeled_10_percent.gz  
  inflating: kddcup.testdata.unlabeled_10_percent/kddcup.testdata.unlabeled_10_percent  
  inflating: training_attack_types   
  inflating: typo-correction.txt     


In [6]:
data_df = pd.read_csv('kddcup.data.corrected', header=None)
data_df.columns = [
    'duration',
    'protocol_type',
    'service',
    'flag',
    'src_bytes',
    'dst_bytes',
    'land',
    'wrong_fragment',
    'urgent',
    'hot',
    'num_failed_logins',
    'logged_in',
    'num_compromised',
    'root_shell',
    'su_attempted',
    'num_root',
    'num_file_creations',
    'num_shells',
    'num_access_files',
    'num_outbound_cmds',
    'is_host_login',
    'is_guest_login',
    'count',
    'srv_count',
    'serror_rate',
    'srv_serror_rate',
    'rerror_rate',
    'srv_rerror_rate',
    'same_srv_rate',
    'diff_srv_rate',
    'srv_diff_host_rate',
    'dst_host_count',
    'dst_host_srv_count',
    'dst_host_same_srv_rate',
    'dst_host_diff_srv_rate',
    'dst_host_same_src_port_rate',
    'dst_host_srv_diff_host_rate',
    'dst_host_serror_rate',
    'dst_host_srv_serror_rate',
    'dst_host_rerror_rate',
    'dst_host_srv_rerror_rate',
    'outcome'
]

In [7]:
data_df.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,root_shell,su_attempted,num_root,num_file_creations,num_shells,num_access_files,num_outbound_cmds,is_host_login,is_guest_login,count,srv_count,serror_rate,srv_serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,outcome
0,0,tcp,http,SF,215,45076,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,normal.
1,0,tcp,http,SF,162,4528,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1,1,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,normal.
2,0,tcp,http,SF,236,1228,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,2,2,1.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,normal.
3,0,tcp,http,SF,233,2032,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,3,3,1.0,0.0,0.33,0.0,0.0,0.0,0.0,0.0,normal.
4,0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,3,3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,4,4,1.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,normal.


In [8]:
data_df.outcome.value_counts()

smurf.              2807886
neptune.            1072017
normal.              972781
satan.                15892
ipsweep.              12481
portsweep.            10413
nmap.                  2316
back.                  2203
warezclient.           1020
teardrop.               979
pod.                    264
guess_passwd.            53
buffer_overflow.         30
land.                    21
warezmaster.             20
imap.                    12
rootkit.                 10
loadmodule.               9
ftp_write.                8
multihop.                 7
phf.                      4
perl.                     3
spy.                      2
Name: outcome, dtype: int64

In [9]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
data_df[['outcome']] = label_encoder.fit_transform(data_df[['outcome']])

In [10]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898431 entries, 0 to 4898430
Data columns (total 42 columns):
 #   Column                       Dtype  
---  ------                       -----  
 0   duration                     int64  
 1   protocol_type                object 
 2   service                      object 
 3   flag                         object 
 4   src_bytes                    int64  
 5   dst_bytes                    int64  
 6   land                         int64  
 7   wrong_fragment               int64  
 8   urgent                       int64  
 9   hot                          int64  
 10  num_failed_logins            int64  
 11  logged_in                    int64  
 12  num_compromised              int64  
 13  root_shell                   int64  
 14  su_attempted                 int64  
 15  num_root                     int64  
 16  num_file_creations           int64  
 17  num_shells                   int64  
 18  num_access_files             int64  
 19  

In [11]:
assert 1 == len(data_df.num_outbound_cmds.unique())  # only one unique value, so drop it
data_df.drop('num_outbound_cmds', axis='columns', inplace=True)

In [12]:
data_df.isna().sum()

duration                       0
protocol_type                  0
service                        0
flag                           0
src_bytes                      0
dst_bytes                      0
land                           0
wrong_fragment                 0
urgent                         0
hot                            0
num_failed_logins              0
logged_in                      0
num_compromised                0
root_shell                     0
su_attempted                   0
num_root                       0
num_file_creations             0
num_shells                     0
num_access_files               0
is_host_login                  0
is_guest_login                 0
count                          0
srv_count                      0
serror_rate                    0
srv_serror_rate                0
rerror_rate                    0
srv_rerror_rate                0
same_srv_rate                  0
diff_srv_rate                  0
srv_diff_host_rate             0
dst_host_c

In [13]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(data_df, test_size=0.2, random_state=seed, stratify=data_df['outcome'])
print(train_df.shape)
print(test_df.shape)

(3918744, 41)
(979687, 41)


In [14]:
train_df.outcome.value_counts()

18    2246308
9      857614
11     778225
17      12714
5        9985
15       8330
10       1853
0        1762
21        816
20        783
14        211
3          42
1          24
6          17
22         16
4          10
16          8
7           7
8           6
2           6
13          3
12          2
19          2
Name: outcome, dtype: int64

In [15]:
test_df.outcome.value_counts()

18    561578
9     214403
11    194556
17      3178
5       2496
15      2083
10       463
0        441
21       204
20       196
14        53
3         11
1          6
6          4
22         4
16         2
7          2
4          2
2          2
13         1
12         1
8          1
Name: outcome, dtype: int64

In [16]:
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('outcome')
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=1024)
  ds = ds.batch(batch_size).prefetch(tf.data.experimental.AUTOTUNE)
  return ds

In [17]:
for column in data_df.columns:
  print(column + ': ' + str(data_df[column].nunique()))

duration: 9883
protocol_type: 3
service: 70
flag: 11
src_bytes: 7195
dst_bytes: 21493
land: 2
wrong_fragment: 3
urgent: 6
hot: 30
num_failed_logins: 6
logged_in: 2
num_compromised: 98
root_shell: 2
su_attempted: 3
num_root: 93
num_file_creations: 42
num_shells: 3
num_access_files: 10
is_host_login: 2
is_guest_login: 2
count: 512
srv_count: 512
serror_rate: 96
srv_serror_rate: 87
rerror_rate: 89
srv_rerror_rate: 76
same_srv_rate: 101
diff_srv_rate: 95
srv_diff_host_rate: 72
dst_host_count: 256
dst_host_srv_count: 256
dst_host_same_srv_rate: 101
dst_host_diff_srv_rate: 101
dst_host_same_src_port_rate: 101
dst_host_srv_diff_host_rate: 76
dst_host_serror_rate: 101
dst_host_srv_serror_rate: 100
dst_host_rerror_rate: 101
dst_host_srv_rerror_rate: 101
outcome: 23


In [18]:
#  1   protocol_type                object (3 values)
#  2   service                      object (70 values)
#  3   flag                         object (11 values)
# land (2 values)
# logged_in (2)
# root_shell (2)
# su_attempted (2)
# is_host_login (2)
# is_guest_login (2)

from tensorflow import feature_column

feature_columns = []

# numeric cols
for column in ['duration','src_bytes','dst_bytes','wrong_fragment','urgent','hot',
               'num_failed_logins','num_compromised','num_root','num_file_creations',
               'num_shells','num_access_files','count','srv_count','serror_rate',
               'srv_serror_rate','rerror_rate','srv_rerror_rate','same_srv_rate',
               'diff_srv_rate','srv_diff_host_rate','dst_host_count','dst_host_srv_count',
               'dst_host_same_srv_rate','dst_host_diff_srv_rate','dst_host_same_src_port_rate',
               'dst_host_srv_diff_host_rate','dst_host_serror_rate','dst_host_srv_serror_rate',
               'dst_host_rerror_rate','dst_host_srv_rerror_rate']:
  feature_columns.append(feature_column.numeric_column(column))

# indicator_columns
indicator_column_names = ['protocol_type', 'service', 'flag', 'land', 'logged_in', 
                          'root_shell', 'su_attempted', 'is_host_login', 'is_guest_login']
for col_name in indicator_column_names:
  categorical_column = feature_column.categorical_column_with_vocabulary_list(
      col_name, data_df[col_name].unique())
  indicator_column = feature_column.indicator_column(categorical_column)
  feature_columns.append(indicator_column)

# embedding columns
# diagnosis = feature_column.categorical_column_with_vocabulary_list(
#       'diagnosis', data_df.diagnosis.unique())
# diagnosis_embedding = feature_column.embedding_column(diagnosis, dimension=16)
# feature_columns.append(diagnosis_embedding)

In [19]:
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

In [20]:
batch_size = 128
train_ds = df_to_dataset(train_df, batch_size=batch_size)
test_ds = df_to_dataset(test_df, shuffle=False, batch_size=batch_size)

In [21]:
from tensorflow.keras.layers import Dense

def create_model():
  tf.keras.backend.clear_session()
  model = tf.keras.Sequential([
    feature_layer,
    Dense(256, activation='relu'),
    Dense(128, activation='relu'),
    Dense(64, activation='relu'),
    Dense(23, activation='softmax')
  ])

  model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])
  return model

model = create_model()

In [22]:
filepath = 'model.h5'

mc = tf.keras.callbacks.ModelCheckpoint(filepath, monitor='val_loss', save_best_only=True, 
                                        save_weights_only=True, mode='auto')

es = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10, verbose=1, mode='auto')

history = model.fit(train_ds, epochs=200, validation_data=test_ds, callbacks=[mc, es],
                    verbose=2)

Epoch 1/200
30616/30616 - 201s - loss: 147.9262 - acc: 0.9959 - val_loss: 44.8872 - val_acc: 0.9983
Epoch 2/200
30616/30616 - 208s - loss: 0.1981 - acc: 0.9983 - val_loss: 0.0115 - val_acc: 0.9984
Epoch 3/200
30616/30616 - 211s - loss: 0.0182 - acc: 0.9984 - val_loss: 0.0124 - val_acc: 0.9983
Epoch 4/200
30616/30616 - 220s - loss: 0.8055 - acc: 0.9986 - val_loss: 0.0091 - val_acc: 0.9987
Epoch 5/200
30616/30616 - 212s - loss: 11.8164 - acc: 0.9986 - val_loss: 0.0124 - val_acc: 0.9985
Epoch 6/200
30616/30616 - 205s - loss: 0.1578 - acc: 0.9986 - val_loss: 72.1837 - val_acc: 0.9987
Epoch 7/200
30616/30616 - 202s - loss: 0.3010 - acc: 0.9986 - val_loss: 0.0089 - val_acc: 0.9987
Epoch 8/200
30616/30616 - 205s - loss: 1.5822 - acc: 0.9986 - val_loss: 0.0110 - val_acc: 0.9986
Epoch 9/200
30616/30616 - 209s - loss: 0.0116 - acc: 0.9987 - val_loss: 0.0101 - val_acc: 0.9987
Epoch 10/200
30616/30616 - 216s - loss: 0.8230 - acc: 0.8770 - val_loss: 241.5107 - val_acc: 0.9699
Epoch 11/200
30616/306

In [23]:
model.load_weights('model.h5')
model.evaluate(test_ds)



[0.008916427381336689, 0.9987149238586426]

In [24]:
import numpy as np
y_preds = model.predict(test_ds, verbose=1)
print(y_preds.shape)
y_preds = np.argmax(y_preds, axis=1)
from sklearn.metrics import classification_report
y_true = test_df['outcome']
# print(classification_report(y_true, y_preds, target_names=label_encoder.classes_))
print(classification_report(y_true, y_preds))

(979687, 23)
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       441
           1       0.00      0.00      0.00         6
           2       0.00      0.00      0.00         2
           3       0.00      0.00      0.00        11
           4       0.00      0.00      0.00         2
           5       0.93      0.98      0.96      2496
           6       0.00      0.00      0.00         4
           7       0.00      0.00      0.00         2
           8       0.00      0.00      0.00         1
           9       1.00      1.00      1.00    214403
          10       0.88      0.51      0.64       463
          11       0.99      1.00      1.00    194556
          12       0.00      0.00      0.00         1
          13       0.00      0.00      0.00         1
          14       0.00      0.00      0.00        53
          15       0.99      0.91      0.95      2083
          16       0.00      0.00      0.00         2
          17  

# class weights

In [25]:
from sklearn.utils import class_weight
class_weights = class_weight.compute_class_weight('balanced', np.unique(train_df['outcome']), train_df['outcome'])
class_weights

array([9.66970340e+01, 7.09917391e+03, 2.83966957e+04, 4.05667081e+03,
       1.70380174e+04, 1.70636128e+01, 1.00223632e+04, 2.43400248e+04,
       2.83966957e+04, 1.98667669e-01, 9.19482860e+01, 2.18934336e-01,
       8.51900870e+04, 5.67933913e+04, 8.07488976e+02, 2.04538024e+01,
       2.12975217e+04, 1.34009890e+01, 7.58489815e-02, 8.51900870e+04,
       2.17599200e+02, 2.08799233e+02, 1.06487609e+04])

In [26]:
class_keys = np.unique(train_df['outcome'])
class_keys

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22])

In [27]:
class_weight_dict = dict(zip(class_keys,class_weights))

In [28]:
model = create_model()

In [29]:
filepath = 'model.h5'

mc = tf.keras.callbacks.ModelCheckpoint(filepath, monitor='val_loss', save_best_only=True, 
                                        save_weights_only=True, mode='auto')

es = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10, verbose=1, mode='auto')

history = model.fit(train_ds, epochs=200, validation_data=test_ds, callbacks=[mc, es], 
                    class_weight=class_weight_dict, verbose=2)

Epoch 1/200
30616/30616 - 215s - loss: 207184.6250 - acc: 0.9150 - val_loss: 3373.8640 - val_acc: 0.9666
Epoch 2/200
30616/30616 - 220s - loss: 1229869.2500 - acc: 0.8999 - val_loss: 103643.5781 - val_acc: 0.9235
Epoch 3/200
30616/30616 - 218s - loss: 2159102.0000 - acc: 0.9054 - val_loss: 6858.8359 - val_acc: 0.9826
Epoch 4/200
30616/30616 - 219s - loss: 4601909.5000 - acc: 0.9219 - val_loss: 483246.2500 - val_acc: 0.8992
Epoch 5/200
30616/30616 - 219s - loss: 5335862.5000 - acc: 0.9187 - val_loss: 151306.9688 - val_acc: 0.9426
Epoch 6/200
30616/30616 - 221s - loss: 5539400.0000 - acc: 0.9312 - val_loss: 380728.2188 - val_acc: 0.9193
Epoch 7/200
30616/30616 - 216s - loss: 7947741.0000 - acc: 0.9150 - val_loss: 861471.0625 - val_acc: 0.9228
Epoch 8/200
30616/30616 - 215s - loss: 7574225.0000 - acc: 0.9250 - val_loss: 163218.5938 - val_acc: 0.9555
Epoch 9/200
30616/30616 - 214s - loss: 15989376.0000 - acc: 0.9284 - val_loss: 860234.1250 - val_acc: 0.9107
Epoch 10/200
30616/30616 - 211s 

In [30]:
model.load_weights('model.h5')
model.evaluate(test_ds)



[3373.864013671875, 0.966621994972229]

In [31]:
y_preds = model.predict(test_ds, verbose=1)
y_preds = np.argmax(y_preds, axis=1)
y_true = test_df['outcome']
print(classification_report(y_true, y_preds))

              precision    recall  f1-score   support

           0       0.08      1.00      0.15       441
           1       0.00      0.00      0.00         6
           2       0.00      0.00      0.00         2
           3       0.00      0.00      0.00        11
           4       0.00      0.00      0.00         2
           5       0.91      0.94      0.93      2496
           6       0.00      0.00      0.00         4
           7       0.00      0.00      0.00         2
           8       0.00      0.00      0.00         1
           9       1.00      0.94      0.97    214403
          10       0.12      0.46      0.18       463
          11       0.99      0.92      0.95    194556
          12       0.00      0.00      0.00         1
          13       0.00      0.00      0.00         1
          14       0.00      0.00      0.00        53
          15       0.15      0.98      0.25      2083
          16       0.00      0.00      0.00         2
          17       0.24    

# Binary classification

In [32]:
label_encoder.classes_[11]

'normal.'

In [33]:
train_df.loc[(train_df.outcome != 11),'outcome'] = 1
train_df.loc[(train_df.outcome == 11),'outcome'] = 0
test_df.loc[(test_df.outcome != 11),'outcome'] = 1
test_df.loc[(test_df.outcome == 11),'outcome'] = 0

In [34]:
train_df.outcome.value_counts()

1    3140519
0     778225
Name: outcome, dtype: int64

In [35]:
train_ds = df_to_dataset(train_df, batch_size=batch_size)
test_ds = df_to_dataset(test_df, shuffle=False, batch_size=batch_size)

In [36]:
tf.keras.backend.clear_session()
model = tf.keras.Sequential([
  feature_layer,
  Dense(256, activation='relu'),
  Dense(128, activation='relu'),
  Dense(64, activation='relu'),
  Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=[tf.keras.metrics.AUC()])

In [37]:
filepath = 'model.h5'

mc = tf.keras.callbacks.ModelCheckpoint(filepath, monitor='val_loss', save_best_only=True, 
                                        save_weights_only=True, mode='auto')

es = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10, verbose=1, mode='auto')

history = model.fit(train_ds, epochs=200, validation_data=test_ds, callbacks=[mc, es], 
                    verbose=2)

Epoch 1/200
30616/30616 - 240s - loss: 70.7713 - auc: 0.9938 - val_loss: 293.6857 - val_auc: 0.9991
Epoch 2/200
30616/30616 - 241s - loss: 12.0578 - auc: 0.9948 - val_loss: 33.7645 - val_auc: 0.9992
Epoch 3/200
30616/30616 - 238s - loss: 0.8356 - auc: 0.9984 - val_loss: 10.4264 - val_auc: 0.9999
Epoch 4/200
30616/30616 - 237s - loss: 0.0082 - auc: 0.9998 - val_loss: 0.0063 - val_auc: 0.9994
Epoch 5/200
30616/30616 - 237s - loss: 0.0065 - auc: 0.9994 - val_loss: 0.0053 - val_auc: 0.9995
Epoch 6/200
30616/30616 - 239s - loss: 0.0067 - auc: 0.9998 - val_loss: 1.0749 - val_auc: 0.9999
Epoch 7/200
30616/30616 - 240s - loss: 0.0138 - auc: 0.9997 - val_loss: 0.0055 - val_auc: 0.9996
Epoch 8/200
30616/30616 - 237s - loss: 0.0089 - auc: 0.9997 - val_loss: 8.4735 - val_auc: 0.9999
Epoch 9/200
30616/30616 - 241s - loss: 5.1598 - auc: 0.9998 - val_loss: 3.2222 - val_auc: 0.9999
Epoch 10/200
30616/30616 - 248s - loss: 0.3308 - auc: 0.9999 - val_loss: 0.0045 - val_auc: 0.9999
Epoch 11/200
30616/3061

In [38]:
model.load_weights('model.h5')
model.evaluate(test_ds)



[0.004479814320802689, 0.9998740553855896]

In [39]:
y_preds = model.predict(test_ds, verbose=1)
y_preds
y_preds[y_preds > 0.5] = 1
y_preds[y_preds <= 0.5] = 0
y_true = test_df['outcome']
print(classification_report(y_true, y_preds))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00    194556
           1       1.00      1.00      1.00    785131

    accuracy                           1.00    979687
   macro avg       1.00      1.00      1.00    979687
weighted avg       1.00      1.00      1.00    979687



In [40]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_true, y_preds))

[[194527     29]
 [   867 784264]]
