<a href="https://colab.research.google.com/github/archiegoodman2/neural_net/blob/main/models_UNSW_NB15.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practice run of analysing/testing different models on the UNSW_NB15 dataset, before trying Deep Learning.

Prior research suggests this is a largely non-linear, less separable dataset so deep learning may be necessary, but I will try simpler, more interpretable models first for the sake of completeness, and to gain Variable Importances

Let's load our packages and data

In [None]:
#import packages:
from google.colab import drive
import pandas as pd
import sklearn as sk
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

print("New run: Packages loaded")

New run: Packages loaded


In [None]:
#if using colabs - will need to first mount your drive

#change these for different users
test_set_filepath = '/content/drive/MyDrive/Colab_Notebooks/Data/UNSW_NB15_testing-set.parquet'
training_set_filepath = '/content/drive/MyDrive/Colab_Notebooks/Data/UNSW_NB15_training-set.parquet'

# Import the two CSV files
test_set = pd.read_parquet(test_set_filepath)
train_set = pd.read_parquet(training_set_filepath)

#drop label and define list of out targets
X = train_set.drop('label', axis=1)
y = train_set['label']

# List only the categorical columns (object types)
categorical_cols = train_set.select_dtypes(include=['category']).columns.tolist()


The next cell does some basic analysis, and one hot encodes some of the features:

In [None]:

#print number of records in our data
print(f"Number of records in training set: {len(train_set)}")
print(f"Number of records in test set: {len(test_set)}")

#lets see which ones are categorical etc
#print(f'''
#The columns and datatypes are:
#{train_set.dtypes}
#''')

print("Categorical Columns are :", categorical_cols)

#print out number of classifications
print(f"Number of categories in 'label' category: {len(train_set['label'].unique())}")

#print out labels
print(f"Labels: {train_set['label'].unique()}")

#print out how many unique values we have for each categorical variable - if we have too many we may need an embeddings layer
for col in categorical_cols:
    print(f"Number of categories in '{col}' category: {len(train_set[col].unique())}")

#there seems to be over 100 possible values of proto - lets see how common they all are
category_percentages = train_set['proto'].value_counts(normalize=True) * 100

#define a dict of the categories and their percentages of occurence. what we want to do here is group any that occur less than 0.5% of the time, into an 'other' category
category_percentages_dict = category_percentages.to_dict()

#init empty dict to store our categories that will bother one-hot encoding
target_categories_to_encode = []

for key in category_percentages_dict:
    if category_percentages_dict[key] > 0.5:
      target_categories_to_encode.append(key)

#we now have a list of values that we want to one hot encode. we want to simply group the others into an 'other column'



Number of records in training set: 175341
Number of records in test set: 82332
Categorical Columns are : ['proto', 'service', 'state', 'attack_cat']
Number of categories in 'label' category: 2
Labels: [0 1]
Number of categories in 'proto' category: 133
Number of categories in 'service' category: 13
Number of categories in 'state' category: 9
Number of categories in 'attack_cat' category: 10
{'tcp': 45.59458426722786, 'udp': 36.091387638943544, 'unas': 6.891713860420552, 'arp': 1.6305370677708007, 'ospf': 1.4799733091518812, 'sctp': 0.6558648576202941, 'any': 0.17109518024877238, 'gre': 0.1283213851865793, 'sun-nd': 0.11463377076667751, 'swipe': 0.11463377076667751, 'pim': 0.11463377076667751, 'mobile': 0.11463377076667751, 'ipv6': 0.11463377076667751, 'rsvp': 0.1140634534991816, 'sep': 0.11007123262671023, 'ib': 0.05760204401708671, 'pri-enc': 0.0570317267495908, 'qnx': 0.0570317267495908, 'rvd': 0.0570317267495908, 'pvp': 0.0570317267495908, 'sat-expak': 0.0570317267495908, 'sat-mon':

Based on the high number of columns in the Proto column, we may want to consider an Embeddings layer with the Deep Learning that we plan to undertake later. However since LR/DT/RF perform poorly on sparse vector datasets (like one hot encoded ones) we will group all the extremely rare categories into an 'other'.
