<a href="https://colab.research.google.com/github/archiegoodman2/neural_net/blob/main/models_UNSW_NB15.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practice run of analysing/testing different models on the UNSW_NB15 dataset, before trying Deep Learning.

Prior research suggests this is a largely non-linear, less separable dataset so deep learning may be necessary, but I will try simpler, more interpretable models first for the sake of completeness, and to gain Variable Importances

Let's load our packages and data

In [41]:
#import packages:
from google.colab import drive
import pandas as pd
import sklearn as sk
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score




print("New run: Packages loaded")

New run: Packages loaded


In [36]:
#if using colabs - will need to first mount your drive

#change these for different users
test_set_filepath = '/content/drive/MyDrive/Colab_Notebooks/Data/UNSW_NB15_testing-set.parquet'
training_set_filepath = '/content/drive/MyDrive/Colab_Notebooks/Data/UNSW_NB15_training-set.parquet'

# Import the two CSV files
test_set = pd.read_parquet(test_set_filepath)
train_set = pd.read_parquet(training_set_filepath)



The next cell does some basic analysis, and one hot encodes some of the features:

In [33]:
"""
#print number of records in our data
print(f"Number of records in training set: {len(train_set)}")
print(f"Number of records in test set: {len(test_set)}")

#lets see which ones are categorical etc
print(f'''
The columns and datatypes are:
{train_set.dtypes}
''')

print("Categorical Columns are :", categorical_cols)

#print out number of classifications
print(f"Number of categories in 'label' category: {len(train_set['label'].unique())}")

#print out labels
print(f"Labels: {train_set['label'].unique()}")

#print out how many unique values we have for each categorical variable - if we have too many we may need an embeddings layer
for col in categorical_cols:
    print(f"Number of categories in '{col}' category: {len(train_set[col].unique())}")

"""

Number of records in training set: 175341
Number of records in test set: 82332
Categorical Columns are : ['proto', 'service', 'state', 'attack_cat']
Number of categories in 'label' category: 2
Labels: [0 1]
Number of categories in 'proto' category: 133
Number of categories in 'service' category: 13
Number of categories in 'state' category: 9
Number of categories in 'attack_cat' category: 10


In [37]:

def preprocess_data(data_set):
  """
  Function to preprocess data. One hot encodes the top 6 most common values for 'proto'.
  And turns boolean columns into 1s and 0s.

  Args:
  data_set (dataframe) : test or train set to be processed

  Retuns:
  data_set (dataframe) : processed data set

  """
  # List only the categorical columns (object types)
  categorical_cols = data_set.select_dtypes(include=['category']).columns.tolist()

  #there seems to be over 100 possible values of proto - lets see how common they all are
  category_percentages = data_set['proto'].value_counts(normalize=True) * 100

  #define a dict of the categories and their percentages of occurence. what we want to do here is group any that occur less than 0.5% of the time, into an 'other' category
  category_percentages_dict = category_percentages.to_dict()
  #we can then print this to view the distributions ^

  # After looking at the distributions of the possible values for Proto, only the top 6 occur more than 0.5% of the time - hence all others are very rare
  # So we get the top 6 most common values. We have to hardcode in this value of top 6
  top_6_categories = category_percentages.head(6).index.tolist()

  #we now have a list of values that we want to one hot encode. we want to simply group the others into an 'other column'
  data_set['proto_grouped'] = data_set['proto'].apply(lambda x: x if x in top_6_categories else 'other')

  #now we one hot encode this column
  data_set = pd.get_dummies(data_set, columns=['proto_grouped'])

  #drop the original columns if still present
  if 'proto' in data_set.columns:
    data_set = data_set.drop('proto', axis=1)
  if 'proto_grouped' in data_set.columns:
      data_set = data_set.drop(['proto_grouped'], axis=1)

  #encode all binary data as 1s and 0s
  binary_cols = data_set.select_dtypes(include=['bool']).columns

  #convert to int - the original apply function was returning a dataframe instead of a series in cases where more than 1 value was found. This is because the column has more than 1 unique value. This can happen when for example, you expect all the values to be `True` but then find some are `False` as well. in that scenario the apply function will not collapse this down to a series.
  data_set[binary_cols] = data_set[binary_cols].apply(lambda x: x.astype(int))

  return data_set

train_set = preprocess_data(train_set)




NOTE TO SELF -
1. optimise lambda funct for data preprocessing?
2. do i have to also preprocess test set

Based on the high number of columns in the Proto column, we may want to consider an Embeddings layer with the Deep Learning that we plan to undertake later. However since DT/RF perform somewhat poorly on sparse vector datasets (like one hot encoded ones) we will group all the extremely rare categories into an 'other'.


In [39]:
def run_models(train_set, model_type):
  """
  Runs LR, DT or RF model on dataframe
  """

  train_set = preprocess_data(train_set)

  #drop label and define list of out targets
  X = train_set.drop('label', axis=1)
  y = train_set['label']

  # List only the categorical columns (object types)
  categorical_cols = train_set.select_dtypes(include=['category']).columns.tolist()

  #we plan to use nested k fold cross validation for the HPs so let's define a dict of lists containing sensible guesses for the HPs for our three models
  param_grid_lr = {
    'C': [0.001, 0.01, 0.1, 1, 10],   # List of regularization strengths
    'solver': ['lbfgs', 'saga']         # List of solvers to try
  }

  param_grid_dt = {
    'max_depth': [3, 5, 10],             # List of max depth values
    'min_samples_split': [2, 10, 20],    # List of minimum samples to split
    'min_samples_leaf': [1, 5, 10]       # List of minimum samples per leaf
  }

  param_grid_rf = {
    'n_estimators': [50, 100, 200],      # List of number of trees to use
    'max_depth': [5, 10, 20],            # List of maximum depths for trees
    'min_samples_split': [2, 10, 20]     # List of minimum samples for splitting a node
  }

  if model_type == 'LR':

    # Standardize features (for Logistic Regression)
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    #init our model
    LR_model = LogisticRegression(max_iter=1000)

  if model_type == 'DT':

    DT_model = DecisionTreeClassifier(random_state=42)

  if model_type == 'RF':

    RF_model = RandomForestClassifier(random_state=42)


