<a href="https://colab.research.google.com/github/alik604/cyber-security/blob/master/Intrusion-Detection/UNSW_NB15%20-%20PyTorch%20feature%20selection%20via%20L1%20regularization%20on%20layer_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 ## References 

 *note that this is binary classification*
 
 MLP with pytorch at end
 
 * Data source: https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-NB15-Datasets/
 * Sample/starter code: https://github.com/Nir-J/ML-Projects/blob/master/UNSW-Network_Packet_Classification/unsw.py

In [3]:
%%capture
!pip install mlxtend

In [4]:
%config IPCompleter.greedy=True
import pandas as pd
import seaborn as sns
import numpy as np

import matplotlib as matplot
import matplotlib.pyplot as plt
%matplotlib inline

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import warnings
warnings.filterwarnings("ignore")

from keras import Sequential
from keras.models import Model, load_model
from keras.layers import *
from keras.callbacks import ModelCheckpoint
from keras import regularizers

from sklearn.metrics import *
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, VotingClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder,normalize


import xgboost, lightgbm
from mlxtend.classifier import EnsembleVoteClassifier

# Preprocessing (transformation/scaling) 

In [5]:
train = pd.read_csv('https://raw.githubusercontent.com/Nir-J/ML-Projects/master/UNSW-Network_Packet_Classification/UNSW_NB15_training-set.csv')
test = pd.read_csv('https://raw.githubusercontent.com/Nir-J/ML-Projects/master/UNSW-Network_Packet_Classification/UNSW_NB15_testing-set.csv')
combined_data = pd.concat([train, test]).drop(['id'],axis=1)

In [6]:
# Contaminsation mean pollution (outliers) in data
tmp = train.where(train['attack_cat'] == "Normal").dropna()
contamination = round(1 - len(tmp)/len(train), 2)
print("train contamination ", contamination)

tmp = test.where(test['attack_cat'] == "Normal").dropna()
print("test  contamination ", round(1 - len(tmp)/len(test),2),'\n')

if contamination > 0.5:
    print(f'contamination is {contamination}, which is greater than 0.5. Fixing...')
    contamination = round(1-contamination,2)
    print(f'contamination is now {contamination}')

train contamination  0.68
test  contamination  0.55 

contamination is 0.68, which is greater than 0.5. Fixing...
contamination is now 0.32


In [7]:
le1 = LabelEncoder()
le = LabelEncoder()

vector = combined_data['attack_cat']

print("attack cat:", set(list(vector))) # use print to make it print on single line 

combined_data['attack_cat'] = le1.fit_transform(vector)
combined_data['proto'] = le.fit_transform(combined_data['proto'])
combined_data['service'] = le.fit_transform(combined_data['service'])
combined_data['state'] = le.fit_transform(combined_data['state'])

vector = combined_data['attack_cat']
print('\nDescribing attack_type: ')
print("min", vector.min())
print("max", vector.max())
print("mode",vector.mode(), "Which is,", le1.inverse_transform(vector.mode()))
print("mode", len(np.where(vector.values==6)[0])/len(vector),"%")

attack cat: {'Generic', 'Normal', 'Exploits', 'Analysis', 'Worms', 'Reconnaissance', 'Fuzzers', 'Backdoor', 'Shellcode', 'DoS'}

Describing attack_type: 
min 0
max 9
mode 0    6
dtype: int32 Which is, ['Normal']
mode 0.3609225646458884 %


In [8]:
le1.inverse_transform([0,1,2,3,4,5,6,7,8,9])
combined_data.head(3)

array(['Analysis', 'Backdoor', 'DoS', 'Exploits', 'Fuzzers', 'Generic',
       'Normal', 'Reconnaissance', 'Shellcode', 'Worms'], dtype=object)

Unnamed: 0,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,sttl,...,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label
0,0.121478,113,0,4,6,4,258,172,74.08749,252,...,1,1,0,0,0,1,1,0,6,0
1,0.649902,113,0,4,14,38,734,42014,78.473372,62,...,1,2,0,0,0,1,6,0,6,0
2,1.623129,113,0,4,8,16,364,13186,14.170161,62,...,1,3,0,0,0,2,6,0,6,0


In [9]:
## OMITTED: For statistical feature removal

lowSTD = list(combined_data.std().to_frame().nsmallest(6, columns=0).index)
# this is stupid. suppose a feature has a 1.0 (spearman or pearson) correlation, OR conditional probability, when not 0.... That a very useful feature  

lowCORR = list(combined_data.corr().abs().sort_values('attack_cat')['attack_cat'].nsmallest(3).index) # .where(lambda x: x < 0.005).dropna()
# This might be stupid. A Deep MLP (feed forward neural net) may see patterns

drop = set( lowCORR + lowSTD)
drop = {'ackdat', 'ct_ftp_cmd', 'djit', 'is_ftp_login', 'is_sm_ips_ports', 'response_body_len', 'sjit', 'synack', 'tcprtt'}
# print(f'Before {combined_data.shape}')
combined_data_reduced=combined_data # .drop(drop,axis=1)
# print(f'After {combined_data.shape}')

In [10]:
# # transform = list(combined_data_reduced.columns.values[4:])
# transform.append('dur')
# transform.remove('attack_cat')
# # transform min-max norm 
# combined_data_reduced[transform] = combined_data_reduced[transform].apply(lambda x: (x - x.min()) / (x.max() - x.min()))

In [11]:
data_x = combined_data_reduced.drop(['attack_cat','label'], axis=1) # droped label
data_y = combined_data_reduced.loc[:,['label']]
# del combined_data # free mem
X_train, X_test, y_train, y_test = train_test_split(data_x, data_y, test_size=.20, random_state=42) # TODO

In [12]:
#combined_data_reduced.where(combined_data_reduced['label'] == 1.0).dropna().tail(20)

In [13]:
X_train.shape
y_train.shape
X_test.shape # test is larger... good 
y_test.shape

(206138, 42)

(206138, 1)

(51535, 42)

(51535, 1)

# Benchmark before feature removal

In [349]:
DTC = DecisionTreeClassifier()
RFC = RandomForestClassifier(n_estimators=150, random_state=42, n_jobs=-1)
ETC = ExtraTreesClassifier(n_estimators=200, random_state=42, n_jobs=-1)
XGB = xgboost.XGBClassifier(n_estimators=150, n_jobs=-1)
GBM = lightgbm.LGBMClassifier(objective='binary', n_estimators= 500) # multiclass

list_of_CLFs_names = []
list_of_CLFs = [DTC, RFC, ETC, XGB, GBM]
ranking = []

for clf in list_of_CLFs:
    _ = clf.fit(X_train, y_train)
    pred = clf.score(X_test, y_test)
    name = str(type(clf)).split(".")[-1][:-2]
    print("Acc: %0.5f for the %s" % (pred, name))

    ranking.append(pred)
    list_of_CLFs_names.append(name)

Acc: 0.93752 for the DecisionTreeClassifier
Acc: 0.95240 for the RandomForestClassifier
Acc: 0.95133 for the ExtraTreesClassifier
Acc: 0.93653 for the XGBClassifier
Acc: 0.95269 for the LGBMClassifier


In [350]:
eclf = EnsembleVoteClassifier(clfs=list_of_CLFs, refit=False, voting='soft')
_ = eclf.fit(X_train, y_train)
pred = eclf.score(X_test, y_test)
print("Acc: %0.5f for the %s" % (pred, str(type(eclf)).split(".")[-1][:-2]))


pred = eclf.predict(X_test)
probas = eclf.predict_proba(X_test)

Acc: 0.95077 for the EnsembleVoteClassifier


In [351]:
from sklearn.feature_selection import RFE
from sklearn.decomposition import PCA, TruncatedSVD, PCA
from sklearn.svm import LinearSVC
n = 10 

### Try RFE

In [352]:
rfe = RFE(DecisionTreeClassifier(), n).fit(X_train, y_train)

desiredIndices = np.where(rfe.support_==True)[0]
whitelist = X_train.columns.values[desiredIndices]
X_train_RFE, X_test_RFE = X_train[whitelist], X_test[whitelist]

print('new shape', X_train_RFE.shape) 

for clf in list_of_CLFs:
    _ = clf.fit(X_train_RFE,y_train)
    pred = clf.score(X_test_RFE,y_test)
    name = str(type(clf)).split(".")[-1][:-2]
    print("Acc: %0.5f for the %s" % (pred, name))

    ranking.append(pred)
    list_of_CLFs_names.append(name)


eclf = EnsembleVoteClassifier(clfs=list_of_CLFs, refit=False, voting='soft')
_ = eclf.fit(X_train_RFE, y_train)
pred = eclf.score(X_test_RFE, y_test)
print("Acc: %0.5f for the %s" % (pred, str(type(eclf)).split(".")[-1][:-2]))


pred = eclf.predict(X_test_RFE)
probas = eclf.predict_proba(X_test_RFE)

new shape (206138, 10)
Acc: 0.93705 for the DecisionTreeClassifier
Acc: 0.95027 for the RandomForestClassifier
Acc: 0.94846 for the ExtraTreesClassifier
Acc: 0.93379 for the XGBClassifier
Acc: 0.94804 for the LGBMClassifier
Acc: 0.94902 for the EnsembleVoteClassifier


### Try SVD and PCA

In [353]:
svd = TruncatedSVD(n_components=n).fit(X_train)
X_train_svd, X_test_svd = svd.transform(X_train), svd.transform(X_test)

for clf in list_of_CLFs:
    _ = clf.fit(X_train_svd, y_train)
    pred = clf.score(X_test_svd, y_test)
    name = str(type(clf)).split(".")[-1][:-2]
    print("Acc: %0.5f for the %s" % (pred, name))

    ranking.append(pred)
    list_of_CLFs_names.append(name)

eclf = EnsembleVoteClassifier(clfs=list_of_CLFs, refit=False, voting='soft')
_ = eclf.fit(X_train_svd, y_train)
pred = eclf.score(X_test_svd, y_test)
print("Acc: %0.5f for the %s" % (pred, str(type(eclf)).split(".")[-1][:-2]))


pred = eclf.predict(X_test_svd)
probas = eclf.predict_proba(X_test_svd)

Acc: 0.87987 for the DecisionTreeClassifier
Acc: 0.89908 for the RandomForestClassifier
Acc: 0.89821 for the ExtraTreesClassifier
Acc: 0.87024 for the XGBClassifier
Acc: 0.88848 for the LGBMClassifier
Acc: 0.89512 for the EnsembleVoteClassifier


In [354]:
pca = PCA(n_components=n).fit(X_train)
X_train_pca, X_test_pca = pca.transform(X_train), pca.transform(X_test)

for clf in list_of_CLFs:
    _ = clf.fit(X_train_pca, y_train)
    pred = clf.score(X_test_pca, y_test)
    name = str(type(clf)).split(".")[-1][:-2]
    print("Acc: %0.5f for the %s" % (pred, name))

    ranking.append(pred)
    list_of_CLFs_names.append(name)

eclf = EnsembleVoteClassifier(clfs=list_of_CLFs, refit=False, voting='soft')
_ = eclf.fit(X_train_pca, y_train)
pred = eclf.score(X_test_pca, y_test)
print("Acc: %0.5f for the %s" % (pred, str(type(eclf)).split(".")[-1][:-2]))


pred = eclf.predict(X_test_pca)
probas = eclf.predict_proba(X_test_pca)

Acc: 0.87896 for the DecisionTreeClassifier
Acc: 0.89388 for the RandomForestClassifier
Acc: 0.88905 for the ExtraTreesClassifier
Acc: 0.86518 for the XGBClassifier
Acc: 0.88235 for the LGBMClassifier
Acc: 0.89176 for the EnsembleVoteClassifier


# Lets' try another way

# MLP with L1 loss for feature selection  


In [14]:
import torch
import torch.nn as nn
import torch.nn.functional as F
np.unique(y_train)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

array([0, 1], dtype=int64)

device(type='cuda')

In [16]:
# device = 'cpu'
input_size = 42
hidden_size = 32 
hidden_size_2 = 10
num_classes = np.unique(y_train) # faster to code like a dumbass... len(set(y_train.values.flatten().tolist()))
print(f'Number of classes: {np.unique(y_train)}')
num_epochs = 5
batch_size = 16
learning_rate = 0.001

# Fully connected neural network with one hidden layer
class NeuralNet(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(NeuralNet, self).__init__()
        self.input_size = input_size # ?? 
        self.l1 = nn.Linear(input_size, hidden_size) 
        self.l2 = nn.Linear(hidden_size, hidden_size_2)  
        self.l3 = nn.Linear(hidden_size_2, num_classes)
        self.relu = nn.ReLU()
        self.elu = nn.ELU()
    
    def forward(self, x):
        out = self.l1(x)
        out = self.relu(out)
        out = self.l2(out)
        out = self.relu(out)
        out = self.l3(out)
        # no activation and no softmax at the end
        return out

Number of classes: [0 1]


In [17]:
factor = 0.005 #0.00005 # reg term coefficient/multiplier/weight  
model = NeuralNet(input_size, hidden_size, num_classes).to(device)

# Loss and optimizer
criterion = nn.CrossEntropyLoss() # This criterion combines nn.LogSoftmax() and nn.NLLLoss() in one single class.
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)  

# L1 Regularizer
l1_reg_criterion = nn.L1Loss() # size_average=False 

X_train_vals= X_train.values # _RFE
y_train_vals = y_train.values.flatten()

X_test_vals= X_test.values
y_test_vals = y_test.values.flatten()

# Train the model
for epoch in range(1, num_epochs + 1):
    # for i in range(len(X_train_RFE_vals)//100 + 1): #, batch_size

    n_correct = 0
    n_samples = 0 
    for i in range(0, X_train_vals.shape[0], batch_size):

        x = torch.as_tensor(X_train_vals[i:i+batch_size], dtype=torch.float).to(device)
        y = torch.as_tensor(y_train_vals[i:i+batch_size], dtype=torch.long).to(device)
        
        outputs = model(x)
        loss = criterion(outputs, y)

        reg_loss = 0 
        for name, param in model.l1.state_dict().items(): # L1 is the first layer 
          if name == 'weight':
            # print(param.size())
            # print((param-param).sum().item()); throw_for_bar_after_print_stop
            reg_loss = torch.norm(model.l1.state_dict()['weight'], p=1).item() # l1_reg_criterion(param, param-param)
        
        # print(f'loss: {loss}, reg_loss: {reg_loss}'); throw_for_bar_after_print_stop

        loss_pre = loss
        loss = loss + (factor * reg_loss)

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Train epoch accuracy
        _, predicted = torch.max(outputs.data, dim=1)
        n_samples += y.size(0)
        n_correct += (predicted == y).sum().item()
    print(f'Epoch [{epoch}/{num_epochs}], Step [{i+1}/{n_total_steps}], Loss: {loss.item():.4f}, Acc: {100.0 * n_correct / (n_samples+1):.4f}\t loss: {loss_pre:.4f}, reg_loss: {factor * reg_loss:.5f}')

TypeError: empty() received an invalid combination of arguments - got (tuple, dtype=NoneType, device=NoneType), but expected one of:
 * (tuple of ints size, *, tuple of names names, torch.memory_format memory_format, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
 * (tuple of ints size, *, torch.memory_format memory_format, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)


In [336]:
# Test the model
# In test phase, we don't need to compute gradients (for memory efficiency)
X_test_vals= X_test.values # _RFE
y_test_vals = y_test.values.flatten()
with torch.no_grad():
    n_correct = 0
    n_samples = 0 
    # for i in range(len(X_train_RFE_vals)//100 + 1):   
    for i in range(0, X_test_RFE_vals.shape[0], batch_size):
        x = torch.as_tensor(X_test_vals[i:i+batch_size], dtype=torch.float).to(device)
        y = torch.as_tensor(y_test_vals[i:i+batch_size], dtype=torch.long).to(device)
        
        outputs = model(x)
        if len(outputs.data) > 0:
          # max returns (value ,index)
          _, predicted = torch.max(outputs.data, dim=1)
          
          n_samples += y.size(0)
          n_correct += (predicted == y).sum().item()

        else:
          print("what???")
          print(x, outputs.data)
    acc = 100.0 * n_correct / (n_samples+1)
    print(f'Accuracy of the network: {acc} %')

Accuracy of the network: 63.88155852219808 %


> Accuracy of the network: 79.643%

In [356]:
# print(model)
# print(model.l1)
# for name, param in model.l1.state_dict().items(): # L1 is the first layer 
#   if name == 'weight': 
#     for param_ in param.T:
#       print(param_.sum().item())

weights = [weight.sum().item() for weight in model.l1.state_dict()['weight'].T]
weights = [round(abs(i), 5) for i in weights]
weights = np.array(weights/np.sum(weights))
weights_idx = np.argsort(fi)[::-1]


print(f"[MLP] Top ten feature weights:  {weights[:10]}")
print(f"[MLP] Top ten feature indexes:  {weights_idx[:10]}")
print(f"[MLP] Last ten feature indexes: {weights_idx[-10:]}\n")


clf = DecisionTreeClassifier()
_ = clf.fit(X_train, y_train)
fi = [round(i, 5 ) for i in clf.feature_importances_] # round 
fi = np.array(fi) # to array 
fi = fi/np.sum(fi) # ensure it's normalize
fi_idx = np.argsort(fi)[::-1] # from largest to smallest

# print(fi[fi_idx])
print(f"[DecisionTreeClassifier] Top ten feature indexes:  {fi_idx[:10]}")
print(f"[DecisionTreeClassifier] Last ten feature indexes: {fi_idx[-10:]}")

[MLP] Top ten feature weights:  [0.01252026 0.00658208 0.02404366 0.04574006 0.01802584 0.02314406
 0.02250805 0.04280177 0.01996318 0.03054257]
[MLP] Top ten feature indexes:  [ 9 24 26 35  6 40 30  7 27 15]
[MLP] Last ten feature indexes: [28  3 14 10 31 37 19 36 22 41]

[DecisionTreeClassifier] Top ten feature indexes:  [ 9 24 26 35  6 40 30  7 27 15]
[DecisionTreeClassifier] Last ten feature indexes: [28  3 14 31 10 36 22 37 19 41]


```
factor = 0.0
[MLP] Top ten feature indexes:  [ 9 24 26 35  6 40 30  7 27 15]
[MLP] Top ten feature weights:  [0.06196863 0.02227446 0.04971818 0.02841482 0.04204971 0.02587458 0.04431207 0.02130464 0.01099255 0.00912548]
[MLP] Last ten feature indexes: [ 5  3 14 28 36 10 37 19 22 41]
```

```
factor = 0.00005
[MLP] Top ten feature indexes:  [ 9 24 26 35  6 40 30  7 27 15]
[MLP] Top ten feature weights:  [0.03054257 0.0136881  0.03092274 0.00804256 0.02250805 0.006705 0.00787411 0.04280177 0.05547793 0.03282484]
[MLP] Last ten feature indexes: [31  3  5 14 37 36 10 19 22 41]
```
Overkill... 0.0005 is too high
```
Epoch [1/5], Step [206129/206138], Loss: 0.6786, Acc: 72.8474	 loss: 0.4932, reg_loss: 0.18541
Epoch [3/5], Step [206129/206138], Loss: 0.7730, Acc: 65.8260	 loss: 0.5133, reg_loss: 0.25978

factor = 0.0005
[MLP] Top ten feature indexes:  [ 9 24 26 35  6 40 30  7 27 15]
[MLP] Top ten feature weights:  [0.04053201 0.0279744  0.03267631 0.00767643 0.0375806  0.00558694 0.01319869 0.04616321 0.00876322 0.01025393]
[MLP] Last ten feature indexes: [ 5  3 14 31 10 36 37 19 22 41]
```

# Conclusion

Doesn't seem to be very useful here. Maybe its because we have **A lot** of data.

* Too bad I made these chances by hand. I could explore more
* I should have set the seed
* Future Ali askes that i should tried a higher `lambda`, whch in my code is `factor = 0.00005`. Now in 2022 this code fails to run.
