# Anomaly Detection Homework

This notebook is for anomaly detection homework of Applied AI Week 2. The dataset is given with [this link](https://drive.google.com/file/d/1cZGOZu_zdKLXnH-Ap1w9SMffYXZqa2Ot/view?usp=sharing). If you are having problems with the link, contact with me: safak@inzva.com

## Dataset Description
"KDD CUP 99 data set is used mainly to analyze the different
attacks. It consists of nearly 4,900,000 samples with 41
features and each sample is classified as either normal or
attack" [explanation from this source](https://www.ripublication.com/ijaer18/ijaerv13n7_81.pdf)

## Task Description

The dataset is prepared and preprocessed for anomaly detection task, the dataset contains "Probe" and "Normal" targets. "Probe" is anomaly, "Normal" is normal. 

**You are supposed to build a anomaly detection model** with **Vanilla Autoencoder**, **Variational Autoencoder** and **Denoising Autoencoder**. However you are not restricted by autoencoer, you can implement a fancy state-of-the-art ensemble 1000B parameter model. It is really up to you. 

We don't really want you to do sloppy homework.

The variable descriptions:

- train set: kdd_train_probe
- validation set (for hyperparam tuning): kdd_valid_probe
- test set: kdd_test_v2_probe

## What will you report?
Report your average macro f1 score on test set:

```python
from sklearn.metrics import f1_score
f1 = f1_score(y_true, y_pred, average = "macro")
print(f1)
```


# Preparation (do not edit this part)

In [None]:
import pandas as pd
import numpy as np
import warnings
from pandas.core.common import SettingWithCopyWarning

import torch.nn as nn
import torch
import sys
from torch.utils.data import DataLoader, Dataset
from collections import defaultdict
from tqdm.auto import tqdm

import seaborn as sns
from pylab import rcParams

from sklearn.metrics import f1_score, accuracy_score, classification_report

In [None]:
%matplotlib inline
%config InlineBackend.figure_format='retina'

sns.set(style="whitegrid", palette="muted", font_scale=1.2)
HAPPY_COLORS_PALETTE = ["#01BEFE", "#FFDD00", "#FF7D00", "#FF006D", "#ADFF02", "#8F00FF"]
sns.set_palette(sns.color_palette(HAPPY_COLORS_PALETTE))

rcParams['figure.figsize'] = 10, 4

In [None]:

warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

kdd = pd.read_csv('/content/drive/MyDrive/applied_ai_enes_safak/homework/kdd.csv')
kdd = kdd.iloc[:,1:43]
kdd = kdd.drop(['Protocol Type', 'Service', 'Flag'], axis = 1)

kdd_train = kdd.iloc[0:102563, :]
kdd_test = kdd.iloc[102563:183737, :]

kdd_train_probe = kdd_train[(kdd_train.Type_Groups == 'Normal') | (kdd_train.Type_Groups == 'Probe')]
kdd_test_probe = kdd_test[(kdd_test.Type_Groups == 'Normal') | (kdd_test.Type_Groups == 'Probe')]

kdd_train_probe['Type_Groups'] = np.where(kdd_train_probe['Type_Groups'] == 'Normal', 0, 1)
kdd_test_probe['Type_Groups'] = np.where(kdd_test_probe['Type_Groups'] == 'Normal', 0, 1)

kdd_valid_probe = kdd_test_probe.iloc[14000:34000,:]
kdd_test_v2_probe = pd.concat([kdd_test_probe.iloc[0:14000,:], kdd_test_probe.iloc[34001:64759,:]])


# classify anomalies and normals
# train set: kdd_train_probe
# validation set (for hyperparam tuning): kdd_valid_probe
# test set: kdd_test_v2_probe
# avg. macro f1 score on test set

## Pytorch DataLoaders

In [None]:
# create our dataloaders for train set and val set
# we will use autoencoders to detect anomalies, so we dont need anomaly class
# remove anomaly samples and train autoencoder to learn reconstruction of normal samples

class TabularDataset(Dataset):
    def __init__(self, df):
        super(TabularDataset, self).__init__()
        self.df = df
    
    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        data = self.df.iloc[idx, :-1].to_numpy()
        return {
            "sample": torch.Tensor(data)
        }


BATCH_SIZE = 128

train_normal = kdd_train_probe[kdd_train_probe.Type_Groups == 0]
val_normal = kdd_valid_probe[kdd_valid_probe.Type_Groups == 0]
test_normal = kdd_test_v2_probe[kdd_test_v2_probe.Type_Groups == 0]


train_data = TabularDataset(train_normal)
val_data = TabularDataset(val_normal)
test_data = TabularDataset(test_normal)
test_data_all = TabularDataset(kdd_test_v2_probe)

train_dataloader = DataLoader(train_data, shuffle = True, batch_size = BATCH_SIZE) # for training
val_dataloader = DataLoader(val_data, shuffle = False, batch_size = BATCH_SIZE) # for validationg
test_dataloader = DataLoader(test_data, shuffle = False, batch_size = BATCH_SIZE) # for testing
test_all_dataloader = DataLoader(test_data_all, shuffle = False, batch_size = BATCH_SIZE) # contains both anomaly and normal samples to test

# VAE

In [None]:
# VAE implementation in PyTorch

class LinearVAE(nn.Module):
    def __init__(self, n_features, latent_dim):
        super(LinearVAE, self).__init__()
        self.n_features = n_features

        self.encoder = nn.Sequential(
            nn.Linear(n_features, 20),
            nn.Tanh()
        )

        self.encoder2mean = nn.Linear(20, latent_dim)
        self.encoder2logvar = nn.Linear(20, latent_dim)

        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 20),
            nn.ReLU(),
            nn.Linear(20, n_features)
        )
    
    def forward(self, x):
        bs = x.size(0)
        out = self.encoder(x)
        mu = self.encoder2mean(out)
        log_var = self.encoder2logvar(out)
        z = self.reparameterize(mu, log_var)
        out = self.decoder(z)
        return out, mu, log_var
        
    def reparameterize(self, mu, log_var):
        std = torch.exp(0.5*log_var)
        eps = torch.randn_like(std)
        sample = mu + (eps * std)
        return sample

In [None]:
def evaluate(...):
    # implement evaluating function over ```val_dataloader``` variable, to use in training function
    ...

def train(...):
    # implement training function over ```train_dataloader``` variable
    ...

def calculate_f1_score(...):
    # implement metric function to calculate macro f1 score over ```test_all_dataloader```
    # by using predefined threshold value
    # if overall loss > threshold, then it is anomaly; else normal
    ...

# Vanilla AE

In [None]:
class VanillaAE(nn.Module):
    # implement vanilla autoencoder in PyTorch
    ...

In [None]:
def evaluate(...):
    # implement evaluating function over ```val_dataloader``` variable, to use in training function
    ...

def train(...):
    # implement training function over ```train_dataloader``` variable
    ...

def calculate_f1_score(...):
    # implement metric function to calculate macro f1 score over ```test_all_dataloader```
    # by using predefined threshold value
    # if overall loss > threshold, then it is anomaly; else normal
    ...

# DAE

In [None]:
class VanillaAE(nn.Module):
    # implement denoising autoencoder in PyTorch
    ...

In [None]:
def evaluate(...):
    # implement evaluating function over ```val_dataloader``` variable, to use in training function
    ...

def train(...):
    # implement training function over ```train_dataloader``` variable
    ...

def calculate_f1_score(...):
    # implement metric function to calculate macro f1 score over ```test_all_dataloader```
    # by using predefined threshold value
    # if overall loss > threshold, then it is anomaly; else normal
    ...