Intelligent Web Application Firewall
====================================
This project is based on data from the ECML/PKDD 2007 Challenge and CSIC 2010 Dataset, [available on GitHub](https://github.com/msudol/Web-Application-Attack-Datasets/blob/master/CSVData/csic_ecml_normalized_final.csv).

# Configure
Initially setup this experiment. For the sake of making setup faster, we will go ahead and download the dataset if it does not already exist on our system. For now, we'll download it off of [msudol/Web-Application-Attack-Datasets](https://github.com/msudol/Web-Application-Attack-Datasets) on GitHub.

In [82]:
DATASET_URL = 'https://raw.githubusercontent.com/msudol/Web-Application-Attack-Datasets/master/CSVData/csic_ecml_normalized_final.csv'

## Normalized Data Caching
We're going to cache the pre-processed data to avoid having to process it repeatedly. Set the `NORMALIZATION_VERSION` below accordingly to use a particular version of the normalized dataset. If it is set to `None`, we'll assume that pre-processing is still a work in progress and to continuously do pre-processing again (you should probably change this eventually). Note that this value will get automatically overridden if set to `None`, so you'll need to run these blocks again if you want data to be pre-processed again!

Note that even if there's a cached version available, we'll always keep a version of the original data (kinda just for the sake of it).

Also, if you *really* hate caching, just turn off saving...

In [83]:
NORMALIZATION_VERSION = None
SAVE_NORMALIZED = False

Now to get things done! This code block will automatically download the dataset if needed and set up our experiment.

In [84]:
import pathlib
import urllib.request

DATASET_DIRECTORY_PATH = pathlib.Path('dataset')
DATASET_NORMALIZED_DIRECTORY_PATH = DATASET_DIRECTORY_PATH.joinpath('normalized')
DATASET_RAW_PATH = DATASET_DIRECTORY_PATH.joinpath('dataset.csv')


if DATASET_DIRECTORY_PATH.is_dir():
    print('[✅] Dataset Directory Exists')
else:
    print('[~] Creating Dataset Directory')
    DATASET_DIRECTORY_PATH.mkdir(parents=True, exist_ok=True)

if DATASET_NORMALIZED_DIRECTORY_PATH.is_dir():
    print('[✅] Normalized Dataset Directory Exists')
else:
    print('[~] Creating Normalized Dataset Directory')
    DATASET_NORMALIZED_DIRECTORY_PATH.mkdir(parents=True, exist_ok=True)

if DATASET_RAW_PATH.is_file():
    print('[✅] Dataset Exists')
else:
    print(f'[~] Downloading Dataset')
    urllib.request.urlretrieve(DATASET_URL, DATASET_RAW_PATH)

    if DATASET_RAW_PATH.is_file():
        print(f'[✅] Dataset Available at {DATASET_RAW_PATH}')
    else:
        print('[⚠️] Failed to Download Dataset')
        raise SystemExit('Dataset Download Failure')

[✅] Dataset Directory Exists
[✅] Normalized Dataset Directory Exists
[✅] Dataset Exists


Now, lets figure out our caching situation.

In [85]:
effective_cache_path: pathlib.Path = None
if NORMALIZATION_VERSION is None:
    # Figure out the next normalization version...
    i = 0
    while True:
        proposed_path = DATASET_NORMALIZED_DIRECTORY_PATH.joinpath(f'{i}.csv')
        if not proposed_path.exists():
            effective_cache_path = proposed_path
            NORMALIZATION_VERSION = i
            break

        i += 1
else:
    effective_cache_path = DATASET_NORMALIZED_DIRECTORY_PATH.joinpath(f'{int(NORMALIZATION_VERSION)}.csv')

Caching path has been figured out! Time to actually normalize...

In [86]:
CLASS_ENUMERATION = {'Valid': 0, 'Anomalous': 1}

# Note that we're defining "anything else" as the maximum + 1.
METHOD_ENUMERATION = {'GET': 0, 'POST': 1, 'PUT': 2}
HTTP_VERSION_ENUMERATION = {'HTTP/1.0': 0, 'HTTP/1.1': 1}

In [87]:
import pandas as pd

def preprocess(path):
    df = pd.read_csv(path)
    
    # Our data contains two classes, "Anomalous" and "Valid".
    df['Class'] = df['Class'].map(CLASS_ENUMERATION)

    # Enumerate the request methods
    df['Method'] = df['Method'].map(METHOD_ENUMERATION).fillna(max(set(METHOD_ENUMERATION.values())) + 1)

    df['HTTP-Version'] = df['Host-Header'].map(HTTP_VERSION_ENUMERATION).fillna(max(set(HTTP_VERSION_ENUMERATION.values())) + 1)

    return df

In [88]:
data: pd.DataFrame = None
if effective_cache_path.is_file():
    data = pd.read_csv(effective_cache_path)
else:
    # We're going to need to do some pre-processing!
    data = preprocess(DATASET_RAW_PATH)

    if SAVE_NORMALIZED:
        data.to_csv(effective_cache_path)