Intelligent Web Application Firewall
====================================
This project is based on data from the ECML/PKDD 2007 Challenge and CSIC 2010 Dataset, [available on GitHub](https://github.com/msudol/Web-Application-Attack-Datasets/blob/master/CSVData/csic_ecml_normalized_final.csv).

# Install Dependencies
* [user-agents](https://pypi.org/project/user-agents/) is required to parse user agents.

```bash
pip3 install pyyaml ua-parser user-agents
```

# Configure
Initially setup this experiment. For the sake of making setup faster, we will go ahead and download the dataset if it does not already exist on our system. For now, we'll download it off of [msudol/Web-Application-Attack-Datasets](https://github.com/msudol/Web-Application-Attack-Datasets) on GitHub.

In [234]:
DATASET_URL = 'https://raw.githubusercontent.com/msudol/Web-Application-Attack-Datasets/master/CSVData/csic_ecml_normalized_final.csv'

Enabling `DEBUG` will make a lot of messages!

In [235]:
DEBUG = True

## Normalized Data Caching
We're going to cache the pre-processed data to avoid having to process it repeatedly. Set the `NORMALIZATION_VERSION` below accordingly to use a particular version of the normalized dataset. If it is set to `None`, we'll assume that pre-processing is still a work in progress and to continuously do pre-processing again (you should probably change this eventually). Note that this value will get automatically overridden if set to `None`, so you'll need to run these blocks again if you want data to be pre-processed again!

Note that even if there's a cached version available, we'll always keep a version of the original data (kinda just for the sake of it).

Also, if you *really* hate caching, just turn off saving...

In [236]:
NORMALIZATION_VERSION = None
SAVE_NORMALIZED = True

Now to get things done! This code block will automatically download the dataset if needed and set up our experiment.

In [237]:
import pathlib
import urllib.request

DATASET_DIRECTORY_PATH = pathlib.Path('dataset')
DATASET_NORMALIZED_DIRECTORY_PATH = DATASET_DIRECTORY_PATH.joinpath('normalized')
DATASET_RAW_PATH = DATASET_DIRECTORY_PATH.joinpath('dataset.csv')


if DATASET_DIRECTORY_PATH.is_dir():
    print('[✅] Dataset Directory Exists')
else:
    print('[~] Creating Dataset Directory')
    DATASET_DIRECTORY_PATH.mkdir(parents=True, exist_ok=True)

if DATASET_NORMALIZED_DIRECTORY_PATH.is_dir():
    print('[✅] Normalized Dataset Directory Exists')
else:
    print('[~] Creating Normalized Dataset Directory')
    DATASET_NORMALIZED_DIRECTORY_PATH.mkdir(parents=True, exist_ok=True)

if DATASET_RAW_PATH.is_file():
    print('[✅] Dataset Exists')
else:
    print(f'[~] Downloading Dataset')
    urllib.request.urlretrieve(DATASET_URL, DATASET_RAW_PATH)

    if DATASET_RAW_PATH.is_file():
        print(f'[✅] Dataset Available at {DATASET_RAW_PATH}')
    else:
        print('[⚠️] Failed to Download Dataset')
        raise SystemExit('Dataset Download Failure')

[✅] Dataset Directory Exists
[✅] Normalized Dataset Directory Exists
[✅] Dataset Exists


A few quick functions to help us out!

In [238]:
def debug(message):
    if DEBUG:
        print(message)

Now, lets figure out our caching situation.

In [239]:
effective_cache_path: pathlib.Path = None
if NORMALIZATION_VERSION is None:
    # Figure out the next normalization version...
    i = 0
    while True:
        proposed_path = DATASET_NORMALIZED_DIRECTORY_PATH.joinpath(f'{i}.csv')
        if not proposed_path.exists():
            effective_cache_path = proposed_path
            NORMALIZATION_VERSION = i
            break

        i += 1
else:
    effective_cache_path = DATASET_NORMALIZED_DIRECTORY_PATH.joinpath(f'{int(NORMALIZATION_VERSION)}.csv')

Caching path has been figured out! Time to actually normalize...

In [240]:
ENUMERATIONS = {
    'Class': {'Valid': 0, 'Anomalous': 1},
    'Method': {'GET': 0, 'POST': 1, 'PUT': 2},
    'Host-Header': {'HTTP/1.0': 0, 'HTTP/1.1': 1},
    'Connection': {'keep-alive': 0, 'close': 1, 'invalid': 2, None: 3},
    'Pragma': {'no-cache': 0, 'invalid': 1, None: 2},
    'Content-Type': {'application/x-www-form-urlencoded': 0, None: 1}
}

LENGTH_FIELDS = {
    'Accept': 'Accept-Length',
    'Accept-Charset': 'Accept-Charset-Length',
    'Accept-Language': 'Accept-Language-Length',
    'User-Agent': 'User-Agent-Length',
    'Content-Type': 'Content-Type-Length',
    'POST-Data': 'POST-Data-Length',
    'GET-Query': 'GET-Query-Length'
}

In [241]:
import pandas as pd
import math
from urllib.parse import parse_qs

def enumerate(df):
    for field, enumeration in ENUMERATIONS.items():
        debug(f'[~] Processing Field: {field}')
        if field not in df:
            raise RuntimeWarning(f'Field {field} Does Not Exist')
        
        unenumerated_values = set(df[field].unique()).difference(set(enumeration.keys()))
        for unenumerated_value in unenumerated_values:
            if math.isnan(unenumerated_value) and None in enumeration:
                continue
            
            raise RuntimeWarning(f'Failed to Enumerate Value "{unenumerated_value}" for Field {field}')
        
        if None in enumeration:
            df[field] = df[field].map(enumeration).fillna(enumeration[None]).astype(int)
        else:
            df[field] = df[field].map(enumeration).astype(int)
        
    return df

def length_append(df):
    for field, target in LENGTH_FIELDS.items():
        debug(f'[~] Processing Field: {field}')
        if field not in df:
            raise RuntimeWarning(f'Field {field} Does Not Exist')
        
        df[target] = df[field].map(lambda v: len(v) if type(v) is str else 0)

    return df

def __query_param_count(df):
    df['GET-Query-Params'] = df['GET-Query'].map(lambda q: len(parse_qs(q).keys()) if type(q) is str else 0)

def __query_characters(df):
    CHARACTERS = [chr(o) for o in range(32, 127)]
    for letter in CHARACTERS:
        df[f'Letter-Frequency-{letter}'] = df['GET-Query'].apply(lambda q: q.count(letter) if type(q) is str else 0)

def parse_query(df):
    __query_param_count(df)
    __query_characters(df)

    return df


def preprocess(path):
    df = parse_query(length_append(enumerate(pd.read_csv(path))))

    return df

In [242]:
data: pd.DataFrame = None
if effective_cache_path.is_file():
    data = pd.read_csv(effective_cache_path)
else:
    # We're going to need to do some pre-processing!
    data = preprocess(DATASET_RAW_PATH)

    if SAVE_NORMALIZED:
        data.to_csv(effective_cache_path)

[~] Processing Field: Class
[~] Processing Field: Method
[~] Processing Field: Host-Header
[~] Processing Field: Connection
[~] Processing Field: Pragma
[~] Processing Field: Content-Type
[~] Processing Field: Accept
[~] Processing Field: Accept-Charset
[~] Processing Field: Accept-Language
[~] Processing Field: User-Agent
[~] Processing Field: Content-Type
[~] Processing Field: POST-Data
[~] Processing Field: GET-Query


  df[f'Letter-Frequency-{letter}'] = df['GET-Query'].apply(lambda q: q.count(letter) if type(q) is str else 0)
  df[f'Letter-Frequency-{letter}'] = df['GET-Query'].apply(lambda q: q.count(letter) if type(q) is str else 0)
  df[f'Letter-Frequency-{letter}'] = df['GET-Query'].apply(lambda q: q.count(letter) if type(q) is str else 0)
  df[f'Letter-Frequency-{letter}'] = df['GET-Query'].apply(lambda q: q.count(letter) if type(q) is str else 0)
  df[f'Letter-Frequency-{letter}'] = df['GET-Query'].apply(lambda q: q.count(letter) if type(q) is str else 0)
  df[f'Letter-Frequency-{letter}'] = df['GET-Query'].apply(lambda q: q.count(letter) if type(q) is str else 0)
  df[f'Letter-Frequency-{letter}'] = df['GET-Query'].apply(lambda q: q.count(letter) if type(q) is str else 0)
  df[f'Letter-Frequency-{letter}'] = df['GET-Query'].apply(lambda q: q.count(letter) if type(q) is str else 0)
  df[f'Letter-Frequency-{letter}'] = df['GET-Query'].apply(lambda q: q.count(letter) if type(q) is str else 0)
 

Preview data!

In [243]:
if DEBUG:
    debug('Sample Data:')
    debug(data.iloc[0])

Sample Data:
Class                                            1
Method                                           0
Host-Header                                      0
Connection                                       2
Accept                audio/*;q=0.7, audio/*;q=0.0
                                  ...             
Letter-Frequency-z                               3
Letter-Frequency-{                               0
Letter-Frequency-|                               0
Letter-Frequency-}                               0
Letter-Frequency-~                               0
Name: 0, Length: 116, dtype: object


Let's check if we can enumerate even more.

In [244]:
import numpy as np

ENUM_THRESHOLD = 64
enumerable = 0
for field in data:
    values = data[field].unique()
    if len(values) < ENUM_THRESHOLD and len({value for value in values if type(value) != np.int64}) > 0:
        print(f'{field} May Be Enumerated with {len(values)} Unique Values')
        values_as_str = ', '.join({str(value) for value in values})
        print(f'\t{values_as_str}')
        enumerable += 1

print()
if enumerable > 0:
    print(f'[⚠️] Enumeration May be Incomplete: {enumerable} May Be Enumerated')
else:
    print(f'[✅] Sufficiently Enumerated with Threshold {ENUM_THRESHOLD}')

Accept-Language May Be Enumerated with 14 Unique Values
	*;q=0.2, nan, *;q=0.1, *;q=0.9, non-standard, *;q=0.4, *;q=0.6, en, *;q=0.8, *;q=0.0, *;q=0.7, *, *;q=0.5, *;q=0.3

[⚠️] Enumeration May be Incomplete: 1 May Be Enumerated
