Intelligent Web Application Firewall
====================================

# Initial Setup

## Download Data
If the dataset already exists, this will in effect do nothing. Else, it will create the dataset directory and download the dataset from the [Systems and Networking Group at UC San Diego](https://www.sysnet.ucsd.edu/projects/url/).

In [1]:
import pathlib
import urllib.request

DATASET_URL = 'https://www.sysnet.ucsd.edu/projects/url/url.mat'

DATASET_DIRECTORY_PATH = pathlib.Path('dataset')
DATASET_NORMALIZED_PATH = DATASET_DIRECTORY_PATH.joinpath('url.npz')
FEATURE_TYPES_PATH = DATASET_DIRECTORY_PATH.joinpath('feature-types.npz')
DATASET_RAW_PATH = DATASET_DIRECTORY_PATH.joinpath('url.mat')

if DATASET_DIRECTORY_PATH.is_dir():
    print('[✅] Dataset Directory Exists')
else:
    print('[~] Creating Dataset Directory')
    DATASET_DIRECTORY_PATH.mkdir(parents=True, exist_ok=True)

if DATASET_NORMALIZED_PATH.is_file():
    print('[✅] Dataset Normalized Directory Exists')
elif DATASET_RAW_PATH.is_file():
    print('[✅] Dataset Raw File Exists')
else:
    print(f'[~] Creating Dataset File')
    urllib.request.urlretrieve(DATASET_URL, DATASET_RAW_PATH)

[✅] Dataset Directory Exists
[✅] Dataset Normalized Directory Exists


## Load Data
Since the data is stored as a MATLAB save, we'll need to load it. We should also try to make it easier for ourselves later.

In [33]:
from scipy.io import loadmat
from scipy.sparse import vstack
import numpy as np

if not DATASET_NORMALIZED_PATH.is_file() or not FEATURE_TYPES_PATH.is_file():
    print('[~] Loading Data from MATLAB')
    dataset = loadmat(DATASET_RAW_PATH)

if DATASET_NORMALIZED_PATH.is_file():
    print('[✅] Using Saved Normalized Day Data')
    data = np.load(DATASET_NORMALIZED_PATH, allow_pickle=True)
    X, y = data['X'], data['y']
else:
    print('[~] Extracting and Saving Day Data')
    X = np.matrix(vstack(np.array(list((map(lambda t: t[0][0], [v['data'] for k, v in dataset.items() if k.startswith('Day')]))))))
    
    y = list()
    for raw_label in [v['labels'] for k, v in dataset.items() if k.startswith('Day')]:
        y.extend(list(map(lambda c: c[0], raw_label[0][0])))

    y = np.array(y)
    np.savez(DATASET_NORMALIZED_PATH, X=X, y=y)


if FEATURE_TYPES_PATH.is_file():
    print('[✅] Using Saved Feature Types Data')
    feature_types = np.load(FEATURE_TYPES_PATH)['data']
else:
    print('[~] Creating Feature Types Save')

    feature_types = dataset.get('FeatureTypes')
    np.savez(FEATURE_TYPES_PATH, data=feature_types)

[✅] Using Saved Normalized Day Data
[✅] Using Saved Feature Types Data


## Initial Data Analysis

In [43]:
print('Samples:', y.shape[0])

Samples: 2396130


(1, 1)