# This notebook will be used for the construction of a step-by-step process for the production of the proposed [HML-IDS](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8751972) method of anomaly-detection for application to IDS systems.

## Preprocessing and Feature Engineering

### Balancing data via SMOTE

- According to Figure 1 (overview of the proposed approach from the [HML-IDS](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8751972) paper, the proposed model was tested using the initial imbalanced dataset alongside a balanced version (balanced using the [SMOTE](https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/) technique) 

- The notebook that is within the root project folder, "SMOTE-testing.ipynb", will be used for testing and practicing for applying the SMOTE technique to the data before the `Feature Engineering` process.

### Feature extraction

- For feature extraction, we are going to likely utilize scikit-learn's [resources](https://scikit-learn.org/stable/modules/feature_extraction.html) on the matter. A big part of this appears to be `Vectorizing` the data, so as to allow it to work better in mathematical functions. The linked resource goes over some of the ways in which this may be implemented via the sklearn library.

### Categorical Labeling

- [This](https://machinelearningknowledge.ai/categorical-data-encoding-with-sklearn-labelencoder-and-onehotencoder/) appears to be a decent resource for better understanding categorical labeling, and how it can be implemented with the sklearn library. It appears that `One Hot Encoding` would likely be the most useful one in our case; with One Hot Encoding, each new categorical value (i.e. 'Name: Jane Doe' or 'Job: Plumber') is given a column in a matrix. The values within each of the rows within those columns is binary, wherein 1 represents that it is of that category (i.e. 1 may represent 'Jane Doe' and 0 may represent 'not Jane Doe'). There is also `Label Encoding`, which works differently but in a similar vein, and may prove useful for the purposes of this project.

- Another thing that seems to be of note is that `Categorical Labeling` is often classified as a method of `Feature Extraction`. Though this may be the case, I have left the two in separate categories in this document due to the same separation being present in the [paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8751972).

### Feature scaling

- For `Feature Scaling`, the process which will be utilized for this project will be based on findings and learnings from the resources available from the [related web pages](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py) provided by scikit-learn.

### Normalization

- As with `Feature scaling`, a [similar approach](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-normalization) will likely be taken for implementation of `Normalization`.

## Feature selection/dimensionality reduction

- The `Feature Selection` process demonstrated in [the paper](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8751972) appears to be relatively complex; it utilizes three different Feature Selection methods (`PCA`, `CCA`, and `ICA`), and uses a technique the authors referred to as `AllKNN` for essentially combining the results of those three methods.

- [This](https://scikit-learn.org/stable/modules/decomposition.html#ica) may also be a fairly useful resource for further analysis and understanding of these feature selection methods (particularly PCA and ICA).

- I would like to note, however, that I would like to see how the `CCFSRFG` [method](https://journalofbigdata.springeropen.com/articles/10.1186/s40537-020-00381-y) may be applied here (likely in opposition to the `AllKNN`-based method as explained above)

### Principal Component Analysis (PCA)

- The implementation of the `PCA` dimensionality reduction technique in this project will likely be derived from [resources provided by scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).

### Canonical Correlation Analysis (CCA)

- As for implementing `CCA`, [this page](https://scikit-learn.org/stable/modules/generated/sklearn.cross_decomposition.CCA.html) (also provided by scikit-learn) will likely be a useful resource.

### Independent Component Analysis (ICA)

- For the implementation of `ICA`, the `FastICA` [algorithm](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.FastICA.html) from the sklearn library will likely be of use.

### AllKNN

- The `AllKNN` algorithm is going to be used for re-sampling the dataset from the results of the `PCA`, `CCA`, and `ICA` algorithms. 

- The imbalanced-learn library appears to have an implementation of `AllKNN`, thus [their documentation](https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.AllKNN.html) will likely prove to be a valuable resource.

## Algorithm 1: Feature Extraction and Generating Feature Matrix

### The following is a step-by-step algorithmic explanation as to the implementation of the proposed HML-IDS as described in the paper:

<ul>1: For i = 1 to O do</ul>
<ul>2:     df <-- read -- dataset : [initialization]</ul>
<ul>3: End for</ul>
<ul>4: For j steps do //standardization</ul>
<ul>5:     Scale values according to <em>Eq. (1)</em></ul>
<ul>6:

Where `Eq. (1)` refers to equation 1 from the paper.