# Syscall Anomaly Detection

As a fun project, I decided to write a package for anomaly detection on my Linux server logs. A couple wants out of this:

- Exploration of anomaly detection algo (written from scratch)
- Use cpp and pybind11 to create the anomaly detection algo
- Algo can be based off the one taught in Stanford Machine Learning class
- Install tests with Travis or something like this
- Use click for a command line interface
- Use setup.py and release on PyPi
- Use PyTorch

_Always build one component from scratch. In this case, my anomaly detection model._

## Syscall Information

- Is there a way to do an online learning algo? Would need to regularly obsever the logs. How often is too often to tap into that feed?
- I'll need to try to use PCA to reduce the dimentionality so that I can visualize this data.

## Tokenization

- I'm imagining starting with the techniques from my recent Stanford Machine Learning class. I can also look into libraries that already do this well for ease of implementation.
- Otherwise this would include:
    - lowercase
    - strip symbols (some might be important)
    - normalize features (see ip below)
    - use a word stemmer
    - get rid of white space and unnecessary characters
- See ex6.pdf on my local drive

## Feature Engineering

- Replace dates with a boolean for time of day potentially
- Replace other highly specific data with a general indication of what it is
    - One example may be to replace ips into two categories {mine, not_mine}
    - Could consider trying Google embeddings on this.. not sure if that would do any good or not
- Once I get all of my words, potentially discard the rarest ones, potentially not
- If not, perhaps assign an integer to each frequently occuring word
- Now, each log is given a 1 or 0 for each index in a vector whose features correspond to the vocabulary list we previously made, where

$n$ = # of features, $x_i\in{\{0, 1\}}$, each email is a vector $\mathbb{R}^n$,
$$x=\left[
\begin{array}{ccc}
   0 \\
   1 \\
   0 \\
   \vdots \\
   0 \\
   1 \\
   1 \\
\end{array}
\right]=\mathbb{R}^n$$

## Anomaly Detection

- Planning on mocking this up in python first for ease. Will do a simplified version of the Stanford ML class at first 
- One point of exploration will be whether or not the one I have in mind will work well for dummy data
- See ex8.pdf and the associated code

Given an unlabled dataset which we expect to be largely not containing anomalies, $$\{x^{(1)}, x^{(2)}, \cdots, x^{(m)}\}$$

We need to estimate the parameters of the gaussian distribution $N(\mu, \sigma^2)$ where,

$\mu_j = \frac{1}{m}\sum_{i}^{m}x_i^j$ 

_the mean of feature j is eual to the sum over the training examples for that value divided by m_

$\sigma_j^2 = \frac{1}{m}\sum_{i}^{m}(x_i^j-\mu_i^j)^2$ 

$p(x_i^j; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x_i^j-\mu_i^j)^2}{2\sigma^2}}$

$p(x)=\Pi_{j=1}^n {p(x_j;\mu_j,\sigma^2)}$

Normal Anomaly Detection:
Assumptions:
- The data has a single gaussian cluster (not sure if that's how you say, but we would not expect the data to have several distinct groups)
    - This may or may not be the case with this data. I can use PCA though to visualize in 2D and determine if I need to cluster first. If so, I would then simply find the examples that are furthest from the centers (potentially to which they belong).
- The features are independent of each other.
    - As I think about this, I do not at all think this is possible. certain words are going to be more likely to exist together
- Each feature is normally distributed.
    - Can plot a histogram to see, but this wouldn't be the case. If I plotted the expectation as a function of n randomly generated logs, I would have a gaussian, but I'm not sure that helps..
    
Multivariate Anomaly Detection
- http://cs229.stanford.edu/section/gaussians.pdf

After some reading of [non-white papers](https://stats.stackexchange.com/questions/62069/anomaly-detection-with-dummy-features-and-other-discrete-categorical-features), it might be ok to try the gaussian and just see how it goes! The PCA should allow me to visualize, but I think I should try without the PCA in the model at first.

## How Good is the System?

- To decide this I'll actually need to create anomalous examples. This may prove difficult as I don't truly know what will constitute a security compromise
- To start, I'll simply train on all data and see what happens. The logic here being, it would be still valuable to know which logs are unusual, whether it'll show security compromises, I have no idea!
- To take it furhter, I can graph in 2D the data, and make sure I'm capturing what look like anomalies
- To take it even furhter, I can train on 60% od data which is all non-anomalous, then 40% of data will be split equally into CV and test groups, where 50% of anomalous events are split between each. I'll train on Fscore on CV and see how it applies to the test group.

### Up next. 

- Finish math in this sheet for the basic one
- Mock up a simple implementation with all prepackaged stuff (except for the anomaly detection model, which I want to do from scratch) and see how it goes!