# Anomaly Detection

### Problem Motivation

- Aircraft Manufacturer Example:
    - Dataset {$x^{(1)},x^{(2)},...,x^{(m)}$}
        - $x_1$ = heat generated
        - $x_2$ = vibration intensity
    - Anomaly detection takes new data $x_{test}$ and compares it to the training set to determine if there are any major fluctuations or anomalies in the output performance.
    
- If we have a model $p(x)$, then we run our new data through the model, and if $p(x_{test}) < \epsilon$ then we flag it as an anomaly

- Aother example is Fraud Detection:
    - $x^{(i)}$ = features of user i's activities
    - Model $p(x)$ from data
    - Identify unusual users by checking which have $p(x) < \epsilon$

### Gaussian (Normal) Distribution

- Say $x \in \mathbb{R}$. If x is a distributed Gaussian with mean $\mu$ and variance $\sigma^2$ 
    - $x$ **~** $N(\mu, \sigma^2)$
    - ~ means 'distributed as'
    - plotting the Gaussian distribution will result in the bell curve
- $p(x;\mu, \sigma^2) = \frac{1}{\sqrt{2 \pi}*\sigma}exp(-\frac{(x- \mu)^2}{2 \sigma ^2})$
    - $\sigma^2$ = variance
    - $\sigma$ = standard deviation
    
- Parameter estimation:
    - $\mu = \frac{1}{m} \sum \limits_{i=1}^m x^{(i)}$ 
    - $\sigma^2 = \frac{1}{m} \sum \limits_{i=1}^m (x^{(i)} - \mu)^2$
    
### Anomaly Detection Algorithm

- $p(x) = p(x_1;\mu,\sigma^2),p(x_2;\mu,\sigma^2),p(x_3;\mu,\sigma^2),...p(x_n;\mu,\sigma^2)$
    - $p(x) = \prod \limits_{j=1}^n p(x_j;\mu_j,\sigma_j^2)$
    - Also called the **Density Estimation**
- Anomaly Detection Algorithm:
    1. Chose features $x_i$ that might be indicative of anomalous examples
    2. Fit parameters $\mu_1,...\mu_n,\sigma_1^2,...\sigma_n^2$
    3. Given new example x, compute $p(x)$, and $p(x)$ is an anomaly if < $\epsilon$
    
### Building and Developing an Anomaly Detection System

- Always easier to evaluate a model if we can return some real-number 'score' to determine how well a model performs when adding or removing certain features
    - For anomaly detection systems, assume we have some labeled data of anomalous and non-anomalous data example (i.e. y = 0 if normal and y = 1 if anomalous)
    - Define training (assume normal/not anomalous), cross-validation, and testing sets
    - Example:
        - 10000 good aircraft engines (y=0) and 20 anomalous engines (y=1)
            - Training Set: 6000 good engines (Train $p(x) = p(x_1;\mu_1,\sigma_1^2),p(x_2;\mu_2,\sigma_2^2),...p(x_n;\mu_n,\sigma_n^2)$
            - CV: 2000 Good engines and 10 anomalous
            - Test: 2000 Good engines and 10 anomalous
    - Possible evaluation metrics:
        1. True positive, false positive, false negative, true negative
        2. Precision/Recall
        3. $F_1$ score
    - Choose different values of the threshold parameter $\epsilon$ that maximizes the $F_1$ score
    
### Choosing Which Features to Use

- First step is to plot data to check and see if it has a vaguely Gaussian distribution (histogram/density graph)
    - if data does **not** look Gaussian, you can play around with some different transformations to manipulate the data shape:
        - $log(x)$
        - $log(x+c)$
        - $\sqrt{x}$
        - $x^d$
- Error analysis for anomaly detection:
    - Want $p(x)$ to be large for normal examples x and $p(x)$ to be small for anomalous examples x
    - A common problem would be that $p(x)$ is comparable (e.g. both are large) for normal and anomalous examples

### Multivariate Gaussian Distribution

- $x \in \mathbb{R}^n$ Don't model $p(x_1),p(x_2),...,$ etc. separately, but instead model $p(x)$ all in one go
    - Parameters: $\mu \in \mathbb{R}^n, \Sigma \in \mathbb{R}^{n x n}$ (Covariance matric)
    - $p(x;\mu,\sigma^2) = \frac{1}{2\pi^{\frac{n}{2}}\mid \Sigma \mid^{\frac{1}{2}}} exp(-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu))$
        - $\mid \Sigma \mid$ = the detminant of a matrix
        
- Parameter fitting:
    - Given a training set {$x^{(1)},x^{(2)},...,x^{(m)}$}
        - $\mu = \frac{1}{m}\sum \limits_{i=1}^m x^{(i)}$
        - $\Sigma = \frac{1}{m}\sum \limits_{i=1}^m (x^{(i)}-\mu)(x^{(i)}-\mu)^T$
- Anomaly detection with the Multivariate Gaussian:
    1. Fit model $p(x)$ by setting
        - $\mu = \frac{1}{m}\sum \limits_{i=1}^m x^{(i)}$
        - $\Sigma = \frac{1}{m}\sum \limits_{i=1}^m (x^{(i)}-\mu)(x^{(i)}-\mu)^T$
    2.  Given a new example x, compute
        - $p(x;\mu,\sigma^2) = \frac{1}{2\pi^{\frac{n}{2}}\mid \Sigma \mid^{\frac{1}{2}}} exp(-\frac{1}{2}(x-\mu)^T \Sigma^{-1}(x-\mu))$
    3. Flag an anomaly if $p(x)< \epsilon$
    
- Original Model Vs. Multivariate Gaussian:
    - Original model:
        - most common
        - manually create new features to capture anomalies where $x_1,x_2$ take unusual combinations of values
        - computationally cheaper (scales better to large n)
        - okay if training size m is small
    - Multivariate model:
        - captures correlations between features
        - automatically captures correlations between features
        - computationally more expensive
        - must have m > n or $\Sigma$ is non-invertible

# Recommender Systems

### Content-based recommender systems
- For each user j, learn a parameter $\theta^{(j)} \in \mathbb{R}^{n+1}$ (n = 2, the number of features). Predict user j as rating movie i with $(\theta^{(j)})^T x^{(i)}$ stars
    - each user will have a different parameter vector associated with them
- Problem Formulation:
    - $r(i,j)$ = 1 if user j has rated movie i (0 otherwise)
    - $y^{(i,j)}$ = rating by user j on movie i (if defined)
    
    - $\theta^{(j)}$ = parameter vector for user j
    - $x^{(i)}$ = feature vector for movie i
   
    - for user j, movie i, predict rating: $(\theta^{(j)})^T x^{(i)}$
    - $m^{(j)}$ = number of movies rated by user j
    - To learn $\theta^{(j)}$:
        1. min $\theta^{(j)}$ = $ \frac{1}{2m^{(j)}}\sum \limits_{i:r(i,j) = 1} ((\theta^{(j)})^T x^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2m^{(j)}}\sum \limits_{k=1}^n (\theta_k^{(j)})^2$
            - can also be written as: $\theta^{(j)}$ = $ \frac{1}{2}\sum \limits_{i:r(i,j) = 1} ((\theta^{(j)})^T x^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2}\sum \limits_{k=1}^n (\theta_k^{(j)})^2$
        2. However, for a recommender system, you want to learn parameters for **all** the users, not just a single user:
            - $\theta^{(1)},...,\theta^{(n_u)}$ = $ \frac{1}{2}\sum \limits_{j=1}^{n_u} \sum \limits_{i:r(i,j) = 1} ((\theta^{(j)})^T x^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2}\sum \limits_{j=1}^{n_u} \sum \limits_{k=1}^n (\theta_k^{(j)})^2$
        3. Gradient Descent Update:
            - $\theta_k^{(j)} := \theta_k^{(j)} - \alpha \sum \limits_{i:r(i,j) = 1}((\theta^{(j)})^T x^{(i)}-y^{(i,j)})x_k^{(i)}$ (for k = 0)
            - $\theta_k^{(j)} := \theta_k^{(j)} - \alpha (\sum \limits_{i:r(i,j) = 1}((\theta^{(j)})^T x^{(i)}-y^{(i,j)})x_k^{(i)}+ \lambda \theta_k^{(j)})$ (when $k \neq 0$)
            
            
### Collaborative Filtering

- Given $\theta^{(1)},...,\theta^{(n_m)}$ where $n_m$ = the number of movies,
    - min $x^{(1)},...,x^{(n_m)}$ $\frac{1}{2}\sum \limits_{i=1}^{n_m} \sum \limits_{j:r(i,j) = 1} ((\theta^{(j)})^T x^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2}\sum \limits_{i=1}^{n_m} \sum \limits_{k=1}^n (x_k^{(i)})^2$ for all movies to determine features
    
- Collaborative Filtering Process:
    - Given $x^{(1)},...,x^{(n_m)}$ (and moving ratings), estimate $\theta^{(1)},...,\theta^{(n_u)}$
    - Given $\theta^{(1)},...,\theta^{(n_u)}$, estimate $x^{(1)},...,x^{(n_m)}$
    
    - Iterate back and forth between estimating $\theta$ parameters and $x$ features
    
 ### Collaborative Filtering Algorithm
 
 - The above cost functions for minimize wrt $\theta$ and $x$ can be combined into a single function to create a more efficient algorithm: $J(x^{(1)},...,x^{(n_m)},\theta^{(1)},...,\theta^{(n_u)}) = \frac{1}{2}\sum \limits_{(i,j):r(i,j) = 1}((\theta^{(j)})^T x^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2}\sum \limits_{i=1}^{n_m} \sum \limits_{k=1}^n (x_k^{(i)})^2 + \frac{\lambda}{2}\sum \limits_{j=1}^{n_u} \sum \limits_{k=1}^n (\theta_k^{(j)})^2$
 - Because we are learning the features automatically with the algorithm, we no longer need the covention $x_0 = 1$, so $x \in \mathbb{R}^n$ and $\theta \in \mathbb{R}^n$
 
 - Stepwise Process:
     1. Intialize $x^{(1)},...,x^{(n_m)},\theta^{(1)},...,\theta^{(n_u)}$ to small random values
     2. Minimuze $J(x^{(1)},...,x^{(n_m)},\theta^{(1)},...,\theta^{(n_u)})$ using gradient descent (or an advanced optimziation algorithm) E.g. for every $j=1,...,n_u, i=1,...,n_m$:
         - $x_k^{(i)} := x_k^{(i)} - \alpha (\sum \limits_{j:r(i,j) = 1}((\theta^{(j)})^T x^{(i)}-y^{(i,j)})\theta_k^{(j)}+ \lambda x_k^{(i)})$
         - $\theta_k^{(j)} := \theta_k^{(j)} - \alpha (\sum \limits_{i:r(i,j) = 1}((\theta^{(j)})^T x^{(i)}-y^{(i,j)})x_k^{(i)}+ \lambda \theta_k^{(j)})$
     3. For a user with parameters $\theta$ and a movie with learned features $x$, predict a start rating of $\theta^Tx$.