# Week 9

## Anomaly detection
### Definition

> Anomaly detection is the identification of rare observations which raise suspicions by **differing significantly from the majority of the data**.
It is usually used for fraud detection, manufacturing defects or medical problems.

## How to use anomaly detection?

- We build a model that defines $p(x)$ as the **probability** that $x$ is a normal data point (not anomalous).
- We define a **probability threshold $\mathcal{E}$** (usually **between 0 and 0.05**), depending on how sure we need to be.

    - if $p(x_{test}) \lt \mathcal{E} \rightarrow$ anomaly flag
    - if $p(x_{test}) \ge \mathcal{E} \rightarrow$ normal data

## The Gaussian distribution (or Normal Distribution)

For $x \in \mathbb{R}$, the **Gaussian equation** is:

> $\displaystyle p(x;\mu,\sigma^{2}) = \frac{1}{\sqrt{2\pi}\,\sigma}\:exp\Big({-\frac{(x-\mu)^{2}}{2\sigma^{2}}}\Big)$
>
where $\mu$ is the mean, $\sigma$ is the standard deviation and $\sigma^{2}$ is the variance

## Anomaly detection algorithm

For a training set $\{x^{(1)}, x^{(2)}, ..., x^{(m)}\}$, with each example $x \in \mathbb{R}^n$ and **normally distributed** $x_n \sim \mathcal{N}(\mu_n, \sigma_n^2)$

1. Choose features $x_{i}$ that you think might be indicative of anomalous examples.
2. Fit parameters $\mu_{1}$, ..., $\mu_{n}$, $\sigma_{1}^{2}$, ..., $\sigma_{n}^{2}$
> $\displaystyle
\begin{align}
\mu_{j} &= \frac{1}{m} \sum_{i=1}^{m}x_j^{(i)}\\
\sigma_{j}^{2} &= \frac{1}{m} \sum_{i=1}^{m}\big(x_{j}^{(i)} - \mu_{j}\big)^{2}
\end{align}$

3. Given new example $x$, compute $p(x)$:
>$\displaystyle p(x)=
\prod_{j=1}^{n}p(x_{j};\mu_{j},\sigma_{j}^{2})=
\prod_{j=1}^{n}\underbrace{\frac{1}{\sqrt{2\pi}\,\sigma_{j}}\:exp\Big({-\frac{(x_{j}-\mu_{j})^{2}}{2\sigma_{j}^{2}}}\Big)}_\textbf{gaussian equation}$

Anomaly if $p(x) \lt \mathcal{E}$

## Algorithm evaluation

1. Set $\mathcal{E}$ at some value.

2. Fit model $p(x)$ on training set $\{x^{(1)}, x^{(2)}, ..., x^{(m)}\}$.

3. On a cross validation/test set predict $x$:
>$
y = \begin{cases}
1 & \text{if p(x)} \lt \mathcal{E} \text{ (anomaly)} \\
0 & \text{if p(x)} \ge \mathcal{E} \text{ (normal)} \end{cases}$

Possible evaluation metrics to use:
- True positive, false positive, true negative, false negative
- Precision/Recall ($F_{1}$-score):
> $\displaystyle F_{1} = \frac{2 \times \text{precision} \times \text{recall}}{\text{precision} + \text{recall}}$

Can also use cross-validation set to choose $\mathcal{E}$

## Anomaly detection vs. Supervised learning

**Anomaly detection:**
- **Very small number of positive example** ($y = 1$) (0-20 is common).
- Large number of negative examples ($y = 0$).
- Many different "types" of anomalies. Hard for an algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we've seen before.

**Supervised learning:**
- **Large number of positive and negative examples**.
- Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples are likely to be similar to ones in training set.

## Choosing Features to use

Plot a **histogram** of data to check it has a Gaussian description.

If feature is **Non-Gaussian**, we can play with different **transformations** of the data to make it look more Gaussian:

- **Log transformation**:
    > $\displaystyle x \rightarrow log(x + c)$
- **Exponent transformation**:
    > $\displaystyle x \rightarrow x^{\frac{1}{2}}, x^{\frac{1}{3}}, \dots$

## Multivariate gaussian distribution model

$x \in \mathbb{R}^n$. Don't model $p(x_{(1)}), p(x_{(2)}), ...,$ etc separately. Model $p(x)$ all in one go. 

Parameters:
> $\mu \in \mathbb{R}^n$, $ \Sigma \in \mathbb{R}^{n \times n}$ (**covariance matrix**).

The **Normal Gaussian** model is a **special case** of Multivariate Gaussian distribution where:

> $\displaystyle \mathbf{\Sigma} = \begin{bmatrix}
    \sigma_{1}^{2} & 0 & \cdots & 0 \\
    0 & \sigma_{2}^{2} & \cdots & 0 \\
    \vdots & \vdots & \ddots & \vdots \\
    0 & 0 & \cdots & \sigma_{n}^{2}
    \end{bmatrix}$

### Algorithm

1. Fit model $p(x)$ by setting:
> $\displaystyle
\begin{align}
\mu &= \frac{1}{m} \sum_{i=1}^{m}x^{(i)}\\
\Sigma &= \frac{1}{m} \sum_{i=1}^{m} (x^{(i)} - \mu) (x^{(i)} - \mu)^{\top}
\end{align}$

2. Given a new example $x$, compute:

> $\displaystyle p(x ; \mu, \Sigma) = 
\frac{1}{(2\pi) ^{\frac{n}{2}}\ \lvert \Sigma \rvert ^{\frac{1}{2}}} \: exp\left(-\frac{1}{2} (x-\mu)^{\top} \:\Sigma^{-1} \:(x-\mu)\right)$

with $\lvert \Sigma \rvert$ determinant of the **covariance matrix**

Flag an anomaly if $p(x) \lt \mathcal{E}$ 

## Original model vs. Multivariate model

**Original model:**
- $\displaystyle p(x_{1};\mu_{1},\sigma_{1}^{2}) \times \cdots \times p(x_{n};\mu_{n},\sigma_{n}^{2})$
- Manually create features to capture anomalies where $x_1, x_2$ take unusual combinations of values
- **Computationally cheaper** (alternatively, scales better to large $n$)
- OK even if $m$ (training set size) is small

**Multivariate Gaussian:**
- $\displaystyle p(x ; \mu, \Sigma) = 
\frac{1}{(2\pi) ^{\frac{n}{2}}\ \lvert \Sigma \rvert ^{\frac{1}{2}}} \: exp\Big(-\frac{1}{2} (x-\mu)^{\top} \:\Sigma^{-1} \:(x-\mu)\Big)$
- Automatically captures correlations between features
- **Computationally more expensive**
- **Must have $m \gt n$,** or else $\Sigma$ is **non-invertible**

# Recommender system

Users have rated movies and we want to recommend them new ones.

$r(i,j) = 1$ if user $j$ has rated movie $i$ (0 otherwise)

$y^{(i,j)} = $ rating by user $j$ on movie $i$ (if defined)

$\theta ^{(j)} = $  parameter vector for user $j$

$x^{(i)} = $ feature vector for movie $i$

For user $j$, movie $i$, predicted rating: $(\theta^{(j)})^{T}(x^{(i)})$ with $\theta^{(j)} \in \mathbb{R}^{n+1}$

$m^{(j)}$ = number of movies rated by user $j$

## Content-based recommendations

We know some features $x^{(i)}$ about the content we want to recommend. For example for movies, we know their category (action, comedy, romance...) and we want to recommend movies from their favorite category to users.

### Cost function

To learn $\theta^{(j)}$ (parameter for user $j$):

>$\displaystyle \min_{\theta^{(j)}} \frac{1}{2} \sum_{i:r(i,j)=1} \big( (\theta^{(j)})^{T} x^{(i)} - y^{(i,j)} \big) ^{2} + \frac{\lambda}{2} \sum_{k=1}^{n}(\theta_{k}^{(j)})^2$

To learn $\theta^{(1)}, \theta^{(2)}, ..., \theta^{(n_{u})}$ (parameter for all users $n_{u}$):

>$\displaystyle \min_{\theta^{(1)}, ..., \theta^{(n_{u})}} \underbrace{ \frac{1}{2} \sum_{j=1}^{n_{u}} \sum_{i:r(i,j)=1} \big( (\theta^{(j)})^{T} x^{(i)} - y^{(i,j)} \big) ^{2} + \frac{\lambda}{2} \sum_{j=1}^{n_{u}} \sum_{k=1}^{n}(\theta_{k}^{(j)})^2}_{J(\theta^{(1)},..., J(\theta^{(n_{u})})}$

### Gradient descent

> $\displaystyle \theta^{(j)}_{k} := \theta^{(j)}_{k} - \alpha \sum_{i:r(i,j)=1} \big( (\theta^{(j)})^{T} x^{(i)} - y^{(i,j)} \big) x^{(i)}_{k} \quad$     $\text{(for k = 0)}$

> $\displaystyle \theta^{(j)}_{k} := \theta^{(j)}_{k} - \alpha \ \Bigg( \sum_{i:r(i,j)=1} \big( (\theta^{(j)})^{T} x^{(i)} - y^{(i,j)} \big) x^{(i)}_{k} + \lambda \theta^{(j)}_{k}\Bigg)$     $\text{(for k}$ $\neq 0$)

## Collaborative filtering

The features $x^{(i)}$ are unknown and we want to estimate them.

### Cost function

Given $\theta^{(1)}, \theta^{(2)}, ..., \theta^{(n_{u})}$, to learn $x^{(i)}$:

>$\displaystyle \min_{x^{(i)}} \frac{1}{2} \sum_{j:r(i,j)=1} \big( (\theta^{(j)})^{T} x^{(i)} - y^{(i,j)} \big) ^{2} + \frac{\lambda}{2} \sum_{k=1}^{n}(x_{k}^{(i)})^2$

Given $\theta^{(1)}, ..., \theta^{(n_{u})}$, to learn $x^{(1)}, ..., x^{(n_{m})}$:

>$\displaystyle \min_{x^{(1)}, ..., x^{(n_{m})}} \frac{1}{2} \sum_{i=1}^{n_{m}} \sum_{j:r(i,j)=1} \big( (\theta^{(j)})^{T} x^{(i)} - y^{(i,j)} \big) ^{2} + \frac{\lambda}{2} \sum_{i=1}^{n_{m}} \sum_{k=1}^{n}(x_{k}^{(i)})^2$


To minimize simultaneously $\theta^{(1)}, ..., \theta^{(n_{u})}$ and $x^{(1)}, ..., x^{(n_{m})}$ :

>$\displaystyle J(x^{(1)}, ..., x^{(n_{m})}, \theta^{(1)}, ..., \theta^{(n_{u})}) = \frac{1}{2} \sum_{(i,j):r(i,j)=1} \big( (\theta^{(j)})^{T} x^{(i)} - y^{(i,j)} \big) ^{2} + \frac{\lambda}{2} \sum_{i=1}^{n_{m}} \sum_{k=1}^{n}(x_{k}^{(i)})^2 + \frac{\lambda}{2} \sum_{j=1}^{n_{u}} \sum_{k=1}^{n}(\theta_{k}^{(j)})^2 $

### Gradient descent

> $\displaystyle x^{(i)}_{k} := x^{(i)}_{k} - \alpha \ \Bigg( \sum_{j:r(i,j)=1} \big( (\theta^{(j)})^{T} x^{(i)} - y^{(i,j)} \big) \theta ^{(j)}_{k} + \lambda x^{(i)}_{k}\Bigg)$

> $\displaystyle \theta^{(j)}_{k} := \theta^{(j)}_{k} - \alpha \ \Bigg( \sum_{i:r(i,j)=1} \big( (\theta^{(j)})^{T} x^{(i)} - y^{(i,j)} \big) x^{(i)}_{k} + \lambda \theta^{(j)}_{k}\Bigg)$

### Collaborative filtering algorithm

1. Initialize $x^{(1)}, ..., x^{(n_{m})}, \theta^{(1)}, ..., \theta^{(n_{u})}$ to small random values, **not zero**.

2. Minimize $J(x^{(1)}, ..., x^{(n_{m})}, \theta^{(1)}, ..., \theta^{(n_{u})})$ using gradient descent.

3. For a user with parameter $\theta$ and a movie with learned features $x$ , predict a star rating of $\theta^{T}x$.

To find movies $j$ related to movie $i$ we need to calculate the smallest $\| x^{(i)} - x^{(j)} \|$

### Implementational detail mean normalization

To avoid the algorithm to predict ratings of 0 (because the $\theta$ is about 0) to a user that have not rated any movies, we need to perform mean normalization.

1. Calculate the mean for each movie
2. Substract the corresponding mean to each movie ratings (in order to have a mean of 0 for each movies)
3. Use this new matrix to learn the parameters $\theta^{(j)}$ (almost equal to 0) and $x^{(i)}$
4. For user j, on movie i, we predict $(\theta^{(j)})^{T} x^{(i)} + \mu_{i}$ and obtain a prediction equal to the mean of the ratings.