# Week 9 - Anomaly Detection
## Density Estimation
### Problem Motivation
- Anomaly detection is a hybrid between unsupervised and supervised learning.
- Application: features of aircraft engines such as heat generated, vibration intensity, etc.
- For any new (test) engine, we want to know if it is anomalous in any way.
- Need to define some epsilon which is a value, above which a classification will be marked as anomolous.

### Gaussian Distribution
- AKA normal distribution
- I'm pretty sure i know this... but I'll watch anyway (at 1.5x)
x ~ N(mu,sigma^2)
- This means that x is a distributed Gaussian with mean mu and variance sigma^2
p(x; mu, sigma^2) = 1/(sqrt(2* pi) * sigma) * exp(-(x-mu)^2 / 2(sigma^2))
- As sigma increases, the distribution becomes "fatter"
- As mu increases, the x offset of the distribution increases

### Algorithm
- For a training set {x1, x2, ..., xm}, p(x) = the product of p(xi;mui,sigmai^2) for all i, where i is a feature across the entire set

steps:
1. Choose features xi that might be indicative of anomalous examples
2. Fit parameters mu1, ..., mun, sigma1^2, ..., sigman^2
muj = 1/m * SUM(i=1:m)(xj^i)
sigmaj^2 = 1/m * SUM(i=1:m)(xj^i - muj)^2
3. Given new example x, compute p(x)
p(x) = PRODUCT(p(xj; muj, sigmaj^2)
4. x is anomalous is p(x) < epsilon


## Building an Anomaly Detection System
### Developing and Evaluating an Anomaly Detection System
- How do we evaluate our algorithm (how well it performs at detecting anomalies)
- Assume we have labeled data, y=0 if normal, y=1 if anomalous
- We will likely have very sparse anomalous examples
- Only put good examples in the training set. Put half of your anomalous examples in CV, half in test.
- On CV and test, y=1 if p(x)< epsilon (anomaly), y=0 if p(x) >= epsilon (normal)

Possible evaluation metrics: 
- true positive, false positive, false neg, true neg
- Precision/Recall
- F1-score

Since data is very skewed, classification accuracy is not a good metric. An algorithm that always predicts normal might perform very well, but not when needed.

- Note that we can also use the cross validation to test and choose the parameter epsilon.

### Anomaly Detection vs Supervised Learning
- When to use which?
- Anomaly:
- - When there is avery small number of positive examples (0-20 is common)
- - If there are many different "types" of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like.
- - Common cases: fraud detection, manufacturing, machine monitoring
- Supervised learninig:
- - When there are a large number of both positive and negative examples
- - When there are enough positive examples (anomalies) to learn what they might look like.
- - spam classification, weather prediction

### Choosing what features to use
- If features are non-gaussian, the algorithm will often work just fine, but we would prefer to transform the data.
- - for instance, we can take a log(x) transformation.
- - you can also take x^a where a is some fraction.

- Look at the data and see if it inspires you to create new features.
- Features should either take on very large or very small values in the event of an anomaly

## Quiz
4/5 (80%)


# Week 9 - Recommender Systems
## Predicting Movie Ratings
### Problem Formulation
- Recommender systems are some of the most important (i.e. commonly used) applications of machine learning
- Responsible for substantial fractions of amazon, netflix, and other companies revenues.
- Interestingly, doesn't receive a lot of attention in academic literature.

Values in a recommender systems problem:
nu = # of users
nm = # of movies
r(i,j) = 1 if user j has rated movie i
y(i,j) = rating given y user j to movie i. Range between 0:5

### Content Based Recommendations
- Define a set of features which represent movie "content". Individual movies might have unique weights for each feature.
- n = # of features

To make predictions, we could do the following:
1. for each user j, learn parameter vector theta(j). (we'll get to how this can be done)
2. Predict user j as rating movie i with theta(j)' * x(i) stars

Then we just do usual linear regression (using least squares). Except we sum the objective over all users. I.e. two sums - one across all features, one across all users.
 - Add in regularization, just 4 fun.
 
Then do gradient descent on the derivative of our objective * a learning rate.

## Collaborative Filtering
### Collaborative Filtering
- Algorithms can learn for themselves what features to use.
- This requires us to build the theta matrix beforehand - possibly by asking each movie what types of movies they like. Then we can look at what movies they have rated and reverse engineer the features out of those movies.

In mathematical terms, what feature vector should x(1) be so that theta(1)' * x(1) = 5 and theta(2)' * x1 = 5 and theta(3)' * x1 = 0 and theta(4)' * x1 = 0 ?

- Optimization algorithm:
Given theta(1), ... theta(n), to learn x(i):

min(x(i)) /2 * SUM(j:r(i,j)=1) ((thetaj)' * x(i) - y(i,j)^2) + (lambda / 2) * SUM(k=1:n) (x(i)^2)

- Using just ratings by users, you can randomly guess the inital thetas, and then learn the features.
- - Wait, really? How.
- Also, this only works well if users are very active and the matrix is fairly robust.

### Collaborative Filtering Algorithm
- An algorithm that can solve for theta and x simultaneously.
- Dear god, that's a lot of summation.

## Low Rank Matrix Factorization
### Vectorization: Low Rank Matrix Factorization
X = a matrix where every row is a movie, and every column is a feature of a movie.
Theta = a matrix where every row is the parameter vector for a person, and every column is a parameter across all people.

Y = X * Theta'

- It's easy to measure how similar two movies are. We can just take the distance between the feature vectors for two movies.

### Implementation Detail: Mean Normalization
- What if there is a user who hasn't rated any movies?
- We will only look at the regularization term, and so we will predict that this user will not like any movie.
- - This is not necessarily helpful. What we can do instead is find the average rating of a movie across all users, and then shift Y so that each element is centered around the mean for that movie (subtract the mean from the value at Y).
- - - This means that each movie in Y has mean 0.

## Quiz
Wow this quiz was hard... First try though.
5/5 (100%)