In [1]:
import numpy as np
import matplotlib.pyplot as plt
from utils import *

%matplotlib inline

### Estimating parameters for a Gaussian distribution

To complete the `estimate_gaussian` function calculate `mu` (mean for each feature in X) and `var` (variance for each feature in X)

In [3]:
def estimate_gaussian(X):
    m, n = X.shape

    mu = 1/m*np.sum(X, axis=0)
    var = 1/m*np.sum((X - mu) ** 2, axis=0)

    return mu, var

### Selecting the threshold epsilon

Now that you have estimated the Gaussian parameters, you can investigate which examples have a very high probability given this distribution and which examples have a very low probability
* the low probability examples are more likely to be the anomalies in our dataset
* one way to determine which examples are anomalies is to select a threshold based on a cross validation set

`select_threshold` function is used to find the best threshold to use for selecting outliers based on the results from the validation set `(p_val)` and the ground `(y_val)`

* if an example `x` has a low probability `p(x) < epsilon`, then it is classified as an anomaly
* tp = number of true positives
* fp = number of false positives
* fn = number if false negatives

In [4]:
def select_threshold(y_val, p_val):
    best_epsilon = 0
    best_F1 = 0
    F1 = 0

    step_size = (max(p_val) - min(p_val)) / 1000

    for epsilon in np.arange(min(p_val), max(p_val), step_size):
        predictions = (p_val < epsilon)

        tp = sum((predictions == 1) & (y_val == 1))
        fp = sum((predictions == 1) & (y_val == 0))
        fn = sum((predictions == 0) & (y_val == 1))

        prec = tp/(tp + fp)
        rec = tp/(tp + fn)

        F1 = (2*prec*rec) / (prec + rec)

        if F1 > best_F1:
            best_F1 = F1
            best_epsilon = epsilon

    return best_epsilon, best_F1