In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


## Similarities and outliers

* Unsupervised learning
* Similarity measures
  * Cosine similarity
  * Recommender systems
* Outlier detection

## Unsupervised learning

* Data without labels

## Similarity measures 

*What belongs together?*

or *how close are things?*

## Distance metric

What are we measuring?

Euclidean distance: $$d(x, y) = \sqrt{(x - y)^2}$$

In [None]:
def euclidean_distance(x, y):
    return math.sqrt(((x - y) ** 2).sum())

In [None]:
x = np.random.sample((10, ))
y = np.random.sample((10, ))

In [None]:
x

In [None]:
euclidean_distance(x, y)

In [None]:
x = np.random.sample((10000, ))
y = np.random.sample((10000, ))

In [None]:
euclidean_distance(x, y)

## Cosine similarity

$$\text{similarity} = \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} = \frac{ \sum\limits_{i=1}^{n}{A_i  B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}}  \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} }$$

## Cosine

![](https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/TrigFunctionDiagram.svg/589px-TrigFunctionDiagram.svg.png)


In [None]:
math.cos(0)

In [None]:
math.cos(1)

In [None]:
math.cos(math.pi / 2)

In [None]:
math.cos(math.pi)

## Dot product


$$\mathbf{a}\cdot\mathbf{b}=\left\|\mathbf{a}\right\|\left\|\mathbf{b}\right\|\cos\theta$$

## Cosine similarity

$$\text{similarity} = \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} = \frac{ \sum\limits_{i=1}^{n}{A_i  B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}}  \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} }$$

In [None]:
x.dot(y) / (np.linalg.norm(x) * np.linalg.norm(y))

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity([x], [y])

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

titles = [
    'A New Hope',
    'The Empire Strikes Back',
    'Return of the Jedi',
    'The Phantom Menace',
    'Attack of the Clones',
    'Revenge of the Sith',
    'The Force Awakens',
    'A Star Wars Story',
    'The Last Jedi',
    ]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(titles)

In [None]:
cosine_similarity(X)

In [None]:
plt.imshow(cosine_similarity(X))

## Outlier detections 

Detects which datapoints are outside the norm.

Useful when dataset contains problematic data.

## Outlier trimming

Step 1: Figure out what is 'abnormal'

In [None]:
import pandas as pd

df = pd.read_csv('science.csv')

In [None]:
df.plot.scatter(x=1, y=2)

Step 2: Find a way to narrow in on 'normal'

In [None]:
from sklearn.neighbors import LocalOutlierFactor

detector = LocalOutlierFactor()
detector.fit_predict(df)

In [None]:
detector = LocalOutlierFactor(n_neighbors=10)
detector.fit_predict(df)
df[outliers > 0]

## Exercise

Actually, autoencoders are great outlier detectors because they remove noise.

Follow this guide to build your own autoencoder outlier detector for credit card fraud: https://medium.com/@curiousily/credit-card-fraud-detection-using-autoencoders-in-keras-tensorflow-for-hackers-part-vii-20e0c85301bd