# Hi!

Today we're going to dive into the second type of Machine Learning, which is **unsupervised learning**. 
We're going to create algorithms that will learn from the data itself, without any need from us to label it. Let's go!

In [None]:
# imports
import numpy as np
import matplotlib.pyplot as plt
import solutions
import time

from ipywidgets import interact, fixed
import ipywidgets as widgets

from scipy.stats import norm, multivariate_normal
import cv2

np.random.seed(int(time.time()))
%matplotlib inline

## Use case 1: Clustering

There are many real-world cases, when a need arises to divide a dataset into various subgroups. Be it people, products or website articles. Sometimes there may not be any obvious ways to achieve that - fortunately, there are algorithms that can figure that out for us!

In [None]:
points = solutions.generate_mess(point_range=100)
X = points[:,0]
Y = points[:,1]

plt.scatter(X,Y, marker='.')
plt.show()

### K-means algorithm

The only thing we have to choose is the number of the clusters we want to achieve.

In [None]:
num_centroids = 3

* We initialize K cluster 'centroids' **randomly** in the space of the examples (represented by their features).
$$centroids = \mu_1, \mu_2, ... \mu_k$$
* repeatedly:
    * add every example to the cluster whose centroid it's the closest to
    * every centroid is reassigned as the mean of the examples in its' cluster

You can repeat the above steps either for a given number of steps or until the algorithm converges (the centroids move slower and slower) 

In [None]:
centroids = np.random.rand(num_centroids, 2) * 100 
plt.scatter(X, Y)
plt.scatter(centroids[:,0], centroids[:,1], marker='.')
plt.show()

In [None]:
def k_means_iteration(centroids, X, Y):
    # implement me!
    # return new positions of centroids

In [None]:
k_means_iteration = solutions.k_means_iteration

In [None]:
centroids, clusters = k_means_iteration(centroids, points)
for c in clusters:
    plt.scatter(X[c], Y[c])
plt.scatter(centroids[:, 0], centroids[:, 1], color='black', marker='.')
plt.show()


In [None]:
num_centroids = 5
centroids = np.random.rand(num_centroids, 2) * 100
interact(solutions.demonstrate_k_means,
        k_means_iter=fixed(k_means_iteration),
        datapoints=fixed(points),
        centroids=fixed(centroids),
        num_iterations=widgets.IntSlider(min=0,max=20,step=1,value=0)
        )

## Use Case 2: Image compression

Could we use the same method to find a number of RGB values that can be used to represent the image?
<img src="img/phoenix.jpg" alt="Name a better comic, I'll wait" style="width: 200px;"/>



In [None]:
img = cv2.imread('img/phoenix.jpg')
print(img.shape)
# plt changes the rgb channels order... 
plt.imshow(img[:,:,[2,1,0]])

In [None]:
#number of colours we want to achieve
num_centroids = 8
centroids = np.random.rand(num_centroids, 3) * 255

In [None]:
interact(solutions.k_means_img_compression,
        k_means_iter=fixed(k_means_iteration),
        image=fixed(img),
        centroids=fixed(centroids),
        num_iterations=widgets.IntSlider(min=0,max=20,step=1,value=0)
        )

## Use Case 3: Anomaly detection

Look at the below datapoints. You can see that most of them are centered around a certain value, but some are more distant than the others:

In [None]:
points = np.random.normal(size=(200, 2), loc=(-2,2), scale=2)
X_1 = points[:,0] 
X_2 = points[:,1] 

In [None]:
def show_messed_up(fun=(lambda X: X), X_1=X_1, X_2=X_2):
    plt.scatter(fun(X_1), fun(X_2))
    plt.show()
    plt.hist(fun(X_1), 20, normed=True)
    plt.hist(fun(X_2), 20, normed=True)
    plt.show()

In [None]:
show_messed_up()

In this case, as in many real-life cases, X_1 and X_2 are spread according to Gaussian distribution (and if at the first glance they're not, there's usually some function of them that is).

In [None]:
show_messed_up(lambda X: X **2)

Gaussian distribuition is defined by two parameters:

Mean:

$$
\mu_j = \frac{1}{m} \sum_{i=1}^{m} x_j^{(i)}
$$

Standard deviation:

$$
\sigma_j = \sqrt{\frac{1}{m}\sum_{i=1}^m(x_j^{(i)} - \mu_j)^2}
$$

Of course, we calculate those parameters separately for every feature of X.

Having calculated those parameters, we can then evaluate the probability for every datapoint:
    
$$
p(x) = \prod_{j=1}^n \frac{1}{\sqrt{2 \pi \sigma_j}} exp(-\frac{(x_j - \mu_j)^2}{2 \sigma^2_j})
$$

In [None]:
# thankfully, numpy has implemented that for us
means = points.mean(axis=0)
stds = points.std(axis=0)
means, stds

In [None]:
probabilities = norm.pdf(points, loc=means, scale=stds)

plt.hist(X_1, 30, normed=True)
plt.hist(X_2, 30, normed=True)
plt.scatter(X_1, probabilities[:, 0], color='b')
plt.scatter(X_2, probabilities[:, 1], color='r')
plt.show()

In [None]:
prob_mul = probabilities.prod(axis=1)
plt.hist(prob_mul,20)

In [None]:
threshold = 0.01
valid_indices = np.argwhere(prob_mul > threshold)
invalid_indices = np.argwhere(prob_mul <= threshold)

plt.scatter(X_1[valid_indices], X_2[valid_indices])
plt.scatter(X_1[invalid_indices], X_2[invalid_indices])

plt.show()

### What if the features are not independent?
Let's take a look at objects represented by features which are somehow corelated.

In [None]:
X_1 = np.linspace(1, 10, 100)
X_2 = 2 * X + np.random.normal(size=X.shape, scale=5)
plt.scatter(X_1, X_2) 
plt.show()
points = np.c_[X_1, X_2]

There are only two features here and you can see that there seems to be a realtion between them. However, while most of the examples seem to adhere to this trend, there are also some more anomalous ones.

### Enter multivariate Gaussian distribution!

The most basic version of Gaussian distribution was represented by mean and standard deviation.

Now, we'll update it a bit so that the mathematical model can learn to recognize relations between datapoints.


Mean remains the same:

$$
\mu = \frac{1}{m} \sum_{i=1}^{m} x^{(i)}
$$

Now we'll use **covariance** matrix in the computations:

$$
\Sigma = \frac{1}{m} \sum_{i=1}^m (x^{(i)} - \mu) (x^{(i)} - \mu)^T
$$

The new formula for probability is:

$$
p(x) = \prod_{j=1}^n \dfrac{1}{\sqrt{(2 \pi)^n |\Sigma|}} exp(-\frac{1}{2}(x - \mu)^T \Sigma^{-1}(x - \mu))
$$


In [None]:
means = points.mean(axis=0)    
cov = (1 / len(means)) * ((points - means).T @ (points - means))
# scipy <3
prob_mul = multivariate_normal.pdf(points, means, cov)
prob_mul.shape

In [None]:
plt.hist(prob_mul)
plt.show()

In [None]:
threshold = 0.00024
valid_indices = np.argwhere(prob_mul > threshold)
invalid_indices = np.argwhere(prob_mul <= threshold)

plt.scatter(X_1[valid_indices], X_2[valid_indices])
plt.scatter(X_1[invalid_indices], X_2[invalid_indices])

plt.show()

# TODO

## Use Case 5: Recommendations

## Use Case 6: GANs