# Cosine Similarity

In [1]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler

Cosine Similarity is cosine of the angle formed between two vectors, calculated as such:

$$cos(\theta) = \frac{A \cdotp B}{\|A\|\|B\|}$$

Within the context of this project, cosine similarity is the primary means of determining similarity between two channels, or, more accurately, one and several channels. It offers a high speed calculation with satisfactory results and provides a viable substitution for more computationally intense algorithms such as Alternating Least Squares and Stochastic Gradient Descent. GraphLab has optimized these methods but the data does not take the form of user-item relationship, nor can it expediently convert the data into a structured frame fast enough to not detract from the user experience. As such, cosine similarity seemed the best and most implementable method given a two week time frame.

Let's try it out on some dummy data.

In [2]:
n = np.array([[1,1,1],[1,0,0],[1,0,1],[0,1,1]])
m = np.array([[0,1,1]])

In [3]:
cosine_similarity(n,m)

array([[ 0.81649658],
       [ 0.        ],
       [ 0.5       ],
       [ 1.        ]])

As we can see, the vectors in n that resemble m the most end up scoring higher on their similarity score. Note that the cosine of 0 is 1, so we can immediately deduce that a perfect score occurs when two vectors point in precisely the same direction. Similarly, we get a similarity of 0 if the vectors are perpendicular (90 degree angle) and -1 if they point in the exact opposite directions (180 degrees).

However, to utilize cosine similarity properly in a recomendation engine, we need to place our locus on our data's center of mass. In other words, we must scale our data set, so let's do that now and see how our results change.

In [4]:
scale = StandardScaler()
scaled_n = scale.fit_transform(n)
scaled_m = scale.transform(m)

cosine_similarity(scaled_n,scaled_m)



array([[ 0.12403473],
       [-0.69230769],
       [-0.62017367],
       [ 1.        ]])

As we can see, while the placement of our results hasn't changed, the values by which they are ranked have very much so. The only score that hasn't changed is that of our perfect match, thanks to the fact that the two vectors were the exact same to begin with. Note that score order will not always remain the same after scaling. We just happen to have a special case.

Within the context of the project, each score is the similarity between one channel and another, where one channel is being scored against multiple. So, each vector has a name, the name of the channel being scored against. If the vectors are ranked on their similarity scores, we'd also return the name of the channel so the user understands who is most similar.