# Exercise (Gensim / vector math)

In this exercise, you'll get to do some of your exploration of our trained movie embeddings, using some of the Gensim tools I showed in [the tutorial](#$TUTORIAL_URL(3)$). To get started, run the setup cell below to import the libraries we'll be using, load our raw embedding data, and wrap it in a `WordEmbeddingsKeyedVectors` object.

In [None]:
import os

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import tensorflow as tf
from tensorflow import keras
from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors

from learntools.core import binder; binder.bind(globals())
from learntools.embeddings.ex3_gensim import *

#_RM_
input_dir = '../input/movielens_preprocessed'
#_UNCOMMENT_
#input_dir = '../input/movielens-preprocessing'
#_RM_
model_dir = '.'
#_UNCOMMENT_
#model_dir = '../input/movielens-spiffy-model'
model_path = os.path.join(model_dir, 'movie_svd_model_32.h5')
model = keras.models.load_model(model_path)

emb_layer = model.get_layer('movie_embedding')
(w,) = emb_layer.get_weights()
movie_embedding_size = w.shape[1]

movies_path = os.path.join(input_dir, 'movie.csv')
all_movies_df = pd.read_csv(movies_path, index_col=0)

threshold = 100

movies = all_movies_df[all_movies_df.n_ratings >= threshold].reset_index(drop=True)

kv = WordEmbeddingsKeyedVectors(movie_embedding_size)
kv.add(
    movies['key'].values,
    w[movies.movieId]
)

## 1. Warm-up

As a warm-up, try using the `kv.most_similar` method on a few of your favourite movies. What do you think of the results? Are there any that stick out as being a bad match? Any movies that you think *should* be on the list but which aren't? 

In [None]:
# Example: one of my favourite films by Alfred Hitchcock. Try with some of your favourite movies.
kv.most_similar('Vertigo')

In [None]:
# Note: if you get a KeyError when looking up a movie, you may want to run something like this
# to look up the 'key' column for your movie. For example, there's more than one movie with the 
# title 'Spellbound', so I need to either call:
#     kv.most_similar('Spellbound (1945)')
# If I want the Hitchcock thriller, or:
#     kv.most_similar('Spellbound (2002)')
# If I want the documentary on spelling bees.
movies[movies.title.str.contains('Spellbound')]

If you find any particularly interesting or funny examples, feel free to share them on [the forums](https://www.kaggle.com/learn-forum).

## 2. *Bambi* + *The Mummy* = ???

So far we've seen the `most_similar()` method called in the following ways:
- with a single (positive) example `m1`, giving us the movies most similar to `m1`
- with one positive example, `m1`, and one negative example, `m2`. The results seem to roughly correspond to the question "which movies exemplify the properties that `m1` has and `m2` doesn't?"
- with two positive examples, `m1` and `m2`, and one negative example, `m3`, which answers the analogy "`m3` is to `m2` as `m1` is to ____".

What do you think will happen (mathematically, and semantically) if we call it with two positive examples, and no negative examples? 

In the code cell below, try calling `most_similar()` with *Legally Blonde* and *Mission: Impossible* as two positive examples. If you're familiar with the movies, see if you can predict what kinds of movies will be returned.

In [None]:
# TODO: call most_similar with the movies "Legally Blonde" and "Mission: Impossible" as positive examples,
# and assign the results to the variable legally_impossible
legally_impossible = None
part2.check()
legally_impossible

Try experimenting with adding other pairs of movies. Do you see a pattern emerging?

What do you think happens if we pass in the same movie twice?

In [None]:
# Feel free to continue experimenting here.

Uncomment the line below to see an explanation of what's going on.

In [None]:
#_COMMENT_IF(PROD)_
part2.solution()

In [None]:
#%%RM_IF(PROD)%%
# Bad solution (wrong call signature)
legally_impossible = kv.most_similar(positive=['Legally Blonde'], negative=['Mission: Impossible'])
part2.check()
legally_impossible

In [None]:
#%%RM_IF(PROD)%%
# Correct (solution code)
legally_impossible = kv.most_similar(positive=['Legally Blonde', 'Mission: Impossible'])
part2.check()
legally_impossible

**Bonus**: Pick a movie you like (let's call it `m`), and see if you can find two other movies, `m1` and `m2` such that `m1 + m2 ≈ m`. Of course you're pretty likely to succeed if you choose two movies which are each similar to `m`, but can you come up with a pair of very *different* movies that have `m` right between them? Again, if you're successful here, feel free to share on [the forums](https://www.kaggle.com/learn-forum).

## 3. Cosine distance vs. Euclidean distance

If you're familiar with linear algebra, you may know that the cosine distance and euclidean distance of two vectors are equivalent (up to a scaling factor) if those vectors have the same length. In particular, when our vectors both have length 1, their euclidean distance is just twice their cosine distance.

Given that we cared about using cosine distance rather than Euclidean distance, we must have some reason to believe that our embedding vectors vary in length. But how much? And is there any pattern to which movies' vectors are long or short?

### 3a. Distribution of lengths

Just as there are lots of definitions of distance, there are lots of definitions of length. But we'll be using the familiar notion, technically called the 'Euclidean norm' or 'L2 norm'. What's the length of the vector `(3, 4)`? Well, if we start at `(0, 0)`, walk 3 steps to the right, and then 4 steps up, we'll get a right triangle where the hypotenuse connects `(3, 4)` to `(0, 0)`. By the Pythagorean theorem, the length of that hypotenuse is $\sqrt{3^2 + 4^2} = \sqrt{25} = 5$. We can extend the calculation to any number of dimensions - for example, the L2 norm of the vector `(1, 1, 3, 5)` is $\sqrt{1^2 + 1^2 + 3^2 + 5^2} = 6$.

Fortunately, we don't need to implement the calculations ourselves. Given a vector, the function [`numpy.linalg.norm`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.norm.html) returns its L2 norm. Run the cell below to calculate the L2 norm of our model's first movie embedding vector:

In [None]:
np.linalg.norm(w[0])

Fill in the missing code in the cell below to create a variable `norms`, containing the L2 norms of all the model's movie embeddings.

In [None]:
norms = None
part3.a.check()

In [None]:
#_COMMENT_IF(PROD)_
part3.a.hint()

In [None]:
#_COMMENT_IF(PROD)_
part3.a.solution()

In [None]:
#%%RM_IF(PROD)%%
# Incorrect
norms = np.linalg.norm(w)
part3.a.check()

In [None]:
#%%RM_IF(PROD)%%
# Correct (solution code)
norms = np.linalg.norm(w, axis=1)
part3.a.check()

Once you've successfully calculated `norms`, run the following cell to generate a visualization of the distribution of lengths of our movie embedding vectors.

In [None]:
norm_series = pd.Series(norms)
ax = norm_series.plot.hist()
ax.set_xlabel('Embedding norm');

### 3b. Patterns in vector lengths?

Fill in the missing code below to add a column called `norm` containing the length of each movie's embedding to our DataFrame containing all movies (`all_movies_df`).

In [None]:
# TODO: Your code goes here. Add the column "norm" to our movies dataframe.
part3.b.check()

In [None]:
#_COMMENT_IF(PROD)_
part3.b.solution()

In [None]:
#%%RM_IF(PROD)%%
# Silly incorrect soln
all_movies_df['norm'] = norms + 1
part3.b.check()

In [None]:
#%%RM_IF(PROD)%%
# Correct (solution code)
all_movies_df['norm'] = norms
part3.b.check()

Run the cells below to see the movies with the largest and smallest embedding vectors. Do you see a pattern?

In [None]:
n = 5
# Movies with the smallest embeddings (as measured by L2 norm)
all_movies_df.sort_values(by='norm').head(n)

In [None]:
# Movies with the largest embeddings
all_movies_df.sort_values(by='norm', ascending=False).head(n)

Uncomment the cell below for some speculation about what's going on.

In [None]:
#_COMMENT_IF(PROD)_
part3.c.solution()

#$END_OF_EMB_EXERCISE(82826085484264)$

In [None]:
#%%RM_BELOW%%

`</exercise>`

# Scratch space

## Look up your favourite movie and judge how legit the similar titles are! 

## multiple positive movies

What movies do you get if you combine x + y? What movies lie between x and y?

## cos distance vs. euclidean

looking at different magnitudes of movies

-----------------

## analogies? Not sure if better to do in an exercise, or in the body of lesson. Or both.

## clustering...?

## exploring more gensim methods

`doesnt_match`, 

## Exploring individual dimensions of the embedding space

In [None]:
all_movies_df[all_movies_df.n_ratings >= threshold].sort_values(by='norm').head(n)