# Recommendation - Model 🍿

---

<img src="https://visithrastnik.si/uploads/tic/public/generic_list_item/6-kulturna_prireditev_v_avli_kulturnega_centra_zagorje_ob_savi.jpg" />

---

In [5]:
!pip install lightfm

Collecting lightfm
  Downloading lightfm-1.17.tar.gz (316 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.4/316.4 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: lightfm
  Building wheel for lightfm (setup.py) ... [?25ldone
[?25h  Created wheel for lightfm: filename=lightfm-1.17-cp39-cp39-macosx_10_9_x86_64.whl size=425952 sha256=d4610249757ce5fa94b746e22f60b4262ff4aeafb3fd5d63a13752f26e6a525c
  Stored in directory: /Users/alexia/Library/Caches/pip/wheels/d8/65/93/6ac8180274dc2e8f86ff326be62da1dfa55dc158fd45faba7d
Successfully built lightfm
Installing collected packages: lightfm
Successfully installed lightfm-1.17


Now, time for the exciting part! We will train a Machine Learning model based on our previous **ratings** sparse matrix, so that it creates a recommendation engine automatically! 

First, load again the dataframe `movies` and `ratings`

In [1]:
### TODO: load the movies and ratings datasets
import pandas as pd
movies = pd.read_csv("ml-latest-small/movies.csv")
ratings = pd.read_csv("ml-latest-small/ratings.csv")

print(movies.head())
print(ratings.head())

   movieId                               title   
0        1                    Toy Story (1995)  \
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931


**Q1**. Start by loading all the pickle you saved during last challenge: `ratings_matrix`, `idx_to_mid`, `mid_to_idx`, `uid_to_idx`, `idx_to_uid`

In [3]:
import pickle
ratings_matrix = pickle.load(open("data/dataratings_matrix.pkl", "rb"))
idx_to_mid = pickle.load(open("data/idx_to_mid.pkl", "rb"))
mid_to_idx = pickle.load(open("data/mid_to_idx.pkl", "rb"))
uid_to_idx = pickle.load(open("data/uid_to_idx.pkl", "rb"))
idx_to_uid = pickle.load(open("data/idx_to_uid.pkl", "rb"))

**Q2**. Because the dataset is slightly different from what we have been used to (X as features, y as target), the usual `train_test_split` method from scikit-learn does not apply.

Hopefully, `lightfm` comes with a `random_train_test_split` located into `cross_validation` dedicated to this usecase 🙂

Split the data randomly into a `train` matrix and a `test` matrix with 20% of interactions into the test set.

In [7]:
import numpy as np
from lightfm.cross_validation import random_train_test_split

train, test = random_train_test_split(
    ratings_matrix,
    test_percentage=0.2,
    random_state=np.random.RandomState(0))

**Q3**. Train a LightFM model for 10 epochs. You can use the parameter `loss="warp"`.

In [10]:

from lightfm import LightFM

model = LightFM(no_components=100, loss="warp", random_state=0)

model.fit(train, epochs=10, verbose=True)

Epoch: 100%|████████████████████████████████████| 10/10 [00:01<00:00,  9.81it/s]


<lightfm.lightfm.LightFM at 0x7fec00a78be0>

**Q4**. Evaluate your model on your test set. You can use the `precision_at_k` metric implemented in the LightFM library.

In [11]:
from lightfm.evaluation import precision_at_k

k = 5
precision_k = precision_at_k(model, test, train, k=k).mean()

print("Precision at k:", k, precision_k)

Precision at k: 5 0.28965518


**Q5**. What does the attribute `item_embeddings` of `model` contains?  This will be the heart of your recommendation engine! 💟 So make sure you understand fully what it contains.

In [12]:
print(model.item_embeddings.shape)

(3650, 100)


**Q6**. We just trained a model that factorized our ratings matrix into a U matrix of shape (n_users, no_components) : `model.user_embeddings` ; and V matrix of shape (n_movies, no_components) : `model.item_embeddings`).

Now we want to compute **similarity between each pair of movies**.

> 🔦 **Hint**: For the similarity distance we can either use `cosine_similarity` function or `pearson_similarity`:
> - **Cosine similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> from sklearn.metrics.pairwise import cosine_similarity
> cosine_similarity(X, Y)
> ```
> - **Pearson similarity** between two vectors, or matrices X and Y is given by:
> ``` python
> import numpy as np
> np.corrcoef(X, Y)
> ```

Compute the `similarity_scores` of size (n_movies, n_movies), containing for each element (i, j) the similarity between movie of index i and movie of index j.

In [13]:
from sklearn.metrics.pairwise import cosine_similarity
similarity_scores = cosine_similarity(model.item_embeddings)
print(similarity_scores.shape)
similarity_scores

(3650, 3650)


array([[ 1.0000004 ,  0.14963762,  0.2653598 , ..., -0.39462298,
        -0.04845239, -0.38155937],
       [ 0.14963762,  0.99999994,  0.14907813, ..., -0.1917545 ,
        -0.33483273,  0.03097395],
       [ 0.2653598 ,  0.14907813,  1.        , ..., -0.33357555,
        -0.32673782,  0.15659057],
       ...,
       [-0.39462298, -0.1917545 , -0.33357555, ...,  1.0000001 ,
         0.65127945,  0.05125603],
       [-0.04845239, -0.33483273, -0.32673782, ...,  0.65127945,
         0.99999994, -0.29655012],
       [-0.38155937,  0.03097395,  0.15659057, ...,  0.05125603,
        -0.29655012,  0.9999999 ]], dtype=float32)

**Q7**. For movie of idx 20, what are the idx of the 10 most similar movies?

In [14]:
idx = 20
similarity_idx = similarity_scores[idx]
ranked_idx = np.argsort(-similarity_idx)
ranked_mid = [idx_to_mid[x] for x in ranked_idx]
for mid in ranked_mid[:10]:
    print(movies[movies.movieId == mid]["title"])

314    Forrest Gump (1994)
Name: title, dtype: object
510    Silence of the Lambs, The (1991)
Name: title, dtype: object
0    Toy Story (1995)
Name: title, dtype: object
506    Aladdin (1992)
Name: title, dtype: object
1284    Good Will Hunting (1997)
Name: title, dtype: object
1503    Saving Private Ryan (1998)
Name: title, dtype: object
277    Shawshank Redemption, The (1994)
Name: title, dtype: object
3568    Monsters, Inc. (2001)
Name: title, dtype: object
1757    Bug's Life, A (1998)
Name: title, dtype: object
1438    Rain Man (1988)
Name: title, dtype: object


**Q8**. Let's now test our engine! Suppose we have an user that likes **Toy Story** 🧸 (movie_id = 1). Which movies would you recommend to that user? In other words, which movies are the most similar to the movie Toy Story 

> ⚠️ **Warning**: Remember that your `similarity_scores` works with `idx` and you have the `movie_id` associated to your movie.

Retrieve the **top 5 recommendations**.

In [15]:
idx = mid_to_idx[1]
similarity_idx = similarity_scores[idx]
ranked_idx = np.argsort(-similarity_idx)
ranked_mid = [idx_to_mid[x] for x in ranked_idx]
for mid in ranked_mid[:5]:
    print(movies[movies.movieId == mid]["title"])

0    Toy Story (1995)
Name: title, dtype: object
224    Star Wars: Episode IV - A New Hope (1977)
Name: title, dtype: object
314    Forrest Gump (1994)
Name: title, dtype: object
418    Jurassic Park (1993)
Name: title, dtype: object
1757    Bug's Life, A (1998)
Name: title, dtype: object


As the next step is to **deploy your model**, you need now to: 

**Q9**. Save your `similarity_scores` into pickle format. Save also `movies` DataFrame into pickle format. Save them at the `data/netflix` directory at the root of the repository.

In [16]:
directory = "./data"
pickle.dump(similarity_scores, open(directory + "/similarity_scores.pkl", "wb"))
pickle.dump(movies, open(directory + "/movies.pkl", "wb"))

**Q10**. Encapsulate the previous code into functions, especially you will need:
- `get_sim_scores(mid)` function that returns the vector of the similarity scores `sims` between a movie `mid` and all the other movies
- `get_ranked_recos(sims)` that returns for a vector of similarity scores `sims` the list of all ranked recommendations (n_movies) (from most recommended to least recommended) - in the format list of (mid, score, name) tuple.

In [18]:
def get_movie_name(mid, movies):
    try:
        name = movies.loc[movies.movieId == mid].title.values[0]
    except:
        name = "Unknown"
    return name

def get_sim_scores(mid):
    idx = mid_to_idx[mid]
    sims = similarity_scores[idx]
    return sims

def get_ranked_recos(sims, movies):
    recos = []
    for idx in np.argsort(-sims):
        mid = idx_to_mid[idx]
        name = get_movie_name(mid, movies)
        score = sims[idx]
        recos.append((mid, score, name))
    return recos

def get_reccomendations(mid, movies, k):
    sim_scores = get_sim_scores(mid)
    return get_ranked_recos(sim_scores, movies)[:k]

In [19]:
get_reccomendations(2, movies, 10)

[(2, 0.99999994, 'Jumanji (1995)'),
 (588, 0.6907587, 'Aladdin (1992)'),
 (158, 0.68903536, 'Casper (1995)'),
 (586, 0.65584224, 'Home Alone (1990)'),
 (500, 0.6440063, 'Mrs. Doubtfire (1993)'),
 (364, 0.6227479, 'Lion King, The (1994)'),
 (1, 0.61928195, 'Toy Story (1995)'),
 (317, 0.6173481, 'Santa Clause, The (1994)'),
 (2953, 0.6067216, 'Home Alone 2: Lost in New York (1992)'),
 (410, 0.60670555, 'Addams Family Values (1993)')]

If you have extra time, feel free now to improve your recommendation engine!