# Using a Vector Database to Recommend Movies

Vector search is certainly critical for generative AI, but also has lots of other interesting applications as well. One very common one is building personalized recommendations. In this exercise, we'll take a small diversion and build a quick movie recommender using a vector database.

For this exercise we'll use the [MovieLens Latest Small Dataset](https://grouplens.org/datasets/movielens/latest/), which contains 100,000 ratings and 3,600 tags applied to 9,000 movies by 600 users. The strategy we'll use is to create embeddings for the movies based on the user ratings. Then if a user rated a particular movie highly, we'll recommend "similar" movies, as determined by the embeddings

In [2]:
!pip install lancedb

Collecting lancedb
  Downloading lancedb-0.11.0-cp38-abi3-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting deprecation (from lancedb)
  Downloading deprecation-2.1.0-py2.py3-none-any.whl.metadata (4.6 kB)
Collecting pylance==0.15.0 (from lancedb)
  Downloading pylance-0.15.0-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (7.2 kB)
Collecting ratelimiter~=1.0 (from lancedb)
  Downloading ratelimiter-1.2.0.post0-py3-none-any.whl.metadata (4.0 kB)
Collecting retry>=0.9.2 (from lancedb)
  Downloading retry-0.9.2-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting overrides>=0.7 (from lancedb)
  Downloading overrides-7.7.0-py3-none-any.whl.metadata (5.8 kB)
Collecting py<2.0.0,>=1.4.26 (from retry>=0.9.2->lancedb)
  Downloading py-1.11.0-py2.py3-none-any.whl.metadata (2.8 kB)
Downloading lancedb-0.11.0-cp38-abi3-manylinux_2_28_x86_64.whl (23.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.1/23.1 MB[0m [31m57.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pylance

In [3]:
import lancedb

import numpy as np
import pandas as pd

The dataset is included along with this exercise:

In [4]:
!ls ml-latest-small

links.csv  movies.csv  ratings.csv  README.txt	tags.csv


## Loading data

Let's start by reading in the `ratings.csv` file. We'll use this to compute the content embeddings

In [5]:
ratings = pd.read_csv('./ml-latest-small/ratings.csv', header=0)
print(ratings.movieId.nunique())
display(ratings)

9724


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


## Computing ratings

Let's use the ratings dataframe from above and create a new reviews dataframe of users (index) and movies (columns). Each entry (i, j) in the dataframe will be the rating that user_i gave to movie_j. If no such pair exists, then fill in the value 0.

We can do this using a pivot table.

In [6]:
reviewmatrix = ratings.pivot_table(index='userId',values='rating',columns='movieId').fillna(0)
reviewmatrix

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,0.0,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
607,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
608,2.5,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
609,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Computing embeddings

Now let's use [matrix factorization](https://www.cs.cmu.edu/~mgormley/courses/10601-s17/slides/lecture25-mf.pdf) to extract content embeddings.

We'll compute the content embeddings from the reviewmatrix dataframe and name the result `embeddings'

We'll use SVD as it is a popular matrix factorization technique

In [7]:
matrix = reviewmatrix.values
print(type(matrix))
_, _, vh = np.linalg.svd(matrix, full_matrices=False)
embeddings = vh.T

<class 'numpy.ndarray'>


In [8]:
embeddings.shape

(9724, 610)

## Metadata

Read in the `movies.csv` and `links.csv` files and make sure it is aligned with the embeddings dataframe.

We'll use `reindex` functionality to help with data alignment

In [9]:
movies = pd.read_csv('./ml-latest-small/movies.csv', header=0)
movies = movies.set_index("movieId").reindex(reviewmatrix.columns)
movies

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
...,...,...
193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
193585,Flint (2017),Drama
193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In [10]:
# We do the same for links

links = pd.read_csv('./ml-latest-small/links.csv', header=0)
links = links.set_index("movieId").reindex(reviewmatrix.columns)
links

Unnamed: 0_level_0,imdbId,tmdbId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,114709,862.0
2,113497,8844.0
3,113228,15602.0
4,114885,31357.0
5,113041,11862.0
...,...,...
193581,5476944,432131.0
193583,5914996,445030.0
193585,6397426,479308.0
193587,8391976,483455.0


## Create vector database table

Let's create a table with the following fields:

1. an integer movie id field
2. a vector field of embeddings
3. a string field of genres
4. a string field for the movie title
5. an integer field for the imdb_id

First, we'll create a pydantic model named `Content` for these fields. For the vector field, use the `lancedb.pydantic.vector` as a shorthand for the field type. Note that you'll need to pass in the number of dimensions.

In [11]:
from lancedb.pydantic import vector, LanceModel

class Content(LanceModel):
    movie_id: int
    vector: vector(embeddings.shape[1])
    genres: str
    title: str
    imdb_id: int

    @property
    def imdb_url(self) -> str:
        return f"https://www.imdb.com/title/tt{self.imdb_id}"

Let's prepare a list of python dicts with all of the data

In [12]:
Content.field_names()

['movie_id', 'vector', 'genres', 'title', 'imdb_id']

In [19]:

values = list(zip(*[reviewmatrix.columns,
                    embeddings,
                    movies["genres"],
                    movies["title"],
                    links["imdbId"],
                    links["tmdbId"]]))

keys = Content.field_names()
data = [dict(zip(keys, v)) for v in values]

data[1]

{'movie_id': 2,
 'vector': array([-0.03853935,  0.00206663, -0.05684471,  0.05067286,  0.01987934,
        -0.00351   , -0.03213443,  0.03115371,  0.02794841,  0.00209064,
        -0.02520406, -0.0010789 , -0.02439877,  0.01947549, -0.02473232,
         0.01633652,  0.00124212, -0.00579382, -0.01579501,  0.00037348,
         0.00490493, -0.00947525,  0.01622049, -0.03478628, -0.02463507,
         0.00461524,  0.03360573, -0.02924211, -0.0088343 ,  0.00411023,
        -0.0214643 ,  0.01527597,  0.00604399, -0.02810287, -0.02222942,
         0.00994754,  0.0044618 ,  0.01742809, -0.00483823,  0.01645457,
        -0.01607385, -0.00890443,  0.04737634,  0.02985171, -0.03773025,
         0.03559281, -0.03038119,  0.01051756, -0.0311283 , -0.00134411,
        -0.00202815, -0.02245112,  0.03232523,  0.01191392,  0.0047238 ,
        -0.05926131, -0.00465635,  0.07688715,  0.03343306, -0.05419442,
        -0.00303547, -0.01239701,  0.0098923 ,  0.00229283, -0.06022275,
        -0.06688004, -0.0

Let's connect to the local database at ~/.lancedb
and create the LanceDB table named "movielens_small".

In [20]:
import pyarrow as pa
import lancedb

table_name = "movielens_small"
data = pa.Table.from_pylist(data, schema=Content.to_arrow_schema())

db = lancedb.connect("~/.lancedb")

# db.drop_table(table_name)
table = db.create_table(table_name, data=data)


## Generating recommendations

Finally we're ready to generate recommendations based on content vector similarity.

In [21]:
def get_recommendations(title: str) -> list[(int, str, str)]:
    # First we retrieve the vector for the input title
    query_vector = (table.to_lance()
                    .to_table(filter=f"title='{title}'")["vector"].to_numpy()[0])

    # Please write the code to search for the 5 most similar titles
    results = table.search(query_vector).limit(5).to_pydantic(Content)

    # For each result, return the movie_id, title, and imdb_url
    return [(c.movie_id, c.title, c.imdb_url) for c in results]

If a user watched the movie titled "Moana (2016)", what should we recommend to the user?

In [22]:
get_recommendations("Moana (2016)")

[(166461, 'Moana (2016)', 'https://www.imdb.com/title/tt3521164'),
 (168418, 'The Boss Baby (2017)', 'https://www.imdb.com/title/tt3874544'),
 (115664, 'The Book of Life (2014)', 'https://www.imdb.com/title/tt2262227'),
 (162578,
  'Kubo and the Two Strings (2016)',
  'https://www.imdb.com/title/tt4302938'),
 (161580, 'Bad Moms (2016)', 'https://www.imdb.com/title/tt4651520')]

In [23]:
# Let's remove the original movie from the results
get_recommendations("Moana (2016)")[1:]

[(168418, 'The Boss Baby (2017)', 'https://www.imdb.com/title/tt3874544'),
 (115664, 'The Book of Life (2014)', 'https://www.imdb.com/title/tt2262227'),
 (162578,
  'Kubo and the Two Strings (2016)',
  'https://www.imdb.com/title/tt4302938'),
 (161580, 'Bad Moms (2016)', 'https://www.imdb.com/title/tt4651520')]

What about "Rogue One: A Star Wars Story (2016)"?

In [24]:
get_recommendations("Rogue One: A Star Wars Story (2016)")[1:]

[(143355, 'Wonder Woman (2017)', 'https://www.imdb.com/title/tt451279'),
 (166568, 'Miss Sloane (2016)', 'https://www.imdb.com/title/tt4540710'),
 (166635, 'Passengers (2016)', 'https://www.imdb.com/title/tt1355644'),
 (103042, 'Man of Steel (2013)', 'https://www.imdb.com/title/tt770828')]

In [25]:
get_recommendations("Jumanji (1995)")[1:]

[(275, 'Mixed Nuts (1994)', 'https://www.imdb.com/title/tt110538'),
 (8943, 'Being Julia (2004)', 'https://www.imdb.com/title/tt340012'),
 (6320, 'Scenes from a Mall (1991)', 'https://www.imdb.com/title/tt102849'),
 (7040,
  'To Live and Die in L.A. (1985)',
  'https://www.imdb.com/title/tt90180')]