# Embeddings-based recommendations

Text embeddings can be used to build recommendation systems that recommend products such as books and movies based on other products that you like. The basic approach is to create embeddings from textual descriptions of each item and then use a technique such as [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) to quantify similarity between embeddings. Scikit-learn provides tools such as the [cosine_similarity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) function to aid in computing similarity, while OpenAI offers embeddings models such as `text-embedding-3-small` to create high-quality embeddings. Let's build an embeddings-based recommendation system that takes a movie title as input and returns a list of similar movies.

![](Images/movies.png)

Start by loading a database of 4,800 movies that includes information such as the director, the cast, and the genre of each movie, as well as keywords describing the movie.

In [None]:
import pandas as pd

df = pd.read_csv('Data/movies.csv')
df.head()

Remove all of the columns except the ones that will be used to quantify similarities between movies.

In [2]:
df = df[['title', 'genres', 'keywords', 'cast', 'director']]
df = df.fillna('') # Fill missing values with empty strings
df.head()

Unnamed: 0,title,genres,keywords,cast,director
0,Avatar,Action Adventure Fantasy Science Fiction,culture clash future space war space colony so...,Sam Worthington Zoe Saldana Sigourney Weaver S...,James Cameron
1,Pirates of the Caribbean: At World's End,Adventure Fantasy Action,ocean drug abuse exotic island east india trad...,Johnny Depp Orlando Bloom Keira Knightley Stel...,Gore Verbinski
2,Spectre,Action Adventure Crime,spy based on novel secret agent sequel mi6,Daniel Craig Christoph Waltz L\u00e9a Seydoux ...,Sam Mendes
3,The Dark Knight Rises,Action Crime Drama Thriller,dc comics crime fighter terrorist secret ident...,Christian Bale Michael Caine Gary Oldman Anne ...,Christopher Nolan
4,John Carter,Action Adventure Science Fiction,based on novel mars medallion space travel pri...,Taylor Kitsch Lynn Collins Samantha Morton Wil...,Andrew Stanton


Add a new column named "features" to the DataFrame that combines all of the words in the other columns.

In [3]:
df['features'] = df['title'] + ' ' + df['genres'] + ' ' + df['keywords'] + ' ' + df['cast'] + ' ' + df['director']

Use OpenAI's `text-embedding-3-small` model to generate an embedding for each feature and Scikit's [cosine_similarity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) function to generate a similarity matrix.

In [4]:
import numpy as np
from openai import OpenAI

client = OpenAI(api_key='OPENAI_API_KEY')
embeddings = []

for index, row in df.iterrows():
    feature = row['features']

    response = client.embeddings.create(
        model='text-embedding-3-small',
        dimensions=256, # Limit each embedding bector to 256 dimensions
        input=feature
    )

    embeddings.append(response.data[0].embedding)

word_matrix = np.array(embeddings)
word_matrix.shape

(4803, 256)

Compute cosine similarities for all the vector pairs.

In [5]:
from sklearn.metrics.pairwise import cosine_similarity

sim = cosine_similarity(word_matrix)

Define a function that takes a movie title as input and returns a list of similar movies, and then use that function to make some recommendations.

In [6]:
def get_recommendations(title, df, sim, count=10):
    # Get the row index of the specified title in the DataFrame
    index = df.index[df['title'].str.lower() == title.lower()]

    # Return an empty list if there is no entry for the specified title
    if (len(index) == 0):
        return []

    # Get the corresponding row in the similarity matrix
    similarities = list(enumerate(sim[index[0]]))

    # Sort the similarity scores in that row in descending order
    recommendations = sorted(similarities, key=lambda x: x[1], reverse=True)

    # Get the top n recommendations, ignoring the first entry in the list since
    # it corresponds to the title itself (and thus has a similarity of 1.0)
    top_recs = recommendations[1:count + 1]

    # Generate a list of titles from the indexes in top_recs
    titles = []

    for i in range(len(top_recs)):
        title = df.iloc[top_recs[i][0]]['title']
        titles.append(title)

    return titles

Use the `get_recommendations` function to identify 10 movies that are similar to the James Bond thriller [Skyfall](https://en.wikipedia.org/wiki/Skyfall):

In [7]:
get_recommendations('Skyfall', df, sim)

['Spectre',
 'Quantum of Solace',
 'Casino Royale',
 'Layer Cake',
 'Die Another Day',
 'Tomorrow Never Dies',
 'Renaissance',
 'Dr. No',
 'Trance',
 'Entrapment']

If you like Disney movies, how about movies that are similar to [Mulan](https://www.imdb.com/title/tt4566758/)?

In [8]:
get_recommendations('Mulan', df, sim)

['Tangled',
 'Shanghai Noon',
 'Aladdin',
 'Looney Tunes: Back in Action',
 'The Tigger Movie',
 'Kung Fu Panda',
 'Kung Fu Panda 2',
 'Enchanted',
 'Shrek',
 'Flubber']

Now find action movies that are similar to [Die Hard](https://www.imdb.com/title/tt0095016/):

In [9]:
get_recommendations('Die Hard', df, sim)

['Die Hard 2',
 'Live Free or Die Hard',
 'Die Hard: With a Vengeance',
 'RED',
 'Wanted',
 'Broken Arrow',
 'Executive Decision',
 'Eraser',
 'Con Air',
 'Cradle 2 the Grave']