# 1. Movie Recommendation System

In this notebook we will design and build a Movie Recommendation System using the power of OpenAI embeddings.

# 2. Libraries import

In [None]:
!pip install openai

Collecting openai
  Downloading openai-1.2.2-py3-none-any.whl (220 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m220.3/220.3 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.25.1-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.1-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.9/76.9 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0.13 (from httpcore->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: h11, httpcore, httpx, openai
[31mERROR: pip's dependency resolver does not currently

In [None]:
import os
import openai
import numpy as np
import pandas as pd

from openai import OpenAI

# 3. Sending a first request to OpenAI API


### 3.1 Setting up API Key

In [None]:
os.environ["OPENAI_API_KEY"] = "sk-XXXXXXXXXXXXX"
client = OpenAI()

### 3.2 Vectors and their similarity


### Embeddings:

Imagine you have a bunch of different fruits, and you want to describe each one on a piece of paper so that someone can understand what each fruit is like without seeing it. You’d write down things like the color, shape, size, and taste of each fruit. In the world of computers and AI, embeddings do something similar for words or movies.

An embedding is a way of turning words, sentences, or things like movies into a list of numbers (we call this list a "vector") that represents different features, just like the list you made for the fruits. For example, for movies, the numbers might represent how action-packed they are, whether they are romantic, if they are funny, and so on. These numbers aren't random; they are calculated so that movies with similar numbers have similar features.

![](https://cdn.sanity.io/images/vr8gru94/production/e016bbd4d7d57ff27e261adf1e254d2d3c609aac-2447x849.png)
Source: https://www.pinecone.io/learn/vector-embeddings/

### Vector Similarity:

Now, let’s say you have two lists of numbers for two different movies. How can you tell if the movies are similar? This is where vector similarity comes in.

Imagine you and a friend each have a toy car, and you race them side by side to see which one is faster. If the cars finish the race at almost the same time, you’d say they’re pretty similar in speed. Vector similarity does the same thing with the lists of numbers for the movies.

Computers use a method to "race" the vectors against each other, often using something called "cosine similarity." They check how close the numbers are in both lists. If the numbers are really close across both lists, it’s like two cars finishing at the same time, which means the movies are similar. If the numbers are far apart, then the movies are quite different, just like if one car finishes way ahead of the other.

So, in simple terms:

- **Embeddings** are like writing a detailed description of something (like a movie) in a special code of numbers that a computer can understand.
- **Vector similarity** is like a race to see how similar two sets of numbers (or embeddings) are, which tells us how similar the things they represent (like two movies) might be to each other.


![](https://cdn.sanity.io/images/vr8gru94/production/5a5ba7e0971f7b6dc4697732fa8adc59a46b6d8d-338x357.png)

Source: https://www.pinecone.io/learn/vector-similarity/

In [None]:
experiment_sentence = "The Terminator is a movie about AI goign after humans"

embed = client.embeddings.create(
    model="text-embedding-ada-002",
    input=experiment_sentence,
)

In [None]:
embed.data[0].embedding[:10]

[-0.014664475806057453,
 -0.05948261916637421,
 -0.02541155181825161,
 -0.020566347986459732,
 0.01930350251495838,
 0.010134982876479626,
 -0.02765374816954136,
 0.0021954833064228296,
 -0.014226345345377922,
 -0.008195612579584122]

## Similarity

In [None]:
toy_dataset = [
    "The Terminator is a movie that has AI-based robots inside of them",
    "Harry Potter is all amobut wizards and magic",
    "In the movie Matrix, AI already has become the most powerfull 'being'"
]

In [None]:
embeddings = client.embeddings.create(
    model="text-embedding-ada-002",
    input=toy_dataset,
)

pure_embeds = []
for embedding in embeddings.data:
    pure_embeds.append(embedding.embedding)

In [None]:
user_request = input("What movie are you looking for? ")

user_vector = client.embeddings.create(
    model="text-embedding-ada-002",
    input=user_request)

What movie are you looking for? magic brat


In [None]:
user_vector = user_vector.data[0].embedding
# Normalize the user_vector
user_vector_norm = user_vector / np.linalg.norm(user_vector)

# Normalize each vector in pure_embeds
pure_embeds_norm = pure_embeds / np.linalg.norm(pure_embeds, axis=1, keepdims=True)

# Calculate the cosine similarity for each pair of the user_vector and the vectors in pure_embeds
cosine_similarity_scores = np.dot(user_vector_norm, pure_embeds_norm.T)

In [None]:
from scipy import spatial

result = 1 - spatial.distance.cosine(user_vector, clean_embeds[0])
cosine_similarity_scores


import numpy as np
from scipy.spatial.distance import cdist

# Example vectors
user_vector = np.array([1, 2, 3]).reshape(1, -1)  # Reshape to 2D array
item_vectors = np.array([[4, 5, 6], [7, 8, 9], [10, 11, 12]])

# Calculate pairwise cosine distances
cosine_distances = cdist(user_vector, item_vectors, 'cosine')

# Convert distances to similarities
cosine_similarities = 1 - cosine_distances

array([0.73729393, 0.81980218, 0.7642304 ])

## Recommending most similar vector

In [None]:
# Get the indices of the scores sorted in descending order
sorted_indices = np.argsort(1-cosine_similarity_scores)

# Now create a prioritized list of movies based on the sorted indices
prioritized_movies = [toy_dataset[index] for index in sorted_indices]

# Print the recommended movies in order
print("User's query: ", user_request)
print("Recommended movies: ")
for i in range(len(prioritized_movies)):
    movie = prioritized_movies[i]
    print(i+1, ":", movie)

User's query:  magic brat
Recommended movies: 
1 : Harry Potter is all amobut wizards and magic
2 : In the movie Matrix, AI already has become the most powerfull 'being'
3 : The Terminator is a movie that has AI-based robots inside of them


# 4. Scaling to the big dataset

You can download dataset from here: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset?select=movies_metadata.csv

In [None]:
dataset = pd.read_csv("movies_metadata.csv")

  dataset = pd.read_csv("movies_metadata.csv")


In [None]:
dataset.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [None]:
subset = dataset[['title', 'overview']]

In [None]:
subset.head()

Unnamed: 0,title,overview
0,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,Jumanji,When siblings Judy and Peter discover an encha...
2,Grumpier Old Men,A family wedding reignites the ancient feud be...
3,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom..."
4,Father of the Bride Part II,Just when George Banks has recovered from his ...


In [None]:
subset.shape

(45466, 2)

In [None]:
# Using only 100 movies for recommendation system to peresven money for API :)
small_dataset = subset.iloc[:100]
small_dataset.shape

(100, 2)

In [None]:
small_dataset.dropna(inplace=True)
small_dataset.shape

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  small_dataset.dropna(inplace=True)


(99, 2)

In [None]:
# calculating all vectors for the samll_dataset
small_embeds = client.embeddings.create(
    model="text-embedding-ada-002",
    input=small_dataset['overview'].values.tolist(),
)

small_vectors = []
for embedding in small_embeds.data:
    small_vectors.append(embedding.embedding)

In [None]:
small_vectors[1][:10]

[0.01767730712890625,
 -0.03527633845806122,
 -0.015929145738482475,
 -0.012478482909500599,
 -0.007090491708368063,
 0.022517366334795952,
 -0.022086849436163902,
 -0.022556504234671593,
 -0.013646098785102367,
 -0.011278252117335796]

In [None]:
user_request = input("What movie are you looking for? ")

user_vector = client.embeddings.create(
    model="text-embedding-ada-002",
    input=user_request)

user_vector = user_vector.data[0].embedding

# Normalize the user_vector
user_vector_norm = user_vector / np.linalg.norm(user_vector)

# Normalize each vector in pure_embeds
small_vectors_norm = small_vectors / np.linalg.norm(small_vectors, axis=1, keepdims=True)

# Calculate the cosine similarity for each pair of the user_vector and the vectors in pure_embeds
cosine_similarity_scores = np.dot(user_vector_norm, small_vectors_norm.T)

What movie are you looking for? around artificial intelligence


In [None]:
# Get the indices of the scores sorted in descending order
sorted_indices = np.argsort(-cosine_similarity_scores)

# Now create a prioritized list of movies based on the sorted indices
prioritized_movies = [small_dataset.iloc[index] for index in sorted_indices]

# Print the recommended movies in order
print("User's query: ", user_request)
print("Recommended movies: ")
for i in range(10):
    movie = prioritized_movies[i]
    print(i+1, ":", movie['title'])

User's query:  around artificial intelligence
Recommended movies: 
1 : Screamers
2 : Lawnmower Man 2: Beyond Cyberspace
3 : The City of Lost Children
4 : Copycat
5 : Nick of Time
6 : The Big Green
7 : Kids of the Round Table
8 : Toy Story
9 : Angels and Insects
10 : Jumanji


### 5. Building movie recommender with Pinecone


Pinecone website: https://www.pinecone.io/

In [None]:
!pip install pinecone-client

Collecting pinecone-client
  Downloading pinecone_client-2.2.4-py3-none-any.whl (179 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/179.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m92.2/179.4 kB[0m [31m2.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.4/179.4 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
Collecting loguru>=0.5.0 (from pinecone-client)
  Downloading loguru-0.7.2-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.5/62.5 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Collecting dnspython>=2.0.0 (from pinecone-client)
  Downloading dnspython-2.4.2-py3-none-any.whl (300 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m300.4/300.4 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: loguru, dnspython, pinecone-client
Successfully instal

In [None]:
import pinecone

pinecone.init(
	api_key='e5ee498a-3412421312312',
	environment='us-west1-gcp-free'
)
index = pinecone.Index('movie-recommendation')

In [None]:
for i in range(len(small_dataset)):
    upsert_response = index.upsert(
    vectors=[
        (
         str(i),
         small_vectors[i],
         {"title": small_dataset.iloc[i]['title']}
        )
    ])

## Searching the most similar movie

In [None]:
user_request = input("What movie are you looking for? ")

user_vector = client.embeddings.create(
    model="text-embedding-ada-002",
    input=user_request)

user_vector = user_vector.data[0].embedding
matches = index.query(
    user_vector,
    top_k=10,
    include_metadata=True)

matches

What movie are you looking for? some random movie


{'matches': [{'id': '65',
              'metadata': {'title': 'Two Bits'},
              'score': 0.791678131,
              'values': []},
             {'id': '91',
              'metadata': {'title': 'Beautiful Girls'},
              'score': 0.789736688,
              'values': []},
             {'id': '79',
              'metadata': {'title': "Things to Do in Denver When You're Dead"},
              'score': 0.787862539,
              'values': []},
             {'id': '8',
              'metadata': {'title': 'Sudden Death'},
              'score': 0.784023821,
              'values': []},
             {'id': '45',
              'metadata': {'title': 'Se7en'},
              'score': 0.780800045,
              'values': []},
             {'id': '98',
              'metadata': {'title': 'Bottle Rocket'},
              'score': 0.779253602,
              'values': []},
             {'id': '43',
              'metadata': {'title': 'To Die For'},
              'score': 0.778241873,
    