# Movie Advisor
*Chat with an AI movie advisor and get recommendations based on your preferences.*

[GitHub Repo](https://github.com/alexdjulin/movie-advisor)

## Brainstorming

Goal of this project is to get my hands on langchain and build an AI movie advisor. I want to be able to do the following:
- Ask for information about a movie ('I heard The Godfather is good, give me the plot')
- Ask for recommendations based on given criterias ('I like adventure movies, especially on the time-travel subject. Could you recommend a few?')
- Handle a list of movies I already watched and if I liked them or not, use it to fine-tune recommendations
- Handle a must-watch list of movies, add or remove movies from it
- Use RAG techniques to access information outside of the LLM knowledge: Either search for a less-known movie on the Internet or input myself the plot

Interactions with the movie advisor should be via keyboard input first, then using speech.

The AI part should rely on a LLM. I will use OpenAI Gpt-4o at first, then try implementing open-source models. I could even try fine-tuning them on the imdb dataset.

Source info should come from the LLM itself. I will try to pack the imdb dataset in a database for the model to consult, but I guess that it's been trained on it already.

Lists should be stored in a database. I will use [Xata](https://app.xata.io/) for this.


In [146]:
! pip install -qU python-dotenv langchain langchain-community openai langchain-openai ipykernel kaggle pandas xata requests tiktoken langchain_huggingface sentence-transformers

In [88]:
import os
import json
from datetime import datetime
import zipfile
import pandas as pd
from langchain_community.vectorstores.xata import XataVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
import dotenv
dotenv.load_dotenv()

True

## Dataset review

Here are some databases on Kaggle that our LLM could access to or extend:
- [IMDB Movies Dataset](https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows)
- [TMDB 5000 Movie Dataset](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata)
- [Netflix Movies and TV Shows](https://www.kaggle.com/datasets/shivamb/netflix-shows) - Netflix contents, probably a bit too restrictive

Let's have a look at the TMDB one.

## Review dataset

### Download

In [3]:
!kaggle datasets download -d tmdb/tmdb-movie-metadata

dataset_path = 'dataset'
zip_file = 'tmdb-movie-metadata.zip'
# unzip dataset
with zipfile.ZipFile(zip_file, 'r') as zip_ref:
    zip_ref.extractall(dataset_path)

os.remove(zip_file)
    


  0%|          | 0.00/8.89M [00:00<?, ?B/s]
 11%|█▏        | 1.00M/8.89M [00:00<00:04, 1.79MB/s]
 34%|███▍      | 3.00M/8.89M [00:00<00:01, 4.74MB/s]
 56%|█████▋    | 5.00M/8.89M [00:00<00:00, 6.79MB/s]
 79%|███████▉  | 7.00M/8.89M [00:01<00:00, 8.22MB/s]
100%|██████████| 8.89M/8.89M [00:01<00:00, 9.20MB/s]
100%|██████████| 8.89M/8.89M [00:01<00:00, 7.12MB/s]


Dataset URL: https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata
License(s): other
Downloading tmdb-movie-metadata.zip to c:\Development\_repos\movie-advisor



### Extract data we need
The database has extensive informaton. We will only extract the following columns:
- title (string)
- genres (list)
- keywords (list)
- overview (string)
- vote average (float)
- production_countries
- release_date (string or datetime)
- runtime in minutes (ing)

In [4]:
# load dataset and print columns
tmdb_df = pd.read_csv(f'{dataset_path}/tmdb_5000_movies.csv')
print(sorted(tmdb_df.columns.tolist()))

['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language', 'original_title', 'overview', 'popularity', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'vote_average', 'vote_count']


In [68]:
# filter columns to keep and make a copy of the df
columns = ["title", "genres", "keywords", "overview", "vote_average", "production_countries", "release_date", "runtime"]
df = tmdb_df.copy().dropna()
df = df[columns]

# format columns with multiple values to comma-separated strings
for column in ["genres", "keywords", "production_countries"]:
    df[column] = df[column].apply(lambda x: [i['name'] for i in eval(x)])
    # df[column] = df[column].apply(lambda x: '|'.join([i['name'] for i in eval(x)]))

# # format floats to strings
# for column in ["vote_average", "runtime"]:
#     df[column] = df[column].astype(str)

# make sure all columns are strings
for column in columns:
    print(type(df[column][0]))

df.head()

<class 'str'>
<class 'list'>
<class 'list'>
<class 'str'>
<class 'numpy.float64'>
<class 'list'>
<class 'str'>
<class 'numpy.float64'>


Unnamed: 0,title,genres,keywords,overview,vote_average,production_countries,release_date,runtime
0,Avatar,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","In the 22nd century, a paraplegic Marine is di...",7.2,"[United States of America, United Kingdom]",2009-12-10,162.0
1,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","Captain Barbossa, long believed to be dead, ha...",6.9,[United States of America],2007-05-19,169.0
2,Spectre,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...",A cryptic message from Bond’s past sends him o...,6.3,"[United Kingdom, United States of America]",2015-10-26,148.0
3,The Dark Knight Rises,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...",Following the death of District Attorney Harve...,7.6,[United States of America],2012-07-16,165.0
4,John Carter,"[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","John Carter is a war-weary, former military ca...",6.1,[United States of America],2012-03-07,132.0


In [238]:
# connect to xata
from xata.client import XataClient
dotenv.load_dotenv()

xata = XataClient()
database_name = "movies-database"

In [239]:
def create_table(table_name: str, table_schema: dict) -> None:
    """ create table if it doesn't exist """
    try:
        assert xata.table().create(table_name).is_success()
    except AssertionError as e:
        print(f"Error creating table '{table_name}': {e}")
    try:
        resp = xata.table().set_schema(table_name, table_schema)
        assert resp.is_success(), resp
    except AssertionError as e:
        print(f"Error setting schema for table '{table_name}': {e}")

In [115]:
# create tmdb_movies table
table_name = "tmdb_movies"

table_schema = {
    "columns": [
        {"name": "title", "type": "string"},
        {"name": "genres", "type": "multiple"},
        {"name": "keywords", "type": "multiple"},
        {"name": "overview", "type": "string"},
        {"name": "vote_average", "type": "float"},
        {"name": "production_countries", "type": "multiple"},
        {"name": "release_date", "type": "string"},
        {"name": "runtime", "type": "float"},
    ]
}

create_table(table_name, table_schema)

# insert one entry as a test
record = df.iloc[0].to_dict()
print(record)

xata.records().insert(table_name, record)

Table 'tmdb_movies' already exists


In [None]:
# if successful, fill up database
table_name = "tmdb_movies"

# records = df.to_dict(orient='records')
# for record in records:
    # xata.records().insert(table_name, record)

# or using bulk insert / faster but 1000 records limit at a time
xata.records().bulk_insert(table_name, {"records": df.to_dict(orient='records')[:999]})
xata.records().bulk_insert(table_name, {"records": df.to_dict(orient='records')[1000:]})

The dataset is now on xata.

### Create movie_history table to handle my own data
This is how to add records to a xata table. But I will use the next method, loading them from documents and creating embeddings on the way

In [None]:
table_name = "movie_history"
table_schema = {
    "columns": [
        {"name": "title", "type": "string"},
        {"name": "watched", "type": "bool"},
        {"name": "watch_list", "type": "bool"},
        {"name": "comment", "type": "string"},
        {"name": "embedding_text", "type": "string"},
    ]
}

create_table(table_name, table_schema)

rec_1 = {
    "title": "Inception",
    "watched": True,
    "watch_list": False,
    "comment": "This is one of my favourite movie, especially how it handles time paradox!",
    "embedding_text": "Inception (watched) one of my favourite movie, like: time paradox"
}
rec_2 = {
    "title": "Jurassik Park",
    "watched": True,
    "watch_list": False,
    "comment": "A true masterpiece. This is probably the best movie of all times, I could watch it forever and never get bored",
    "embedding_text": "Jurassik Park (watched) best movie of all times, masterpiece, never get bored"
}

rec_3 = {
    "title": "West Side Story",
    "watched": False,
    "watch_list": True,
    "comment": "I heard it's a very good movie and the music is amazing. I can't wait to watch it",
    "embedding_text": "West Side Story (watch_list) music amazing, can't wait to watch it"
}

rec_4 = {
    "title": "The Exorcist",
    "watched": False,
    "watch_list": False,
    "comment": "I don't like horror movies and I heard this one is pretty rough, I prefer not to watch it",
    "embedding_text": "The Exorcist (not on watch_list) I dont like horror movies, prefer not to watch it"
}

# add records to table
embedding_text_list = []
for record in [rec_1, rec_2, rec_3, rec_4]:
    xata.records().insert(table_name, record)
    embedding_text_list.append(record["embedding_text"])


In [242]:
# load the content of movie_history
records = xata.data().query(table_name)["records"]
for rec in records:
    print(rec["title"])

Inception
Jurassik Park
West Side Story
The Exorcist


In [244]:
def get_movie_history(table_name: str) -> dict:
    """ Query the list of watched, not watched and blacklists movies """

    records = xata.data().query(table_name)["records"]
    watched = '|'.join([rec['title'] for rec in records if rec["watched"]])
    watchlist = '|'.join([rec['title'] for rec in records if not rec["watched"] and rec["watch_list"]])
    blacklist = '|'.join([rec['title'] for rec in records if not rec["watched"] and not rec["watch_list"]])
    
    return {"watched": watched, "watchlist": watchlist, "blacklist": blacklist}
        
        
history = get_movie_history(table_name)
print("Watched:", history["watched"])
print("Watchlist:", history["watchlist"])
print("Blacklist:", history["blacklist"])

Watched: Inception|Jurassik Park
Watchlist: West Side Story
Blacklist: The Exorcist


### Load data and embedding to xata table
Pack all record data into Documents and load them to a xata table, creating vector embeddings on the way

In [166]:
from langchain_community.vectorstores.xata import XataVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document

# create vectorstore
embeddings = OpenAIEmbeddings()

rec_1 = {
    "title": "Inception",
    "watched": True,
    "watch_list": False,
    "comment": "This is one of my favourite movie, especially how it handles time paradox!",
    "embedding_text": "Inception (watched) one of my favourite movie, liked the time paradox"
}
rec_2 = {
    "title": "Jurassik Park",
    "watched": True,
    "watch_list": False,
    "comment": "A true masterpiece. This is probably the best movie of all times, I could watch it forever and never get bored",
    "embedding_text": "Jurassik Park (watched) best movie of all times, masterpiece, never get bored"
}

rec_3 = {
    "title": "West Side Story",
    "watched": False,
    "watch_list": True,
    "comment": "I heard it's a very good movie and the music is amazing. I can't wait to watch it",
    "embedding_text": "West Side Story (watch_list) music amazing, can't wait to watch it"
}

rec_4 = {
    "title": "The Exorcist",
    "watched": False,
    "watch_list": False,
    "comment": "I don't like horror movies and I heard this one is pretty rough, I prefer not to watch it",
    "embedding_text": "The Exorcist (not on watch_list) I dont like horror movies, prefer not to watch it"
}

documents = []

for rec in [rec_1, rec_2, rec_3, rec_4]:
    doc = Document(page_content=rec["embedding_text"], metadata={k: v for k, v in rec.items() if k != "embedding_text"})
    documents.append(doc)

print(documents)

table_name = "movie_history"
table_schema = {
    "columns": [
        {"name": "title", "type": "text"},
        {"name": "watched", "type": "bool"},
        {"name": "watch_list", "type": "bool"},
        {"name": "comment", "type": "text"},
        {"name": "content", "type": "text"},
        {"name": "embedding", "type": "vector", "vector": {"dimension": 1536}}
    ]
}

create_table(table_name, table_schema)

vector_store = XataVectorStore.from_documents(
    documents,
    embeddings,
    api_key=os.getenv("XATA_API_KEY"),
    db_url=os.getenv("XATA_DATABASE_URL"),
    table_name=table_name
)




[Document(metadata={'title': 'Inception', 'watched': True, 'watch_list': False, 'comment': 'This is one of my favourite movie, especially how it handles time paradox!'}, page_content='Inception (watched) one of my favourite movie, liked the time paradox'), Document(metadata={'title': 'Jurassik Park', 'watched': True, 'watch_list': False, 'comment': 'A true masterpiece. This is probably the best movie of all times, I could watch it forever and never get bored'}, page_content='Jurassik Park (watched) best movie of all times, masterpiece, never get bored'), Document(metadata={'title': 'West Side Story', 'watched': False, 'watch_list': True, 'comment': "I heard it's a very good movie and the music is amazing. I can't wait to watch it"}, page_content="West Side Story (watch_list) music amazing, can't wait to watch it"), Document(metadata={'title': 'The Exorcist', 'watched': False, 'watch_list': False, 'comment': "I don't like horror movies and I heard this one is pretty rough, I prefer not 

### Similarity Search test
Let's see if we can extract information from our history

In [245]:
query = "What movies am I most affraid of?"
found_docs = vector_store.similarity_search(query, k=1)
print(found_docs)

[Document(metadata={'comment': "I don't like horror movies and I heard this one is pretty rough, I prefer not to watch it", 'title': 'The Exorcist', 'watch_list': False, 'watched': False}, page_content='The Exorcist (not on watch_list) I dont like horror movies, prefer not to watch it')]


## Build our LLM structure

### Prompt

In [233]:
from langchain.prompts import ChatPromptTemplate

template = """You are a movie advisor offering me recommendations based on my watch history and preferences.
- Moves I have already seen: {watched}.
- Moves I haven't seen yet and I want to: {watchlist}.
- Movies I haven't seen yet and I don't want to: {blacklist}.
My preferences related to the question: {preferences}.
Output format: You recommend me movies and update my watch history based on my answers.
You return a JSON object with the following structure:
{{
    "title": "(str) The title you recommend",
    "comment": "(str) Your personal opinion about the movie",
    "watched": "(bool) True or False if I told you that I have already seen it", 
    "watchlist": "(bool) True or False if I told you that I have never seen it but I want to",
    "blacklist": "(bool) True or False if I told you that I have never seen it and I don't want to"
}}
Keep the "watched", "watchlist" and "blacklist" fields updated based on my answers.
Question: {question}.
"""

prompt = ChatPromptTemplate.from_template(template)
print(prompt)


input_variables=['blacklist', 'preferences', 'question', 'watched', 'watchlist'] messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['blacklist', 'preferences', 'question', 'watched', 'watchlist'], template='You are a movie advisor offering me recommendations based on my watch history and preferences.\nDon\'t recommend movies from the following 3 lists:\n- Moves I have already seen: {watched}.\n- Moves I haven\'t seen yet and I want to: {watchlist}.\n- Movies I haven\'t seen yet and I don\'t want to: {blacklist}.\nSome of my preferences related to the question: {preferences}.\nOutput format: You recommend me movies and update my watch history based on my answers.\nYou return a JSON object with the following structure:\n{{\n    "title": "(str) The title you recommend",\n    "comment": "(str) Your personal opinion about the movie",\n    "watched": "(bool) True or False if I told you that I have already seen it", \n    "watchlist": "(bool) True or False if I told y

### Similarity Retriever

In [231]:
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 1})
results = retriever.invoke("I'm affraid of horror movies")
print(results)

[Document(metadata={'comment': "I don't like horror movies and I heard this one is pretty rough, I prefer not to watch it", 'title': 'The Exorcist', 'watch_list': False, 'watched': False}, page_content='The Exorcist (not on watch_list) I dont like horror movies, prefer not to watch it')]


### LLM Model

In [203]:
from langchain_openai import ChatOpenAI
llm_gpt4 = ChatOpenAI(model='gpt-4o')

### JSON Output Parser
[Documentation](https://python.langchain.com/v0.1/docs/modules/model_io/output_parsers/types/json/)

In [236]:
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_openai import ChatOpenAI

model = ChatOpenAI(temperature=0)

# define desired data structure
class Recommendation(BaseModel):
    title: str = Field(description="The movie title you recommend")
    comment: str = Field(description="Your personal opinion about the movie")
    watched: bool = Field(description="True to add it to the list of watched movies, else False")
    watchlist: bool = Field(description="True to add it to the list of movies I would like to watch, else False")
    blacklist: bool = Field(description="True to add it to the list of movies I don't want to watch, else False")

output_parser = JsonOutputParser(pydantic_object=Recommendation)

### Build chain

In [237]:
# define input
history = get_movie_history(table_name)
watched = (lambda x: history["watched"])
watchlist = (lambda x: history["watchlist"])
blacklist = (lambda x: history["blacklist"])
preferences = (lambda x: x["question"]) | retriever
question = (lambda x: x["question"])
input = {"watched": watched, "watchlist": watchlist, "blacklist": blacklist, "preferences": preferences, "question": question}

# define chain
chain = (input | prompt | llm_gpt4 | output_parser)

# invoke answer
answer = chain.invoke({"question": "Could you recommand a good horror movie?"})
print(answer)

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))