# Movie Advisor
*Chat with an AI movie advisor and get recommendations based on your preferences.*

[GitHub Repo](https://github.com/alexdjulin/movie-advisor)

## Brainstorming

Goal of this project is to get my hands on langchain and build an AI movie advisor. I want to be able to do the following:
- Ask for information about a movie ('I heard The Godfather is good, give me the plot')
- Ask for recommendations based on given criterias ('I like adventure movies, especially on the time-travel subject. Could you recommend a few?')
- Handle a list of movies I already watched and if I liked them or not, use it to fine-tune recommendations
- Handle a must-watch list of movies, add or remove movies from it
- Use RAG techniques to access information outside of the LLM knowledge: Either search for a less-known movie on the Internet or input myself the plot

Interactions with the movie advisor should be via keyboard input first, then using speech.

The AI part should rely on a LLM. I will use OpenAI Gpt-4o at first, then try implementing open-source models. I could even try fine-tuning them on the imdb dataset.

Source info should come from the LLM itself. I will try to pack the imdb dataset in a database for the model to consult, but I guess that it's been trained on it already.

Lists should be stored in a database. I will use [Xata](https://app.xata.io/) for this.


In [114]:
! pip install -qU python-dotenv langchain langchain-community langchainhub openai langchain-openai ipykernel kaggle pandas xata requests tiktoken langchain_huggingface sentence-transformers

In [63]:
# load environment variables
import dotenv
dotenv.load_dotenv()

sep = 50 * "-"

## Dataset review

Here are some databases on Kaggle that our LLM could access to or extend:
- [IMDB Movies Dataset](https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows)
- [TMDB 5000 Movie Dataset](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata)
- [Netflix Movies and TV Shows](https://www.kaggle.com/datasets/shivamb/netflix-shows) - Netflix contents, probably a bit too restrictive

Let's have a look at the TMDB one.

## Review dataset

### Download

In [2]:
!kaggle datasets download -d tmdb/tmdb-movie-metadata

import os
import zipfile
dataset_path = 'dataset'
zip_file = 'tmdb-movie-metadata.zip'

# unzip dataset
with zipfile.ZipFile(zip_file, 'r') as zip_ref:
    zip_ref.extractall(dataset_path)

os.remove(zip_file)
    

Dataset URL: https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata
License(s): other
tmdb-movie-metadata.zip: Skipping, found more recently modified local copy (use --force to force download)


### Extract data we need
The database has extensive informaton. We will only extract the following columns:
- title (string)
- genres (list)
- keywords (list)
- overview (string)
- vote average (float)
- production_countries
- release_date (string or datetime)
- runtime in minutes (ing)

In [4]:
import pandas as pd

# load dataset and print columns
tmdb_df = pd.read_csv(f'{dataset_path}/tmdb_5000_movies.csv')
print(sorted(tmdb_df.columns.tolist()))

['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language', 'original_title', 'overview', 'popularity', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'vote_average', 'vote_count']


In [5]:
# filter columns to keep and make a copy of the df
columns = ["title", "genres", "keywords", "overview", "vote_average", "production_countries", "release_date", "runtime"]
df = tmdb_df.copy().dropna()
df = df[columns]

# format columns with multiple values as lists
for column in ["genres", "keywords", "production_countries"]:
    df[column] = df[column].apply(lambda x: [i['name'] for i in eval(x)])

# # format floats to strings
# for column in ["vote_average", "runtime"]:
#     df[column] = df[column].astype(str)

# make sure all columns are strings
for column in columns:
    print(type(df[column][0]))

df.head()

<class 'str'>
<class 'list'>
<class 'list'>
<class 'str'>
<class 'numpy.float64'>
<class 'list'>
<class 'str'>
<class 'numpy.float64'>


Unnamed: 0,title,genres,keywords,overview,vote_average,production_countries,release_date,runtime
0,Avatar,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","In the 22nd century, a paraplegic Marine is di...",7.2,"[United States of America, United Kingdom]",2009-12-10,162.0
1,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","Captain Barbossa, long believed to be dead, ha...",6.9,[United States of America],2007-05-19,169.0
2,Spectre,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...",A cryptic message from Bond’s past sends him o...,6.3,"[United Kingdom, United States of America]",2015-10-26,148.0
3,The Dark Knight Rises,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...",Following the death of District Attorney Harve...,7.6,[United States of America],2012-07-16,165.0
4,John Carter,"[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","John Carter is a war-weary, former military ca...",6.1,[United States of America],2012-03-07,132.0


In [8]:
# connect to xata
from xata.client import XataClient

xata = XataClient()
database_name = "movies-database"

In [9]:
def create_table(table_name: str, table_schema: dict) -> None:
    """ create table if it doesn't exist """
    try:
        assert xata.table().create(table_name).is_success()
    except AssertionError as e:
        print(f"Error creating table '{table_name}': {e}")
    try:
        resp = xata.table().set_schema(table_name, table_schema)
        assert resp.is_success(), resp
    except AssertionError as e:
        print(f"Error setting schema for table '{table_name}': {e}")

In [10]:
# create tmdb_movies table
table_name = "tmdb_movies"

table_schema = {
    "columns": [
        {"name": "title", "type": "string"},
        {"name": "genres", "type": "multiple"},
        {"name": "keywords", "type": "multiple"},
        {"name": "overview", "type": "string"},
        {"name": "vote_average", "type": "float"},
        {"name": "production_countries", "type": "multiple"},
        {"name": "release_date", "type": "string"},
        {"name": "runtime", "type": "float"},
    ]
}

create_table(table_name, table_schema)

# insert one entry as a test
record = df.iloc[0].to_dict()
print(record)

xata.records().insert(table_name, record)

{'title': 'Avatar', 'genres': ['Action', 'Adventure', 'Fantasy', 'Science Fiction'], 'keywords': ['culture clash', 'future', 'space war', 'space colony', 'society', 'space travel', 'futuristic', 'romance', 'space', 'alien', 'tribe', 'alien planet', 'cgi', 'marine', 'soldier', 'battle', 'love affair', 'anti war', 'power relations', 'mind and soul', '3d'], 'overview': 'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.', 'vote_average': 7.2, 'production_countries': ['United States of America', 'United Kingdom'], 'release_date': '2009-12-10', 'runtime': 162.0}


{'id': 'rec_cqce4hei9lerrvc50680',
 'xata': {'createdAt': '2024-07-18T09:37:09.605725Z',
  'updatedAt': '2024-07-18T09:37:09.605725Z',
  'version': 0}}

In [None]:
# if successful, fill up database
table_name = "tmdb_movies"

# records = df.to_dict(orient='records')
# for record in records:
    # xata.records().insert(table_name, record)

# or using bulk insert / faster but 1000 records limit at a time
xata.records().bulk_insert(table_name, {"records": df.to_dict(orient='records')[:999]})
xata.records().bulk_insert(table_name, {"records": df.to_dict(orient='records')[1000:]})

The dataset is now on xata.

### Create movie_history table to handle my own data
This is how to add records to a xata table. But I will use the next method, loading them from documents and creating embeddings on the way

In [12]:
table_name = "movie_history"
table_schema = {
    "columns": [
        {"name": "title", "type": "string"},
        {"name": "watched", "type": "bool"},
        {"name": "watch_list", "type": "bool"},
        {"name": "comment", "type": "string"},
        {"name": "embedding_text", "type": "string"},
    ]
}

create_table(table_name, table_schema)

rec_1 = {
    "title": "Inception",
    "watched": True,
    "watch_list": False,
    "comment": "This is one of my favourite movie, especially how it handles time paradox!",
    "embedding_text": "Inception (watched) one of my favourite movie, like: time paradox"
}
rec_2 = {
    "title": "Jurassik Park",
    "watched": True,
    "watch_list": False,
    "comment": "A true masterpiece. This is probably the best movie of all times, I could watch it forever and never get bored",
    "embedding_text": "Jurassik Park (watched) best movie of all times, masterpiece, never get bored"
}

rec_3 = {
    "title": "West Side Story",
    "watched": False,
    "watch_list": True,
    "comment": "I heard it's a very good movie and the music is amazing. I can't wait to watch it",
    "embedding_text": "West Side Story (watch_list) music amazing, can't wait to watch it"
}

rec_4 = {
    "title": "The Exorcist",
    "watched": False,
    "watch_list": False,
    "comment": "I don't like horror movies and I heard this one is pretty rough, I prefer not to watch it",
    "embedding_text": "The Exorcist (not on watch_list) I dont like horror movies, prefer not to watch it"
}

# add records to table
embedding_text_list = []
for record in [rec_1, rec_2, rec_3, rec_4]:
    xata.records().insert(table_name, record)
    embedding_text_list.append(record["embedding_text"])


In [13]:
# load the content of movie_history
table_name = "movie_history"
records = xata.data().query(table_name)["records"]
for rec in records:
    print(rec["title"])

Jurassik Park
Inception
West Side Story
The Exorcist


In [14]:
def get_movie_history(table_name: str) -> dict:
    """ Query the list of watched, not watched and blacklists movies """

    records = xata.data().query(table_name)["records"]
    watched = ' | '.join([rec['title'] for rec in records if rec["watched"]])
    watchlist = ' | '.join([rec['title'] for rec in records if not rec["watched"] and rec["watch_list"]])
    blacklist = ' | '.join([rec['title'] for rec in records if not rec["watched"] and not rec["watch_list"]])
    
    return {"watched": watched, "watchlist": watchlist, "blacklist": blacklist}
        
        
history = get_movie_history(table_name)
print("Watched:", history["watched"])
print("Watchlist:", history["watchlist"])
print("Blacklist:", history["blacklist"])

Watched: Jurassik Park | Inception
Watchlist: West Side Story
Blacklist: The Exorcist


### Load data and embedding to xata table
Pack all record data into Documents and load them to a xata table, creating vector embeddings on the way

In [15]:
from langchain_community.vectorstores.xata import XataVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document

# create vectorstore
embeddings = OpenAIEmbeddings()

rec_1 = {
    "title": "Inception",
    "watched": True,
    "watch_list": False,
    "comment": "This is one of my favourite movie, especially how it handles time paradox!",
    "embedding_text": "Inception (watched) one of my favourite movie, liked the time paradox"
}
rec_2 = {
    "title": "Jurassik Park",
    "watched": True,
    "watch_list": False,
    "comment": "A true masterpiece. This is probably the best movie of all times, I could watch it forever and never get bored",
    "embedding_text": "Jurassik Park (watched) best movie of all times, masterpiece, never get bored"
}

rec_3 = {
    "title": "West Side Story",
    "watched": False,
    "watch_list": True,
    "comment": "I heard it's a very good movie and the music is amazing. I can't wait to watch it",
    "embedding_text": "West Side Story (watch_list) music amazing, can't wait to watch it"
}

rec_4 = {
    "title": "The Exorcist",
    "watched": False,
    "watch_list": False,
    "comment": "I don't like horror movies and I heard this one is pretty rough, I prefer not to watch it",
    "embedding_text": "The Exorcist (not on watch_list) I dont like horror movies, prefer not to watch it"
}

documents = []

for rec in [rec_1, rec_2, rec_3, rec_4]:
    doc = Document(page_content=rec["embedding_text"], metadata={k: v for k, v in rec.items() if k != "embedding_text"})
    documents.append(doc)

print(documents)

table_name = "movie_history"
table_schema = {
    "columns": [
        {"name": "title", "type": "text"},
        {"name": "watched", "type": "bool"},
        {"name": "watch_list", "type": "bool"},
        {"name": "comment", "type": "text"},
        {"name": "content", "type": "text"},
        {"name": "embedding", "type": "vector", "vector": {"dimension": 1536}}
    ]
}

create_table(table_name, table_schema)

# create vectorstore from documents
vector_store = XataVectorStore.from_documents(
    documents,
    embeddings,
    api_key=os.getenv("XATA_API_KEY"),
    db_url=os.getenv("XATA_DATABASE_URL"),
    table_name=table_name
)


[Document(metadata={'title': 'Inception', 'watched': True, 'watch_list': False, 'comment': 'This is one of my favourite movie, especially how it handles time paradox!'}, page_content='Inception (watched) one of my favourite movie, liked the time paradox'), Document(metadata={'title': 'Jurassik Park', 'watched': True, 'watch_list': False, 'comment': 'A true masterpiece. This is probably the best movie of all times, I could watch it forever and never get bored'}, page_content='Jurassik Park (watched) best movie of all times, masterpiece, never get bored'), Document(metadata={'title': 'West Side Story', 'watched': False, 'watch_list': True, 'comment': "I heard it's a very good movie and the music is amazing. I can't wait to watch it"}, page_content="West Side Story (watch_list) music amazing, can't wait to watch it"), Document(metadata={'title': 'The Exorcist', 'watched': False, 'watch_list': False, 'comment': "I don't like horror movies and I heard this one is pretty rough, I prefer not 

In [16]:
# initialise existing vectorstore
table_name = "movie_history"
embeddings = OpenAIEmbeddings()
vector_store = XataVectorStore(
    embedding=embeddings,
    api_key=os.getenv("XATA_API_KEY"),
    db_url=os.getenv("XATA_DATABASE_URL"),
    table_name=table_name
)

vector_store

<langchain_community.vectorstores.xata.XataVectorStore at 0x186dc7e4800>

### Similarity Search test
Let's see if we can extract information from our history

In [27]:
query = "What movies am I most affraid of?"
found_docs = vector_store.similarity_search(query, k=1)
print(found_docs[0].page_content)

The Exorcist (not on watch_list) I dont like horror movies, prefer not to watch it


## Build our LLM structure

### Prompt

In [28]:
from langchain.prompts import ChatPromptTemplate

template = """You are a movie advisor offering me recommendations based on my preferences and watch history from these 3 lists:
- Watched: Movies I have already seen: {watched}.
- Watchlist: Movies I haven't seen yet but I want to: {watchlist}.
- Blacklist: Movies I haven't seen yet but I don't want to: {blacklist}.
Only recommend one movie at a time.
My preferences related to the question to orient your answer: {preferences}.
Question: {question}.
"""

prompt = ChatPromptTemplate.from_template(template)
print(prompt)


input_variables=['blacklist', 'preferences', 'question', 'watched', 'watchlist'] messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['blacklist', 'preferences', 'question', 'watched', 'watchlist'], template="You are a movie advisor offering me recommendations based on my preferences and watch history from these 3 lists:\n- Watched: Movies I have already seen: {watched}.\n- Watchlist: Movies I haven't seen yet but I want to: {watchlist}.\n- Blacklist: Movies I haven't seen yet but I don't want to: {blacklist}.\nOnly recommend one movie at a time.\nMy preferences related to the question to orient your answer: {preferences}.\nQuestion: {question}.\n"))]


### Similarity Retriever

In [30]:
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 1})
results = retriever.invoke("I'm affraid of horror movies")
print(results[0].page_content)

The Exorcist (not on watch_list) I dont like horror movies, prefer not to watch it


### LLM Model

In [31]:
from langchain_openai import ChatOpenAI
llm_gpt4 = ChatOpenAI(model='gpt-4o')

### String output parser
Create a string output parser for general chatting with the advisor about movies

In [32]:
from langchain_core.output_parsers import StrOutputParser
str_output_parser = StrOutputParser()

In [37]:
from langchain_core.prompts.prompt import PromptTemplate

# test string parser
test_prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": "String"},
)

str_chain = test_prompt | llm_gpt4 | str_output_parser
response = str_chain.invoke({"query": "What movie should I watch tonight?"})
print(response)

Choosing a movie can depend on your mood, interests, and what genre you're in the mood for. Here are some recommendations based on different genres:

1. **Action/Adventure**: 
   - "Mad Max: Fury Road"
   - "John Wick"

2. **Comedy**:
   - "Superbad"
   - "The Grand Budapest Hotel"

3. **Drama**:
   - "The Shawshank Redemption"
   - "A Beautiful Mind"

4. **Horror**:
   - "Get Out"
   - "The Conjuring"

5. **Romance**:
   - "Pride and Prejudice"
   - "La La Land"

6. **Science Fiction**:
   - "Inception"
   - "Blade Runner 2049"

7. **Family/Animated**:
   - "Coco"
   - "Toy Story 4"

8. **Documentary**:
   - "13th"
   - "Free Solo"

Feel free to choose based on what you feel like watching tonight! If you have any specific preferences or mood, let me know and I can tailor the recommendation more closely to your tastes.


### JSON Output Parser
Create a json output parser to update the database
[Documentation](https://python.langchain.com/v0.1/docs/modules/model_io/output_parsers/types/json/)

In [41]:
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_openai import ChatOpenAI

model = ChatOpenAI(temperature=0)

# define desired data structure
class Recommendation(BaseModel):
    title: str = Field(description="The movie title you recommend")
    comment: str = Field(description="My personal opinion about the movie")
    watched: bool = Field(description="True to add it to the list of watched movies, else False")
    watchlist: bool = Field(description="True to add it to the list of movies I would like to watch, else False")
    blacklist: bool = Field(description="True to add it to the list of movies I don't want to watch, else False")
    embedding: bool = Field(description="A short text that describes my answer as follow: {movie_title} ({watched/watchlist/blacklist}) {personal_opinion}. It will be stored as an embedding vector in the database")

json_output_parser = JsonOutputParser(pydantic_object=Recommendation)
print(json_output_parser)

pydantic_object=<class '__main__.Recommendation'>


In [42]:
# test json parser
test_prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": json_output_parser.get_format_instructions()},
)

json_chain = test_prompt | model | json_output_parser
json_chain.invoke({"query": "What movie should I watch tonight?"})

{'title': 'Inception',
 'comment': "One of the best mind-bending movies I've ever seen",
 'watched': False,
 'watchlist': True,
 'blacklist': False,
 'embedding': "Inception (watchlist) One of the best mind-bending movies I've ever seen"}

### Build chain

In [43]:
# define input
table_name = "movie_history"
history = get_movie_history(table_name)

watched = (lambda x: history["watched"])
watchlist = (lambda x: history["watchlist"])
blacklist = (lambda x: history["blacklist"])

preferences = (lambda x: x["question"]) | retriever

question = (lambda x: x["question"])

input_var = {"watched": watched, "watchlist": watchlist, "blacklist": blacklist, "preferences": preferences, "question": question}

# define chain
chain = (input_var | prompt | llm_gpt4 | str_output_parser)

# invoke answer
answer = chain.invoke({"question": "Could you recommand a good horror movie?"})
print(answer)

Given that you don't like horror movies and prefer not to watch them, I wouldn't recommend a horror film. Instead, considering your interest in movies like "Inception" and "Jurassic Park," I would suggest something that fits more into the sci-fi, thriller, or adventure genres. 

How about "Interstellar"? It's a thought-provoking sci-fi film directed by Christopher Nolan, the same director as "Inception." It combines elements of adventure, drama, and science fiction, making it a great match for your preferences.


### Build Tools
Let's build tool methods that our model can use to add or remove a movie from a list.

In [119]:
from langchain_core.tools import tool

@tool
def add_movie_to_watched_list(title: str) -> dict:
    """Add a movie title to the list of movies I have already watched"""
    global watch_dict
    watch_dict["watched"].append(title)
    return watch_dict

@tool
def add_movie_to_mustSee_list(title: str) -> dict:
    """Add a movie title to the list of movies I want to see"""
    global watch_dict
    watch_dict["must_see"].append(title)
    return watch_dict

@tool
def add_movie_to_notInterested_list(title: str) -> dict:
    """Add a movie title to the list of movies I'm not interested in"""
    global watch_dict
    watch_dict["not_interested"].append(title)
    return watch_dict

@tool
def remove_movie_from_watched(title: str) -> dict:
    """Remove a movie title from the list of movies I have already watched"""
    global watch_dict
    watch_dict["watched"].remove(title)
    return watch_dict

@tool
def remove_movie_from_mustSee(title: str) -> dict:
    """Remove a movie title from the list of movies I want to see"""
    global watch_dict
    watch_dict["must_see"].remove(title)
    return watch_dict

@tool
def remove_movie_from_notInterested(title: str) -> dict:
    """Remove a movie title from the list of movies I'm not interested in"""
    global watch_dict
    watch_dict["not_interested"].remove(title)
    return watch_dict

tools = [
    add_movie_to_watched_list,
    add_movie_to_mustSee_list,
    add_movie_to_notInterested_list,
    remove_movie_from_watched,
    remove_movie_from_mustSee,
    remove_movie_from_notInterested
]

# link our model to the tools
gpt4_with_tools = llm_gpt4.bind_tools(tools)   


### Create Agent

Let's now create an agent that will use the tools to update our lists.

In [126]:
from langchain import hub
from langchain.agents import AgentExecutor, create_tool_calling_agent

# import premade prompt
prompt = hub.pull("hwchase17/openai-tools-agent")
print(prompt)

agent = create_tool_calling_agent(llm_gpt4, tools, prompt)

agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=False)

watch_dict = {"watched": [], "must_see": [], "not_interested": []}
print(watch_dict)

result = agent_executor.invoke(
    {
        "input": "Could you please add the movie 'The Matrix' to my watched list?"
    }
)

print(result)
print(watch_dict)

input_variables=['agent_scratchpad', 'input'] optional_variables=['chat_history'] input_types={'chat_history': typing.List[typing.Union[langchain_core.messages.ai.AIMessage, langchain_core.messages.human.HumanMessage, langchain_core.messages.chat.ChatMessage, langchain_core.messages.system.SystemMessage, langchain_core.messages.function.FunctionMessage, langchain_core.messages.tool.ToolMessage]], 'agent_scratchpad': typing.List[typing.Union[langchain_core.messages.ai.AIMessage, langchain_core.messages.human.HumanMessage, langchain_core.messages.chat.ChatMessage, langchain_core.messages.system.SystemMessage, langchain_core.messages.function.FunctionMessage, langchain_core.messages.tool.ToolMessage]]} partial_variables={'chat_history': []} metadata={'lc_hub_owner': 'hwchase17', 'lc_hub_repo': 'openai-tools-agent', 'lc_hub_commit_hash': 'c18672812789a3b9697656dd539edf0120285dcae36396d0b548ae42a4ed66f5'} messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], templa

In [None]:
# test it in a while loop
watch_dict = {"watched": [], "must_see": [], "not_interested": []}

while True:

    print("WATCH STATUS:", watch_dict)
    
    input_text = input("Alex:")
    if not input_text:
        break

    result = agent_executor.invoke({"input": input_text})
    print("Agent:", result)