# Movie Advisor
*Chat with an AI movie advisor and get recommendations based on your preferences.*

[GitHub Repo](https://github.com/alexdjulin/movie-advisor)

## Brainstorming

The goal of this project is to get my hands on langchain and build an AI movie advisor. I want to be able to do the following:
- Ask the agent for information about a movie ('I heard The Godfather is good, give me the plot')
- Ask for recommendations based on given criterias ('I like adventure movies, especially on the time-travel subject. Could you recommend a few?')
- Handle my 3 lists of movies: Already watched, must-see and not-interested. The agent should keep my lists up to date based on my answers. For each, I want to be able to add a comment (if I liked them or not for instance), and use it to fine-tune recommendations
- The agent should be able to search for information when the LLM does not know what to answer. For instance, search for a recent or less-known movie on the Internet or on a movie database

Interactions with the movie advisor should be via keyboard input first, then using speech.

The AI part should rely on a LLM. I will use OpenAI gpt-4o-mini that has just been released.

Source info should come from the LLM first. I will try to pack the imdb dataset into a searchable database for the model to consult, but I guess that it's been trained on it already. Still, it would be interesting to see if the model can add entries to it, for recent or less-known movies it has not heard about yet.

Lists should be stored in a database. I will use [Xata](https://app.xata.io/) for this.


## Install modules used in this Notebook

In [20]:
! pip install -qU python-dotenv langchain langchain-community langchainhub openai langchain-openai ipykernel kaggle pandas xata requests tiktoken langchain_huggingface sentence-transformers langchain-google-community[speech]

## Imports
Let's import all modules used in this notebook.  
Put your api-keys in an .env file first, they will be loaded to environment variables below.   
See .env-template.  

In [1]:
# misc
import os
import json
import zipfile
import pandas as pd
# langchain
from langchain_core.output_parsers import StrOutputParser, JsonOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.tools import tool
from langchain import hub
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain.prompts import ChatPromptTemplate
from langchain_community.vectorstores.xata import XataVectorStore
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.schema import Document
from langchain_core.pydantic_v1 import BaseModel, Field
# openai
embeddings = OpenAIEmbeddings()
# xata client and config
from xata.client import XataClient
xata = XataClient()
# load environment variables
import dotenv
dotenv.load_dotenv()
# print separator
sep = 100 * "-"

## Dataset review

Here are some databases on Kaggle that our LLM could access to or extend:
- [IMDB Movies Dataset](https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows) - 1000 entries from the IMDB database
- [TMDB 5000 Movie Dataset](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata) - 5000 entries from the TMDB database
- [Netflix Movies and TV Shows](https://www.kaggle.com/datasets/shivamb/netflix-shows) - Netflix contents, probably a bit too restrictive

Let's have a look at the TMDB one.

## Review the TMDB Movie Dataset

### Download

In [5]:
!kaggle datasets download -d tmdb/tmdb-movie-metadata

dataset_path = 'dataset'
zip_file = 'tmdb-movie-metadata.zip'

# unzip dataset
with zipfile.ZipFile(zip_file, 'r') as zip_ref:
    zip_ref.extractall(dataset_path)

os.remove(zip_file)
    

Dataset URL: https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata
License(s): other
Downloading tmdb-movie-metadata.zip to c:\Development\_repos\movie-advisor




  0%|          | 0.00/8.89M [00:00<?, ?B/s]
 11%|█▏        | 1.00M/8.89M [00:00<00:05, 1.64MB/s]
 34%|███▍      | 3.00M/8.89M [00:00<00:01, 4.44MB/s]
 56%|█████▋    | 5.00M/8.89M [00:01<00:00, 6.49MB/s]
 79%|███████▉  | 7.00M/8.89M [00:01<00:00, 7.96MB/s]
100%|██████████| 8.89M/8.89M [00:01<00:00, 8.99MB/s]
100%|██████████| 8.89M/8.89M [00:01<00:00, 6.83MB/s]


### Extract data we need
The database has extensive informaton. We will only extract the following columns:
- title (string)
- genres (list)
- keywords (list)
- overview (string)
- vote average (float)
- production_countries
- release_date (string or datetime)
- runtime in minutes (ing)

In [6]:
# load dataset and print columns
tmdb_df = pd.read_csv(f'{dataset_path}/tmdb_5000_movies.csv')
print(sorted(tmdb_df.columns.tolist()))

['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language', 'original_title', 'overview', 'popularity', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'vote_average', 'vote_count']


In [8]:
# filter columns to keep and make a copy of the df
columns = ["title", "genres", "keywords", "overview", "vote_average", "production_countries", "release_date", "runtime"]
df = tmdb_df.copy().dropna()
df = df[columns]

# format columns with multiple values as lists
for column in ["genres", "keywords", "production_countries"]:
    df[column] = df[column].apply(lambda x: [i['name'] for i in eval(x)])

# # format floats to strings
# for column in ["vote_average", "runtime"]:
#     df[column] = df[column].astype(str)

# make sure all columns are strings
for column in columns:
    col = df[column][0]

df.head()

Unnamed: 0,title,genres,keywords,overview,vote_average,production_countries,release_date,runtime
0,Avatar,"[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","In the 22nd century, a paraplegic Marine is di...",7.2,"[United States of America, United Kingdom]",2009-12-10,162.0
1,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","Captain Barbossa, long believed to be dead, ha...",6.9,[United States of America],2007-05-19,169.0
2,Spectre,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...",A cryptic message from Bond’s past sends him o...,6.3,"[United Kingdom, United States of America]",2015-10-26,148.0
3,The Dark Knight Rises,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...",Following the death of District Attorney Harve...,7.6,[United States of America],2012-07-16,165.0
4,John Carter,"[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","John Carter is a war-weary, former military ca...",6.1,[United States of America],2012-03-07,132.0


## Create Xata TMDB movies table

In [3]:
def create_table(table_name: str, table_schema: dict) -> None:
    """ create table if it doesn't exist """
    try:
        assert xata.table().create(table_name).is_success()
    except AssertionError as e:
        print(f"Error creating table '{table_name}': {e}")
    try:
        resp = xata.table().set_schema(table_name, table_schema)
        assert resp.is_success(), resp
    except AssertionError as e:
        print(f"Error setting schema for table '{table_name}': {e}")

In [10]:
table_name = "tmdb_movies"

table_schema = {
    "columns": [
        {"name": "title", "type": "string"},
        {"name": "genres", "type": "multiple"},
        {"name": "keywords", "type": "multiple"},
        {"name": "overview", "type": "string"},
        {"name": "vote_average", "type": "float"},
        {"name": "production_countries", "type": "multiple"},
        {"name": "release_date", "type": "string"},
        {"name": "runtime", "type": "float"},
    ]
}

create_table(table_name, table_schema)

# insert one entry as a test
record = df.iloc[0].to_dict()
print(record)

xata.records().insert(table_name, record)

{'title': 'Avatar', 'genres': ['Action', 'Adventure', 'Fantasy', 'Science Fiction'], 'keywords': ['culture clash', 'future', 'space war', 'space colony', 'society', 'space travel', 'futuristic', 'romance', 'space', 'alien', 'tribe', 'alien planet', 'cgi', 'marine', 'soldier', 'battle', 'love affair', 'anti war', 'power relations', 'mind and soul', '3d'], 'overview': 'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.', 'vote_average': 7.2, 'production_countries': ['United States of America', 'United Kingdom'], 'release_date': '2009-12-10', 'runtime': 162.0}


{'id': 'rec_cqlp1gs10k8fe1dfhupg',
 'xata': {'createdAt': '2024-08-01T13:42:27.873639Z',
  'updatedAt': '2024-08-01T13:42:27.873639Z',
  'version': 0}}

In [11]:
# check if table exists
table_name = "tmdb_movies"
assert xata.data().query(table_name).is_success()

### Fill up table

In [None]:
# if successful, fill up database
table_name = "tmdb_movies"

# METHOD 1: isert records one by one
# records = df.to_dict(orient='records')
# for record in records:
    # xata.records().insert(table_name, record)

# METHOD 2: using bulk insert / faster but 1000 records limit at a time
xata.records().bulk_insert(table_name, {"records": df.to_dict(orient='records')[:999]})
xata.records().bulk_insert(table_name, {"records": df.to_dict(orient='records')[1000:]})

The dataset is now on xata.

## Create movie history table
Let's now create a second table to store my movie history:
- Title (str): movie Title
- Status (str): watched, must_see, not_interested
- Comment (str): some personal comment about it
- Content (str): a short text summing up (title, status, comment) that will be used as embedding
- Embedding (vector): an OpenAI embedding vector of the content string


In [4]:
table_name = "movie_watchlists"
table_schema = {
    "columns": [
        {"name": "title", "type": "text"},
        {"name": "status", "type": "text"},
        {"name": "comment", "type": "text"},
        {"name": "content", "type": "text"},
        {"name": "embedding", "type": "vector", "vector": {"dimension": 1536}}
    ]
}

# create table
create_table(table_name, table_schema)

# create table vectorstore
table_name = "movie_watchlists"
vector_store = XataVectorStore(
    embedding=embeddings,
    api_key=os.getenv("XATA_API_KEY"),
    db_url=os.getenv("XATA_DATABASE_URL"),
    table_name=table_name
)

vector_store


<langchain_community.vectorstores.xata.XataVectorStore at 0x27c8e63cf80>

### Query table records

In [10]:
# query the content of movie_history
table_name = "movie_watchlists"

def get_table_records(table_name: str) -> list:
    records = xata.data().query(table_name)["records"]
    return records

records = get_table_records(table_name)

if not records:
    print("No records found, table is empty.")
else:    
    for rec in records:
        print("id: ", rec["id"])
        print("Title: ", rec["title"])
        print("Status: ", rec["status"])
        print("Comment: ", rec["comment"])
        print("Content: ", rec["content"])
        print(sep)



id:  rec_cqp07mp98512cinpa7j0
Title:  Jurassic Park
Status:  watched_liked
Comment:  A true masterpiece. This is probably the best movie of all times, I could watch it forever and never get bored
Content:  Jurassik Park (watched) best movie of all times, masterpiece, never get bored
----------------------------------------------------------------------------------------------------
id:  rec_cqp07mpflnj05l8tagc0
Title:  Inception
Status:  watched_liked
Comment:  This is one of my favourite movie, especially how it handles time paradox!
Content:  Inception (watched) one of my favourite movie, liked the time paradox
----------------------------------------------------------------------------------------------------
id:  rec_cqp07mpflnj05l8tagcg
Title:  West Side Story
Status:  watched_disliked
Comment:  The music is good but I hate musicals in general, I didn't like it
Content:  West Side Story (must_see) music is good but I hate musicals in general
---------------------------------------

### Fill up table with some test records

In [6]:
def add_update_movie(table_name: str, movie_record: dict) -> None:

    table_records = get_table_records(table_name)

    # delete record if it exists
    for rec in table_records:
        if rec["title"] == movie_record["title"]:
            xata.records().delete(table_name, rec["id"])
            break

    # add updated record
    doc = Document(page_content=movie_record["content"], metadata={k: v for k, v in movie_record.items() if k != "content"})
    vector_store.add_documents([doc])

In [9]:
rec_1 = {
    "title": "Inception",
    "status": "watched_liked",
    "comment": "This is one of my favourite movie, especially how it handles time paradox!",
    "content": "Inception (watched) one of my favourite movie, liked the time paradox"
}

rec_2 = {
    "title": "Jurassic Park",
    "status": "watched_liked",
    "comment": "A true masterpiece. This is probably the best movie of all times, I could watch it forever and never get bored",
    "content": "Jurassik Park (watched) best movie of all times, masterpiece, never get bored"
}

rec_3 = {
    "title": "West Side Story",
    "status": "watched_disliked",
    "comment": "The music is good but I hate musicals in general, I didn't like it",
    "content": "West Side Story (must_see) music is good but I hate musicals in general"
}

rec_4 = {
    "title": "Once Upon a Time in America",
    "status": "must_see",
    "comment": "I have been willing to watch it for a long time, but it's a long movie and I never found the time to watch it",
    "content": "Once Upon a Time in America (must_see) Very long, never found the time to watch it"
}

rec_5 = {
    "title": "The Exorcist",
    "status": "not_interested",
    "comment": "I don't like horror movies and I heard this one is pretty rough, I prefer not to watch it",
    "content": "The Exorcist (not on watch_list) I dont like horror movies"
}

table_name = "movie_watchlists"

for rec in [rec_1, rec_2, rec_3, rec_4, rec_5]:
    add_update_movie(table_name, rec)


### Query columns with filters

In [21]:
table_name = "movie_watchlists"

records = xata.data().query(
    table_name=table_name,
    payload={
        "columns": ["title"],
        "filter": {
            "status": "watched_liked"
        }
    }
)
titles = [rec["title"] for rec in records["records"]]
print("Titles of movies I liked: ", titles)

Titles of movies I liked:  ['Jurassic Park', 'Inception']


### Add a new record or update an existing one in the table

In [32]:
rec_6 = {
    "title": "eXistenZ",
    "status": "watched",
    "comment": "I didnt like it, I found it very confusing and the story was sometimes disturbing",
    "content": "eXistenZ (watched) Didnt like it, confusing, disturbing"
}

rec_7 = {
    "title": "West Side Story",
    "status": "watched",
    "comment": "I finally watched it last night, it was amazing, thrilling music",
    "content": "West Side Story (watched) amazing, thrilling music"
}

table_name = "movie_watchlists"
for rec in [rec_6, rec_7]:
    add_update_movie(table_name, rec)


### Delete a record from the table

In [33]:
def delete_movie_from_table(table_name: str, title: str) -> None:
    records = get_table_records(table_name)
    for rec in records:
        if rec["title"] == title:
            xata.records().delete(table_name, rec["id"])
            print(f"Record '{title}' deleted successfully")
            return
    print(f"Record '{title}' not found")

table_name = "movie_watchlists"
delete_movie_from_table(table_name, title="Inception")

Record 'Inception' deleted successfully


In [36]:
def get_watch_lists(table_name: str) -> dict:
    """
    Return the list of watched, not watched and blacklists movies in a dict
    """
    
    records = get_table_records(table_name)

    watched = [rec['title'] for rec in records if rec["status"] == "watched"]
    must_see = [rec['title'] for rec in records if rec["status"] == "must_see"]
    not_interested = [rec['title'] for rec in records if rec["status"] == "not_interested"]

    return {"watched": watched, "must_see": must_see, "not_interested": not_interested}

table_name = "movie_watchlists"
watch_dict = get_watch_lists(table_name)
for k, v in watch_dict.items():
    print(f"- {k}: {v}")

- watched: ['Jurassic Park', 'eXistenZ', 'West Side Story']
- must_see: ['Once Upon a Time in America']
- not_interested: ['The Exorcist']


### Similarity Search test
Let's see if we can extract information from our history now.

In [42]:
query = "What movies am I most affraid of?"
found_docs = vector_store.similarity_search(query, k=1)
print("Query: ", query)
print("Found: ", found_docs[0].page_content)

print(sep)

query = "What is my favourite movie of all times?"
found_docs = vector_store.similarity_search(query, k=1)
print("Query: ", query)
print("Found: ", found_docs[0].page_content)

Query:  What movies am I most affraid of?
Found:  The Exorcist (not on watch_list) I dont like horror movies
----------------------------------------------------------------------------------------------------
Query:  What is my favourite movie of all times?
Found:  Jurassik Park (watched) best movie of all times, masterpiece, never get bored


## Build our langchain structure

### LLM Model

In [46]:
llm_gpt4 = ChatOpenAI(model='gpt-4o-mini', api_key=os.getenv("OPENAI_API_KEY"))

### Similarity Retriever

In [44]:
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 1})
query = "Do I like horror movies?"
results = retriever.invoke(query)
print("Query: ", query)
print("Found: ", results[0].page_content)

Query:  Do I like horror movies?
Found:  The Exorcist (not on watch_list) I dont like horror movies


### String output parser
Create a string output parser for general chatting with the advisor about movies

In [47]:
str_output_parser = StrOutputParser()

### Test string output parser using a simple chain

In [48]:
# create prompt from template
test_prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": "String"},
)

str_chain = test_prompt | llm_gpt4 | str_output_parser
response = str_chain.invoke({"query": "What movie should I watch tonight?"})
print(response)

It depends on your mood and preferences! Here are a few suggestions across different genres:

1. **Action**: *Mad Max: Fury Road* - A high-octane adventure with stunning visuals.
2. **Comedy**: *The Grand Budapest Hotel* - A quirky and visually delightful film by Wes Anderson.
3. **Drama**: *The Shawshank Redemption* - A powerful story of hope and friendship.
4. **Horror**: *Get Out* - A thought-provoking and suspenseful thriller.
5. **Romance**: *La La Land* - A beautiful musical about love and dreams.
6. **Sci-Fi**: *Inception* - A mind-bending journey through dreams and reality.

Let me know if you have a specific genre in mind, and I can provide more tailored suggestions!


### JSON Output Parser
[Documentation](https://python.langchain.com/v0.1/docs/modules/model_io/output_parsers/types/json/)  
Test using a JSON output parser. I won't use it then in my final solution.  

In [49]:
model = ChatOpenAI(temperature=0)

# define desired data structure
class Recommendation(BaseModel):
    title: str = Field(description="The movie title you recommend")
    comment: str = Field(description="My personal opinion about the movie")
    watched: bool = Field(description="True to add it to the list of watched movies, else False")
    watchlist: bool = Field(description="True to add it to the list of movies I would like to watch, else False")
    blacklist: bool = Field(description="True to add it to the list of movies I don't want to watch, else False")
    embedding: bool = Field(description="A short text that describes my answer as follow: {movie_title} ({watched/watchlist/blacklist}) {personal_opinion}. It will be stored as an embedding vector in the database")

json_output_parser = JsonOutputParser(pydantic_object=Recommendation)
print(json_output_parser)

pydantic_object=<class '__main__.Recommendation'>


### Test JSON output parser with a simple chain

In [50]:
# test json parser
test_prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": json_output_parser.get_format_instructions()},
)

json_chain = test_prompt | model | json_output_parser
json_chain.invoke({"query": "What movie should I watch tonight?"})

{'title': 'Inception',
 'comment': "One of the best mind-bending movies I've ever seen",
 'watched': False,
 'watchlist': True,
 'blacklist': False,
 'embedding': "Inception (Watchlist) One of the best mind-bending movies I've ever seen"}

### Project prompt

In [52]:
template = """You are a movie advisor offering me recommendations based on my preferences and watch history from these 3 lists:
- Watched: Movies I have already seen: {watched}.
- Watchlist: Movies I haven't seen yet but I want to: {must_see}.
- Blacklist: Movies I haven't seen yet but I don't want to: {not_interested}.
Only recommend one movie at a time.
My preferences related to the question to orient your answer: {preferences}.
Question: {question}.
"""

prompt = ChatPromptTemplate.from_template(template)
print(prompt)


input_variables=['must_see', 'not_interested', 'preferences', 'question', 'watched'] messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['must_see', 'not_interested', 'preferences', 'question', 'watched'], template="You are a movie advisor offering me recommendations based on my preferences and watch history from these 3 lists:\n- Watched: Movies I have already seen: {watched}.\n- Watchlist: Movies I haven't seen yet but I want to: {must_see}.\n- Blacklist: Movies I haven't seen yet but I don't want to: {not_interested}.\nOnly recommend one movie at a time.\nMy preferences related to the question to orient your answer: {preferences}.\nQuestion: {question}.\n"))]


### Project chain
This chain passes the 3 list of movies and uses the similarity retriever to pass context information to help the LLM answer the question more precisely. I won't use this technique in the final project, as Agents are doing this under the hood using Tools. See next section.

In [59]:
# define input
table_name = "movie_watchlists"
history = get_watch_lists(table_name)

watched = (lambda x: history["watched"])
must_see = (lambda x: history["must_see"])
not_interested = (lambda x: history["not_interested"])

preferences = (lambda x: x["question"]) | retriever

question = (lambda x: x["question"])

input_var = {"watched": watched, "must_see": must_see, "not_interested": not_interested, "preferences": preferences, "question": question}

# define chain
chain = (input_var | prompt | llm_gpt4 | str_output_parser)

# invoke answer
question = "Could you recommand a good horror movie?"
answer = chain.invoke({"question": question})
print("Question: ", question)
print("Answer: ", answer)

Question:  Could you recommand a good horror movie?
Answer:  Since you've indicated that you don't like horror movies and have blacklisted "The Exorcist," I won't recommend any horror films. Instead, how about exploring a different genre? Based on your watched list, you might enjoy something adventure-related or a classic. Would you like a recommendation from those genres?


### Build Tools
This is the solution I am using for this project. I create a langchain agent and give it access to custom tools to perform operations (check my watch lists on Xata, add or remove titles and info to them, etc.)

Let's build tools for the agent to use.

In [60]:
def remove_title_from_all_lists(title):
    """Make sure a title is in none of the lists"""
    global watch_dict
    for titles in watch_dict.values():
        if title in titles:
            titles.remove(title)

@tool
def add_title_to_list_of_movies_I_have_already_watched(title: str) -> None:
    """Add a movie title to the list of movies I have already watched in the past"""
    global watch_dict
    if title not in watch_dict["watched"]:
        remove_title_from_all_lists(title)
        watch_dict["watched"].append(title)

@tool
def add_title_to_list_of_movies_I_want_to_watch_later(title: str) -> None:
    """Add a movie title to the list of movies I want to watch later"""
    global watch_dict
    if title not in watch_dict["must_see"]:
        remove_title_from_all_lists(title)
        watch_dict["must_see"].append(title)

@tool
def add_title_to_list_of_movies_I_never_want_to_watch(title: str) -> None:
    """Add a movie title to the list of movies I never want to watch"""
    global watch_dict
    if title not in watch_dict["not_interested"]:
        remove_title_from_all_lists(title)
        watch_dict["not_interested"].append(title)

@tool
def remove_title_from_list_of_movies_I_have_already_watched(title: str) -> None:
    """Remove a movie title from the list of movies I have already watched"""
    global watch_dict
    if title in watch_dict["watched"]:
        watch_dict["watched"].remove(title)

@tool
def remove_title_from_list_of_movies_I_want_to_watch_later(title: str) -> None:
    """Remove a movie title from the list of movies I want to see"""
    global watch_dict
    if title in watch_dict["must_see"]:
        watch_dict["must_see"].remove(title)

@tool
def add_title_from_list_of_movies_I_never_want_to_watch(title: str) -> None:
    """Remove a movie title from the list of movies I'm not interested in"""
    global watch_dict
    if title in watch_dict["not_interested"]:
        watch_dict["not_interested"].remove(title)


# List of tools
tools = [
    add_title_to_list_of_movies_I_have_already_watched,
    add_title_to_list_of_movies_I_want_to_watch_later,
    add_title_to_list_of_movies_I_never_want_to_watch,
    remove_title_from_list_of_movies_I_have_already_watched,
    remove_title_from_list_of_movies_I_want_to_watch_later,
    add_title_from_list_of_movies_I_never_want_to_watch
]

# link our model to the tools
gpt4_with_tools = llm_gpt4.bind_tools(tools) 


### Create Agent

Let's now create an agent that will use the tools to update our lists.

In [67]:
# import premade prompt
prompt = hub.pull("hwchase17/openai-tools-agent")
print(prompt)

agent = create_tool_calling_agent(llm_gpt4, tools, prompt)

agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=False)

watch_dict = {"watched": [], "must_see": [], "not_interested": []}
print(watch_dict)

result = agent_executor.invoke(
    {
        "input": "Could you please add the movie 'The Matrix' to my watched list?"
    }
)

print(sep)
print("Me: ",result["input"])
print("Agent: ", result["output"])
print("Watch history: ", watch_dict)

input_variables=['agent_scratchpad', 'input'] optional_variables=['chat_history'] input_types={'chat_history': typing.List[typing.Union[langchain_core.messages.ai.AIMessage, langchain_core.messages.human.HumanMessage, langchain_core.messages.chat.ChatMessage, langchain_core.messages.system.SystemMessage, langchain_core.messages.function.FunctionMessage, langchain_core.messages.tool.ToolMessage]], 'agent_scratchpad': typing.List[typing.Union[langchain_core.messages.ai.AIMessage, langchain_core.messages.human.HumanMessage, langchain_core.messages.chat.ChatMessage, langchain_core.messages.system.SystemMessage, langchain_core.messages.function.FunctionMessage, langchain_core.messages.tool.ToolMessage]]} partial_variables={'chat_history': []} metadata={'lc_hub_owner': 'hwchase17', 'lc_hub_repo': 'openai-tools-agent', 'lc_hub_commit_hash': 'c18672812789a3b9697656dd539edf0120285dcae36396d0b548ae42a4ed66f5'} messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], templa

### Test the agent in a while loop
Tell the LLM to add movies to one of your lists.  
Example:  
- I have watched The Matrix last night
- I should really watch Inception one day
- I never want to see The Exorcist!

Leave the field blank to exit

In [69]:
watch_dict = {"watched": [], "must_see": [], "not_interested": []}

while True:

    print("Watch history:", watch_dict)
    
    input_text = input("Alex:")
    if not input_text:
        break

    result = agent_executor.invoke({"input": input_text})
    print(sep)
    print("Me: ",result["input"])
    print("Agent: ", result["output"])

Watch history: {'watched': [], 'must_see': [], 'not_interested': []}
----------------------------------------------------------------------------------------------------
Me:  I have seen The Matrix
Agent:  I've added "The Matrix" to your list of movies that you have already watched. If you need anything else, feel free to ask!
Watch history: {'watched': ['The Matrix'], 'must_see': [], 'not_interested': []}
----------------------------------------------------------------------------------------------------
Me:  I never want to see Inception
Agent:  I've added "Inception" to your list of movies you never want to watch.
Watch history: {'watched': ['The Matrix'], 'must_see': [], 'not_interested': ['Inception']}


## Query the TMDB database
The online TMDB database has more than 1 million movie entries. So it is worth querying it directly and not use the dataset like we did above, which only has 5000 entries.

Let's use langchain tool to query the TMDB database for a movie that our LLM would potentially not know about.  

We can see below that gpt-4o has not heard about the 2024 upcoming marvel movie 'Deadpool & Wolverine'. However, you can find information about it in the TMDB database [here](https://www.themoviedb.org/movie/533535-deadpool-wolverine).

In [71]:
# Let's find a movie that gpt4o does not know about
test_prompt = PromptTemplate(
    template="You are a movie advisor, you answer questions about movies. Keep your answers short, 1 or 2 sentences max.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": "String"},
)

str_chain = test_prompt | llm_gpt4 | str_output_parser
response = str_chain.invoke({"query": "Give me the plot of the movie 'Deadpool & Wolverine'?"})
print(response)

As of my last update, "Deadpool & Wolverine" doesn't exist as a standalone movie; however, Deadpool and Wolverine share a connection in the "X-Men" franchise, with Wolverine often serving as a foil to Deadpool’s irreverent humor. If you're referring to "Deadpool 3," it involves Deadpool navigating the multiverse and teaming up with Wolverine for a chaotic adventure.


Now let's create an agent with access to the tmdb-api tool, so gpt4o can do api calls to the TMDB database when needed.

You will need a TMDB account with an API-key for this.  

[TMDP API Documentation](https://developer.themoviedb.org/docs/getting-started)

### Let's do a simple query to the database

In [73]:
import requests

def search_movie(api_key, query):
    # Base URL for the search endpoint
    url = "https://api.themoviedb.org/3/search/movie"
    
    # Parameters for the API request
    params = {
        'api_key': api_key,
        'query': query
    }
    
    # Making the GET request to the API
    response = requests.get(url, params=params)

    # Checking if the request was successful
    if response.status_code == 200:
        # Parsing the JSON response
        data = response.json()
        return data['results']
    else:
        # If the request failed, print the status code
        print(f"Error: {response.status_code}")
        return None

# Example usage
api_key = os.getenv("TMDB_BEARER_TOKEN")
query = "Inception"

results = search_movie(api_key, query)

if results:
    for movie in results:
        print(f"Title: {movie['title']}, Release Date: {movie['release_date']}, Overview: {movie['overview']}")
        print(sep)
else:
    print("No results found.")


Title: Inception, Release Date: 2010-07-15, Overview: Cobb, a skilled thief who commits corporate espionage by infiltrating the subconscious of his targets is offered a chance to regain his old life as payment for a task considered to be impossible: "inception", the implantation of another person's idea into a target's subconscious.
----------------------------------------------------------------------------------------------------
Title: The Crack: Inception, Release Date: 2019-10-04, Overview: Madrid, Spain, 1975; shortly after the end of the Franco dictatorship. Six months after the mysterious death of his lover, a prestigious tailor, a married woman visits the office of the young Germán Areta, a former police officer turned private detective, to request his professional services.
----------------------------------------------------------------------------------------------------
Title: Inception: The Cobol Job, Release Date: 2010-12-07, Overview: This "Inception" prequel unfolds co

### Using Langchain tmdb-api tool
So far I could not get this tool to work. It seems that it's not adding my api-key to the request url and I get an error.  
I opened a github ticket [here](https://github.com/langchain-ai/langchain/discussions/24612) but the bot could not help me solve it.  
However, this is not a big problem. We can go around this problem by simply using the TMDB API to send a query inside a langchain Tool.


In [None]:
import os
from langchain.agents import AgentExecutor, create_tool_calling_agent, load_tools
from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

# load api-keys from environment variables or specify them below
openai_api_key = os.getenv("OPENAI_API_KEY")
tmdb_bearer_token = os.getenv("TMDB_BEARER_TOKEN")

# define the language model
llm = ChatOpenAI(model='gpt-4o', api_key=openai_api_key)

# load the tmdb-api tool
tools = load_tools(["tmdb-api"], llm=llm, tmdb_bearer_token=tmdb_bearer_token)

# define the prompt
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a movie advisor, you answer questions about movies."),
        ("system", "If you don't know an answer, invoke the TMDB-API with a question in natural language."),
        ("human", "{query}"),
        ("placeholder", "{agent_scratchpad}"),
    ]
)

# create the agent
agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

# test the agent
result = agent_executor.invoke(
    {
        "query": "Give me the plot of movie 'Inception'?",
    }
)

# print the result
print("Agent: ", result["output"])

It seems to be related to the way the request url is built. If I build the url myself manually, it works just fine.

In [14]:
from langchain_community.utilities.requests import TextRequestsWrapper

# Create a TextRequestsWrapper instance with custom headers
headers = {"Authorization": f"Bearer {os.getenv('TMDB_BEARER_TOKEN')}"}
requests_wrapper = TextRequestsWrapper(headers=headers)

# Make a GET request and inspect the response
response = requests_wrapper.get(f"https://api.themoviedb.org/3/search/movie?query=Inception&language=en-US&api_key={os.getenv('TMDB_BEARER_TOKEN')}")

# convert string to dict
response_dict = json.loads(response)
print(response_dict["results"][0]["overview"])

Cobb, a skilled thief who commits corporate espionage by infiltrating the subconscious of his targets is offered a chance to regain his old life as payment for a task considered to be impossible: "inception", the implantation of another person's idea into a target's subconscious.


## Audio Parsers
Additional tests to approach the STT and TTS steps, when building a chatbot for our movie advisor.

I researched Google SpeechToTextLoader from Langchain's community module. It's working as expected but unfortunately is not an open-source and free solution. You have a quota per project that you can use for free.

In [15]:
!gcloud auth application-default login

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login&state=0SCeMTVUdMHvnHSQ1rUAGdnKTeLeoe&access_type=offline&code_challenge=GsqGfOEgvCiFSchDXWpFnWn-_xHGoLN-IPMrFSBB1Hc&code_challenge_method=S256


Credentials saved to file: [C:\Users\Nath\AppData\Roaming\gcloud\application_default_credentials.json]

These credentials will be used by any library that requests Application Default Credentials (ADC).

Quota project "ai-chitchat" was added to ADC which can be used by Google client libraries for billing and quota. Note that some services may still bill the project owning the resource.


In [21]:
from langchain_google_community import SpeechToTextLoader

project_id = "ai-chitchat"
file_path = "audio.wav"

loader = SpeechToTextLoader(project_id=project_id, file_path=file_path)

docs = loader.load()
print(docs)


[Document(metadata={'language_code': 'en-US', 'result_end_offset': datetime.timedelta(seconds=3)}, page_content=' What is the height of the eiffel tower?')]
