**RAQA LangChain Application Querying IMDB Reviews For The Napoleon Movie

***Scope***
* This notebook demonstrates a Retrieval And Question Answering (RAQA) application for the movie "Napoleon".
* Movie reviews scraped from the IMDB movie review website are the document source.

In [1]:
# Install libraries
%pip install --upgrade --quiet  langchain==0.2.0 langchain-community==0.2.0 langchainhub==0.1.15 langchain-openai==0.1.7
!pip install -q -U faiss-gpu==1.7.2 tiktoken==0.7.0
!pip install -q -U requests==2.31.0
!pip install -q -U scrapy==2.11.2 selenium==4.21.0
!apt install chromium-chromedriver

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m973.7/973.7 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.9/307.9 kB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.2/121.2 kB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.6/320.6 kB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m36.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.0/53.0 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━

In [2]:
# Obtain OpenAI key
import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass()

··········


In [3]:
# Activate LangChain tracing and obtain key
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()

··········


In [4]:
# Import libraries
from langchain import hub
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.embeddings import CacheBackedEmbeddings
from langchain.vectorstores import FAISS
from langchain.storage import LocalFileStore
from langchain.chains import RetrievalQA
from langchain.callbacks import StdOutCallbackHandler
import pandas as pd
import tensorflow as tf
import numpy as np
from scrapy.selector import Selector
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

In [5]:
# Remove limit on column display width
pd.options.display.max_colwidth = None

In [6]:
# Utilize GPU if available
# Get the list of available physical devices
physical_devices = tf.config.list_physical_devices('GPU')

if len(physical_devices) > 0:
    # If a GPU is available, use it
    tf.config.experimental.set_memory_growth(physical_devices[0], True)
    device = "/GPU:0"
    print("Using GPU")
else:
    # If no GPU is available, use CPU
    device = "/CPU:0"
    print("Using CPU")

Using GPU


In [7]:
# Select LLM
llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

**Scraping IMDB Reviews of Napoleon**

In [8]:
# Set chrome options for scraping
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=chrome_options)

In [9]:
# Define Napoleon movie review URL
url = "https://www.imdb.com/title/tt13287846/reviews/?ref_=tt_ql_2"
driver.get(url)

In [10]:
# Define selector
sel = Selector(text = driver.page_source)
review_counts = sel.css('.lister .header span::text').extract_first().replace(',','').split(' ')[0]
more_review_pages = int(int(review_counts)/25)

In [11]:
# Import data
for i in tqdm(range(more_review_pages)):
    try:
        css_selector = 'load-more-trigger'
        driver.find_element(By.ID, css_selector).click()
    except:
        pass

100%|██████████| 52/52 [00:29<00:00,  1.74it/s]


In [12]:
# Define DataFrame columns and append
rating_list = []
review_date_list = []
review_title_list = []
author_list = []
review_list = []
review_url_list = []
error_url_list = []
error_msg_list = []
reviews = driver.find_elements(By.CSS_SELECTOR, 'div.review-container')

for d in tqdm(reviews):
    try:
        sel2 = Selector(text = d.get_attribute('innerHTML'))
        try:
            rating = sel2.css('.rating-other-user-rating span::text').extract_first()
        except:
            rating = np.NaN
        try:
            review = sel2.css('.text.show-more__control::text').extract_first()
        except:
            review = np.NaN
        try:
            review_date = sel2.css('.review-date::text').extract_first()
        except:
            review_date = np.NaN
        try:
            author = sel2.css('.display-name-link a::text').extract_first()
        except:
            author = np.NaN
        try:
            review_title = sel2.css('a.title::text').extract_first()
        except:
            review_title = np.NaN
        try:
            review_url = sel2.css('a.title::attr(href)').extract_first()
        except:
            review_url = np.NaN
        rating_list.append(rating)
        review_date_list.append(review_date)
        review_title_list.append(review_title)
        author_list.append(author)
        review_list.append(review)
        review_url_list.append(review_url)
    except Exception as e:
        error_url_list.append(url)
        error_msg_list.append(e)
review_df = pd.DataFrame({
    'Review_Date':review_date_list,
    'Author':author_list,
    'Rating':rating_list,
    'Review_Title':review_title_list,
    'Review':review_list,
    'Review_Url':review_url
    })

100%|██████████| 50/50 [00:00<00:00, 153.28it/s]


In [13]:
# Print DataFrame
review_df

Unnamed: 0,Review_Date,Author,Rating,Review_Title,Review,Review_Url
0,3 December 2023,petra_ste,6.0,An interesting failure\n,"Ridley Scott directed one of the best movies ever made set during the Napoleonic Wars: unfortunately, that movie is not Napoleon but his cinematic debut, The Duellists, forty years ago.",/review/rw9450588/?ref_=tt_urv
1,23 November 2023,imseeg,6.0,A few words of warning for those with high expectations...\n,A word of warning for those expecting another Gladiator or non stop action spectacle. It is not. Truly not...,/review/rw9450588/?ref_=tt_urv
2,2 March 2024,Eleatic67,6.0,Images without Words\n,"The success of any film depends mostly on the script. Why Scott would initiate such an expensive project without ensuring a refined and sophisticated script is a mystery. I'm not convinced there is a single interesting scene that provides insight into the characters or captures through language the prevailing political ideas. Scott's frequent missteps as a director reflect a greater interest in the cinematic rather than in the dramatic. However, this seems inevitable when your priority is delivering a blockbuster that will have broad appeal instead of digging deeper into culture, society, or history. A colossal waste of an extraordinary opportunity to create an important film about a fascinating historical figure.",/review/rw9450588/?ref_=tt_urv
3,22 November 2023,Vic_max,6.0,Expected an experience ... almost fell asleep\n,"Many of Ridley Scott's movies are like visual masterpieces with epic storylines. I was sort of expecting something like Gladiator. Instead, it was just ""meh"" - I probably would have quit watching if it was on TV.",/review/rw9450588/?ref_=tt_urv
4,27 November 2023,granka-47093,,Stuff just happens...\n,"Ridley Scott's Napoleon is a high-budget cinematic exercise in ""Whatever, man, that'll do.""\nThe film, both in terms of what it presents and how it presents, reeks of hollowness. Characters are shadows(not defined enough to even be considered parodies or mockeries of their real-life counterparts as some people like to see them), story is a shadow of a proper story( at times feeling as if written by A. I), atmosphere, with the exception of some of the battle scenes and the Russian segment, sterile and practically non existent(disasterous for Scott who is known to be one of the greatest world builders in history of the artform). Stuff just happens in the film. No significance or weight to anything or anybody... Sure, it's not all bad. The classic Ridley Scott elements are here - battles are engaging, the costumes and set designs very well-done. Something he can't help but always be good at.",/review/rw9450588/?ref_=tt_urv
5,22 November 2023,zeki-4,7.0,Bring on the director's cut!\n,"Back in 2005 Ridley Scott's 144 minute version of 'Kingdom of Heaven' premiered in theatres to somewhat mixed reviews. A couple of years later the vastly superior 190 minute director's cut version finally arrived, with the general consensus that the final product was a masterclass in storytelling, directing, acting and cinematography. - without doubt the best motion picture ever made about the crusades.",/review/rw9450588/?ref_=tt_urv
6,3 January 2024,stefan-huybrechts,6.0,Excellent trailer - not so the movie\n,"I will not get in to the historical inaccuracies, as in a lot of historical movies history is adapted for dramatic purposes. It is Hollywood after all and especially for big budget movies the goal is to make a lot of money. Beautiful Trailer.",/review/rw9450588/?ref_=tt_urv
7,22 November 2023,dorMancyx,6.0,Tsk Tsk Tsk\n,"I feel unsatisfied walking out of that theater after three hours of melancholy and confusion. I understand every single word and every single scene, but when they connect into a whole film I don't understand anything. To start off, the costume/production design, naturalistic sceneries, the two meticulously-depicted ancient warfares --- one amid the doleful squall of Austerlitz and another atop the dampened prairie of Waterloo --- and all other technical stuff are spotless. However, there's an anxiety-inducing problem with the narrative --- the movie has no focus. No climax, no resonating themes like Oppenheimer, just one plain, linear, chronological plot; and even that we get multiple baffling time jumps throughout. The plot is so simple that everything in the film was taught in my AP World History class last year in one day, except it nibbles on some superficial details. Highlighting the relationship between Napolean and Josephine is a unique take, and I do see efforts from the two esteemed performers to capture their mutual toxicity and intricacy, but this love story has barely anything to do with the movie's main arc, Napolean's personal rise and fall. Nothing! Two basically unrelated storylines unfolding in the most bland way possible. Also adding to this insipid mess is the score, which is composed primarily of classical or really old-sounding French folk --- what happened to Radiohead from the trailer? God Ridley I don't wanna say three hours of my life is wasted but it kinda is!",/review/rw9450588/?ref_=tt_urv
8,8 March 2024,UrsusProblemus,8.0,Another reminder to not care about people's opinions ever again\n,"Yes, the film is somewhat disjointed. It would have been much better off being a mini-series. However, apart from that, I can't really fault the movie, and it's an absolute mystery to me why everyone seems to hate it so much. Granted, I can't really judge its historical accuracy, I'm merely speaking about its merits as a drama. Well, it's beautifully shot and it never gets dull. And it's still way above average. I would definitely rank it higher than Scott's ""Last Duel"", too.",/review/rw9450588/?ref_=tt_urv
9,22 November 2023,cagebox111,7.0,"Napoleon the butcher, Napoleon the megalomaniac, Napoleon the lover\n",Napoleon was the most significant man of his age and no film can explain his significance in 2.5 hours. Scott decides to focus on three specific aspects of Napoleon's life and personality to show who he thinks Napoleon was at the expense of omitting much of what makes him such a fascinating figure in history.,/review/rw9450588/?ref_=tt_urv


In [14]:
# Convert DataFrame to CSV
review_df.to_csv("./review.csv")

In [15]:
# Load data
loader = CSVLoader(
    file_path= './review.csv',
    source_column = 'Review_Url'
    )

data = loader.load()

In [16]:
# Print data and show length
print(data)
len(data)



50

In [17]:
# Define text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000, # the character length of the chunk
    chunk_overlap = 100, # the character length of the overlap between chunks
    length_function = len # the length function
)

In [18]:
# Transform data
documents = text_splitter.transform_documents(data)

In [19]:
# Print documents
print(documents)



In [20]:
# Show documents length
len(documents)

67

In [21]:
# Define embedder and vector store
store = LocalFileStore('./cache/')

core_embeddings_model = OpenAIEmbeddings()

embedder = CacheBackedEmbeddings.from_bytes_store(
    core_embeddings_model, store, namespace = core_embeddings_model.model
)

vector_store = FAISS.from_documents(documents, embedder)

In [22]:
# Implement a query
query = "Which actor is the star of this movie?"
embedding_vector = core_embeddings_model.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)

for page in docs:
  print(page.page_content)

: 37
Review_Date: 9 December 2023
Author: adambarta
Rating: 6
Review_Title: Hollywood overtaking over quality
Review: Once I found out that there will be a movie about Napoleon starring Joaquin Phoenix, I got excited. What a great idea? It's one of the most interesting people in history, of course this would make a great movie.
Review_Url: /review/rw9450588/?ref_=tt_urv
: 33
Review_Date: 22 November 2023
Author: WadoodS
Rating: 5
Review_Title: Napoleon Deserved Better
Review: Making a movie on a charismatic person like Napoleon is a huge undertaking. It requires in-depth study of his life and then choosing specific events of his life to show and connecting them to perfection. The film maker is short in time therefore it is important for the movie to be precise, fluent and with a purpose. In this movie, the director, Ridley Scott, fails to determine which points of his life he wishes to portray and consequently it results in a piece-by-piece movie which lacks cinematic flow and confuses

In [23]:
# Comparison for cached embedding
%%timeit -n 1 -r 1
query = "Which actor is the star of this movie?"
embedding_vector = core_embeddings_model.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)

175 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [24]:
# Comparison for cached embedding
%timeit
query = "Which actor is the star of this movie?"
embedding_vector = core_embeddings_model.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)



In [25]:
# Define retriever
retriever = vector_store.as_retriever()

In [26]:
# Define retrieval chain
handler = StdOutCallbackHandler()

qa_with_sources_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    callbacks=[handler],
    return_source_documents=True
)

In [27]:
# Demonstrate question and answering
qa_with_sources_chain.invoke({"query" : "How was Joaquin Phoenix in this movie?"})



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'How was Joaquin Phoenix in this movie?',
 'result': "Based on the reviews provided, Joaquin Phoenix's performance in the movie about Napoleon seems to have received mixed feedback. Some reviewers mentioned that Phoenix seemed uncomfortable and clueless in the role, while others mentioned that his interpretation felt similar to a performance you would see in a stage theater. Overall, it appears that there are differing opinions on his portrayal in this particular film.",
 'source_documents': [Document(page_content="The feeling that i got was that we saw a fast forward version of his life but without any soul or essence. Even the battle scenes seemed dull and without soul. I think Ridley set the bar too high through his previous movies for this part. Regarding Joaquin's performance, there is not much to say. I think his interpretation was similar to a performance you would see in a stage theater. I don't know why but at some points in the movie, it felt like a low budget movie

In [28]:
# Demonstrate question answering
qa_with_sources_chain.invoke({"query" : "Was it worthwhile to watch this movie?"})



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'Was it worthwhile to watch this movie?',
 'result': 'Based on the provided reviews, it seems that opinions about the movie "Napoleon" are mixed. Some viewers found it unsatisfying due to issues with the narrative focus and pacing, while others enjoyed it more than expected. The film is noted for its technical aspects like costume design and production, but it seems to lack depth in storytelling for some viewers. If you are interested in historical dramas or Napoleon Bonaparte, you might still find it worthwhile to watch to form your own opinion.',
 'source_documents': [Document(page_content="The feeling that i got was that we saw a fast forward version of his life but without any soul or essence. Even the battle scenes seemed dull and without soul. I think Ridley set the bar too high through his previous movies for this part. Regarding Joaquin's performance, there is not much to say. I think his interpretation was similar to a performance you would see in a stage theater. I 

**Conclusion
* The RAQA application has been completed and will be implemented as a HuggingFace space.