**RAQA LangChain Application Querying IMDB Reviews For The Napoleon Movie

***Scope***
* This notebook demonstrates a Retrieval And Question Answering (RAQA) application for the movie "Napoleon".
* Movie reviews scraped from the IMDB movie review website are the document source.

In [1]:
# Install libraries
%pip install --upgrade --quiet  langchain==0.2.0
!pip install -q -U faiss-cpu==1.7.2
!pip install -q -U requests==2.31.0
!pip install -q -U scrapy==2.11.2 selenium==4.21.0
!apt install chromium-chromedriver

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
chromium-chromedriver is already the newest version (1:85.0.4183.83-0ubuntu2.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 44 not upgraded.


In [2]:
# Obtain OpenAI key
import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass()

··········


In [4]:
# Import libraries
from langchain import hub
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.embeddings import CacheBackedEmbeddings
from langchain.vectorstores import FAISS
from langchain.storage import LocalFileStore
from langchain.chains import RetrievalQA
from langchain.callbacks import StdOutCallbackHandler
import pandas as pd
import tensorflow as tf
import numpy as np
from scrapy.selector import Selector
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

In [5]:
# Remove limit on column display width
pd.options.display.max_colwidth = None

In [6]:
# Utilize GPU if available
# Get the list of available physical devices
physical_devices = tf.config.list_physical_devices('GPU')

if len(physical_devices) > 0:
    # If a GPU is available, use it
    tf.config.experimental.set_memory_growth(physical_devices[0], True)
    device = "/GPU:0"
    print("Using GPU")
else:
    # If no GPU is available, use CPU
    device = "/CPU:0"
    print("Using CPU")

Using GPU


In [7]:
# Select LLM
llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

**Scraping IMDB Reviews of Napoleon**

In [8]:
# Set chrome options for scraping
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=chrome_options)

In [9]:
# Define Napoleon movie review URL
url = "https://www.imdb.com/title/tt13287846/reviews/?ref_=tt_ql_2"
driver.get(url)

In [10]:
# Define selector
sel = Selector(text = driver.page_source)
review_counts = sel.css('.lister .header span::text').extract_first().replace(',','').split(' ')[0]
more_review_pages = int(int(review_counts)/25)

In [11]:
# Import data
for i in tqdm(range(more_review_pages)):
    try:
        css_selector = 'load-more-trigger'
        driver.find_element(By.ID, css_selector).click()
    except:
        pass

100%|██████████| 52/52 [00:06<00:00,  7.83it/s]


In [12]:
# Define DataFrame columns and append
rating_list = []
review_date_list = []
review_title_list = []
author_list = []
review_list = []
review_url_list = []
error_url_list = []
error_msg_list = []
reviews = driver.find_elements(By.CSS_SELECTOR, 'div.review-container')

for d in tqdm(reviews):
    try:
        sel2 = Selector(text = d.get_attribute('innerHTML'))
        try:
            rating = sel2.css('.rating-other-user-rating span::text').extract_first()
        except:
            rating = np.NaN
        try:
            review = sel2.css('.text.show-more__control::text').extract_first()
        except:
            review = np.NaN
        try:
            review_date = sel2.css('.review-date::text').extract_first()
        except:
            review_date = np.NaN
        try:
            author = sel2.css('.display-name-link a::text').extract_first()
        except:
            author = np.NaN
        try:
            review_title = sel2.css('a.title::text').extract_first()
        except:
            review_title = np.NaN
        try:
            review_url = sel2.css('a.title::attr(href)').extract_first()
        except:
            review_url = np.NaN
        rating_list.append(rating)
        review_date_list.append(review_date)
        review_title_list.append(review_title)
        author_list.append(author)
        review_list.append(review)
        review_url_list.append(review_url)
    except Exception as e:
        error_url_list.append(url)
        error_msg_list.append(e)
review_df = pd.DataFrame({
    'Review_Date':review_date_list,
    'Author':author_list,
    'Rating':rating_list,
    'Review_Title':review_title_list,
    'Review':review_list,
    'Review_Url':review_url
    })

100%|██████████| 250/250 [00:02<00:00, 116.85it/s]


In [13]:
# Print DataFrame
review_df

Unnamed: 0,Review_Date,Author,Rating,Review_Title,Review,Review_Url
0,3 December 2023,petra_ste,6,An interesting failure\n,"Ridley Scott directed one of the best movies ever made set during the Napoleonic Wars: unfortunately, that movie is not Napoleon but his cinematic debut, The Duellists, forty years ago.",/review/rw9465377/?ref_=tt_urv
1,23 November 2023,imseeg,6,A few words of warning for those with high expectations...\n,A word of warning for those expecting another Gladiator or non stop action spectacle. It is not. Truly not...,/review/rw9465377/?ref_=tt_urv
2,2 March 2024,Eleatic67,6,Images without Words\n,"The success of any film depends mostly on the script. Why Scott would initiate such an expensive project without ensuring a refined and sophisticated script is a mystery. I'm not convinced there is a single interesting scene that provides insight into the characters or captures through language the prevailing political ideas. Scott's frequent missteps as a director reflect a greater interest in the cinematic rather than in the dramatic. However, this seems inevitable when your priority is delivering a blockbuster that will have broad appeal instead of digging deeper into culture, society, or history. A colossal waste of an extraordinary opportunity to create an important film about a fascinating historical figure.",/review/rw9465377/?ref_=tt_urv
3,22 November 2023,Vic_max,6,Expected an experience ... almost fell asleep\n,"Many of Ridley Scott's movies are like visual masterpieces with epic storylines. I was sort of expecting something like Gladiator. Instead, it was just ""meh"" - I probably would have quit watching if it was on TV.",/review/rw9465377/?ref_=tt_urv
4,27 November 2023,granka-47093,,Stuff just happens...\n,"Ridley Scott's Napoleon is a high-budget cinematic exercise in ""Whatever, man, that'll do.""\nThe film, both in terms of what it presents and how it presents, reeks of hollowness. Characters are shadows(not defined enough to even be considered parodies or mockeries of their real-life counterparts as some people like to see them), story is a shadow of a proper story( at times feeling as if written by A. I), atmosphere, with the exception of some of the battle scenes and the Russian segment, sterile and practically non existent(disasterous for Scott who is known to be one of the greatest world builders in history of the artform). Stuff just happens in the film. No significance or weight to anything or anybody... Sure, it's not all bad. The classic Ridley Scott elements are here - battles are engaging, the costumes and set designs very well-done. Something he can't help but always be good at.",/review/rw9465377/?ref_=tt_urv
...,...,...,...,...,...,...
245,24 November 2023,andreikobli,,Good\n,A word of warning for those expecting another Gladiator or non stop action spectacle. It is not. Truly not...,/review/rw9465377/?ref_=tt_urv
246,24 November 2023,qypvb,6,Disappointing\n,"Let's start with the best parts of the movie. The costumes, hairstyling, makeup, sets and set design were well done for the historical period. The battle scenes were well executed but brief. Casting was good but the acting was devoid of depth and genuine emotion. Joaquin Phoenix interpretation of Napoleon was a caricature and I was disappointed with his lack of development in his role of Napoleon. The same can be said for the role of Josephine. The musical score was deficient throughout most of the movie. I really was looking forward to seeing this movie as I expected Phoenix to be riveting in the role of Napoleon, but he and the movie was disappointing.",/review/rw9465377/?ref_=tt_urv
247,25 November 2023,mabmadridespana,8,A down-to-earth Emperor...\n,"Many are complaining about this movie on the basis of an alleged lack of historical accuracy. Well, people, were you even there?",/review/rw9465377/?ref_=tt_urv
248,8 December 2023,TheVictoriousV,7,There is a great movie in here we've yet to see\n,"It turns out Napoleon wasn't quite the masterwork everyone was anticipating. A great many critics appear to agree that the best thing about this project is Ridley Scott dabbing at all the nerds who complain about its myriad historical inaccuracies. (Taika Waititi did something similar with his Thomas Rongen biopic, arguing that the Holy Bible is full of real events ""plus magic stuff"" yet people still like that story fine, but I'll get called a ""Redditor"" unless I pretend that wasn't funny so boooo.)",/review/rw9465377/?ref_=tt_urv


In [14]:
# Convert DataFrame to CSV
review_df.to_csv("./review.csv")

In [15]:
# Load data
loader = CSVLoader(
    file_path= './review.csv',
    source_column = 'Review_Url'
    )

data = loader.load()

In [16]:
# Print data and show length
print(data)
len(data)



250

In [17]:
# Define text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000, # the character length of the chunk
    chunk_overlap = 100, # the character length of the overlap between chunks
    length_function = len # the length function
)

In [18]:
# Transform data
documents = text_splitter.transform_documents(data)

In [19]:
# Print documents
print(documents)



In [20]:
# Show documents length
len(documents)

343

In [21]:
# Define embedder and vector store
store = LocalFileStore('./cache/')

core_embeddings_model = OpenAIEmbeddings()

embedder = CacheBackedEmbeddings.from_bytes_store(
    core_embeddings_model, store, namespace = core_embeddings_model.model
)

vector_store = FAISS.from_documents(documents, embedder)

In [22]:
# Implement a query
query = "Which actor is the star of this movie?"
embedding_vector = core_embeddings_model.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)

for page in docs:
  print(page.page_content)

: 35
Review_Date: 9 December 2023
Author: adambarta
Rating: 6
Review_Title: Hollywood overtaking over quality
Review: Once I found out that there will be a movie about Napoleon starring Joaquin Phoenix, I got excited. What a great idea? It's one of the most interesting people in history, of course this would make a great movie.
Review_Url: /review/rw9465377/?ref_=tt_urv
Author: jmillerjr-00983
Rating: 7
Review_Title: See it in the Theater
Review: My son and I saw it this morning and as we walked out of the theater, I asked him what he thought. "Better than Oppenheimer," he said. He was absolutely right. Despite the middling reviews, I'd say this film was a solid effort. If anything, it did feel a bit too self-aware and because of that it often poked fun at itself (and its main character). The accents were a bit distracting, but I'm not sure what could have been done there. Phoenix's Bonaparte often overplays his American accent and it seems in stark contrast to Kirby's English. By the 

In [23]:
# Comparison for cached embedding
%%timeit -n 1 -r 1
query = "Which actor is the star of this movie?"
embedding_vector = core_embeddings_model.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)

168 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [24]:
# Comparison for cached embedding
%timeit
query = "Which actor is the star of this movie?"
embedding_vector = core_embeddings_model.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)



In [25]:
# Define retriever
retriever = vector_store.as_retriever()

In [26]:
# Define retrieval chain
handler = StdOutCallbackHandler()

qa_with_sources_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    callbacks=[handler],
    return_source_documents=True
)

In [27]:
# Demonstrate question and answering
qa_with_sources_chain.invoke({"query" : "How was Joaquin Phoenix in this movie?"})



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'How was Joaquin Phoenix in this movie?',
 'result': "Joaquin Phoenix's performance in the movie received mixed reviews. One reviewer mentioned that it was his best role so far, while another found his performance unconvincing. Another review highlighted compelling acting from Joaquin Phoenix. Overall, opinions on his performance varied among viewers.",
 'source_documents': [Document(page_content=': 104\nReview_Date: 1 December 2023\nAuthor: mm-39\nRating: 8\nReview_Title: Joaquin Phoenix best role so far!', metadata={'source': '/review/rw9465377/?ref_=tt_urv', 'row': 104}),
  Document(page_content=': 207\nReview_Date: 1 December 2023\nAuthor: ghassanomar_\nRating: 6\nReview_Title: Dramatically weak\nReview: He lacked the correct dramatic formulation and got lost in the crowd of successive events, in addition to the unconvincing performance of the hero Joaquin Phoenix.\nReview_Url: /review/rw9465377/?ref_=tt_urv', metadata={'source': '/review/rw9465377/?ref_=tt_urv', 'row': 2

In [28]:
# Demonstrate question answering
qa_with_sources_chain.invoke({"query" : "Was it worthwhile to watch this movie?"})



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'Was it worthwhile to watch this movie?',
 'result': "Based on the reviews provided, opinions on whether the movie is worthwhile vary. One reviewer rated it a 3, calling it a waste of money due to a bad script. Another reviewer rated it a 7, mentioning it was a solid effort with good action sequences and standout performances. However, the same reviewer mentioned some drawbacks like fast-paced events and unclear character emotions due to heavy editing. Ultimately, whether it's worthwhile to watch would depend on your personal preferences regarding historical dramas and how much you value script quality and character development.",
 'source_documents': [Document(page_content=': 127\nReview_Date: 27 November 2023\nAuthor: richard-1787\nRating: 3\nReview_Title: A waste of money\nReview: This movie must have cost a fortune to make. (There is no budget listed yet here on IMDB.) You would think they could have spent a little of it on a good script. So many of the problems here stem

**Conclusion
* The RAQA application has been completed and will be implemented as a HuggingFace space.