## Retrieval and Question Answering Exercise

In this exercise, your goal is to utilize a vector database to attempt to retrieve relevant context to answer questions about Best Picture winners since 2000. Each question can be answered from the Wikipedia page of each movie. 

You have been provided a list of movies and links to their Wikipedia pages in the file best_picture_2000.csv.

Build a vector database off of these Wikipedia pages which, given a query, can find potentially relevant context to answer the question. 

Then use a question-answering model from HugingFace to anwser the question.

A list of question and answer pairs is given in QAs.csv, but feel free to add to it yourself.

In [None]:
#Strategy for pt1:
#Read in csv as a df
#Grab the wiki articles, get the main text and put them in a new column
#Clean and processed them using something like gensim

In [1]:
#Strategy for pt2:
#Build a vector db using chroma_db...
#Then, use use HuggingFace to answer the questions...
#Look at the chroma_db pages.  Do you webscrape the info, or use chroma_db?

In [2]:
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

In [3]:
#Quick test, as I'm a bit rusty...
URL = 'https://en.wikipedia.org/wiki/Gladiator_(2000_film)'

response = requests.get(URL) 

In [4]:
response.text

'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-appearance-disabled vector-feature-appearance-pinned-clientpref-0 vector-feature-night-mode-disabled skin-theme-clientpref-day vector-toc-available" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>Gladiator (2000 film) - Wikipedia</title>\n<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled ve

In [5]:
soup = BeautifulSoup(response.text)
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-appearance-disabled vector-feature-appearance-pinned-clientpref-0 vector-feature-night-mode-disabled skin-theme-clientpref-day vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Gladiator (2000 film) - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinne

In [6]:
wiki_articles=pd.read_csv('../data/best_picture_2000.csv')

In [7]:
wiki_articles.head(8)

Unnamed: 0,title,link,year
0,Gladiator,https://en.wikipedia.org/wiki/Gladiator_(2000_...,2000
1,A Beautiful Mind,https://en.wikipedia.org/wiki/A_Beautiful_Mind...,2001
2,Chicago,https://en.wikipedia.org/wiki/Chicago_(2002_film),2002
3,The Lord of the Rings: The Return of the King,https://en.wikipedia.org/wiki/The_Lord_of_the_...,2003
4,Million Dollar Baby,https://en.wikipedia.org/wiki/Million_Dollar_Baby,2004
5,Crash,https://en.wikipedia.org/wiki/Crash_(2004_film),2005
6,The Departed,https://en.wikipedia.org/wiki/The_Departed,2006
7,No Country for Old Men,https://en.wikipedia.org/wiki/No_Country_for_O...,2007


In [8]:
#Create a place to put things
wiki_text=[]

#loop through the links and add the articles.  
for i in wiki_articles['link']:
    response = requests.get(i)
    soup = BeautifulSoup(response.text)
    next_text = soup.find('div', class_ = 'mw-body-content').get_text()
    wiki_text.append(next_text)

#add it all to the df
wiki_articles['text']=wiki_text
    

In [9]:
#Did it work?
wiki_articles['text'].loc[3]

'2007 film by Ethan and Joel Coen\nFor the novel, see No Country for Old Men (novel).\n\n\nNo Country for Old MenTheatrical release posterDirected byJoel CoenEthan CoenScreenplay by\nJoel Coen\nEthan Coen\nBased onNo Country for Old Menby Cormac McCarthyProduced by\nScott Rudin\nEthan Coen\nJoel Coen\nStarring\nTommy Lee Jones\nJavier Bardem\nJosh Brolin\nCinematographyRoger DeakinsEdited byRoderick Jaynes[a]Music byCarter BurwellProductioncompanies\nParamount Vantage\nScott Rudin Productions\nMike Zoss Productions\nDistributed by\nMiramax Films (United States)\nParamount Pictures (International)\nRelease dates\nMay\xa019,\xa02007\xa0(2007-05-19) (Cannes)\nNovember\xa09,\xa02007\xa0(2007-11-09) (United States)\nRunning time122 minutesCountryUnited StatesLanguage\nEnglish\nBudget$25 millionBox office$171.6 million[1]\nNo Country for Old Men is a 2007 American neo-Western crime thriller film written, directed, produced and edited by Joel and Ethan Coen, based on Cormac McCarthy\'s 2005 n

In [10]:
#I have a suspicion that the 'cleaner' the text is, the better than answers will be...
from gensim.parsing import preprocessing

def clean_the_wiki(html_text: str) -> str:
    preprocessed_text = preprocessing.strip_non_alphanum(s=html_text)
    preprocessed_text = preprocessing.strip_multiple_whitespaces(s=preprocessed_text)
    preprocessed_text = preprocessing.strip_punctuation(s=preprocessed_text)
    
    return preprocessed_text

wiki_articles["processed_text"] = wiki_articles["text"].apply(clean_the_wiki)

In [11]:
#compare this with 'processed_text'
wiki_articles['text'].loc[3]

'2003 film by Peter Jackson\nThis article is about the 2003 film. For the book by Tolkien, see The Return of the King. For other uses, see The Return of the King (disambiguation).\n\n\nThe Lord of the Rings:The Return of the KingTheatrical release posterDirected byPeter JacksonScreenplay by\nFran Walsh\nPhilippa Boyens\nPeter Jackson\nBased onThe Return of the Kingby J. R. R. TolkienProduced by\nBarrie M. Osborne\nFran Walsh\nPeter Jackson\nStarring\nElijah Wood\nIan McKellen\nLiv Tyler\nViggo Mortensen\nSean Astin\nCate Blanchett\nJohn Rhys-Davies\nBernard Hill\nBilly Boyd\nDominic Monaghan\nOrlando Bloom\nHugo Weaving\nMiranda Otto\nDavid Wenham\nKarl Urban\nJohn Noble\nAndy Serkis\nIan Holm\nSean Bean\nCinematographyAndrew LesnieEdited byJamie SelkirkMusic byHoward ShoreProductioncompanies\nNew Line Cinema[1]\nWingNut Films[1]\nDistributed byNew Line Cinema[1]Release dates\n1\xa0December\xa02003\xa0(2003-12-01) (Embassy Theatre)\n17\xa0December\xa02003\xa0(2003-12-17) (United States

In [23]:
#There seems to be a lot of junk on the bottom.  How to remove it?
wiki_articles['processed_text'].loc[3][-20000:]

'en Ken Ralston and Kit West Return of the Jedi 1983 Dennis Muren George Gibbs Michael J McAlister and Lorne Peterson Indiana Jones and the Temple of Doom 1984 George Gibbs and Richard Conway Brazil 1985 Robert Skotak Brian Johnson Suzanne M Benson John Richardson and Stan Winston Aliens 1986 Michael Lantieri Michael Owens Edward Jones and Bruce Walters The Witches of Eastwick 1987 George Gibbs Richard Williams Ken Ralston and Edward Jones Who Framed Roger Rabbit 1988 Ken Ralston Michael Lantieri John Bell and Steve Gawley Back to the Future Part II 1989 The production team of Honey I Shrunk the Kids 1990 Stan Winston Dennis Muren Gene Warren Jr and Robert Skotak Terminator 2 Judgment Day 1991 Michael Lantieri Ken Ralston Alec Gillis Tom Woodruff Jr Doug Chiang and Douglas Smythe Death Becomes Her 1992 Dennis Muren Stan Winston Phil Tippett and Michael Lantieri Jurassic Park 1993 Ken Ralston George Murphy Stephen Rosenbaum Doug Chiang and Allen Hall Forrest Gump 1994 Robert Legato Mich