# Testing notebook for data ingestion
This notebook will include test scripts to ingest the necessary data from various web sources.

## Scrape Bechdel test of movies
Scrape data from http://bechdeltest.com/ using its given API. Note that according to the owner, we should avoid calling the `getAllMovies` module frequently due to a shared hosting plan. Due to this, I ran the get requests once and saved the copy as a csv file.

In [25]:
import io
import requests
import pandas as pd
from bs4 import BeautifulSoup

bechdel_movies = '/home/jdtganding/Documents/data-engineering-zoomcamp/week_7_PROJECT/data/BechdelTestMovies.csv'

# html = requests.get('http://bechdeltest.com/api/v1/getAllMovies').content
# df = pd.read_json(io.StringIO(html.decode('utf-8')))
# df.to_csv(bechdel_movies, index=None)

bechdel_movies_df = pd.read_csv(bechdel_movies)
bechdel_movies_df.sample(10)

Unnamed: 0,rating,imdbid,id,title,year
4209,3,348226.0,6867,Tomie: Saishuu-sho - kindan no kajitsu,2002
9371,3,11334312.0,9507,"Final Level: Escaping Rancala, The",2019
7836,3,1724965.0,6668,Innocence,2014
2975,3,109913.0,1788,Go Fish,1994
2896,0,107214.0,4337,Indien,1993
2144,1,88000.0,5751,Revenge of the Nerds,1984
5359,1,871197.0,7121,&quot;Masters of Horror&quot; Right to Die,2007
7093,1,1597522.0,6993,Asterix and Obelix: God Save Britannia,2012
5081,3,785532.0,7114,&quot;Masters of Horror&quot; Pro-Life,2006
8226,3,4009278.0,7881,Intruders,2015


## Scrape Oscars movie nominees and winners
The Academy Awards has their own database found on https://awardsdatabase.oscars.org/. I scraped the whole database from the 1st Academy Awards up to the latest using `selenium` and saved the page source as a variable that can be read using `BeautifulSoup`.

In [4]:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By

options = webdriver.FirefoxOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')

driver = webdriver.Firefox(options=options)
driver.get("https://awardsdatabase.oscars.org/") 

#select award categories
driver.find_element(By.XPATH,"//button[contains(@class,'awards-basicsrch-awardcategory')]").click()
driver.find_element(By.XPATH,"//b[contains(text(),'Current Categories')]").click()

#select starting award year
driver.find_element(By.XPATH,"//button[contains(@class,'awards-advsrch-yearsfrom')]").click()
driver.find_element(By.XPATH,"//div[@class='btn-group multiselect-btn-group open']//input[@value='1']").click()

#select ending award year
driver.find_element(By.XPATH,"//button[contains(@class,'awards-advsrch-yearsto')]").click()
year_latest = len(driver.find_elements(By.XPATH,"//div[@class='btn-group multiselect-btn-group open']//li"))-2
driver.find_element(By.XPATH,f"//div[@class='btn-group multiselect-btn-group open']//input[@value='{year_latest}']").click()

#search to view results
driver.find_element(By.XPATH,'//*[@id="btnbasicsearch"]').click()

#wait for all results to show
time.sleep(60)

#get html source for BeautifulSoup extraction
page_source = driver.page_source

#close driver
driver.close()

### Use BeautifulSoup to extract elements and clean the page source
I saved the data for `id=resultscontainer` to avoid the long execution of the above code.

In [27]:
soup = BeautifulSoup(page_source, "lxml")
results_container = soup.find('div', {'id':'resultscontainer'})

main_dir = "/home/jdtganding/Documents/data-engineering-zoomcamp/week_7_PROJECT/data"
with open (f"{main_dir}/results.txt", "w") as file:
    file.write(str(results_container))

In [28]:
results = open(f"{main_dir}/results.txt", "r")
results = BeautifulSoup(results, 'lxml')

In [29]:
award_years = results.find_all('div', class_='awards-result-chron result-group group-awardcategory-chron')

for award_year in award_years:
    ceremony_year = award_year.find('div', class_='result-group-title').find('a').text
    print(ceremony_year)

1927/28 (1st)
1928/29 (2nd)
1929/30 (3rd)
1930/31 (4th)
1931/32 (5th)
1932/33 (6th)
1934 (7th)
1935 (8th)
1936 (9th)
1937 (10th)
1938 (11th)
1939 (12th)
1940 (13th)
1941 (14th)
1942 (15th)
1943 (16th)
1944 (17th)
1945 (18th)
1946 (19th)
1947 (20th)
1948 (21st)
1949 (22nd)
1950 (23rd)
1951 (24th)
1952 (25th)
1953 (26th)
1954 (27th)
1955 (28th)
1956 (29th)
1957 (30th)
1958 (31st)
1959 (32nd)
1960 (33rd)
1961 (34th)
1962 (35th)
1963 (36th)
1964 (37th)
1965 (38th)
1966 (39th)
1967 (40th)
1968 (41st)
1969 (42nd)
1970 (43rd)
1971 (44th)
1972 (45th)
1973 (46th)
1974 (47th)
1975 (48th)
1976 (49th)
1977 (50th)
1978 (51st)
1979 (52nd)
1980 (53rd)
1981 (54th)
1982 (55th)
1983 (56th)
1984 (57th)
1985 (58th)
1986 (59th)
1987 (60th)
1988 (61st)
1989 (62nd)
1990 (63rd)
1991 (64th)
1992 (65th)
1993 (66th)
1994 (67th)
1995 (68th)
1996 (69th)
1997 (70th)
1998 (71st)
1999 (72nd)
2000 (73rd)
2001 (74th)
2002 (75th)
2003 (76th)
2004 (77th)
2005 (78th)
2006 (79th)
2007 (80th)
2008 (81st)
2009 (82nd)
2010 (8