# Testing notebook for scraping data 
This notebook will include test scripts to scrape the necessary data from various web sources.

## Scrape Bechdel test of movies
Scrape data from http://bechdeltest.com/ using its given API. Note that according to the owner, we should avoid calling the `getAllMovies` module frequently due to a shared hosting plan. Due to this, I ran the get requests once and saved the copy as a csv file.

In [1]:
import io
import re
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

main_dir = "/home/jdtganding/Documents/bechdel-movies-project/data"

# html = requests.get('http://bechdeltest.com/api/v1/getAllMovies').content
# df = pd.read_json(io.StringIO(html.decode('utf-8')))
# df.to_csv(bechdel_movies, index=None)

bechdel_movies_df = pd.read_csv(f"{main_dir}/BechdelTestMovieList.csv")
bechdel_movies_df.sample(10)

Unnamed: 0,rating,imdbid,id,title,year
5967,3,844708.0,5363,"Last House on the Left, The",2009
8451,1,4975722.0,7421,Moonlight,2016
3741,3,212346.0,655,Miss Congeniality,2000
5484,3,832266.0,545,"Definitely, Maybe",2008
7353,3,2829458.0,5113,Kyss meg for faen i helvete,2013
2257,3,91949.0,2,Short Circuit,1986
4386,2,324197.0,6745,Time of the Wolf,2003
7582,3,2536428.0,5408,La cr&egrave;me de la cr&egrave;me,2014
4401,3,326977.0,8198,I&#39;m Not Scared,2003
7500,0,2796584.0,7151,Lost Angel,2013


## Scrape Oscars movie nominees and winners
The Academy Awards has their own database found on https://awardsdatabase.oscars.org/. I scraped the whole database from the 1st Academy Awards up to the latest using `selenium` and saved the page source as a variable that can be read using `BeautifulSoup`.

In [95]:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException

options = webdriver.FirefoxOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')

driver = webdriver.Firefox(options=options)
driver.get("https://awardsdatabase.oscars.org/") 

#select award categories
driver.find_element(By.XPATH,"//button[contains(@class,'awards-basicsrch-awardcategory')]").click()
driver.find_element(By.XPATH,"//b[contains(text(),'Current Categories')]").click()

#select starting award year
driver.find_element(By.XPATH,"//button[contains(@class,'awards-advsrch-yearsfrom')]").click()
driver.find_element(By.XPATH,"//div[@class='btn-group multiselect-btn-group open']//input[@value='1']").click()

#select ending award year
driver.find_element(By.XPATH,"//button[contains(@class,'awards-advsrch-yearsto')]").click()
year_latest = len(driver.find_elements(By.XPATH,"//div[@class='btn-group multiselect-btn-group open']//li"))-2
driver.find_element(By.XPATH,f"//div[@class='btn-group multiselect-btn-group open']//input[@value='{year_latest}']").click()

#search to view results
# driver.find_element(By.XPATH,'//*[@id="btnbasicsearch"]').click()

#wait for all results to show
# time.sleep(60)

try:
    #resultscontainer will contain all our needed Oscars data
    driver.find_element(By.XPATH, '//*[@id="resultscontainer"]')

except NoSuchElementException as error:
    print(error)
    print(f"Needed element still not found after 60 seconds delay")

#get html source for BeautifulSoup extraction
page_source = driver.page_source

#close driver
driver.close()
print("Driver closed")

Message: Unable to locate element: //*[@id="resultscontainer"]
Stacktrace:
WebDriverError@chrome://remote/content/shared/webdriver/Errors.jsm:186:5
NoSuchElementError@chrome://remote/content/shared/webdriver/Errors.jsm:398:5
element.find/</<@chrome://remote/content/marionette/element.js:300:16

Needed element still not found after 60 seconds delay
Driver closed


### Transform json into structured format such as csv
First, we will use BeautifulSoup to extract elements and clean the page source. Then we would want the following structure for our dataframe:
```python
df_structure = {
    "AwardYear":[],
    "AwardCeremonyNum":[],
    "Movie":[],
    "AwardCategory":[],
    "AwardStatus":[]
}
```

- `AwardYear`: the year the award was received
- `AwardCeremonyNum`: the nth annual ceremony award
- `Movie`: the title of the nominated film
- `AwardCategory`: the category the film was nominated for
- `AwardStatus`: whether the film was only nominated or had won

In [27]:
# soup = BeautifulSoup(page_source, "lxml")
# results_container = soup.find('div', {'id':'resultscontainer'})

# with open (f"{main_dir}/OscarsResultsContainerHTML.txt", "w") as file:
#     file.write(str(results_container))

In [2]:
results_container = open(f"{main_dir}/OscarsResultsContainerHTML.txt", "r")
results_container = BeautifulSoup(results_container, 'lxml')

award_year_all = results_container.find_all('div',class_='awards-result-chron result-group group-awardcategory-chron')

In [90]:
oscars_results = []

for award_year_group in award_year_all:

    df_structure = {
        "AwardYear":'',
        "AwardCeremonyNum":'',
        "Movie":[],
        "AwardCategory":[],
        "AwardStatus": []
    }

    #find the award year title
    award_year = award_year_group.find('div',class_='result-group-title')\
                                 .get_text(strip=True)

    #separate award year title to extract year
    key_split = award_year.split(" ")
    df_structure['AwardYear'] = key_split[0]
    df_structure['AwardCeremonyNum'] = re.findall(r'\d+',key_split[1])[0]
    
    #award category result subgroup (each contains award title and nominees)
    award_category_all = award_year_group.find_all('div',class_='result-subgroup subgroup-awardcategory-chron')

    for award_category_group in award_category_all:

        #find award title
        award_title = award_category_group.find('div',class_='result-subgroup-title')\
                                          .get_text(strip=True)
        
        try:
            #find nominated movies
            movies = [movie.get_text(strip=True) for movie in award_category_group\
                           .find_all('div', class_='awards-result-film-title')]

            #remove duplicates
            movies = list(set(movies)) 

            #find winning movie
            winner_group = award_category_group.find('span', {'title':'Winner'})\
                                               .find_next_sibling('div')

            winners = [movie.get_text(strip=True) for movie in winner_group\
                            .find_all('div', class_='awards-result-film-title')]

            #update df_structure movie and category lists
            count = len(movies)
            if count > 0:
                df_structure['Movie'].extend(movies)
                df_structure['AwardCategory'].extend(list(np.repeat([award_title],count)))

                #add winner/s
                categ_list = list(np.repeat(['nominated'],count))
                for winner in winners:
                    categ_list[movies.index(winner)] = 'won'

                df_structure['AwardStatus'].extend(categ_list)
      
        except AttributeError:
            pass

    #append dataframe to list
    oscars_results.append(pd.DataFrame(df_structure))

#concatenate all award year dataframe into one    
oscars_results_final = pd.concat(oscars_results).reset_index(drop=True)

#save data as a csv file
oscars_results_final.to_csv(f"{main_dir}/OscarsFullResults.csv", index=False)

In [98]:
movies = ['movie1', 'movie2', 'movie3', 'movie4']
status = ['nominated','nominated','nominated','nominated']
winners = ['movie1', 'movie3']

for winner in winners:
    

[]

## Collecting IMDB datasets

In [97]:
imdb_titles = pd.read_csv('https://datasets.imdbws.com/title.basics.tsv.gz', 
                          chunksize=500_000,
                          iterator=True,
                          sep='\t',
                          header=0)

In [18]:
next(imdb_titles).head(10)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"
5,tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,\N,1,Short
6,tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,\N,1,"Short,Sport"
7,tt0000008,short,Edison Kinetoscopic Record of a Sneeze,Edison Kinetoscopic Record of a Sneeze,0,1894,\N,1,"Documentary,Short"
8,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,\N,45,Romance
9,tt0000010,short,Leaving the Factory,La sortie de l'usine Lumière à Lyon,0,1895,\N,1,"Documentary,Short"


## Using tmdb API to collect data