# <center> Please go to https://ccv.jupyter.brown.edu </center>

## What we've learned so far:
- Introduction to HTML
- Making a request to a webpage and creating a beautiful soup object
- Simple and advanced navigation through a soup object
- Scraping weather data example

## By the end of today you will learn about:

- Scraping Wikipedia for the 2019 Oscar nominees

# We will be scraping the 2019 Oscar nominees from Wikipedia.

https://en.wikipedia.org/wiki/92nd_Academy_Awards

### To get started let's inspect the HTML underlying the page.
Using Chrome, go to View > Developer > Developer Tools

* `<table class="wikitable">` contains all tags for the nominees table.
* `<tr>` tags are rows in the nominees table, and their `<td>` children contain individual categories and winners.

We can use this information to figure out how to scrape the page.

* Nominees are in tables in the Awards section
* Winners are in bold
* Categories are table headers

We want the CSV to look like the following:

|Nominee|Winner|Category|
|---|---|---|
|Parasite|1|Best Picture|
|Little Women|0|Best Picture|
|...|...|...|

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [None]:
url = 'https://en.wikipedia.org/wiki/92nd_Academy_Awards'
page = requests.get(url)

In [None]:
soup = BeautifulSoup(page.content, 'html.parser')

In [None]:
all_nominees = soup.find(class_="wikitable")
print(all_nominees)

### Extract the nominees table

In [None]:
print(all_nominees.prettify())

## Let's figure out how to scrape cateogory, winner, and nominees from one subtable

In [None]:
subtables = all_nominees.select('tr td')
print(len(subtables))

In [None]:
print(subtables[0].prettify())

In [None]:
# find first <b> tag
print(subtables[0].find('b'))

In [None]:
# another way to find the first <b> tag
print(subtables[0].b)

In [None]:
# find first a tag within the b tag
print(subtables[0].b.a)

In [None]:
# get the title attribute from the a tag
category = subtables[0].b.a['title']
print(category)

## Exercise: get the category from the text of the a tag above

#### Find all <b> <a> </a> </b> tags within the first subtable

In [None]:
print(subtables[0].find_all('b'))

In [None]:
print(subtables[0].select('b a'))

#### Get the winner from the text of the `<a>` tag

In [None]:
winner = subtables[0].select('b a')[1].get_text()
print(winner)

## Exercise: get the winning director (Bong Joon-ho)

#### Get the nominees (inspect the HTML underlying the webpage first)

In [None]:
nominees_table = subtables[0].ul.li.ul.find_all('li')
print(nominees_table)

In [None]:
nominees_lst = []
for entry in nominees_table:
    nominee = entry.a.get_text()
    nominees_lst.append(nominee)
print(nominees_lst) 

## Exercise: get the director for each best picture nominee (ex: Peter Chernin for Ford vs. Ferrari and Martin Scorsese for The Irishman)

### We have the category, winner, and nominees. Now we just need to structure them into a data frame.

In [None]:
print(winner)
movies_lst = [winner] + nominees_lst
print(movies_lst)

In [None]:
is_winner_lst = [1] + [0]*len(nominees_lst)
print(is_winner_lst)

In [None]:
print(category)
categories_lst = [category]*len(movies_lst)
print(categories_lst)

In [None]:
movies_df = pd.DataFrame({
    'movies': movies_lst,
    'winner': is_winner_lst,
    'category': categories_lst
})
print(movies_df)

## Next we will do this in a loop for all subtables

In [None]:
for subtable in subtables:
    category = subtable.b.a['title']
    winner = subtable.select('b a')[1].get_text()
    nominees_table = subtable.ul.li.ul.find_all('li')
    nominees_lst = []
    for entry in nominees_table:
        nominee = entry.a.get_text()
        nominees_lst.append(nominee)    
    movies_lst = [winner] + nominees_lst
    is_winner_lst = [1] + [0]*len(nominees_lst)
    categories_lst = [category]*len(movies_lst)
    movie_df = pd.DataFrame({
        'movies': movies_lst,
        'winner': is_winner_lst,
        'category': categories_lst
    })
print(movie_df)

### What is the problem? How do we fix it?

In [None]:
movies_list = []
for subtable in subtables:
    category = subtable.b.a['title']
    winner = subtable.select('b a')[1].get_text()
    nominees_table = subtable.ul.li.ul.find_all('li')
    nominees_lst = []
    for entry in nominees_table:
        nominee = entry.a.get_text()
        nominees_lst.append(nominee)    
    movies_lst = [winner] + nominees_lst
    is_winner_lst = [1] + [0]*len(nominees_lst)
    categories_lst = [category]*len(movies_lst)
    movie_df = pd.DataFrame({
        'movies': movies_lst,
        'winner': is_winner_lst,
        'category': categories_lst
    })
    movies_list.append(movie_df)
movies_df = pd.concat(movies_list)
print(movies_df)

In [None]:
movies_df.to_csv('data/oscars.csv', index=False)