# Web Scraping: An Example
Mahdi Sadjadi, Data Scientist @ VideoAmp, May 2020

This is an introductory example of web scraping box office data in the US using `request` and `BeautifulSoup` packages.

## Read the webpage

We can read the content of a web page directly into a python object. Using `request` library, we send a request to the server and receive the html content in response. Then we can use `text` method to extract the html.

In [0]:
import requests

box_office_url = "https://www.the-numbers.com/box-office-chart/daily/2020/01/23"

# Pretend to be a web browser and make a get request of a webpage
box_request = requests.get(box_office_url)

# The .text returns the text from the request
box_html = box_request.text

# Parsed string
len(box_html)

Print first 200 characters:

In [0]:
print (box_html[0:200])

In [0]:
print(box_html[20000:20500])

## Parse with BeautifulSoup
We can directly use the string returned by `request` but it will a long painful process. `BeautifulSoup` allows to decompose the string into html tags. Then we can easily search through the html tree to find the tags we're interested in.

In [0]:
from bs4 import BeautifulSoup

# Turn into soup, specify the HTML parser
box_soup = BeautifulSoup(box_html, 'html.parser')

Find all tables identified by `table` tag:

In [0]:
all_tables = box_soup.find_all('table')

In [0]:
len(all_tables)

The first table holds the navigation links and the second table contains the data we need:

In [0]:
print (all_tables[0])

In [0]:
table_with_data = all_tables[1]

In [0]:
type(table_with_data)

In [0]:
#dir(table_with_data)

In [0]:
print (table_with_data)

Find all rows containing the data:

In [0]:
rows = table_with_data.find_all('tr')
len(rows)

How do we extract the value of each column? 
Example for a row:

In [0]:
row4 = rows[4]
row4

In [0]:
for item in row4.find_all('td'):
    print (f"Value is = {item.get_text()}")

In [0]:
def parse_row(row):
    """
    Input: a row object with required data
    Output: the value of each column
    """
    
    # parse out the text values
    items = [item.get_text() for item in row.find_all('td')]
    
    # post process the item as needed
    # See Excercise 3
    
    return items

In [0]:
parse_row(row4)

In [0]:
for row in rows[0:4]:
    print (parse_row(row))

## Putting everything together

In [0]:
import pandas as pd

In [0]:
columns = [item.get_text() for item in table_with_data.find_all("thead")[0].find_all('th')]
columns

In [0]:
parsed_data = [parse_row(row) for row in rows]

In [0]:
df = pd.DataFrame(data=parsed_data, columns=columns)
df.head(10)

## Exercises

### Exercise 1 (scraping)

Look at the `columns` definition and analyze how and why does `table_with_data.find_all("thead")[0].find_all('th')` return the column names?

### Exercise 2 (functional programming)

It is a best practice to define functions to handle different use cases. For example, we made a table for a single day in the above section. Define a function that receives a date and returns the final table for that date. This function itself is made of other smaller functions which is equivalent to the process I talked about above.

### Exercise 3 (string manipulation)
Data scraped from pages often need pre-processing and post-processing. Modify `parse_row` (or your equivalent/independent function) to remove the signs (dollar, percentage, ) from data.


### Exercise 4 (data handling)

It is better to join the dataframe for each day to form a bigger dataset. However, there is no identifier column for dates so if you join two dataframes you don't know where each row came from. Add a column to each dataframe based on their date and then join multiple dataframes to form a final dataframe.