# Elements Of Data Processing (2021S1) - Week 6


### Web Scraping
- Web scraping is an automated solution for people who need access to some structured data (i.e table) on a website.
- Useful for public websites where an API is not supported to easily get the data or if there is limited access.
- `BeautifulSoup` (from `bs4`) is a Python module (similar to `requests` and `urllib`) which help support users to scrape data from web pages for processing and/or website analysis.
- The main operations are *scraping* a website for structured data or *crawling* the website by traversing through the index and contents of the website.

### Example
The example below extracts some tennis scores from the 2019 ATP Tour

In [None]:
import requests
import unicodedata
import re
import matplotlib.pyplot as plt
import pandas as pd
from urllib.parse import urljoin
from bs4 import BeautifulSoup
from IPython.display import display

In [None]:
# first, use requests to get the html content of the page
url = 'https://en.wikipedia.org/wiki/2019_ATP_Tour'

response = requests.get(url)
response

Response code `200` denotes `SUCCESS` (no errors - this is what you want to see).

Some other useful ones to know:
- `400`: The server (website) has issues or is broken.
- `403`: You don't have sufficient permission (i.e authentication required).
- `404`: Content doesn't exist.
- `405`: Method is not allowed (i.e you used `GET` instead of `POST` or `PUT`).
- `500`: Something unexpected happened and the server doesn't know why.
- `503`: Server is overloaded and can't handle your current request (but try again later).

More here... https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

In [None]:
# the text representation of the html (raw = response.content)
response.text

- Rather than looking at this text, we can use `BeautifulSoup` to parse it!
- Since we're looking for the **2019 ATP Tour**, we can use the `.find()` method to find the corresponding section.

In [None]:
s = BeautifulSoup(response.text, 'html.parser')
atp_tag = s.find(id='ATP_ranking')
atp_tag

- Now, let's go to the table (the right table under this headline).

In [None]:
table = atp_tag.findNext('table').findNext('table')
table

There are a couple ways to parse this data...
- Method 1, parse by directly reading the rows (faster)
- Method 2, use pandas (slower, but easier)

#### `unicodedata.normalize()`
- Web pages typically use `unicode` encoding, which represents far more characters than your ASCII encoding.
- However, since `unicode` has several languages, there can be many ways to express the same characters.
- For example, `"â"` can be represented as one code point for `"â"` (U+00E2), and two decomposed code points for `"a"` (U+0061) and `" ̂"` (U+0302).
- In this specific example, we will use Normalization Form KD (`NFKD`) which decomposes characters by compatibility.
- For example: `"ﬁ"` (U+FB01) becomes `"f"` (U+0066) and `"i"` (U+0069)

#### Example:

In [None]:
unistr = u'\u2460'
print(f"{unicodedata.normalize('NFKD', unistr)} is the equivalent character of {unistr}")

In [None]:
%%time
# method 1
rows = table.find_all('tr')

i = 0
records = []
    
for row in rows[2:]:
    cells = row.find_all('td')
    record = []
    
    ranking = int(unicodedata.normalize("NFKD", cells[0].text.strip()))
    record.append(int(ranking))
    
    player = unicodedata.normalize("NFKD", cells[1].text.strip())
    # Removes the country from the player name, removing surrounding whitespaces.
    player_name = re.sub('\(.*\)', '', player).strip()
    # print(player_name)
    record.append(player_name)

    # Remove the thousands separator from the points value and store as an integer
    points = unicodedata.normalize("NFKD", cells[2].text.strip())
    record.append(int(re.sub(',', '', points)))
    
    # number of tours: integer type
    tours = unicodedata.normalize("NFKD", cells[3].text.strip())
    record.append(int(tours))
    
    # Store the country code separately
    country_code = re.search('\((.*)\)', player).group(1)
    record.append(country_code)

    # [1, 'Rafael Nadal', 9585, 12, 'ESP']
    records.append(record)
    i += 1

column_names = ["Ranking", "Player", "Points", "Tours", "Country"]
tennis_data = pd.DataFrame(records, columns = column_names)

tennis_data.head()

In [None]:
# method 2
df = pd.read_html(str(table))[0]
display(df.head())

# drop the 0th level multi-index (Singles Race Rankings Final rankings[9])
# and also drop the nan column
# then rename the # to ranking
df = df.droplevel(level=0, axis=1).dropna(axis=1).rename({'#': 'Ranking'}, axis=1)
display(df.head())

In [None]:
%%time
df = pd.read_html(str(table))[0]
df = df.droplevel(level=0, axis=1).dropna(axis=1).rename({'#': 'Ranking'}, axis=1)

# apply unicode 
df['Player'] = df['Player'].apply(lambda x: unicodedata.normalize("NFKD", x))

# get country value
df['Country'] = df['Player'].apply(lambda x: re.search('\((.*)\)', x).group(1))

# get player text
df['Player'] = df['Player'].apply(lambda x: re.sub('\(.*\)', '', x).strip())
df.head()

In [None]:
plt.xticks(rotation=90)
plt.bar(df['Player'], df['Points'])
plt.ylabel('Points')
plt.title("ATP Tour - Player Points")
plt.show()

### <span style="color:blue"> Exercise 1 </span>

Produce a graph similar to the example above for the **2019 ATP Doubles Scores**.

*First locate the section we're interested in.*
    

In [None]:
# Specify the page to download
url = 'https://en.wikipedia.org/wiki/2019_ATP_Tour'
page = requests.get(url)
s = BeautifulSoup(page.text, 'html.parser')

# add code below


### Web crawling
- This is a web crawler that traverses http://books.toscrape.com/


In [None]:
page_limit = 20

# Specify the initial page to crawl
base_url = 'http://books.toscrape.com/'
seed_item = 'index.html'

seed_url = base_url + seed_item
page = requests.get(seed_url)
soup = BeautifulSoup(page.text, 'html.parser')

# initialise dictionary of visit
visited = {seed_url: True}
pages_visited = 1
print(seed_url)

# Remove index.html
links = soup.findAll('a')
seed_link = soup.findAll('a', href=re.compile("^index.html"))
to_visit_relative = [l for l in links if l not in seed_link]

# Resolve to absolute urls
to_visit = []
for link in to_visit_relative:
    to_visit.append(urljoin(seed_url, link['href']))

    
# Find all outbound links on succsesor pages and explore each one whilst under visit limit
while (to_visit) and pages_visited < page_limit:
    # consume the list of urls
    link = to_visit.pop(0)
    print(link)

    # need to concat with base_url, an example item <a href="catalogue/sharp-objects_997/index.html">
    page = requests.get(link)
    
    # scarping code goes here
    soup = BeautifulSoup(page.text, 'html.parser')
    
    # mark the item as visited, i.e., add to visited list, remove from to_visit
    visited[link] = True
    new_links = soup.findAll('a')
    for new_link in new_links :
        new_item = new_link['href']
        new_url = urljoin(link, new_item)
        if new_url not in visited and new_url not in to_visit:
            to_visit.append(new_url)
        
    pages_visited += 1

print(f'\nVisited {len(visited)} pages out of {len(to_visit)} to visit in total')

### <span style="color:blue"> Exercise 2 </span>
- The code above can easily be end up stuck in a **crawler trap** (when a crawler crawls an infinite number of irrelevant URLs).  
- Explain three ways this could occur and suggest possible solutions
- Read more here https://en.wikipedia.org/wiki/Spider_trap

### <span style="color:blue"> Exercise 3 </span>

- Modify the code above to print the titles of as many books as can be found within the `page_limit`.
- Only a few additional lines are required where commented.

In [None]:
page_limit = 20

# Specify the initial page to crawl
base_url = 'http://books.toscrape.com/'
seed_item = 'index.html'

seed_url = base_url + seed_item
page = requests.get(seed_url)
soup = BeautifulSoup(page.text, 'html.parser')

# initialise dictionary of visit
visited = {seed_url: True}
pages_visited = 1

#### initialse an emtpy set or list for book titles here ####

#############################################################

# Remove index.html
links = soup.findAll('a')
seed_link = soup.findAll('a', href=re.compile("^index.html"))
to_visit_relative = [l for l in links if l not in seed_link]

# Resolve to absolute urls
to_visit = []
for link in to_visit_relative:
    to_visit.append(urljoin(seed_url, link['href']))
    
# Find all outbound links on succsesor pages and explore each one whilst under visit limit
while (to_visit) and pages_visited < page_limit:
    # consume the list of urls
    link = to_visit.pop(0)

    # need to concat with base_url, an example item <a href="catalogue/sharp-objects_997/index.html">
    page = requests.get(link)
    
    # scarping code goes here
    soup = BeautifulSoup(page.text, 'html.parser')
    
    # mark the item as visited, i.e., add to visited list, remove from to_visit
    visited[link] = True
    new_links = soup.findAll('a')
    
    for new_link in new_links:
        #### if a new link has attribute 'title', then add new_link['title'] to the set ####

        
        ####################################################################################
        
        new_item = new_link['href']
        new_url = urljoin(link, new_item)
        if new_url not in visited and new_url not in to_visit:
            to_visit.append(new_url)
        
    pages_visited += 1

#### print out every title ####


###############################