# Scraping data from static websites

In this notebook, we will scrape content from static websites.

Per default `pd.read_html` tries to use the lxml package. It can be installed with `conda install -c conda-forge lxml`

### 1. Scraping tables with pandas

In [None]:
import pandas as pd

In [None]:
url_population = "https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population"

In [None]:
tables = pd.read_html(url_population)

In [None]:
len(tables)

In [None]:
# Target table (requires some searching)
df_population = tables[4]

In [None]:
df_population.head()

In [None]:
df_population.dtypes

In [None]:
# Cleaning
df_population["City"] = df_population["City"].str.replace("\[.*\]", "", regex=True)
df_population = df_population[["2021rank", "City", "2021estimate"]]
df_population = df_population.rename(columns={"2021rank": "Rank",
                                              "2021estimate": "Population"})

In [None]:
# Plot

In [None]:
df_population.plot(x="Rank", y="Population", kind="scatter",
                   figsize=(10,7), color="red", s=10);

In [None]:
# Points almost on straight line on a log scale here
df_population.plot(x="Rank", y="Population", logx=True, logy=True, kind="scatter",
                   figsize=(10,7), color="red", s=10);

What we see is an example of [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law). These [power laws](https://en.wikipedia.org/wiki/Power_law) are surprisingly general. For example, have a look how the first 10 million words in 30 Wikipedias: https://en.wikipedia.org/wiki/Zipf%27s_law#/media/File:Zipf_30wiki_en_labels.png

### 2. Scraping with requests and beautiful soup + CSS selectors

In [None]:
from bs4 import BeautifulSoup
import requests

First, we will scrape the same table again but now select it directly with its associated CSS selector.

In [None]:
page_population = requests.get(url_population)
soup = BeautifulSoup(page_population.content, 'html.parser')

In [None]:
tables_selected = pd.read_html(str(soup.select("table.wikitable:nth-child(21)")[0]))

In [None]:
# Only one table was collecte now
len(tables_selected)

In [None]:
df_poluation_alternative = tables_selected[0]
df_poluation_alternative.head()

<br>

As an example how we can scrape less structured data on a static website, let us collect the first news article on the English page of the [University of Muenster](https://www.uni-muenster.de/en/).

In [None]:
url_wwu = "https://www.uni-muenster.de/en/"
page = requests.get(url_wwu)
soup = BeautifulSoup(page.content, 'html.parser')

In [None]:
print(soup.select("article.module:nth-child(2)")[0].text)

### 3. Scraping with requests and beautiful soup + regular expressions

We can also sometimes scrape directly fom the HTML content with regular expressions without any need for CSS selectors. Let us try to obtain all course codes from the LSE's graduate course [catalogue](https://www.lse.ac.uk/resources/calendar/courseGuides/graduate.htm) as an example.

In [None]:
import re

In [None]:
url_lse = "https://www.lse.ac.uk/resources/calendar/courseGuides/graduate.htm"
page_lse = requests.get(url_lse)
soup_lse = BeautifulSoup(page_lse.content, 'html.parser')

In [None]:
course_codes = re.findall("[A-Z]{2}\d{1}\w*", soup_lse.text)

In [None]:
print(course_codes[:5])
print(len(course_codes))