[![Binder](http://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/EconomicsObservatory/courses/HEAD?labpath=5%2Fs5_Scraping.ipynb)

<a href="https://colab.research.google.com/github/EconomicsObservatory/courses/blob/main/5/s5_Scraping.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# First we need to import the libraries we will be using
import pandas as pd
import requests
from bs4 import BeautifulSoup
from collections import Counter

</br>
</br>

# Scraping the HTML source (advanced)

Sometimes we want to access data that isn't as nicely formatted. For example:

- **Prices**: you might want data on a type of product or from a shop
- **Weather**: maybe you want to automate the collection of weather data from the Met office or weather.com
- **News and Media**: Scraping headlines and summaries can tell you about current affairs

</br>
</br>




In this example, we will scrape the Economics Observatory website to collect the latest article names and taglines to tell us about what is being reported.



 However, first, let's go through some basics for three tools:
 
 - `BeautifulSoup` which lets us search through and analyse HTML from the internet
 - `Pandas` which provides table-like objects we can use for analysis
 - `Counter` which let us count the occurence of items in a list
 




# Using `BeautifulSoup`

`BeautifulSoup` is a Python library for parsing `HTML` documents. We can use it to extract data from HTML we fetch from the web.
Today, we've written and modified `HTML` of our own.

</br>
</br>

Let's try investigating our own websites with `BeautifulSoup`.


In [8]:
website_url = "http://mclass-user.github.io" # Replace this with your URL
req = requests.get(website_url) # download the webpage
soup = BeautifulSoup(req.content, "lxml") # parse the webpage into an object that let's us navigate the HTML

</br>
`soup` now contains a structured and interactive version of our website. We can take a look at it with `soup.prettify()`

In [9]:
soup.prettify()

'<!-- This HTML file provides a simple example of a portfolio page. It contains a header, a few sections, and a few charts. -->\n<!-- It works with the three css files in this folder: example1.css, example2.css and example3.css. -->\n<!-- The HTML file defines the structure of the page, and the CSS files define how it looks. -->\n<!-- The CSS examples are all designed to work with this page. Try swapping them out on line 29 and see what happens. -->\n<!-- Take a look around and make your own copy to adapt for your own use. -->\n<!-- This is a comment -->\n<!-- Comments are not displayed in the browser -->\n<!-- They can be used to leave notes in the code -->\n<!DOCTYPE html>\n<html>\n <!-- HTML is heirarchial: tags like this define elements on the page. \n    Each tag does something different. For example:\n        - The <html> tag defines the whole document.\n        - The <h1> tag defines a large heading\n        - The <p> tag defines a paragraph\n        - Tags appear as an opening 

</br>

and search for elements. For example, we can search for every section title by searching for every `<h2>`

In [10]:
section_titles = soup.find_all("h2")
for title in section_titles:
    print(title.text)

Section 1: A basic chart
Section 2: Another Chart


or look for an element with the class `big` - our main title

In [22]:
# find element with class="big"
soup.find(class_="big").text

' \n\n        An Example Portfolio\n    '

</br>
</br>
</br>

# Using `Pandas`

The second tool we'll use today is `Pandas`, a Python library used to work with datasets. It provides access to dataframes - tables we analyse with code.

Python already has a few built in data structures, for example lists and dictionaries:

In [28]:
london = {
    "name": "London",
    "population": 8308369,
    "area": 1572
} # This is an example of a dictionary

locations = [
    {
        "name": "London",
        "population": 8_982_000,
        "area": 606
    },
    {
        "name": "Newport",
        "population": 128_060,
        "area": 32.52
    },
    {
        "name": "Darlington",
        "population": 93_015,
        "area": 7.62
    },

]

Which we can turn into Pandas `dataframes`

In [29]:
df = pd.DataFrame(locations)
df

Unnamed: 0,name,population,area
0,London,8982000,606.0
1,Newport,128060,32.52
2,Darlington,93015,7.62


and manipulate in different ways.

For example, we can add a density column:

In [30]:
df['density'] = df['population'] / df['area']
df

Unnamed: 0,name,population,area,density
0,London,8982000,606.0,14821.782178
1,Newport,128060,32.52,3937.884379
2,Darlington,93015,7.62,12206.692913


or sort our dataframe

In [31]:
sorted_df = df.sort_values(by="density", ascending=False)
sorted_df

Unnamed: 0,name,population,area,density
0,London,8982000,606.0,14821.782178
2,Darlington,93015,7.62,12206.692913
1,Newport,128060,32.52,3937.884379


</br>
</br>
</br>

</br>

# Using `Counter`

`Counter` is a module that's built in to python. It takes a list (or list-like object) and returns how many times each item occurs.

For example, if we have a list of fruit:

In [32]:
fruit = ["apple", "banana", "cherry", "apple", "cherry"]

we can use `Counter` to count how many times each occurs:

In [33]:
fruit_counts = Counter(fruit)
fruit_counts

Counter({'apple': 2, 'banana': 1, 'cherry': 2})

and we can combine this with `Pandas` by making a Dataframe from the counter:

In [37]:
pd.DataFrame(fruit_counts.items(), columns=["fruit", "count"])

Unnamed: 0,fruit,count
0,apple,2
1,banana,1
2,cherry,2


</br>
</br>
</br>

## Investigating the webpage

Before writing any code, let's take a look at the webpage.

<img
style="max-height: 250px;
    width: auto;" src="https://raw.githubusercontent.com/jhellingsdata/RADataHub/main/misc/Masterclass/section%205/images/eco_website.png"> </img>

We want to extract a list of article titles, such as "What do we know about labour market power in the UK?". To do this, we need to know where they appear in the HTML and how they are defined. By using inspect-element (right/ctrl click), we can see the HTML code that creates the titles.

<img
style="max-height: 250px;
    width: auto;" src="https://raw.githubusercontent.com/jhellingsdata/RADataHub/main/misc/Masterclass/section%205/images/inspect_element.png"> </img>

Here we can see that article titles have the class "home__blocks-item-title". We'll use this information to extract just the article titles.

Before we start writing code to scrape, we need to install and load the packages we need.

</br>
</br>
</br>
</br>


## Scraping the page

First, we'll download the HTML which defines the page, using the requests module.

In [3]:
req = requests.get("https://www.economicsobservatory.com") # Make a request to the ECO home-page
page_html = req.text # store the HTML in page_html

Now we have the page's source stored in {{page_html}}. Next we're going to use a module called BeautifulSoup to turn this text into a representation of the page we can interact with. We'll store this in a variable called {{soup}}.

In [4]:
soup = BeautifulSoup(page_html, 'html.parser') # Create a BeautifulSoup object to interact with the page's HTML

Now we'll look for article titles by searching for elements with the class "home__blocks-item-title" which we identified above.


In [5]:
article_title_elements = soup.find_all(class_="home__blocks-item-title") # Find all elements with the class "home__blocks-item-title"
article_title_elements

[<h3 class="home__blocks-item-title">What is the state of the UK economy in early 2024?</h3>,
 <h3 class="home__blocks-item-title">Ukraine: what’s the global economic impact of Russia’s invasion?</h3>,
 <h3 class="home__blocks-item-title">Could a new policy institution help solve the UK’s productivity problem?</h3>,
 <h3 class="home__blocks-item-title">What can the UK learn from the latest global data on pupil performance?</h3>,
 <h3 class="home__blocks-item-title">What would be the effects of abolishing or reforming inheritance tax?</h3>,
 <h3 class="home__blocks-item-title">Read the latest edition of our magazine here</h3>,
 <h3 class="home__blocks-item-title">How much revenue might be raised by a one-off wealth tax?</h3>,
 <h3 class="home__blocks-item-title">Update: How did personality affect mental health during the pandemic?</h3>,
 <h3 class="home__blocks-item-title" style="text-align: left;">Government Investment</h3>,
 <h3 class="home__blocks-item-title">How can we reduce gender

We now have a list of HTML tags containing titles but all we care about is the text within them.

In [7]:

article_titles = [element.text for element in article_title_elements] # Extract the text from each element
article_titles

['What is the state of the UK economy in early 2024?',
 'Ukraine: what’s the global economic impact of Russia’s invasion?',
 'Could a new policy institution help solve the UK’s productivity problem?',
 'What can the UK learn from the latest global data on pupil performance?',
 'What would be the effects of abolishing or reforming inheritance tax?',
 'Read the latest edition of our magazine here',
 'How much revenue might be raised by a one-off wealth tax?',
 'Update: How did personality affect mental health during the pandemic?',
 'Government Investment',
 'How can we reduce gender gaps in mathematics education?',
 'How have minorities been treated by the UK’s judicial system?',
 'How are plastics harming marine ecosystems?',
 'How might house prices affect workers’ productivity in OECD economies?',
 '#studentviews: How can young doctors be discouraged from leaving the NHS?']

We also care about the taglines/'teasers' of each article.

These are contained in \<spans\> and \<p\> tags contained in divs with the class "home__blocks-item-teaser display"

In [8]:
# find all divs with the class "home__blocks-item-teaser display"
tagline_divs = soup.find_all(class_="home__blocks-item-teaser display")
# get all the <p> tags from the tagline_divs
taglines = [div.find("p") for div in tagline_divs]
# extract the text from each tag
tagline_texts = [tagline.text for tagline in taglines]
tagline_texts

['The Office for National Statistics recently released fresh data on the UK’s labour market, inflation and GDP. While trends in prices and wages have improved conditions for many households, the economy has contracted since early 2022. Prospects for renewed growth are positive but weak.',
 'New data from the OECD reveal how teenagers around the world are doing at school. The results for the UK are above average – and there hasn’t been the same big drop in scores as in some other places. But there remain significant social concerns, with over one in ten children going hungry.',
 'Inheritance taxes aim to raise revenue and reduce inequality of opportunity arising from differences in parental wealth. But current UK policy has little impact on social mobility. Scrapping the tax would benefit the wealthiest, while reforms could reduce unfairness and negative economic effects.',
 'If the government decides to raise taxes after the pandemic, an alternative to taxes on work or spending could b

</br>
</br>


And where from here?
We now have a list of articles, how could this be useful?

- **Automated News Roundups**: you could write code to collect news titles each day to produce a daily roundup
- **Sentiment Analysis**: If you scale up the data collection, you could perform [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) to learn about the emotional valience of news stories.
</br>
</br>
</br>
</br>


### Making a Chart: Term Frequencies

Today, we can make a chart of term frequencies from the headlines. This will tell us about the topics covered by the website.

To do this, we will:

1. Define a list of common words to avoid (e.g. "the", "how", "should")
2. Work out how many times each word appears, excluding the common words
3. Save our data

</br>
</br>

#### 1: Make a long list of every word

In [9]:
word_list = [] # Create an empty list to store the words

article_titles_and_taglines = article_titles+tagline_texts # Combine the article titles and taglines into one list

for text in article_titles_and_taglines: # Loop through each article title and tagline
  words_in_text = text.split() # Split the text into words so we can loop through each word
  for word in words_in_text: # Loop through each word
    lowercase_word = word.lower() # Convert the word to lowercase
    word_excl_punc = lowercase_word.replace(".", "").replace(",", "").replace("?","") # Remove punctuation
    word_list.append(word_excl_punc) # Add the word to the word_list



</br>
</br>

#### 2: Count how many times each word appears
For this we'll use Counters, which take a list and return an object showing how many times each item appears

In [10]:
word_counter = Counter(word_list)

word_counter

Counter({'what': 3,
         'is': 3,
         'the': 32,
         'state': 1,
         'of': 13,
         'uk': 4,
         'economy': 2,
         'in': 17,
         'early': 2,
         '2024': 1,
         'ukraine:': 1,
         'what’s': 1,
         'global': 2,
         'economic': 2,
         'impact': 3,
         'russia’s': 1,
         'invasion': 1,
         'could': 5,
         'a': 8,
         'new': 3,
         'policy': 3,
         'institution': 1,
         'help': 4,
         'solve': 1,
         'uk’s': 3,
         'productivity': 4,
         'problem': 1,
         'can': 6,
         'learn': 1,
         'from': 6,
         'latest': 2,
         'data': 3,
         'on': 5,
         'pupil': 1,
         'performance': 1,
         'would': 2,
         'be': 6,
         'effects': 2,
         'abolishing': 1,
         'or': 2,
         'reforming': 1,
         'inheritance': 2,
         'tax': 5,
         'read': 1,
         'edition': 1,
         'our': 2,
         'maga

</br>
</br>


#### 3: Making a table

We now have a counter showing word frequencies. Using **Pandas**, we can turn this into a more useful table.

In [11]:
df = pd.DataFrame(word_counter.items(), columns=["word", "count"])

sorted_df = df.sort_values(by="count", ascending=False)

sorted_df

Unnamed: 0,word,count
2,the,32
7,in,17
157,to,14
104,and,13
4,of,13
...,...,...
151,ten,1
152,children,1
153,going,1
154,hungry,1


</br>
</br>


#### 4: Filter out Common words

The most common words in our word_counter are common words: "the", "to", "a", etc.

We should filter these out so we can learn more about the articles.

Thankfully, lots of lists of common words exist already. We'll download a common word list and only keep the word that do not appear in it.

In [12]:
common_words = requests.get("https://raw.githubusercontent.com/6/stopwords-json/master/dist/en.json").json()

df_excl_common = sorted_df.query("word not in @common_words")

df_excl_common

Unnamed: 0,word,count
42,tax,5
5,uk,4
67,reduce,4
25,productivity,4
119,growth,3
...,...,...
147,concerns,1
151,ten,1
152,children,1
154,hungry,1


</br>
</br>


#### Saving Our Data

We now have a table of how often words appear in Economics Observatory headlines and taglines.
Let's keep the words that appear more than twice and save the table.

In [33]:
more_than_twice_df = df_excl_common.query("count >2")
more_than_twice_df.to_csv("eco_words.csv", index=False)

more_than_twice_df

Unnamed: 0,word,count
9,productivity,6
99,growth,4
30,tax,4
58,gender,3
13,uk,3
32,work,3
89,past,3
54,labour,3
