# Week 4: Web Scraping

In this week, we will learn to automatically extract information from a website. To do so, we will need to to understand the following:

- the HTML language, the building block of the Web;
- the CSS language, describing how HTML elements are positioned, how text is displayed, etc.;
- how CSS styling is encoded in HTML attributes with ‘CSS selectors’;
- inspecting a website in your browser and finding out how to select information to scrape.

Once we have the above basic understanding, we will use a Python library to help us automate information extraction from the websites.

## HTML

HyperText Markup Language (HTML) is the language that is interpreted by the web browsers to display Web page. Just like how Python code is interpreted by the Python interpreter, web browsers (e.g., Chrome, Firefox, Safari) are interpreters for HTML code. 

A simple example of an HTML code looks as follows:

```
<html>
    <div>
        <h1>Hello World!</h1>
        <p>This is my first HTML code</p>
    </div>
</html>
```

<div class="alert-info alert">

**Exercise 0.1:** Go to https://htmledit.squarefree.com/ and insert the HTML code in the top input box. Notice how the the tags such as `<p>` or `</p>` disappear, thereby rendering only the text between these tags. 

</div>


### HTML Elements

**Notice** how the tags appear in pairs as `<p>` and `</p>` or `<div>` and `</div>`. A start tag (e.g., `<p>`), text, and the end tag (e.g., `</p>`) forms an **HTML element**. A web browser displays the text contained in these elements according to the element's function. Each of these tags have a specific purpose. Here are some examples:

<hr />
<br />

| tag (name)              | Function | 
|:------------------------|:------------|
| `p` (paragraph)         | Display the text as is |                    
| `h1` (headline-1)       | Display the text as the main headline with bold and larger font |
| `h2` (headline-2)       | Display the text as a sub headline with bold and slighlty smaller font than `h1`|
| `div` (division)        | Define a section in the HTML document |
| `html` (root)           | Define the end and start of the HTML code | 

<br/>
<hr/>

Finally, these **tags are further nested**. In our example, a div-tag contains an h3-tag and a p-tag. 

<div class="alert-warning alert">

A **full list of all tag lists** is available here: https://www.w3schools.com/tags/default.asp
</div>

## CSS 

Cascading Style Sheets (CSS) tells the web browser how to style the content of an element. It is the language that often accompanies any HTML code. The styling may include having specific font size, color, background color, and many more properties related to visually displaying that element. 

One can understand the tags in HTML like `<h1>` as having some default CSS. However, when we want to do more than what is availble in HTML tags, we will need to use CSS.

```html
<html>
    <head>
        <style>
        p {
            background-color: lightblue,
            font-size: 20px
        }

        h1 {
            text-align: center
        }
        </style>
    </head>

    <body>
        <div>
            <h1>Hello World!</h1>
            <p>This is my first HTML code</p>
        </div>
    </body>
</html>
```



Since it is beyond the capacity of a notebook to render the above CSS in these blocks, we will use an online HTML hosting platform (JSFiddle) to test the previous example.

Click on https://jsfiddle.net/pguptacs/vg8wuzL5/ to see how the above HTML code is displayed. If HTML doesn't display at the bottom right corner, then click on the `Run` button at the top-left corner of the window. 

**Note** how specifications like `text-align` or `background-color` stylize their respective HTML elements.

**Note** that we have partitioned the HTML document into two sections using the head-tag and the body-tag. The head-tag contains all the metadata, while the body tag contains everything that needs to be displayed on a web page. 

The above code is a cleaner way to separate out CSS styling and HTML code via style-tag nested within the head-tag. However, HTML is flexible enough to let the user define these styles in the individual elements. For example, study the code below and note how we moved style within the elements themselves.

```
<html>
  <div>
    <h1 style="text-align: center";>Hello World!</h1>
    <p style="background-color:lightblue; font-size:20px">This is my first HTML code</p>
  </div>
</html>
```

<div class="alert-info alert">

**Exercise 0.2:** Go to https://htmledit.squarefree.com/ and insert the HTML code in the top input box. Notice how the individual elements have been stylized in the similar fashion as the preceding HTML code with a separate style-tag.

</div>

You can also view the above code snippet in JSFiddle here: https://jsfiddle.net/pguptacs/5s2av39u/

### Advanced CSS styling using HTML attributes

We saw above that each start tag can have arguments like `style="text-align: center"` to define the style of that element. An HTML code can quickly become too messy to read if we were to define styles in each of the elements. Thus, in an attempt to organize the CSS styling, one can imagine the following two situations -

- **(A)** There are multiple elements which all need to be styled in the same way: HTML provides `class` attrbiutes to define such CSS styles using the CSS selector `.`
- **(B)** There is an element that needs an unique CSS style: HTML provides `id` attribute to define a unique ID to that element and this element can be styled separately using the CSS selector `#`.

**Note:** We have defined **HTML Attributes** and their corresponding **CSS Selector**. This distinction will become clearer with the following example that attributes `class` and `id` to the p-tags, and define their styles separately in the style sheet using their corresponding selector `.` and `#`. 

```
<html>
    <head>
        <style>
        .done {
            background-color: lightblue;
            font-size: 20px
        }

        #todo {
            background-color: lightgreen;
            text-align: center
        }
        </style>
    </head>

    <body>
        <div>
            <h1>Hello World!</h1>
            <p class="done">I know how HTML works</p>
            <p class="done">I know how CSS works</p>
            <p id="todo">I am learning about HTML attributes and CSS selectors</p>
        </div>
    </body>
</html>
```

To display the above code, head to https://jsfiddle.net/pguptacs/o9vfryg8/ and notice the difference.

<div class="alert-info alert">

**Example 0.3:** Think of what will happen when we have a class and id defined in the same attribute. For example, how will the following HTML code be rendered?

</div>

```
<html>
    <head>
        <style>
        .done {
            background-color: lightblue;
            font-size: 20px
        }

        #todo {
            background-color: lightgreen;
            text-align: center
        }
        </style>
    </head>

    <body>
        <div>
            <h1>Hello World!</h1>
            <p class="done">I know how HTML works</p>
            <p class="done">I know how CSS works</p>
            <p class="done" id="todo">I am learning about HTML attributes and CSS selectors</p>
        </div>
    </body>
</html>
```

Head over to https://jsfiddle.net/pguptacs/utbxoe48/ and note the following:

-  `id` overwrites the style defined in CSS 
- Final style is the union of individual styles

### Combining multiple HTML attributes

Finally, many classes and ids can be combined in the same tag. For example, look at the code below and think about how it will be rendered.

```
<html>
    <head>
        <style>
        .done {
            background-color: lightblue;
            font-size: 20px
        }

        #todo {
            background-color: lightgreen;
            text-align: center
        }
        
        .weight {
          font-weight: bold;
        }
         </style>
    </head>

    <body>
        <div>
            <h1>Hello World!</h1>
            <p class="done weight">I know how HTML works</p>
            <p class="done weight">I know how CSS works</p>
            <p class="done " id="todo">I am learning about HTML attributes and CSS selectors</p>
        </div>
    </body>
</html>
```

Now head over to https://jsfiddle.net/pguptacs/03bu2tLs/ and check for yourself how it is rendered.

<div class="alert-warning alert">
    <b>Fun fact:</b> the colored boxes you've been seeing in this course also use a bit of HTML! Double-click to inspect the code.
</div>

## Inspecting a web page

To extract information from a website automatically, we need to find out how to tell a computer what patterns to look for. To do so, we will make use of the HTML code itself and the presence of elements and CSS scripts.

Let's take as an example, the following website https://www.oii.ox.ac.uk/people/faculty/. 

- **Step 1:** Where do we find the HTML code of that page?
    - Open the above URL in a separate window.
    - Two-finger click on Mac (or right-click on windows) anywhere on the web page, and select the last option `Inspect` from the dropdown menu.
    - This will open up a **console** in the web browser with options far beyond what we need in this course.
    - Focus your attention on HTML code under the **Inspector** (Firefox) or **Elements** (Chrome) tab, which will also be the default view.
    - For now, just know that this is how we can read the HTML code of any web page.


<image src="https://mdn.mozillademos.org/files/16371/landingPage_PageInspector.png" />

- **Step 2:** We want to **extract the name of all the faculty members** at the OII. We can visually locate them within the blue blocks on the web page, but what does it mean in terms of the HTML code?
    - How can we locate the HTML tags that correspond to the names of the faculty members?
    - Reading all the HTML code is a gruesome task. Therefore, move the cursor over the faculty name (e.g., hover over where its written “Professor Victoria Nash”). Then right-click and repeat the above process of inspecting the source code.
    - You will notice that the console now highlights the HTML element that contains this hovered text. Specifically, it will highlight the following element: `<h4>Professor Victoria Nash</h4>`.

- **Step 3:** We are also interested in extracting other information (e.g., **bio and the department position**) from theses blue blocks.
    - How do we locate the HTML element corresponding to the bigger blue box with name, photo, and bio?
    - As a reminder, the HTML code is written via nested elements. Once you found the name, look at its parent and repeat until you found: an element… which contains all the child elements… which contains the title, photo, and bio.
    - In the console, hover the mouse a few different elements, and notice how the corresponding color changes on the web page. This happens because the web browser highlights which element that is being currenlty hovered on. This is very helpful!
    - Now notice that we are able to locate the bigger blue box by hovering over the element: `<article class="box  people-box light-background box-has-button third-at-full  has_url people  ">`

- **Step 4:** Great! We have understood how the web page designer structured the HTML code such that the faculty information can be easily located under the article-tag with class as `box`, `people-box`, and so on. Now, perform a sanity check by hovering over the similar tags and verifying whether they correspond to the boxes of other faculty memebrs. 




## Automating Web Scraping using Python

We just understood the manual process of locating information on a web page. We will now automate that process using Python. Clearly, we need a tool to download a page and select HTML elements that match certain attributes. In our running example on extracting information about OII faculty members, we need to select elements with article-tag and attributes as box, people-box, etc. How can we do so using Python?


We need two libraries:
- `requests` to download the HTML code of a web page.
- `BeautifulSoup4` is a library that has functions to enable HTML parsing and search of information within the HTML code.

In [1]:
%%capture
!pip install requests
!pip install beautifulsoup4

In [78]:
import pandas as pd
import requests
from tqdm import tqdm  # fancy library to print the progress bar while iterating through a list

<div class="alert alert-info">

**Example 0.4:** To extract the HTML code of an URL, we will use the `requests.get` method.

Extract the HTML code of the website: https://www.oii.ox.ac.uk/people/faculty/ and print the HTML code. 

</div>

In [117]:
from bs4 import BeautifulSoup

site = requests.get('https://www.oii.ox.ac.uk/people/faculty/')
soup = BeautifulSoup(site.text, 'html.parser')

print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" lang="en-GB">
 <head>
  <title>
   OII | People
  </title>
  <meta content="A list of the faculty at the OII." name="description"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <link href="https://www.oii.ox.ac.uk/wp-content/themes/OII2022/assets/logos/favicon.ico?v=2" rel="shortcut icon" type="image/icon"/>
  <link href="https://www.oii.ox.ac.uk/wp-content/themes/OII2022/assets/logos/favicon.ico?v=2" rel="icon" type="image/icon"/>
  <link href="https://fonts.googleapis.com/css?family=Roboto:400,500,700,300|B612+Mono|Lato:400,400i,700,700i,900,900i&amp;display=swap" rel="stylesheet"/>
  <!-- TWITTER -->
  <meta content="summary_large_image" name="twitter:card"/>
  <meta content="@oiioxford" name="twitter:site"/>
  <meta content="OII | People" name="twitter:title"/>
  <meta content="A list of the faculty at the OII." name="twitter:description"/>
  <meta content="https://www.oii.ox.ac.uk/wp-content/themes/OII2022/assets/l

<div class="alert-info alert">

**Example 0.5:** Now, we are going to make use of the full potential of Beautiful Soup.

We are going to find and extract all the elements corresponding to the tag `<article class="box people-box light-background box-has-button third-at-full has_url people">`.

</div>

<div class="alert-warning alert">

**Reminder:** Refer back to the definition of CSS selectors:
- classes are prefaced by a `.` (e.g., `.box`, `people-box`)
- unique id selectors by `#` (e.g., `#about`)
- elements do not have any prefix (e.g., `div`, `h1`, `h2`)

To match an HTML element, we can combine all these selectors in a nested chain: `div.people-box h2` would match any `h2` element that is inside a `div` element with the `people-box` class.

You can do more complex patterns: `div.box.people-box.has_url` matches any `div` element that has the three classes `box`, `people-box`, and `has_url`.

Be careful that:

- The sequence of selectors matter as HTML is a nested code
- The order matters: `.box p` matches a paragraph in a box, `p.box` matches a paragraph with the `box` class, `p .box` matches a box within a paragraph.

</div>

In [97]:
faculty_members = soup.select('article.box.people-box')


In [123]:
print(f'Num of members: {len(faculty_members)}')

print(faculty_members[0].prettify())

Num of members: 28
<article class="box people-box light-background box-has-button third-at-full has_url people">
 <a href="https://www.oii.ox.ac.uk/people/profiles/victoria-nash/">
  <img alt="Victoria Nash" src="https://www.oii.ox.ac.uk/wp-content/uploads/2021/07/Victoria-Nash-170x170.jpg"/>
  <div class="box-text-container">
   <h4>
    Professor  Victoria Nash
   </h4>
   <p>
    <i>
     Director, Associate Professor, Senior Policy Fellow
    </i>
   </p>
   <p>
    Victoria Nash is the OII's Director and a Senior Policy Fellow. Her research focuses on the opportunities and risks experienced by children using digital technologies; she also leads OII engagement on Internet regulation and digital policy issues.
   </p>
  </div>
  <div class="button-container">
   <div class="pseudo-button">
    View profile
   </div>
  </div>
 </a>
</article>



In [120]:
faculty_members = soup.select('.box.people-box')
len(faculty_members)

28

<div class="alert-warning alert">

**Note** how both the selectors `article.box.people-box` and `.box.people-box` return the required elements. It is entirely possible to have several ways to identify the same elements.

</div>

<div class="alert-info alert">

**Example 0.6:** Let's investigate the `faculty_members` elements returned above, and look at various nested elements and how their content that is displayed on the web page.

- Select the first faculty member in the list (`faculty_members[0]`)
- Go back to the web page and inspect its child elements under the `article` element. Try to understand which child element contains the name, position in the department, bio, and the link to their home page. 


Finally, go back to Python and use the `find` and `find_all` methods of the first faculty member to extract the name, position, bio, and link.

</div>

In [99]:
first_person = faculty_members[0]

Following is the rough sketch of article element:

```
<article>
    <a href="URL TO THE HOMEPAGE">
    <img src="URL TO THE IMAGE">
    <div>
        <h4>NAME</h4>
        <p>POSITION IN THE DEPARTMENT</p>
        <p>BIO</p>
    </div>
</article>
```


Therefore, we will find name in the first h4-tag, department position in the first p-tag, and bio in the second p-tag, and the URL to the home page in the first a-tag.

In [115]:
name = first_person.find('h4').text
pos = first_person.find_all('p')[0].text
bio = first_person.find_all('p')[1].text
homepage_link = first_person.find('a')['href']

print("Name:", name)
print("Position:", pos)
print("Bio:", bio)
print("URL:", homepage_link)

Name: Professor  Victoria Nash
Position: Director, Associate Professor, Senior Policy Fellow
Bio: Victoria Nash is the OII's Director and a Senior Policy Fellow. Her research focuses on the opportunities and risks experienced by children using digital technologies; she also leads OII engagement on Internet regulation and digital policy issues.
URL: https://www.oii.ox.ac.uk/people/profiles/victoria-nash/


<div class="alert-info alert">

**Example 0.7:** Print name, position, and url of all the faculty members at OII. 

</div>

In [152]:
def print_faculty_info(element):
    name = element.find('h4').text
    pos = element.find_all('p')[0].text
    bio = element.find_all('p')[1].text
    url = element.find('a')['href']

    print(name, " | ",  pos, " | ", url)
    return name, pos, bio, url

for e in faculty_members:
    print_faculty_info(e)

Professor  Victoria Nash  |  Director, Associate Professor, Senior Policy Fellow  |  https://www.oii.ox.ac.uk/people/profiles/victoria-nash/
Dr  Grant Blank  |  Departmental Lecturer  |  https://www.oii.ox.ac.uk/people/profiles/grant-blank/
Dr  Fabian Braesemann  |  Departmental Research Lecturer  |  https://www.oii.ox.ac.uk/people/profiles/fabian-braesemann/
Dr  Kathryn Eccles  |  Senior Research Fellow  |  https://www.oii.ox.ac.uk/people/profiles/kathryn-eccles/
Professor  Rebecca Eynon  |  Professor of Education, the Internet and Society  |  https://www.oii.ox.ac.uk/people/profiles/rebecca-eynon/
Professor  Luciano Floridi  |  Professor of Philosophy and Ethics of Information  |  https://www.oii.ox.ac.uk/people/profiles/luciano-floridi/
Professor  Mark Graham  |  Professor of Internet Geography  |  https://www.oii.ox.ac.uk/people/profiles/mark-graham/
Dr  Scott A. Hale  |  Associate Professor, Senior Research Fellow  |  https://www.oii.ox.ac.uk/people/profiles/scott-hale/
Professor 

## Exercise 1: Web page crawler

<div class="alert-info alert">

**Exercise 1.1:** Investigate the home page of Dr. Victoria Nash (or any other faculty member): https://www.oii.ox.ac.uk/people/profiles/victoria-nash/ and find the HTML elements

- that contains contents in **About**. 

- that contains contents in **Current Courses**


</div>



In [129]:
r = requests.get('https://www.oii.ox.ac.uk/people/profiles/victoria-nash/')
soup = BeautifulSoup(r.text,'html.parser')

In [147]:
about_section = soup.find(id='about')
print(about_section.text)


About
Victoria Nash is the Director, an Associate Professor, and Senior Policy Fellow at the Oxford Internet Institute (OII). In the latter role, she is responsible for connecting OII research with policy and practice. Her research interests draw on her background as a political theorist, and concern the normative policy implications of evidence characterising children’s use of Internet technologies. Recent projects have included an analysis of age verification policies as a tool for balancing the interests of children and adults online, and a review of the risks and harms faced by children online. She is currently concluding a funded research project examining the concept of the ‘algorithmic child’ and the data risks posed to children by connected toys and the Internet of Things. She holds several digital policy advisory roles, including membership of the UK Government’s multi-stakeholder UK Council on Internet Safety (UKCIS) Evidence Group, and serves on the Advisory Board of COADEC

In [146]:
list_of_classes = [c.text for c in soup.find(id='current-courses').find_all('h4')]
list_of_classes

['Digital Era Government and Politics']

<div class="alert-info alert">

**Example 1.2:** Make a function that takes a home page URL (string) as an argument and returns the content in the **About** section as well the list of courses being taught by the faculty in that term.

- Call this function on https://www.oii.ox.ac.uk/people/profiles/victoria-nash/ and verify if its correct
- Call this function on https://www.oii.ox.ac.uk/people/profiles/luc-rocher/ and verify if its correct

</div>

In [161]:
def crawl_homepage(URL):
    r = requests.get(URL)
    soup = BeautifulSoup(r.text, 'html.parser')
    
    about_section = soup.find(id='about').text

    classes = soup.find(id='current-courses')
    if classes:
        list_of_classes = [c.text for c in classes.find_all('h4')]
    else:
        list_of_classes = []
    
    return about_section, list_of_classes

In [150]:
about_section, list_of_classes = crawl_homepage('https://www.oii.ox.ac.uk/people/profiles/victoria-nash/')
print(about_section)
print(list_of_classes)


About
Victoria Nash is the Director, an Associate Professor, and Senior Policy Fellow at the Oxford Internet Institute (OII). In the latter role, she is responsible for connecting OII research with policy and practice. Her research interests draw on her background as a political theorist, and concern the normative policy implications of evidence characterising children’s use of Internet technologies. Recent projects have included an analysis of age verification policies as a tool for balancing the interests of children and adults online, and a review of the risks and harms faced by children online. She is currently concluding a funded research project examining the concept of the ‘algorithmic child’ and the data risks posed to children by connected toys and the Internet of Things. She holds several digital policy advisory roles, including membership of the UK Government’s multi-stakeholder UK Council on Internet Safety (UKCIS) Evidence Group, and serves on the Advisory Board of COADEC

In [162]:
about_section, list_of_classes = crawl_homepage('https://www.oii.ox.ac.uk/people/profiles/luc-rocher/')
print(about_section)
print(list_of_classes)


About
Luc Rocher is the Director of the DPhil Programme in Social Data Science and is a lecturer at the Oxford Internet Institute, a junior research fellow at Kellogg College, and a fellow of Imperial College London’s Data Science Institute.
Their research investigates the harms posed by large-scale collections of digital human traces—from social media traces to biometrics—and deployed artificial intelligence technologies, identifying gaps in how technology is regulated and how risks are documented, and proposing better models for academic research using sensitive human data.
Luc specialises in computational modelling approaches to study emerging concerns in algorithmic societies, such as the future of privacy and digital rights as well as the governance of algorithms in digital platforms. Their research develops statistical models to make sense of these complex systems, adversarial machine learning approaches to highlight weaknesses of deployed technologies, and interactive tools for

<div class="alert-info alert">

**Example 1.3:** Make a dataframe containing the following information about the OII faculty members: `name` , `position` in the department, `shortbio` as specified on  https://www.oii.ox.ac.uk/people/faculty/ , `url` of their homepage, `about` contents on their homepage, and `courses` that they are teaching currently. 

</div>

In [158]:
page_to_crawl = "https://www.oii.ox.ac.uk/people/faculty/"
r = requests.get(page_to_crawl)
soup = BeautifulSoup(r.text, 'html.parser')

people = []

for people_box in soup.select('.box.people-box'):
    
    name, pos, bio, url = print_faculty_info(people_box)
    
    about, courses = crawl_homepage(url)
    
    people.append(dict(name=name, pos=pos, shortbio=bio, url=url, about=about, courses=courses))

people_df = pd.DataFrame(people)

Professor  Victoria Nash  |  Director, Associate Professor, Senior Policy Fellow  |  https://www.oii.ox.ac.uk/people/profiles/victoria-nash/
Dr  Grant Blank  |  Departmental Lecturer  |  https://www.oii.ox.ac.uk/people/profiles/grant-blank/
Dr  Fabian Braesemann  |  Departmental Research Lecturer  |  https://www.oii.ox.ac.uk/people/profiles/fabian-braesemann/
Dr  Kathryn Eccles  |  Senior Research Fellow  |  https://www.oii.ox.ac.uk/people/profiles/kathryn-eccles/
Professor  Rebecca Eynon  |  Professor of Education, the Internet and Society  |  https://www.oii.ox.ac.uk/people/profiles/rebecca-eynon/
Professor  Luciano Floridi  |  Professor of Philosophy and Ethics of Information  |  https://www.oii.ox.ac.uk/people/profiles/luciano-floridi/
Professor  Mark Graham  |  Professor of Internet Geography  |  https://www.oii.ox.ac.uk/people/profiles/mark-graham/
Dr  Scott A. Hale  |  Associate Professor, Senior Research Fellow  |  https://www.oii.ox.ac.uk/people/profiles/scott-hale/
Professor 

In [159]:
people_df

Unnamed: 0,name,pos,shortbio,url,about,courses
0,Professor Victoria Nash,"Director, Associate Professor, Senior Policy F...",Victoria Nash is the OII's Director and a Seni...,https://www.oii.ox.ac.uk/people/profiles/victo...,"\nAbout\nVictoria Nash is the Director, an Ass...","[\n, \n\nCurrent Courses\n\nDigital Era Govern..."
1,Dr Grant Blank,Departmental Lecturer,Grant Blank's work focuses on the social and c...,https://www.oii.ox.ac.uk/people/profiles/grant...,\nAbout\nGrant Blank is a Departmental Lecture...,"[\n, \n\nCurrent Courses\n\nQualitative Data A..."
2,Dr Fabian Braesemann,Departmental Research Lecturer,Dr Fabian Braesemann is a Departmental Researc...,https://www.oii.ox.ac.uk/people/profiles/fabia...,\nAbout\nDr Fabian Braesemann is a Departmenta...,"[\n, \n\nCurrent Courses\n\nSocial Network Ana..."
3,Dr Kathryn Eccles,Senior Research Fellow,Kathryn Eccles has research interests in the i...,https://www.oii.ox.ac.uk/people/profiles/kathr...,\nAbout\nKathryn is a Senior Research Fellow a...,"[\n, \n\nCurrent Courses\n\nCultural Analytics..."
4,Professor Rebecca Eynon,"Professor of Education, the Internet and Society",Rebecca Eynon's research focuses on learning a...,https://www.oii.ox.ac.uk/people/profiles/rebec...,\nAbout\nRebecca Eynon holds a joint academic ...,"[\n, \n\nCurrent Courses\n\nEducation, the Int..."
5,Professor Luciano Floridi,Professor of Philosophy and Ethics of Information,Luciano Floridi‘s research areas are the philo...,https://www.oii.ox.ac.uk/people/profiles/lucia...,\nAbout\nHe is the OII’s Professor of Philosop...,"[\n, \n\nCurrent Courses\n\nThe Philosophy and..."
6,Professor Mark Graham,Professor of Internet Geography,Mark Graham is an economic geographer. His res...,https://www.oii.ox.ac.uk/people/profiles/mark-...,\nAbout\nI am the Professor of Internet Geogra...,"[\n, \n\nCurrent Courses\n\nDigital Capitalism..."
7,Dr Scott A. Hale,"Associate Professor, Senior Research Fellow","Dr Scott A. Hale is an Associate Professor, Se...",https://www.oii.ox.ac.uk/people/profiles/scott...,\nAbout\nDr Scott A. Hale is an Associate Prof...,"[\n, \n\nCurrent Courses\n\nData Analytics at ..."
8,Professor Ekaterina Hertog,Associate Professor in AI and Society,Ekaterina Hertog is an Associate Professor of ...,https://www.oii.ox.ac.uk/people/profiles/ekate...,\nAbout\nEkaterina’s research interests lie at...,"[\n, \n\nCurrent Courses\n\nDigital Interviewi..."
9,Dr Bernie Hogan,Senior Research Fellow,"Bernie Hogan examines how to capture, represen...",https://www.oii.ox.ac.uk/people/profiles/berni...,"\nAbout\nBernie Hogan (PhD Toronto, 2009) is a...","[\n, \n\nCurrent Courses\n\nWrangling Data\nTh..."


## Exercise 2: Understanding the limits of data scraping


<div class="alert-info alert">

- Go to https://www.instagram.com/oxford_uni/
- Inspect the elements, what selectors would you use to extract the description, number of followers, and the URL of each post?
- Now use `session` to extract HTML code of the above link
- Extract all the links from this HTML. How many links did it return?
- Print the HTML code returned above and check the header of that HTML. What does the title say? What is different this time compared to the previous exercise?

</div>

In [167]:
r = requests.get('https://www.instagram.com/oxford_uni/')
soup = BeautifulSoup(r.text, 'html.parser')

In [168]:
[a.get('href') for a in soup.find_all('a')]

[]

What happened? Let's look at the source code:

In [172]:
print(soup.prettify())

<!DOCTYPE html>
<html class="_9dls" dir="ltr" lang="en" style="background-color: rgb(var(--ig-secondary-background))">
 <head>
  <link data-default-icon="https://static.cdninstagram.com/rsrc.php/v3/yb/r/lswP1OF1o6P.png" href="https://static.cdninstagram.com/rsrc.php/v3/yb/r/lswP1OF1o6P.png" rel="icon" sizes="192x192"/>
  <meta content="noarchive, noimageindex" name="robots"/>
  <meta charset="utf-8"/>
  <meta content="default" name="apple-mobile-web-app-status-bar-style"/>
  <meta content="yes" name="mobile-web-app-capable"/>
  <meta content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=2, viewport-fit=cover" id="viewport" name="viewport"/>
  <meta content="#ffffff" name="theme-color"/>
  <link href="https://static.cdninstagram.com/rsrc.php/v3/yR/r/lam-fZmwmvn.png" rel="apple-touch-icon" sizes="76x76"/>
  <link href="https://static.cdninstagram.com/rsrc.php/v3/ys/r/aM-g435MtEX.png" rel="apple-touch-icon" sizes="120x120"/>
  <link href="https://static.cdninstagram

<div class = "alert-info alert">

This text consists almost entirely of JavaScript! The links we are looking for were not returned at all. The content of the website is put together on the fly, making it more difficult to scrape. Instagram (and many other websites) will also block you from seeing most things unless you log in. However, if you log in and try to scrape the page, you may be flagged as a bot and banned. 
    
</div>

## Exercise 3: Let's scrape El País

Let's now see if we can use our new scraping skills on a different website, the Spanish newspaper El País. We would like to collect all news articles published on a given date, for instance January 1st 1990 to start with.

Head towards https://elpais.com/hemeroteca/1990-01-01/ and inspect the webpage. Look at the elements and the way the CSS classes are defined. While doable, the meaning of the classes is missing: the classes `c` and `c--m-n` appear to encode an article block. (This is likely an automatic optimization done to minimize bandwith and reduce the size of a webpage.)

In [174]:
# Let's start by crawling the list of articles from 1990-01-01:

r = requests.get("https://elpais.com/hemeroteca/1990-01-01/")
soup = BeautifulSoup(r.text, 'html.parser')

<div class="alert alert-info">

**Exercise 3.1** Our first task is to extract the links to the pages containing articles published that date.

We have two solutions:
1. we extract all the links in the webpage into a list. We then filter that list to keep only the URLs that point to a news article
2. we find the HTML element with as children all the links to news articles, then extract all its links

Run the commands below and compare them. Which one seems easier for you?

</div>

In [176]:
# Solution 1

import re

# We create a list of all the links on the webpage we crawled:
all_links = [a.get('href') for a in soup.find_all('a')]
print(f"We collected {len(all_links)} URLs")
print(all_links)

# We filter that using a regular expression (https://automatetheboringstuff.com/2e/chapter7/):
links_to_articles = [link for link in all_links if re.match("/diario/1990(.*).html", link)]
# Our new list includes all the link to a ‘diario’ page:
print(f"We filtered down to {len(links_to_articles)} article URLs")
print(links_to_articles)

We collected 170 URLs
['https://elpais.com', 'https://elpais.com/america/', 'https://elpais.com/mexico/', 'https://elpais.com/america-colombia/', 'https://elpais.com/chile/', 'https://elpais.com/argentina/', 'https://english.elpais.com', 'https://elpais.com', '/hemeroteca/1990-01-01/', 'https://elpais.com/suscripciones/#/campaign#?prod=SUSDIG&o=boton_cab&prm=suscrip_cabecera_el-pais&backURL=https%3A%2F%2Felpais.com%2Fhemeroteca%2F1990-01-01%2F', 'https://elpais.com/subscriptions/#/sign-in?prod=REG&o=CABEP&prm=login_cabecera_el-pais&backURL=https%3A%2F%2Felpais.com%2Fhemeroteca%2F1990-01-01%2F', '/hemeroteca/1990-01-01/', '/diario/1990/01/02/economia/631234806_850215.html', 'https://elpais.com/autor/felix-monteira/', '/hemeroteca/1990-01-01/', '/diario/1990/01/02/espana/631234805_850215.html', 'https://elpais.com/autor/agencia-efe/', '/hemeroteca/1990-01-01/', '/diario/1990/01/02/opinion/631234801_850215.html', 'https://elpais.com/autor/cartas-director/', '/hemeroteca/1990-01-01/', '/di

In [186]:
# Solution 2: after inspection, we will collect all the headers-2 (h2) elements with the class 'c_t'

article_headers = soup.select('h2.c_t')

links_to_articles = [a.find('a').get('href') for a in article_headers]

print(f"We collected {len(links_to_articles)} article URLs")
print(links_to_articles)

We collected 27 article URLs
['/diario/1990/01/02/economia/631234806_850215.html', '/diario/1990/01/02/espana/631234805_850215.html', '/diario/1990/01/02/opinion/631234801_850215.html', '/diario/1990/01/02/espana/631234807_850215.html', '/diario/1990/01/02/deportes/631234808_850215.html', '/diario/1990/01/02/deportes/631234805_850215.html', '/diario/1990/01/02/espana/631234804_850215.html', '/diario/1990/01/02/espana/631234814_850215.html', '/diario/1990/01/02/cultura/631234801_850215.html', '/diario/1990/01/02/madrid/631283056_850215.html', '/diario/1990/01/02/espana/631234806_850215.html', '/diario/1990/01/02/radiotv/631234805_850215.html', '/diario/1990/01/02/internacional/631234806_850215.html', '/diario/1990/01/02/espana/631234809_850215.html', '/diario/1990/01/02/radiotv/631234806_850215.html', '/diario/1990/01/02/internacional/631234807_850215.html', '/diario/1990/01/02/cultura/631234808_850215.html', '/diario/1990/01/02/espana/631234802_850215.html', '/diario/1990/01/02/deporte

<div class="alert-warning alert">

**Note:** We see that all of the links start with `/diario/year/month/date/blah/blah`. If you copy any of those links and paste it in your browser, it will not work. Full link is actually, `https://elpais.com/diario/year/month/date/blah/blah`. The former path is referred to as **relative path** and the latter path is referred to as **absolute path**. It is really up to the web site developers, which one they prefer. In our case, we have access to the relative paths, so we need to convert them to absolute paths before we can crawl the articles. 


</div>

<div class="alert-info alert">

**Exercise 3.2:** Let's now extract information from that first article
- Select the first article link
- Covert the link to absolute link and call the crawler on that link. 

</div>

In [187]:
article_one = links_to_articles[0]
print(article_one)

/diario/1990/01/02/economia/631234806_850215.html


In [189]:
r = requests.get(f'https://elpais.com{article_one}')
soup = BeautifulSoup(r.text, 'html.parser')
print(r.url)

https://elpais.com/diario/1990/01/02/economia/631234806_850215.html


<div class="alert alert-info">

**Exercise 3.3** Let's extract the title, author, and full text of the article.

Open the page of the above url in your browser and inspect it. Find which HTML element and CSS selector to use in order to match title, author, and full text.

For full text, think how would you match all the paragraphs?
</div>

In [190]:
title = soup.find('h1').text
print(title)

11 países de la CE aceptan levantar el secreto bancario con la oposición de Luxemburgo


In [194]:
author = soup.select('.a_md_a')[0].text
print(author)

Felix Monteira


In [203]:
soup.select('.a_c.clearfix')[0].find_all('p')

[<p class="">La presidencia francesa de la CE no ha logrado un acuerdo definitivo sobre la fiscalidad del ahorro, que condiciona la libre circulación de capitales que entrará en vigor en julio de 1990. Pero 11 países de la CE han aprobado la cooperación entre las administraciones de Hacienda, lo cual implica levantar el secreto bancario en caso de presunto fraude fiscal. Luxemburgo se opone para no perder las ventajas de su paraíso bancario.</p>,
 <p class="">Los servicios jurídicos de la CE estudian, a petición de España, la posibilidad de establecer mecanismos jurídicos que permitan establecer la obligación de cooperar entre las Administraciones de Hacienda. El objetivo, según el secretario de Estado de Hacienda, José Borrell, "es dar información a un país, aunque ese Estado no la pueda obtener para sus fines propios". Ello equivale a levantar el secreto bancario a demanda de otro Estado.La base jurídica ha pasado a ser un tema determinante para poner en vigor una norma, puesto que l

In [208]:
# The full text is divided into paragraphs. Let's first match all the paragraphs:

# How would you match all the paragraphs?
all_the_paragraphs = soup.select('.a_c.clearfix')[0].find_all('p')
print(f"We collected {len(all_the_paragraphs)} paragraphs.\n")

# Then, we extract the text of each paragraph and put them all into a list
paragraph_texts = [p.text for p in all_the_paragraphs]

# Finally, we concatenate the paragraph texts into a full_text object
full_text = '\n\n'.join(paragraph_texts)
print(full_text)

We collected 7 paragraphs.

La presidencia francesa de la CE no ha logrado un acuerdo definitivo sobre la fiscalidad del ahorro, que condiciona la libre circulación de capitales que entrará en vigor en julio de 1990. Pero 11 países de la CE han aprobado la cooperación entre las administraciones de Hacienda, lo cual implica levantar el secreto bancario en caso de presunto fraude fiscal. Luxemburgo se opone para no perder las ventajas de su paraíso bancario.

Los servicios jurídicos de la CE estudian, a petición de España, la posibilidad de establecer mecanismos jurídicos que permitan establecer la obligación de cooperar entre las Administraciones de Hacienda. El objetivo, según el secretario de Estado de Hacienda, José Borrell, "es dar información a un país, aunque ese Estado no la pueda obtener para sus fines propios". Ello equivale a levantar el secreto bancario a demanda de otro Estado.La base jurídica ha pasado a ser un tema determinante para poner en vigor una norma, puesto que la 

<div class="alert alert-info">

**Exercise 3.4** Fill the function `process_article` below. It takes an article path (e.g., `/diario/1990/01/02/madrid/631283054_850215.html`), crawl it, and returns a dictionary with title, author, and full text.
</div>

In [216]:
def process_article(article_path):
    
    r = requests.get(f'https://elpais.com{article_path}')
    soup = BeautifulSoup(r.text, 'html.parser')

    title = soup.find('h1').text
    author = soup.select('.a_md_a')[0].text
    
    paragraphs = [p.text for p in soup.select('.a_c.clearfix')[0].find_all('p')]
    full_text = '\n'.join(paragraphs)
    
    return dict(title=title, author=author, full_text=full_text)

In [212]:
process_article(article_one)

11 países de la CE aceptan levantar el secreto bancario con la oposición de Luxemburgo | Felix Monteira


{'title': '11 países de la CE aceptan levantar el secreto bancario con la oposición de Luxemburgo',
 'author': 'Felix Monteira',
 'full_text': 'La presidencia francesa de la CE no ha logrado un acuerdo definitivo sobre la fiscalidad del ahorro, que condiciona la libre circulación de capitales que entrará en vigor en julio de 1990. Pero 11 países de la CE han aprobado la cooperación entre las administraciones de Hacienda, lo cual implica levantar el secreto bancario en caso de presunto fraude fiscal. Luxemburgo se opone para no perder las ventajas de su paraíso bancario.\nLos servicios jurídicos de la CE estudian, a petición de España, la posibilidad de establecer mecanismos jurídicos que permitan establecer la obligación de cooperar entre las Administraciones de Hacienda. El objetivo, según el secretario de Estado de Hacienda, José Borrell, "es dar información a un país, aunque ese Estado no la pueda obtener para sus fines propios". Ello equivale a levantar el secreto bancario a demand

<div class="alert-info alert">

**Exercise 3.5:** Loop over all the article links published on January 1st, 1990 and scrape all their contents. You can use list comprehension to loop over all the articles.

- Make a DataFrame of the data scraped above
- Add a column, `date` to your dataframe that contains the date January 1st, 1990 in pandas DateTime format. 

</div>

In [217]:
content_from_1990_01_01 = [process_article(link) for link in tqdm(links_to_articles)]
df_1990_01_01 = pd.DataFrame(content_from_1990_01_01)

100%|██████████████████████████████████████████████████████████████████████████████████| 27/27 [00:31<00:00,  1.15s/it]


In [218]:
# Let's convert the date into a meaningful Python format:
df_1990_01_01['date'] = pd.to_datetime('1990-01-01').date()

In [220]:
df_1990_01_01

Unnamed: 0,title,author,full_text,date
0,11 países de la CE aceptan levantar el secreto...,Felix Monteira,La presidencia francesa de la CE no ha logrado...,1990-01-01
1,Contusionados cuatro policías al reducir a los...,EFE,Cuatro agentes uniformados del Cuerpo Nacional...,1990-01-01
2,Chiringuitos 'progres',Cartas al Director,Suelo estar bastante de acuerdo con las opinio...,1990-01-01
3,Israel participará en la Expo 92 de Sevilla,EFE,El Gobierno de Isaac Shamir ha aprobado la par...,1990-01-01
4,Barrios se exhibió ante González en la San Sil...,El País,El mexicano Arturo Barrios plusmarquista mundi...,1990-01-01
5,El viento variable obliga al 'Fortuna' a bajar...,EFE,La intensidad variable de los vientos en las ú...,1990-01-01
6,Dos presos de los GRAPO abandonan la huelga de...,EFE,Los dos miembros de los GRAPO que fueron inter...,1990-01-01
7,"Un matrimonio y sus dos hijas, hallados muerto...",EFE,Un matrimonio y sus dos hijas fueron encontrad...,1990-01-01
8,Historia de Italia,Casimiro Torreiro,"Italia, 1945. Tres jóvenes partisanos celebran...",1990-01-01
9,El Ayuntamiento de Robledo de Chavela emprende...,Luis Esteban,"El Ayuntamiento de Robledo de Chavela, que enc...",1990-01-01


<div class="alert-info alert">

**Exercise 3.5:** How would we collect **one year of data**? Complete the function below.

The code snippet below simplify our manual process. Notice how simply iterating over all the potential dates (from `'1990-01-01'` to `'1990-12-31'`) would allow us to crawl the entire 1990 archive. All that in less than a few dozens lines of Python.

</div>

 

In [224]:
def article_urls_for_one_date(str_date, str_year):
    """
    Return links to all articles of El País for a given date.

    Arguments:
        - str_date (str): a string formated as yyyy-mm-dd
    """

    r = requests.get(f"https://elpais.com/hemeroteca/{str_date}/")
    
    soup = BeautifulSoup(r.text, 'html.parser')

    article_headers = soup.select('h2.c_t')
    links_to_articles = [a.find('a').get('href') for a in article_headers]    
   
    return links_to_articles

In [225]:
articles_to_crawl = article_urls_for_one_date('1990-01-01', '1990')
df = pd.DataFrame([process_article(link) for link in tqdm(articles_to_crawl)])
df['date'] = pd.to_datetime('1990-01-01').date()

100%|██████████████████████████████████████████████████████████████████████████████████| 27/27 [00:41<00:00,  1.55s/it]


In [226]:
df

Unnamed: 0,title,author,full_text,date
0,El sentimiento panameño de la existencia,Manuel Vazquez Montalban,"""Cautivo y desarmado el ejército rojo, los últ...",1990-01-01
1,El ministro de Exteriores israelí visitará Esp...,Ignacio Cembrero,"I. C., El ministro de Asuntos Exteriores de Is...",1990-01-01
2,Sentirse periodista,Hermann Tertsch,"El diario Adevarul (La Verdad), antiguo Seinte...",1990-01-01
3,Un robot para matar en un producto sin valor,Juan Arribas,Revivir a las grandes estrellas de la pantalla...,1990-01-01
4,Dos petroleros accidentados vierten más de 95....,El País,La zona del Atlántico comprendida entre la isl...,1990-01-01
5,"1989, año de la libertad",,"Al principio era, o parecía ser, un año como o...",1990-01-01
6,¡Al ladrón!,Antonio Caño,"Un país ocupado, como Panamá, es un país donde...",1990-01-01
7,"El Liverpool, líder.",Reuters,La Liga inglesa continúa encabezada por el Liv...,1990-01-01
8,Un capitán griego con tripulantes filipinos y ...,Antonio García-Baquero González,"El buque petrolero de bandera iraní, Khark 5, ...",1990-01-01
9,"José Bono: ""No gastaré en la Expo lo que me cu...",Miguel González,Castilla-La Mancha fue la única comunidad autó...,1990-01-01


<div class="alert-info alert">

**Exercise 3.6:** We now demonstrate how to use your function from Exercise 3.5 to iterate over a few days of information. You do not need to change any of the code for this one. I have tried to help make it more understandable with comments in case you do something like this for your summative or thesis.


</div> 

<div class="alert-danger alert">
DO NOT scrape an entire year of data from El País. It will take you a very long time for no added benefit. If you were running a real research project that needed this data, you may choose to scrape longer time periods, but for learning purposes you should not needlessly consume additional resources.
</div>

In [227]:
from datetime import date, timedelta
import re

all_articles_df = []

# code copied from the Stackoverflow: https://stackoverflow.com/a/1060352/3413239

start_date = date(1990, 1, 1) # using datetimes lets python determine when we have real dates
end_date = date(1990, 1, 4) # we are only collecting three days of data

delta = timedelta(days=1) # this is a special type of object which is one day worth of time

# now we repeat until we've gone from start to end
while start_date <= end_date:
    
    # these functions just convert a date to a formatted string
    str_date = start_date.strftime("%Y-%m-%d")
    str_year = start_date.strftime("%Y")

    print(str_date)

    # this is your code from above
    articles_to_crawl = article_urls_for_one_date(str_date, str_year)
    df = pd.DataFrame([process_article(link) for link in tqdm(articles_to_crawl)])
    df['date'] = pd.to_datetime(str_date).date()

    # collect all dataframe in a list
    all_articles_df.append(df)
    
    # In many programming languages, x += y means x = x + y, increment x by the value of y
    # This adds one day to the date we just used
    start_date += delta

# concatenation of all dataframes
all_articles_df = pd.concat(all_articles_df)

# let's look
print("Number of articles collected: ", all_articles_df.shape[0])



1990-01-01


100%|██████████████████████████████████████████████████████████████████████████████████| 27/27 [00:04<00:00,  5.70it/s]


1990-01-02


100%|██████████████████████████████████████████████████████████████████████████████████| 27/27 [00:41<00:00,  1.55s/it]


1990-01-03


100%|██████████████████████████████████████████████████████████████████████████████████| 27/27 [00:39<00:00,  1.45s/it]


1990-01-04


100%|██████████████████████████████████████████████████████████████████████████████████| 27/27 [00:48<00:00,  1.81s/it]

Number of articles collected:  108





In [228]:
all_articles_df.sample(10)

Unnamed: 0,title,author,full_text,date
1,La Comunidad destina 300 millones para reparar...,Andres Manzano,La Consejería de Agricultura y Cooperación des...,1990-01-04
13,Banesto negocia con Aker la entrada de Valenci...,Carlos Schvartz,El Banco Español de Crédito (Banesto) dijo aye...,1990-01-04
16,"Lawrence Alloway, historiador de arte",El País,"Lawrence Alloway, conservador de museo, histor...",1990-01-04
9,"José Bono: ""No gastaré en la Expo lo que me cu...",Miguel González,Castilla-La Mancha fue la única comunidad autó...,1990-01-01
11,Yehudi Menuhin inicia en Sevilla y Córdoba una...,Margot Molina,"El violinista estadounidense Yehudi Menuhin, d...",1990-01-04
14,"El arreglo con UGT pasa por el ""cambio generac...",Miguel González,"José Bono, secretario general de los socialist...",1990-01-01
18,Gibraltar y los aduaneros.,EFE,"El fiscal general de Gibraltar, Kenneth Harris...",1990-01-03
2,Linda Bray,Carlos Mendo,Por primera vez en la historia del Ejército no...,1990-01-04
14,Una nueva teoría sostiene que los continentes ...,Walter Sullivan,La formación de los continentes ha sido mucho ...,1990-01-04
13,Una persona,EFE,resultó muerta ayer y tres heridas en Nagorno ...,1990-01-02


## Homewmork: Formative Assignment

<div class="alert-info alert">
   
<b>Formative<b>
    
As Luc discussed in the first class, this week you will be doing a short formative assignment to help you prepare for the summative. By Monday of Week 5, (2/13), you need to submit a title and a short 1-3 paragraph outline for your summative. We will give you feedback on the idea you submit. You may submit more than one if you are concerned. I will create an assignment on Canvas where you can upload your submissions. 

</div>
<div class="alert-info alert">
    
<b>Summative:<b>

<b> Write a maximum of 5,000 word report which applies ‘social data’ collection skills to a social science research question:<b>
- Short context and intro to question ⇒ small amount of literature
- Describe data collection approach and method. Discuss limitations and potential extensions
- Produce descriptive statistics from the data
- Explain how you would start answering your question
- Code attached as an appendix

<b> Essays that score the highest marks will typically contain at least some of the following:<b>
- A novel source of data. Or a combination of multiple types of data
- Going further beyond the code snippets presented in class
- A reproducible codebase and a documented corpus of collected data
- Clearly stated RQ which connects well to the data and good operationalisation
- A strong impact statement discussing the ethics of your data collection
    
</div>

## Homework: this week's datasheet questions

This is now our final time building on the practice of documenting our datasets, using the Datasheet for Datasets framework (here is an <a href="https://github.com/zykls/folktables/blob/main/datasheet.md">example of a datasheet"</a>). In the notebook for this week, you designed a small dataset of page content scraped from El País. Let's assume you plan to use this dataset in your research project.

How would you structure your Datasheet for this small dataset? For this week's homework, please answer the following questions:

> **Was any preprocessing/cleaning/labeling of the data done** (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remainder of the questions in this section.
>
>...

>**What tasks could the dataset be used for?**
>
>...

>**Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?** For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms?
>
>...

>**Are there tasks for which the dataset should not be used?** If so, please provide a description.
>
>...

