<font style='font-size:1.5em'>**👨🏻‍🏫 Week 05 lab – Web Scraping practice** </font>

<font style='font-size:1.2em'>DS105W – Data for Data Science</font>

**AUTHORS:**  Dr. [Jon Cardoso-Silva](https://jonjoncardoso.github.io)

**DEPARTMENT:** [LSE Data Science Institute](https://lse.ac.uk/dsi)

**OBJECTIVE**: Continue from where we left off in W04 lecture + a few more things




---

# Part 0: Export your chat logs (~ 3 min)

As part of the <span style="font-weight:bold"> ![](/figures/icons/GENIAL_favicon.png){width=1em} GEN<font color='#D55816'>IA</font>L</span> project, we ask that you fill out the following form as soon as you come to the lab:

🎯 **ACTION POINTS**

1. 🔗 [**CLICK HERE**](https://forms.office.com/e/689MersZzV) to export your chat log.

    Thanks for being GENIAL! You are now one step closer to earning some prizes! 🎟️





# Part I: ⚙️ The setup

You will need to install the requests and Scrapy packages in order to complete this lab. I will assume you have configured the virtual environment for this course as follows. 



Open the terminal (directly from within VS Code will be easier) and run each of the following commands:


```bash
pip install pandas requests scrapy
```


In [1]:
import requests               # This is how we access the web
import pandas as pd           # This is how we work with data frames

from pprint import pprint     # Print things in a pretty way
from scrapy import Selector   # This is how we parse HTML

# Part I: An important recap (20 min)

**🧑‍🏫 TEACHING MOMENT:** Your class teacher will recap the essentials of HTML as well as `requests` and `scrapy` with you.


<details><summary style="display: list-item;cursor: pointer;"><h2 style="display:inline;border-bottom: 1px solid #dee2e6;padding-bottom: .5rem;opacity: .9;margin-left:1rem;font-weight: 300;font-size:1.5rem;font-family: 'News Cycle','Arial Narrow Bold',sans-serif;line-height: 1.1;vertical-align:middle;">1. HTML in brief</h2></summary>

- You learned that **HTML files are structured into [tags](https://www.w3schools.com/TAGs/) (or elements)**. Each tag carries a specific meaning, allowing browsers to display the information accurately. 

    - For example, a `<p></p>` tells the browser, 'This is a paragraph,' 
    
    - whereas a `<div></div>` tag tells the browser, 'this is a box of elements'.

- HTML tags can have **attributes**.

    - For example, whenever we add a link (`<a>`), we need to specify the location where this link is pointing to (`href`):

        ```html
        <a href="https://lse-dsi.github.io/DS105">DS105 main page</a>
        ```

</details>

<details><summary style="display: list-item;cursor: pointer;"><h2 style="display:inline;border-bottom: 1px solid #dee2e6;padding-bottom: .5rem;opacity: .9;margin-left:1rem;font-weight: 300;font-size:1.5rem;font-family: 'News Cycle','Arial Narrow Bold',sans-serif;line-height: 1.1;vertical-align:middle;">2. Styling (CSS) in brief</h2></summary>

You also learned that one can apply **styles** to a tag using a language called CSS.

- Styles can appear **inline**, as specified by the `style` attribute: 

    ```html
    <p style="margin-bottom:10px;background-color:red;color:white">Some text</p>
    ```

- But styles can also be specified separately via a `.css` file. In that file, one uses **CSS Selectors** to identify which tags should be styled and how. For example, if I want _all_ my `<p>` tags to have the same style, I'd write:

    ```css
    p {
        margin-bottom:10px;
        background-color:red;
        color:white
    }
    ```

    When I load this CSS file into my HTML, the styling above will apply to all `<p>`s.

    For the above to work, I'd have to add the following to my HTML document:

    ```html
    <html>
        <head>
            <link rel="stylesheet" type="text/css" href="your_css_file.css">
        </head>

        <body>
            ...
        </body>
    </html>
    ```

</details>



<details><summary style="display: list-item;cursor: pointer;"><h2 style="display:inline;border-bottom: 1px solid #dee2e6;padding-bottom: .5rem;opacity: .9;margin-left:1rem;font-weight: 300;font-size:1.5rem;font-family: 'News Cycle','Arial Narrow Bold',sans-serif;line-height: 1.1;vertical-align:middle;">3. The class attribute</h2></summary>

A class can be applied to style multiple elements at once. 

```html
<p class="coloured"></p>
```

The way to specify the style of a class using **CSS selectors** is with a dot (`.`).  

For example, the class above can be specified in my CSS file as:

```css
p.coloured{
    ...<some-styling>...
}
```

or simply:

```css
.coloured{
    ...<some-styling>...
}
```


</details>



<details><summary style="display: list-item;cursor: pointer;"><h2 style="display:inline;border-bottom: 1px solid #dee2e6;padding-bottom: .5rem;opacity: .9;margin-left:1rem;font-weight: 300;font-size:1.5rem;font-family: 'News Cycle','Arial Narrow Bold',sans-serif;line-height: 1.1;vertical-align:middle;">4. The id attribute</h2></summary>

An `id` is a unique identifier or an element. It should only appear once in a page. We specify ids with the 'hashtag' symbol (`#`).

Therefore, if I have a 

```html
<p id="uniquely-huge">
```

I could specify the **CSS selector** as:

```css
p#uniquely-huge {
    ...<some-styling>...
}
``` 

or simply:

```css
#uniquely-huge {
    ...<some-styling>...
}
```

(we don't even need to specify the tag)

</details>



<details><summary style="display: list-item;cursor: pointer;"><h2 style="display:inline;border-bottom: 1px solid #dee2e6;padding-bottom: .5rem;opacity: .9;margin-left:1rem;font-weight: 300;font-size:1.5rem;font-family: 'News Cycle','Arial Narrow Bold',sans-serif;line-height: 1.1;vertical-align:middle;">5. A full example with classes and id</h2></summary>

Take, for example, the following HTML document:

```html
<html>
    <head>
        <link rel="stylesheet" type="text/css" href="my_styles.css">
    </head>

    <body>
        <p>Some text</p>

        <p class="coloured">Some text with coloured background</p>

        <p class="coloured">Some text with coloured background</p>

        <p id="uniquely-huge" class="coloured"></p>
    </body>
</html>
```

Suppose we also have a `my_styles.css` file as below:

```css
p {
    margin-bottom:10px;
}

p.coloured {
    background-color:red;
    color:white
}

#uniquely-huge {
    font-size: 2em;
}
```

This will render as:

![](./figures/example_html.png){style="width:30%"}

</details>



    



<details><summary style="display: list-item;cursor: pointer;"><h2 style="display:inline;border-bottom: 1px solid #dee2e6;padding-bottom: .5rem;opacity: .9;margin-left:1rem;font-weight: 300;font-size:1.5rem;font-family: 'News Cycle','Arial Narrow Bold',sans-serif;line-height: 1.1;vertical-align:middle;">6. Using CSS selectors for web scraping</h2></summary>

In brief, when collecting data from a public webpage, this is a skeleton what you need:

```python
response = requests.get('<some-url>')
sel = Selector(response.text)
sel.css('<your-css-selector>')
```
You can also refer to your W04 lecture notebook to remember the full syntax. The key for the rest of this lab is identifying what must be written in the `<your-css-selector>`. 

- You learned that you can include the names of specific tags directly. For example, `sel.css('h3').extract_all()` will return a list of all H3 in the entire page
- You also learned that you can find the closest **container** (say, `div.card-box`) and then scrape the contents of this box later.

<details><summary style="display: list-item;cursor: pointer;"><h2 style="display:inline;border-bottom: 1px solid #dee2e6;padding-bottom: .5rem;margin-left:1rem;font-weight: 300;font-size:1.5rem;font-family: 'News Cycle','Arial Narrow Bold',sans-serif;line-height: 1.1;vertical-align:middle;color:#c89020">7. CSS selectors cheatsheet ⭐</h2></summary> 

<div style="margin-top:1.5em;width:80%;font-size:0.9em;">

| Selector              | Example                  | Use Case Scenario                                                                                                                              |
|-----------------------|--------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|
| *                     | *                        | This selector picks all elements within a page. It’s not that different from a page. Not much use for it but still good to know                |
| .class                | .card-title              | The simplest CSS selector is targeting the class attribute. If only your target element is using it, then it might be sufficient.            |
| .class1.class2        | .card-heading.card-title | There are elements with a class like class=“card-heading card-title”. When we see a space, it is because the element is using several classes. However, there’s no one fixed way of selecting the element. Try keeping the space, if that doesn’t work, then replace the space with a dot. |
| #id                   | #card-description        | What if the class is used in too many elements or if the element doesn’t have a class? Picking the ID can be the next best thing. The only problem is that IDs are unique per element. So won’t cut to scrape several elements at once.                   |
| element               | h4                       | To pick an element, all we need to add to our parser is the HTML tag name.                                                                  |
| element.class         | h4.card-title            | This is the most common we’ll be using in our projects.                                                                                      |
| parentElement > childElement | div > h4          | We can tell our scraper to extract an element inside another. In this example, we want it to find the h4 element whose parent element is a div.                                                             |
| parentElement.class > childElement | div.card-body > h4 | We can combine the previous logic to specify a parent element and extract a specific CSS child element. This is super useful when the data we want doesn’t have any class or ID but is inside a parent element with a unique class/ID. |
| [attribute]           | [href]                   | Another great way to target an element with no clear class to choose from. Your scraper will extract all elements containing the specific attribute. In this case, it will take all <a> tags which are the most common element to contain an href attribute. |
| [attribute=value]     | [target=_blank]          | We can tell our scraper to extract only the elements with a specific value inside its attribute.                                              |
| element[attribute=value] | a[rel=next]          | This is the selector we used to add a crawling feature to our Scrapy script: next_page = response.css(‘a[rel=next]’).attrib[‘href’] The target website was using the same class for all its pagination links so we had to come up with a different solution. |
| [attribute~=value]    | [title~=rating]         | This selector will pick all the elements containing the word ‘rating’ inside its title attribute.                                             |

</div>

Source: [The Only CSS Selectors Cheat Sheet You Need for Web Scraping](https://www.scraperapi.com/blog/css-selectors-cheat-sheet/#CSS-Selectors-Cheat-Sheet)

</details>

💡 PRO-TIP: Did you notice that we're using a mix of markdown + HTML in this Jupyter notebook?

# Part II: Time to put all of this into practice (60-70 min)

Now go over the action points below in pairs:

🎯 **ACTION POINTS**

1. Go to the [Data Science Seminar series](https://socialdatascience.network/index.html#schedule) website and inspect the page (mouse right-click + Inspect) and find the way to the name of the first event on the page. 

::: {style="margin-left:1.5em; background-color: #f9f9f9; border: 1px solid #ddd; border-radius: 0.5em; padding: 1em;margin-bottom:1.5em;"}


By following the procedure, you should conclude that the title is inside a `<h6>` tag, as highlighted by the red boxes in the image below:

![](./figures/civica_screenshot.png){width=80%}

<br>

💡 **TIP:** Pay close attention to how the tags are **nested**!

Our `<h6>` tag doesn't appear isolated within the body structure of the HTML (`<html> <body> ... </body> </html>`). Instead, developers and web designers intentionally constructed this website so that the content boxes and their titles are nested deeply within several `<div>` and `<section>` tags.

For example, this is what the surrounding structure of our desired title looks like:

```html
<div class="card-body">

      <!--Title-->
      <a href="spring2024/sess1.html">
        <h6 class="card-title">Misinformation...</h6>
      </a>

      <!--Text-->
      <p class="card-text">Speaker: ... <br> Date: ...</p>

      <a href="spring2024/sess1.html">
        <button>Read more</button>
      </a>

</div>
```


:::


2. Write down the **full** "directions" inside the HTML file to reach the event title. For example, maybe you will find that:

    > _The first event title is inside a \<html\> ➡️ \<div\> ➡️ \<div\> ➡️ \<h6\> tag_.

    Write it in the markdown cell below:

::: {style="margin-left:1.5em; background-color: #f9f9f9; border: 1px solid #ddd; border-radius: 0.5em; padding: 1em;margin-bottom:1.5em;"}

To obtain the full path to the title, right-click on the tag and select "Copy," then choose the appropriate option. The method of copying may differ depending on your browser. In Mozilla Firefox, the following options are available:

**Copy > Copy CSS Path**

This shows just how deep the title is nested within the HTML structure. Here's the full path to the title:

```css
html body main#main section#schedule.section-with-bg div.container.wow.fadeInUp div.card-deck.row div.col-xs-12.col-sm-6.col-md-4.portfolio-item.filter-spring2024 div.card.mb-4 div.card-body a h6.card-title
```

**Copy > Copy CSS Selector**

This option provides a more direct way to access the title using CSS selectors:

```css
.filter-spring2024 > div:nth-child(1) > div:nth-child(2) > a:nth-child(1) > h6:nth-child(1)
```

**Copy > Copy XPath**

XPath is an alternative method for specifying HTML tags, similar to specifying a path in the Terminal. Here's the full XPath to the title:

```css
/html/body/main/section[2]/div/div[3]/div[1]/div/div[2]/a[1]/h6
```

We will talk about XPath in the 👨🏻‍🏫 [Week 05 lecture](https://lse-dsi.github.io/DS105/2023/winter-term/weeks/week05/page.html).


:::

3. Write the required Python code to scrape the CSS selector you identified above. 

    - Don't use the notion of containers just yet - we will practice that later in the W05 lecture. 
    - For now, just write the full CSS selector you identified above


::: {style="margin-left:1.5em; background-color: #f9f9f9; border: 1px solid #ddd; border-radius: 0.5em; padding: 1em;margin-bottom:1.5em;"}

Here's the Python code to scrape the title using the CSS selector:

```python
url = "https://socialdatascience.network/index.html#schedule"
response = requests.get(url)
selector = Selector(text=response.text)

# Specify the full CSS Path to the title
# Note that I added the `::text` to extract the text inside the tag
title_css_path = "html body main#main section#schedule.section-with-bg div.container.wow.fadeInUp div.card-deck.row div.col-xs-12.col-sm-6.col-md-4.portfolio-item.filter-spring2024 div.card.mb-4 div.card-body a h6.card-title::text"

# Specify the CSS Selector to the title
# Note that I added the `::text` to extract the text inside the tag
title_css_selector = ".filter-spring2024 > div:nth-child(1) > div:nth-child(2) > a:nth-child(1) > h6:nth-child(1)::text"

# Use either the CSS Path or the CSS Selector to extract the title
title = selector.css(title_css_path).extract_first()

print(title)
```

💬 **CONSIDER THIS:** While both the full CSS Path and the CSS Selector provided above function adequately, they are not optimal.

For one, they are not very human-readable. The CSS Path is quite lengthy, and the CSS Selector, although more concise, still relies on manual indexing (selecting the first child of a div, then its second child, and so forth). This makes the code more prone to breaking if the website's structure changes even slightly in the future.

:::

4. **Let's simplify.** Let's capture the title of the **first event** again, but instead of writing the entire full absolute path, like above, identify a more direct way to capture it. 

    - Note: Either use scrapy's `.extract_first()` or use `extract()` and later filter the list using regular  Python

::: {style="margin-left:1.5em; background-color: #f9f9f9; border: 1px solid #ddd; border-radius: 0.5em; padding: 1em;margin-bottom:1.5em;"}

Here's a more direct way to capture the title of the first event using a CSS Selector:

```python
# I noticed that only event titles use the <h6> tag
# I can specify the tag directly in the CSS Selector

title_css_selector = "h6::text"
title = selector.css(title_css_selector).extract_first()

print(title)
```

Alternatively, you can use the `extract()` method and filter the list using regular Python:

```python
title = selector.css(title_css_selector).extract()

# Filter the list to get the first title
title = title[0]
```

(The first approach is a bit more elegant and also more efficient, as it doesn't require creating a list of all the titles on the page just to filter it afterwards.)

:::

5. **Collect all the titles**. OK, now let's practice getting all event titles from the entire page. Save the titles into a list.

    **NOTE:** Again, collect all the information from the webpage at once. Don't use the notion of containers just yet. We will practice it in the W05 lecture.

::: {style="margin-left:1.5em; background-color: #f9f9f9; border: 1px solid #ddd; border-radius: 0.5em; padding: 1em;margin-bottom:1.5em;"}

Presumably, the code above - with the CSS Selector `h6::text` - will already collect all the titles from the entire page. 

```python
titles = selector.css("h6::text").extract()
```

That's it. Or is it? The next question will explain why this approach is not always the best.

:::

6. Do the same with the dates of the events and speaker names and save them to separate lists. 

    **NOTE:** Again, collect all the information from the webpage at once. Don't use the notion of containers just yet. We will practice it in the W05 lecture.


::: {style="margin-left:1.5em; background-color: #f9f9f9; border: 1px solid #ddd; border-radius: 0.5em; padding: 1em;margin-bottom:1.5em;"}

Using the same approach as before, after inspecting the page again, you will eventually discover that the info you want is inside the `<p>` tag that contains the class `card-text`. 

Interestingly, the information is not separated by any tag but by a line break (`<br>`). There is no way to capture that with a CSS Selector alone. ^[But it's possible to use an XPath though! See [Option 2](#option-02-use-the-xpath-method) We will talk about it in the 👨🏻‍🏫 [Week 05 lecture](https://lse-dsi.github.io/DS105/2023/winter-term/weeks/week05/page.html).].

[👉 Note also that the break line is an odd tag. It's a tag that doesn't have a closing tag! More confusingly, sometimes it appears as `<br>` and sometimes as `<br/>`.]{style="margin-left:2em;display:inline-block;"}

### Option 01: capture the text and post-process it in Python

Knowing what you know, you could still use the `::text` pseudo-element to capture the text inside the `<p>` tag but you will need to post-process the text later using Python.

```python
# Get everything inside the <p> tag
info = selector.css("p.card-text::text").extract()

```

However, there's a drawback: the `extract()` function will split the text by line breaks, resulting in a list of strings containing alternating speaker names and dates.

That is, if you look inside the `info` list, you will see:

```python
['Speaker: ... ', ' Date: ...', 
 'Speaker: ... ', ' Date: ...', ...]
```

Not the end of the world. From here, I could just filter the list to get the speaker names and dates separately. Inspired by the code that Oliver Gregory shared in the `#help-python-pandas` channel on Slack, I could do something like:

```python
dates = []
speakers = []

for i in range(len(info)):
    if i % 2 == 0:
        speakers.append(info[i])
 
for i in range(len(info)):
    if i % 2 != 0:
        dates.append(info[i])
```

This works! But there's a neat and elegant alternative way to do this using [list slicing](https://www.learnbyexample.org/python-list-slicing/):

```python
# Start at item 0 and go to the end, step by 2
speakers = info[::2]

# Start at item 1 and go to the end, step by 2
dates = info[1::2]
```

### Option 02: use the `xpath` method

As you will eventually learn, XPath lets you capture the text inside the `<p>` tag and separate the speaker names and dates in one go. All we need to do is to specify the position of the text we want to capture.

```python
speakers_xpath = "//p[@class='card-text']/text()[1]"
speakers = selector.xpath(speakers_xpath).extract()

dates_xpath = "//p[@class='card-text']/text()[2]"
dates = selector.xpath(dates_xpath).extract()
```

This is definitely the most elegant and efficient way to capture the speakers' names and dates. This is preferred over the previous method. The XPath is concise and arguably human-readable, and the code is more efficient as it doesn't require processing the collected text in Python afterwards.


:::

7. 🥇 **Challenge:** Combine all these lists you captured above into a single pandas data frame and save it to a CSV file. 

    Tip 1: Say you have lists called `dates`, `titles`, `speakers`, you can create a data frame (a table) like this:
    
    ```python
    df = pd.DataFrame({'date': dates,
                       'title': titles,
                       'speakers': speakers})
    ```    
    
    Tip 2: What if an event does not have a date or speaker name? Set that particular event's date or speaker to `None`

::: {style="margin-left:1.5em; background-color: #f9f9f9; border: 1px solid #ddd; border-radius: 0.5em; padding: 1em;margin-bottom:1.5em;"}

**At first glance, this seems like a super simple task. But it's not.**

As soon as you try to simply copy and paste the code above, you will notice that the lists `dates`, `titles`, and `speakers` have different lengths. This is because some events don't have a date or a speaker name!

Let's print out the lengths of the lists to see what we're dealing with:

```python
print(len(titles), len(dates), len(speakers))
```

yielding:

```output
39 35 35
```

Four events don't have a date or a speaker name.

What should you do in this case? Here is what I recommend:

### Option 01: Debug the issue

Try to understand why some events don't have a date or a speaker name. Is it a problem with the code? Or is it a problem with the website?

At least on Firefox, the Inspector window lets you search the page using the same CSS selectors you'd use on `scrapy`. You can type it on the search bar there (see the bit in yellow in the image below).

![](./figures/pcard.png){width=100%}

Keep hitting Enter to navigate the multiple matches and determine which ones are left out when using this selector.

[👉 Eventually, you'll notice that while all boxes contain paragraphs displaying speakers and dates, a few relevant `<p>` tags lack the `card-text` class. Why? Well, who knows! It seems like the maintainers of the website might have forgotten to tick a box or something.]{style="margin-left:2em;display:inline-block;"}

These inconsistencies are super common in web scraping. It's important to be aware of them and to know how to handle them.

## Option 02: Think of a more robust selector

Instead of immediately selecting the `p.card-text`, we could consider using a parent tag as our **reference point**.

The `<div class="card-body">` tag is the most natural reference point in this context. As the overarching **container** for the event details, given that it contains only a single `<p>` tag, we can use the `div.card-body` selector to capture the information.

**Using CSS Selectors:**

```python
info = selector.css("div.card-body > p ::text").extract()

## Use the list-slicing method to separate the speakers and dates
speakers = info[::2]
dates = info[1::2]
```

**Using XPath:**

```python
speakers_xpath = "//div[@class='card-body']/p/text()[1]"
speakers = selector.xpath(speakers_xpath).extract()

dates_xpath = "//div[@class='card-body']/p/text()[2]"
dates = selector.xpath(dates_xpath).extract()

```

Perfect! Now, we have the same number of speakers, dates, and titles. We can proceed to create the data frame and save it to a CSV file.

```python
df = pd.DataFrame({'date': dates,
                    'title': titles,
                    'speakers': speakers})

df.to_csv("seminar_series.csv", index=False)
```


:::

In the upcoming lecture, 👨🏻‍🏫 [Week 05 lecture](https://lse-dsi.github.io/DS105/2023/winter-term/weeks/week05/page.html), we will discuss XPath and how to write custom functions to extract all you need from **containers**.