# Lecture - Static web scraping 1

Author: Jun Sun (jun.sun@gesis.org)

In [1]:
from IPython.display import HTML

## 1. Using API vs webpage scraping

Using API (Application Programming Interface) calls and webpage scraping are two different methods of accessing data from websites or online services. Each approach has its own set of pros and cons, and the choice between them depends on various factors including data availability, legality, and your specific use case. 

Compare the results of the two methods

1. Using an API call with the URL https://en.wikipedia.org/w/api.php?action=parse&page=Mannheim&format=json

2. Visiting the webpage with the URL https://en.wikipedia.org/wiki/Mannheim

Using API call:

![api.png](attachment:api.png)

Visiting the webpage:

![webscraping.png](attachment:webscraping.png)

The choice between API and web scraping depends on your specific needs and constraints. If an API is available and provides the data you need, it iss generally the preferred method due to its structured data, reliability, and legality. However, web scraping can be a valuable tool when an API is not available or when you need to extract data from websites that do not offer an API, but you should always approach it with caution and respect for legal and ethical considerations.

## 2. Web scraping legality & ethics

### Terms of service (ToS)

The "Terms of Service" (ToS) or "Terms of Use" of a website or online service typically outline the rules and regulations that users must adhere to when using that service. When it comes to automated data collection, ToS may contain specific provisions or guidelines regarding this activity.

The following example shows the ToS of Facebook regarding web scraping as stated in https://www.facebook.com/apps/site_scraping_tos_terms.php.

![facebook_tos.png](attachment:facebook_tos.png)

It is important to note that ToS are legally binding contracts, and violating them can have legal consequences. Therefore, when engaging in automated data collection, it is crucial to review and comply with the terms of the website or service you are interacting with. Additionally, if you have any uncertainty or specific questions about the terms, you may want to consult with legal counsel to ensure your actions are in compliance with applicable laws and regulations.

### robots.txt

A `robots.txt` file, often referred to as the "robots exclusion protocol" or "robots.txt protocol", is a standard used by websites to communicate with web crawlers or robots about which parts of the website should be crawled or indexed and which parts should be excluded. This file is typically placed in the root directory of a website, and it provides instructions to web crawlers about how they should interact with the site.

Key components of a `robots.txt` file:

* **User-agent**: This line specifies the web crawler or robot to which the rules apply. Common user-agents include Googlebot (for Google's crawler) and Bingbot (for Bing's crawler).

* **Disallow**: This directive tells the web crawler which parts of the site it should not crawl. You specify the URLs or paths that should be excluded from indexing.

* **Allow**: While not always used, this directive can specify exceptions to the disallow rules. It tells the crawler which specific URLs within a disallowed directory are allowed to be crawled.

* **Sitemap**: This line can point to the website's XML sitemap file, which lists all the URLs that you want search engines to index.

Here is an example of `robots.txt` file: https://www.google.com/robots.txt

```
User-agent: *
Disallow: /search
Allow: /search/about
Allow: /search/static
Allow: /search/howsearchworks
Disallow: /sdch
Disallow: /groups
Disallow: /index.html?
Disallow: /?
Allow: /?hl=
Disallow: /?hl=*&
Allow: /?hl=*&gws_rd=ssl$
Disallow: /?hl=*&*&gws_rd=ssl
Allow: /?gws_rd=ssl$
Allow: /?pt1=true$
Disallow: /imgres
Disallow: /u/
...

# AdsBot
User-agent: AdsBot-Google
Disallow: /maps/api/js/
Allow: /maps/api/js
Disallow: /maps/api/place/js/
Disallow: /maps/api/staticmap
Disallow: /maps/api/streetview

# Crawlers of certain social media sites are allowed to access page markup when google.com/imgres* links are shared. To learn more, please contact images-robots-allowlist@google.com.
User-agent: Twitterbot
Allow: /imgres
Allow: /search
Disallow: /groups
Disallow: /hosted/images/
Disallow: /m/

User-agent: facebookexternalhit
Allow: /imgres
Allow: /search
Disallow: /groups
Disallow: /hosted/images/
Disallow: /m/

Sitemap: https://www.google.com/sitemap.xml
```

### User agent

As shown in the example above, websites can specify different rules for different web crawlers or "user-agents".

The User-Agent request header is a characteristic string that lets servers and network peers identify the application, operating system, vendor, and/or version of the requesting client.
However, user-agent strings can be customized or modified, and some web browsers allow users to change their user-agent for various purposes, such as compatibility testing or privacy concerns.

More information can be found at https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent

Worth noticing, the server can see more information about the client from the request header, not just the user-agent. As an illustration you can visit https://www.whatismybrowser.com/detect/what-http-headers-is-my-browser-sending to see what the server knows about you.

* Visit in your browser
* Visit outside your browser (e.g., wget, python)

### Best practices

* Never scrape more frequently than you need to
* Cache the content you scrape: only download once
* Pause to keep from overwhelming servers

## Webpages

A typical webpage contains

* HTML (HyperText Markup Language) describes and defines the content
* CSS (Cascading Style Sheets) are used to describe the appearance
* JS (JavaScript) is a programming language used to add interactivity

In this tutorial we illustrate the basics of HTML and CSS.

### Inspecting a webpage

To inspect a webpage in the browser, using Chrome as an example,
* Press F12 (Windows / Linux)
* Or: Ctrl + Shift + I
* Or: Menu → More tools → Developer tools

To inspect a specific part of the webpage, right click → Inspect.

### HTML basics
To create a simple webpage, open your text editor and type in the following content. Save it with extention `html`.

Open the HTML file in the browser, for example Chrome or Firefox.

```
<p>This is a paragraph.</p>          <!-- this is a comment -->
```

Alternatively you can try it online: https://jsbin.com/ or https://html5-editor.net/.

In Jupyter notebooks you can use `HTML()` to render HTML codes.

In [2]:
HTML('<p>This is a paragraph.</p>          <!-- this is a comment -->')

More examples:

* A simple HTML element with an attribute

```
<p class="para">This is a paragraph.</p>
```

In [3]:
HTML('<p class="para">This is a paragraph.</p>')

* A hyperlink to Google

```
<a href="https://www.google.com/" target="_blank">search google</a>
```

In [4]:
HTML('<a href="https://www.google.com/" target="_blank">search google</a>')

* HTML text

```
<h1>top level heading</h1>    <!-- also h2, h3, ..., h6 -->
<p>paragraph.</p>
<i>italic</i> <b>bold</b> <u>underline</u>
<ul>
    <!-- <ol> for ordered -->
    <li>milk</li>
    <!-- you can nest ’em -->
    <li>eggs</li>
</ul>
```

In [7]:
html = '''
        <h1>top level heading</h1>    <!-- also h2, h3, ..., h6 -->
        <p>paragraph.</p>
        <i>italic</i> <b>bold</b> <u>underline</u>
        <ul>
            <!-- <ol> for ordered -->
            <li>milk</li>
            <!-- you can nest ’em -->
            <li>eggs</li>
        </ul>
       '''

HTML(html)

* Simple HTML structure

```
<!DOCTYPE html>
<html>
    <head>
        <title>ohai</title>
    </head>
    <body>
        <p>hello world!</p>
    </body>
</html>
```

In [6]:
html = '''
        <!DOCTYPE html>
        <html>
            <head>
                <title>ohai</title>
            </head>
            <body>
                <p>hello world!</p>
            </body>
        </html>
       '''
HTML(html)

### CSS selectors

CSS selectors are patterns used to select and style elements in HTML documents. In the context of web scraping, CSS selectors allow us to locate the content we are interested in on the webpage.

To get the CSS selector of an HTML element, using Chrome as an example, first inspect the element, then right click the HTML code in the "Elements" tab and click Copy → Copy selector.
Note multiple selectors can be valid for the same element.

Some frequently used examples of CSS selectors are:

| Selector        | Example       | Example Description                                                               |
|-----------------|---------------|-----------------------------------------------------------------------------------|
| `*`               | `*`             | all elements                                                                      |
| `#id`             | `#firstname`    | the element with `id="firstname"`                                                  |
| `.class`          | `.intro`        | all elements with `class="intro"`                                                  |
| `.class1.class2`  | `.name1.name2`  | all elements with both `name1` and `name2` classes                                    |
| `.class1 .class2` | `.name1 .name2` | all elements with class `name2` that is a descendant of an element with class `name1` |
| `element`             | `p`                              | all `<p>` elements                                                                  |
| `element.class`       | `p.intro`                        | all `<p>` elements with `class="intro"`                                               |
| `element,element`     | `div, p`                         | all `<div>` elements and all `<p>` elements                                           |
| `element element`     | `div p`                          | all `<p>` elements inside `<div>` elements                                            |
| `element>element`     | `div > p`                        | all `<p>` elements where the parent is a `<div>` element                              |

More information can be found at: https://www.w3schools.com/cssref/css_selectors.asp