# 6. Web Scraping

### Definition

**Web scraping** is used to extract (scrape) data from webpages on the Internet. The program that performs this task is usually called a **web scraper** or a **bot**. 

**Web crawling** is the process of exploring and oftentimes indexing the webpages on the Internet by following hyperlinks from webpage to webpage. The program that performs this task is usuall called a **spider** or **web crawler**.

Oftentimes, web scraping and web crawling are combined into a single program. I will continue using "web scraping" to denote both approaches.

Web scraping can be used for both **focus crawls** which concentrate on crawling and scraping a single website (e.g. amazon.com) or **broad crawl** which does the same on many different websites.

Common **use cases** for web scraping are:
- search engines
- price monitoring
- content aggregators
- collecting massive amounts of text data for the training of language models
- copying online databases
- research data

### HTML

The HyperText Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. Web browsers receive HTML documents from a web server and render the documents into multimedia web pages.

This is an example for a simple html document:

```
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>

<h6>This is a Heading</h6>
<p>This is a paragraph.</p>

</body>
</html
```

We can "execute" HTML directly in the cells of our Jupyter notebook:

<h6>This is a Heading</h6>

We can check out the HTML source code of any website in our browser. This can be done by either right click anywhere on the website and select "show source code" (or similar) or use the shortcut **ctrl** + **u**.

Original website:

<img src="../misc/istari.png" width="600">

HTML source code:

<img src="../misc/istari_source.png" width="600">

HTML consists of a series of elements which tell the browser how to display the content. An **HTML element** is defined by a **start tag** ```<tag>```, some **content** (e.g. text or a hyperlink), and an **end tag** ```</tag>```:

```<tagname>Content goes here...</tagname>``` 

There are many different html elements. Some of the most frequently used are:

- **headings** are defined with the ```<h1>``` to ```<h6>``` tags
- **paragraphs** are defined with the ```<p>```< tag
- **links** are defined with the ```<a>``` tag
- **images** are defined with the ```<img>``` tag

<h6>Small Heading</h6>
<p>This is a paragraph with a <a href="www.google.com"> Link</a>.</p>

### Python requests

We will use a Python package called *requests* to request and retrieve HTML from webpages. First, install requests using pip:

```pip install requests```

After installation (and restarting the Jupyter kernel), we can have to import the package.

In [1]:
import requests



Now we can request HTML from any website using its **URL** (Uniform Resource Locator), colloquially termed a **web address**, and passing it to ```.get()```.

In [2]:
requests.get("http://www.example.com")

<Response [200]>

This returns us ```<Response [200]>``` which is a ```Response``` object containing everything the server responded to our request. 

In [3]:
type(requests.get("http://www.example.com"))

requests.models.Response

The ```200``` is a HTML response code which stands for "OK" and it is the standard response for successful HTTP requests. Other important status codes are:

- ```301``` Moved Permanently
- ```403``` Forbidden
- ```404``` Not Found


We can use ```.text``` on the response object to recieve the HTML code.

In [4]:
response = requests.get("http://www.example.com")
response.text

'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <

### Beautifulsoup

The easiest way to extract content from HTML is to use the Python package *beautifulsoup*. Beautiful Soup is a Python library for pulling data out of HTML files. So let's install and import it:

```pip install beautifulsoup4```

In [5]:
from bs4 import BeautifulSoup

First we want to use BeautifulSoup to create a BeautifulSoup object, which represents the HTML document as a nested data structure.

In [6]:
soup = BeautifulSoup(response.text)
soup

<!DOCTYPE html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples

We can now use our BeautifulSoup object to directly retrieve elements from the HTML code. For example, ```.title``` extracts the title of the HTML document.

In [7]:
soup.title

<title>Example Domain</title>

This actually returns a ```Tag``` object.

In [8]:
type(soup.title)

bs4.element.Tag

If we want to get the content of the tag as a string, we just have to add a ```.string```.

In [9]:
soup.title.string

'Example Domain'

In a similar manner, we can also access specific elements.

In [10]:
soup.p

<p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>

There are also handy functions included. ```.get_text()``` retrieves all strings from the HTML code. We can define a ```separator=""``` to separate the invidiual contents and also tell BeautifulSoup to ```strip=True``` the content (removing trailing whitespaces).

In [11]:
soup.get_text(separator=' ', strip=True)

'Example Domain Example Domain This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission. More information...'

Extracting all texts from a webpage boils down to a single line of Python code.

In [12]:
BeautifulSoup(requests.get("http://www.istari.ai/en").text).get_text(separator=' ', strip=True)

'ISTARI.AI - Die Zukunft der Unternehmensdatenbank Primary Menu HOME TEAM REFERENCES WEBAI NEWS CONTACT The future of the enterprise database – comprehensive information in real time Making hidden company data visible automatically, without rigid sources and in real time: Our webAI provides you with exactly the information you need. To get to this valuable information, our webAI searches millions of websites at high frequency to analyze them with artificial intelligence. Infinite use cases From market analysis and lead generation to investment opportunities, webAI offers the necessary data for all use cases. webAI Contact us Web indicators Extensive and constantly updated web data provide the basis for a range of webAI web indicators that make information easily accessible. \uf0ac \uf0ac Internationalization Is the company internationally active? \uf3e7 \uf3e7 Location What happens at the company location? \ue109 \ue109 Networking With which other companies is the company connected? \u

If we want to find alle the hyperlinks on a webpage, we can apply ```.find_all()``` on our BeautifulSoup object and pass the ```"a"``` tag. This will return us a list with ```<a>``` elements.

In [13]:
all_hyperlinks = BeautifulSoup(requests.get("http://www.istari.ai/en").text).find_all("a")
all_hyperlinks

[<a href="https://istari.ai/en/" rel="home">
 <span class="logo"><img alt="ISTARI.AI" class="tgp-exclude default" src="https://istari.ai/wp-content/uploads/thegem-logos/logo_d588768ef987c12dd42043d4cd5c17ba_1x.png" srcset="https://istari.ai/wp-content/uploads/thegem-logos/logo_d588768ef987c12dd42043d4cd5c17ba_1x.png 1x,https://istari.ai/wp-content/uploads/thegem-logos/logo_d588768ef987c12dd42043d4cd5c17ba_2x.png 2x,https://istari.ai/wp-content/uploads/thegem-logos/logo_d588768ef987c12dd42043d4cd5c17ba_3x.png 3x" style="width:176px;"/><img alt="ISTARI.AI" class="tgp-exclude small light" src="https://istari.ai/wp-content/uploads/thegem-logos/logo_77c5838c8035f4b4b8f3ba17b35e60fa_1x.png" srcset="https://istari.ai/wp-content/uploads/thegem-logos/logo_77c5838c8035f4b4b8f3ba17b35e60fa_1x.png 1x,https://istari.ai/wp-content/uploads/thegem-logos/logo_77c5838c8035f4b4b8f3ba17b35e60fa_2x.png 2x,https://istari.ai/wp-content/uploads/thegem-logos/logo_77c5838c8035f4b4b8f3ba17b35e60fa_3x.png 3x" sty

To retrieve the actual hyperlinks, we have to apply ```.get("href")``` on the individual ```tag``` objects. We can do so by iterating over the complete list.

In [14]:
for link in all_hyperlinks:
    print(link.get("href"))

https://istari.ai/en/
https://istari.ai/en/#Home-en
https://istari.ai/en/#Team_en
https://istari.ai/en/#Testimonials_en
https://istari.ai/en/webai/
https://istari.ai/en/blog
https://istari.ai/en/#contact-en
https://istari.ai/en/
https://istari.ai
https://istari.ai/en/webai/
https://istari.ai/#Kontakt
https://twitter.com/_david_lenz_
https://www.linkedin.com/in/david-lenz-9b98a4167/
https://twitter.com/jan_kinne
https://www.linkedin.com/in/jan-kinne-63169312b/
https://www.linkedin.com/in/miriam-krueger/
https://www.linkedin.com/in/devadeep-sen/
https://www.linkedin.com/in/sebastian-schmidt-a4688b153/
https://www.linkedin.com/in/rakshya-2199/
https://istari.ai/blog/
https://istari.ai/en/prompt-availability-of-information/
https://istari.ai/en/prompt-availability-of-information/
https://istari.ai/en/company-networks/
https://istari.ai/en/company-networks/
https://istari.ai/en/artificial-intelligence-applications-use-cases/
https://istari.ai/en/artificial-intelligence-applications-use-case

As you can see, there are quite a lot of duplicate links included. To get rid of them, the easiest way is to first extract the actual hyperlinks from the ```a``` elements and to put them into a list.

In [15]:
hyperlink_list = [link.get("href") for link in all_hyperlinks]
print(len(hyperlink_list))

42


We can then transfer this list to a ```set```. The items in a set are unordered, unchangeable, and do not allow duplicate values. This "automatically" deletes all duplicate entries in our list. We then just transfer our set back to a list using ```list()```.

In [16]:
unique_hyperlink_list = list(set(hyperlink_list))
print(len(unique_hyperlink_list))
unique_hyperlink_list

29


['https://istari.ai/en/blog',
 'https://istari.ai/en/artificial-intelligence/',
 'https://istari.ai',
 'https://istari.ai/en/#Testimonials_en',
 'https://www.linkedin.com/in/devadeep-sen/',
 'https://istari.ai/en/company-networks/',
 'https://istari.ai/en/#contact-en',
 'https://istari.ai/en/marketanalyses-corona/',
 'https://www.linkedin.com/in/jan-kinne-63169312b/',
 'https://www.linkedin.com/in/sebastian-schmidt-a4688b153/',
 'https://istari.ai/blog/',
 'https://wirkungswerk.com',
 '/datenschutz',
 'https://istari.ai/en/webai/',
 'https://www.linkedin.com/in/david-lenz-9b98a4167/',
 'https://istari.ai/en/#Home-en',
 '/impressum',
 'https://istari.ai/en/innoprob/',
 'https://istari.ai/#Kontakt',
 'https://www.linkedin.com/in/rakshya-2199/',
 'https://istari.ai/en/prompt-availability-of-information/',
 'https://istari.ai/en/#Team_en',
 'https://istari.ai/en/artificial-intelligence-applications-use-cases/',
 'https://twitter.com/jan_kinne',
 'https://istari.ai/en/',
 'https://twitter.c

If we want to extract specific elements from 

In [24]:
BeautifulSoup(requests.get("http://www.istari.ai/en").text).find_all("div", "team-person-name")

[<div class="team-person-name title-h4">DR. DAVID LENZ</div>,
 <div class="team-person-name title-h4">DR. JAN KINNE</div>,
 <div class="team-person-name title-h4">MIRIAM KRÜGER</div>,
 <div class="team-person-name title-h4">DEVADEEP SEN</div>,
 <div class="team-person-name title-h4">SEBASTIAN SCHMIDT</div>,
 <div class="team-person-name title-h4">RAKSHYA KC</div>,
 <div class="team-person-name title-h4">Prof. Dr. Svenja Falk</div>,
 <div class="team-person-name title-h4">Dr. Georg Licht</div>,
 <div class="team-person-name title-h4">Prof. Dr. Irene Bertschek</div>,
 <div class="team-person-name title-h4">Prof. Dr. Knut Blind</div>,
 <div class="team-person-name title-h5">ROBERT DEHGHAN</div>,
 <div class="team-person-name title-h5">Dr. Milad Abbasiharofteh</div>,
 <div class="team-person-name title-h5">Johannes Dahlke</div>,
 <div class="team-person-name title-h5">Dr. Carlo Menon</div>,
 <div class="team-person-name title-h5">Dania Eugenidis</div>,
 <div class="team-person-name title-h

In [None]:
attrs={"data-foo": "value"}

### Crawling