# 7. Web Scraping

### Definition

**Web scraping** is used to extract (scrape) data from webpages on the Internet. The program that performs this task is usually called a **web scraper** or a **bot**. 

**Web crawling** is the process of exploring and oftentimes indexing the webpages on the Internet by following hyperlinks from webpage to webpage. The program that performs this task is usually called a **spider** or **web crawler**.

Oftentimes, web scraping and web crawling are combined into a single program. I will continue using "web scraping" to denote both approaches.

Web scraping can be used for both **focus crawls** which concentrate on crawling and scraping a single website (e.g. amazon.com) or **broad crawl** which does the same on many different websites.

Common **use cases** for web scraping are:
- search engines
- price monitoring
- content aggregators
- collecting massive amounts of text data for the training of language models
- copying online databases
- research data

### HTML

The HyperText Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. Web browsers receive HTML documents from a web server and render the documents into multimedia web pages.

This is an example for a simple html document:

```
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>

<h6>This is a Heading</h6>
<p>This is a paragraph.</p>

</body>
</html>
```

We can "execute" HTML directly in the cells of our Jupyter notebook:

<h6>This is a Heading</h6>

We can check out the HTML source code of any website in our browser. This can be done by either right click anywhere on the website and select "show source code" (or similar) or use the shortcut **ctrl** + **u**.

Original website:

<img src="../misc/istari.png" width="600">

HTML source code:

<img src="../misc/istari_source.png" width="600">

HTML consists of a series of elements which tell the browser how to display the content. An **HTML element** is defined by a **start tag** ```<tag>```, some **content** (e.g. text or a hyperlink), and an **end tag** ```</tag>```:

```<tagname>Content goes here...</tagname>``` 

There are many different html elements. Some of the most frequently used are:

- **headings** are defined with the ```<h1>``` to ```<h6>``` tags
- **paragraphs** are defined with the ```<p>``` tag
- **links** are defined with the ```<a>``` tag
- **images** are defined with the ```<img>``` tag

<h6>Small Heading</h6>
<p>This is a paragraph with a <a href="www.google.com"> Link</a>.</p>

### Python requests

We will use a Python package called *requests* to request and retrieve HTML from webpages. First, install requests using pip:

```pip install requests```

After installation (and restarting the Jupyter kernel), we can have to import the package.

In [None]:
import requests

Now we can request HTML from any website using its **URL** (Uniform Resource Locator), colloquially termed a **web address**, and passing it to ```.get()```.

In [None]:
requests.get("http://www.example.com")

This returns us ```<Response [200]>``` which is a ```Response``` object containing everything the server responded to our request. 

In [None]:
type(requests.get("http://www.example.com"))

The ```200``` is a HTML response code which stands for "OK" and it is the standard response for successful HTTP requests. Other important status codes are:

- ```301``` Moved Permanently
- ```403``` Forbidden
- ```404``` Not Found


We can use ```.text``` on the response object to recieve the HTML code.

In [None]:
response = requests.get("http://www.example.com")
response.text

### Beautifulsoup

The easiest way to extract content from HTML is to use the Python package *beautifulsoup*. Beautiful Soup is a Python library for pulling data out of HTML files. So let's install and import it:

```pip install beautifulsoup4```

In [None]:
from bs4 import BeautifulSoup

First we want to use BeautifulSoup to create a BeautifulSoup object, which represents the HTML document as a nested data structure.

In [None]:
soup = BeautifulSoup(response.text)
soup

We can now use our BeautifulSoup object to directly retrieve elements from the HTML code. For example, ```.title``` extracts the title of the HTML document.

In [None]:
soup.title

This actually returns a ```Tag``` object.

In [None]:
type(soup.title)

If we want to get the content of the tag as a string, we just have to add a ```.string```.

In [None]:
soup.title.string

In a similar manner, we can also access specific elements.

In [None]:
soup.p

There are also handy functions included. ```.get_text()``` retrieves all strings from the HTML code. We can define a ```separator=""``` to separate the invidiual contents and also tell BeautifulSoup to ```strip=True``` the content (removing trailing whitespaces and newline characters).

In [None]:
soup.get_text(separator=' ', strip=True)

Extracting all texts from a webpage boils down to a single line of Python code.

In [None]:
BeautifulSoup(requests.get("http://www.istari.ai/en").text).get_text(separator=' ', strip=True)

If we want to find all the hyperlinks on a webpage, we can apply ```.find_all()``` on our BeautifulSoup object and pass the ```"a"``` tag. This will return us a list with ```<a>``` elements.

In [None]:
all_hyperlinks = BeautifulSoup(requests.get("http://www.istari.ai/en").text).find_all("a")
all_hyperlinks

To retrieve the actual hyperlinks, we have to apply ```.get("href")``` on the individual ```tag``` objects. We can do so by iterating over the complete list.

In [None]:
for link in all_hyperlinks:
    print(link.get("href"))

As you can see, there are quite a lot of duplicate links included. To get rid of them, the easiest way is to first extract the actual hyperlinks from the ```a``` elements and to put them into a list.

In [None]:
all_hyperlinks = [link.get("href") for link in all_hyperlinks]
print(len(all_hyperlinks))

We can then transfer this list to a ```set```. The items in a set are unordered, unchangeable, and do not allow duplicate values. This "automatically" deletes all duplicate entries in our list. We then just transfer our set back to a list using ```list()```.

In [None]:
unique_hyperlink_list = list(set(hyperlink_list))
print(len(unique_hyperlink_list))
unique_hyperlink_list

If we want to extract specific elements from the html corpus, we can also use ```.find_all()``` for that.

In [None]:
BeautifulSoup(requests.get("http://www.istari.ai/en").text).find_all("div", "team-person-name")

### Building a simple scraper/crawler

Let's continue with building a simple crawler with the following functionalities:
1. query websites using a random useragent header (string that lets servers and network peers identify the application, operating system, vendor, and/or version of the requesting user agent)
2. extract texts from these websites
3. extract hyperlinks from these websites
4. follow one of these hyperlinks but make sure not query a visited website again
5. repeat until a pre-defined number of websites were successfully scraped

The user agent identifies us (or our web scraper) as a user to the server to which we send our request. It usually contains information about our operating system, resolution, browser and preferred language. Especially if we don't want to be recognized as a bot, it is recommended not to use the same user agent over and over again. For this we will use the Python package ```fake-useragent``` which you have to install first using pip.

```pip install fake-useragent```

In [None]:
from fake_useragent import UserAgent

Getting a random useragent (based on actual user statistics) is super easy:

In [None]:
UserAgent().random

In fact, every request should also go through a proxy that masks our actual IP. But for this you need a proxy provider, which forwards our requests in his "name" (IP) to the final destination. Assuming we had a proxy provider, we could simply enter their IP as a parameter 

```requests.get(url, proxies = { 'http': "http://182.52.51.155:39236", 'https': "https://182.52.51.155:39236"})```.

Next we want to define a simple function that takes a url string as input and cleans it.

In [None]:
def clean_url(url):
    url = ".".join(url.split(".")[-2:]) # extract main domain
    url = "http://" + url.replace("http://", "").replace("https://", "").replace("www.", "") # make sure the format is "http://example.com"
    return url

Let's try this.

In [None]:
clean_url("www.cloud.istari.ai")

We also need a function that extracts hyperlinks from BeautifulSoup objects and then extracts the associated domains, e.g. *istari.ai/products* should become *istari.ai*. 

In [None]:
from urllib.parse import urlparse #urlparse extracts domains from urls

def get_unqiue_domains_from_soup(soup):
    all_hyperlinks = soup.find_all("a") # get hyperlinks from soup
    unique_domains = list(set([urlparse(link.get("href")).netloc for link in all_hyperlinks])) # get unique domains from hyperlinks
    unique_domains = [clean_url(domain) for domain in unique_domains if len(domain) > 0] # clean domains and filter empty results
    
    return unique_domains

Next, we also want to have a function that returns a timestamp string.

In [None]:
import time
import datetime

def get_timestamp():
    return datetime.datetime.fromtimestamp(time.time()).strftime('%d.%m.%Y %H:%M:%S')

And now we can build our simple webscraper.

In [None]:
import random

next_url = "http://www.mannheim.de"
visited_domains = []
unvisited_domains = []
scraped_domains = []
texts = []
timestamps = []

scraping_limit = 10


while len(scraped_domains) < scraping_limit:
    try:
        # request page
        response = requests.get(next_url, headers={'User-Agent': UserAgent().random})

        # extract html and build soup
        html = response.text
        bs =  BeautifulSoup(html)

        # get timestamp 
        timestamps.append(get_timestamp())

        # extract text from html
        text = bs.get_text(separator=' ', strip=True)
        texts.append(text)

        # get domain of response 
        domain = urlparse(response.url).netloc
        visited_domains.append(clean_url(domain))
        scraped_domains.append(clean_url(domain))

        # get all unique domains, add previous unvisited dommains and filte visited ones
        unique_domain_list = get_unqiue_domains_from_soup(bs) + unvisited_domains
        unvisited_domains = [link for link in unique_domain_list if link not in visited_domains]

        # select next website to visit at random
        next_url = random.choice(unvisited_domains)

        print("Sucessfully scraped:", domain)
        print("Now scraping:", next_url)
        print("########################")
        
    except Exception as e:
        print(e)
        print("Failed to scrape", next_url)
        visited_domains.append(next_url)
        next_url = random.choice(unvisited_domains)
        print("Now scraping", next_url)
        print("########################")
        
print("Scraping limit reached.")

The results can be transferred to a Pandas dataframe for further analysis (next week).

In [None]:
import pandas as pd

pd.DataFrame(list(zip(scraped_domains, timestamps, texts)), columns =["url", "timestamp", "text"])