# Scraping Tutorial

This Scraping Tutorial created for Tracking Injustice to aid in the creation of a living dataset tracking Canadian Police-Involved Deaths. For more information regarding Tracking Injustice, see https://trackinginjustice.ca/.

This tutorial is run and created by Emily Medema and Rohan Khan.

## Interactive Scraping Example

The best way to learn how to scrape is to do it. As we have seen in the slides (which can be found here: [Slide Link]("http://tiny.cc/scraping-tutorial-slides")), websites differ greatly and therefore our scraping techniques have to be customized to the site. The best way to learn to do that, while following the guidelines also laid out in the slides, is to do it yourself. Scraping is also a continuous process of creating and maintainence. A website may change which can cause your script to become ineffective. You must be able to adapt so that your script can continue to be effective.

First, we will get our environment setup to scrape and then we will work our way up from simply examples to more and more complex ones. 

### Environmental Setup

We are using Jupyter Notebooks, which is an open source web application that you can use to create and share documents that contain live code, equations, visualizations, and text. It is incredibly useful for Python. Through jupyter notebooks you can create a document that documents, explains, and contains your code all in one place. This is very helpful for maintainence purposes.

Jupyter Notebooks run from either third-party sites, your virtual environment, or your localhost. Therefore, you can access whatever python libaries installed either on that third-party site, your virtual environment, or your own machine by simply importing the libary. If you ever have to install a libary on the notebook itself you can do so with this command:

```
!pip install libaryname
```

For this scraping tutorial, we will want the following libraries:

- pandas
- numpy
- urllib3
- bs4
- MechanicalSoup
- Scrapy
- selenium

You will most likely never need all of these for scraping. Nevertheless, ensure that all these are installed on whatever machine you are running jupyter notebook on.

You can easily install all of these libraries if you have the [github repository]("https://github.com/emedema/scraping_tutorial") cloned by running the following command:

```
pip install -r requirements.txt
```

Now that we have all these libraries installed, we can import them into the notebook. This means we can then use their methods etc. within our code.

In [8]:
# import libraries
import pandas as pd
import numpy as np
import re
import urllib3
from bs4 import BeautifulSoup
import mechanicalsoup
import scrapy
import selenium

Now that we have our environment setup, we can move onto the first scraping exercise.

### Exercise 1

For our first exercise, we are going to scrape this site: http://olympus.realpython.org/profiles/aphrodite

This was setup as a site for building your first webscraper, so it is a good first step.

On the site, we can see that there is a few different pieces of information:

1. There is an image of Aphrodite
2. Her name is shown in an H2 tag like as "Name: Aphrodite"
3. Other information is also shown on the page within the \<center> tag

To grab this information, we are going to use the `urllib` library to grab the source code for the site and then parse that code to get the information.

In [4]:
# we specifically want the urlopen method from urllib.request
from urllib.request import urlopen

# assign the url to a variable for ease
url = "http://olympus.realpython.org/profiles/aphrodite"

# to open a webpage pass the url to urlopen
page = urlopen(url)
page

<http.client.HTTPResponse at 0x17df5ff70>

`urlopen` returns an HTTPResponse object, therefore we need to first use the HTTPResponse object’s `.read()` method, which returns a sequence of bytes. Then use `.decode()` to decode the bytes to a string using UTF-8

In [5]:
html_bytes = page.read()
html = html_bytes.decode("utf-8")

print(html)

<html>
<head>
<title>Profile: Aphrodite</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/aphrodite.gif" />
<h2>Name: Aphrodite</h2>
<br><br>
Favorite animal: Dove
<br><br>
Favorite color: Red
<br><br>
Hometown: Mount Olympus
</center>
</body>
</html>



Now that we have the text of the HTML, we could extract the information using Python's string methods.

We could use the `.find()` method like so:

In [6]:
start_index = html.find("<title>") + len("<title>")
end_index = html.find("</title>")
title = html[start_index:end_index]
title

'Profile: Aphrodite'

But this is not reliable as HTML can change drastically between sites and can be very messy. A simple change such as going from `<title>` to `<title >` can result in a logic error in our scraping despite it not effecting the site itself.

Therefore, this is not the best way to utilize the string methods of python. A better way would be to use Regular Expressions or regex. Regex are patterns that you can use to search for text within a string. Python supports regular expressions through the standard library’s re module, which we have already imported. 

Regular expressions use special characters called metacharacters to denote different patterns. For instance, the asterisk character (*) stands for zero or more instances of whatever comes just before the asterisk.

In the following example, you use `.findall()` to find any text within a string that matches a given regular expression:

In [9]:
re.findall("ab*c", "ac")

['ac']

The first argument of `re.findall()` is the regular expression that you want to match, and the second argument is the string to test. In the above example, you search for the pattern "ab*c" in the string "ac".

The regular expression "ab*c" matches any part of the string that begins with "a", ends with "c", and has zero or more instances of "b" between the two. `re.findall()` returns a list of all matches. The string "ac" matches this pattern, so it’s returned in the list.

Here’s the same pattern applied to different strings:

In [11]:
print(re.findall("ab*c", "abcd"))
print(re.findall("ab*c", "acc"))
print(re.findall("ab*c", "abcac"))
print(re.findall("ab*c", "abdc"))

['abc']
['ac']
['abc', 'ac']
[]


Notice that if no match is found, then `.findall()` returns an empty list.

Pattern matching is case sensitive. If you want to match this pattern regardless of the case, then you can pass a third argument with the value `re.IGNORECASE`.

You can use a period (.) to stand for any single character in a regular expression. For instance, you could find all the strings that contain the letters "a" and "c" separated by a single character as follows:

In [12]:
print(re.findall("a.c", "abc"))
print(re.findall("a.c", "abbc"))
print(re.findall("a.c", "ac"))

['abc']
[]
[]


The pattern .* inside a regular expression stands for any character repeated any number of times. For instance, you can use "a.*c" to find every substring that starts with "a" and ends with "c", regardless of which letter—or letters—are in between

Often, you use `re.search()` to search for a particular pattern inside a string. This function is somewhat more complicated than `re.findall()` because it returns an object called MatchObject that stores different groups of data. This is because there might be matches inside other matches, and `re.search()` returns every possible result.

The details of MatchObject are irrelevant here. For now, just know that calling `.group()` on MatchObject will return the first and most inclusive result, which in most cases is just what you want:

In [13]:
match_results = re.search("ab*c", "ABC", re.IGNORECASE)
match_results.group()

'ABC'

There’s one more function in the `re` module that’s useful for parsing out text. `re.sub()`, which is short for substitute, allows you to replace the text in a string that matches a regular expression with new text. It behaves sort of like the `.replace()` string method.

The arguments passed to `re.sub()` are the regular expression, followed by the replacement text, followed by the string. Here’s an example:

In [14]:
string = "Everything is <replaced> if it's in <tags>."
string = re.sub("<.*>", "ELEPHANTS", string)
string

'Everything is ELEPHANTS.'

But as we can see, we need to be careful with regex as we might accidentally do something like this. 

`re.sub()` uses the regular expression "<.*>" to find and replace everything between the first < and the last >, which spans from the beginning of \<replaced> to the end of \<tags>. This is because Python’s regular expressions are greedy, meaning they try to find the longest possible match when characters like * are used.

Alternatively, you can use the non-greedy matching pattern *?, which works the same way as * except that it matches the shortest possible string of text:

In [15]:
string = "Everything is <replaced> if it's in <tags>."
string = re.sub("<.*?>", "ELEPHANTS", string)
string

"Everything is ELEPHANTS if it's in ELEPHANTS."

This time, `re.sub()` finds two matches, <replaced> and <tags>, and substitutes the string "ELEPHANTS" for both matches.
    
Now let's try this on our Aphrodite page.

In [16]:
# regex pattern, searching for the text between the title tags
pattern = "<title.*?>.*?</title.*?>"
match_results = re.search(pattern, html, re.IGNORECASE)
title = match_results.group()
# Remove HTML tags
title = re.sub("<.*?>", "", title)

print(title)

Profile: Aphrodite


Let's try it on another site with messier HTML, such as this one: http://olympus.realpython.org/profiles/dionysus

In [17]:
url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")

pattern = "<title.*?>.*?</title.*?>"
match_results = re.search(pattern, html, re.IGNORECASE)
title = match_results.group()
title = re.sub("<.*?>", "", title) # Remove HTML tags

print(title)

Profile: Dionysus


Let's take a closer look at the first regular expression in the pattern string by breaking it down into three parts:

1. <title.*?> matches the opening \<TITLE > tag in html. The \<title part of the pattern matches with \<TITLE because re.search() is called with re.IGNORECASE, and .*?> matches any text after \<TITLE up to the first instance of >.

2. .*? non-greedily matches all text after the opening \<TITLE >, stopping at the first match for \</title.*?>.

3. \</title.*?> differs from the first pattern only in its use of the / character, so it matches the closing \</title  / > tag in html.

The second regular expression, the string "<.*?>", also uses the non-greedy .*? to match all the HTML tags in the title string. By replacing any matches with "", `re.sub()` removes all the tags and returns only the text.

### Exercise 2

Although regular expressions are great for pattern matching in general, sometimes it’s easier to use an HTML parser that’s explicitly designed for parsing out HTML pages. There are many Python tools written for this purpose, but the Beautiful Soup library is a good one to start with.

To use `BeautifulSoup`, we will create a program that does the following:

   1. Opens the URL http://olympus.realpython.org/profiles/dionysus by using `urlopen()` from the urllib.request module

   2. Reads the HTML from the page as a string and assigns it to the html variable

   3. Creates a BeautifulSoup object and assigns it to the soup variable

The `BeautifulSoup` object assigned to soup is created with two arguments. The first argument is the HTML to be parsed, and the second argument, the string "html.parser", tells the object which parser to use behind the scenes. "html.parser" represents Python’s built-in HTML parser.

In [18]:
url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

Now that we have the HTML saved and parsed in a soup object, we can use the soup methods such as `get_text()` that you can use to extract all the text from the document and automatically remove any HTML tags.

In [20]:
print(soup.get_text())



Profile: Dionysus





Name: Dionysus

Hometown: Mount Olympus

Favorite animal: Leopard 

Favorite Color: Wine






There are a lot of blank lines in this output. These are the result of newline characters in the HTML document’s text. You can remove them with the `.replace()` string method if you need to.

Often, you need to get only specific text from an HTML document. Using Beautiful Soup first to extract the text and then using the `.find()` string method is sometimes easier than working with regular expressions.

However, other times the HTML tags themselves are the elements that point out the data you want to retrieve. For instance, perhaps you want to retrieve the URLs for all the images on the page. These links are contained in the src attribute of \<img> HTML tags.

In this case, you can use `find_all()` to return a list of all instances of that particular tag:

In [21]:
soup.find_all("img")

[<img src="/static/dionysus.jpg"/>, <img src="/static/grapes.png"/>]

This returns a list of all \<img> tags in the HTML document. The objects in the list look like they might be strings representing the tags, but they’re actually instances of the Tag object provided by Beautiful Soup. Tag objects provide a simple interface for working with the information they contain.

You can explore this a little by first unpacking the Tag objects from the list:

In [22]:
image1, image2 = soup.find_all("img")

Each Tag object has a .name property that returns a string containing the HTML tag type:

In [23]:
image1.name

'img'

You can access the HTML attributes of the Tag object by putting their names between square brackets, just as if the attributes were keys in a dictionary.

For example, the `<img src="/static/dionysus.jpg"/>` tag has a single attribute, src, with the value "/static/dionysus.jpg". Likewise, an HTML tag such as the link `<a href="https://realpython.com" target="_blank">` has two attributes, href and target.

To get the source of the images in the Dionysus profile page, you access the src attribute using the dictionary notation mentioned above:

In [25]:
image2["src"]

'/static/grapes.png'

You can also access certain tags from the HTML page such as title.

In [26]:
soup.title

<title>Profile: Dionysus</title>

In [27]:
soup.title.string

'Profile: Dionysus'

One of the features of Beautiful Soup is the ability to search for specific kinds of tags whose attributes match certain values. For example, if you want to find all the \<img> tags that have a src attribute equal to the value /static/dionysus.jpg, then you can provide the following additional argument to `.find_all()`.

This example is somewhat arbitrary, and the usefulness of this technique may not be apparent from the example. If you spend some time browsing various websites and viewing their page sources, then you’ll notice that many websites have extremely complicated HTML structures.

When scraping data from websites with Python, you’re often interested in particular parts of the page. By spending some time looking through the HTML document, you can identify tags with unique attributes that you can use to extract the data you need.

Then, instead of relying on complicated regular expressions or using .find() to search through the document, you can directly access the particular tag that you’re interested in and extract the data you need.

In some cases, you may find that Beautiful Soup doesn’t offer the functionality you need. The `lxml` library is somewhat trickier to get started with but offers far more flexibility than Beautiful Soup for parsing HTML documents. You may want to check it out once you’re comfortable using Beautiful Soup.

### Example 3

Beautiful Soup is great for scraping data from a website’s HTML, but it doesn’t provide any way to work with HTML forms. For example, if you need to search a website for some query and then scrape the results, then Beautiful Soup alone won’t get you very far. In this case, we can use `MechanicalSoup`.