<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Firstname Lastname](https://) for the 2023 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).

For questions/comments/improvements, email author@email.address.<br />
____

# Web Scraping Toolkit 2

This is lesson 2 of 3 in the educational series on `Web Scraping`. This notebook is intended to teach the core problem solving perspectives and tools for webscraping. 

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial` / `How-To` 

**Difficulty:** Intermediate

**Completion time:** 90 minutes

**Knowledge Required:** 

* Python basics (variables, flow control, functions, lists)
* Basic file operations (open, close, read, write)

**Knowledge Recommended:**

* basic html/websites

**Learning Objectives:**
After this lesson, learners will be able to:

1. Learn how to use pathlib and requests.
2. Handle delays in a crawler.
3. Generate URLs and download files.

**Research Pipeline:**

1. You have a research question and data in mind.
2. You've found some data you want to use.
2. **The data is on a website somewhere and you want to get it off the site and into a data file.**
3. You do your analysis or other data prep!


# Required Python Libraries

* `requests` for downloading things

## Install Required Libraries

In [2]:
### Install Libraries ###

# Using !pip installs
!pip install requests
!pip install lxml
!pip instal bs4

# Using %%bash magic with apt-get and yes prompt

ERROR: unknown command "instal" - maybe you meant "install"


In [12]:
### Import Libraries ###

#3rd party
import requests
from lxml import etree
from bs4 import BeautifulSoup

import re
import time
import pathlib
import csv

# Data management and file organization

Some diagrams pulled from previous work: https://www.ideals.illinois.edu/items/93987

Some themes for organizing your data with web scraping:

* keep the project in a folder, ideally under version control
	* yes things get weird when you start having thousands of files, but that shouldn't stop you
* use folders within that folder to contain things
* be consistent with names
* put meaningful information in the filenames
	* an ID or other content that will easily connect things back to the original source
* retain that meaning between files as you convert them
	* eg use the same identifier information between different versions of the entity 

## Before we move on, some considerations

There are a few things to consider before we move on where we are programmatically downloading things from someone else's server. Not every website wants to be scraped. Some have restrictions some have blocks, and there's a certain kind of etiquette that we want to follow.

First, we want to keep the speed we are hitting their server to something reasonable. This is usually a minimum of 4 seconds, but I've worked with pages that asked for 30 seconds delays. 

Second, some ask that you only do large scale harvesting or scraping during "off peak" times. This often means overnight.

Third, some pages may just completely ban scraping tools from being used. Usually this is because they have an API they'd prefer you to use (and usually pay for) or because the data is sensitive in some way.  Let's look a few examples. 

* Linkedin has a hard block on programmatic web scraping because their data is really valuable and they want to sell it to you.
* Google will quickly block you from scraping their results because they want to you to use an API. Many of theirs are open and reasonable to use, but they don't want HTML scraping.
* Archive of Our Own (AO3) has a block against it because they don't want search engines to index the results. This gives them control over story and author information and the ability to fully take things down as needed. 

But how can you know for sure? This can be hard and there's no single answer. You can often check the `robots.txt` file for the website. You can read about this file here: https://en.wikipedia.org/wiki/Robots.txt Very generally, it will contain information for humans and for bots, and give you an idea about limitations, etc. Not every site will have it, but most with data will. 

* https://en.wikipedia.org/robots.txt
* https://archiveofourown.org/robots.txt
	* my favorite "cruel but efficient"
	* note the crawl delay
* https://www.fanfiction.net/robots.txt

You can ask for permission to go out of bounds for this, especially for research. Just be respectful.

## Handling delays

Most programming languages will have some ability to "delay"actions. We will use the `time` module in Python to delay our execution.

`time.sleep(seconds)` takes a number of seconds and pauses script execution for that long. Other languages use `ms` instead, so be mindful if switching!

In [4]:
import time

for _ in range(3):
	print("hello!")
	time.sleep(5)

hello!
hello!
hello!


## Downloading things off one page

Starting with the simplest version for sure, we have one page with a side of links, and we want to download the results of those links. What those files are, doesn't really matter because you're downloading them to disk. 

So what I love about this page is that they just have the sql statement right at the top of the page. 

https://calphotos.berkeley.edu/cgi/img_query?where-taxon=Allium+anceps

Let's take a look at the structure here:

* clearly these are coming from a database
* there are multiple pages
* the images are displayed on the page
* there are detail links by each image
* being displayed in a table

 Tip: Chrome XPath Helper tool

I like to use this to preview the structure of the elements.

There are a variety of tools you can use for this part! Our basic goal for this is to get URL for each of the pictures. Once we have those collected, we can run through them to download each. I'm going to provide these URLs for now so we can focus on the downloading. 

Just a small preview of this xpath we'll be using:

`//td//img/@src`

* we can use `//img` to get all the images on the page, but most pages will have other images. Best practice is to include something more specific to disambiguate. This is why I have `td` in here.
* Using `@src` allows me to request that it return the value for the source property
* the URLs for the images appear to have a specific folder structure, which I could have also used to gather them
* the URLs gathered are relative links, meaning that I'll need to build the full URL when I'm doing my pass over them. 

Let's open the text file with the URLs and start building those up. As mentioned, these are relative links so we will need to do a bit of editing to get them into the full pattern.  You can check out a link on the main page to inspect what the full URL should be and what the relative links are. Looking a that we can discover that the "base" url should be.

Here's a full link:
`https://calphotos.berkeley.edu/imgs/128x192/0000_0000/1209/2448.jpeg`

And here's the corresponding relative link: 

`imgs/128x192/0000_0000/1209/2448.jpeg`

This means we'll need to prepend `https://calphotos.berkeley.edu` before each URL to have the full one. There are several ways you can do this and this is a great time to practice your core Python skills. 

Some notes:

* using list comprehension syntax here
* using `readlines` to read it in, which returns a list of strings, each string is a line from the file plus a newline character
* `strip` is needed to take the ending newline character off
* I'm concatenating the base before the url from the line, but note that I didn't include the final / because there's already an opening one from the url. 
* This will result in a list of all the urls.

In [6]:
with open('pictures.txt', 'r', encoding = 'utf-8') as infile:
    # urls = infile.readlines()
    urls = ['https://calphotos.berkeley.edu' + u.strip() for u in infile.readlines()]

## Working with `requests`

Let's try something basic!

In [7]:
import requests

url = "https://loripsum.net/api/1/plaintext/short"
result = requests.get(url)

print(result)

<Response [200]>


So what we're seeing here is a sucessfull connection, but not the text.  We have to ask about that explicitly from out result object.

We do this with `.text` (no parens!) this will allow us to ask for a variable value within out object (versus calling a function). Some objects just work this way, and we know how to do this by looking at the documentation or a tutorial.


In [8]:
print(result.content)

b'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Equidem e Cn. Nam Pyrrho, Aristo, Erillus iam diu abiecti. Iam id ipsum absurdum, maximum malum neglegi. Prioris generis est docilitas, memoria; Tria genera bonorum; Duo Reges: constructio interrete. \n\n'


us to make these objects. 

We can use the `mkdir()` method to create a folder, and then use the `/` concatenation operator to combine them.

`pathlib` has two awesome path object methods to write out content:

* `write_text(text stuff)`
	* for text!
* `write_bytes(a bytes or non-text doodad)`
	* briefly, for stuff that isn't text

https://calphotos.berkeley.edu/robots.txt

We have a list of URLs now, so we can loop through those and begin downloading them. There are a few tasks we'll need to accomplish.

* create the file name (from the file name)
* create a directory for the new files to go into
* create the full destination path (target folder plus file name)
* open up the requests connection
* access and write the content
* close the connection
* wait for 5 seconds

This is a lot and we build it up bit by bit.

In [10]:
import pathlib
import time
import requests

# create the target folder object
target = pathlib.Path('pictures')
# make the directory if needed
# does nothing if already exists
target.mkdir(exist_ok=True)

for u in urls:
    parts = u.split('/')
    last_two = parts[-2:] # grab the last two parts
    fname = "_".join(last_two)
    # print(fname)
    p = target / pathlib.Path(fname)
    print(p) # this is the full path
    r = requests.get(u) #open connection
    p.write_bytes(r.content) # get content, write bytes
    r.close() # always close your connection!!!
    time.sleep(5) # pause to not anger the server

pictures/0903_0732.jpeg
pictures/1002_0400.jpeg
pictures/1102_0790.jpeg
pictures/1102_0792.jpeg
pictures/1207_0067.jpeg
pictures/1207_0083.jpeg
pictures/1207_0084.jpeg
pictures/1207_0086.jpeg
pictures/0408_1095.jpeg
pictures/0608_2437.jpeg
pictures/0608_2438.jpeg
pictures/0608_2439.jpeg
pictures/0608_2440.jpeg
pictures/0608_2441.jpeg
pictures/0608_2442.jpeg
pictures/0608_2443.jpeg
pictures/0608_2444.jpeg
pictures/0209_0663.jpeg
pictures/0209_0664.jpeg
pictures/0209_0665.jpeg
pictures/0209_0666.jpeg
pictures/0209_0667.jpeg
pictures/0509_0139.jpeg
pictures/1209_2447.jpeg
pictures/1209_2448.jpeg
pictures/0611_1218.jpeg
pictures/0611_1219.jpeg
pictures/0611_1220.jpeg
pictures/0611_1221.jpeg
pictures/0611_1222.jpeg
pictures/0611_1223.jpeg
pictures/0413_3699.jpeg
pictures/1113_3030.jpeg
pictures/1115_2820.jpeg
pictures/1115_2821.jpeg
pictures/1115_3063.jpeg
pictures/1115_3064.jpeg
pictures/1017_1587.jpeg
pictures/1017_1588.jpeg
pictures/1017_1589.jpeg
pictures/0918_2740.jpeg
pictures/0918_27

One thing I always check at this point is the file size for everything that has downloaded. When in jupyter on a cloud service, that can be hard, but `!` to the rescue.

In [None]:
!ls -l pictures

Now, what if we had many or some messed up? Using pathlib is awesome here. We can utilize the `exists()` method to check if the file we are proposing to make already exists. 

In [None]:
target = pathlib.Path('pictures')
target.mkdir(exist_ok=True)

for u in urls:
    parts = u.split('/')
    last_two = parts[-2:] # grab the last two parts
    fname = "_".join(last_two)
    p = target / pathlib.Path(fname)
    # use .exists to check
    if p.exists():
        print("already done!")
    else:
        print(p)
        r = requests.get(u)
        p.write_bytes(r.content)
        r.close()
        time.sleep(5)

So we've seen how to gather some contents from a web site, download a group of things from a website, pause a scraper, and even check if that file already exists.

This may have been about files, but this could also be about pages. 


# Generating URLs to download pages from

Let's look here: https://calphotos.berkeley.edu/flora/

https://calphotos.berkeley.edu/flora/sci-A.html

Say that we wanted to automatically grab all the URLS for the scientific names. https://calphotos.berkeley.edu/flora/ looking here we can see that the pages all go from A-Z in the URLs. There's other things we can do, we know that it should contain `flora/sci` within the content. We could get all the URLS, filter, and then use that as our list. But let's try generating the URLs.

Often times you'll need to generate numbers or other things within a url. You can use a for loop with `range(number)` to generate a set of numbers 

In this case we have this theme of `base + letter + .html`. No, we don't need to make all of these ourselves.  The `string` module actually has some fun stuff to keep in mind!
Note that most of the items in this module are variables that you are importing instead of functions. This just means that there won't be `()` after the names.

In [None]:
import string

print(string.ascii_uppercase)

We can see that we have the letters, let's put it in action. Our url looks like this: `https://calphotos.berkeley.edu/flora/sci-A.html` So hopefully you can see where we might put the letter.

In [None]:
import string


for letter in string.ascii_uppercase:
    url = "https://calphotos.berkeley.edu/flora/sci-" + letter + ".html"
    # print(url)

Let's add some code to download and save these pages to disk and in a folder like we did before. 

In [12]:
import string
import requests
import pathlib
import time

target = pathlib.Path("sci-pages")
target.mkdir(exist_ok=True)

for letter in string.ascii_uppercase:
    url = "https://calphotos.berkeley.edu/flora/sci-" + letter + ".html"
    print(url)
    p = target / pathlib.Path("sci-" + letter + ".html")
    r = requests.get(url)
    p.write_text(r.text)
    r.close()
    time.sleep(5)

https://loripsum.net/api/1/plaintext/short
https://calphotos.berkeley.edu/flora/sci-A.html
https://calphotos.berkeley.edu/flora/sci-B.html
https://calphotos.berkeley.edu/flora/sci-C.html
https://calphotos.berkeley.edu/flora/sci-D.html
https://calphotos.berkeley.edu/flora/sci-E.html
https://calphotos.berkeley.edu/flora/sci-F.html
https://calphotos.berkeley.edu/flora/sci-G.html
https://calphotos.berkeley.edu/flora/sci-H.html
https://calphotos.berkeley.edu/flora/sci-I.html
https://calphotos.berkeley.edu/flora/sci-J.html
https://calphotos.berkeley.edu/flora/sci-K.html
https://calphotos.berkeley.edu/flora/sci-L.html
https://calphotos.berkeley.edu/flora/sci-M.html
https://calphotos.berkeley.edu/flora/sci-N.html
https://calphotos.berkeley.edu/flora/sci-O.html
https://calphotos.berkeley.edu/flora/sci-P.html
https://calphotos.berkeley.edu/flora/sci-Q.html
https://calphotos.berkeley.edu/flora/sci-R.html
https://calphotos.berkeley.edu/flora/sci-S.html
https://calphotos.berkeley.edu/flora/sci-T.ht

Yay! we have these on disk! Now we can loop over these files and start trying to get content off of them. We know we can get the content out of the html, but we first need to get the content read in. Let's focus on that first. We are going to reuse the `target` variable, which is set to the folder with all the html files.

Path objects that are folders can use the `glob` method to do queries about their file content. We use things similar to how bash or terminal commands would work. So `*.html` will give us all the html files within that folder. (note: you can use `rglob` to recursively search a directory and all descendent child directories for those files). 

This will return a `generator` object, which may look weird. But this is just a way of saving memory. You can either loop over it and print it or recast the result to a list to see all the content. The paths that are returned are already path objects!

In [None]:
for p in target.glob("*.html"):
    print(p)


* run it through bs first to clean up the html, you could save this content later to disk if you want (and have many of these files) as it can execute a bit slowly.
* from then you can try and use bs4 syntax to extract stuff, or you can send it through lxml.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#bs4.Tag

Simple queries are pretty nice here! 

Our challenge with this structure is the lack of attribute tags. We just have a bunch of table elements that are nested. Now, this makes some things pretty clean. But it can make it hard to select certain things with precision when there are duplicates. 

We need to take a look at the structure of the website. What is the content we are after? What makes that location unique compared to the other data?

There's no one single answer here. You have to look at the contents and it'll be something specific for each project. But here are some areas to consider:

* Is the content only in a specific tag? 
	* like an image or h1 text
* Is the content in a unique part of the tree?
	* like an image but only the images within a table
* Does the content have a specific style class label or html tag attribute that you can latch on to?
	* like the p tags that are marked for formatting with `class = "name"`
	* This may me semantic, like a name, or something structural that uniquely identifies what you need
* Does the text content within the tag have something unique you can check for?
	* like you want all the `td` cells inside a `tr` where the first `td` cell starts with "Total:". Effectively, you want the contents of a row where the first bit of text starts with "total"
* Does an attribute have a specific value that you can check for?
	* like the `href` for an `a` tag has something specific, as in, you want to check all the hyperlinks but only want the ones that link to a specific subdomain

Each of these situations can be coded up. You can usually use some combination of the selection/extraction tool itself along with core python. How you divide that up will depend on your skills with the tools and how nicely the content will play with them. 

### Some caveats about Beautiful Soup
I'll be honest, I'm not the biggest fan of using this for extraction. However, the utility is there and for simpler things it's pretty straight forward. We will be looking at xpath queries later, and for highly structured html or more complex queries, it really is much more straight forward. 

Another consideration: the searching/parsing of Beautiful Soup is generally slower than what lxml can do. Now, for a few dozen or a few hundred files this shouldn't impact you very much. Use whichever clicks the first for your needs. However, should the number of queries go up into the thousands or millions, you'll want to switch over. Speed may not end up mattering because you can get in and out quickly, and don't need to rerun the results. So this isn't a hard rule. Just keep it in mind. 

## Extracting things with BeautifulSoup

You can read the documentation here for bs4: https://www.crummy.com/software/BeautifulSoup/bs4/doc

Some of the lingo on this page may not make a ton of sense to you if you haven't spent some intense personal time in the land of XML or metadata. However, this is where librarian instructors can really shine! Many of us are very used to these kinds of discussions, and you can leverage that expertise to promote your workshops.

The sum of it is, each tag is like a node in a tree. That node will have some combination of parent, child, sibling, ancestors, and descendants. This is how you navigate a tree. There's lots of examples on their page that you can work through, but the best way to get used to it is just to mess around. 

## Loading an html file into beautiful soup

Let's read in a single file to explore some of these tools. In this code we just want to grab one file to play with from our target folder. This also lets us practice a bit with our core python!

In [None]:
import pathlib
from bs4 import BeautifulSoup

first_file = next(target.glob('*.html')) # just a fancy way to ask it to iterate once

soup = BeautifulSoup(first_file.read_text(), 'html.parser') 

They always use `soup` as the variable name for the parsed content, so I'm using that to match. I generally suggest you do the same with your own work so things match up with documentation. 

From here we can now operate on `soup` in a variety of ways. 

### Simplest extraction
The simplest query is just to go after all of a single element. Maybe you want all the images on a page or all the links. You can grab those and run filters etc. with their content in regular python if need be. This can be a nice place to start and allows you to avoid some of the query complexity. 

You can use `soup` with dot notation and a single tag to grab the first one that the parser sees. This will return the entire element.

In [None]:
print(soup.a)

```html
<a href="/">CalPhotos</a>
```


Yup, this is the top of the page. First one it sees. Maybe this is the one you want? Maybe not. This can be good if you need to start digging around the tree and the first ones coming up are the ones you want. You can also chain these together to go directly at something, presuming that the first one it sees is the one and only one you want.

```python
print(soup.body.table.tr.td)
```
```html
<td bgcolor="DFE5FA" width="5%">
<!-- uncomment the following line to add a logo ----->
<!-- img align = left border=0 height=130 src = "/calflora/icon.gif" alt = "CalFlora"-->
<br/>
</td>
```


You can also ask for the content of the tag with the dot notation. 

```python
print(soup.a.contents)

['CalPhotos']
```

And ask for an attribute's value using dictionary-like notation.

```python
print(soup.a['href'])

'/'
```
(this result does make sense, as it is linking back to the main page but with a relative link)

Generally our queries will be more complex than these simple ones, but this core syntax is good to keep in mind because we will use it in conjunction with more complex queries. 

## Extracting multiple things

In our case we want to examine all the link tags, not just the first ones. We can use `find_all`. They also have a specific section on this you can read more about: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree

In [None]:
print(soup.find_all('a'))

This gives a large list of all the things. As mentioned, given that this is all the links there will be extras we don't want. There are two ways we can improve these results: 1) make our extraction query more specific or 2) attempt to filter out the content we don't want. 

Both are good strategies to consider. You may not be able to uniquely pinpoint the ones you want within the structure and thus need to look at the content itself, or maybe the content all looks the same and you depend on the structure to disambiguate. Maybe there's a combination of the two!

Let's print this out in a for loop for better viewing. 

```python
for a in soup.find_all('a'):
    print(a)
```
```html
<a href="/">CalPhotos</a>
<a href="/flora/">Plants</a>
<a href="/browse_imgs/plant.html">Browse Thumbnail Photos of Plants</a>
<a href="/cgi/img_query?stat=BROWSE&amp;where-genre=Plant&amp;where-taxon=Wachendorfia+paniculata&amp;title_tag=Wachendorfia+paniculata">Wachendorfia paniculata</a>
<a href="/cgi/img_query?stat=BROWSE&amp;where-genre=Plant&amp;where-taxon=Wachendorfia+thyrsiflora&amp;title_tag=Wachendorfia+thyrsiflora">Wachendorfia thyrsiflora</a>
(snip)
```

We can see a few things. The results we want all likely have the hrefs starting with "/cgi". We can do this directly by referencing the attribute name (`href` holds the url) and compiling a regular expression. Yes, you can use regular expression fanciness in here but you can also just put in any string to have it try and match that substring. This is also something you could do in core python with string tools. 

In this case I've told regex that: the start (`^`) of the string should have `/cgi`. 


```python
for a in soup.find_all('a', href=re.compile('^/cgi')):
    print(a)
```

We can add a search into this on the actual contents of the a tag using the `string` argument. 

```python
for a in soup.find_all('a', href=re.compile('^/cgi'), string = re.compile('var')):
    print(a)
```
This will retain our previous filter but also search the name of the plant for "var".

This function has a lot of power and the documentation provides a ton of detail: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all Caution, if the core python doesn't make sense then the example likely also won't make sense. 

## specifying more structure

There's lots of ways to fuss over searching the structure in beautiful soup. The nice thing about being able to provide queries with more content is that you don't have to add loops or other complexities in. You can simply say: I want all the `a` tags that are directly under `p` tags. Or something similar. 

We can do this with some nice shorthand with the `select()` soup method. 

This method allows you to specific css structure to select things if you want, but you can also specify more tree structure. This is what we'll be using. 

You can specify multiple tags here and it will look for those tags within the tree.

We could say:

```python
soup.select("p a")
```
This would find all `a` tags that exist anywhere inside a `p` tag. (xpath equiv: `//p//a`)

We can also say:

```python
soup.select("p > a")
```
Where the `a` tags must be direct children of `p`. (xpath equiv: `//p/a`)

We can also combine these:

```python
soup.select("table p > a")
```

Saying to select all `p` elements anywhere inside of a `table`, and then an `a` tag if directly a child of `p`. (xpath equiv: `//table//p/a`)

## Handling the results

These results all give you a list of tag objects you can further mess with. 

We can ask for the contents of the tag:

```python
for a in soup.select("table p > a"):
    print(a.text)
```
This will give us all the species names. Let's note that the species name is the only thing in the hyperlink. We also want the number next to it. To get this we need all the `p` tag text, but we don't want all the `p` tags. We can accomplish this by navigating the tree more: find all the `a` tags we want and then ask for the parent tag's text. (this is weird but more common than you think). xpath equiv: `//p/a/../text()`

```python
for a in soup.select("td > p > a"):
    print(a.parent.text)
```

`a.parent` is a relative lookup and "becomes" the `p` tags we want. Then `text` is applied to that. 

We don't want `a.parent.contents` because that will also return the full `a` tag object along with the number. Using `.text` allows us to ask just for the text that is displayed from that element. 

Looking further at the results we can also notice that we have "flattened" this table to just a single column of data. This is because we are ignoring the structure of the table and just grabbing all the individual elements.

# Exercises (Optional)

`If possible, include practice exercises for users to do on their own. These may have clear solutions or be more open-ended.`

# Solutions (Optional)
`Offer some possible solutions for the practice exercises.`


# References (Optional)
No citations required but include this if you have cited academic sources. Use whatever format you like, just be consistent. Markdown footnotes are not well-supported in notebooks.[$^{1}$](#1) I suggest using an anchor link with plain html as shown.[$^{2}$](#2)

1. <a id="1"></a> Here is an anchor link footnote.
2. <a id="2"></a> D'Ignazio, Catherine and Lauren F. Klein. [*Data Feminism*](https://mitpress.mit.edu/books/data-feminism). MIT Press, 2020.