<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Firstname Lastname](https://) for the 2023 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).

For questions/comments/improvements, email author@email.address.<br />
____

# Web Scraping Toolkit 3

This is lesson 3 of 3 in the educational series on `Web Scraping`. This notebook is intended to teach the core problem solving perspectives and tools for webscraping. 

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial` / `How-To` 

**Difficulty:** Intermediate

**Completion time:** 90 minutes

**Knowledge Required:** 

* Python basics (variables, flow control, functions, lists)
* Basic file operations (open, close, read, write)

**Knowledge Recommended:**

* basic html/websites

**Learning Objectives:**
After this lesson, learners will be able to:

1. Work with extraction in bs4
2. Work with regex
3. Talk about other tools

**Research Pipeline:**

1. You have a research question and data in mind.
2. You've found some data you want to use.
2. **The data is on a website somewhere and you want to get it off the site and into a data file.**
3. You do your analysis or other data prep!


# Required Python Libraries

* `requests` for downloading things

## Install Required Libraries

In [2]:
### Install Libraries ###

# Using !pip installs
!pip install requests
!pip install lxml
!pip instal bs4

# Using %%bash magic with apt-get and yes prompt

ERROR: unknown command "instal" - maybe you meant "install"


In [12]:
### Import Libraries ###

#3rd party
import requests
from lxml import etree
from bs4 import BeautifulSoup

import re
import time
import pathlib
import csv

# Day 3

At this point we've got a bunch of foundational skills:

* downloading things to our computer 
* handling files and folders, checking if they exist, making names, etc.
* putting delays in between downloading things
* parsing html with beautiful soup

Honestly, the first parts here of just handling the files, etc. are some of the hardest parts and not often covered in the documentation for the actual parsing tools. Things like beautiful soup, lxml, etc will all presume that you've got the files under control. 

Choosing a tool for extraction isn't about finding the "right" one. True, sometimes you'll need a certain tool or something else unique to one particular package. Otherwise, the rest is personal preference. 

I happen to have learned xpath before beautiful soup, so I'm naturally inclined to go with a tool supporting that. I also regularly need to parse xml files, which lxml also does well. That's just my experience, and mainly because I happened to know xpath before any of the other tools. 

## More realistic extraction from BeautifulSoup

In our case we want to examine all the link tags, not just the first ones. We can use `find_all`. They also have a specific section on this you can read more about: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree

```python
print(soup.find_all('a'))
```

This gives a large list of all the things. As mentioned, given that this is all the links there will be extras we don't want. There are two ways we can improve these results: 1) make our extraction query more specific or 2) attempt to filter out the content we don't want. 

Both are good strategies to consider. You may not be able to uniquely pinpoint the ones you want within the structure and thus need to look at the content itself, or maybe the content all looks the same and you depend on the structure to disambiguate. Maybe there's a combination of the two!

Let's print this out in a for loop for better viewing. 

In [16]:
import pathlib
from bs4 import BeautifulSoup
import re

first_file = next(target.glob('*.html')) # just a fancy way to ask it to iterate once

soup = BeautifulSoup(first_file.read_text(), 'html.parser') 

In [14]:
for a in soup.find_all('a'):
    print(a)

<a href="/">CalPhotos</a>
<a href="/flora/">Plants</a>
<a href="/browse_imgs/plant.html">Browse Thumbnail Photos of Plants</a>
<a href="/cgi/img_query?stat=BROWSE&amp;where-genre=Plant&amp;where-taxon=Fabaceae+sp.&amp;title_tag=Fabaceae+sp.">Fabaceae sp.</a>
<a href="/cgi/img_query?stat=BROWSE&amp;where-genre=Plant&amp;where-taxon=Fabiana+imbricata&amp;title_tag=Fabiana+imbricata">Fabiana imbricata</a>
<a href="/cgi/img_query?stat=BROWSE&amp;where-genre=Plant&amp;where-taxon=Fabronia+pusilla&amp;title_tag=Fabronia+pusilla">Fabronia pusilla</a>
<a href="/cgi/img_query?stat=BROWSE&amp;where-genre=Plant&amp;where-taxon=Facelis+retusa&amp;title_tag=Facelis+retusa">Facelis retusa</a>
<a href="/cgi/img_query?stat=BROWSE&amp;where-genre=Plant&amp;where-taxon=Facheiroa+ulei&amp;title_tag=Facheiroa+ulei">Facheiroa ulei</a>
<a href="/cgi/img_query?stat=BROWSE&amp;where-genre=Plant&amp;where-taxon=Fagonia+chilensis&amp;title_tag=Fagonia+chilensis">Fagonia chilensis</a>
<a href="/cgi/img_query?sta

We can see a few things. The results we want all likely have the hrefs starting with "/cgi". We can do this directly by referencing the attribute name (`href` holds the url) and compiling a regular expression. Yes, you can use regular expression fanciness in here but you can also just put in any string to have it try and match that substring. This is also something you could do in core python with string tools. 

In this case I've told regex that: the start (`^`) of the string should have `/cgi`. 

In [17]:
for a in soup.find_all('a', href=re.compile('^/cgi')):
    print(a)

<a href="/cgi/img_query?stat=BROWSE&amp;where-genre=Plant&amp;where-taxon=Fabaceae+sp.&amp;title_tag=Fabaceae+sp.">Fabaceae sp.</a>
<a href="/cgi/img_query?stat=BROWSE&amp;where-genre=Plant&amp;where-taxon=Fabiana+imbricata&amp;title_tag=Fabiana+imbricata">Fabiana imbricata</a>
<a href="/cgi/img_query?stat=BROWSE&amp;where-genre=Plant&amp;where-taxon=Fabronia+pusilla&amp;title_tag=Fabronia+pusilla">Fabronia pusilla</a>
<a href="/cgi/img_query?stat=BROWSE&amp;where-genre=Plant&amp;where-taxon=Facelis+retusa&amp;title_tag=Facelis+retusa">Facelis retusa</a>
<a href="/cgi/img_query?stat=BROWSE&amp;where-genre=Plant&amp;where-taxon=Facheiroa+ulei&amp;title_tag=Facheiroa+ulei">Facheiroa ulei</a>
<a href="/cgi/img_query?stat=BROWSE&amp;where-genre=Plant&amp;where-taxon=Fagonia+chilensis&amp;title_tag=Fagonia+chilensis">Fagonia chilensis</a>
<a href="/cgi/img_query?stat=BROWSE&amp;where-genre=Plant&amp;where-taxon=Fagonia+laevis&amp;title_tag=Fagonia+laevis">Fagonia laevis</a>
<a href="/cgi/im

We can add a search into this on the actual contents of the a tag using the `string` argument. 

In [None]:
for a in soup.find_all('a', href=re.compile('^/cgi'), string = re.compile('var')):
    print(a)

This will retain our previous filter but also search the name of the plant for "var".

This function has a lot of power and the documentation provides a ton of detail: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all Caution, if the core python doesn't make sense then the example likely also won't make sense. 

## specifying more structure

There's lots of ways to fuss over searching the structure in beautiful soup. The nice thing about being able to provide queries with more content is that you don't have to add loops or other complexities in. You can simply say: I want all the `a` tags that are directly under `p` tags. Or something similar. 

We can do this with some nice shorthand with the `select()` soup method. 

This method allows you to specific css structure to select things if you want, but you can also specify more tree structure. This is what we'll be using. 

You can specify multiple tags here and it will look for those tags within the tree.

We could say:

```python
soup.select("p a")
```
This would find all `a` tags that exist anywhere inside a `p` tag. (xpath equiv: `//p//a`)

We can also say:

```python
soup.select("p > a")
```
Where the `a` tags must be direct children of `p`. (xpath equiv: `//p/a`)

We can also combine these:

```python
soup.select("table p > a")
```

Saying to select all `p` elements anywhere inside of a `table`, and then an `a` tag if directly a child of `p`. (xpath equiv: `//table//p/a`)

## Handling the results

These results all give you a list of tag objects you can further mess with. 

We can ask for the contents of the tag:

In [None]:
for a in soup.select("table p > a"):
    print(a.text)

This will give us all the species names. Let's note that the species name is the only thing in the hyperlink. We also want the number next to it. To get this we need all the `p` tag text, but we don't want all the `p` tags. We can accomplish this by navigating the tree more: find all the `a` tags we want and then ask for the parent tag's text. (this is weird but more common than you think). xpath equiv: `//p/a/../text()`

In [None]:
for a in soup.select("td > p > a"):
    print(a.parent.text)

`a.parent` is a relative lookup and "becomes" the `p` tags we want. Then `text` is applied to that. 

We don't want `a.parent.contents` because that will also return the full `a` tag object along with the number. Using `.text` allows us to ask just for the text that is displayed from that element. 

Looking further at the results we can also notice that we have "flattened" this table to just a single column of data. This is because we are ignoring the structure of the table and just grabbing all the individual elements.

## getting all files from a directory

Now that we can extract what we want from one page, we can look at extracting things from all the pages. 


Yay! we have these on disk! Now we can loop over these files and start trying to get content off of them. We know we can get the content out of the html, but we first need to get the content read in. Let's focus on that first. We are going to reuse the `target` variable, which is set to the folder with all the html files.

Path objects that are folders can use the `glob` method to do queries about their file content. We use things similar to how bash or terminal commands would work. So `*.html` will give us all the html files within that folder. (note: you can use `rglob` to recursively search a directory and all descendent child directories for those files). 

This will return a `generator` object, which may look weird. But this is just a way of saving memory. You can either loop over it and print it or recast the result to a list to see all the content. The paths that are returned are already path objects!

In [None]:
for p in target.glob("*.html"):
    print(p)

We've got the files, and because we've used pathlib for this, these paths are already Path objects. So let's hook in bs4.

In [None]:
for p in target.glob("*.html"):
    soup = BeautifulSoup(p.read_text())
    for a in soup.select("td > p > a"):
        print(a.parent.text)

Basically, we take what we did before and scoot it inside our for loop. So that part isn't so complicated. What can get a bit weird is collecting everything up. 

There are two big ways you could do this:

* write the contents for each page out to another file
	* extra work but ideal if there's a ton of things
	* allows you to skip writing out something you've already parsed
	* but may not be needed for smaller projects
* collect all the contents into something in memory and then write them all out
	* you'll likely want this as your end goal anyhow
	* sometimes you can just jump right to it

### Writing pages out

We can use a few more pathlib tools here. 

* make a new folder for these, like you have before.
* you can use `p.stem` to get just the file name from the original file, but convert it to a string
* then concat ".txt" onto it and you've made your new name!
* concat that with your new target folder and that's your new path object
* write the extracted text out to it

Something to keep in mind: the `write_text()` pathlib method doesn't work in append mode. You'll either need to collect everything up or open the file and use .write() with it. 

This code takes several minutes to run! Be careful. The finished files are available to you in the repository.


In [None]:
parsed_target = pathlib.Path('parsed_sci-pages')
parsed_target.mkdir(exist_ok = True)

for p in target.glob("*.html"):
    soup = BeautifulSoup(p.read_text())
    # create the new path object
    f = parsed_target / pathlib.Path(str(p.stem) + '.txt')
    for a in soup.select("td > p > a"):
        f.write_text(a.parent.text)

* run it through bs first to clean up the html, you could save this content later to disk if you want (and have many of these files) as it can execute a bit slowly.
* from then you can try and use bs4 syntax to extract stuff, or you can send it through lxml.

## Extracting content from the results

Given the size of these files, let's work with just one of these files! 

Let's run through the results for some of this and get p tag text. This has the species plus the number of observations. 

An oddity: the p tags contain each column of data. So while we can get the species names from the a tags, the structure with the text is all inside one p tag. This is a situation where we need to grab the text and then use core python to break the content apart.

In [None]:
parsed_target = pathlib.Path('parsed_sci-pages')
parsed_target.mkdir(exist_ok = True)

all_rows = []
target = pathlib.Path('sci-pages')

for p in target.glob("*Y.html"):
    soup = BeautifulSoup(p.read_text(), 'html.parser')
    f = parsed_target / pathlib.Path(str(p.stem) + '.txt')
    p_tags = soup.select("td > p")
    p_text = [p.text for p in p_tags]
    lines = []
    for chunk in p_text:
        lines.extend(chunk.split('\n'))

for l in lines:
    print(l)

We can also take a look at the results to see better how to filter it.

In [None]:
['',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Yabea microcarpa (27)',
 'Yavia cryptocarpa (7)',
 ...]

A quick list comprehension can help filter out all the empty strings.

In [None]:
lines = [l for l in lines if len(l) > 0]


We could keep going with all the stuff we might want to get out of this, but let's go directly to another good example.

## Using regular expressions to separate content
This is a great situation for using some regular expressions to separate out the name plus the number of species.

You can also use tools like Open Refine to help separate information like this. 

Let's use the file that we made. We've got this basic pattern of `species name (number of pictures)`. We can get the species name alone just from the `a` tag, so that's sorted. But perhaps we want we get them both together to make a data file. 

Connecting the link, the name, and the number could be a few more steps given that you can't get them all in a single query. But that's for another day.

You may see people out there suggesting that you use regex to snag things from html. NO NO NO NO NO. Use a proper parser to extract text content from html, and then regex if you need to separate content from that text. 

Yes, this can be okay in very specific/small scale situations. Using it this way should be rare and done with extreme care if you do. I've literally never needed to do this. I once had to use string tools to fix html to make it parsable, but that's very different. 

### What are regular expressions?

Regex is a really cool text pattern matching system that's been around forever and usually has tools in many programs. If you've never heard the name before then you will likely start seeing it around more.

Regex as a language is a system of metacharacters (where characters stand for something else) to describe patterns in text. There are some really advanced things you can do with this, but often most of the basics are all you'll need to get a bunch of really useful things done.

* we are used to static searching for text, where stuff needs to match verbatim even if it's a substring
* often there are systematic patterns out there that we want to search for or match
* we can describe these patterns using regex
* this allows us to just return search results matching the pattern or we can use them to actually extract out content
* many tools support regex queries, so the same sort of query can be used in python/r/other text tools. Making this sort of tool a little more "universal" if you need to use multiple platforms.

Our pattern is pretty simple, we can look at it and say: okay there are some words and text followed by some spaces and then parentheses with a number inside. Now, we haven't previewed all species names, so we should do some investigations first. 

Let's read in the file and gather some information first.


In [None]:
with open('all_species.txt', 'r', encoding = 'utf-8') as infile:
    names = infile.read().splitlines() # another way of removing newlines

In [None]:
len(names)


In [None]:
without_p = 0

for n in names:
    if (not '(' in n) or (not ')' in n):
        without_p += 1

0

Apparently not. This is good! Given that these results are being generated by a database query and displayed from a template, this likely means that we successfully got all the entities that we need. 

### Using Regexr to help
Hands down one of my most important tools for writing regex. It can be really hard to predict stuff on your own or understand why there aren't results. https://regexr.com/ offers:

* live preview of results
* helpful tooltips about the pattern
* regex references
* save and share your patterns

Paste a sample of your data into the body of the page.

Our first goal is to write a query that will cover all the text in each line.

Remember we have three parts: some text, (), and the numbers inside them. 

### Some regex basics
This will be brief, but is a good preview. There are many regex learning resources plus a later workshop on this. 

In regex we can talk about a few things:

* text in general with `[A-Za-z]` and `[0-9]` as character classes, these will match text within the ranges of those groups
	* but let's remember that there's punctuation and other accented characters
	* and that the numbers will have varying length
* directly mention a character, like `(`
	* we for sure have `(` and `)` in there, so we'll need this
* ask for things to be extracted via `()`
	* hmm this may make for a problem with the existing `()`
* indicate repetition may happen
	* so this might help with the numbers!

So we can start building this up. 

* `*` for 0-infinity
	* `+` for 1-infinity
	* `{min, max}` to specify a certain number

* let's start with the numbers: `[0-9]+`
	* `[0-9]` include all digits as a character class
	* `+` operates on the previous item (not single character but item) saying that it could appear 1-infinity times. 
	* This means that at least one digit must be there, but remaining flexible for what might be there.
* Specify that the numbers should be in (): `\([0-9]+\`
	* We need to add the literal versions of `(` and `)` and can do this by "escaping" them out with the `\` character. 
	* Putting `\` before any single character in a regex string will have it treated as the literal version instead of the metacharacter version
* some text and stuff appears before this in the same line: `.+`
	* your first inclination may be to use a character class or two to support A-Z. But remember there's punctuation, accented characters, spaces, etc. happening in here.
	* We don't want to be overly specific but have the benefit of not needing to break the species names apart any more.
	* `.` stands for "anything" basically, but excludes line breaks (because these usually separate data records)
	* `+` says that something needs to appear once
So we're matching all the content now, and need to add the `()` to the groups of data that we want to return. We only need to put it around the content that we want.

`(.+)\(([0-9]+)\)` We are ignoring the literal `()` within the query and only wanting the numbers, plus the text before it. 

Alternatively: we could have attempted to "split" the text via `(` to break it into two parts. 

## Running queries in python

We will use the regex module, which actually has some pretty lovely official documentation. Two key functions:

* `compile` allows us to state a pattern and save it to a variable. This is good practice because this function supports a lot of flags you may want to add later on. It also lets your code stay a little more compact once you use it.
* `findall` takes a compiled pattern or a pattern directly plus text and returns all matches from your text. 


In [None]:
import re

match_species = re.compile('(.+)\(([0-9]+)\)')

for n in names:
    print(re.findall(match_species, n))


We are looping over the lines and running each through this query (vs running this query on the entire thing). Thus, we get back many results. Let's collect those up.

Note that the result is a tuple inside of a list, so we'll need to extract the content. We can check the length on them to ensure that we are finding exactly one matched group. Only saving those that match and printing out any that don't.

In [None]:
match_species = re.compile('(.+)\(([0-9]+)\)')
results = []
for n in names:
    found = re.findall(match_species, n)
    if len(found) != 1:
        print(found)
    else:
        results.append(found[0])

Nothing appears! So great. We can also now check that we got exactly two groups from each:

In [None]:
for r in results:
    if len(r) != 2:
        print(r)

Nothing appears again, so we should be good. 

Just another preview of a cool thing you can do, let's play with a dictionary comprehension to take these counts and make a dictionary. 

In [None]:
species_pics = {name: int(count) for name, count in results}

Now let's convert that to a Counter object to check some details. This will convert it and then ask it to print the 10 most common (so 10 largest values).

In [None]:
from collections import Counter

counted = Counter(species_pics)

counted.most_common(10)


In [None]:
counted.total()


In [None]:
for name, count in counted.items():
    if count == 1:
        print(name)

## Other cool tools

### XPath

This system allows you to write queries to navigate and extract content from an xml tree, including html. We can use it within lxml. 

Benefits: 

* you can extract things with more precision than beautiful soup
* xpath has a bunch of functions etc. for precisely searching for things
* xpath is supported by many systems, including R and some other programs. so you don't need to relearn it if you migrate
* being a separate standard, lots of documentation online
* also parses regular xml
* queries run significantly faster than BeautifulSoup

Cons:

* the html needs to be correctly formed xml, but that's something that beatiful soup can do for you.
* not many people know it?
* being string queries, errors can be hard to understand

### Selenium

This package focuses on interactive with websites versus just extraction. For example, pressing buttons, typing things in, clicking things, moving the page, etc.

https://www.selenium.dev/

Can be useful to automate testing or putting queries into a form. This could assist with some authentication needs, but shouldn't be used as an extraction replacement. 

### Python's built in `webbrowser` package

https://docs.python.org/3/library/webbrowser.html

Basically, lets you give the function a url and it will open that page up in your web browser (eg launch Chrome with that page). Usually used in conjunction with other tools. 

```python
import webbrowser

url = "https://www.geeksforgeeks.org"
  
webbrowser.open(url)
```

### `pyautogui` for interface controlling

https://pyautogui.readthedocs.io/en/latest/

This lets you put in timing and actions for interacting with interfaces on your actual computer. Yes, this will actually just have your computer do stuff while you watch. It's very satisfying. 

## Sort of working with authentication

There are more formal ways of dealing with authentication, but when I came up against 2 factor authentication I gave up trying to use those things.

Instead, I did the following to grab a full backup of every page I could find in our internal wiki.

* I found a page that had a listing of every page within our system and manually downloaded it to disk
* Used my xpath stuff (could have used beautiful soup, too) to extract all the names and hyperlinks
* Went to the wiki and logged in with 2fa using my own credentials. I also manually went to save a page and navigated to the directory where I was saving stuff so it would come up by default later in the session.
* Looped through those urls/names, used `webbrowser` to launch the url in my browser where I was logged in.
* Used `pyautogui` to run the mac commands needed to save the page (cmd + s, wait for the file picker to pop up, press enter to save it to the default folder that shows up, close the tab)

Here's the code for reference, note that you won't have these variables set so you can't run it. It took me a solid hour of messing with this to find the right set of commands. Mostly because my laptop was set in dvorak and weird things were happening. 

```python
for u in urls:
    webbrowser.open(u[1])
    time.sleep(2)
    pyautogui.keyDown('command')
    pyautogui.press('s')
    pyautogui.keyUp('command')
    pyautogui.press('enter')
    time.sleep(2)
    pyautogui.keyDown('command')
    pyautogui.press('w')
    
    print(u)
```

Then I worked on knitting while I watched the show.

# Exercises (Optional)

`If possible, include practice exercises for users to do on their own. These may have clear solutions or be more open-ended.`

# Solutions (Optional)
`Offer some possible solutions for the practice exercises.`


# References (Optional)
No citations required but include this if you have cited academic sources. Use whatever format you like, just be consistent. Markdown footnotes are not well-supported in notebooks.[$^{1}$](#1) I suggest using an anchor link with plain html as shown.[$^{2}$](#2)

1. <a id="1"></a> Here is an anchor link footnote.
2. <a id="2"></a> D'Ignazio, Catherine and Lauren F. Klein. [*Data Feminism*](https://mitpress.mit.edu/books/data-feminism). MIT Press, 2020.