<a href="https://colab.research.google.com/github/rzl-ds/gu511/blob/master/007_web_scraping.ipynb" target="_parent">
    <img src="https://colab.research.google.com/assets/colab-badge.svg"/>
</a>

# web scraping and `http`

let's talk about the internet. specifically, how people communicate using the internet. super specifically, how you can get data and request that "things" be done for you using the internet.

this is by no means an exhaustive summary. there are 7.2 bajillion articles about web scraping and they're probably all better. [especially the Mozilla ones](https://developer.mozilla.org/en-US/). also, people always freak out about how bad the W3 pages are (and they are), but they're still (in my opinion) a good beginner's resource. so dont' be afraid to try them out, but be aware that there are some problems.

the only thing that is special about this one is that I wrote it and you've been duped into reading it.

so enjoy!

## `http`, `request`s and `response`s

the basic problem here is that some one out there in the ol' interwebs has put up a pretty page with some data we want, and we need to go get it.

we are trying to make our computer yell at their computer and send us that neat data. so we need to figure out how to yell in a way that makes their computer happy.

note that we've already done some internet based communication using different protocols: we've yelled from one computer to another via `ssh`. we've used programs like `ssh`, `scp`, and `putty` to manage this yelling. we're going to do the same thing, but with a new **protocol** -- a new yelling language.

our computers all yell at each other in a language called `http` -- you have perhaps heard of it.

### `request`s

[there are a ton of details about how to yell](https://developer.mozilla.org/en-US/docs/Web/HTTP), what's appropriate and what isn't, etc. but the basic idea is that you make a block of text like the below and then send it over the internet to the other server

<br><div align="center"><img src="https://mdn.mozillademos.org/files/13821/HTTP_Request_Headers2.png" width="700px"></div>

these messages basically contain 3 pieces:

+ the start line (above, `POST / HTTP/1.1`)
+ the `headers`
    + metadata about the request. things like where it is coming from, the type of browser requesting it, what sorts of format that `request`-ing agent understands (e.g. `html`, `json`)
    + the red, green, and blue boxes above are three flavors of headers
+ the `body`
    + a generic space for the `request`er to add any content (text) that might be meaningful to the server on the other end of that `request`

right now, let's zero in on the first line (called the "start line"):

```
POST / HTTP/1.1
```

this has three pieces:

1. the [`request` `method`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) (here, `POST`)
1. a url to which this message should be sent (here, `/`, but it will usually be a full fledged public `url`)
1. the `http` protocol version (here, `HTTP/1.1`)

let's zero in on that first element -- the `method`

there are several different types of yells (`request` `method`s) which "you" (the client) can make to "them" (the `http` server):

+ `GET` -- you want to get a thing and they should give it to you
+ `POST` -- you want to send them a thing and they should at least write you a thank you card; if they want to give you something back that's cool but it's not expected (but it's polite, so, that's basically the same as expectation, but with all sorts of passive aggressiveness tacked on)

+ `HEAD` -- you want to get a thing but you would prefer it if they just told you about the thing first, like sent you a picture of it or told you how big it was. You may still want to get it at some point but you're just testing the water. it's a fun, no commitment thing -- why rush in?
+ `PUT` -- a really pushy `POST` request, basically they tell us they have a cool painting and we say OUR PAINTING HERE IS BETTER FORGET YOURS

+ `DELETE` -- actually that painting was dumb. you were right. our bad
+ `CONNECT`, `OPTIONS`, `TRACE`, `PATCH` -- stuff your parents yell to their friends but no one really does any more outside of really formal settings

all of these have their time and place and the web wouldn't work without them.

that being said, you generally only need `GET` and `POST` messages.

### `response`s

suppose you found some server out on the internet and you decided to compose a `GET` message like the one above and send it to that computer (using its `ip` address). that server spins its wheels determining what it wants to say and then sends you back a `response` -- something *similar* to a `request`, but with simple, important differences:

<br><div align="center"><img src="https://mdn.mozillademos.org/files/13823/HTTP_Response_Headers2.png" width="700px"></div>

these messages basically contain 3 pieces:

+ the start line (above, `HTTP/1.1 200 OK`)
+ the `headers`
    + metadata about the `response`. things like where it is coming from, the format of the material it is sending back (if anything), `cookie` values (cached, computed values for future use)
+ the `body`
    + a generic space for the server to put the content that was `request`ed

right now, let's zero in on the first line (called the "start line"):

```
HTTP/1.1 200 OK
```

1. the `http` protocol version (here, `HTTP/1.1`)
1. the [`http` status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) (here, `200`)
    + 100s: information responses (rare in web scraping)
    + 200s: success! (what we want)
    + 300s: redirection (go look *here* instead)
    + 400s: **client** error (*you* did something wrong)
    + 500s: **server** error (you did everything right but the server borked)
1. a short summary of the response (here, `OK`)

## what's in the `body`

so, suppose you got your computer to yell at their computer and everyone's on the same page.

you say HEY CAN I `GET` OR WHAT and the server sends a `response`. the body is:

```
01100101 01100001 01110011 01110100 01100101 01110010 00100000 01100101
01100111 01100111 01110011 00100000 01100001 01110010 01100101 00100000
01100111 01110010 01100101 01100001 01110100 00100000 01001001 00100000
01101100 01101111 01110110 01100101 00100000 01100101 01100001 01110011
01110100 01100101 01110010 00100000 01100101 01100111 01100111 01110011
00101110 00100000 01110011 01100011 01110010 01100001 01101101 01100010
01101100 01100101 01100100 00101100 00100000 01101111 01110110 01100101
01110010 00100000 01100101 01100001 01110011 01111001 00101100 00100000
01100100 01101111 01101110 00100111 01110100 00100000 01100101 01110110
01100101 01101110 00100000 01100011 01100001 01110010 01100101 00101110
00100000 01110011 01100101 01110010 01101001 01101111 01110101 01110011
01101100 01111001 00100000 01110100 01100001 01110011 01110100 01111001
00101110
```

<!--a href="http://www.rapidtables.com/convert/number/binary-to-ascii.htm">funtimes<a/ -->

not cool.

okay so things are actually a *little* cool: while there's nothing that says the `body` will come in any particular format, there are some very common formats.

it looks like many of the other computers out there speak `json` and `html`

### `json` format and `javascript`

one of the most common `http` yelling languages is [JavaScript Object Notation](http://json.org/), or `json` for short.

this is the basic way that computers communicate to each other through `api`s

if you're familiar with `python` default data types, `json` can be converted directly to lists, dictionaries, numbers, and strings. my personal favorite example is [the Magic the Gathering `json` webpage](https://mtgjson.com/) ([direct link to example](https://www.mtgjson.com/json/decks/FaerieSchemes_ELD.json)), but [there are a ton out there](https://github.com/toddmotto/public-apis).

there are a million web tutorials out there. use any of them. they're all better than this.

`json` is everywhere. it [has it's detractors](https://codepunk.io/xml-vs-json-why-json-sucks/), but it is essentially the *de facto* data and information transfer protocol on the web.

as I wrote above, `json` is the "JavaScript Object Notation." the reason it's so ubiquitous is that [`javascript`](https://developer.mozilla.org/en-US/docs/Web/JavaScript) is ubiquitous.

but **`json` is not `javascript`**. it's a *notation*, a agreed-upon format for writing out hierarchical objects as text.

`javascript` is a programming language which has a ton of ~~bugs~~ features resulting from its use case as the primary web language for client-side computation (where heavy lifting is done by your laptop (via your browser) rather than the server sending you information).

some people treat `javascript` as if it's not a chaos land of abominations and for the most part we all let them go on thinking those things, because they make us really pretty and fast webpages. Sometimes [everyone realizes that the whole enterprise is castles built on sand](https://qz.com/646467/how-one-programmer-broke-the-internet-by-deleting-a-tiny-piece-of-code/) and the super l33t trolls of the internet have a chuckle or two.

the main reason I bring it up is this:

+ getting `json` as the body of a `response` to your `request` is trivially easy -- the best case scenario for web scraping
+ needing `javascript` to *compute* something so that it can be scraped is much harder -- the worst case scenario for web scraping (but still doable!!)

sometimes the only way to access the data you want is via `javascript`. in about 99% of all cases I've come across, I find this is something you can avoid with some digging and some advance knowledge of how web pages are built / how to use the browser inspect tools.

why avoid `javascript`? because it requires breaking out of the simple `requests` paradigm and using a full-fledged `javascript` engine, and that's a big jump in complexity.

**whenever possible, I try and just `GET` things.**

### `html` and `xml`

`json` isn't the only way that computers yell at each other, though. there are other formats they yell in, perhaps you've heard of `html` ([HyperText Markup Language](https://developer.mozilla.org/en-US/docs/Web/HTML))

`html` is a special amalgam of xml-like tag structures that is the language which is used by your browser to render webpages.

check out [this example page with simple `html`](http://www.columbia.edu/~fdc/sample.html).

specifically, open it in your browser and then right click and select "view source." you can do that on any webpage.

of course, you usually will have more luck using the "developers tools" than just viewing source

+ in google chrome, right click > "Inspect"
+ in firefox, right click > "Inspect Element"
+ in explorer or edge, press F12
+ in safari, go to www.google.com, search for, and then download something else.

so suppose you sent a `GET` request to [that simple `url` from above](http://www.columbia.edu/~fdc/sample.html) and received something (I've added indentations) that looked like:

```html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
    <!-- THIS IS A COMMENT -->
    <title>Sample Web Page</title>
    <META http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
  </head>
  <body bgcolor="#ffffff" text="#000000">
    <h2>Sample Web Page</h2>
    <!-- How to insert an image -->
    <img src="picture-of-something.jpg" alt="Brief description" width="100%"><br>
    <small><i>A random photo, maximize your browser to enlarge.</i></small>
    <p>
        <a href="http://www.columbia.edu/~fdc/">Frank da Cruz</a><br>
        ...
```

##### EXTREMELY brief `xml` refresher

<br><div align="center"><img src="https://i1.wp.com/img.c4learn.com/2012/05/What-is-an-XML-ELement1.png" width="600px"><br><p>picture from <a href="https://premaseem.wordpress.com/2015/01/25/xml-elements-vs-attributes/">this blog post</a></p></div>

we will learn two ways of parsing `html` for *elements* of data we wish to obtain: `css` selectors and the `xpath` query language

#### searching `html` with `css`

Cascading Style Sheets (*aka* [`css`](https://developer.mozilla.org/en-US/docs/Web/CSS)) is a language for describing how a browser should format and style the `html` documents it renders.

`css` files are not a pre-requisite for creating a webpage (styling can be done directly in `html` docs), but websites that *don't* utilize `css`(or some more advanced variant) are few and far between.

If you want an example of `css`, you can again refer to the "Inspect" screen / "developers tools."

+ The entire right panel is dedicated to `css` and `javascript` properties of the highlighted `html` document elements.
+ the `Sources` tab of the Inspect window has a lot of files, but all the delivered `css` files are there
+ check out the `css` file for [the Storyblocks landing page](https://www.storyblocks.com): [source is here](https://d3g7htsbjjywiv.cloudfront.net/assets/build/storyblocks~storyblocksAboutUs.e8fa4b843dab436694b1.css), but that's not that useful, is it?

why does this matter?

developers are already using `css` to find and style "things like this" in `html` documents. we can do the same! if we want to find elements that look like

```html
<ul class="zachs_list">
```

we can use `css` selector shorthand:

```css
ul.zachs_list
```

`css` is a remarkably flexible way of specifying elements within a hierarchy and [this MDN tutorial](https://developer.mozilla.org/en-US/docs/Learn/CSS/Introduction_to_CSS/Selectors) is an excellent intro. there are a few rules that are most important to know:

+ you don't *have* to provide the element tag
+ `class` values are selected with the `.` character (e.g. `.class_name`)
+ `id` values are selected with the `#` character (e.g. `#id_name`)
+ a basic element in a selector is a combo of the `html` tag elements (e.g. `div`, `p`, `li`) and/or the above (e.g. `div.class1`)

+ the basic elements can be combined in several ways
    + a comma between elements (e.g. `div, p`) will select any element of either the first or second type
    + a space (e.g. `div p`) will select any second element underneath the first
    + a `>` (e.g. `div > p`) will select any second element which is a direct child of the first
    + a `+` (e.g. `div + p`) will select any second element which is an immediate sibling of the first
    + a `~` (e.g. `div ~ p`) will select any second element which is any sibling of the first

+ attributes can be selected via
    + `[attr]`: any element which has a given attribute
    + `[attr=val]`: any element which has the given attribute with the given value
    + several other options performing logical test on the values

##### examples of each of the above

a comma between elements will select any element of either the first or second type

selector:
```css
ul, li
```

matches:
```html
<body>
    <div>
        <ul class="zachs_list">  <!-- selected -->
            <li>hello</li>       <!-- selected -->
            <li>world</li>       <!-- selected -->
        </ul>
    </div>
</body>
```

a space between elements will select any element of the *second* type which is a descendent of the *first* at *any* depth

selector
```css
div li
```

matches
```html
<body>
    <div>                        <!-- not selected, but looking under here -->
        <ul class="zachs_list">
            <li>hello</li>       <!-- selected -->
            <li>world</li>       <!-- selected -->
        </ul>
    </div>
</body>
```

a `>` between two elements will select any element of the *second* type which is a *direct child* of an element of the *first* type

selector
```css
ul > li
```

matches
```html
<body>
    <div>
        <ul class="zachs_list">
            <li>hello</li>       <!-- selected -->
            <li>world</li>       <!-- selected -->
        </ul>
    </div>
</body>
```

note that the `>` selector statement is pickier than the corresponding space (` `). `div li` worked above, but `div > li` does not work

selector
```css
div > li
```

matches
```html
<body>
    <div>                        <!-- not selected, but looking under here -->
        <ul class="zachs_list">
            <li>hello</li>       <!-- not selected, not a direct child! -->
            <li>world</li>       <!-- not selected, not a direct child! -->
        </ul>
    </div>
</body>
```

a `+` between two elements will select any element of the *second* type which is an *immediately-following sibling* of an element of the *first* type

selector
```css
li + li
```

matches
```html
<body>
    <div>
        <ul class="zachs_list">
            <li>hello</li>       <!-- not selected, but element of first type -->
            <li>world</li>       <!-- selected, immediate sibling -->
            <li>for fun</li>     <!-- selected, immediate sibling -->
        </ul>
    </div>
</body>
```

a `~` between two elements will select any element of the *second* type which is *any following sibling* of an element of the first type

selector
```css
li ~ li
```

matches
```html
<body>
    <div>
        <ul class="zachs_list">
            <li>hello</li>       <!-- not selected, not a *following* sibling -->
            <p>...</p>
            <li>world</li>       <!-- selected -->
        </ul>
        <ul class="zachs_second_list">
            <li>world</li>       <!-- not selected, not a *following* sibling -->
        </ul>
    </div>
</body>
```

writing an attribute name within brackets (e.g. `[attr]`) or a tag element plus that string (e.g. `div[attr]`) will select any element which has that tag (if provided) and has that attribute (regardless of value)

selector
```css
li[myattr]
```

matches
```html
<body>
    <div>
        <ul class="zachs_list">
            <li myattr="myval">hello</li>  <!-- selected -->
            <li>world</li>
        </ul>
    </div>
</body>
```

and without specifying that it must be a `ul` element:

selector
```css
/* note: no leading element! */
[myattr]
```

matches
```html
<body>
    <div>
        <ul class="zachs_list">
            <li myattr="myval">hello</li>    <!-- selected -->
            <li>world</li>
        </ul>
    </div>
    <p myattr="myval">example paragraph</p>  <!-- selected -->
</body>
```

note that the this selector doesn't care if the *attribute* has anything for a *value*:

selector
```css
/* note: no leading element! */
[myattr]
```

matches
```html
<body>
    <div>
        <ul class="zachs_list">
            <li myattr>hello</li>             <!-- selected, even without a value -->
            <li>world</li>
        </ul>
    </div>
    <p myattr="myval"> example paragraph</p>  <!-- selected -->
</body>
```

finally, you can also specify that in addition to *having* an attribute that attribute has a specific value (`[attr=val]`). the same rules re: including or not including the tabs apply

selector
```css
li[myattr="myval"]
```

matches
```html
<body>
    <div>
        <ul class="zachs_list">
            <li myattr="myval">hello</li>  <!-- selected -->
            <li>world</li>
        </ul>
    </div>
</body>
```

and an example in which the tag is not provided but only one element has a matching attribute value

selector
```css
[myattr="myval"]
```

matches
```html
<body>
    <div>
        <ul class="zachs_list">
            <li myattr>hello</li>             <!-- not selected, no value -->
            <li>world</li>
        </ul>
    </div>
    <p myattr="myval"> example paragraph</p>  <!-- selected -->
    <p myattr="blurb"> example paragraph</p>  <!-- not selected, wrong value -->
</body>
```

#### searching `html` with `xpath` [advanced]

[`xpath`](https://msdn.microsoft.com/en-us/library/ms256086) is a query language for `xml` documents. Given that `html` is effectively a refined subset of `xml`, it's a natural fit for parsing `html` documents.

it's a bit older than an alternative selecting language option (`css`) (time-tested in years, but perhaps not in projects or eyeballs) and has a broader use case (all of `xml`) and can do some things `css` selectors cannot. **however**, it has a steeper learning curve and is less commonly used in the `web`-scraping community.

by analogy, `xpath` is to `css` selectors what `C` or `C++` are to `python`

that being said, it's by far my favorite of the two, and almost always my go-to. this is probably only because I learned it first, which is a terrible way to make a decision.

technically, if you're learning both `xpath` and `css` selectors here, you learned them in the opposite order -- I have broken the cycle.

in many respects, `xpath` is similar to describing paths of files on a linux file system. let's learn via this analogy.

suppose that you know that somewhere several levels deep inside your root directory there is a folder called `important_files` and that it has a file in it where the extension is `.txt` and you want to list information about it.

if you knew the *exact path*, you could use the `ls` command:

```
ls /path/to/directory/important_files/my_file.txt
```

if you *don't* know the *exact path*, though, maybe you could use some wildcard / `glob` expressions to find it. for example, you could use the find tool:

```bash
find . -ipath "*important_files/*.txt"

# real example:
find ~/miniconda3 -ipath "*/pkgs/*.yaml"
```

this example is a little convoluted. typically you don't suddenly have an instinct that there are `yaml` files you need and they are all under the `pkgs` directory; rather, you identify the files you need and you build as generic a `glob` expression as you can to search for only those, e.g.:

> all the files I want end in `.yaml`, but that's too broad -- I want only those that had `pkgs` as one of the ancestor directories

`xpath` is similar. take the example below, and suppose we wanted to "select" the `ul` item.

you know things about the hierarchy of a document (it's a list item `ul`) and the attributes (attribute `class` has value `zachs_list`)

```html
<body>
    <div>
        <ul class="zachs_list">
            <li>hello</li>
            <li>world</li>
        </ul>
    </div>
</body>
```

we can specify the *exact path*:

```
path = /body/div/ul
```

```html
<body>
    <div>
        <ul class="zachs_list">
            <li>hello</li>
            <li>world</li>
        </ul>
    </div>
</body>
```

it's also possible to discuss *relative paths*. suppose we want to describe `ul` *relative* to `div`:

```
path = ./ul
```

```html
<body>
    <div>
        <ul class="zachs_list">
            <li>hello</li>
            <li>world</li>
        </ul>
    </div>
</body>
```

we can also represent unknown elements along some longer hierarchy (similar to our `*` in our patterns in the file path examples above. we do this in `xpath` by writing two consecutive `/` characters. suppose we want to find any `ul` element under the `root` (top of the document) with any intermediate elements:

```
path = //ul
```

```html
<body>
    <div>
        <ul class="zachs_list">
            <li>hello</li>
            <li>world</li>
        </ul>
    </div>
</body>
```

finally, it's possible to select elements using their attributes (those items inside the `<>` characters such as `class`, `id`, *etc.*).

you specify these elements using the following notation:

```
/path/to/element[@attr="attr_value"]
```

for example, we could find all `ul` elements with `class` value of `zachs_list` via the path:

```
path = //ul[@class="zachs_list"]
```

```html
<body>
    <div>
        <ul class="zachs_list">
            <li>hello</li>
            <li>world</li>
        </ul>
        <ul></ul>
    </div>
</body>
```

**<div align="center">mini-exercise: create some `xpath` expressions</div>**

given the source `html` from this webpage: https://anaconda.org/anaconda/repo, develop some `xpath` expressions to select the values in the "Package Name" column.

*hint: in "inspect" mode, you can find an "element selector" button that will allow you to click on the visible element and it will isolate that piece in the source code*

here are some reasonable options:

1. absolute: `/html/body/div/div/div/div/div/div/div/div/div/table/tbody/tr/td/a/span`
2. at any depth (over-inclusive): `//a/span`
3. using attributes: `//span[@class="packageName"]`

one final note on "narrowing down" the number of items which will match a given `xpath` expression: there are two very common attributes in `html` documents, and because they serve different purposes in `html` and `css`, they can be pretty useful in selecting items:

1. `class`: this attribute marks that an element is "in a class with" other elements, so it often defines elements which are conceptually related. This is particularly common for groups of things that are specially formatted (e.g. our "Package Name" elements, which are all green links
2. `id`: this is a *globally unique* attribute, so if we want to get one and only one element (*e.g.* `<table class="full-width" id="repo-packages-table">`) this is a guaranteed way to do so

### a quick diversion: installing selector gadget

generally speaking, it never *hurts* to check out the source code when building an `xpath` expression or a `css` selector expression. but there's a pretty great tool for cutting short, if you're interested...

**<div align="center">mini-exercise: install selector gadget</div>**

go find the browser extension [selector gadget](http://selectorgadget.com/) and install it.

note: this is optional. if you don't like cluttering your browser I get it and you can just follow along

up above we built an `xpath` expression to find the "Package Name" element in the table at (https://anaconda.org/anaconda/repo). let's look at how selector gadget helps us here

**<div align="center">mini-exercise: use selector gadget to build a `css` path and an `xpath` expression</div>**

1. go to the [`anaconda` package repo page](https://anaconda.org/anaconda/repo)
1. activate selector gadget by clicking on the extension button
1. select the package name element
    1. this will activate (highlight yellow) elements which match that selected item under *some* `css` selector path
1. click on any "wrong" items until only the desire elements are highlighted
1. investigate the `css` selector item at the bottom, and click the `xpath` button
1. repeat with the Summary field items

**<div align="center">PAUSE FOR ZOOM BREAK</div>**

## setup `python` to make `request`s

**<div align="center">mini-exercise: create a web scraping `conda` environment</div>**

using `conda`, let's create a new environment and install the most basic items for our first web scraping tests. use `ec2` or your laptop, wherever you prefer to access `conda` from a command line

```bash
conda create -n scrapesville python=3
conda activate scrapesville
conda install -y requests lxml cssselect pandas beautifulsoup4
```

that's it! we should be good to go

fun side story: check out [this cute little trick](https://www.theregister.co.uk/2017/09/15/pretend_python_packages_prey_on_poor_typing/), and make sure you always install `requests` and not `reqeusts`

update: this has actually been addressed by `pypi` admins, and they are supposedly preventing this in the future. how? not sure.

## `GET` requests

so far we have learned:

+ communication via the `http` protocol is done by a *client* sending a `request` to a *server*, and a *server* sending a `response` back to the *client*
+ `request`s and `response`s have a fixed format that includes a title line, headers with meta info, and a body
+ most `response` messages contain a body that is formatted as `json`, `html`, or `xml`

one thing we *haven't* done yet is talked about how to build `request` messages. let's do that now!

we will start by building a `GET` `request` message.

as a reminder, a `GET` request is a request we send to a server to ask for some information, and the returned message is whatever content we asked for.

### `json` is almost too easy

let's start with the easy case: a `GET` request that returns `json`.

my go-to `json` example, as discussed above, is [mtgjson.com](https://mtgjson.com/). however, I can understand why maybe that's not the most useful resource for everyone.

let's work with the (debatably) more useful [`github` software developers jobs board `json` api](https://jobs.github.com/api).

take, for example, the below url:

```
https://jobs.github.com/positions.json?description=data&location=washington%20dc&full_time=true
```
[(link)](https://jobs.github.com/positions.json?description=data&location=washington%20dc&full_time=true)

when working with web apis, you can thing of a URL as being a call to a function, perhaps passing that function some parameters

let's break the url from this request down into pieces. I'll use whitespace to emphasize the pieces:

```
https://jobs.github.com/positions.json
    ?
    description=data
    &location=washington%20dc
    &full_time=true
```

```
https://jobs.github.com/positions.json  <-- the endpoint
    ?
    description=data
    &location=washington%20dc
    &full_time=true
```

the first part is the familiar looking `url`. this is known as the api "endpoint," a `url` to which we can send `request`s. in an api call, this serves the name of the function call

```
https://jobs.github.com/positions.json
    ?                                   <-- the start of parameters
    description=data
    &location=washington%20dc
    &full_time=true
```

`?` is a common delimiter (most common? only?) indicating that what follows is a list of key-value pairs of parameters and their values. think of this is the open-parentheses of a function declaration

```
https://jobs.github.com/positions.json
    ?
    description=data           <-- parameter=value
    &location=washington%20dc  <-- &parameter=value
    &full_time=true            <-- &parameter=value
```

each of the `key=value` strings are parameter and value pairs passed to the api function.

the allowed keys are defined by the api endpoint, and a really good api will [tell you what the possibilities are](https://jobs.github.com/api) (but some won't).

```
https://jobs.github.com/positions.json
    ?
    description=data           <-- parameter=value
    &location=washington%20dc  <-- &parameter=value
    &full_time=true            <-- &parameter=value
```

in addition to the `key=value` pairs there are `&` characters -- these symbols separate key-value pairs, like commas in an `R` or `python` function

all together:

```
https://jobs.github.com/positions.json?description=data&location=washington%20dc&full_time=true
```

+ ask the `github` `positions.json` endpoint, and
+ set the `description` to be `"data"`
+ set the `location` to be `washington%20dc`
    + here `%20` is a [`url`-encoding](https://en.wikipedia.org/wiki/Percent-encoding) of the space character. many characters are escaped and encoded this way
+ set the `full_time` parameter to be `true`
    + note: `javascript` and hence most `api`s have a lowercase `t` in the `true` boolean

in theory, you could look up the `api` endpoints, parameters, and possible values, and construct these `url` values yourself. you could build the entire `url` as a string and use `curl` or `wget` just like we have to download things in the past in this course.

in practice, you will just let `python` (and the `requests` package we installed above) handle all of that for you.

In [None]:
import requests

help(requests.get)

In [None]:
response = requests.get(
    # note: no ? or key-value pairs in the url -- requests will do that
    url='https://jobs.github.com/positions.json',
    # the key-value pairs go here, and note we don't %20 (requests does
    # that for us, too)
    params={
        'description': 'data',
        'location': 'washington dc',
        'full_time': True,
    }
)

In [None]:
response

remember that the 200-series [`http` status codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) are *success* codes, so our request above was successful

this code is useful information for more automated process, but also sometimes helps us understand what went wrong. for the most part, we like 200s and we hate 400s and 500s. the others -- supposedly -- exist.

In [None]:
response.status_code

very cool. now what about the content? as the [documentation](https://jobs.github.com/api) said, this `api` endpoint responded with `json`, which is accessible via the `json` method of this `response` object:

In [None]:
j = response.json()
j

we can dig into that as we would with any `python` dict of dicts of dicts of dicts of lists of dicts:

In [None]:
j[0]['title']

In [None]:
from IPython.core.display import display, HTML

display(HTML(j[0]['description']))

not too bad!

**<div align="center">PAUSE FOR ZOOM BREAK</div>**

### `html` is a little harder

as a working example, let's keep focusing on that `conda` repo table: https://anaconda.org/anaconda/repo.

we'll use the `requests` library to get the `html` document, and then a few different ways of parsing / searching that document

+ `css` selectors with the `lxml` library
+ `xpath` with the `lxml` library
+ `beautifulsoup4` find functions

In [None]:
import requests

In [None]:
response = requests.get(url='https://anaconda.org/anaconda/repo')
response.status_code

In [None]:
print(response.text[:1000])

#### loading `html` into `python`

with `response.text` we have a string we *could* parse to find things. we know from our selector gadget work above that the package name `html` elements all have a class of `.packageName`. we could find that via normal string functions, or regular expressions:

In [None]:
i_first_class = response.text.find('packageName')
response.text[i_first_class - 100: i_first_class + 100]

In [None]:
import re
m = re.search("<span class='packageName'>", response.text)
response.text[m.start(): m.end()]

you *could* do this. you *could*. but you definitely ***should not do this***. to understand why, start with [one of the most popular stack overflow answers of all time](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)

in short: parsing `html` is very hard! don't do it by yourself. rely on thousands of better programmers to do it for you, using an external `xml` parsing library `libxml`.

in `python`, this is available as a package `lxml`

the `lxml` package, and specifically `lxml.html` library, will read in arbitrary (even broken!) `html` and `xml` strings and build a `python` object that you can use as an interface to that `html` document.

the object we create this way knows how to "move around" in the returned `html`, and can very quickly search through `html` documents using the query languages we described above (`css` selectors, `xpath`)

In [None]:
import lxml.html

root = lxml.html.fromstring(response.text)
root

the `Element` object is nested: it has "children" which are, themselves, `lxml.html Element` objects:

In [None]:
root.getchildren()

note that this matches the general structure of all `html` documents:
```html
<!doctype html>
<html lang="en">
    <head>...</head>
    <body>...</body>
</html>
```

the `head` and `body` elements themselves have children:

In [None]:
head, body = root.getchildren()
body.getchildren()

and so on throughout the document

#### `css` selection

`css` selectors are a concise and flexible query language for specifying elements in arbitrary `html` documents. we can use this query language to search "within" any one `lxml.html` `Element` object using the `cssselect` method (*three* `s` characters there).

*note: `css` selectors are not supported out of the box by the `lxml` module, only `xpath` is. to use `css` selectors you must also install the `python` package named `cssselect`, which we did with `conda` up above*

in the selector gadget mini-exercise above, we found a simple `css` selector expression for identifying all the package name cells: `.packageName`

In [None]:
packageElems = root.cssselect('.packageName')
print('we have {} spans'.format(len(packageElems)))

print('the first one is:\n')
pkg0 = packageElems[0]
print(lxml.html.tostring(pkg0).decode())

the text contained "within" this first `span` element is `isort` (note: will change day-to-day). the `lxml.html` object that we named `pkg0` above has that text string available to you as an attribute `.text`:

In [None]:
pkg0.text

using that `text` object attribute we can get all of the package names from a simple list comprehension

In [None]:
# list comprehension to iterate over packageElems list
packages = [elem.text.strip() for elem in packageElems]
packages[:10]

that's pretty good so far, but what if we want to actually pull down *all* the contents of that table -- not just the package name?

we *could* create a separate parser for each column and zip them together

there are issues with this sort of approach, though -- any idea what would be problematic with this?

first, what would happen if one of the rows was missing an element, and we zipped lists of different lengths? or a future table had multiple sub-elements in a cell in a row (e.g. [here](https://en.wikipedia.org/wiki/Help:Table#Cells_spanning_multiple_rows_or_columns))

second, that approach actually throws away quite a bit of information. every cell element within *any one row* of the table is actually a *child* of a table row element (`tr`).

we could use a `css` selector expression to get the 50 `tr` table row elements of this table, and then iterate through the elements (`td`) that are children of those rows.

let's start by building an expression to get the table rows. this is helped a great deal by the fact that we have an `id` (globally unique in an `html` document!) identifying our table for us. we can then select any `tr` element underneath that table

```css
table#repo-packages-table tr
```

In [None]:
rows = root.cssselect('table#repo-packages-table tr')
print('we found {} rows'.format(len(rows)))

notice anything weird?

that `css` selection grabbed 51 rows, but we were only supposed to have 50. our "any descendant" selector picked up the header as well.

not cool!

let's be more specific

In [None]:
rows = root.cssselect('table#repo-packages-table > tbody > tr')
print('we found {} rows'.format(len(rows)))

row0 = rows[0]
row0

In [None]:
print(lxml.html.tostring(row0).decode())

this `tr` "table row" element has four `td` children -- one for each column in the table

In [None]:
row0.getchildren()

let's unpack this row and parse out the info we want from each element

In [None]:
packageTd, accessTd, summaryTd, updatedTd = row0.getchildren()

for `packageTd`, let's get the link to and name of the package.

In [None]:
packageTd

In [None]:
a = packageTd.cssselect('a')[0]
print(a)

a.attrib

In [None]:
packagelink = packageTd.cssselect('a')[0].attrib['href']
packagelink

as for the name, that's the `text` element on the `span` (we saw this up above).

In [None]:
packagename = packageTd.cssselect('a > span')[0].text
packagename

the other three are much easier -- each is simply a `td` element with a text item we'd like to pull out

In [None]:
access = accessTd.text
summary = summaryTd.text
updated = updatedTd.text

access, summary, updated

the extra whitespace on the `summary` is annoying -- we can `strip` that off. also, let's parse the `updated` date string into a datetime object

In [None]:
import datetime

access = accessTd.text
summary = summaryTd.text.strip()
updated = datetime.datetime.strptime(updatedTd.text, '%Y-%m-%d')

access, summary, updated

let's unpack these four elements all in one place and put them together into a dictionary for this record

In [None]:
packageTd, accessTd, summaryTd, updatedTd = row0.getchildren()
d = {
    'packagelink': packageTd.find('a').attrib['href'],
    'packagename': packageTd.find('a/span').text,
    'access': accessTd.text,
    'summary': summaryTd.text.strip(),
    'updated': datetime.datetime.strptime(updatedTd.text, '%Y-%m-%d'),
}
d

let's turn that into a function we can apply to all of the rows in our parsed table

In [None]:
def parse_row(elem):
    packageTd, accessTd, summaryTd, updatedTd = elem.getchildren()
    d = {'packagelink': packageTd.cssselect('a')[0].attrib['href'],
         'packagename': packageTd.cssselect('a > span')[0].text,
         'access': accessTd.text,
         'summary': summaryTd.text.strip(),
         'updated': datetime.datetime.strptime(updatedTd.text, '%Y-%m-%d'), }
    return d

and finally, we can take our list of `tr` elements we called `rows` from wayyyyyyyy back and use a list comprehension to parse each row in that list to a dictionary of useful information. in fact, it'll be a list of dictionaries -- let's just toss it into a `pandas` dataframe while we're at it

In [None]:
import pandas as pd

rows = root.cssselect('table#repo-packages-table tbody tr')
packageinfo = [parse_row(row) for row in rows]
dfpackage = pd.DataFrame(packageinfo)
dfpackage.head()

from end to end, then, we have the following `python` code to parse this entire table:

In [None]:
def parse_row(elem):
    packageTd, accessTd, summaryTd, updatedTd = elem.getchildren()
    d = {'packagelink': packageTd.cssselect('a')[0].attrib['href'],
         'packagename': packageTd.cssselect('a > span')[0].text,
         'access': accessTd.text,
         'summary': summaryTd.text.strip(),
         'updated': datetime.datetime.strptime(updatedTd.text, '%Y-%m-%d'), }
    return d

def get_packages():
    response = requests.get(url='https://anaconda.org/anaconda/repo')
    root = lxml.html.fromstring(response.text)
    return pd.DataFrame([
        parse_row(row)
        for row in root.cssselect('table#repo-packages-table tbody tr')])

In [None]:
dfpackage = get_packages()
dfpackage.head()

**<div align="center">PAUSE FOR ZOOM BREAK</div>**

#### `xpath` selection with `lxml` [advanced]

anything we can do with `css` selectors we can also do with `xpath` -- plus, the `lxml` library is specifically built for `xpath` search, so we have some (very minor) shortcuts at our disposal.

just like `lxml.html` objects have a `.cssselect` method for `css` selection queries, they also have a `.xpath` method for `xpath` queries.

*note: we are re-using the `lxml.html` object from above*

by definition, all `xpath` expressions executed this way are relative to the particular object whose method we are using. this means that while we could search for `div` elements anywhere in a document with

```python
root.cssselect('div')
```

the `xpath` searching is more explicit -- `root.xpath('div')` would only find `div` elements that were **immediate children** of the `root` element.

In [None]:
# root *does* have a body element child:
root.xpath('body')

In [None]:
# root *does not* have a div element child:
root.xpath('div')

let's take the full `xpath` to the "Package Name" elements we developed in the `xpath` mini-exercise above:

```
body/div/div/div/div/div/div/div/div/div/table/tbody/tr/td/a/span
```

and find elements at this `xpath` using the `xpath` method, one element at a time

In [None]:
# build it up one element at a time
#root.xpath('body/div/div/div/div/div/div/div/div/div/table/tbody/tr/td/a/span')
root.xpath('body/div/div/div/div/div/div/div/div/div/table/tbody/tr/td/a/span')

note that `tbody` element in the path above. it turns out that modern browsers add the `tbody` element in there whether it's in the source code or not.

this means that we can often end up in a situation where `tbody` is what we see in the chrome devtools, for example, but there's actually no `tbody` element in the `lxml` object.

why is still mostly a mystery to me, and [I'm comforted by the fact that I'm not alone](https://stackoverflow.com/questions/27918086/why-tbody-will-be-added-automatically-by-browser).

In [None]:
root.xpath('body/div/div/div/div/div/div/div/div/div/table/tbody/tr/td/a/span')

SO USEFUL, RIGHT?

this absolute path is tremendous overkill, of course. we saw in the `css` selectors section and using selector gadget that we could identify package name spans by class -- we can do that in `xpath` as well

```
//span[@class="packageName"]
│ │    └─has 'class' attr with value "packageName"
│ └─is a span element
└─is at arbitrary depth in the tree below root
```

In [None]:
packageElems = root.xpath('//span[@class="packageName"]')
print('we have {} spans'.format(len(packageElems)))

pkg0 = packageElems[0]
print(lxml.html.tostring(pkg0).decode())

same elements with about 1/80th of the work. cool.

at this point we can directly reproduce the table parsing code from above, replacing all `cssselect` statements with `xpath` statements. the selection of the table element goes from the `css` selector

```css
table#repo-packages tbody tr
```

to `xpath` expression

```
.//table[@id="repo-packages-table"]/tbody/tr
```

In [None]:
rows = root.xpath('.//table[@id="repo-packages-table"]/tbody/tr')
print('we found {} rows'.format(len(rows)))

one additional perk of using `xpath` is that `lxml.html` objects have a special method `.find()` which will take an `xpath` expression (*not* `cssselect`) and will return the very first match -- no need to get a list and select the `[0]`th element. additionally, this will return `None` if no match is found, whereas grabbing the 0th element of an empty list will raise an `IndexError`

In [None]:
# find is the same as getting the first matching element
assert root.find('.//div') == root.xpath('.//div')[0]

this simplifies the link and name parsing in the package element:

In [None]:
# packagelink = packageTd.xpath('a')[0].attrib['href']
packagelink = packageTd.find('a').attrib['href']
packagelink

In [None]:
packagename = packageTd.find('a/span').text
packagename

altogether, we get identical behavior with some slight tweaks to the `parse_row` and `get_packages` functions:

In [None]:
def parse_row(elem):
    packageTd, accessTd, summaryTd, updatedTd = elem.getchildren()
    return {'packagelink': packageTd.find('a').attrib['href'],
            'packagename': packageTd.find('a/span').text,
            'access': accessTd.text,
            'summary': summaryTd.text.strip(),
            'updated': datetime.datetime.strptime(
                updatedTd.text, '%Y-%m-%d'), }

def get_packages():
    response = requests.get(url='https://anaconda.org/anaconda/repo')
    root = lxml.html.fromstring(response.text)
    return pd.DataFrame([
        parse_row(row)
        for row in root.xpath('//table[@id="repo-packages-table"]/tbody/tr') ])

In [None]:
dfpackage = get_packages()
dfpackage.head()

**<div align="center">PAUSE FOR ZOOM BREAK</div>**

### a great alternative: beautiful soup

in the last section we covered parsing `html` documents using the base `lxml.html` library and either the `css` selector or `xpath` query language. that method is considered pretty low-level -- you will need to invest a lot of overhead in doing simple common tasks, but the basic tools are all there.

similar to how some plotting libraries (e.g. `seaborn`, `plotly.express`) exist as smarter, more convenient wrappers to lower-level plotting libraries (e.g. `matplotlib`, base `plotly`), there is a very useful wrapper library for parsing `html` called [`beautifulsoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). we installed this package (called `beautifulsoup4` on `pypi` / `conda` repos) earlier

consider this library as a more user-friendly way of doing `html` parsing

let's start by creating a `soup` object from `html` text:

In [None]:
from bs4 import BeautifulSoup

response = requests.get(url='https://anaconda.org/anaconda/repo')
soup = BeautifulSoup(response.text)

In [None]:
print(soup.a.prettify()[:1000])

first of all, we can do all of the `css` selector statements from above exactly as before, this time via

In [None]:
trs = soup.select('table#repo-packages-table tbody tr')
print('we found {} rows\n'.format(len(trs)))
tr0 = trs[0]
print(type(tr0))

the objects we are picking out are not `lxml.html` elements any more but now `bs4.element.Tag` objects. they have better (usually preferable) behaviors. for example, when we print them to the screen they will actually show us the tag contents:

In [None]:
tr0

compare this to prior behavior for `lxml.html` objects:

In [None]:
root

in addition, the `soup` object has a `find_all` method which accepts a flexible set of inputs for defining elements.

the first of those inputs is the `name` argument, and can be any one of what the author calls ["filters"](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#kinds-of-filters), and there are a few types:

a string: search for *element tags* that exactly match the provided string

In [None]:
soup.find_all('span')

a regular expression: search for *element tags* that match a regular expression

In [None]:
import re

soup.find_all(re.compile('spa\w'))

a list: take a list of elements and find all elements that match any of them

In [None]:
len(soup.find_all(['span', 'a']))

additional options are filters which always return `True` (the constant filter, include everything) and the filter which looks at the passed in `tag` element and invokes some code (check the docs for details)

assuming `find_all` can filter down the *tags* using the above filters as the value of the parameter `name`, we can go one step further and filter based on the `attributes` by providing key-value pairs to `python`. for example

In [None]:
# class is special -- it's reserved in `python`, so if we are looking for
# elements such that `class="packageName"` in `html`, add the underscore
soup.find_all('span', class_="packageName")

the same goes for `id` values, `href` values, etc. in particular, this allows you to **use regular expressions** when matching the values of things -- e.g. you can find all the elements which contain a certain `class_` value, or all the `href` values which are external.

**<div align="center">PAUSE FOR ZOOM BREAK</div>**

## `POST` requests

the entire long prior section was about how to handle the response message for a `GET` request when the body of the `response` is `json` or `html`. the actual `GET` request itself was very easy to build in `python`. parsing a `json` response was also pretty easy -- almost all of the complication above came from parsing the `response` when it was `html`.

`GET` isn't our only method option for sending `request`s, though -- we can also `POST`. while a `GET` request implies that the server has some data and we are simply asking to read that data, a `POST` message could mean a few things:

+ we are actively creating something (`POST`ing a message to a message board, e.g.)
+ we are asking for some data, but our parameters for that `api` won't be passed as regular `api` endpoint parameters (e.g. they are too complex, or the `api` must remain the same regardless of parameters)
    + this is how almost all `html` forms work, e.g.

sometimes a `POST` request is basically the same as `GET` request -- point at a `url` and yell `POST` instead of `GET` and you're done.

generally speaking, though, we are *also* sending data along with our `POST` request

far and away, the most common use case for `POST` requests is submitting data via `html` forms (e.g. login forms, uploads, submission forms).

as a quick and simple example of a `POST` request, let's check out the submission api for [github gist](https://gist.github.com/RZachLamberty) (a place to host small snippets of code instead of full repos):

https://developer.github.com/v3/gists/#create-a-gist

so, we have the ability to create a new gist via an automated `POST` message. let's do that!

our first task is to build the data object in that gist. It looks like a dictionary, so let's just see if that works.

In [None]:
data = {
    'description': 'a test post of a gist',
    'public': True,
    'files': {
        'my_test_file.py': {
            'content': "print('hello world')"
        }
    }
}

let's try and post this information using the `requests.post` function. we could either convert the `data` element to `json` using the `python json.dumps` function, or we could let `requests` do it for us via the `json` parameter.

In [None]:
resp = requests.post(url="https://api.github.com/gists",
                     json=data)
resp.status_code

dang. 400s are no good. we did receive a `json` response, for what it's worth:

In [None]:
resp.json()

### once more, with ~~feeling~~ authentication

fundamentally, a `POST` request is allowing users to add some data to some data source. it is rare for a site to allow a `POST` request without also requesting that we authenticate

a paradigm we will repeat often in the remainder of the class -- one that is ubiquitous in web communication -- is the idea of creating something called a `session`.

a `session` is a persistent connection with some other computer which serves a number of purposes, but in the web client context it mainly serves to cache authentication and preferred behavior.

we are going to add our `github` credentials to a re-usable `session` object available directly in the `requests` library (and almost any other web request library).

In [None]:
# check out `auth` below
help(requests.Request)

github gists support this as well, so let's do it! enter your user name and password to `username` and `pw` (I will use `getpass` here so I can type mine and not save it forever in a `notebook`, which would *not* be cool)

In [None]:
import getpass

username = 'rzl5'
pw = getpass.getpass('Password: ')

and now, just use that user name and password directly in the call to the github `POST` function

In [None]:
resp = requests.post(url="https://api.github.com/gists",
                     json=data,
                     auth=(username, pw))
resp.status_code

note: you may still get a 401 if you have enabled multi-factor authentication. if so, good for you, but also, sucks for you, because I won't cover that just now

and our results (click on the link!):

In [None]:
j = resp.json()
print(j['html_url'])
j

it's great that that worked for us, but imagine if you had to program hundreds of similar but slightly different requests. what if you don't want to pass your credentials to each one of them?

In [None]:
session = requests.Session()
session.auth = (username, pw)

# note *session* below, not requests.post
resp = session.post(url="https://api.github.com/gists",
                    json=data)

resp.status_code

In [None]:
j = resp.json()
print(j['html_url'])
j

## using browser developer tools

when doing web scraping, the modern browser developer tools can be pretty much invaluable. I'm going to do a *very* cursory summary of the most important tricks to know when using developer tools. [the full documentation](https://developers.google.com/web/tools/chrome-devtools/) is excellent.

remember the main goal: finding elements that have the data we want, or a static endpoint that contains those elements in a more compact form

**<div align="center">walkthrough: google chrome "Inspect" mode</div>**

let's check out the use of the chrome developer tools on [the `anaconda` repo page](https://anaconda.org/anaconda/repo).

*note to self: exit presentation mode to get list of things to cover in notes slide*

the walkthrough covered the following topics, broken down by tab name in the "Inspect" mode dashboard

+ elements
    + main window
        + this is a DOM explorer, allowing you to search for elements, expand and collapse, etc
        + selecting a line item highlights the corresponding rendered space
            + this is nested, so you can use this to find which element does a thing, or...
    + find (`Ctrl + f`)
        + search is done using `css` selectors or `xpath`
    + bottom "breadcrumbs" banner
        + this can be used to find the full `xpath` expression or the individual `css` selector expressions
    + the "element selector" button (top left of the Inspect menu)
        + can be used to point and click select
    + element properties menu (right panel)
        + generally, more important for identifying `css` properties
        + let's edit a value, and click on a page to identify the source of that value
+ console
    + this is a `javascript` console
        + this is an interactive shell for executing `javascript` code
        + it has already effectively loaded *all* of the `javascript` code that was used by your browser to build and render this webpage
    + example: `console.log('hello world')`
    + it will display pretty frequent error messages
+ sources
    + many files are used to build a webpage
        + the contents returned by the single `GET` of the first `url` will include instructions on how to acquire many other files (*e.g.* `javascript` files, `css` files, other `url`s)
    + this menu lists them
    + files are grouped by
        + the domain that sent them
        + the path in the url
        + a common construct is to have `css`, `js`, and `img` directories to separate those files by type
    + most files are "minified"
        + white spaces is removed to make transfer faster
        + you can have chrome add that white spacing back in
            + click the `{}` character in the bottom-left corner of the display window
+ network
    + one use case: seeing what files are sent when
    + a better use case: looking at the content individual pages we pulled in
        + many times, a *complicated* webpage will be built out of *much simpler* and *more programmatically friendly* data (e.g. simple `json` objects)
    + top panel: filter by file type
        + this is a high-pass filter and not guaranteed
        + it's by extension -- some `javascript` or `json` requests will not have `js` as an extension, so it's not perfect
    + bottom-left panel: file name
    + bottom-right panel: `request` details
        + this is the real bread and butter!
        + [`headers`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers)
            + `headers` are packets of information that are sent along with the actual `request` or `response` (basically, meta-data about the request itself)
                + example: the data type of the content we're requesting or responding with (`json`)
            + this panel contains the actual structure of the actual `request` that was made
            + subsections:
                + general
                    + these are the most basic details about the `request` which was actually made and the status of the response
                + response headers
                    + these are the `header` elements of the `response` to our browser's request
                + request headers
                    + the `header` our browser sent along with our original `request`
                    + often some sub-section of these are *required* by the responding server for your content to be received
                + query string parameters
                    + if the `url` contained a query string (`?key1=val1&key2=val2`, etc.), it's parsed here
                    + can be "unparsed" as well
            + if you want *just this piece* of the full url building process request, you should try to replicate this page
        + preview
            + mostly used for rendering images
        + response
            + this is the rendered content of the actual response (*i.e.* usually what you're looking for)
    + back to the file name panel
        + right click one of the items
        + "copy >"
        + check out the various options, including `curl` statements!
    + finally, *reload* the page to see the entire request stream re-built

**<div align="center">walkthrough: a practical application</div>**

let's open [the Eversource power company's outage reporting map](https://outagemap.eversource.com/external/default.html) and use some of the developer tools to find the data in the table of current outage information

for the walkthrough, do the following

+ click on the "Customer Outages > Connecticut" button on the left nav
+ observe that outage stats are in a table (e.g. for city "Avon")
+ open the inspect panel
+ (re)load [the outage map](https://outagemap.eversource.com/external/default.html)
+ explore
    + look for the current outage information (search for "Avon") -- should not find it!
    + click the "Customer Outages > Connecticut" button
+ in the "region report" popup, select the table row elements
    + `css` selector: `table#report-panel-conn-table tr`
    + `xpath` selector: `//table[@id="report-panel-conn-table"]//tr`
    + copy one of these
+ try this in python with the code in the cell below

In [None]:
resp = requests.get('https://outagemap.eversource.com/external/default.html')
root = lxml.html.fromstring(resp.text)
root.cssselect('table#report-panel-conn-table tr')

+ it didn't work -- why?
+ reloading the page
    + maybe the data is not in the *original* request, but is loaded afterward. it is *built* by the page
    + reload the webpage and search for that same `xpath` or `css select` statement rapidly
    + observe: they are *not found* when the page starts to render, but then are found after we click the Connecticut link
        + this screams [ajax](https://developer.mozilla.org/en-US/docs/AJAX/Getting_Started) (Asynchronous Javascript And Xml) -- the data doesn't show up until we click a button, so it probably was sent that data
    + maybe we can see the request that obtained those elements?
+ Inspect > Network tab
    + reload the page
    + clear the previous requests
    + check the "preserve log" box
    + hit the "Customer Outages > Connecticut" button
    + click around through those files
        + try limiting to `js` first, on a whim
            + no dice
        + try just `xhr`
            + awwwwwwwwww yisssssssssss
            + note the request url: https://outagemap.eversource.com/resources/data/external/interval_generation_data/YYYY_mm_dd_HH_MM_SS/report_conn.json
            + and the request method: `GET`
        + try the code below

In [None]:
# update this in class with current value
url = ('https://outagemap.eversource.com/resources/data/external/interval_generation_data/'
       '2020_10_04_22_29_30'
       '/report_conn.json')
resp = requests.get(url)
j = resp.json()
j

it would be reasonable here to ask how we would automate this, if I need to get that date string `YYYY_mm_dd_HH_MM_SS` every time. fortunately, if you let the inspect tools run long enough, you will see that there is a second `api` endpoint constantly getting polled:

In [None]:
metadata_url = 'https://outagemap.eversource.com/resources/data/external/interval_generation_data/metadata.json'
metadata_j = requests.get(metadata_url).json()
metadata_j

so, all together, we can repeatably scrape this table's values with

In [None]:
metadata_url = 'https://outagemap.eversource.com/resources/data/external/interval_generation_data/metadata.json'
directory = requests.get(metadata_url).json()['directory']

connecticut_url = 'https://outagemap.eversource.com/resources/data/external/interval_generation_data/{}/report_conn.json'
connecticut_url = connecticut_url.format(directory)

j = requests.get(connecticut_url).json()
df = pd.DataFrame(j['file_data']['areas'][0]['areas'][0]['areas'][0]['areas'])
df.head()

## `javascript` engines and `selenium` webdrivers

in the previous example we experienced something pretty tricky -- an `ajax` (Asynchronous Javascript And Xml) request.

the response to our initial request contained instructions on how to *keep* building the webpage, and our browser knew how to take those instructions and turn them into additional `requests` (smart little cookie)

because of the way that the developers who put together the Eversource cite constructed their webpage, we got lucky -- we were able to find a *single, `static` url* which had all the content we need in one go. it's not always that simple, though.

sometimes it's just not possible to get the information you need without running some `javascript` code (just like your smart little internet browser does).

when this is required, we must do something more complicated than our previous simple `python` requests -- we must use a `javascript engine`s like those used in our web browsers.

we can use `webdrivers` -- programs which interface with browsers on our behalf -- to do this.

there are many packages for running `webdrivers` in multiple languages, but the most common (across languages) is [`selenium`](http://www.seleniumhq.org/)

you can think of it like this: people write code in various languages (e.g. `python`) that use the `selenium webdriver` to create and then interact with a browser's `javascript engine`

### setup

#### installing `selenium`, a web browser, and a webdriver

we need:

1. the `python selenium` library
2. a web browser
3. a web driver

as always, the first question: am I installing this *locally* or *on my `ec2` server*?

the answer: it depends. we can make both work.

+ on both local and remote, the install steps are basically the same
+ *if* we go with *remote*, we have an additional complication when we want to *use* selenium (more on that later)

you do you. I will do the easier *local* setup in class (just to show the basics of how `selenium` works). the steps involved in doing the harder *remote* setup are in the lecture notes below.

**<div align="center">walkthrough: installing `selenium` parts (either *local* or *remote*)</div>**

1. open a terminal
1. install `selenium`
    1. activate some `conda` environment (e.g. the `scrapesville` one we've been using)
    1. run `pip install selenium`
1. download and unpack [the webdriver for your chosen browser](http://selenium-python.readthedocs.io/installation.html#drivers)
    1. you will likely have to pick one, download (`wget`) a `zip` or `tar`ball, and `unzip` or `tar -xvzf` it
    1. note the path where you saved it
    1. there are some complications with `chrome` at this time, so I'd recommend using `firefox`
1. if needed, install a browser (none is installed by default in `ec2`)

in my (local) case:

```bash
conda activate scrapesville
pip install selenium

mkdir -p ~/selenium_example
cd ~/selenium_example
# mac:
wget https://github.com/mozilla/geckodriver/releases/download/v0.27.0/geckodriver-v0.27.0-macos.tar.gz
tar -xvzf geckodriver*
ls -alh ~/selenium_example/

# finally, install firefox -- chrome is actually pretty annoying for this example
# mac:
# install by hand
# linux:
# sudo apt install firefox
```

so we now should have access to the `selenium python` code library, an executable `webdriver`, and a browser that `webdriver` can drive.

what can (should) we actually *do* with them?

well, `selenium` was originally constructed not for web scraping but for UAT processes -- we could use it to automatically replicate the experience of a user coming to a webpage, clicking buttons, typing things in fields, etc.

it can be used to create `test`s of a webpage's functionality, ensuring that the webpage behaves as expected.

this means, in particular, that it can *launch an actual browser*, and control it with commands. that's pretty awesome.

in turn, *that* means that we may need to be able to see an actual browser (or launching it will fail). if you're working *locally*, no problem! if you're working *remotely*, though, that's not something we've ever done via the command line yet.

#### `x11` forwarding graphical interfaces (*remote only*)

the general idea here is straightforward: if *you* have some `x11`-speaking service running, and the *remote server* has some `x11`-speaking service running, you can have guis sent over your `ssh` connection from the *remote* to your local laptop.

your `ubuntu` servers already have what they need -- let's install what *we* need

for a full walkthrough, check out [this description from Indiana University](https://uisapp2.iu.edu/confluence-prd/pages/viewpage.action?pageId=280461906). the basics are simple though:

1. mac: download [`xquartz`](https://www.xquartz.org/) and run it
2. windows: download [`xming`](http://sourceforge.net/projects/xming/) and run it
3. linux: you should be ok out of the box. if you see complaints about not having the `xorg` package installed, install it

now that you have an `x11` service running on your machine, you should be able to have `x11` communications with your remote server, and display guis `forward`ed from that server.

for mac and linux users, create an `ssh` connection with `x11` forwarding by entering the *exact same command* you usually would to connect via `ssh` connection, but **add a `-Y` flag** (that's a *capital* `Y`):

```bash
ssh -Y -i /path/to/privatekey username@servername
```

for windows users, you will want to enable `x11` by loading your `putty` connection and navigating to the `connection > ssh > x11` side menu panel. after you've made that update, you can save if you want.

`x11` forwarding is a somewhat common process, and you will do it *any* time you want to "see" an application running on a remote server. think of it as "remote desktop", but for a single application. it's also the same process for every application, so you only need to set it up and learn it one time.

##### where are we?

at this point:

1. you should have installed `selenium`
2. you should have downloaded and unzipped a `webdriver`
3. if you're doing all of this remotely
    1. you should have downloaded an `x11` client
    2. you should have made an `ssh` connection using the `-X` flag

everybody there?

### an example

let's try an example. suppose [our Eversource example page from before](https://www.eversource.com/clp/outage/outagemap.aspx) hadn't yielded that single convenient endpoint. we could still scrape that table if we wanted to because with `selenium` we can do *literally anything* we could do with a browser

let's start with a few simple `python` commands just to get things up and running.

first, we will need to know or remember *where* on our filesystem we saved that `webdriver` file -- we will need to be able to point to it in order to use.

I saved mine in a `~/selenium_example` directory

```python
import os
import pandas as pd
import selenium.webdriver

fdriver = os.path.join(os.path.expanduser('~'),
                       'selenium_example',
                       'geckodriver')

# you may use Firefox, or Chrome, or Edge, or whatever
driver = selenium.webdriver.Firefox(executable_path=fdriver)
```

woah!

first, that `driver = ...` command actually launched our browser. cool.

did you notice, though, that the `python` session waited until the browser seemed done rendering? it does that. for everything we do, actually. which is good -- we are waiting until the web browser tells us it's done working on the current request.

let's try getting our eversource url:

```python
driver.get('https://outagemap.eversource.com/external/default.html')
driver.current_url
```

the `driver` item we have now has the ability to do all the things we might want to do -- including:

1. clicking on elements
2. typing (sending keystrokes)
3. dragging, dropping, or highlighting

the way it does all of the above is by

1. selecting an element it wishes to interact with (e.g. a text box we would type in)
2. using `send_keys`, `click`, or other member functions on that element

let's try an example -- we know we want to click on that table popup here. we need a `css` selector or `xpath` to isolate that clickable menu element

look for that information in the `inspector` mode of firefox, chrome, etc.

once you've found an `{xpath, css}`, try and select that element with `driver.find_element_by_{xpath, css_selector}()`

if that works, try `click`ing that element and see what happens!

```python
menulink = driver.find_element_by_xpath('.//span[@id="menu-summary"]')
menulink
menulink.click()
```

so let's keep repeating this process until we've done all the steps we want:

1. click on the "View Outage Report" link
    1. hint: there's a `driver.find_element_by_id` method
2. select all the elements in the table

```python
outagelink = driver.find_element_by_id('view-summary-conn')
outagelink.click()

outageInfo = []
for row in driver.find_elements_by_css_selector('tr.level2'):
    town, cust, aff, pct = row.find_elements_by_tag_name('td')
    outageInfo.append({'town_name': town.text,
                       'customers': cust.text,
                       'customers_affected': aff.text,
                       'pct_customers_affected': pct.text, })

df = pd.DataFrame(outageInfo)
df.head()
```

now, that's pretty inarguably cool, but in terms of doing something *practical*, I think it has some serious drawbacks:

1. requires external software (a running browser program and driver)
    1. **you don't have to do the `x11` forwarding**
        1. it is possible to run the browser in "headless" mode
        2. this makes the process considerably faster
2. obviously slower (more overhead in rendering, e.g.)
3. more resource (memory) intensive

## final summary

so, in general, the workflow for doing directed webscraping of an identified data sources is:

1. search, at least once, for a `json` api (e.g. google search "api my.data.source")
    1. if you find a `json` api, use that api with the standard web requests
2. see if the content is kept in obviously structured `html` in a static request
    1. if it is, `GET` the `html` document with `requests`
    2. convert the `html` to an `lxml.html` object
    3. identify `xpath` or `css` selectors to iterate through data elements and find them on the `lxml.html` object
3. look through the developer tools for a better sub-request
    1. if so, repeat the steps under 2
4. look into an automated `javascript` engine approach using `selenium`

<strong><div align="center"><code>REST</code> up</div></strong>
<div align="center"><img src="http://www.softwaresamurai.org/wp-content/uploads/2017/12/RESTfil-API.png" width="400px"></div>

# END OF LECTURE

next lecture: [`aws` `iam`](008_iam.ipynb)