# Web Scraping & Pagination

## EIA Forms 923/906

### Example: getting links from a webpage

### Challenge: get all the `a` tags from EIA 906

Lots of the data that is collected in EIA 923 was collected in EIA 906 in the past.

We'll have you work through the scraping steps on the 906 data to get a sense of how this all works.


The first step is to get all the `a` tags from this page:

```
https://www.eia.gov/electricity/data/eia923/eia906u.php
```

Start with this skeleton code:

```python
import bs4
import requests

eia_906_url = "https://www.eia.gov/electricity/data/eia923/eia906u.php"

# get the page contents, then use bs4 to get the `a` tags
```

### Solution

### Example: filtering tags

### Challenge: only get the relevant `a` tags

Let's grab only the `a` tags that are links to the actual data files.

This will look a little bit different from the 923 example, since the links are all to XLS files instead of ZIP files.

Let's start with this code:

```python
import bs4
import requests

eia_906_url = "https://www.eia.gov/electricity/data/eia923/eia906u.php"
response = requests.get(eia_906_url)
soup = bs4.BeautifulSoup(response.text)

# get only the `a` tags that are links to Form 906 data files.
```

### Solution

### Example: downloading data

### Challenge: get the Form 906 file contents

OK, so now we know how to scrape a bunch of URLs from a webpage. Let's read the Form 906 files into our program! Since they're XLS files, we can read them directly from a URL using `pandas.read_excel` - no need for manually downloading with `requests`.

Try filling out the body of this for loop.

```python
from urllib.parse import urljoin

import bs4
import requests
import pandas as pd

eia_906_url = "https://www.eia.gov/electricity/data/eia923/eia906u.php"
response = requests.get(eia_906_url)
soup = bs4.BeautifulSoup(response.text)

a_tags = soup.find_all("a", href=True)
eia_906_links = []
for a in a_tags:
    if ".xls" in a["href"].lower():
        eia_906_links.append(a)

eia_906_dataframes = []
for a in eia_906_links:
    ___
```

### Solution

### Discussion

Why might you choose to do all this instead of just manually collecting links?

## Pagination

### Example: getting the first few pages

### Challenge: pagination

OK, now let's put it all together! Let's fill in the blanks for this code:

```python
import math

import pandas as pd
import requests

base_url = "https://api.eia.gov/v2/electricity"
api_key = "3zjKYxV86AqtJWSRoAECir1wQFscVu6lxXnRVKG8"

page = requests.get(
  f"{base_url}/facility-fuel/data",
  params={
    "data[]": "generation",
    "facets[state][]": "CO",
    "api_key": api_key
  }
).json()["response"]

total_rows = ___
page_size = 5000
num_pages = ___

all_records = []
for page_num in range(num_pages):
    offset = ___
    page = requests.get(
      f"{base_url}/facility-fuel/data",
      params={
        "data[]": "generation",
        "facets[state][]": "CO",
        ___,
        "api_key": api_key
      }
    ).json()["response"]
    all_records.extend(pd.DataFrame(page_of_data["data"]))

df = pd.concat(all_records)
```

### Solution

## Key points

- beautiful soup lets you grab links out of a webpage so that you can then download them
- if you need to get more than one request worth of results from an API, they usually provide some "pagination" capabilities so you can make all the requests programmatically.
- web scraping is a wide world - if you get stuck, try searching for some of the keywords above.