# Web Scraping & Pagination

## EIA Forms 923/906

### Example: getting links from a webpage

We'll be looking at the [EIA 923 webpage](https://www.eia.gov/electricity/data/eia923/) in examples and the [EIA-906 historical archive](https://www.eia.gov/electricity/data/eia923/eia906u.php) in exercises.

In [None]:
import requests

In [None]:
eia_923_url = "https://www.eia.gov/electricity/data/eia923/"

In [None]:
eia_923_response = requests.get(eia_923_url)
eia_923_response.text

In [None]:
import bs4

In [None]:
eia_923_soup = bs4.BeautifulSoup(eia_923_response.text)
eia_923_soup

In [None]:
eia_923_soup.find_all("title")

In [None]:
eia_923_all_a_tags = eia_923_soup.find_all("a")
eia_923_all_a_tags

In [None]:
eia_923_a_hrefs = eia_923_soup.find_all("a", href=True)
eia_923_a_hrefs

In [None]:
eia_923_zip_tags = []
for a in eia_923_a_hrefs:
    if a["href"].lower().endswith(".zip") and "906" not in a["href"]:
        eia_923_zip_tags.append(a)
eia_923_zip_tags

### Challenge: get all the relevant `a` tags from EIA 906

Lots of the data that is collected in EIA 923 was collected in EIA 906 in the past.

We'll have you work through the scraping steps on the 906 data to get a sense of how this all works.


Let's get the relevant `a` tags from the [EIA 906 page](https://www.eia.gov/electricity/data/eia923/eia906u.php):

Start with the skeleton code outlined below - we expect a variable called `eia_906_xls_tags` at the end, which holds all the tags that refer to the actual 1970-2000 data files.

### Solution

In [None]:
eia_906_url = "https://www.eia.gov/electricity/data/eia923/eia906u.php"
# get the page contents
# turn it into a collection of tags
# filter them down to the tags that contain the links to XLS data - for all years 1970-2000

### Example: downloading data

In [None]:
eia_923_one_link = eia_923_zip_tags[0]
eia_923_one_link

In [None]:
eia_923_one_response = requests.get(eia_923_one_link["href"])

In [None]:
from urllib.parse import urljoin

In [None]:
eia_923_one_full_url = urljoin(eia_923_url, eia_923_one_link["href"])
response = requests.get(eia_923_one_full_url)
response

### Challenge: get the Form 906 file contents

OK, so now we know how to scrape a bunch of URLs from a webpage. Let's read the Form 906 files into our program! Since they're XLS files, we can read them directly from a URL using `pandas.read_excel`.

Try making a list, `eia_906_dataframes`, that includes all of the data files from the [EIA 906 page](https://www.eia.gov/electricity/data/eia923/eia906u.php) - start with the (minimal) scaffold below!

### Solution

In [None]:
import pandas as pd

eia_906_dataframes = []

# loop through the eia_906_xls_tags and make a pd.DataFrame for each one

### Discussion

Why might you choose to do all this instead of just manually collecting links?

## Pagination

### Example: getting the first few pages

In [None]:
eia_api_base_url = "https://api.eia.gov/v2/electricity"
api_key = "3zjKYxV86AqtJWSRoAECir1wQFscVu6lxXnRVKG8"

In [None]:
first_page = requests.get(
  f"{eia_api_base_url}/facility-fuel/data",
  params={
    "data[]": "generation",
    "facets[state][]": "CO",
    "sort[0][column]": "period",
    "sort[0][direction]": "desc",
    "sort[1][column]": "plantCode",
    "sort[1][direction]": "desc",
    "api_key": api_key
  }
).json()["response"]

first_page.keys()

In [None]:
pd.DataFrame(first_page["data"])

In [None]:
next_page = requests.get(
  f"{eia_api_base_url}/facility-fuel/data",
  params={
    "data[]": "generation",
    "facets[state][]": "CO",
    "sort[0][column]": "period",
    "sort[0][direction]": "desc",
    "sort[1][column]": "plantCode",
    "sort[1][direction]": "desc",
    "offset": 5000,
    "api_key": api_key
  }
).json()["response"]

In [None]:
pd.DataFrame(next_page["data"])

In [None]:
for page_num in range(5):
    print(f"Getting page {page_num}")
    # actually get the page here...

In [None]:
total_rows = first_page["total"]

In [None]:
import math

In [None]:
page_size = 5000
num_pages = math.ceil(int(total_rows) / page_size)
num_pages

### Challenge: pagination

OK, now let's put it all together! 

Let's try to get *all* of the net generation data in Colorado that is in the EIA API.

Start with the following code and modify it to work:

### Solution

In [None]:
all_records = []
for page_num in range(num_pages):
    print(f"Getting page {page_num}...")
    offset = ___
    page = requests.get(
      f"{eia_api_base_url}/facility-fuel/data",
      params={
        "data[]": "generation",
        "facets[state][]": "CO",
        "sort[0][column]": "period",
        "sort[0][direction]": "desc",
        "sort[1][column]": "plantCode",
        "sort[1][direction]": "desc",
        ___,
        "api_key": api_key
      }
    ).json()["response"]
    all_records.append(pd.DataFrame(page["data"]))

df = pd.concat(all_records)

## Key points

- beautiful soup lets you grab links out of a webpage so that you can then download them
- if you need to get more than one request worth of results from an API, they usually provide some "pagination" capabilities so you can make all the requests programmatically.
- web scraping is a wide world - if you get stuck, try searching for some of the keywords above.