In [1]:
import pandas as pd

## Reading remote data into `pandas`

In [2]:
help(pd.read_parquet)

Help on function read_parquet in module pandas.io.parquet:

read_parquet(path: 'FilePath | ReadBuffer[bytes]', engine: 'str' = 'auto', columns: 'list[str] | None' = None, storage_options: 'StorageOptions | None' = None, use_nullable_dtypes: 'bool | lib.NoDefault' = <no_default>, dtype_backend: 'DtypeBackend | lib.NoDefault' = <no_default>, filesystem: 'Any' = None, filters: 'list[tuple] | list[list[tuple]] | None' = None, **kwargs) -> 'DataFrame'
    Load a parquet object from the file path, returning a DataFrame.

    Parameters
    ----------
    path : str, path object or file-like object
        String, path object (implementing ``os.PathLike[str]``), or file-like
        object implementing a binary ``read()`` function.
        The string could be a URL. Valid URL schemes include http, ftp, s3,
        gs, and file. For file URLs, a host is expected. A local file could be:
        ``file://localhost/path/to/table.parquet``.
        A file URL can also be a path to a directory that

In [3]:
df = pd.read_parquet("https://s3.us-west-2.amazonaws.com/pudl.catalyst.coop/nightly/core_eia923__monthly_generation_fuel.parquet")

print(df.report_date.max())


2024-11-01


### challenge

Adapt your previous Excel-reading code to read the same Excel file directly from the Internet. GitHub has convenient links for all the files in the repo - you should be able to find the Excel file here: `https://github.com/catalyst-cooperative/open-energy-data-for-all/raw/refs/heads/main/data/eia923_2022.xlsx`

Start with the code below.

### solution

```python
# modify this to read from the URL!
pd.read_excel("data/eia923_2022.xlsx", skiprows=5)
```

### key point

Most, but not all, of the `read_*` functions support URLs -  check the docs to make sure this will work!

### Discussion
What are some advantages and disadvantages you can imagine for using remote data vs. saving the data to your hard drive (aka **local data**)?


## Using `requests` to download files

In [4]:
import requests

response = requests.get("https://raw.githubusercontent.com/catalyst-cooperative/open-energy-data-for-all/refs/heads/main/data/eia923_2022.json")

In [5]:
response.status_code

200

In [6]:
response.text



In [7]:
eia923_2022_json = response.json()
eia923_2022_json["response"]["data"]

[{'period': '2022-12',
  'plantCode': '6761',
  'plantName': 'Rawhide',
  'fuel2002': 'ALL',
  'fuelTypeDescription': 'Total',
  'state': 'CO',
  'stateDescription': 'Colorado',
  'primeMover': 'ALL',
  'generation': '188961',
  'gross-generation': '203283',
  'generation-units': 'megawatthours',
  'gross-generation-units': 'megawatthours'},
 {'period': '2022-12',
  'plantCode': '54142',
  'plantName': 'Hillcrest Pump Station',
  'fuel2002': 'WAT',
  'fuelTypeDescription': 'Hydroelectric Conventional',
  'state': 'CO',
  'stateDescription': 'Colorado',
  'primeMover': 'HY',
  'generation': '342.43',
  'gross-generation': '358.27',
  'generation-units': 'megawatthours',
  'gross-generation-units': 'megawatthours'},
 {'period': '2022-12',
  'plantCode': '54142',
  'plantName': 'Hillcrest Pump Station',
  'fuel2002': 'WAT',
  'fuelTypeDescription': 'Hydroelectric Conventional',
  'state': 'CO',
  'stateDescription': 'Colorado',
  'primeMover': 'ALL',
  'generation': '342.43',
  'gross-gen

### challenge

Adapt the JSON reading code from last episode to use requests.get.

### solution

```python
import pandas as pd
import json

with open('data/eia923_2022.json') as file:
    eia923_json = json.load(file)

eia923_json_df = pd.DataFrame(eia923_json["response"]["data"])
```

### key points

* `requests` is useful when you need to reformat the data before shoving it into `pandas`
* `response.status_code` tells you if the request succeeded or why it failed.
* `response.text` gives you the raw response, if you need to check that the data is formatted how you expect
* `response.json()` will parse the response as JSON, which is handy

## Web APIs: Fancy URLs

In [8]:
response = requests.get("https://api.eia.gov/v2/electricity/electric-power-operational-data/data?data[]=consumption-for-eg&facets[fueltypeid][]=NG&facets[sectorid][]=99&facets[location][]=CO&frequency=annual&start=2020&end=2023&api_key=3zjKYxV86AqtJWSRoAECir1wQFscVu6lxXnRVKG8")

response.json()

{'response': {'total': '3',
  'dateFormat': 'YYYY',
  'frequency': 'annual',
  'data': [{'period': '2021',
    'location': 'CO',
    'stateDescription': 'Colorado',
    'sectorid': '99',
    'sectorDescription': 'All Sectors',
    'fueltypeid': 'NG',
    'fuelTypeDescription': 'natural gas',
    'consumption-for-eg': '117512.901',
    'consumption-for-eg-units': 'thousand Mcf'},
   {'period': '2022',
    'location': 'CO',
    'stateDescription': 'Colorado',
    'sectorid': '99',
    'sectorDescription': 'All Sectors',
    'fueltypeid': 'NG',
    'fuelTypeDescription': 'natural gas',
    'consumption-for-eg': '127967.696',
    'consumption-for-eg-units': 'thousand Mcf'},
   {'period': '2023',
    'location': 'CO',
    'stateDescription': 'Colorado',
    'sectorid': '99',
    'sectorDescription': 'All Sectors',
    'fueltypeid': 'NG',
    'fuelTypeDescription': 'natural gas',
    'consumption-for-eg': '134798.975',
    'consumption-for-eg-units': 'thousand Mcf'}],
  'description': 'Month

In [None]:
# https://api.eia.gov/v2/electricity/electric-power-operational-data/data?
# data[]=consumption-for-eg&
# facets[fueltypeid][]=NG&
# facets[sectorid][]=99&facets[location][]=CO&frequency=annual&start=2020&end=2023&api_key=3zjKYxV86AqtJWSRoAECir1wQFscVu6lxXnRVKG8

### challenge

Make a request to `https://api.eia.gov/v2/electricity/electric-power-operational-data/data?data[]=consumption-for-eg&facets[fueltypeid][]=NG&facets[sectorid][]=99&facets[location][]=CO&frequency=annual&start=2020&end=2023&api_key=3zjKYxV86AqtJWSRoAECir1wQFscVu6lxXnRVKG8` with `requests.get`.

Try removing the `end=2023` parameter from the URL. What happens?

### solution

### key points

* web APIs can be thought of as bundles of fancy URLs
* each web API is different, but if you can read the documentation and make requests to URLs, you can figure them out


## Case study: EIA API

In [10]:
api_key = "3zjKYxV86AqtJWSRoAECir1wQFscVu6lxXnRVKG8"

response = requests.get(f"https://api.eia.gov/v2/electricity?api_key={api_key}")
response.json()

{'response': {'id': 'electricity',
  'name': 'Electricity',
  'description': 'EIA electricity survey data',
  'routes': [{'id': 'retail-sales',
    'name': 'Electricity Sales to Ultimate Customers',
    'description': 'Electricity sales to ultimate customer by state and sector (number of customers, average price, revenue, and megawatthours of sales).  \n    Sources: Forms EIA-826, EIA-861, EIA-861M'},
   {'id': 'electric-power-operational-data',
    'name': 'Electric Power Operations (Annual and Monthly)',
    'description': 'Monthly and annual electric power operations by state, sector, and energy source.\n    Source: Form EIA-923'},
   {'id': 'rto',
    'name': 'Electric Power Operations (Daily and Hourly)',
    'description': 'Hourly and daily electric power operations by balancing authority.  \n    Source: Form EIA-930'},
   {'id': 'state-electricity-profiles',
    'name': 'State Specific Data',
    'description': 'State Specific Data'},
   {'id': 'operating-generator-capacity',
  

### challenge

If we're looking for yearly data about fuel consumption at the plant level, what route should we request next?

### solution

In [13]:
base_url = "https://api.eia.gov/v2/electricity"

In [14]:
facility_fuel = requests.get(f"{base_url}/facility-fuel?api_key={api_key}").json()
facility_fuel

{'response': {'id': 'facility-fuel',
  'name': 'Electric Power Operations for Individual Power Plants (Annual and Monthly)',
  'description': 'Annual and monthly electric power operations for individual power plants, by energy source and prime mover\n    Source: Form EIA-923',
  'frequency': [{'id': 'monthly',
    'description': 'One data point for each month.',
    'query': 'M',
    'format': 'YYYY-MM'},
   {'id': 'quarterly',
    'description': 'One data point every 3 months.',
    'query': 'Q',
    'format': 'YYYY-"Q"Q'},
   {'id': 'annual',
    'description': 'One data point for each calendar year.',
    'query': 'A',
    'format': 'YYYY'}],
  'facets': [{'id': 'plantCode', 'description': 'Plant ID and Name'},
   {'id': 'fuel2002', 'description': 'Energy Source'},
   {'id': 'state', 'description': 'State'},
   {'id': 'primeMover', 'description': 'Prime Mover'}],
  'data': {'generation': {'alias': 'Net Generation', 'units': 'megawatthours'},
   'gross-generation': {'alias': 'Gross G

### challenge

Given the above example, and the output for the `facility-fuels` metadata, how do we get the net generation data?

Build off of the earlier request, reproduced below:

### solution

In [15]:
facility_fuel = requests.get(f"{base_url}/facility-fuel?api_key={api_key}")

facility_fuel.json()

{'response': {'id': 'facility-fuel',
  'name': 'Electric Power Operations for Individual Power Plants (Annual and Monthly)',
  'description': 'Annual and monthly electric power operations for individual power plants, by energy source and prime mover\n    Source: Form EIA-923',
  'frequency': [{'id': 'monthly',
    'description': 'One data point for each month.',
    'query': 'M',
    'format': 'YYYY-MM'},
   {'id': 'quarterly',
    'description': 'One data point every 3 months.',
    'query': 'Q',
    'format': 'YYYY-"Q"Q'},
   {'id': 'annual',
    'description': 'One data point for each calendar year.',
    'query': 'A',
    'format': 'YYYY'}],
  'facets': [{'id': 'plantCode', 'description': 'Plant ID and Name'},
   {'id': 'fuel2002', 'description': 'Energy Source'},
   {'id': 'state', 'description': 'State'},
   {'id': 'primeMover', 'description': 'Prime Mover'}],
  'data': {'generation': {'alias': 'Net Generation', 'units': 'megawatthours'},
   'gross-generation': {'alias': 'Gross G

In [18]:
yearly = requests.get(f"{base_url}/facility-fuel/data?data[]=generation&frequency=yearly&api_key={api_key}")
yearly

<Response [500]>

In [19]:
annual = requests.get(f"{base_url}/facility-fuel/data?data[]=generation&frequency=annual&api_key={api_key}")

annual.json()

    'description': 'The API can only return 5000 rows in JSON format.  Please consider constraining your request with facet, start, or end, or using offset to paginate results.'}],
  'total': '685702',
  'dateFormat': 'YYYY',
  'frequency': 'annual',
  'data': [{'period': '2021',
    'plantCode': '844',
    'plantName': 'Upper Power Plant',
    'fuel2002': 'WAT',
    'fuelTypeDescription': 'Hydroelectric Conventional',
    'state': 'ID',
    'stateDescription': 'Idaho',
    'primeMover': 'ALL',
    'generation': '42194',
    'generation-units': 'megawatthours'},
   {'period': '2021',
    'plantCode': '844',
    'plantName': 'Upper Power Plant',
    'fuel2002': 'WAT',
    'fuelTypeDescription': 'Hydroelectric Conventional',
    'state': 'ID',
    'stateDescription': 'Idaho',
    'primeMover': 'HY',
    'generation': '42194',
    'generation-units': 'megawatthours'},
   {'period': '2021',
    'plantCode': '846',
    'plantName': 'Combie South',
    'fuel2002': 'ALL',
    'fuelTypeDescrip

In [22]:
annual = requests.get(
    f"{base_url}/facility-fuel/data",
    params={
        "data[]": "generation",
        "frequency": "annual",
        "api_key": api_key
    },
)

annual.json()

    'description': 'The API can only return 5000 rows in JSON format.  Please consider constraining your request with facet, start, or end, or using offset to paginate results.'}],
  'total': '685702',
  'dateFormat': 'YYYY',
  'frequency': 'annual',
  'data': [{'period': '2013',
    'plantCode': '3136',
    'plantName': 'Keystone',
    'fuel2002': 'SC',
    'fuelTypeDescription': 'Coal',
    'state': 'PA',
    'stateDescription': 'Pennsylvania',
    'primeMover': 'ST',
    'generation': '0',
    'generation-units': 'megawatthours'},
   {'period': '2013',
    'plantCode': '3138',
    'plantName': 'New Castle Plant',
    'fuel2002': 'ALL',
    'fuelTypeDescription': 'Total',
    'state': 'PA',
    'stateDescription': 'Pennsylvania',
    'primeMover': 'ALL',
    'generation': '392492',
    'generation-units': 'megawatthours'},
   {'period': '2013',
    'plantCode': '3138',
    'plantName': 'New Castle Plant',
    'fuel2002': 'BIT',
    'fuelTypeDescription': 'Coal',
    'state': 'PA',
  

In [24]:
facility_fuel.json()

{'response': {'id': 'facility-fuel',
  'name': 'Electric Power Operations for Individual Power Plants (Annual and Monthly)',
  'description': 'Annual and monthly electric power operations for individual power plants, by energy source and prime mover\n    Source: Form EIA-923',
  'frequency': [{'id': 'monthly',
    'description': 'One data point for each month.',
    'query': 'M',
    'format': 'YYYY-MM'},
   {'id': 'quarterly',
    'description': 'One data point every 3 months.',
    'query': 'Q',
    'format': 'YYYY-"Q"Q'},
   {'id': 'annual',
    'description': 'One data point for each calendar year.',
    'query': 'A',
    'format': 'YYYY'}],
  'facets': [{'id': 'plantCode', 'description': 'Plant ID and Name'},
   {'id': 'fuel2002', 'description': 'Energy Source'},
   {'id': 'state', 'description': 'State'},
   {'id': 'primeMover', 'description': 'Prime Mover'}],
  'data': {'generation': {'alias': 'Net Generation', 'units': 'megawatthours'},
   'gross-generation': {'alias': 'Gross G

In [25]:
fueltypes = requests.get(f"{base_url}/facility-fuel/facet/fuel2002?api_key={api_key}").json()

fueltypes

{'response': {'totalFacets': 47,
  'facets': [{'id': 'NG', 'name': 'Natural Gas'},
   {'id': 'LIG', 'name': 'Coal'},
   {'id': 'OBG', 'name': 'other renewables'},
   {'id': 'MWH', 'name': 'Other'},
   {'id': 'WDL', 'name': 'Wood Waste Solids'},
   {'id': 'SC', 'name': 'Coal'},
   {'id': 'SUN', 'name': 'Solar'},
   {'id': 'MSN', 'name': 'Other'},
   {'id': 'PG', 'name': 'Waste Oil and Other Oils'},
   {'id': 'WAT', 'name': 'Hydroelectric Conventional'},
   {'id': 'SUB', 'name': 'Coal'},
   {'id': 'RFO', 'name': 'Residual Fuel Oil'},
   {'id': 'NUC', 'name': 'Nuclear'},
   {'id': 'ANT', 'name': 'Coal'},
   {'id': 'OOG', 'name': 'Other Gases'},
   {'id': 'WH', 'name': 'Natural Gas'},
   {'id': 'BFG', 'name': 'Other Gases'},
   {'id': 'MSW', 'name': 'other renewables'},
   {'id': 'BIT', 'name': 'Coal'},
   {'id': 'OBS', 'name': 'other renewables'},
   {'id': 'GEO', 'name': 'Geothermal'},
   {'id': 'WND', 'name': 'Wind'},
   {'id': 'LFG', 'name': 'Municiapl Landfill Gas'},
   {'id': 'PC', '

In [27]:
annual_ng = requests.get(
    f"{base_url}/facility-fuel/data",
    params={
        "data[]": "generation",
        "frequency": "annual",
        "facets[fuel2002][]": "NG",
        "api_key": api_key
    },
)

annual_ng.json()

    'description': 'The API can only return 5000 rows in JSON format.  Please consider constraining your request with facet, start, or end, or using offset to paginate results.'}],
  'total': '123214',
  'dateFormat': 'YYYY',
  'frequency': 'annual',
  'data': [{'period': '2003',
    'plantCode': '50006',
    'plantName': 'Linden Cogen Plant',
    'fuel2002': 'NG',
    'fuelTypeDescription': 'Natural Gas',
    'state': 'NJ',
    'stateDescription': 'New Jersey',
    'primeMover': 'ALL',
    'generation': '5181791.34',
    'generation-units': 'megawatthours'},
   {'period': '2003',
    'plantCode': '50006',
    'plantName': 'Linden Cogen Plant',
    'fuel2002': 'NG',
    'fuelTypeDescription': 'Natural Gas',
    'state': 'NJ',
    'stateDescription': 'New Jersey',
    'primeMover': 'CA',
    'generation': '966748.66',
    'generation-units': 'megawatthours'},
   {'period': '2003',
    'plantCode': '50006',
    'plantName': 'Linden Cogen Plant',
    'fuel2002': 'NG',
    'fuelTypeDescrip

### challenge

Now we want to limit this to just the state of Colorado - let's update the code to do that.

As before, let's build off the old request.

### solution

In [29]:
annual_ng = requests.get(
    f"{base_url}/facility-fuel/data",
    params={
        "data[]": "generation",
        "frequency": "annual",
        "facets[fuel2002][]": "NG",
        "api_key": api_key
    },
)

annual_ng.json()

    'description': 'The API can only return 5000 rows in JSON format.  Please consider constraining your request with facet, start, or end, or using offset to paginate results.'}],
  'total': '123214',
  'dateFormat': 'YYYY',
  'frequency': 'annual',
  'data': [{'period': '2022',
    'plantCode': '61890',
    'plantName': 'NRG Chalk Point CT',
    'fuel2002': 'NG',
    'fuelTypeDescription': 'Natural Gas',
    'state': 'MD',
    'stateDescription': 'Maryland',
    'primeMover': 'ALL',
    'generation': '182.09',
    'generation-units': 'megawatthours'},
   {'period': '2022',
    'plantCode': '61890',
    'plantName': 'NRG Chalk Point CT',
    'fuel2002': 'NG',
    'fuelTypeDescription': 'Natural Gas',
    'state': 'MD',
    'stateDescription': 'Maryland',
    'primeMover': 'GT',
    'generation': '182.09',
    'generation-units': 'megawatthours'},
   {'period': '2022',
    'plantCode': '3797',
    'plantName': 'Chesterfield',
    'fuel2002': 'NG',
    'fuelTypeDescription': 'Natural Ga

### example: time limits

We saw the start/end parameters a bit earlier, but let's actually poke at the documentation to see how they're used:

![Screenshot with several examples, reproduced below](../episodes/fig/ep-3/start-end.png)

> Start date
> https://api.eia.gov/v2/electricity/retail-sales/data?api_key=xxxxxx&data[]=price&facets[sectorid][]=RES&facets[stateid][]=CO&frequency=monthly&start=2008-01-31
>
> End date
> https://api.eia.gov/v2/electricity/retail-sales/data?api_key=xxxxxx&data[]=price&facets[sectorid][]=RES&facets[stateid][]=CO&frequency=monthly&end=2008-03-01
>
> Start and end date together
> https://api.eia.gov/v2/electricity/retail-sales/data?api_key=xxxxxx&data[]=price&facets[sectorid][]=RES&facets[stateid][]=CO&frequency=monthly&start=2008-01-31&end=2008-03-01

Let's try out this pattern!

### challenge

Limit the results to 2020-2023. Start from your last query, reproduced below:

### solution

In [None]:
annual_ng_co = requests.get(
    f"{base_url}/facility-fuel/data",
    params={
        "data[]": "generation",
        "frequency": "annual",
        "facets[fuel2002][]": "NG",
        "facets[state][]": "CO",
        "api_key": api_key
    },
)

annual_ng_co.json()

### discussion

Think back to the metadata you saw - what are some questions can you answer with the `facility-fuel` endpoint?

### keypoints

* Many functions in the `pandas.read_*` family can read tabular data from remote servers & cloud storage as if it was on your local computer
* `requests` can get data that's not in the right shape for `pandas.read_*`; you'll have to do the translation from their response format into `pandas.DataFrame` yourself
* web APIs are just collections of fancy URLs, which you can interact with via `requests`
* to learn an API, you need to be able to read the documentation and experiment with the API to see how it responds.