# Reading remote data files

In [None]:
help(pd.read_parquet)

In [None]:
df = pd.read_parquet(parquet_example_url)

df.report_date.max()

## Using `requests`

### Setup

In [None]:
import requests

json_example_url = "https://raw.githubusercontent.com/catalyst-cooperative/open-energy-data-for-all/refs/heads/main/data/eia923_2022.json"


### Example: `requests.get`

In [None]:
response = requests.get(json_example_url)

In [None]:
help(response)

In [None]:
response.status_code

In [None]:
response.text

In [None]:
response.json()

### Challenge: reading remote files

Can you take the raw XML data at the following URL and turn it into a string in python?

In [None]:
xml_url = "https://raw.githubusercontent.com/catalyst-cooperative/open-energy-data-for-all/refs/heads/main/data/eia923_2022.xml"

### Solution

### Discussion
What are some advantages and disadvantages you can imagine for using remote data vs. saving the data to your hard drive (aka **local data**)?


### Key points

* `requests` is useful when you need to access remote data
* `response.status_code` tells you if the request succeeded or why it failed.
* `response.text` gives you the raw response, if you need to check that the data is formatted how you expect
* `response.json()` will parse the response as JSON, which is handy

## Intro to web APIs

### APIS as fancy URLs

Question: "how much natural gas was consumed for electricity generation, totalled across all sectors, in Puerto Rico, for each year between 2020 and 2023?"

In [None]:
response = requests.get("https://api.eia.gov/v2/electricity/electric-power-operational-data/data?data[]=consumption-for-eg&facets[fueltypeid][]=NG&facets[sectorid][]=99&facets[location][]=PR&frequency=annual&start=2020&end=2023&api_key=3zjKYxV86AqtJWSRoAECir1wQFscVu6lxXnRVKG8")

response.json()

### The structure of an API call

In [None]:
example_api_url = (
    "https://api.eia.gov" # "host": the high-level name of the API you're accessing
    "/v2/electricity/electric-power-operational-data/data" # "route": the specific aspect of the API you're accessing
    "?" # separator that indicates "everything after this will be a name-value pair"
    "data[]=consumption-for-eg" # name: data[], value: consumption-for-eg ("consumption for electricity generation")
    "&" # separator between each pair
    "facets[fueltypeid][]=NG" # only natural gas data
    "&"
    "facets[sectorid][]=99" # total across all sectors
    "&"
    "facets[location][]=PR" # in Puerto Rico
    "&"
    "frequency=annual" # per year
    "&"
    "start=2020" # starting in 2020
    "&"
    "end=2023" # ending in 2023
    "&"
    "api_key=3zjKYxV86AqtJWSRoAECir1wQFscVu6lxXnRVKG8" # a password to prove you have access to the API
)

requests.get(example_api_url).json()

### Key points

* web APIs can be thought of as bundles of fancy URLs
* each web API is different, but if you can read the documentation and make requests to URLs, you can figure them out


## Case study: EIA API

[Link to main documentation page](https://www.eia.gov/opendata/documentation.php)

In [None]:
api_key = "3zjKYxV86AqtJWSRoAECir1wQFscVu6lxXnRVKG8"

### Example: trying out an API request

In [None]:
electricity_response = requests.get(f"https://api.eia.gov/v2/electricity?api_key={api_key}")
electricity_response.json()

### Challenge

[In this section](https://www.eia.gov/opendata/documentation.php#Examiningametadatarequ), the docs say:

> Discovering datasets should be much easier in APIv2 because the API now self-documents and organizes itself in a data hierarchy. Parent datasets have child datasets, which may have children of their own, and so on. To investigate what datasets are available, we request a parent node. The API will respond with the child datasets (routes) for the path we've requested.

If we're looking for yearly data about fuel consumption at the plant level, what route should we request next? Please request it using `requests.get` below.

### Solution

### Example: getting data points

[full documentation link](https://www.eia.gov/opendata/documentation.php#Facets)

> In earlier examples, when we asked about the metadata, the API responded with these available data points [under the 'data' key]:
>
> [...]
>
> Remember, in addition to specifying the column in the data[] parameter, we must also specify /data as the last node in the route:
>
> `https://api.eia.gov/v2/electricity/retail-sales/data/?api_key=XXXXXX&data[]=price`

In [None]:
base_url = "https://api.eia.gov/v2/electricity"

facility_fuels_metadata = requests.get(f"{base_url}/facility-fuel?api_key={api_key}")

facility_fuels_metadata.json()["response"]["data"]

In [None]:
net_generation = requests.get(f"{base_url}/facility-fuel/data?data[]=generation&api_key={api_key}")

net_generation.json()

### Example: filtering the data

We can read the documentation a bit more, and find [this section](https://www.eia.gov/opendata/documentation.php#Frequency) talking about:

> Facets enable us to filter the data of concern to us, shrinking the size of the returns to a more manageable size.
>
> For example, our retail sales of electricity has the location and sector facets. If we query the route (without specifying /data), the API will tell us the facets that are relevant to that route.

In [None]:
facility_fuels_metadata.json()["response"]

In [None]:
gas_only = requests.get(
    f"{base_url}/facility-fuel/data?data[]=generation&facets[fuel2002][]=gas&api_key={api_key}"
)

In [None]:
gas_only = requests.get(
    f"{base_url}/facility-fuel/data",
    params={
        "data[]": "generation",
        "facets[fuel2002][]": "gas",
        "api_key": api_key
    },
)

gas_only.json()

In [None]:
fueltypes = requests.get(f"{base_url}/facility-fuel/facet/fuel2002?api_key={api_key}").json()

fueltypes

In [None]:
gas_only = requests.get(
    f"{base_url}/facility-fuel/data",
    params={
        "data[]": "generation",
        "facets[fuel2002][]": "NG",
        "api_key": api_key
    },
)

gas_only.json()

### Challenge: putting it all together

So we've handled the fuel type - let's split into breakout groups to handle the other issues with the data:

* we would like to filter this to Colorado data only
* we would like to filter this to data for 2020, 2021, 2022, and 2023
* we would like the data to be reported yearly, not monthly

For each group, pick one of those bullets and follow these steps:

1. Look at the metadata. See what parameters might help you get the right data back.
2. Figure out what values you want to pass in.
3. Try doing that and see if it fixed the problem.

Once we're all done we can come back and make the full API request together.

### Solution

### Key points

* Many functions in the `pandas.read_*` family can read tabular data from remote servers & cloud storage as if it was on your local computer
* `requests` can get data that's not in the right shape for `pandas.read_*`; you'll have to do the translation from their response format into `pandas.DataFrame` yourself
* web APIs are just collections of fancy URLs, which you can interact with via `requests`
* to learn an API, you need to be able to read the documentation and experiment with the API to see how it responds.