FERC EQR Archiver #43

e-belfer · 2023-01-26T22:21:35Z

This script archives FERC EQR data from two sources: 2002-2013 and 2014-present. Technically, the code should work: it produces the expected URLs for each resource, and sends them to the download_zipfile() method. However, two issues will frustrate the easy use of this archiver and should be dealt with more creatively.

Exceptionally slow download times from FERC. As FERC notes on their website:

Due to the large size of the database, only three users are allowed to download simultaneously at any time. Because downloading may take a long time, it will likely work best if scheduled during off-peak hours, such as Sundays or at night.
Indeed, running this script generally results in timeout errors. Because of the 3-download limitation, downloads may want to be done serially, rather than in parallel.

Files exceed size limitations for GitHub runner, as noted by Zane.

This issue is blocked by #44

jdangerx · 2023-01-26T22:54:28Z

Like, a global 3-user throttle? Oy vey. I think we could specifically limit the # of concurrent connections to 1 for EQR - maybe via some sort of command-line flag. I'm thinking something like this in src/pudl_archiver/__init__.py:

diff --git a/src/pudl_archiver/__init__.py b/src/pudl_archiver/__init__.py
index 4eb4439..cf9d20c 100644
--- a/src/pudl_archiver/__init__.py
+++ b/src/pudl_archiver/__init__.py
@@ -47,13 +47,19 @@ async def archive_dataset(
         await archiver.create_archive()
 
 
+DATASETS_LIMITED_CONCURRENCY = {"ferc_eqr"}
+
+
 async def archive_datasets(
     datasets: list[str],
     sandbox: bool = True,
     initialize: bool = False,
     dry_run: bool = True,
+    conns_per_host: int = 20,  # could be parsed as a CLI option
 ):
     """A CLI for the PUDL Zenodo Storage system."""
+    if set(datasets).intersection(DATASETS_LIMITED_CONCURRENCY):
+        conns_per_host = 1
     if sandbox:
         upload_key = os.environ["ZENODO_SANDBOX_TOKEN_UPLOAD"]
         publish_key = os.environ["ZENODO_SANDBOX_TOKEN_PUBLISH"]
@@ -61,7 +67,7 @@ async def archive_datasets(
         upload_key = os.environ["ZENODO_TOKEN_UPLOAD"]
         publish_key = os.environ["ZENODO_TOKEN_PUBLISH"]
 
-    connector = aiohttp.TCPConnector(limit_per_host=20, force_close=True)
+    connector = aiohttp.TCPConnector(limit_per_host=conns_per_host, force_close=True)
     async with aiohttp.ClientSession(
         connector=connector, raise_for_status=False
     ) as session:
@@ -75,7 +81,6 @@ async def archive_datasets(
                 publish_key,
                 testing=sandbox,
             )

             tasks.append(
                 archive_dataset(
                     dataset,

Re: the file size, if we were to upload individual files to Zenodo instead of downloading the whole dataset and then uploading everything, would that work? I don't think that would be too disruptive of a modification to how archiver works, and it looks like each file is significantly smaller than 14 GB - e.g. the 2013 file is 1G

jdangerx

Hey, that was quick! Of course there are, well, the other issues you mentioned 😓 - but overall this looks good. One non-blocking code structure comment.

I'm trying to figure out what testing for this might look like - any thoughts @zschira ?

src/pudl_archiver/archivers/ferc/ferceqr.py

zschira · 2023-01-27T17:43:01Z

Testing the individual archivers is definitely an important/unanswered question. This archiver is following the general pattern of having a method that gets the url for/downloads a single file. We could always mock out download_zipfile and check that the URL can be downloaded from there, or we could go with the strategy you presented of splitting it out into two parts.

In general I'd like to have testing that would actually check that the file names and partitions are structured in the way that PUDL expects, so we could catch issues like the one I found with eia860m. I'm not sure of an obvious way to get there though.

zschira · 2023-01-27T17:56:20Z

As for the rate limiting and file size issues, I like the idea of being able to set the number of concurrent connections per dataset. I think this will be useful in tuning the rate limiting more generally. I also think downloading one file at a time seems like a good approach. We could probably even do something where we download a file, go through the whole deposition update process (uploads/deletes and update datapackage), but just don't publish the new version, and continue this until we're through the whole process. This might be an easy approach without having to change very much.

jdangerx · 2023-01-27T18:24:23Z

Sounds like we should make a separate ticket for "streaming deposition update in archiver" - I can do that.

In terms of testing the actual data - how did you find that eia860m issue?

Could we basically download a small subset of the data for each dataset and validate that? Seems like we'd want to run it through the whole ETL flow, so we could try something like this:

create all datasets from scratch on sandbox
update them with as little data as reasonable
run ETL against them
validate ETL outputs

I don't actually know the details of what the nightly build does, but is this similar to that?

e-belfer · 2023-01-30T15:00:07Z

So far I've very coarsely tested this by just printing the URLs to terminal and running a spot check to make sure all is expected as I'm developing the code, but in terms of long-term testing the url inputs I could see a few options working here. The download_hyperlinks() method by default only generates hyperlinks which exist on the page, so testing is not an issue for these methods. This leaves download_zipfile() and download_file() as the two methods which take a url as input.

Adding a resp.ok HEAD check into download_zipfile()/download_file(), which would add this check into all the archivers that take urls as input.
Adding a separate url_check() method (perhaps, that can be turned off/on), which would precede the download method. Since the functions are built mostly to take singular URLs, it would be best to write this to take one URL as input.

zschira · 2023-01-30T15:39:58Z

@jdangerx this is from my understanding basically what the nightly builds are doing. My thoughts have been that once things are stabilized we would just create new production archives every time we run the automated archiver, and then kick off a nightly build using these new archives. We could then investigate any failures that arise during the nightly builds. We could certainly create sandbox archives first instead, I don't have particularly strong feelings, but it felt like maybe an unnecessary step if failures shouldn't arise very often. Outside of this though, I think having a way to test the interfaces during unit/integration tests would be really helpful. Technically we could probably do this now since the archiver does depend on PUDL, so we could devise some tests that use the Datastore. Not sure exactly what these would look like, but it seems possible.

Eventually, when we get to pulling the Datastore out of PUDL I think it could make sense to combine the archiver and Datastore. I think it could be cool to have a single unified interface to an 'archive'. It would be nice to have one place that contains all of the code to create/modify/access an archive that can all be tested and developed together.

zaneselvans · 2023-02-08T20:47:20Z

Even just having the raw EQR archived somewhere (AWS open data...) that doesn't have a global 3-user limitation will dramatically improve data accessibility 😄

@zschira I think once an archiver is up and running and working well, we should probably be using it to create new production archives on a schedule, while also being conservative about ensuring that the archiver fails if something unexpected happens, so we don't end up creating polluted (irrevocable) archives.

My guess is that for many of our datasets, almost every new archive will cause the nightly builds to fail, so if we're going to attempt an auto-PR to switch to a new archive, doing it one dataset at a time will be necessary. Snapshotting the data is and versioning it has value (and is relatively easy) and should make it easy for us to switch over to a new set of DOIs once a quarter or once a month, address any new brokenness, and cut a release.

e-belfer · 2023-02-14T14:00:21Z

Update: this PR is either waiting on the resolution of issue #44 or on the development of an interim workaround. In the meantime, I'll hold off on merging.

zaneselvans · 2023-02-14T20:31:28Z

I guess one potential workaround is that instead of running this archiver on at GitHub runner, we spin up a GCP instance on a schedule (kicked off by a GitHub action on a schedule?) and run it there instead, with plenty of disk and a fast, reliable net connection. Maybe just monthly or quarterly if it ends up being a bear. 🐻

zschira · 2023-08-07T14:39:27Z

src/pudl_archiver/orchestrator.py

@@ -34,6 +41,22 @@ class Config:
        arbitrary_types_allowed = True


+class FileWrapper(io.BytesIO):


When aiohttp times out while uploading a file, it will then close the file, which leads to an error when we retry the upload. I created this wrapper so aiohttp can't close the file, which doesn't seem like the best solution, but it was easy, and maybe it's good enough?

The discussion on this issue implies that there's some sort of custom Payload we could pass in, and that this is the preferred method by the aiohttp maintainer. But, it's been a whole major version since then, and the documentation doesn't mention this at all.

This issue uses a "monkeypatch the .close() function" approach which seems equally janky but a little shorter. 🤷

This existing option is good enough for me, though we should add some more documentation about this somewhat surprising behavior.

Yeah I found an issue where the aiohttp maintainers seemed quite convinced that they should take ownership of a file, and callers should reopen the file, which would requrie a lot of re-engineering on our part, so gonna have to live with a hack I guess.

zschira · 2023-08-07T14:40:47Z

src/pudl_archiver/orchestrator.py

-        # Leave immediately after generating changes if dry_run
+        resources = {}
+        changed = False
+        async for name, resource in self.downloader.download_all_resources():


Handles one file at a time now

jdangerx

This looks good, but I want to make sure we think through the apply_changes interface a bit, and I wanted to flag the tmp_dir lifecycle question.

The FileWrapper is fine... I tried to look at how aiohttp wants us to do this, and couldn't figure it out with a few minutes of searching, so I gave up.

Finally I have a small suggestion for the test that (I think) means you get to skip making a whole new test class.

src/pudl_archiver/archivers/classes.py

jdangerx · 2023-08-11T16:41:33Z

src/pudl_archiver/orchestrator.py

+
+            if action:
+                changed = True
+                await self._apply_changes(action, name, resource)


A single change is a tuple of action, name, resource, right? If that's the case, it's nice to not be able to split up the components of a change when we call _apply_change:

add_or_update = self._generate_change(name, resource) # maybe we could make a Change type but just a tuple seems fine too if self.dry_run: continue if add_or_update: changed = True await self._apply_change(add_or_update) deletes = self._get_files_to_delete(resources) ...

Added a class to encapsulate a change that I think makes things nicer

jdangerx · 2023-08-11T16:43:45Z

src/pudl_archiver/orchestrator.py

+                await self._upload_file(
+                    _UploadSpec(source=resource.local_path, dest=name)
+                )
+            case _DepositionAction.UPDATE:


Hm, wonder if we could encode a delete as an update with no resource, since an update is a delete + an upload anyways.

Or, I guess, an update could be encoded as a delete then a create.

Lmk what you think of my change. I could see it being a bit confusing that the Update triggers both if statements, but it does remove redundant code.

jdangerx · 2023-08-11T16:48:40Z

tests/unit/archive_base_test.py

+            self.call_count += 1
+            return Path(f"path{self.call_count-1}")
+
+    tmpdir_mock = mocker.Mock(side_effect=TmpDir())


non-blocking: would this work for the same purpose?

tmpdir_mock = mocker.Mock(side_effect=[f"path{i}" for i in range(5)]

I think if you wanted to have an infinitely incrementing path counter you could do this, too:

import itertools tmpdir_mock = mocker.Mock(side_effect=(f"path{i}" for i in itertools.count(0)))

Thanks! Didn't realize you could just pass an iterator as a side effect

jdangerx · 2023-08-11T17:08:28Z

src/pudl_archiver/orchestrator.py

@@ -34,6 +41,22 @@ class Config:
        arbitrary_types_allowed = True


+class FileWrapper(io.BytesIO):


The discussion on this issue implies that there's some sort of custom Payload we could pass in, and that this is the preferred method by the aiohttp maintainer. But, it's been a whole major version since then, and the documentation doesn't mention this at all.

This issue uses a "monkeypatch the .close() function" approach which seems equally janky but a little shorter. 🤷

This existing option is good enough for me, though we should add some more documentation about this somewhat surprising behavior.

jdangerx

New class looks great! tiny nit: now _apply_changes only takes one change, maybe we should rename 🤷

e-belfer linked an issue Jan 26, 2023 that may be closed by this pull request

Permissions errors when adding new dataset to Zenodo #38

Closed

jdangerx reviewed Jan 26, 2023

View reviewed changes

src/pudl_archiver/archivers/ferc/ferceqr.py Show resolved Hide resolved

src/pudl_archiver/archivers/ferc/ferceqr.py Show resolved Hide resolved

jdangerx mentioned this pull request Jan 27, 2023

EQR archiver concurrency/disk limits #44

Closed

6 tasks

e-belfer and others added 7 commits January 30, 2023 12:01

In progress EQR archiver

06e70b6

Use catalyst-cooperative as contributor if none exist

3dbcc27

Working download paths

0bf8379

Appeasing flake8

1d93eb5

Use catalyst-cooperative as contributor if none exist

3495e07

Split annual and quarterly download functions

06a85a6

Split annual and quarterly download functions

6f8bd31

e-belfer force-pushed the ferceqr branch from 45335eb to 6f8bd31 Compare January 30, 2023 17:04

Fixing branch diversion

997e6fe

zschira mentioned this pull request Jan 30, 2023

Test individual archivers #45

Open

2 tasks

jdangerx linked an issue Feb 7, 2023 that may be closed by this pull request

Archive FERC EQR #31

Closed

jdangerx added the inframundo label Feb 7, 2023

jdangerx assigned e-belfer Feb 13, 2023

zaneselvans mentioned this pull request Feb 14, 2023

Archive targeted Sloan grant data sets catalyst-cooperative/pudl#1412

Closed

5 tasks

This was referenced Feb 27, 2023

Automate Zenodo archiving #61

Open

Refresh Zenodo Archives Manually #79

Closed

zaneselvans added the ferceqr FERC Electric Quarterly Reports (aka Form 920) label Feb 28, 2023

Merge branch 'main' into ferceqr

e356f96

zschira added 3 commits June 16, 2023 10:18

Merge branch 'main' into ferceqr

061629b

Handle resources in chunks to limit concurrency

2f13fdf

Update eqr archiver to handle ftp server

adb3ff4

e-belfer mentioned this pull request Aug 2, 2023

Rewrite CEMS archiver to use API #138

Merged

9 tasks

zschira added 5 commits August 4, 2023 13:52

Increase file upload timeout limit to handle large files

bc7b48a

Wrap files before uploading to avoid unexpected closures

b52eb0e

Modify EQR archiver to only download data from 2013 onward

7afd198

Merge branch 'main' into 'ferceqr'

9bee0a9

Fix minor merge error

c08cf1d

zschira marked this pull request as ready for review August 4, 2023 20:26

zschira added 2 commits August 4, 2023 16:31

Increase timeout and remove unnecessary log

461156a

Add archiver resource chunking test

08f26af

zschira self-assigned this Aug 7, 2023

zschira reviewed Aug 7, 2023

View reviewed changes

jdangerx self-requested a review August 10, 2023 18:05

jdangerx requested changes Aug 11, 2023

View reviewed changes

zschira added 2 commits August 11, 2023 16:08

Add class to encapsulate a deposition change

36ea490

Simplify unit test

ca4302a

jdangerx approved these changes Aug 14, 2023

View reviewed changes

zschira added 3 commits August 14, 2023 16:30

Change apply_changes to apply_change

cfdcae4

Add ferceqr to run-archivers action

b16dd88

Merge branch 'main' into 'ferceqr'

bbe697e

zschira merged commit 542f0f8 into main Aug 14, 2023
3 checks passed

zschira deleted the ferceqr branch August 14, 2023 20:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FERC EQR Archiver #43

FERC EQR Archiver #43

e-belfer commented Jan 26, 2023 •

edited by zaneselvans

Loading

jdangerx commented Jan 26, 2023 •

edited

Loading

jdangerx left a comment

zschira commented Jan 27, 2023

zschira commented Jan 27, 2023

jdangerx commented Jan 27, 2023

e-belfer commented Jan 30, 2023

zschira commented Jan 30, 2023

zaneselvans commented Feb 8, 2023

e-belfer commented Feb 14, 2023

zaneselvans commented Feb 14, 2023

zschira Aug 7, 2023

jdangerx Aug 11, 2023

zschira Aug 11, 2023

zschira Aug 7, 2023

jdangerx left a comment

jdangerx Aug 11, 2023

zschira Aug 11, 2023

jdangerx Aug 11, 2023

jdangerx Aug 11, 2023

zschira Aug 11, 2023

jdangerx Aug 11, 2023

zschira Aug 11, 2023

jdangerx Aug 11, 2023

jdangerx left a comment

		@@ -34,6 +41,22 @@ class Config:
		arbitrary_types_allowed = True


		class FileWrapper(io.BytesIO):

FERC EQR Archiver #43

FERC EQR Archiver #43

Conversation

e-belfer commented Jan 26, 2023 • edited by zaneselvans Loading

jdangerx commented Jan 26, 2023 • edited Loading

jdangerx left a comment

Choose a reason for hiding this comment

zschira commented Jan 27, 2023

zschira commented Jan 27, 2023

jdangerx commented Jan 27, 2023

e-belfer commented Jan 30, 2023

zschira commented Jan 30, 2023

zaneselvans commented Feb 8, 2023

e-belfer commented Feb 14, 2023

zaneselvans commented Feb 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdangerx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdangerx left a comment

Choose a reason for hiding this comment

e-belfer commented Jan 26, 2023 •

edited by zaneselvans

Loading

jdangerx commented Jan 26, 2023 •

edited

Loading