Archive quarterly CEMS data in Annual files #237

jdangerx · 2023-12-18T17:20:06Z

Partial fix for #230. Base branch is stop-discarding-drafts because I wanted those fixes for testing.

This just takes every year's four quarters and puts them into one ZIP file to avoid the 100 file limit in Zenodo.

You can check out the output here.

Here's the resources section of datapackage.json - I decided to format the partitions like {"year": XXXX, "quarters": [1, 2, 3, 4]}.

{
    ...
    "resources": [
        {
            "profile": "data-resource",
            "name": "epacems-1995.zip",
            "path": "https://sandbox.zenodo.org/records/13474/files/epacems-1995.zip",
            "remote_url": "https://sandbox.zenodo.org/records/13474/files/epacems-1995.zip",
            "title": "epacems-1995.zip",
            "parts": {
                "year": 1995,
                "quarter": [
                    1,
                    2,
                    3,
                    4
                ]
            },
            "encoding": "utf-8",
            "mediatype": "application/zip",
            "format": ".zip",
            "bytes": 78118649,
            "hash": "8efa7789255fbfac75ed291d3a41f25b"
        },
        {
            "profile": "data-resource",
            "name": "epacems-1996.zip",
            "path": "https://sandbox.zenodo.org/records/13474/files/epacems-1996.zip",
            "remote_url": "https://sandbox.zenodo.org/records/13474/files/epacems-1996.zip",
            "title": "epacems-1996.zip",
            "parts": {
                "year": 1996,
                "quarter": [
                    1,
                    2,
                    3,
                    4
                ]
            },
            "encoding": "utf-8",
            "mediatype": "application/zip",
            "format": ".zip",
            "bytes": 75959903,
            "hash": "7cba31cfd16e73511831c6547b104381"
        }
    ],
    "created": "2023-12-18 17:02:54.530469",
    "version": "2.0.0"
}

jdangerx

I'm still getting some wacky type complaints from the get_resources() -> ArchiveAwaitable call, and ended up spending quite some time trying to get the correct amount of await-ing in the right places for the download to work.

I have some hopes and dreams of trying out a refactor that would make this easier, but not sure if that's really the right place to put my time right now.

jdangerx · 2023-12-18T17:21:38Z

src/pudl_archiver/archivers/classes.py

        """Check if this year is one we are interested in.

        Args:
            year: the year to check against our self.only_years
        """
+        if not year:


Just dealing with some complaints from the typechecker. I suppose I could move the "if not year" logic to the caller, too, but that would entail either pulling that calling code into a separate function or adding a weird conditional inside a list comprehension, which sucks.

Is this because sometimes there is no year partition? Not following the logic here of adding None

jdangerx · 2023-12-18T17:22:26Z

src/pudl_archiver/archivers/epacems.py

@@ -15,6 +20,38 @@
 logger = logging.getLogger(f"catalystcoop.{__name__}")


+class BulkFile(BaseModel):


Wasn't necessary for this change, but found it slightly helpful to document the expected shape of the API responses in code - and made the type annotations more useful.

jdangerx · 2023-12-18T17:28:27Z

src/pudl_archiver/archivers/epacems.py

+
+            await self.download_file(url=url, file=file_path, timeout=60 * 14)
+
+            with zipfile.ZipFile(


I didn't quite want to mess with how download_file_to_zip worked for all archivers yet, so I inlined this. But I think probably a more useful thing to do is have a couple useful utility functions like "download file (raw)," "download file (zip it)," and "download file (add to an existing archive)" that aren't coupled to the class hierarchy. Out of scope for this PR, I think.

src/pudl_archiver/archivers/epacems.py

jdangerx · 2023-12-18T20:11:09Z

src/pudl_archiver/archivers/epacems.py

+            local_path=archive_path,
+            partitions={
+                "year": year,
+                "quarter": sorted([file.metadata.quarter for file in files]),


@cmgosnell - this is just the first partition format I came up with, we can change it to suit whatever we want to use downstream. @e-belfer and I chatted about this and think maybe it should be partitions={"year_quarter": sorted([f"{year}q{file.metadata.quarter}" for file in files])} ? What do you think?

+1 to this suggestion.

e-belfer

This works as expected, I have a few questions about specific lines where clarification would be helpful though. Only blocking suggestion is to fix the partitions as discussed.

e-belfer · 2023-12-18T22:05:17Z

src/pudl_archiver/archivers/classes.py

        """Check if this year is one we are interested in.

        Args:
            year: the year to check against our self.only_years
        """
+        if not year:


Is this because sometimes there is no year partition? Not following the logic here of adding None

src/pudl_archiver/archivers/classes.py

e-belfer · 2023-12-18T22:07:08Z

src/pudl_archiver/archivers/epacems.py

-                and (file["metadata"]["dataSubType"] == "Hourly")
-                and ("quarter" in file["metadata"])
-                and (self.valid_year(file["metadata"]["year"]))
+                if (file.metadata.data_type == "Emissions")


so much neater, thank you!

src/pudl_archiver/archivers/epacems.py

e-belfer · 2023-12-18T22:12:10Z

src/pudl_archiver/archivers/epacems.py

+            local_path=archive_path,
+            partitions={
+                "year": year,
+                "quarter": sorted([file.metadata.quarter for file in files]),


+1 to this suggestion.

e-belfer

Changes look great, the handling of the metadata types seems much neater with the additional filtering method.

e-belfer · 2023-12-19T18:02:13Z

.github/workflows/run-archiver.yml

@@ -51,7 +51,7 @@ jobs:
          ZENODO_TOKEN_UPLOAD: ${{ secrets.ZENODO_TOKEN_UPLOAD }}
          ZENODO_TOKEN_PUBLISH: ${{ secrets.ZENODO_TOKEN_PUBLISH }}
        run: |
-          pudl_archiver --datasets ${{ matrix.dataset }} --summary-file ${{ matrix.dataset }}_run_summary.json
+          pudl_archiver --datasets ${{ matrix.dataset }} --summary-file ${{ matrix.dataset }}_run_summary.json --sandbox


We'll want to revert this before merging, just a flag!

e-belfer · 2023-12-19T18:06:49Z

src/pudl_archiver/archivers/epacems.py

@@ -61,48 +61,55 @@ class EpaCemsArchiver(AbstractDatasetArchiver):
    base_url = "https://api.epa.gov/easey/bulk-files/"
    parameters = {"api_key": os.environ["EPACEMS_API_KEY"]}  # Set to API key

+    def __filter_for_complete_metadata(


Thinking aloud here - the only issue here could be if the underlying format of the metadata changes (e.g., year becomes string type), and we'd silently lose all the files. But I guess this would be caught in our other validation tests (e.g. check missing files).

jdangerx force-pushed the multi-zip-cems branch from 6aa2811 to ee61ab2 Compare December 18, 2023 17:26

jdangerx commented Dec 18, 2023

View reviewed changes

jdangerx requested review from e-belfer and cmgosnell December 18, 2023 17:31

jdangerx commented Dec 18, 2023

View reviewed changes

jdangerx assigned aesharpe Dec 18, 2023

jdangerx mentioned this pull request Dec 18, 2023

CEMS archiver fails frequently #230

Closed

3 tasks

e-belfer approved these changes Dec 18, 2023

View reviewed changes

jdangerx force-pushed the stop-discarding-drafts branch from 2ddf0c1 to b2fd5d6 Compare December 19, 2023 18:02

Base automatically changed from stop-discarding-drafts to main December 19, 2023 18:05

e-belfer approved these changes Dec 19, 2023

View reviewed changes

jdangerx added 3 commits December 19, 2023 13:34

Introduce types for the API response

b442e3e

Put four quarterly files into one annual archive.

318ddaf

PR fixes: partition format, types, and typos

c1d870f

jdangerx force-pushed the multi-zip-cems branch from 46036bb to c1d870f Compare December 19, 2023 18:34

jdangerx enabled auto-merge (squash) December 19, 2023 18:34

jdangerx merged commit 1fa4368 into main Dec 19, 2023
3 checks passed

jdangerx deleted the multi-zip-cems branch December 19, 2023 18:38

e-belfer mentioned this pull request Feb 12, 2024

Add month of data availability into Zenodo archive description #278

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Archive quarterly CEMS data in Annual files #237

Archive quarterly CEMS data in Annual files #237

jdangerx commented Dec 18, 2023 •

edited

jdangerx left a comment

jdangerx Dec 18, 2023

e-belfer Dec 18, 2023

jdangerx Dec 18, 2023

jdangerx Dec 18, 2023

jdangerx Dec 18, 2023

e-belfer Dec 18, 2023

e-belfer left a comment •

edited

e-belfer Dec 18, 2023

e-belfer Dec 18, 2023

e-belfer Dec 18, 2023

e-belfer left a comment

e-belfer Dec 19, 2023

e-belfer Dec 19, 2023

		@@ -15,6 +20,38 @@
		logger = logging.getLogger(f"catalystcoop.{__name__}")


		class BulkFile(BaseModel):


		await self.download_file(url=url, file=file_path, timeout=60 * 14)

		with zipfile.ZipFile(

Archive quarterly CEMS data in Annual files #237

Archive quarterly CEMS data in Annual files #237

Conversation

jdangerx commented Dec 18, 2023 • edited

jdangerx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

e-belfer left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

e-belfer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdangerx commented Dec 18, 2023 •

edited

e-belfer left a comment •

edited