Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive quarterly CEMS data in Annual files #237

Merged
merged 3 commits into from
Dec 19, 2023
Merged

Archive quarterly CEMS data in Annual files #237

merged 3 commits into from
Dec 19, 2023

Conversation

jdangerx
Copy link
Member

@jdangerx jdangerx commented Dec 18, 2023

Partial fix for #230. Base branch is stop-discarding-drafts because I wanted those fixes for testing.

This just takes every year's four quarters and puts them into one ZIP file to avoid the 100 file limit in Zenodo.

You can check out the output here.

Here's the resources section of datapackage.json - I decided to format the partitions like {"year": XXXX, "quarters": [1, 2, 3, 4]}.

{
    ...
    "resources": [
        {
            "profile": "data-resource",
            "name": "epacems-1995.zip",
            "path": "https://sandbox.zenodo.org/records/13474/files/epacems-1995.zip",
            "remote_url": "https://sandbox.zenodo.org/records/13474/files/epacems-1995.zip",
            "title": "epacems-1995.zip",
            "parts": {
                "year": 1995,
                "quarter": [
                    1,
                    2,
                    3,
                    4
                ]
            },
            "encoding": "utf-8",
            "mediatype": "application/zip",
            "format": ".zip",
            "bytes": 78118649,
            "hash": "8efa7789255fbfac75ed291d3a41f25b"
        },
        {
            "profile": "data-resource",
            "name": "epacems-1996.zip",
            "path": "https://sandbox.zenodo.org/records/13474/files/epacems-1996.zip",
            "remote_url": "https://sandbox.zenodo.org/records/13474/files/epacems-1996.zip",
            "title": "epacems-1996.zip",
            "parts": {
                "year": 1996,
                "quarter": [
                    1,
                    2,
                    3,
                    4
                ]
            },
            "encoding": "utf-8",
            "mediatype": "application/zip",
            "format": ".zip",
            "bytes": 75959903,
            "hash": "7cba31cfd16e73511831c6547b104381"
        }
    ],
    "created": "2023-12-18 17:02:54.530469",
    "version": "2.0.0"
}

Copy link
Member Author

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still getting some wacky type complaints from the get_resources() -> ArchiveAwaitable call, and ended up spending quite some time trying to get the correct amount of await-ing in the right places for the download to work.

I have some hopes and dreams of trying out a refactor that would make this easier, but not sure if that's really the right place to put my time right now.

"""Check if this year is one we are interested in.

Args:
year: the year to check against our self.only_years
"""
if not year:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just dealing with some complaints from the typechecker. I suppose I could move the "if not year" logic to the caller, too, but that would entail either pulling that calling code into a separate function or adding a weird conditional inside a list comprehension, which sucks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because sometimes there is no year partition? Not following the logic here of adding None

@@ -15,6 +20,38 @@
logger = logging.getLogger(f"catalystcoop.{__name__}")


class BulkFile(BaseModel):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't necessary for this change, but found it slightly helpful to document the expected shape of the API responses in code - and made the type annotations more useful.


await self.download_file(url=url, file=file_path, timeout=60 * 14)

with zipfile.ZipFile(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't quite want to mess with how download_file_to_zip worked for all archivers yet, so I inlined this. But I think probably a more useful thing to do is have a couple useful utility functions like "download file (raw)," "download file (zip it)," and "download file (add to an existing archive)" that aren't coupled to the class hierarchy. Out of scope for this PR, I think.

src/pudl_archiver/archivers/epacems.py Show resolved Hide resolved
local_path=archive_path,
partitions={
"year": year,
"quarter": sorted([file.metadata.quarter for file in files]),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmgosnell - this is just the first partition format I came up with, we can change it to suit whatever we want to use downstream. @e-belfer and I chatted about this and think maybe it should be partitions={"year_quarter": sorted([f"{year}q{file.metadata.quarter}" for file in files])} ? What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this suggestion.

Copy link
Member

@e-belfer e-belfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works as expected, I have a few questions about specific lines where clarification would be helpful though. Only blocking suggestion is to fix the partitions as discussed.

"""Check if this year is one we are interested in.

Args:
year: the year to check against our self.only_years
"""
if not year:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because sometimes there is no year partition? Not following the logic here of adding None

src/pudl_archiver/archivers/classes.py Outdated Show resolved Hide resolved
and (file["metadata"]["dataSubType"] == "Hourly")
and ("quarter" in file["metadata"])
and (self.valid_year(file["metadata"]["year"]))
if (file.metadata.data_type == "Emissions")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so much neater, thank you!

src/pudl_archiver/archivers/epacems.py Outdated Show resolved Hide resolved
src/pudl_archiver/archivers/epacems.py Show resolved Hide resolved
local_path=archive_path,
partitions={
"year": year,
"quarter": sorted([file.metadata.quarter for file in files]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this suggestion.

Base automatically changed from stop-discarding-drafts to main December 19, 2023 18:05
Copy link
Member

@e-belfer e-belfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look great, the handling of the metadata types seems much neater with the additional filtering method.

@@ -51,7 +51,7 @@ jobs:
ZENODO_TOKEN_UPLOAD: ${{ secrets.ZENODO_TOKEN_UPLOAD }}
ZENODO_TOKEN_PUBLISH: ${{ secrets.ZENODO_TOKEN_PUBLISH }}
run: |
pudl_archiver --datasets ${{ matrix.dataset }} --summary-file ${{ matrix.dataset }}_run_summary.json
pudl_archiver --datasets ${{ matrix.dataset }} --summary-file ${{ matrix.dataset }}_run_summary.json --sandbox
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want to revert this before merging, just a flag!

@@ -61,48 +61,55 @@ class EpaCemsArchiver(AbstractDatasetArchiver):
base_url = "https://api.epa.gov/easey/bulk-files/"
parameters = {"api_key": os.environ["EPACEMS_API_KEY"]} # Set to API key

def __filter_for_complete_metadata(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking aloud here - the only issue here could be if the underlying format of the metadata changes (e.g., year becomes string type), and we'd silently lose all the files. But I guess this would be caught in our other validation tests (e.g. check missing files).

@jdangerx jdangerx enabled auto-merge (squash) December 19, 2023 18:34
@jdangerx jdangerx merged commit 1fa4368 into main Dec 19, 2023
3 checks passed
@jdangerx jdangerx deleted the multi-zip-cems branch December 19, 2023 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

3 participants