-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Archive quarterly CEMS data in Annual files #237
Conversation
6aa2811
to
ee61ab2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still getting some wacky type complaints from the get_resources() -> ArchiveAwaitable
call, and ended up spending quite some time trying to get the correct amount of await
-ing in the right places for the download to work.
I have some hopes and dreams of trying out a refactor that would make this easier, but not sure if that's really the right place to put my time right now.
"""Check if this year is one we are interested in. | ||
|
||
Args: | ||
year: the year to check against our self.only_years | ||
""" | ||
if not year: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just dealing with some complaints from the typechecker. I suppose I could move the "if not year" logic to the caller, too, but that would entail either pulling that calling code into a separate function or adding a weird conditional inside a list comprehension, which sucks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this because sometimes there is no year partition? Not following the logic here of adding None
@@ -15,6 +20,38 @@ | |||
logger = logging.getLogger(f"catalystcoop.{__name__}") | |||
|
|||
|
|||
class BulkFile(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't necessary for this change, but found it slightly helpful to document the expected shape of the API responses in code - and made the type annotations more useful.
|
||
await self.download_file(url=url, file=file_path, timeout=60 * 14) | ||
|
||
with zipfile.ZipFile( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't quite want to mess with how download_file_to_zip
worked for all archivers yet, so I inlined this. But I think probably a more useful thing to do is have a couple useful utility functions like "download file (raw)," "download file (zip it)," and "download file (add to an existing archive)" that aren't coupled to the class hierarchy. Out of scope for this PR, I think.
local_path=archive_path, | ||
partitions={ | ||
"year": year, | ||
"quarter": sorted([file.metadata.quarter for file in files]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmgosnell - this is just the first partition format I came up with, we can change it to suit whatever we want to use downstream. @e-belfer and I chatted about this and think maybe it should be partitions={"year_quarter": sorted([f"{year}q{file.metadata.quarter}" for file in files])}
? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to this suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works as expected, I have a few questions about specific lines where clarification would be helpful though. Only blocking suggestion is to fix the partitions as discussed.
"""Check if this year is one we are interested in. | ||
|
||
Args: | ||
year: the year to check against our self.only_years | ||
""" | ||
if not year: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this because sometimes there is no year partition? Not following the logic here of adding None
and (file["metadata"]["dataSubType"] == "Hourly") | ||
and ("quarter" in file["metadata"]) | ||
and (self.valid_year(file["metadata"]["year"])) | ||
if (file.metadata.data_type == "Emissions") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so much neater, thank you!
local_path=archive_path, | ||
partitions={ | ||
"year": year, | ||
"quarter": sorted([file.metadata.quarter for file in files]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to this suggestion.
2ddf0c1
to
b2fd5d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look great, the handling of the metadata types seems much neater with the additional filtering method.
.github/workflows/run-archiver.yml
Outdated
@@ -51,7 +51,7 @@ jobs: | |||
ZENODO_TOKEN_UPLOAD: ${{ secrets.ZENODO_TOKEN_UPLOAD }} | |||
ZENODO_TOKEN_PUBLISH: ${{ secrets.ZENODO_TOKEN_PUBLISH }} | |||
run: | | |||
pudl_archiver --datasets ${{ matrix.dataset }} --summary-file ${{ matrix.dataset }}_run_summary.json | |||
pudl_archiver --datasets ${{ matrix.dataset }} --summary-file ${{ matrix.dataset }}_run_summary.json --sandbox |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll want to revert this before merging, just a flag!
@@ -61,48 +61,55 @@ class EpaCemsArchiver(AbstractDatasetArchiver): | |||
base_url = "https://api.epa.gov/easey/bulk-files/" | |||
parameters = {"api_key": os.environ["EPACEMS_API_KEY"]} # Set to API key | |||
|
|||
def __filter_for_complete_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking aloud here - the only issue here could be if the underlying format of the metadata changes (e.g., year becomes string type), and we'd silently lose all the files. But I guess this would be caught in our other validation tests (e.g. check missing files).
46036bb
to
c1d870f
Compare
Partial fix for #230. Base branch is
stop-discarding-drafts
because I wanted those fixes for testing.This just takes every year's four quarters and puts them into one ZIP file to avoid the 100 file limit in Zenodo.
You can check out the output here.
Here's the
resources
section ofdatapackage.json
- I decided to format the partitions like{"year": XXXX, "quarters": [1, 2, 3, 4]}
.