Skip to content

Commit

Permalink
Standardize filenames to be similar to other datasets
Browse files Browse the repository at this point in the history
All of the other archivers use a file naming pattern like:

dataset-part1-part2.ext

where:
- dataset is the name of the dataset (e.g. ferc1, eia923, mshamines)
- part1 is the value of the first partition dimension contained in
  the file (e.g. ca, ny, wy, for states, 2019 if it's a year)
- part2 is the value of the second partition dimension contained in
  the file, if applicable.
- ext is the file name extension indicating file type.

I've changed the naming for phmsagas to follow the `dataset-*` part of
the naming convention, but have left the filename in place rather than
using only the dataset and start year, since if we just use the start
year it looks like years are missing in the archive, as each file
contains several years of data.

The dataset and start_year partitions are still used in datapackage.json
to refer to the individual resources for programmatic purposes.

This came up when working on #79
  • Loading branch information
zaneselvans committed Feb 28, 2023
1 parent f75ef87 commit 019d5d7
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions src/pudl_archiver/archivers/phmsagas.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,18 +66,18 @@ async def get_zip_resource(
For example: annual_underground_natural_gas_storage_2017_present.zip
"""
url = f"https://www.phmsa.dot.gov/{link}"
file = str(match.group(1)).replace("-", "_") # Get file name
filename = str(match.group(1)).replace("-", "_") # Get file name

# Set dataset partition
dataset = "_".join(file.lower().split("_")[0:-2])
dataset = "_".join(filename.lower().split("_")[0:-2])

if dataset not in PHMSA_DATASETS:
logger.warning(f"New dataset type found: {dataset}.")

# Set start year
start_year = int(file.split("_")[-2])
start_year = int(filename.split("_")[-2])

download_path = self.download_directory / f"phmsagas_{file}.zip"
download_path = self.download_directory / f"{self.name}-{filename}.zip"
await self.download_zipfile(url, download_path)

return ResourceInfo(
Expand Down

0 comments on commit 019d5d7

Please sign in to comment.