Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion for directory structure #65

Closed
peterdesmet opened this issue Apr 15, 2022 · 13 comments
Closed

Suggestion for directory structure #65

peterdesmet opened this issue Apr 15, 2022 · 13 comments

Comments

@peterdesmet
Copy link
Member

peterdesmet commented Apr 15, 2022

Update: consensus for the structure suggested in #65 (comment)

Current

The vp data in the ENRAM data repository are currently organized as:

# hdf5 vp files organized in hour directories
be/jab/2020/05/07/01/bejab_vp_20200507T010000Z_0x9.h5
be/jab/2020/05/07/01/bejab_vp_20200507T010500Z_0x9.h5
be/jab/2020/05/07/01/... # 60/5 = 12 files per directory
be/jab/2020/05/07/01/bejab_vp_20200507T015500Z_0x9.h5

# zipped directories with all hdf5 vp data for that month
be/jab/2020/bejab202005.zip

Design principles

  1. We still want to store the individual hdf5 files. That is how they are produced by vol2bird.
  2. Sources of the vp data (BALTRAD, flyway, meteo offices) should be kept separate, so a) data pipelines don't conflict with each other and b) the user can choose.
  3. Within a source, the best possible data is kept, e.g. when BALTRAD data is reprocessed with improved vol2bird settings, those data should overwrite existing (lower quality) data from BALTRAD.
  4. We don't want the zipped directories of data anymore. They facilitate download speed, but not analysis speed. Rather, we want to offer the vpts data in a tabular format(Flat table format for VPTS vpts-csv#25). This data product will be used by:
  • bioRad for analysis: data should be bulked and easy to download (e.g. zipped radar year).
  • CROW for visualization: data should be chunked and be read directly (e.g. unzipped radar date)
  1. The directory structure of the repository should be easy to navigate (especially for the data products), without too many subdirectories, so it can be understand by humans and e.g. bioRad functions to download data.

Proposal

1. radar/yyyy

# source/format/radar/yyyy/

# original hdf5 vp files
baltrad/h5/bejab/2020/bejab_vp_20200101T000000Z_0x9.h5
baltrad/h5/bejab/2020/bejab_vp_20200101T000500Z_0x9.h5
baltrad/h5/bejab/2020/... # 60/5*24*365 = 105.120 files per directory
baltrad/h5/bejab/2020/bejab_vp_20201231T235500Z_0x9.h5

# tabular data products (zipped and unzipped)
baltrad/csv/bejab/2020/
baltrad/csv/bejab/2020/bejab_vpts_2020.csv.gz
baltrad/csv/bejab/2020/bejab_vpts_20200101.csv
baltrad/csv/bejab/2020/bejab_vpts_20200102.csv
baltrad/csv/bejab/2020/... # 365 files per directory
baltrad/csv/bejab/2020/bejab_vpts_20201231.csv
  • RMI data repository uses the same radar/yyyy structure (and filename convention for vpts).
  • The US data (Organize files under new folder structure (similar to US) #54) data are organized as yyyy/mm/dd/ (files for all radars). The BALTRAD PVOL archive uses yyyy/mm/dd/HH/MM/ (files for all radars). Although organizing by year first has some benefits, the fact that there is no radar directory, makes it hard for tools to find data for a specific radar, which is almost always part of the query.

2. yyyy/radar

A valid alternative is switching radar and year columns:

# source/format/yyyy/radar/

# original hdf5 vp files
baltrad/h5/2020/bejab/bejab_vp_20200101T000000Z_0x9.h5
baltrad/h5/2020/bejab/bejab_vp_20200101T000500Z_0x9.h5
baltrad/h5/2020/bejab/... # 60/5*24*365 = 105.120 files per directory
baltrad/h5/2020/bejab/bejab_vp_20201231T235500Z_0x9.h5

# tabular data products (zipped and unzipped)
baltrad/csv/2020/bejab/
baltrad/csv/2020/bejab/bejab_vpts_2020.csv.gz
baltrad/csv/2020/bejab/bejab_vpts_20200101.csv
baltrad/csv/2020/bejab/bejab_vpts_20200102.csv
baltrad/csv/2020/bejab/... # 365 files per directory
baltrad/csv/2020/bejab/bejab_vpts_20201231.csv
@peterdesmet
Copy link
Member Author

@adokter @bart1 @niconoe @CeciliaNilsson709 @baptischmi @Rafnuss Feedback welcome on the above proposal.

@CeciliaNilsson709
Copy link
Collaborator

I think I slightly prefer # source/format/radar/yyyy/, but either way it will fit some users and not others. Tools will have to be adapted between the US structure and this anyway even if we put year first too.

@niconoe
Copy link

niconoe commented Apr 19, 2022

No strong preference between the two proposals. I agree with all the design principles and think we should stick to one of the two proposals.

I'm not a fan of the approaches that use too many levels such as yyyy/mm/dd/HH/MM/: little added value, but add a lot of heavy path manipulations in tools, in my experience.

@adokter
Copy link
Contributor

adokter commented Apr 19, 2022

Hi @peterdesmet, here a few thoughts:

  1. I've found that directory structure matters a lot in terms of how quickly you can fetch the data from AWS, and you want to avoid too many files in a single directory as it really slows downloads and searches for available files. Therefore, for h5 I would add a deeper directory structure than for csv.

  2. My preference is a directory structure with date before radar. The reason is this better suits full-network analyses, in which you want to be able to efficiently grab all the available radar data for a given date or period. Such downloads are much more fast when date is up front in the tree. In fact I have processed vp data into radar/date order in the past, and I've come to regret it because of how slow data retrieval from s3 becomes.

  3. A main downside of a date/radar ordering is that downloading multiple years for a single radar requires more cycling through the tree. But I've found that is typically fairly easy, because the set of potentially available dates to search for is well defined (while the set of potentially available radars is not). I suspect these considerations also led NEXRAD to order date before radar, see https://s3.amazonaws.com/noaa-nexrad-level2/index.html

  4. Taken these together, for h5 I would recommend source/format/yyyy/mm/dd/radar, and for csv source/format/yyyy/radar

  5. Why daily csv files and not monthly? I find monthly files a nice compromise between file size and having a substantial period in the file. Daily files are very short and you end up with many, while year files get very big.

@baptischmi
Copy link

Thank you @peterdesmet

I'll support @adokter suggestion, although # source/format/radar/yyyy/ allow to quickly get the temporal coverage of a radar (which is not granted in Europe).

As mentioned by @niconoe , its good to have few levels only (as proposed here in contract to yyyy/mm/dd/HH/MM/) for monthly csv tables, but as mentioned by @adokter maybe not for single h5-files, since having too many files slows the search and download of files.

@bart1
Copy link

bart1 commented Apr 19, 2022

@peterdesmet thanks for the explanation. For clarity I will first clarify what we use @ uva to deal with different projects and data. This is a somewhat similar problem to the pipeline issue. As we deal with both pvols and vps and both can have their own differences we have a 2 tiered system for pvols and three for vps.
The structure is:
project/pvol_settings for pvols and project/pvol_settings/vp_settings for vp. Each vp in this structure can be easily referred back to a pvol. We quite regularly end up exploring different settings for both constructing the pvols and vp. It might be worth considering that vp can be calculated with different settings. @BerendWijers do you have anything to add to this?

On the order of year and radar I do prefer radar first as it allows for a quick overview of what time period a radar covers. Although it is hard to know how much of that is just the habit of always having it like that.

@peterdesmet
Copy link
Member Author

peterdesmet commented Apr 19, 2022

Thanks for the suggestions all!

@adokter: Therefore, for h5 I would add a deeper directory structure than for csv.

Agreed, that is a worry I had too.

@adokter My preference is a directory structure with date before radar. The reason is this better suits full-network analyses, in which you want to be able to efficiently grab all the available radar data for a given date or period.
vs
@baptischmi @bart1 ... quick overview of what time period a radar covers ...

Even though Europe probably lends itself less to full-network analyses than the US, I can see how efficiently sampling a time period will probably always be part of an analyses. An overview per radar can likely be provided as summary data, cf. the coverage.csv file we currently have. So I'd support putting temporal information first.

@adokter Why daily csv files and not monthly?

For CROW small files are better (since it's all done in browser); we currently use daily files. For bioRad we could produce monthly files over yearly files if that is more convenient? If we do, I would introduce month directories.

@bart1 We quite regularly end up exploring different settings for both constructing the pvols and vp.

That makes sense for UvA, but for the data repository I hope to provide a consensus view, where the best possible data is given for a certain source. We already add the complexity of choosing between different sources, I'd like to avoid adding the complexity of having to choose between different processing too.

3. yyyy/mm/radar

Given that monthly files might be more convenient and we don't want too deep a file structure, we could use month directories for all? Or would you prefer to keep the day as part of the path?

# source/format/yyyy/mm/radar/

# original hdf5 vp files
baltrad/h5/2020/01/bejab/bejab_vp_20200101T000000Z_0x9.h5
baltrad/h5/2020/01/bejab/bejab_vp_20200101T000500Z_0x9.h5
baltrad/h5/2020/01/bejab/... # 60/5*24*31 = 8.556 files per directory
baltrad/h5/2020/01/bejab/bejab_vp_20200131T235500Z_0x9.h5 # last file for that month
baltrad/h5/2020/01/bewid/
baltrad/h5/2020/02/behel/

# tabular data products (zipped and unzipped)
baltrad/csv/2020/01/bejab/
baltrad/csv/2020/01/bejab/bejab_vpts_202001.csv.gz
baltrad/csv/2020/01/bejab/bejab_vpts_20200101.csv
baltrad/csv/2020/01/bejab/bejab_vpts_20200102.csv
baltrad/csv/2020/01/bejab/... # 31 files per directory
baltrad/csv/2020/01/bejab/bejab_vpts_20200131.csv

@peterdesmet
Copy link
Member Author

4. yyyy/mm/dd/radar

To more easily compare, here's the structure suggested by @adokter. It mimics the US structure (for the h5 data).

# source/format/yyyy/mm/dd/radar/

# original hdf5 vp files
baltrad/h5/2020/01/01/bejab/bejab_vp_20200101T000000Z_0x9.h5
baltrad/h5/2020/01/01/bejab/bejab_vp_20200101T000500Z_0x9.h5
baltrad/h5/2020/01/01/bejab/... # 60/5*24 = 288 files per directory
baltrad/h5/2020/01/01/bejab/bejab_vp_20200101T235500Z_0x9.h5 # last file for that day
baltrad/h5/2020/01/01/bewid/
baltrad/h5/2020/02/01/behel/

# tabular data products (zipped and unzipped)
baltrad/csv/2020/bejab/
baltrad/csv/2020/bejab/bejab_vpts_202001.csv.gz
baltrad/csv/2020/bejab/... # 12 zipped files per directory
baltrad/csv/2020/bejab/bejab_vpts_202012.csv.gz
baltrad/csv/2020/bejab/bejab_vpts_20200101.csv
baltrad/csv/2020/bejab/bejab_vpts_20200102.csv
baltrad/csv/2020/bejab/... # 365 unzipped files per directory
baltrad/csv/2020/bejab/bejab_vpts_20200131.csv

Personally I think that structure works quite well and is better than the month directories I suggested above.

@bart1
Copy link

bart1 commented Apr 20, 2022

# source/format/yyyy/mm/dd/radar/

# original hdf5 vp files
baltrad/h5/2020/01/01/bejab/bejab_vp_20200101T000000Z_0x9.h5
baltrad/h5/2020/01/01/bejab/bejab_vp_20200101T000500Z_0x9.h5
baltrad/h5/2020/01/01/bejab/... # 60/5*24 = 288 files per directory
baltrad/h5/2020/01/01/bejab/bejab_vp_20200101T235500Z_0x9.h5 # last file for that day
baltrad/h5/2020/01/01/bewid/
baltrad/h5/2020/02/01/behel/

# tabular data products (zipped and unzipped)
baltrad/csv/2020/bejab/
baltrad/csv/2020/bejab/bejab_vpts_202001.csv.gz
baltrad/csv/2020/bejab/... # 12 zipped files per directory
baltrad/csv/2020/bejab/bejab_vpts_202012.csv.gz
baltrad/csv/2020/bejab/bejab_vpts_20200101.csv
baltrad/csv/2020/bejab/bejab_vpts_20200102.csv
baltrad/csv/2020/bejab/... # 365 unzipped files per directory
baltrad/csv/2020/bejab/bejab_vpts_20200131.csv

I have been thinking a bit more about this

With this structure checking data availability for long time series in the original vp data gets slightly more difficult as many directories need to be searched. Especially as radars/data stream might be on and of some what frequent (leading to days missing). As most people might be accessing data through the csv this is maybe not so much of an issue. But for data quality checking I tend to first go by radar.

This is might be a wider argument for going for radar first as the first step for the analysis at least for European data is frequently doing a quality check, including checks if the quality changed over time. A structure with radar first might facilitate this.

Also I do imagine that people are more likely to to analysis with a limited geographic scope compared to a limited temporal scope. Here I'm more thinking about for example ecological consulting interested in a region. The people wanting to do analysis on a large geographic scope I guess also want a long time series and are likely more technically savy.

@peterdesmet
Copy link
Member Author

peterdesmet commented Apr 20, 2022

Thanks for the input @bart1, I agree with your arguments. The European data are not homogenous in quality or coverage like the US data, but very radar (country) dependent. So it makes sense to be able to select on radar up front. And as @CeciliaNilsson709 mentions, we'll likely require different functions anyway to query US vs EU data, so aligning is only partly useful. In any case, I don't think there is a right decision here, we just need to make one.

So, suggestion 5. 👍, 👎, feedback welcome.

5. radar/yyyy/mm/dd

# source/format/radar/yyyy/mm/dd/

# original hdf5 vp files: same as original proposal but deeper hierarchy
baltrad/hdf5/bejab/2020/01/01/bejab_vp_20200101T000000Z_0x9.h5
baltrad/hdf5/bejab/2020/01/01/bejab_vp_20200101T000500Z_0x9.h5
baltrad/hdf5/bejab/2020/01/01/... # 60/5*24 = 288 files per directory
baltrad/hdf5/bejab/2020/01/01/bejab_vp_20200101T235500Z_0x9.h5
baltrad/hdf5/bewid/
baltrad/hdf5/behel/
baltrad/hdf5/...

# daily csv
baltrad/csv-daily/bejab/2020/
baltrad/csv-daily/bejab/2020/bejab_vpts_202012.csv.gz
baltrad/csv-daily/bejab/2020/bejab_vpts_20200101.csv
baltrad/csv-daily/bejab/2020/bejab_vpts_20200102.csv
baltrad/csv-daily/bejab/2020/... # 365 unzipped files per directory
baltrad/csv-daily/bejab/2020/bejab_vpts_20200131.csv

# monthly csv
baltrad/csv-monthly/bejav/2020/bejab_vpts_202001.csv.gz
baltrad/csv-monthly/bejab/2020/... # 12 zipped files per directory

@niconoe
Copy link

niconoe commented Apr 21, 2022

From my point of view (of someone who would write code accessing the repository - but not explore it by hand), I'm not a huge fan of all those subdirectories (radar/yyyy/mm/dd), since they just repeat data that's already in the filename (bejab_vp_20200101T235500Z_0x9.h5) and add a lot of verbose and error-prone path manipulations.

Actually from that approach having almost all files in a single flat directory would be perfectly fine (since the filename provides all the metadata already). That would also circumvent the discussions about which subdirectory (radar of year) should be at the highest level.

Now:

  • I understand that this approach is probably too radical and less suited for human exploration of the repository
  • It's still okay to deal with all those subdirectories if that's your choice, just less handy :)
  • that brings another question: how will the repository will accessed: web interface? directly through S3? Both? That might impact what we do (in some cases, it might make sense to store the files flatly on the system, but to have the interface actually show it to the users in "virtual" subdirectories)

@BerendWijers
Copy link

On our S3 server @ UvA I've steered away from a flat structure. I feel the same about the duplication of information in path and filename. However, I did notice quite a performance hit storing everything together in a single bucket when you want to, for example, provide an overview of unique radars, unique radar years, etc.

@peterdesmet
Copy link
Member Author

Thanks all for the input! There is a consensus on the structure suggested in #65 (comment)

  • source/format/radar/yyyy/mm/dd/ for hdf5
  • source/format/radar/yyyy/ for csv data products.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants