-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion for directory structure #65
Comments
@adokter @bart1 @niconoe @CeciliaNilsson709 @baptischmi @Rafnuss Feedback welcome on the above proposal. |
I think I slightly prefer |
No strong preference between the two proposals. I agree with all the design principles and think we should stick to one of the two proposals. I'm not a fan of the approaches that use too many levels such as |
Hi @peterdesmet, here a few thoughts:
|
Thank you @peterdesmet I'll support @adokter suggestion, although # source/format/radar/yyyy/ allow to quickly get the temporal coverage of a radar (which is not granted in Europe). As mentioned by @niconoe , its good to have few levels only (as proposed here in contract to yyyy/mm/dd/HH/MM/) for monthly csv tables, but as mentioned by @adokter maybe not for single h5-files, since having too many files slows the search and download of files. |
@peterdesmet thanks for the explanation. For clarity I will first clarify what we use @ uva to deal with different projects and data. This is a somewhat similar problem to the pipeline issue. As we deal with both pvols and vps and both can have their own differences we have a 2 tiered system for pvols and three for vps. On the order of year and radar I do prefer radar first as it allows for a quick overview of what time period a radar covers. Although it is hard to know how much of that is just the habit of always having it like that. |
Thanks for the suggestions all!
Agreed, that is a worry I had too.
Even though Europe probably lends itself less to full-network analyses than the US, I can see how efficiently sampling a time period will probably always be part of an analyses. An overview per radar can likely be provided as summary data, cf. the
For CROW small files are better (since it's all done in browser); we currently use daily files. For bioRad we could produce monthly files over yearly files if that is more convenient? If we do, I would introduce month directories.
That makes sense for UvA, but for the data repository I hope to provide a consensus view, where the best possible data is given for a certain source. We already add the complexity of choosing between different sources, I'd like to avoid adding the complexity of having to choose between different processing too. 3. yyyy/mm/radarGiven that monthly files might be more convenient and we don't want too deep a file structure, we could use # source/format/yyyy/mm/radar/
# original hdf5 vp files
baltrad/h5/2020/01/bejab/bejab_vp_20200101T000000Z_0x9.h5
baltrad/h5/2020/01/bejab/bejab_vp_20200101T000500Z_0x9.h5
baltrad/h5/2020/01/bejab/... # 60/5*24*31 = 8.556 files per directory
baltrad/h5/2020/01/bejab/bejab_vp_20200131T235500Z_0x9.h5 # last file for that month
baltrad/h5/2020/01/bewid/
baltrad/h5/2020/02/behel/
# tabular data products (zipped and unzipped)
baltrad/csv/2020/01/bejab/
baltrad/csv/2020/01/bejab/bejab_vpts_202001.csv.gz
baltrad/csv/2020/01/bejab/bejab_vpts_20200101.csv
baltrad/csv/2020/01/bejab/bejab_vpts_20200102.csv
baltrad/csv/2020/01/bejab/... # 31 files per directory
baltrad/csv/2020/01/bejab/bejab_vpts_20200131.csv |
4. yyyy/mm/dd/radarTo more easily compare, here's the structure suggested by @adokter. It mimics the US structure (for the h5 data). # source/format/yyyy/mm/dd/radar/
# original hdf5 vp files
baltrad/h5/2020/01/01/bejab/bejab_vp_20200101T000000Z_0x9.h5
baltrad/h5/2020/01/01/bejab/bejab_vp_20200101T000500Z_0x9.h5
baltrad/h5/2020/01/01/bejab/... # 60/5*24 = 288 files per directory
baltrad/h5/2020/01/01/bejab/bejab_vp_20200101T235500Z_0x9.h5 # last file for that day
baltrad/h5/2020/01/01/bewid/
baltrad/h5/2020/02/01/behel/
# tabular data products (zipped and unzipped)
baltrad/csv/2020/bejab/
baltrad/csv/2020/bejab/bejab_vpts_202001.csv.gz
baltrad/csv/2020/bejab/... # 12 zipped files per directory
baltrad/csv/2020/bejab/bejab_vpts_202012.csv.gz
baltrad/csv/2020/bejab/bejab_vpts_20200101.csv
baltrad/csv/2020/bejab/bejab_vpts_20200102.csv
baltrad/csv/2020/bejab/... # 365 unzipped files per directory
baltrad/csv/2020/bejab/bejab_vpts_20200131.csv Personally I think that structure works quite well and is better than the month directories I suggested above. |
I have been thinking a bit more about this With this structure checking data availability for long time series in the original vp data gets slightly more difficult as many directories need to be searched. Especially as radars/data stream might be on and of some what frequent (leading to days missing). As most people might be accessing data through the csv this is maybe not so much of an issue. But for data quality checking I tend to first go by radar. This is might be a wider argument for going for radar first as the first step for the analysis at least for European data is frequently doing a quality check, including checks if the quality changed over time. A structure with radar first might facilitate this. Also I do imagine that people are more likely to to analysis with a limited geographic scope compared to a limited temporal scope. Here I'm more thinking about for example ecological consulting interested in a region. The people wanting to do analysis on a large geographic scope I guess also want a long time series and are likely more technically savy. |
Thanks for the input @bart1, I agree with your arguments. The European data are not homogenous in quality or coverage like the US data, but very radar (country) dependent. So it makes sense to be able to select on radar up front. And as @CeciliaNilsson709 mentions, we'll likely require different functions anyway to query US vs EU data, so aligning is only partly useful. In any case, I don't think there is a right decision here, we just need to make one. So, suggestion 5. 👍, 👎, feedback welcome. 5. radar/yyyy/mm/dd# source/format/radar/yyyy/mm/dd/
# original hdf5 vp files: same as original proposal but deeper hierarchy
baltrad/hdf5/bejab/2020/01/01/bejab_vp_20200101T000000Z_0x9.h5
baltrad/hdf5/bejab/2020/01/01/bejab_vp_20200101T000500Z_0x9.h5
baltrad/hdf5/bejab/2020/01/01/... # 60/5*24 = 288 files per directory
baltrad/hdf5/bejab/2020/01/01/bejab_vp_20200101T235500Z_0x9.h5
baltrad/hdf5/bewid/
baltrad/hdf5/behel/
baltrad/hdf5/...
# daily csv
baltrad/csv-daily/bejab/2020/
baltrad/csv-daily/bejab/2020/bejab_vpts_202012.csv.gz
baltrad/csv-daily/bejab/2020/bejab_vpts_20200101.csv
baltrad/csv-daily/bejab/2020/bejab_vpts_20200102.csv
baltrad/csv-daily/bejab/2020/... # 365 unzipped files per directory
baltrad/csv-daily/bejab/2020/bejab_vpts_20200131.csv
# monthly csv
baltrad/csv-monthly/bejav/2020/bejab_vpts_202001.csv.gz
baltrad/csv-monthly/bejab/2020/... # 12 zipped files per directory |
From my point of view (of someone who would write code accessing the repository - but not explore it by hand), I'm not a huge fan of all those subdirectories ( Actually from that approach having almost all files in a single flat directory would be perfectly fine (since the filename provides all the metadata already). That would also circumvent the discussions about which subdirectory (radar of year) should be at the highest level. Now:
|
On our S3 server @ UvA I've steered away from a flat structure. I feel the same about the duplication of information in path and filename. However, I did notice quite a performance hit storing everything together in a single bucket when you want to, for example, provide an overview of unique radars, unique radar years, etc. |
Thanks all for the input! There is a consensus on the structure suggested in #65 (comment)
|
Update: consensus for the structure suggested in #65 (comment)
Current
The vp data in the ENRAM data repository are currently organized as:
Design principles
Proposal
1. radar/yyyy
radar/yyyy
structure (and filename convention for vpts).yyyy/mm/dd/
(files for all radars). The BALTRAD PVOL archive usesyyyy/mm/dd/HH/MM/
(files for all radars). Although organizing by year first has some benefits, the fact that there is no radar directory, makes it hard for tools to find data for a specific radar, which is almost always part of the query.2. yyyy/radar
A valid alternative is switching radar and year columns:
The text was updated successfully, but these errors were encountered: