-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collection properties that become relevant when datasets are merged #13
Comments
I don't really follow what you mean? There are many many things that have to change when you merge datasets; that's essentially what GFID does, it takes data from many sources (some large datasets, but also individual crowdsourcing) and offers them back out again as datasets which have different facets depending on how you want to consume them (e.g. 'all current field boundaries' or 'all distinct boundaries through time' or 'each individual representation of any boundary' or 'all the current field boundaries in Brazil'). Almost all the metadata properties attached to a boundary become arrays when you deduplicate the boundary, and collection-level items like dates either change or lose meaning. Typically you can't just naively merge collections because you cannot guarantee uniqueness in the union set, which is why the merged collection is really a different collection from the original one. Perhaps you can take a look at how we solve this in GFID to understand the data model relationships. This page has a high level overview of the data types:
There is also 'fields' but this is not actually a spatial collection (this is why the API is not just an OGC Features API, because it writeable and deals with concepts like relationships between boundaries and between fields & boundaries, temporal change etc) Both collections can be arbitrarily filtered, for example "give me only the features in this bbox, from this source, of this determination method". So you can make any arbitrary subcollection you like and the URI of the request would technically be usable as an ID for that collection. All of those collections would be guaranteed to have unique IDs, but it doesn't make sense to embed references between them at the object level because there are an infinite number of them - a feature is part of many collections. When we make static exports available soon, there will be distinct collections that can be given a name to though, like I mentioned in the first paragraph, and this is where we'd put dataset level metadata if it had to be in STAC format. Otherwise, it'd only exist at the level of the FeatureCollection in the API. |
@andyjenkinson To clarify what I meant with merging: Let's say we have a dataset that follows fiboa from the US and one from Germany in GeoParquet. I want to have them in the same GeoParquet. In this case I may want to have some columns that identify identify details such as license that otherwise for Germany would the all the same values and as such are in the collection. Let's say US has columns id, geometry, area and us_state and a collection with provider ABC and license CC-0 The merged dataset could have the following columns: id, geometry, area, perimeter, us_state, inspire:id, and collection. But some may also want to have provider and license. So this issue is just about this and how we want to tackle this. I think you were thinking about something else, where you also need to merge geometries or other things. I was more on the simpler side of things where data actually has no overlap and it's just about the properties.... Makes sense? |
Another issue that we need to discuss: What happens when files are merged that have different extensions implemented with required fields? The current "required" implementation assumes non-nullable fields, which in case of a merge fails. |
To be honest I'd been treating multiple collections as out of scope for now because it's much more complex than the examples considered to date. After all that's what our system actually does - we merge many separate collections (we call them "sources") and serve them out through the API in multiple ways:
There are many issues with the current Fiboa data model for representing the latter "multiple collections in a single file" case, since the values of most of the fields can different between collections. Not sure what this has to do with this topic though so probably best to talk it through separately |
Indeed, moved the discussion about required (extension) fields to #26 |
Closing here, we won't distinguish between collection properties and other properties anymore. |
If you merge multiple datasets, you may want to have properties for some properties that are defined in the collections. For example:
The text was updated successfully, but these errors were encountered: