Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In endpoints.yaml/.json for cohorts and datasets the /{id}/ request should be served by a beaconCollectionsResponse (or summary responses) #123

Open
mbaudis opened this issue Mar 27, 2024 · 9 comments
Assignees
Labels
bug Something isn't working Dev scout

Comments

@mbaudis
Copy link
Member

mbaudis commented Mar 27, 2024

$ref: '#/components/responses/ResultsOKResponse'

This picks up where some back-and-forth discussions in #116 started and rephrases this as the minimal issue. In short, since we use a separate beaconCollectionsResponse response schema for record-level responses in collections (datasets, cohorts) the single entity id endpoints (/datasets/{id}/ and /cohorts/{id}/) should use this response schema (invoked through locally defined CollectionsResponse) instead of a beaconResultsetsResponse(invoked through locally definedResultsOKResponse`):

Endpoint Record Response Schema Looks like ...
/datasets beaconCollectionsResponse list of datasets in response.results
/datasets/{id} beaconCollectionsResponse list of a single dataset in response.results
/datasets/{id}/biosamples beaconResultsetsResponse a list of one dataset in response.resultSets with a list of biosamples in response.resultSets[0].results

... etc.

Alternative 1

The special beaconCollectionsResponse could be discarded since in principle we could have e.g. a list of datasets or cohorts inside response.resultSets without the results component used for record level biosamples etc.

Alternative 2

Collections should always respond with a beaconCollectionsResponse, also for /datasets/{id}/biosamples style endpoints since it is just one list of records associated with one collection (i.e. not potentially multiple resultsets).

@jrambla
Copy link
Contributor

jrambla commented Apr 3, 2024

I agree with the table description about how the responses are expected to be.

@jrambla
Copy link
Contributor

jrambla commented Apr 3, 2024

I don't think the alternatives are necessary, as with the correction above, the behavior would be the expected one.

  • A resultset is the equivalent of a collection.

  • The collection concept is defined in the framework.

  • Every model could have different entry types of type collection with different schemas. In the current model: datasets and cohorts.

  • For that reason we can't have a datasetResponse or cohortResponse.

  • Inside the response, using "collection" instead of "resultset" could be counterintuitive as the results could come from a filter, hence we are not returning a whole collection but a subset of it.

  • In alternative 1 the response will return a "resultset" (collection) of collections. This seemed very confusing to me, and the reason I ended up creating the CollectionResponse as a different kind of response.

  • Alternative 2 seems more natural to me, but, then /datasets/{id}/biosamples will return a collectionResponse while /individuals/{id}/biosamples will return a resultsetResponse, which is quite confusing to me.

This is the history and rationale of these terms and responses.

@mbaudis
Copy link
Member Author

mbaudis commented Apr 3, 2024

  • A resultset is the equivalent of a collection.
  • The collection concept is defined in the framework.
  • Every model could have different entry types of type collection with different schemas. In the current model: datasets and cohorts.
  • For that reason we can't have a datasetResponse or cohortResponse.
  • Inside the response, using "collection" instead of "resultset" could be counterintuitive as the results could come from a filter, hence we are not returning a whole collection but a subset of it.
  • In alternative 1 the response will return a "resultset" (collection) of collections. This seemed very confusing to me, and the reason I ended up creating the CollectionResponse as a different kind of response.

I always saw the point of having a distinction between "data entities" and "collections". But then, datasets/{id}/cohorts/ looks a lot like /datasets/{id}/individuals/; could result in a resultSet of one (type: dataset) containing a list of cohorts (not their individuals etc. which would be in /cohorts/{id}/individuals/). And the discussion here started w/ the lack of a definition of which type of response to use; so unification is something to consider ...

  • Alternative 2 seems more natural to me, but, then /datasets/{id}/biosamples will return a collectionResponse while /individuals/{id}/biosamples will return a resultsetResponse, which is quite confusing to me.

Well, it could make sense that "collections" endpoints always return a "collectionsResponse"; since you already define the unique collection you can deliver the list of records for the selected response entity. This might be the cleanest definition:

  • datasets path is for a dataset entryType of type collection => beaconCollectionsResponse
  • individuals path is for a individual entryType of type record => beaconResultsetsResponse

And both have /datasets/filtering_terms/ | /individuals/filtering_terms/ ...

@mbaudis
Copy link
Member Author

mbaudis commented Apr 3, 2024

... but one thing I still do not see is where in the schema we define which type of resultSets (i.e. cohort or dataset) is being returned... I use dataset by default. This will even get more interesting with more aggregator based resultsetResponses.

@jrambla
Copy link
Contributor

jrambla commented Apr 3, 2024

This is in the configuration endpoint.
E.g.

        "dataset": {
            "aCollectionOf": [
                {
                    "id": "genomicVariant",
                    "name": "Genomic Variants"
                }
            ],
            "additionalSupportedSchemas": [],
            "defaultSchema": {
                "id": "ga4gh-beacon-dataset-v2.0.0",
                "name": "Default schema for datasets",
                "referenceToSchemaDefinition": "https://exampleBeacons.org/datasets/defaultSchema.json",
                "schemaVersion": "v2.0.0"
            },
(...)
        }

@mbaudis
Copy link
Member Author

mbaudis commented Apr 3, 2024

@jrambla This doesn't help. Logically a dataset isn't a collection of variants but a collection of records in the entities of the data model, with inherently consistent ids used for referencing between the records of the different entities. Remembering correctly, as a different type of collection we designed cohorts, which can (potentially) contain entries from different datasets (allowing for transversal discovery though never really war gamed). It isn't clear to me where a "default schema" would come in here:

  • if you have a datasets/ endpoint you deliver datasets (i.e. the information about those, including the not-yet-defined-but-wished-for aggregate data)
  • same for cohorts/
  • otherwise you have to specify what entity you deliver for a given cohort/dataset
  • AFAIK the datasets/{id} endpoint was documented to provide information about a given dataset (but this could be changed to be about a default entity, e.g. variants) - not a fan. Otherwise I might be ignorant but I do not actually see where this "aCollectionOf" has a practical intersection w/ any parts of the implementation.

BUT My point was different: There is no way to indicate what the requested collection type (i.e. dataset, cohort) of a given resultsetResponse is. Obviously, you want both the option by dataset and cohort (or any other collection type in your model). So:

  • How do I indicate that my resultsets should be of type dataset or cohort, given a biosamples/ etc. path?
  • How do I select which datasets/cohorts should be contained in the list of resultsets?

I made earlier statements about the necessity to have parameters for these. I remember our design discussions about having the responses separated into resultsets to allow multiple aggregations (and this is now in line with my suggestion to treat every Beacon response collected by an aggregator simply as a resultsets entry). But you cannot aggregate into resultsets if you cannot define what they are ort how to select amongst them.

(We currently do this by providing a datasetIds parameter and using dataset as a collection type, by default).

@jrambla
Copy link
Contributor

jrambla commented Apr 8, 2024

(I'll be telegraphic, in order to be concise)

About what is a dataset and what is a cohort:
In the current model (v2.0) a dataset could only contain "genomicVariations" and a cohort could only contain "individuals".
This is depicted in the documentation introduction diagram.
I can agree that this is leaving out some cases or "forcing" an internal navigation, but avoids confusions like the one you are describing.

In the map endpoint you'll find entries like https://exampleBeacons.org/datasets/{id}/individuals, but this is just a convenience. The rationale behind that endpoint is that you navigate/jump from genomicVariations to individuals, not that the individuals are actually part of the dataset.

Similarly, https://exampleBeacons.org/cohorts/{id}/g_variants is a convenience for navigating from individuals to the genomicVariations obtained from their biosamples, not that a cohort is composed by biosamples or genomicVariations.

In the resultSet response the setType property is declaring what type of set that response is generated from, but has no impact in the returned schema, as the returned schema is the entry type one (biosamples, individuals, etc.) not the collection one. Hence, there is no need to request a type of collection.

A topic, that must happen in a separate discussion, is how do you select just some datasets or cohorts in your request. I agree that today we don't have a direct solution for that other than doing a query per each dataset or cohort you would select for.

@mbaudis
Copy link
Member Author

mbaudis commented Apr 15, 2024

@jrambla Datasets: Well, the unitary connection in the diagram was mostly to emphasize the difference between cohorts as a transversal collection type, and datasets as the "internal organization" one, as described:

Collections (Datasets and Cohorts): groupings of variants or individuals that share something in common: e.g., who belong to the same repository (datasets) or study populations (cohorts).

The model doesn't describe the dataset to "contain" only variants but effectively also all related entry types which can be accessed through the relationship graph. I actually only left the ---- connections out to emphasize the core data model ...

The rationale behind that endpoint is that you navigate/jump from genomicVariations to individuals, not that the individuals are actually part of the dataset.

Well, if you can "navigate" to individuals belonging to the variants, you have a finite set of individuals representing all that are part of the dataset. I don't really understand what you mean w/ "internal navigation"; here it is just about which entities in principle can be accessed as belonging to the dataset - the internal handling of record retrieval is of no interest. But this all isn't really part of the problem; it comes here:

In the resultSet response the setType property is declaring what type of set that response is generated from, but has no impact in the returned schema, as the returned schema is the entry type one (biosamples, individuals, etc.) not the collection one. Hence, there is no need to request a type of collection.

... which is obvious. But:

 There is no way to define which collection type an item in a resultsetsResponse should belong to.

The setType is no parameter in query or path or at least I cannot find it. So you cannot define if the resultSets from an e.g. a /biosamples?filters=... get biosamples per all datasets in the beacon, or of all the cohorts in the beacon. Or both (which w/o extra definition it would look like).

A topic, that must happen in a separate discussion, is how do you select just some datasets or cohorts in your request. I agree that today we don't have a direct solution for that other than doing a query per each dataset or cohort you would select for.

Yes, exactly. We just use datasetIds.

@mbaudis
Copy link
Member Author

mbaudis commented Apr 16, 2024

@jrambla

Yes, exactly. We just use datasetIds

... is actually in line w/ the framework examples (which in principle don't have to be in line w/ the default model but that would be a bit confusing):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Dev scout
Projects
None yet
Development

No branches or pull requests

4 participants