-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
self_reported_ethnicity must allow multiple HANCESTRO terms #334
Comments
2 Collections use
In Lattice DB, yet to get to CxG, there are 26 donors in another study that would have multiple terms |
For multiple terms, are we OK with comma+space separated, like the current examples? And/or comma only?
@brianraymor responds: This has been addressed with the ontologist.
While I agree that curators & submitters ideally won't need to worry about it, but seems like they should be ordered for users so the Census users won't see
|
Thanks for catching. Corrected. |
I was tempted to write some ABNF earlier.
|
I had an earlier draft that normalized the labels by ascending lexicographical order would then require a rewrite of their term id(s) to match. Based on my earlier draft, your example above:
Another option is to normalize by ascending term id(s) but then the labels are more random:
|
RE:ordering - both options result in consistent display of the same values so I don't have a strong preference. Maybe slight preference for lexicographical order of labels (users are more likely to see the A-->Z pattern & therefore less likely to think the order means something significant?) |
Re: array syntax:
Re: ordering:
|
|
|
|
Everything looks good to me. Great job on going over the HANCESTRO ontology and excluding terms. Only two points:
|
Can you say more about the rationale? The schema pins a HANCESTRO version. The requirements for For the future, we've asked Dani to consider adding a property that would allow HANCESTRO consumers to easily filter terms that are appropriate for self reporting rather than requiring the manual review and cherry picking of terms. |
After reviewing with Dani, I opened Preferred term label for HANCESTRO:0015. This is targeted for first half of August. |
And since I'm knee deep in clarifying requirements for #216 - any reordering of terms/labels could impact |
Mainly a historical record of the decisions that were made, it can come in handy for when explaining the decisions and or in the future when we have to revise HANCESTRO. |
Design
Notes:
I recommend that this issue be addressed last when updating
cellxgene-schema
CLI. Its design may be simplified prior to Schema 4 delivery IF HANCESTRO is updated to address pending requests:Also see #single-cell-data-wrangling.
obs
(Cell Metadata)obs
is apandas.DataFrame
.Curators MUST annotate the following columns in the
obs
dataframe:self_reported_ethnicity_ontology_term_id
str
categories. Iforganism_ontolology_term_id
is"NCBITaxon:9606"
for Homo sapiens, the value MUST be formatted as one or more comma-separated (with no leading or trailing spaces) HANCESTRO terms is ascending lexical order or"unknown"
if unavailable.For example, if the terms are
"HANCESTRO:00014
andHANCESTRO:0005"
then the value ofself_reported_ethnicity_ontology_term_id
MUST be"HANCESTRO:0005,HANCESTRO:0014"
.The following terms MUST NOT be used:
"HANCESTRO:0002"
for regions and its children"HANCESTRO:0003"
for country"HANCESTRO:0004"
for ancestry category"HANCESTRO:0018"
for uncategorised population"HANCESTRO:0290"
for genetically isolated population"HANCESTRO:0304"
for ancestry status and its children"HANCESTRO:0323"
for Finnish founder"HANCESTRO:0324"
for Dutch founder"HANCESTRO:0551"
for genetically homogenous Irish"HANCESTRO:0554"
for Silk Road founder"HANCESTRO:0555"
for Arab Israeli founder"HANCESTRO:0557"
for Costa Rican founder"HANCESTRO:0558"
for French Canadian founder"HANCESTRO:0559"
for Italian founder"HANCESTRO:0560"
for Northern Finnish founder"HANCESTRO:0561"
for Romanian founder"HANCESTRO:0564"
for Vis founder"HANCESTRO:0565"
for Split founder"HANCESTRO:0566"
for undefined ancestry population"GEO:000000374"
for continent and its children:"HANCESTRO:0029"
for Africa"HANCESTRO:0030"
for Asia"HANCESTRO:0031"
for Europe"HANCESTRO:0032"
for Oceania"HANCESTRO:0033"
for Latin America and the Caribbean"HANCESTRO:0034"
for Northern AmericaOtherwise, for all other organisms the
str
value MUST be"na"
.When a dataset is uploaded, the CELLxGENE Discover MUST automatically add the matching human-readable name for the corresponding ontology term to the
obs
dataframe. Curators MUST NOT annotate the following columns.self_reported_ethnicity
str
categories. This MUST be"na"
or"unknown"
if set inself_reported_ethnicity_ontology_term_id
; otherwise, this MUST be one or more comma-separated (with no leading or trailing spaces) human-readable names for the terms inself_reported_ethnicity_ontology_term_id
in the same order.For example, if the value of
self_reported_ethnicity_ontology_term_id
is"HANCESTRO:0005,HANCESTRO:0014"
then the value ofself_reported_ethnicity
is"European,Hispanic or Latin American"
.schema v4.0.0
self_reported_ethnicity_ontology_term_id
self_reported_ethnicity
Context
To support accurate metadata for multiethnic donors (donors who have selected more than one ethnicity),
self_reported_ethnicity
must allow multiple HANCESTRO terms.The text was updated successfully, but these errors were encountered: