self_reported_ethnicity must allow multiple HANCESTRO terms #334

brianraymor · 2022-08-25T18:02:08Z

Design

Notes:

I recommend that this issue be addressed last when updating cellxgene-schema CLI. Its design may be simplified prior to Schema 4 delivery IF HANCESTRO is updated to address pending requests:

The ontology is modeled to make it possible to identify HANCESTRO terms that are appropriate for the self reported use case. The ontologist is planning to ensure that each term is tagged accordingly so we could infer 2 new parent classes to automatically give us a full list of geographic descriptors and of ethnicity descriptors. Currently, CELLxGENE must assess the entire ontology and cherry pick terms to block which increases the complexity of curation and validation.
Preferred term label for HANCESTRO:0015 was addressed.

Also see #single-cell-data-wrangling.

`obs` (Cell Metadata)

obs is a pandas.DataFrame.

Curators MUST annotate the following columns in the obs dataframe:

self_reported_ethnicity_ontology_term_id

Key	self_reported_ethnicity_ontology_term_id
Annotator	Curator
Value	categorical with `str` categories. If `organism_ontolology_term_id` is `"NCBITaxon:9606"` for Homo sapiens, the value MUST be formatted as one or more comma-separated (with no leading or trailing spaces) HANCESTRO terms is ascending lexical order or `"unknown"` if unavailable. For example, if the terms are `"HANCESTRO:00014` and `HANCESTRO:0005"` then the value of `self_reported_ethnicity_ontology_term_id` MUST be `"HANCESTRO:0005,HANCESTRO:0014"`. The following terms MUST NOT be used: `"HANCESTRO:0002"` for regions and its children `"HANCESTRO:0003"` for country `"HANCESTRO:0004"` for ancestry category `"HANCESTRO:0018"` for uncategorised population `"HANCESTRO:0290"` for genetically isolated population `"HANCESTRO:0304"` for ancestry status and its children `"HANCESTRO:0323"` for Finnish founder `"HANCESTRO:0324"` for Dutch founder `"HANCESTRO:0551"` for genetically homogenous Irish `"HANCESTRO:0554"` for Silk Road founder `"HANCESTRO:0555"` for Arab Israeli founder `"HANCESTRO:0557"` for Costa Rican founder `"HANCESTRO:0558"` for French Canadian founder `"HANCESTRO:0559"` for Italian founder `"HANCESTRO:0560"` for Northern Finnish founder `"HANCESTRO:0561"` for Romanian founder `"HANCESTRO:0564"` for Vis founder `"HANCESTRO:0565"` for Split founder `"HANCESTRO:0566"` for undefined ancestry population The imported GEO term `"GEO:000000374"` for continent and its children: `"HANCESTRO:0029"` for Africa `"HANCESTRO:0030"` for Asia `"HANCESTRO:0031"` for Europe `"HANCESTRO:0032"` for Oceania `"HANCESTRO:0033"` for Latin America and the Caribbean `"HANCESTRO:0034"` for Northern America Otherwise, for all other organisms the `str` value MUST be `"na"`.

When a dataset is uploaded, the CELLxGENE Discover MUST automatically add the matching human-readable name for the corresponding ontology term to the obs dataframe. Curators MUST NOT annotate the following columns.

self_reported_ethnicity

Key	self_reported_ethnicity
Annotator	CELLxGENE Discover
Value	categorical with `str` categories. This MUST be `"na"` or `"unknown"` if set in `self_reported_ethnicity_ontology_term_id`; otherwise, this MUST be one or more comma-separated (with no leading or trailing spaces) human-readable names for the terms in `self_reported_ethnicity_ontology_term_id` in the same order. For example, if the value of `self_reported_ethnicity_ontology_term_id` is `"HANCESTRO:0005,HANCESTRO:0014"` then the value of `self_reported_ethnicity` is `"European,Hispanic or Latin American"`.

schema v4.0.0

obs (Cell metadata)
- Updated self_reported_ethnicity_ontology_term_id
- Updated self_reported_ethnicity

Context

To support accurate metadata for multiethnic donors (donors who have selected more than one ethnicity),self_reported_ethnicity must allow multiple HANCESTRO terms.

The text was updated successfully, but these errors were encountered:

jahilton · 2023-04-20T21:09:04Z

2 Collections use multiethnic currently so those will be replaced by a list of multiple HANCESTRO terms

In Lattice DB, yet to get to CxG, there are 26 donors in another study that would have multiple terms

jahilton · 2023-07-18T15:21:39Z

For multiple terms, are we OK with comma+space separated, like the current examples? And/or comma only?

Are we prepared for a comma within a single label? HANCESTRO:0015 is Greater Middle Eastern (Middle Eastern, North African or Persian). Is there another delimiter we could use to avoid any parsing confusion/errors?

@brianraymor responds: This has been addressed with the ontologist.

The order of multiple terms is not significant

While I agree that curators & submitters ideally won't need to worry about it, but seems like they should be ordered for users so the Census users won't see HANCESTRO:0005, HANCESTRO:0008 & HANCESTRO:0008, HANCESTRO:0005 and cxg viewers won't see European, Asian & Asian, European, etc.

~~Couple typos in the URLs:~~

~~"HANCESTRO:0002" - current link is for HANCESTRO_0001~~
~~"HANCESTRO:0324" - current link is for HANCESTRO_0323~~

brianraymor · 2023-07-18T16:49:13Z

Couple typos in the URLs:

Thanks for catching. Corrected.

brianraymor · 2023-07-18T16:57:14Z

For multiple terms, are we OK with comma+space separated, like the current examples? And/or comma only?

I was tempted to write some ABNF earlier.

Are we prepared for a comma within a single label? HANCESTRO:0015 is Greater Middle Eastern (Middle Eastern, North African or Persian). Is there another delimiter we could use to avoid any parsing confusion/errors?

~~I meant to check the presence of commas in labels. Doh. This is primarily an issue because we're unable to use arrays for the list of self reported ethnicities. Possible options:~~

~~Since there's only case where a HANCESTRO label includes a comma, we could request that it be replaced by an "or"? And ask that no further comma(s) be specified?~~
~~We replace the comma with a separator that's unlikely to appear in an HANCESTRO label such as "|".~~

~~@pablo-gar @bkmartinjr @atarashansky - do you have preferences?~~

brianraymor · 2023-07-18T17:11:50Z

The order of multiple terms is not significant

While I agree that curators & submitters ideally won't need to worry about it, but seems like they should be ordered for users so the Census users won't see HANCESTRO:0005, HANCESTRO:0008 & HANCESTRO:0008, HANCESTRO:0005 and cxg viewers won't see European, Asian & Asian, European, etc.

I had an earlier draft that normalized the labels by ascending lexicographical order would then require a rewrite of their term id(s) to match. Based on my earlier draft, your example above:

HANCESTRO:0005, HANCESTRO:0008 would be normalized to HANCESTRO:0008, HANCESTRO:0005 to match the ordered labels Asian, European.

Another option is to normalize by ascending term id(s) but then the labels are more random:

HANCESTRO:0008, HANCESTRO:0005 becomes HANCESTRO:0005, HANCESTRO:0008

CC: @pablo-gar @bkmartinjr @atarashansky

jahilton · 2023-07-18T20:01:40Z

RE:ordering - both options result in consistent display of the same values so I don't have a strong preference. Maybe slight preference for lexicographical order of labels (users are more likely to see the A-->Z pattern & therefore less likely to think the order means something significant?)

bkmartinjr · 2023-07-18T21:20:30Z

Re: array syntax:

prefer a single-character delimiter (not the space+comma combination used in the first example).
also prefer a delimiter which is illegal in an ontology term, so we do not have to worry about escaping it. That makes the typical query (via string regex) feasible. OTOMH, I don't know what is legal/illegal in RDF terms (and I hope OBO follows the same?), but I'd look to that grammar to find a character that we can safely use. Perhaps a simple space is the best choice?

Re: ordering:

I don't think the Census cares strongly about ordering unless there are semantics associated with the ordering (there are weak arguments for a (any) canonical ordering as that would help compression, categorical types, etc). I agree that visualization tools do care, but I think they care in different ways, and can apply their own logic (i.e., it does not need to be codified in the schema). In Jason's "cxg" (I think he means Explorer) example, it is trivial for the Explorer backend to sort these at CXG creation time.
Will there be a requirement to have the two columns ordered the same (i.e., positions match)?

brianraymor · 2023-07-18T22:24:45Z

~~Perhaps a simple space is the best choice?~~

~~I may be missing your point, but there are many cases where the HANCESTRO labels include spaces like "South East Asian" or consider Jason's outlier :~~

~~"Greater Middle Eastern (Middle Eastern, North African or Persian separator European"~~

~~prefer a delimiter which is illegal in an ontology term~~

~~I could see if OWL(s) have different constraints beyond XML.~~

~~What about using multiple characters such as "<br>". There are some crazier ideas out there like scanning the source text and finding a unique character.~~

brianraymor · 2023-07-18T23:18:27Z

~~I've asked Dani if the one label could be updated to not use a comma. Fingers crossed.~~

~~Could we double-quote terms within single-quoted strings?~~

label = '"Greater Middle Eastern (Middle Eastern, North African, or Persian)", "European"'
list = (re.findall('"([^"]*)"', label))

~~which produces a helpful list:~~

~~['Greater Middle Eastern (Middle Eastern, North African, or Persian)', 'European']~~

bkmartinjr · 2023-07-18T23:20:58Z

~~I may be missing your point, but there are many cases where the HANCESTRO labels include spaces like "South East Asian" or consider Jason's outlier :~~

~~You were not - I was myopically focused on the term id syntax, and forgot about the labels. Sigh. We may simply have to escape labels, although that will make any regex-based search interesting~~

pablo-gar · 2023-07-19T04:31:20Z

Everything looks good to me. Great job on going over the HANCESTRO ontology and excluding terms.

Only two points:

~~I propose we use ; as a separator.~~
We should include the curated spreadsheet at the very least in this ticket and ideally in the schema. For that spreadsheet I suggest naming the tab as "HANCESTRO_version..." to know exactly what version it is.

brianraymor · 2023-07-19T06:55:34Z

We should include the curated spreadsheet at the very least in this ticket and ideally in the schema. For that spreadsheet I suggest naming the tab as "HANCESTRO_version..." to know exactly what version it is.

Can you say more about the rationale? The schema pins a HANCESTRO version. The requirements for self_reported_ethnicity_term_id ensure that only terms appropriate for self reporting can be selected by curators. (There are 290 appropriate terms after the filtering) I'm happy to link the spreadsheet to the issue if it's helpful for others to check the math.

For the future, we've asked Dani to consider adding a property that would allow HANCESTRO consumers to easily filter terms that are appropriate for self reporting rather than requiring the manual review and cherry picking of terms.

brianraymor · 2023-07-19T16:22:01Z

After reviewing with Dani, I opened Preferred term label for HANCESTRO:0015. This is targeted for first half of August.

brianraymor · 2023-07-19T22:17:43Z

And since I'm knee deep in clarifying requirements for #216 - any reordering of terms/labels could impact _colors if a submitter was coloring by this field.

pablo-gar · 2023-07-28T22:09:34Z

Can you say more about the rationale? The schema pins a HANCESTRO version. The requirements for self_reported_ethnicity_term_id ensure that only terms appropriate for self reporting can be selected by curators. (There are 290 appropriate terms after the filtering) I'm happy to link the spreadsheet to the issue if it's helpful for others to check the math.

For the future, we've asked Dani to consider adding a property that would allow HANCESTRO consumers to easily filter terms that are appropriate for self reporting rather than requiring the manual review and cherry picking of terms.

Mainly a historical record of the decisions that were made, it can come in handy for when explaining the decisions and or in the future when we have to revise HANCESTRO.

brianraymor added schema CELLxGENE Discover dataset schema discovery labels Aug 25, 2022

brianraymor self-assigned this Aug 25, 2022

brianraymor added the 4.0 Next major CELLxGENE schema version label Nov 9, 2022

brianraymor removed the discovery label Nov 18, 2022

brianraymor mentioned this issue Dec 7, 2022

Increase the velocity of dataset schema evolution and dataset migration chanzuckerberg/single-cell#365

Closed

brianraymor changed the title ~~self-reported_ethnicity must allow multiple HANCESTRO terms~~ self_reported_ethnicity must allow multiple HANCESTRO terms Apr 20, 2023

brianraymor added discovery blocked labels Jun 16, 2023

brianraymor removed discovery blocked labels Jul 13, 2023

brianraymor mentioned this issue Jul 28, 2023

Updated self reported ethnicities #583

Merged

brianraymor closed this as completed in #583 Jul 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

self_reported_ethnicity must allow multiple HANCESTRO terms #334

self_reported_ethnicity must allow multiple HANCESTRO terms #334

brianraymor commented Aug 25, 2022 •

edited

Loading

jahilton commented Apr 20, 2023

jahilton commented Jul 18, 2023 •

edited by brianraymor

Loading

brianraymor commented Jul 18, 2023

brianraymor commented Jul 18, 2023 •

edited

Loading

brianraymor commented Jul 18, 2023 •

edited

Loading

jahilton commented Jul 18, 2023

bkmartinjr commented Jul 18, 2023 •

edited by brianraymor

Loading

brianraymor commented Jul 18, 2023 •

edited

Loading

brianraymor commented Jul 18, 2023 •

edited

Loading

bkmartinjr commented Jul 18, 2023 •

edited by brianraymor

Loading

pablo-gar commented Jul 19, 2023 •

edited by brianraymor

Loading

brianraymor commented Jul 19, 2023

brianraymor commented Jul 19, 2023 •

edited

Loading

brianraymor commented Jul 19, 2023

pablo-gar commented Jul 28, 2023

self_reported_ethnicity must allow multiple HANCESTRO terms #334

self_reported_ethnicity must allow multiple HANCESTRO terms #334

Comments

brianraymor commented Aug 25, 2022 • edited Loading

Design

obs (Cell Metadata)

self_reported_ethnicity_ontology_term_id

self_reported_ethnicity

Context

jahilton commented Apr 20, 2023

jahilton commented Jul 18, 2023 • edited by brianraymor Loading

brianraymor commented Jul 18, 2023

brianraymor commented Jul 18, 2023 • edited Loading

brianraymor commented Jul 18, 2023 • edited Loading

jahilton commented Jul 18, 2023

bkmartinjr commented Jul 18, 2023 • edited by brianraymor Loading

brianraymor commented Jul 18, 2023 • edited Loading

brianraymor commented Jul 18, 2023 • edited Loading

bkmartinjr commented Jul 18, 2023 • edited by brianraymor Loading

pablo-gar commented Jul 19, 2023 • edited by brianraymor Loading

brianraymor commented Jul 19, 2023

brianraymor commented Jul 19, 2023 • edited Loading

brianraymor commented Jul 19, 2023

pablo-gar commented Jul 28, 2023

brianraymor commented Aug 25, 2022 •

edited

Loading

`obs` (Cell Metadata)

jahilton commented Jul 18, 2023 •

edited by brianraymor

Loading

brianraymor commented Jul 18, 2023 •

edited

Loading

brianraymor commented Jul 18, 2023 •

edited

Loading

bkmartinjr commented Jul 18, 2023 •

edited by brianraymor

Loading

brianraymor commented Jul 18, 2023 •

edited

Loading

brianraymor commented Jul 18, 2023 •

edited

Loading

bkmartinjr commented Jul 18, 2023 •

edited by brianraymor

Loading

pablo-gar commented Jul 19, 2023 •

edited by brianraymor

Loading

brianraymor commented Jul 19, 2023 •

edited

Loading