Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

self_reported_ethnicity must allow multiple HANCESTRO terms #334

Closed
brianraymor opened this issue Aug 25, 2022 · 15 comments · Fixed by #583
Closed

self_reported_ethnicity must allow multiple HANCESTRO terms #334

brianraymor opened this issue Aug 25, 2022 · 15 comments · Fixed by #583
Assignees
Labels
4.0 Next major CELLxGENE schema version schema CELLxGENE Discover dataset schema

Comments

@brianraymor
Copy link
Contributor

brianraymor commented Aug 25, 2022

Design

Notes:

I recommend that this issue be addressed last when updating cellxgene-schema CLI. Its design may be simplified prior to Schema 4 delivery IF HANCESTRO is updated to address pending requests:

  1. The ontology is modeled to make it possible to identify HANCESTRO terms that are appropriate for the self reported use case. The ontologist is planning to ensure that each term is tagged accordingly so we could infer 2 new parent classes to automatically give us a full list of geographic descriptors and of ethnicity descriptors. Currently, CELLxGENE must assess the entire ontology and cherry pick terms to block which increases the complexity of curation and validation.
  2. Preferred term label for HANCESTRO:0015 was addressed.

Also see #single-cell-data-wrangling.


obs (Cell Metadata)

obs is a pandas.DataFrame.

Curators MUST annotate the following columns in the obs dataframe:

self_reported_ethnicity_ontology_term_id

Key self_reported_ethnicity_ontology_term_id
Annotator Curator
Value categorical with str categories. If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, the value MUST be formatted as one or more comma-separated (with no leading or trailing spaces) HANCESTRO terms is ascending lexical order or "unknown" if unavailable.

For example, if the terms are "HANCESTRO:00014 and HANCESTRO:0005" then the value of self_reported_ethnicity_ontology_term_id MUST be "HANCESTRO:0005,HANCESTRO:0014".

The following terms MUST NOT be used:


Otherwise, for all other organisms the str value MUST be "na".


When a dataset is uploaded, the CELLxGENE Discover MUST automatically add the matching human-readable name for the corresponding ontology term to the obs dataframe. Curators MUST NOT annotate the following columns.

self_reported_ethnicity

Key self_reported_ethnicity
Annotator CELLxGENE Discover
Value categorical with str categories. This MUST be "na" or "unknown" if set in self_reported_ethnicity_ontology_term_id; otherwise, this MUST be one or more comma-separated (with no leading or trailing spaces) human-readable names for the terms in self_reported_ethnicity_ontology_term_id in the same order.

For example, if the value of self_reported_ethnicity_ontology_term_id is "HANCESTRO:0005,HANCESTRO:0014" then the value of self_reported_ethnicity is "European,Hispanic or Latin American".


schema v4.0.0

  • obs (Cell metadata)
    • Updated self_reported_ethnicity_ontology_term_id
    • Updated self_reported_ethnicity

Context

To support accurate metadata for multiethnic donors (donors who have selected more than one ethnicity),self_reported_ethnicity must allow multiple HANCESTRO terms.

@brianraymor brianraymor added schema CELLxGENE Discover dataset schema discovery labels Aug 25, 2022
@brianraymor brianraymor self-assigned this Aug 25, 2022
@brianraymor brianraymor added the 4.0 Next major CELLxGENE schema version label Nov 9, 2022
@brianraymor brianraymor changed the title self-reported_ethnicity must allow multiple HANCESTRO terms self_reported_ethnicity must allow multiple HANCESTRO terms Apr 20, 2023
@jahilton
Copy link
Collaborator

2 Collections use multiethnic currently so those will be replaced by a list of multiple HANCESTRO terms

In Lattice DB, yet to get to CxG, there are 26 donors in another study that would have multiple terms

@jahilton
Copy link
Collaborator

jahilton commented Jul 18, 2023

For multiple terms, are we OK with comma+space separated, like the current examples? And/or comma only?


Are we prepared for a comma within a single label? HANCESTRO:0015 is Greater Middle Eastern (Middle Eastern, North African or Persian). Is there another delimiter we could use to avoid any parsing confusion/errors?

@brianraymor responds: This has been addressed with the ontologist.


The order of multiple terms is not significant

While I agree that curators & submitters ideally won't need to worry about it, but seems like they should be ordered for users so the Census users won't see HANCESTRO:0005, HANCESTRO:0008 & HANCESTRO:0008, HANCESTRO:0005 and cxg viewers won't see European, Asian & Asian, European, etc.


Couple typos in the URLs:

  • "HANCESTRO:0002" - current link is for HANCESTRO_0001
  • "HANCESTRO:0324" - current link is for HANCESTRO_0323

@brianraymor
Copy link
Contributor Author

Couple typos in the URLs:

Thanks for catching. Corrected.

@brianraymor
Copy link
Contributor Author

brianraymor commented Jul 18, 2023

For multiple terms, are we OK with comma+space separated, like the current examples? And/or comma only?

I was tempted to write some ABNF earlier.

Are we prepared for a comma within a single label? HANCESTRO:0015 is Greater Middle Eastern (Middle Eastern, North African or Persian). Is there another delimiter we could use to avoid any parsing confusion/errors?

I meant to check the presence of commas in labels. Doh. This is primarily an issue because we're unable to use arrays for the list of self reported ethnicities. Possible options:

  1. Since there's only case where a HANCESTRO label includes a comma, we could request that it be replaced by an "or"? And ask that no further comma(s) be specified?
  2. We replace the comma with a separator that's unlikely to appear in an HANCESTRO label such as "|".

@pablo-gar @bkmartinjr @atarashansky - do you have preferences?

@brianraymor
Copy link
Contributor Author

brianraymor commented Jul 18, 2023

The order of multiple terms is not significant

While I agree that curators & submitters ideally won't need to worry about it, but seems like they should be ordered for users so the Census users won't see HANCESTRO:0005, HANCESTRO:0008 & HANCESTRO:0008, HANCESTRO:0005 and cxg viewers won't see European, Asian & Asian, European, etc.

I had an earlier draft that normalized the labels by ascending lexicographical order would then require a rewrite of their term id(s) to match. Based on my earlier draft, your example above:

HANCESTRO:0005, HANCESTRO:0008 would be normalized to HANCESTRO:0008, HANCESTRO:0005 to match the ordered labels Asian, European.

Another option is to normalize by ascending term id(s) but then the labels are more random:

HANCESTRO:0008, HANCESTRO:0005 becomes HANCESTRO:0005, HANCESTRO:0008

CC: @pablo-gar @bkmartinjr @atarashansky

@jahilton
Copy link
Collaborator

RE:ordering - both options result in consistent display of the same values so I don't have a strong preference. Maybe slight preference for lexicographical order of labels (users are more likely to see the A-->Z pattern & therefore less likely to think the order means something significant?)

@bkmartinjr
Copy link
Contributor

bkmartinjr commented Jul 18, 2023

Re: array syntax:

  • prefer a single-character delimiter (not the space+comma combination used in the first example).
  • also prefer a delimiter which is illegal in an ontology term, so we do not have to worry about escaping it. That makes the typical query (via string regex) feasible. OTOMH, I don't know what is legal/illegal in RDF terms (and I hope OBO follows the same?), but I'd look to that grammar to find a character that we can safely use. Perhaps a simple space is the best choice?

Re: ordering:

  • I don't think the Census cares strongly about ordering unless there are semantics associated with the ordering (there are weak arguments for a (any) canonical ordering as that would help compression, categorical types, etc). I agree that visualization tools do care, but I think they care in different ways, and can apply their own logic (i.e., it does not need to be codified in the schema). In Jason's "cxg" (I think he means Explorer) example, it is trivial for the Explorer backend to sort these at CXG creation time.
  • Will there be a requirement to have the two columns ordered the same (i.e., positions match)?

@brianraymor
Copy link
Contributor Author

brianraymor commented Jul 18, 2023

Perhaps a simple space is the best choice?

I may be missing your point, but there are many cases where the HANCESTRO labels include spaces like "South East Asian" or consider Jason's outlier :

"Greater Middle Eastern (Middle Eastern, North African or Persian separator European"

prefer a delimiter which is illegal in an ontology term

I could see if OWL(s) have different constraints beyond XML.

What about using multiple characters such as "<br>". There are some crazier ideas out there like scanning the source text and finding a unique character.

@brianraymor
Copy link
Contributor Author

brianraymor commented Jul 18, 2023

I've asked Dani if the one label could be updated to not use a comma. Fingers crossed.

Could we double-quote terms within single-quoted strings?

label = '"Greater Middle Eastern (Middle Eastern, North African, or Persian)", "European"'
list = (re.findall('"([^"]*)"', label))

which produces a helpful list:

['Greater Middle Eastern (Middle Eastern, North African, or Persian)', 'European']

@bkmartinjr
Copy link
Contributor

bkmartinjr commented Jul 18, 2023

I may be missing your point, but there are many cases where the HANCESTRO labels include spaces like "South East Asian" or consider Jason's outlier :

You were not - I was myopically focused on the term id syntax, and forgot about the labels. Sigh. We may simply have to escape labels, although that will make any regex-based search interesting

@pablo-gar
Copy link
Contributor

pablo-gar commented Jul 19, 2023

Everything looks good to me. Great job on going over the HANCESTRO ontology and excluding terms.

Only two points:

  • I propose we use ; as a separator.
  • We should include the curated spreadsheet at the very least in this ticket and ideally in the schema. For that spreadsheet I suggest naming the tab as "HANCESTRO_version..." to know exactly what version it is.

@brianraymor
Copy link
Contributor Author

We should include the curated spreadsheet at the very least in this ticket and ideally in the schema. For that spreadsheet I suggest naming the tab as "HANCESTRO_version..." to know exactly what version it is.

Can you say more about the rationale? The schema pins a HANCESTRO version. The requirements for self_reported_ethnicity_term_id ensure that only terms appropriate for self reporting can be selected by curators. (There are 290 appropriate terms after the filtering) I'm happy to link the spreadsheet to the issue if it's helpful for others to check the math.

For the future, we've asked Dani to consider adding a property that would allow HANCESTRO consumers to easily filter terms that are appropriate for self reporting rather than requiring the manual review and cherry picking of terms.

@brianraymor
Copy link
Contributor Author

brianraymor commented Jul 19, 2023

After reviewing with Dani, I opened Preferred term label for HANCESTRO:0015. This is targeted for first half of August.

@brianraymor
Copy link
Contributor Author

And since I'm knee deep in clarifying requirements for #216 - any reordering of terms/labels could impact _colors if a submitter was coloring by this field.

@pablo-gar
Copy link
Contributor

Can you say more about the rationale? The schema pins a HANCESTRO version. The requirements for self_reported_ethnicity_term_id ensure that only terms appropriate for self reporting can be selected by curators. (There are 290 appropriate terms after the filtering) I'm happy to link the spreadsheet to the issue if it's helpful for others to check the math.

For the future, we've asked Dani to consider adding a property that would allow HANCESTRO consumers to easily filter terms that are appropriate for self reporting rather than requiring the manual review and cherry picking of terms.

Mainly a historical record of the decisions that were made, it can come in handy for when explaining the decisions and or in the future when we have to revise HANCESTRO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4.0 Next major CELLxGENE schema version schema CELLxGENE Discover dataset schema
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants