Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do or do not (there is no try) _colors #216

Closed
brianraymor opened this issue Apr 27, 2022 · 28 comments · Fixed by #580
Closed

do or do not (there is no try) _colors #216

brianraymor opened this issue Apr 27, 2022 · 28 comments · Fixed by #580
Assignees
Labels
4.0 Next major CELLxGENE schema version data-viz Data Viz Team schema CELLxGENE Discover dataset schema

Comments

@brianraymor
Copy link
Contributor

brianraymor commented Apr 27, 2022

Design

uns (Dataset Metadata)

uns is a ordered dictionary with a str key. Curators MAY annotate the following keys and values in uns:

{category}_colors

Key {category}_colors where {category} MUST be the name of a category data type column in obs that
is annotated by the data submitter or curator. The following columns that are annotated by CELLxGENE
Discover MUST NOT be specified as {category}:

  • assay
  • cell_type
  • development_stage
  • disease
  • organism
  • self_reported_ethnicity
  • sex
  • tissue

Annotator Curator
Value numpy.ndarray. This MUST be a 1-D array of shape (, c), where c is greater than or equal to the
number of unique categories in the {category} column as calculated by:

len(anndata.obs.{category}.unique())

The color code at the Nth position in the ndarray corresponds to the Nth category of anndata.obs.{category}.unique().

For example, if cell_type_ontology_term_id includes two unique categories:

anndata.obs.cell_type_ontology_term_id.unique()

['CL:0000057', 'CL:0000115']
Categories (2, object): ['CL:0000057', 'CL:0000115']


then cell-type_ontology_term_id_colors MUST contain two or more colors such as:

['aqua' 'blueviolet']

where 'aqua' is the color assigned to 'CL:0000057' and 'blueviolet' is the color assigned to
'CL:0000115'.

All elements in the ndarray MUST use the same color model, limited to:

Color Model Element Format
Named Colors str. MUST be a case-insensitive CSS4 color name with no spaces such as
"aliceblue"
Hex Triplet str. MUST start with "#" immediately followed by six case-insensitive hexadecimal
characters as in "#08c0ff"

Context

_colors was not included in schema 2.0.0 because it was perceived as an under documented feature of cellxgene desktop and was not present in most submissions in the portal at the time. Also see category colors and RFC: User-defined colors.

There was no validation performed by cellxgene-schema CLI 1.1.0 or in the subsequent cellxgene conversion so support was brittle. The current CXG conversion code limits validation to color format:

convert_anndata_category_colors_to_cxg_category_colors

Recently, there was a CXG conversion failure during data ingestion reported in #single-cell-data-wrangling due to the following causes:

  1. at least one color table existed for a non-existent obs column
  2. at least one color table existed for a non-categorical obs column

_colors should either be blocked+fail or documented+validated in the future.


There are more testing details documented in the Category Colors section of Prepare Data;

To test that you've done this properly, check that for your given category the number of colors match the number of category values and that the second command below results in a mapping from categories to colors.

>>> len(adata.obs[category].cat.categories) == len(adata.uns[f"{category}_colors"])

True

>>> dict(zip(adata.obs[category].cat.categories, adata.uns[f"{category}_colors"]))
{'CD4 T cells': '#1f77b4', 'CD14+ Monocytes': '#ff7f0e', 'B cells': '#2ca02c', 'CD8 T cells': '#d62728', 'NK cells': '#9467bd', 'FCGR3A+ Monocytes': '#8c564b', 'Dendritic cells': '#e377c2', 'Megakaryocytes': '#bcbd22'}
@brianraymor
Copy link
Contributor Author

Boom!

@jahilton
Copy link
Collaborator

Caught another issue.
uns[sex_colors] is a list of 2 values, but there are 3 unique values in obs[sex]. The dataset in the video also has uns[ever_smoker_colors] less than the length of obs[ever_smoker].unique()
Trying to color by sex or ever_smoker results in failures.

colors_bug.mov

@seve
Copy link
Contributor

seve commented Apr 5, 2023

We found another instance of this issue with cell_type within this dataset https://cellxgene.cziscience.com/e/3bbb6cf9-72b9-41be-b568-656de6eb18b5.cxg/

[since revised] See #single-cell-data-wrangling.

@brianraymor brianraymor added the 4.0 Next major CELLxGENE schema version label Apr 12, 2023
@brianraymor
Copy link
Contributor Author

@signechambers1 to document the full set of Explorer requirements by Q2 which is required for inclusion in schema 4.0.

@signechambers1
Copy link

signechambers1 commented Jun 28, 2023

@brianraymor here are the H5AD criteria for _colors functionality for explorer:

  • adata.obs["{category}"] MUST be of type pandas.Categorical
  • adata.uns["{category}_colors"] MUST be a list of hex-code colors corresponding to each category in adata.obs["{category"].cat.categories
  • If any uns color entry corresponds to a non-existent category, validation MUST fast-fail
  • If any uns color entry corresponds to a category that is improperly formatted (i.e. not a pandas.Categorical or does not have the right number of labels), validation MUST fast-fail
  • len(adata.uns["{category}_colors"]) >= len(adata.obs["{category}"].unique())

Thank you @atarashansky for the help. @brianraymor let us know if you have suggestions or questions.

@brianraymor
Copy link
Contributor Author

CC: @jahilton @jychien - can you please review based on your experience in debugging _colors?

  1. There is no explanation for how _colors is surfaced in Explorer. You might compare with embeddings which needs an update to the screenshot due to the new Explorer layout.
  2. The new requirements do not match the earlier requirements in schema 1.1.0 - which were also not enforced. For example, the 1.1.0 version supported multiple formats for the values. 1.1.0 also specified "for any other categorical integer metadata field".

Here's the relevant text which was also linked in the top-level summary comment for this issue as a reference:

Presentation Hints

The metadata fields below are optional. They aren't needed for integration, and cellxgene can display the data fine without them, but if they are included cellxgene will do something with them. This allows submitters to fine-tune how their datasets are presented, which is a common request.

Field name Description
color_map Submitters can include a field called "{field}_colors" for any other categorical integer metadata field. The value must be an array of one of fourcolor specifications:
  • CSS4 color name, as supported by matplotlib https://matplotlib.org/3.1.0/gallery/color/named_colors.html
  • RGB tuple/list with values ranging from 0.0 to 1.0, as in [0.5, 0.75, 1.0]
  • RFB tuple/list with values ranging from 0 to 255, as in [128, 192, 255]
  • Hex triplet string, as in "#08c0ff"and each string must be a hex color code.
The color code at the nth position in the array corresponds to category n in the metadata field.

@jahilton
Copy link
Collaborator

There is definitely a length requirement where you need to have as many colors as you have categories (comment above)...
I believe len(adata.uns["{category}_colors"]) >= len(adata.obs["{category}"].unique())


If any uns color entry corresponds to a non-existent category, validation MUST fast-fail

Currently, this case is ignored & passes validation seemingly without any downstream issues. We do have a block of qa code that checks for this as 'best practice' so I don't have any concern enforcing it in the validator, but just wanted to note the change in behavior. (clean-up of existing datasets would be incredibly straightforward during a migration)

@jychien
Copy link
Collaborator

jychien commented Jun 29, 2023

  • Scanpy has documentation for palette for it's plot function, where it points to matplotlib colors as being valid. Not sure what is meant by the wording of the 1.1.0 schema, but my understanding is also that the corresponding column in obs need to be Categorical.
  • Looks like '(category)_colors' can be an array or list, but unclear if it is implied that the schema will be requiring it to be one and not the other.
    Screenshot 2023-06-28 at 5 28 12 PM

@signechambers1
Copy link

Thank you all! Added the length requirement to my original comment.

@atarashansky can you determine if multiple color formats will work, whether an array or list is expected by explorer, and if colors work for categorical and categorical integer fields?

@brianraymor
Copy link
Contributor Author

If possible, please write requirements in text and not in code fragments.

I called this out earlier. Isn't this a missing constraint?

Submitters can include a field called "{field}_colors" for any other categorical integer metadata field.

@signechambers1
Copy link

Updated requirements in plain text with outstanding questions:

  • Fields for which colors are defined MUST be categorical fields or categorical integer fields (@atarshansky to confirm categorical integer fields are supported)
  • Colors MUST NOT be defined for categorical fields that do not exist
  • The number of colors defined MUST be equal to or greater than the number of unique values in the categorical field (@atarshansky can you confirm if there are more colors than values it will not error?)
  • Colors MUST be a list of hex-code colors (@atarshansky to determine if other formats work)
  • Open question: Should the schema require colors to be a list or array? (@atarshansky could you look into this too please?)

@atarashansky
Copy link

atarashansky commented Jul 10, 2023

Fields for which colors are defined MUST be categorical fields or categorical integer fields (@atarshansky to confirm categorical integer fields are supported)

Categorical integer fields are supported so long as they are pandas Categoricals.

Colors MUST NOT be defined for categorical fields that do not exist

I think colors actually can be defined for categorical fields that do not exist - those colors will just be skipped and not added to the cxg group metadata.

The number of colors defined MUST be equal to or greater than the number of unique values in the categorical field (@atarshansky can you confirm if there are more colors than values it will not error?)

If there are more colors, it will not error. The extra colors will be ignored.

Colors MUST be a list of hex-code colors (@atarshansky to determine if other formats work)

Ah, thank you for correcting me @brianraymor. Here are the actual color requirements. As you indicated, this is validated and will raise an exception so we're good on this front:

The function accepts for the following formats:
- A CSS4 color name, as supported by matplotlib https://matplotlib.org/3.1.0/gallery/color/named_colors.html
- RGB tuple/list with values ranging from 0.0 to 1.0, as in [0.5, 0.75, 1.0]
- RFB tuple/list with values ranging from 0 to 255, as in [128, 192, 255]
- Hex triplet string, as in "#08c0ff"

Open question: Should the schema require colors to be a list or array? (@atarshansky could you look into this too please?)

Either numpy array or python lists work.

@brianraymor
Copy link
Contributor Author

@atarashansky @signechambers1 - should this reference be updated to - A CSS4 color name, as supported by matplotlib https://matplotlib.org/stable/gallery/color/named_colors.html or did Explorer pin 3.1.0?

@brianraymor
Copy link
Contributor Author

@atarashansky - should the shape of the _colors ndarray be derived from:

  1. adata.obs['category'].cat.categories style is referenced in CXG validation.
  2. adata.obs['category'].unique() style is also referenced in this issue based on cellxgene guidance.

@bkmartinjr kindly educated me on the differences, because I was too lazy to RTFM this evening:

cat.categories - returns all possible values, even if they are not present
unique() - returns the used values

I don't have a horse in this race, but want to document only one approach where the schema matches the implementation. Recommendation?

CC:@jahilton @signechambers1

@atarashansky
Copy link

@atarashansky @signechambers1 - should this reference be updated to - A CSS4 color name, as supported by matplotlib https://matplotlib.org/stable/gallery/color/named_colors.html or did Explorer pin 3.1.0?

I don't even see matplotlib as a dependency in explorer.

@brianraymor
Copy link
Contributor Author

I don't even see matplotlib as a dependency in explorer.

By Explorer, I meant the CXG conversion code for Explorer. There must be some method for validating CSS4 color names?

  • RFB tuple/list with values ranging from 0 to 255, as in [128, 192, 255]

I'm unfamiliar with a color model named RFB. From my perspective, it's simply a different way of specifying RGB as illustrated by CSS 4 examples:

The color type provides multiple ways to syntactically specify a given color. For example, the following declarations all specify the sRGB color “lime”:

  • { color: lime; } /* color keyword */
  • { color: rgb(0 255 0); } /* RGB range 0-255 */
  • { color: rgb(0% 100% 0%); } /* RGB range 0%-100% */
  • { color: color(sRGB 0 1 0); } /* sRGB range 0.0-1.0 */

Could we eliminate /* RGB range 0-255 */ as a supported case. I was reviewing matplotlib which does not seem to support the RGB range 0-255 style, preferring:

RGB or RGBA (red, green, blue, alpha) tuple of float values in a closed interval [0, 1].

@brianraymor
Copy link
Contributor Author

@signechambers1 @atarashansky

RE

The number of colors defined MUST be equal to or greater than the number of unique values in the categorical field (@atarshansky can you confirm if there are more colors than values it will not error?)

If there are more colors, it will not error. The extra colors will be ignored.

I'm reviewing a live example that defines 48 colors for 53 categoricals.

Is the actual requirement - "There must be at least one color?"

@atarashansky
Copy link

Can you link me to this example? If there are 48 colors for 53 categoricals, then that should cause a bug in explorer.

@atarashansky
Copy link

I don't even see matplotlib as a dependency in explorer.

By Explorer, I meant the CXG conversion code for Explorer. There must be some method for validating CSS4 color names?

  • RFB tuple/list with values ranging from 0 to 255, as in [128, 192, 255]

I'm unfamiliar with a color model named RFB. From my perspective, it's simply a different way of specifying RGB as illustrated by CSS 4 examples:

The color type provides multiple ways to syntactically specify a given color. For example, the following declarations all specify the sRGB color “lime”:

  • { color: lime; } /* color keyword */
  • { color: rgb(0 255 0); } /* RGB range 0-255 */
  • { color: rgb(0% 100% 0%); } /* RGB range 0%-100% */
  • { color: color(sRGB 0 1 0); } /* sRGB range 0.0-1.0 */

Could we eliminate /* RGB range 0-255 */ as a supported case. I was reviewing matplotlib which does not seem to support the RGB range 0-255 style, preferring:

RGB or RGBA (red, green, blue, alpha) tuple of float values in a closed interval [0, 1].

Looks like we pin Matplotlib to 3.6.3:
https://github.com/chanzuckerberg/single-cell-data-portal/blob/ec494df7a463e2191607d26874dd604333fa4e36/backend/api_server/requirements.txt#L7

Here, it looks like we always cast to 0-255:
https://github.com/chanzuckerberg/single-cell-data-portal/blob/ec494df7a463e2191607d26874dd604333fa4e36/backend/common/utils/color_conversion_utils.py#L180

I agree we should change it to match matplotlib specifications/preferences.

@atarashansky
Copy link

  1. Is the len of the _colors array based on unique(), cat.categories, or is is just "MUST be at least 1 color"? This is the 48 colors for 53 categoricals case.

The only requirement is that the len of _colors must be greater than or equal to the number of categories (unique() == cat.categories).

  1. Both python lists and numpy arrays are currently allowed which increases the text matrix. I recommend stricter requirements and have only allowed ndarrays.

Why ndarrays as opposed to pure lists? I would prefer lists. ndarray does not offer any additional functionality, here.

  1. Supporting four color formats increases the test matrix. @atarashansky has agreed with the suggestion to eliminate RGB range 0-255 because it does not seem to be a supported format in matplotlib which prefers RGB or RGBA (red, green, blue, alpha) tuple of float values in a closed interval [0, 1].

Correct.

@brianraymor
Copy link
Contributor Author

@atarashansky

The only requirement is that the len of _colors must be greater than or equal to the number of categories (unique() == cat.categories).

I'm not sure how to interpret this statement. Are you indicating that you believe unique to be equivalent to cat.categories? If so, please see this earlier comment.

@brianraymor
Copy link
Contributor Author

Why ndarrays as opposed to pure lists? I would prefer lists. ndarray does not offer any additional functionality, here.

It will be converted to an ndarray when written. See Saving then reading turns list into numpy.ndarray?.


adata_valid.uns['cell_type_colors'] = ['aqua', 'blueviolet', 'chartreuse']
print(type(adata_valid.uns['cell_type_colors']))

<class 'list'>

adata_valid.write('conversion.h5ad')
adata_valid = sc.read_h5ad('conversion.h5ad')
print(type(adata_valid.uns['cell_type_colors']))

<class 'numpy.ndarray'>

@brianraymor
Copy link
Contributor Author

Of the three choices:

Confirming the consensus from the #single-cell-data-wrangling:

  1. Detect the case and automate renaming tissue_ontology_term_id_colors to tissue_colors
  2. Detect the case and automate copying tissue_ontology_term_id_colors to tissue_colors (maintains the submitters original choice)
    3. Change Explorer to just use the term id(s) which are present in the original submission (labels are applied later by CELLxGENE during ingestion)

Option 3 was the preference. My assumption is that Explorer continues to display the ontology labels but uses the related term ids for colors?

This needs to be documented in the schema. Are we going to warn or fail if a submitter specifies tissue_colors?

@atarashansky
Copy link

It will be converted to an ndarray when written. See Saving then reading turns list into numpy.ndarray?.

Gotcha, wasn't aware of this.

I'm not sure how to interpret this statement. Are you indicating that you believe unique to be equivalent to cat.categories? If so, please see this earlier comment.

You interpreted correctly, and TIL. Then the requirements should be unique() <= len(colors).

Option 3 was the preference. My assumption is that Explorer continues to display the ontology labels but uses the related term ids for colors?

At conversion time, Explorer will be updated to assign the custom colors to the corresponding label column's categories.

@brianraymor
Copy link
Contributor Author

RE

This needs to be documented in the schema. Are we going to warn or fail if a submitter specifies tissue_colors?

This must fail then because tissue_colors will not be present during validation. It's similar to specifying any other non-existent obs category.

@atarashansky
Copy link

RIght, yes, agreed!

@brianraymor
Copy link
Contributor Author

Inquiring minds. What is the scenario for allowing the length of _colors to be greater than unique()? Is there an expectation that the submitter may be reusing color arrays across multiple categories some of which may be larger than others or ... ?

CC: @jahilton

@atarashansky
Copy link

@brianraymor That would make sense if they just wanted to apply a custom discrete colormap to all categories.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4.0 Next major CELLxGENE schema version data-viz Data Viz Team schema CELLxGENE Discover dataset schema
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants