chore: modify test dataset to contain edge cases for testing purposes #873

seve · 2024-04-02T18:31:17Z

differing number of cells for spatial vs non-spatial embedding
Singleton categorical column
Singleton continuous column

Related Tickets: #872 #870

seve · 2024-04-02T19:17:38Z

seve · 2024-04-15T21:57:25Z

Aligned with timmy on this ticket, we have a great opportunity right now to clobber all three datasets (pbmc3k, super-cool-spatial, and truncation-test) into a single one!

kaloster · 2024-05-08T15:21:33Z

Following a good discussion with @seve here - summarizing my take on the subject:

On a single consolidated dataset:

Having one consolidated dataset has the benefit of being lean and eliminate the need to have multiple datasets committed to the repo, with the intention of having a single dataset covering all edge cases.
The downside would be that:
- We would have a manually modified dataset, diverged from it's original form, that is detached from schema updates and migration and it would need to be documented on the modifications
- ~~Which means that it would probably get stale quickly with future schema changes and migrations (realistically)~~
- ~~Risk of losing context of what was changed and how (even with a documented process via readme - which would be another thing to maintain)~~
- ~~Manually modifying a dataset to cover edge cases can be an issue when done outside of the validation/migration workflow, is generally not good practice and can introduce other issues~~
- ~~Generally, it would be another thing that we would have to maintain and document in order to maintain~~

~~The option of using actual prod datasets:~~

~~Find a couple of datasets that encapsulate all edge cases that we need to dev/test for and keep those in the original state unmodified, with the original dataset id as CXGs under example-dataset~~
Whenever there's a schema update, download the refreshed h5ad from the assets bucket and regenerate those to update the test CXGs (or even just download the regenerated CXG from the hosted bucket and commit to the repo).
No need to maintain and document a modified dataset - just refresh existing assets~~
~~Dev and test would be done against up-to-date prod data~~
~~Tapping into an existing, robust workflow vs creating a new one is generally preferred~~

EDIT: I'm fully aligned with the approach Seve suggested below, leaving this here for posterity

seve · 2024-05-10T17:46:28Z

What if we do the following:

we select a spatial dataset from prod
download the local.h5ad
create a notebook that adds any necessary testing data to the dataset
store that notebook somewhere in the explorer repo
run the notebook against the .h5ad
run the conversion script against the .h5ad
commit the generated .cxg to the repo
repeat as needed (not every schema migration)
nice to have: create a GHA cron job that pulls the dataset from prod weekly, runs the notebook, and commits the cxg to /example-dataset

seve · 2024-05-10T17:47:43Z

I believe it's worth noting that our current main test dataset(pbmc3k.cxg) is rarely updated, and I'm not sure it's ever gone through a schema migration.

https://github.com/chanzuckerberg/single-cell-explorer/commits/main/example-dataset/pbmc3k.cxg

kaloster · 2024-05-12T15:40:03Z

I like it and I'm fully aligned with this approach.

Thank you for coming up with the workflow @seve . The only thing I have to contribute is that maybe as a second phase would be cool to implement this as a cli command as well

What if we do the following:

we select a spatial dataset from prod

download the local.h5ad

create a notebook that adds any necessary testing data to the dataset

store that notebook somewhere in the explorer repo

run the notebook against the .h5ad

run the conversion script against the .h5ad

commit the generated .cxg to the repo

repeat as needed (not every schema migration)

nice to have: create a GHA cron job that pulls the dataset from prod weekly, runs the notebook, and commits the cxg to /example-dataset

seve added the testing label Apr 2, 2024

seve self-assigned this Apr 2, 2024

tihuan added the P2 Priority 2 - Improvement with narrower impact, fix within a month label May 8, 2024

seve removed their assignment May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: modify test dataset to contain edge cases for testing purposes #873

chore: modify test dataset to contain edge cases for testing purposes #873

seve commented Apr 2, 2024

seve commented Apr 2, 2024

seve commented Apr 15, 2024

kaloster commented May 8, 2024 •

edited

seve commented May 10, 2024

seve commented May 10, 2024

kaloster commented May 12, 2024

chore: modify test dataset to contain edge cases for testing purposes #873

chore: modify test dataset to contain edge cases for testing purposes #873

Comments

seve commented Apr 2, 2024

seve commented Apr 2, 2024

seve commented Apr 15, 2024

kaloster commented May 8, 2024 • edited

seve commented May 10, 2024

seve commented May 10, 2024

kaloster commented May 12, 2024

kaloster commented May 8, 2024 •

edited