Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify column_names argument in get_anndata() method #1035

Closed
hthomas-czi opened this issue Feb 28, 2024 · 4 comments · Fixed by #1149
Closed

Simplify column_names argument in get_anndata() method #1035

hthomas-czi opened this issue Feb 28, 2024 · 4 comments · Fixed by #1149
Assignees
Labels
effort-xxs P0 Priority 0 - Critical, fix ASAP! UXR value-high Assessed as high value work

Comments

@hthomas-czi
Copy link

hthomas-czi commented Feb 28, 2024

Separate the get_anndata() method column_names argument into obs_column_names and var_column_names which accept an array instead of an object.

Current

adata = cellxgene_census.get_anndata(
    census = census,
    organism = "Homo sapiens",
    var_value_filter = "feature_id in ['ENSG00000161798', 'ENSG00000188229']",
    obs_value_filter = "sex == 'female' and cell_type in ['microglial cell', 'neuron']",
    column_names = {"obs": ["assay", "cell_type", "tissue", "tissue_general", "suspension_type", "disease"]}
)

Proposed

adata = cellxgene_census.get_anndata(
    census = census,
    organism = "Homo sapiens",
    var_value_filter = "feature_id in ['ENSG00000161798', 'ENSG00000188229']",
    obs_value_filter = "sex == 'female' and cell_type in ['microglial cell', 'neuron']",
    obs_column_names = ["assay", "cell_type", "tissue", "tissue_general", "suspension_type", "disease"]
    var_column_names =  ["feature_id", "feature_name"]

)

edited by Pablo

@hthomas-czi hthomas-czi transferred this issue from chanzuckerberg/single-cell Feb 29, 2024
@pablo-gar pablo-gar added UXR value-high Assessed as high value work labels Mar 1, 2024
@pablo-gar pablo-gar added the P0 Priority 0 - Critical, fix ASAP! label May 1, 2024
@ivirshup ivirshup self-assigned this May 17, 2024
@ivirshup
Copy link
Contributor

Currently this argument is passed right through to soma.ExperimentAxisQuery.to_anndata, which has the argument column_names.

Should this change be made over there?

I think it would be odd if the argument column_names was deprecated here, but still used in somacore. Especially if the new form of the argument isn't made available in somacore.

@ivirshup
Copy link
Contributor

Emanuele suggested @aaronwolen may have thoughts on whether a change in somacore is on the table here.

Personally, I'm not sure this change is worth creating inconsistency between the APIs. But do like this change overall.

@prathapsridharan
Copy link
Contributor

prathapsridharan commented May 21, 2024

@ivirshup @ebezzi @pablo-gar - I think this is a tricky design issue and the following are my thoughts. I would like to hear your views on this:

It seems rather confusing to me that we would allow column_names, obs_column_names, and var_column_names and then write some logic to ensure mutual exclusion. The interface has not gained any clarity with this approach.

I would suggest we keep the function signature of get_anndata() as it is right now or replace column_names with obs_column_names and var_column_names in census. If soma also wants to adopt obs_column_names and var_column_names because the maintainers think it is an improvement then that would be fine but not a requirement.

Here is why:

I think we should distinguish between the public interface of census and soma. They don't necessarily have to be consistent because one is an application (census - with application specific nouns and verbs to describe its functionality) built on top of another lower level API (soma - which is a general data model supporting out-of-core processing with a query API).

For example, a census API method could wrap several soma API methods to offer some functionality and in doing so can reasonably have function and argument names that makes sense for the census application and doesn't exactly match soma function signatures.

You could imagine other applications built on top of soma that might define function names and arguments that communicate the nouns and verbs in the application better than soma could.

The fact that census does intentionally expose soma functions does make API design tricky but you could view it as census allowing you to drop down to lower level to get more fine grained control. For example, census exposes get_obs() -> pd.Dataframe, and if the user needs finer control can use obs.read() to get an iterator over arrow tables and can then materialize it in memory if they wish. Similarly, get_anndata() is a user friendly function for census users whereas soma.ExperimentAxisQuery.to_anndata is lower level and more general.

I think we should avoid doing things like the above where we support both soma type arguments and census type arguments that effectively have the same meaning and exist alongside each other in the same census API method.

In general, I think we should put more weight on on what makes usage of the census API easier rather than consistency with form of the underlying soma API.

@ivirshup
Copy link
Contributor

It seems rather confusing to me that we would allow column_names, obs_column_names, and var_column_names and then write some logic to ensure mutual exclusion. The interface has not gained any clarity with this approach.

I have update that code slightly so that now it either errors or throws a warning if column_names is ever passed (I already had the test for it, but not the actual code). I think this should be more clear.


I think that argument is fair, but I guess I assumed there was some desire for users to access tiledbsoma for more complex use cases. Then it would be nice for them if the APIs are the same. It could also ease maintenance if the APIs don't drift much and many arguments can be passed straight through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort-xxs P0 Priority 0 - Critical, fix ASAP! UXR value-high Assessed as high value work
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants