# Experimental Highly Variable Genes API

This tutorial describes use of the `cellxgene_census.experimental.pp` API for finding highly variable genes (HVGs) in the Census. The HVG algorithm implements the ranked normalized variance method `seurat_v3` described in [`scanpy.pp.highly_variable_genes`](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html#scanpy.pp.highly_variable_genes).

There are two API available:
* `get_highly_variable_genes()` - high level function which accepts arguments similar to `cellxgene_census.get_anndata()`, and returns annotations for each `var` feature in a Pandas DataFrame.
* `highly_variable_genes()` - lower level function which accepts a `tiledbsoma.ExperimentAxisQuery` and returns the same result.

Both functions accept common arguments to control ranking, with argument semantics matching the Scanpy API:
* `n_top_genes` - number of genes to rank.
* `batch_key` - if specified, normalized ranking will be done in separate batches based upon the obs column value name specified, and then merged into the final result.
* `span` - the fraction of the data (cells) used when estimating the variance in the [loess model fit](https://has2k1.github.io/scikit-misc/stable/generated/skmisc.loess.loess_model.html#skmisc.loess.loess_model).

In addition:
* `max_lowess_jitter` - maxmimum jitter (noise) to data if LOESS fails. Disable by setting to zero.

For more information, see the docstrings for both functions (e.g. `help(function)`)

In [10]:
# Import packages
import cellxgene_census
from cellxgene_census.experimental.pp import get_highly_variable_genes, highly_variable_genes
import pandas as pd
import tiledbsoma as soma

## get_highly_variable_genes

This convenience function will meet most use cases, and is a wrapper around `highly_variable_genes`.  This demonstration requests the top 500 genes from the Mouse census where `tissue_general` is `heart`, and joins with the `var` dataframe.

The HVGs returned by get_highly_variable_genes are indexed by their `soma_joinid`.  Join with the `var` dataframe to have a merged view of var metadata.

In [4]:
with cellxgene_census.open_soma(census_version="stable") as census:
    hvgs_df = get_highly_variable_genes(
        census,
        organism="mus_musculus",
        n_top_genes=500,
        obs_value_filter="""is_primary_data == True and tissue_general == 'heart'""",
    )

    # while the Census is open, also grab the var dataframe for the mouse
    var_df = census["census_data"]["mus_musculus"].ms["RNA"].var.read().concat().to_pandas()

hvgs_df

The "stable" release is currently 2023-05-15. Specify 'census_version="2023-05-15"' in future calls to open_soma() to ensure data consistency.


Unnamed: 0_level_0,means,variances,highly_variable_rank,variances_norm,highly_variable
soma_joinid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0.030084,0.937446,,0.487800,False
1,0.000000,0.000000,,0.000000,False
2,14.902965,10452.915369,,0.585244,False
3,0.000000,0.000000,,0.000000,False
4,2.450334,717.070806,,0.338753,False
...,...,...,...,...,...
52387,0.000000,0.000000,,0.000000,False
52388,0.000000,0.000000,,0.000000,False
52389,0.000000,0.000000,,0.000000,False
52390,0.000000,0.000000,,0.000000,False


Concat the two dataframes for convenience:

In [7]:
combined_df = pd.concat([var_df.set_index("soma_joinid"), hvgs_df], axis=1)
combined_df

Unnamed: 0_level_0,feature_id,feature_name,feature_length,means,variances,highly_variable_rank,variances_norm,highly_variable
soma_joinid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,ENSMUSG00000109644,0610005C13Rik,3583,0.030084,0.937446,,0.487800,False
1,ENSMUSG00000108652,0610006L08Rik,2128,0.000000,0.000000,,0.000000,False
2,ENSMUSG00000007777,0610009B22Rik,998,14.902965,10452.915369,,0.585244,False
3,ENSMUSG00000086714,0610009E02Rik,1803,0.000000,0.000000,,0.000000,False
4,ENSMUSG00000043644,0610009L18Rik,619,2.450334,717.070806,,0.338753,False
...,...,...,...,...,...,...,...,...
52387,ENSMUSG00000081591,Btf3-ps9,496,0.000000,0.000000,,0.000000,False
52388,ENSMUSG00000118710,mmu-mir-467a-3_ENSMUSG00000118710,83,0.000000,0.000000,,0.000000,False
52389,ENSMUSG00000119584,Rn18s,1849,0.000000,0.000000,,0.000000,False
52390,ENSMUSG00000118538,Gm18218,970,0.000000,0.000000,,0.000000,False


Select _only_ the highly_variable genes by using the `highly_variable` column value:

In [8]:
combined_df[combined_df.highly_variable]

Unnamed: 0_level_0,feature_id,feature_name,feature_length,means,variances,highly_variable_rank,variances_norm,highly_variable
soma_joinid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
552,ENSMUSG00000097536,2610037D02Rik,5020,0.855924,2701.181363,144.0,3.928039,True
953,ENSMUSG00000097276,4930525G20Rik,2987,0.099004,37.125256,499.0,2.516880,True
1494,ENSMUSG00000073386,9830107B12Rik,3691,0.200361,199.832453,491.0,2.530400,True
1738,ENSMUSG00000045165,AI467606,1932,23.730773,130376.031665,195.0,3.478058,True
1767,ENSMUSG00000051669,AU021092,1368,21.549119,78776.053301,327.0,2.829371,True
...,...,...,...,...,...,...,...,...
26760,ENSMUSG00000063660,Olfr98,1039,0.097637,60.434917,385.0,2.703275,True
27147,ENSMUSG00000074003,Gucy2d,4102,0.208949,219.691473,353.0,2.770308,True
27711,ENSMUSG00000079853,Klra1,1764,0.128651,80.756272,326.0,2.829798,True
29230,ENSMUSG00000096323,Gm20767,1427,0.609561,1085.532796,372.0,2.729782,True


## highly_variable_genes

This API provides the same function as `get_highly_variable_genes`, but accepts any `tiledbsoma.ExperimentAxisQuery`.  It is intended for more advanced users who wish to use create and manage their own queries.

In [12]:
with cellxgene_census.open_soma(census_version="stable") as census:
    experiment = census["census_data"]["mus_musculus"]
    with experiment.axis_query(
        measurement_name="RNA",
        obs_query=soma.AxisQuery(value_filter="""is_primary_data == True and tissue_general == 'heart'"""),
    ) as query:
        hvgs_df = highly_variable_genes(query, n_top_genes=500)

hvgs_df[hvgs_df.highly_variable]

The "stable" release is currently 2023-05-15. Specify 'census_version="2023-05-15"' in future calls to open_soma() to ensure data consistency.


Unnamed: 0_level_0,means,variances,highly_variable_rank,variances_norm,highly_variable
soma_joinid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
552,0.855924,2701.181363,144.0,3.928039,True
953,0.099004,37.125256,499.0,2.516880,True
1494,0.200361,199.832453,491.0,2.530400,True
1738,23.730773,130376.031665,195.0,3.478058,True
1767,21.549119,78776.053301,327.0,2.829371,True
...,...,...,...,...,...
26760,0.097637,60.434917,385.0,2.703275,True
27147,0.208949,219.691473,353.0,2.770308,True
27711,0.128651,80.756272,326.0,2.829798,True
29230,0.609561,1085.532796,372.0,2.729782,True
