# Tutorial for module discovery and characterization with CellSP

 - This notebook gives a basic understanding of CellSP's functionalities. There are 4 main steps involved:
    - Data Preprocessing
    - Subcellular Pattern Identification
    - Module Discovery
    - Module Characterization and Visualization

 - Please use Github Issues for any errors or feature requests.

---

### Installation

 - CellSP is avaliable on `pypi` and can be installed using `pip install cellSP`.

In [2]:
import cellSP

## Data Preprocessing

CellSP analyzes single-molecule (subcellular) spatial transcriptomics (ST) data. To characterize module cells, we require scRNA-seq data from the same tissue to impute gene expression. Both datasets should be provided in the `AnnData` format.

CellSP requires minimal preprocessing of input data. No preprocessing is performed on the input ST data; however, we recommend users apply quality control filtering of transcripts by appropriately thresholding Phred-scaled Q-Scores.

For scRNA-seq data, we apply data denoising using `MAGIC` and perform imputation of ST data using `Tangram`.

#### Loading `AnnData`

CellSP provides the function `cellSP.ds.load_data()` to load `AnnData` objects. It takes the following arguments:

- `sc_adata`: The `AnnData` object containing scRNA-seq data.
- `st_adata`: The `AnnData` object containing ST data.

These datasets are used for downstream analysis within CellSP.

Example Datasets - An example data can be Xenium Whole mouse brain for [ST](https://www.10xgenomics.com/datasets/fresh-frozen-mouse-brain-replicates-1-standard) and Stero-seq for [SC](https://db.cngb.org/search/project/CNP0001543/). 

For this analysis, we use a subset of 5000 randomly sampled cells from the replicate 1 of the Xenium Whole Mouse Brain dataset available publically [link](https://www.10xgenomics.com/datasets/fresh-frozen-mouse-brain-replicates-1-standard). 


### File format

The `sc_adata` file should follow the standard `AnnData` format for single-cell data, where:

- `sc_adata.X` contains the gene expression matrix (cells × genes).
- `sc_adata.var` contains gene metadata, indexed by gene names.
- `sc_adata.obs` contains cell metadata.

The `st_adata` should have the same format as `sc_adata` and additionally is expected to have a DataFrame as the `transcripts` key in the `uns` attribute with the following columns - `['gene', 'uID', 'absX', 'absY']`. Also, the `obsm` attribute is expected to have another DataFrame which stores the spatial locations of the cells. 

Below is an example table for a 2D data (if the data is 3D, `absZ` column is also expected) - 

| gene | uID | absX | absY |
| :---         |     :---:      |          ---: |           ---: |
| AKAP11   |  2    | -1401.666    | -2956.618     |
| SIPA1L3  |  3       | -1411.692      |   -2936.609     |
| THBS1  |  925       | -764.6989      |   -1604.828    |

Optionally, the user can also provide the cell boundaries of each cell. `cell_boundary` key in the `uns` key is expected to have a DataFrame with format as below - 

|uID|vertex_x|vertex_y|
| :---         |          ---: |           ---: |
|1|1554.65|2519.1875|
|1|1550.825|2521.525|
|2|1549.55|2523.4375|

If cell boundaries are not provided, the convex hull of the transcripts of each cell are used as its approximation.

In [14]:
adata_sc, adata_st = cellSP.ds.load_data(sc_adata= 'adata_sc.h5ad', 
                                         st_adata = f"adata_st.h5ad")

For denoising of scRNA-seq data, we use the function `cellSP.pp.impute()` which has the following arguments,
 - `adata_sc`:  The `AnnData` object containing scRNA-seq data.
 - `t`: `MAGIC` parameter which corresponds to the power to which the diffusion operator is powered.
 - `**kwargs`: Additional arguments for `MAGIC`.


In [4]:
adata_sc = cellSP.pp.impute(adata_sc, t="auto")

Calculating MAGIC...
  Running MAGIC on 50140 cells and 16168 genes.
  Calculating graph and diffusion operator...
    Calculating PCA...
    Calculated PCA in 71.02 seconds.
    Calculating KNN search...
    Calculated KNN search in 64.78 seconds.
    Calculating affinities...
    Calculated affinities in 68.57 seconds.
  Calculated graph and diffusion operator in 204.76 seconds.
  Running MAGIC with `solver='exact'` on 16168-dimensional data may take a long time. Consider denoising specific genes with `genes=<list-like>` or using `solver='approximate'`.
  Calculating imputation...
    Automatically selected t = 4
  Calculated imputation in 40.70 seconds.
Calculated MAGIC in 246.68 seconds.
Time to impute 0:04:06.696124


For imputation of genes not in the ST data, we use the function `cellSP.pp.run_tangram()` which has the following arguments,
 - `adata_sc`:  The `AnnData` object containing scRNA-seq data.
 - `adata_st`:  The `AnnData` object containing ST data.
 - `device`: Device to run Tangram on. Either 'cpu' or 'cuda'.

In [5]:
adata_st = cellSP.pp.run_tangram(adata_sc, adata_st, device='cuda')

Running Tangram...


INFO:root:16168 training genes are saved in `uns``training_genes` of both single cell and spatial Anndatas.
INFO:root:16168 overlapped genes are saved in `uns``overlap_genes` of both single cell and spatial Anndatas.
INFO:root:uniform based density prior is calculated and saved in `obs``uniform_density` of the spatial Anndata.
INFO:root:rna count based density prior is calculated and saved in `obs``rna_count_based_density` of the spatial Anndata.
INFO:root:Allocate tensors for mapping.
INFO:root:Begin training with 16168 genes and rna_count_based density_prior in cells mode...
INFO:root:Printing scores every 100 epochs.


Score: 0.780, KL reg: 0.219
Score: 0.987, KL reg: 0.002
Score: 0.994, KL reg: 0.001
Score: 0.995, KL reg: 0.001
Score: 0.996, KL reg: 0.001
Score: 0.997, KL reg: 0.000
Score: 0.997, KL reg: 0.000
Score: 0.997, KL reg: 0.000
Score: 0.997, KL reg: 0.000
Score: 0.998, KL reg: 0.000


INFO:root:Saving results..


Time to run Tangram 0:26:53.856380


---

### Saving the `AnnData` object.

CellSP provides the function `cellSP.ds.write_h5ad()` to save `AnnData` objects. It takes the following arguments:
- `data`: Anndata object to save.
- `filename`: Name of the file to save.

In [6]:
cellSP.io.write_h5ad(adata_st, f"adata_st.h5ad")

---

## Subcellular Pattern Identification

CellSP utilizes existing tools [InSTAnT](https://github.com/bhavaygg/InSTAnT/) and [SPRAWL](https://github.com/salzman-lab/SPRAWL/) for the discovery of subcellular spatial patterns of genes within each cell. Types of patterns detected include gene-gene colocalization reported by InSTAnT and four types of subcellular localization preferences (peripheral, punctate, central, radial) reported by SPRAWL.

#### Running `InSTAnT`.

We use the function `cellSP.ch.run_instant()`, which has the following parameters:
- **`adata_st`** (`AnnData`): The `AnnData` object containing spatial transcriptomics data.  
- **`distance_threshold`** (`float`): The distance threshold for the PP-Test.  
- **`threads`** (`int`): The number of threads to use.  
- **`n_vertices`** (`int`, optional): The number of vertices to use for the FSM (default: `None`).  
- **`alpha_cpb`** (`float`, optional): The significance level for the CPB (default: `0.001`).  
- **`filename`** (`str`, optional): The name of the file to save the preprocessed data (default: `.cellSP_st.h5ad`, deleted after completion).  
- **`remove_file`** (`bool`, optional): If `True`, the file is deleted after completion (default: `True`).  
- **`is_sliced`** (`bool`, optional): Specifies whether the data is sliced (default: `True`).  

We create a temporary `Anndata` file to run `InSTAnT` with name `filename`. If the argument is not specified, we use a random name and delete the file unless specified otherwise in `remove_file`. The argument `is_sliced` is used to denote whether the ST data has discrete z-planes (like in some MERFISH experiments) or continous (like in Xenium). 


In [11]:
adata_st = cellSP.ch.run_instant(adata_st = adata_st, distance_threshold=2, threads=128)

Running InSTAnT...
Loaded Data. Number of Transcripts:  1909167 , Number of Genes:  248 , Number of Cells:  5000
Initialised PP-3D slice now on 128 threads
Number of cells:  5000 , Number of Genes:  248
min genecount less than 20 for cell id 622, Skipping ...
min genecount less than 20 for cell id 623, Skipping ...
min genecount less than 20 for cell id 716, Skipping ...
min genecount less than 20 for cell id 1253, Skipping ...
min genecount less than 20 for cell id 3878, Skipping ...
min genecount less than 20 for cell id 4858, Skipping ...
Running PP-3D slice now on 128 threads for, 4994 cells, 1909167 transcripts
Cell-wise Proximal Pairs Time : 3618.63 seconds
Running Global Colocalization now on 128 threads
Number of cells: 5000, Number of genes: 248
Global Colocalization initialized ..
High Precision Global Colocalization Time: 59.71 seconds
Cell-wise Global Colocalization Time : 122.07 seconds


#### Running `SPRAWL`.

We use the function `cellSP.ch.run_sprawl()`, which has the following parameters:
- **`adata_st`** (`AnnData`): The `AnnData` object containing spatial transcriptomics data.  
- **`methods`** (`list`, optional): List of methods/patterns to run SPRAWL on. Options: `Peripheral`, `Radial`, `Punctate`, `Central` (default: `['Peripheral', 'Radial', 'Punctate', 'Central']`).  
- **`threads`** (`int`, optional): Number of threads to use (default: `1`).  

In [14]:
adata_st = cellSP.ch.run_sprawl(adata_st, threads = 128)

Running SPRAWL...
SPRAWL completed in: 0:52:01.080050


---

## Module Discovery

Patterns identified in the previous step (represented as matrices of cells × genes or gene pairs) undergo biclustering analysis to identify subsets of rows and columns with high average values. These subsets represent **gene-cell modules**, where genes or gene pairs exhibit the same subcellular pattern within the same cells. Gene-cell modules that significantly overlap in genes (or gene pairs) are merged to reduce redundancy. This step outputs a set of gene-cell modules with varying numbers of cells and genes.

CellSP uses the biclustering algorithm `LAS` to analyze each pattern and identify gene-cell modules.

#### Module Discovery Functions

CellSP provides two functions for module discovery, one for `SPRAWL` (`cellSP.ch.bicluster_sprawl()`) and one for `InSTAnT` (`cellSP.ch.bicluster_instant()`). 

Both functions share the same parameters, but the `InSTAnT` function includes two additional parameters:

- **`alpha`** (`float`, optional): p-value significance threshold below which a gene pair is considered for biclustering (default: `1e-3`).  
- **`topk`** (`int`, optional): Selects only the `K` most significant gene pairs with p-values < `alpha` (default: `None`).  

These parameters help restrict the number of gene pairs for biclustering, reducing computational complexity.

#### Shared Parameters

- **`adata_st`** (`AnnData`): The `AnnData` object containing spatial transcriptomics data.  
- **`num_biclusters`** (`int` or `auto`, optional): Number of modules to find (default: `auto`).  
- **`randomized_searches`** (`int`, optional): Number of randomized searches to perform in LAS (default: `50,000`).  
- **`gene_threshold`** (`int`, optional): Minimum number of genes required for a valid bicluster (default: `3`).  
- **`threads`** (`int`, optional): Number of threads to use for parallel computation (default: `1`).  
- **`expand`** (`bool`, optional): Whether to expand biclusters by including additional nearby or correlated entries (default: `True`).  
- **`oc`** (`float`, optional): Overlap coefficient threshold for merging overlapping biclusters (default: `0.667`).


**Note** - Using `auto` mode for bilcustering generates a null distribution for each pattern and finds suitable LAS score threshold based on the data. 


In [13]:
adata_st = cellSP.ch.bicluster_instant(adata_st, distance_threshold=2, threads=128, alpha=1e-5, num_biclusters = 10, randomized_searches = 50000)

Bi-clustering InSTAnT CPB results...
InSTAnT CPB Bi-clustering time: 1254 days, 0:10:50.194914


In [15]:
adata_st = cellSP.ch.bicluster_sprawl(adata_st, threads=128, num_biclusters = 10, randomized_searches = 50000)


Bi-clustering SPRAWL results...
SPRAWL bi-clustering completed time: 1:12:12.287354


---

## Module Characterization and Visualization

### Module Characterization

To aid biological interpretation, CellSP reports shared properties of the genes and cells of each discovered module. Genes are characterized using Gene Ontology (GO) enrichment tests, while cells are characterized by their cell type composition if such information is available. To provide a more precise characterization of a module’s cells, CellSP trains a machine learning classifier to discriminate those cells from all other cells, using the expression levels of all genes other than the module genes. Genes that are highly predictive in this task are then subjected to GO enrichment tests, furnishing hypotheses about biological processes and pathways that are active specifically in the module cells.

To characterize the module genes, we perform GO enrichment using `cellSP.geo.geo_analysis()` with the paramter `setting` set to `module`. The detailed arguments are mentioned below.

- **`adata_st`** (`AnnData`): The `AnnData` object containing spatial transcriptomics data.  
- **`mode`** (`list`, optional): List of analyses to perform. Must be one or more of `['instant_fsm', 'instant_biclustering', 'sprawl_biclustering']` (default: `['instant_biclustering', 'sprawl_biclustering']`).  
- **`organism`** (`int`, optional): Taxon ID of the organism (default: `10090`).  
- **`do_revigo`** (`bool`, optional): If `True`, performs REVIGO analysis (default: `True`).  
- **`setting`** (`str`, optional): Specifies whether to perform the analysis at the **module** or **cell** level. Must be `'module'` or `'cell'` (default: `'module'`).  
- **`corr_threshold`** (`float`, optional): Correlation threshold for finding genes correlated to the marker genes of the module cells (default: `0.98`).  


In [19]:
adata_st = cellSP.geo.geo_analysis(adata_st, setting="module")

Performing GO Enrichment Analysis...
GO Enrichment Analysis Completed in : 0:10:23.404245


To characterize the module cells, we train a Random Forest classifier of module cells using the function `cellSP.md.model_modules()`. Its arguments are - 

- **`adata_st`** (`AnnData`): The `AnnData` object containing spatial transcriptomics data.  
- **`mode`** (`list`, optional): List of characterizations to model. Options: `['instant_biclustering', 'sprawl_biclustering']` (default: `['instant_biclustering', 'sprawl_biclustering']`).  
- **`n_repeats`** (`int`, optional): Number of sampling repetitions (default: `25`).  
- **`subsample`** (`bool`, optional): If `True`, subsamples cells based on cell type (if available) or randomly (default: `True`).  
- **`corr_threshold`** (`float`, optional): Correlation threshold for filtering gene sets (default: `0.98`).

The marker genes for each module are found using `SHAP` and are used to provide biological interpretation of the module cells beyond their cell type.

In [17]:
adata_st = cellSP.md.model_modules(adata_st, do_shap=True, subsample = True, corr_threshold = 0.90, n_repeats = 10)

Modeling subcellular patterns...
Modelling for Module 0 with genes: ['cldn5', 'ly6a', 'paqr5', 'fgd5', 'acvrl1', 'pecam1', 'kdr', 'cobll1', 'pglyrp1', 'car4', 'slfn5', 'sox17', 'nostrin', 'mecom', 'zfp366', 'ccn2', 'adgrl4', 'emcn', 'fn1', 'cd93'] in 423 cells
Modelling for Module 1 with genes: ['lyz2', 'spi1', 'cd300c2', 'laptm5', 'trem2', 'plekha2', 'arhgap25', 'cd53', 'siglech', 'kctd12', 'ikzf1', 'cd68'] in 211 cells
Modelling for Module 2 with genes: ['col1a1', 'igf2', 'lyz2', 'cyp1b1', 'spp1', 'pdgfra', 'ccn2', 'gjb2', 'dcn', 'col6a1', 'fn1', 'fmod', 'slc13a4', 'igfbp4', 'aldh1a2'] in 127 cells
Modelling for Module 3 with genes: ['opalin', 'sox10', 'tmem163', 'zfp536', 'clmn', 'dpy19l1', 'sema6a', 'prox1', 'adamtsl1', 'gng12', 'gjc3'] in 303 cells
Modelling for Module 4 with genes: ['sema3d', 'sox10', 'cspg4', 'pdgfra', 'gpr17', 'pou3f1', 'tmem255a', 'sema5b', 'gng12', 'gjc3'] in 120 cells
Modelling for Module 5 with genes: ['cbln4', 'thsd7a', 'gad1', 'kcnmb2', 'sst', 'sema3e', '

In [20]:
adata_st = cellSP.geo.geo_analysis(adata_st, setting="cell")

Performing GO Enrichment Analysis...
GO Enrichment Analysis Completed in : 0:31:44.734452


### Module Visualization

CellSP provides functionality to visualize the detected patterns for each module and also to visualize each pattern on each module.

The first function, `cellSP.vs.visualize_modules()`, visualizes all modules for their detected patterns. Its arguments are - 

- **adata_st** (*AnnData*): Spatial transcriptomics data.
- **mode** (*list*): List of characterizations. (`'instant_biclustering'`, `'sprawl_biclustering'`).
- **num_sectors** (*int*): Number of sectors in the circular cell (default: `10`).
- **num_ccircles** (*int*): Number of  Concentric circles (default: `5`).
- **distance_threshold** (*float*): PP-Test distance. Used to create proximity heatmap. (default: `2`).
- **positions** (*bool*): Plot spatial positions of module cells in a subplot. (`True`).
- **is_sliced** (*bool*): Are the z-slices discrete? (`True`).

The next function, `cellSP.vs.visualize_pattern()` allows you to visualize any pattern and save it for a given module. Its arguments are - 

- **adata_st** (*AnnData*): Spatial transcriptomics data.  
- **module_number** (*int*): Index of the module to visualize.  
- **pattern** (*str*): Spatial pattern (`'Proximal'`, `'Radial'`, `'Concentric'`).  
- **mode** (*list*): List of characterizations. (`'instant_biclustering'`, `'sprawl_biclustering'`).
- **filename** (*str*): Filename to save the plot.  (`None`)
- **num_sectors** (*int*): Number of sectors in the circular cell (default: `10`).
- **num_ccircles** (*int*): Number of  Concentric circles (default: `5`).
- **distance_threshold** (*float*): PP-Test distance. Used to create proximity heatmap. (default: `2`).

CellSP creates a comprehensive webpage summary for all detected modules. The report showcases key details such as the identified subcellular pattern, spatial distribution of module cells in the tissue, UMAP visualization of the module cells, and the functional characterization of the module. We use the function `cellSP.vs.create_report()` which require the `adata_st` as input and runs the pattern visualization and other plotting functions to generate a HTML report. The report generated for this example is available as PDF [here](https://github.com/bhavaygg/CellSP/blob/main/figures/example_report.pdf). 

In [None]:
cellSP.vs.create_report(adata_st)

--- 

## `AnnData` format

The `AnnData` file is updated with the outputs of each of the functions. Here we try to explain what each output holds and means

In [None]:
adata_st

AnnData object with n_obs × n_vars = 5000 × 16171
    obs: 'region', 'uniform_density', 'rna_count_based_density'
    uns: 'cell_boundary', 'cpb_results', 'geneList', 'instant_biclustering', 'instant_biclustering_geo_cell', 'instant_biclustering_geo_module', 'instant_fsm', 'instant_fsm_geo_cell', 'instant_fsm_geo_module', 'nV4_cliques', 'overlap_genes', 'pp_test_d2_pvalues', 'sprawl_biclustering', 'sprawl_biclustering_geo_cell', 'sprawl_biclustering_geo_module', 'sprawl_scores', 'training_genes', 'transcripts'
    obsm: 'genecount', 'spatial'

- `obs`
    - `region`: ID of the region where the cell lies. (Not used)
    - `cell_type`: Cell type of the cell. Used for sampling during cell characterization.
    - `uniform_density` and `rna_count_based_density`: Tangram outputs.
- `uns`
    - `transcripts`: `DataFrame` with columns `uID` (cell ID), `absX`, `absY`, `absZ`(Optional), 'gene'.
    - `cpb_results`: InSTAnT outputs. CPB Test results (DataFrame), `geneList`: genes used (list), `nV4_cliques`: FSM output (cliques), `pp_test_d2_pvalues`: PP Test p-values (NumPy matrix).
    - `sprawl_scores`: SPRAWL output. Dictionary with the patterns as keys for which SPRAWL was run for. Values are numpy matrices of shape n_cells X n_genes. 
    - `cell_boundary`: `DataFrame` with columns `vertex_x` and `vertex_y`. Used in `SPRAWL` and plotting. If not present, a convex hull of the transcripts is used as cell boundary. 
    - `training_genes` and `overlap_genes`: Tangram outputs.
    - `instant_biclustering` and `sprawl_biclustering`: Outputs of module discovery step. Described below.
    - `instant_biclustering_geo_{}` and `sprawl_biclustering_geo_{}`: Outputs of module characterization step. Described below.
- `obsm`:
    - `spatial`: `DataFrame` with columns `absX`, `absY`, `absZ`(Optional) with the spatial location of each cell in the tissue.
    - `X-UMAP`: `DataFrame` with columns `UMAP-1`, `UMAP-2` corresponding to the first and second UMAP dimension of each cell. Used for plotting in report.

---

#### Module Discovery Outputs

Outputs from biclustering on `InSTAnT` and `SPRAWL` are presented as `DataFrames` as shown below. Each row represents a detected module. The columns denote
- `gene-pairs`: (for InSTAnT) Pairs of genes in the module.
- `genes`: Genes in the module
- `uIDs`: Cells in the module
- `#cells`: Number of cells in the module.
- `combined`: Number of times the module was merged.
- `tangram`: Average cross-validation accuracy for module classification using genes other than module genes.
- `tangram_corr`: Average cross-validation accuracy for module classification using genes other than module genes and genes correlated with them.
- `baseline`: Average cross-validation accuracy for module classification using module genes.
- `shap_genes`: Top 20 most predictive genes calculated using `SHAP`, also called as marker genes of the module cells.
- `GO Module`: Most significant GO term found using enrichment test of module genes.
- `GO Cell`: Most significant GO term found using enrichment test of marker genes of module cells.
- `#pc_genes`: Number of genes found to be highly correlated with the marker genes of the module cells and are used for characterization of the module cells.


In [22]:
adata_st.uns['instant_biclustering']

Unnamed: 0,gene-pairs,genes,uIDs,#cells,combined,pre cell expansion,post cell expansion,instant average,instant score,tangram,tangram_corr,baseline,shap genes,GO Module,#pc_genes,GO Cell
0,"(fgd5,pecam1),(fgd5,nostrin),(nostrin,zfp366),...","cldn5,ly6a,paqr5,fgd5,acvrl1,pecam1,kdr,cobll1...","140436,129048,5309,39156,141970,45390,140846,9...",423,1,34133.725391,34133.725391,"0.8341564079291461,0.7305973067255646:0.637820...","34133.725390515654,34133.725390515654:31150.94...",0.879667,0.876585,0.873298,"col1a2,srgn,pik3cg,myoc,fmod,acta2,cp,mpzl2,fo...",angiogenesis,59.0,collagen fibril organization
1,"(plekha2,trem2),(laptm5,trem2),(cd68,spi1),(kc...","lyz2,spi1,cd300c2,laptm5,trem2,plekha2,arhgap2...","79722,95083,13368,149578,13804,156896,100660,1...",211,1,24778.336737,24778.336737,"1.3074074013208965,0.9338975380662246:0.884153...","24778.33673746181,24778.33673746181:21730.8592...",0.839557,0.84778,0.834535,"gm5136,gm19426,bcl2a1b,dgkq,itgb2,gm2582,17001...",cellular response to lipopolysaccharide,32.0,extracellular matrix organization
2,"(col1a1,igf2), (aldh1a2,igf2), (aldh1a2,fmod),...","col1a1,igf2,lyz2,cyp1b1,spp1,pdgfra,ccn2,gjb2,...","40682,8215,3804,2392,155150,30568,10615,14188,...",127,0,16971.652816,16971.652816,0.7376447113799066,16971.652815512436,0.944877,0.932662,0.942615,"nfatc4,slc22a6,olfr1033,tfpi,adam12,stra6,aebp...",response to hormone,61.0,skeletal system development
3,"(opalin,sox10), (gjc3,opalin), (clmn,opalin), ...","opalin,sox10,tmem163,zfp536,clmn,dpy19l1,sema6...","115752,30,103084,112906,48454,73187,128197,113...",303,0,15972.587709,15972.587709,0.9100920497163525,15972.587709481668,0.918033,0.909601,0.913104,"gm42756,d7ertd443e,smco3,gm44866,sec14l5,gm459...",ensheathment of neurons,76.0,myelination
4,"(cspg4,pdgfra), (cspg4,gpr17), (pdgfra,sema3d)...","sema3d,sox10,cspg4,pdgfra,gpr17,pou3f1,tmem255...","46594,128197,88969,120518,20507,94389,1377,906...",120,0,13568.407786,13568.407786,1.218629438706812,13568.407785650978,0.8275,0.8125,0.821667,"ermn,creb5,c030029h02rik,tsc22d4,cpm,d7ertd443...",ensheathment of neurons,92.0,ensheathment of neurons
5,"(gad1,rab3b), (gad1,gad2), (gad2,rab3b), (dner...","cbln4,thsd7a,gad1,kcnmb2,sst,sema3e,necab1,gad...","19699,159779,61459,62491,143327,15746,96168,40...",134,0,9092.400214,9092.400214,0.9149824103740001,9092.400214200303,0.776439,0.786111,0.786952,"prokr2,npy,6530411m01rik,npas1,catip,slc32a1,g...",glutamate catabolic process,31.0,cerebral cortex GABAergic interneuron differen...
6,"(aqp4,ntsr2), (aqp4,slc39a12), (ntsr2,slc39a12...","slc39a12,ntsr2,gli3,rmst,aqp4,clmn,acsbg1,hapl...","45322,112631,69762,79737,46629,107264,120534,5...",286,0,5291.380998,5291.380998,0.7901200793394628,5291.380998292888,0.825188,0.823436,0.82131,"ac034116.3,phka1,ppara,nat8f3,ccdc141,gm35552,...",regulation of smoothened signaling pathway,27.0,regulation of generation of precursor metaboli...
7,"(ano1,carmn), (car4,carmn), (cspg4,inpp4b), (c...","acta2,cspg4,ccn2,carmn,inpp4b,ano1,car4","78280,26700,127795,107900,10612,80885,61426,14...",34,0,4333.368398,4333.368398,1.110971154375574,4333.368397747355,0.848333,0.853571,0.955714,"crip1,cp,cavin1,myh11,cfh,pf4,gm47719,mpzl2,s1...",positive regulation of MAPK cascade,79.0,tissue development


In [5]:
adata_st.uns['sprawl_biclustering']

Unnamed: 0,method,genes,uIDs,#cells,combined,pre cell expansion,post cell expansion,sprawl average,sprawl score,tangram,tangram_corr,baseline,shap genes,GO Module,#pc_genes,GO Cell
0,Peripheral,"slc13a4,aldh1a2,spp1,fmod,dcn,igf2,acta2","8215,140712,126630,136597,155150,96880,57662,1...",61,0,1627.962243,1627.962243,0.2127366407756252,1627.9622432729848,0.848526,0.843269,0.870192,"ogn,bmp6,h2-ab1,cfh,myoc,slc13a3,cp,zic2,fam18...",regulation of vascular endothelial cell prolif...,65.0,skeletal system development
1,Radial,"cldn5,adgrl4,pecam1,acvrl1,ly6a,fn1,pglyrp1,so...","76365,141457,80575,61374,115711,108543,94090,4...",181,0,4501.068575,4501.068575,0.3430488737781555,4501.068575461013,0.904369,0.901622,0.858919,"slc6a20a,4930523c07rik,gjb2,asgr1,itih2,gm4267...",angiogenesis,56.0,renal system development
2,Radial,"cd53,trem2,cd300c2,siglech,cd68,ikzf1,spi1,laptm5","19695,94774,42871,136552,40598,112643,42585,41...",89,0,4192.935669,4192.935669,0.4662921348314606,4192.935668805927,0.885425,0.887124,0.823595,"crybb1,ltc4s,c1qc,ajuba,tyrobp,lyl1,il21r,fcgr...",regulation of immune system process,32.0,"complement activation, classical pathway"
3,Radial,"aldh1a2,col1a1,fmod,spp1,gjb2,cyp1b1,slc13a4,d...","155150,137037,14188,140712,120166,143128,8932,...",79,0,2197.957627,2197.957627,0.3308917018284107,2197.95762713589,0.915458,0.90775,0.919667,"cp,tbx15,prelp,asgr1,fgl2,lbp,foxc2,slc26a7,wn...",response to nutrient,60.0,skeletal system development
4,Radial,"ano1,carmn,cspg4","10612,57455,64726,46017,107900,149206,61426,40...",57,0,2145.904307,2145.904307,0.5977076023391813,2145.904306614343,0.825833,0.816136,0.927045,"gata2,a2m,pcolce,osr1,b230323a14rik,hspg2,slc2...",gliogenesis,56.0,renal system development
5,Radial,"gpr17,cspg4,sema3d,pdgfra","94389,11221,104206,1377,73950,85223,130273,103...",104,0,1938.946652,1957.42921,0.5515961538461538,1957.4292098625488,0.781024,0.782381,0.751786,"litaf,tmem98,arrdc2,pla2g16,gm33594,gatm,prr5l...",platelet-derived growth factor receptor signal...,91.0,myelination
6,Punctate,"cldn5,adgrl4,pecam1,acvrl1,ly6a,pglyrp1,fn1,kd...","91165,10170,57455,53774,108270,54266,17139,420...",204,0,6981.428102,6981.428102,0.4034558823529411,6981.428102117615,0.930476,0.927756,0.89028,"col1a2,fmod,lum,itih2,igf2,pf4,gjb2,fam180a,co...",angiogenesis,54.0,renal system development
7,Punctate,"cd300c2,cd53,trem2,ikzf1,siglech,spi1,cd68,laptm5","19695,63439,84448,94774,88506,112643,10693,128...",99,0,6559.742852,6559.742852,0.5703484848484849,6559.742851723009,0.870868,0.872395,0.810711,"tyrobp,ang,cdca7,vsir,p3h4,pllp,crybb1,gm44532...",regulation of immune system process,58.0,ensheathment of neurons
8,Punctate,"cspg4,gpr17,ano1,carmn,sema3d,pdgfra","46017,42508,61426,57455,136504,64726,103198,94...",193,0,2828.799095,2828.799095,0.3883246977547495,2828.79909467086,0.819845,0.820081,0.838354,"cldn14,gm10687,2700046a07rik,serpinb1a,gm44866...",platelet-derived growth factor receptor signal...,84.0,ensheathment of neurons
9,Punctate,"fmod,col1a1,spp1,aldh1a2,dcn,gjb2,cyp1b1,igf2,...","155150,137037,143128,26240,93733,40698,140712,...",88,0,2340.250294,2340.250294,0.3084454545454546,2340.250294497428,0.904673,0.886405,0.911993,"itih2,slc6a12,aebp1,serpind1,slc6a13,emp3,lbp,...",response to nutrient,51.0,nephron development


---

#### Module Characterization Outputs

Outputs from the module characterization of the modules using GO enrichment tests. Outputs are presented as `DataFrames` contained within a dictionary. `instant_biclustering_geo_{x}` and `sprawl_biclustering_geo_{x}` are dictionaries with keys as the module number (str) and the DataFrames are keys. The DataFrames are the standard `csv` that are outputed by PantherDB overrepresentation test. `x` is either `cell` or `module` depending on the cell or genes characterization respectively.


In [18]:
adata_st.uns['instant_biclustering_geo_module']['0']

Unnamed: 0,number_in_list,fold_enrichment,fdr,expected,number_in_reference,pValue,term,id,dataset
0,8,6.068111,0.018886,1.318367,17,0.000004,angiogenesis,GO:0001525,BP
1,8,4.912281,0.066385,1.628571,21,0.000031,blood vessel morphogenesis,GO:0048514,BP
2,8,3.967611,0.272274,2.016327,26,0.000189,blood vessel development,GO:0001568,BP
3,9,3.413313,0.251671,2.636735,34,0.000232,tube morphogenesis,GO:0035239,BP
4,8,3.820663,0.221991,2.093878,27,0.000256,vasculature development,GO:0001944,BP
...,...,...,...,...,...,...,...,...,...
79,2,5.157895,1.000000,0.387755,5,0.049595,transforming growth factor beta receptor super...,GO:0141091,BP
81,2,5.157895,1.000000,0.387755,5,0.049595,transforming growth factor beta receptor signa...,GO:0007179,BP
82,2,5.157895,1.000000,0.387755,5,0.049595,cell surface receptor protein serine/threonine...,GO:0007178,BP
80,2,5.157895,1.000000,0.387755,5,0.049595,positive regulation of vasculature development,GO:1904018,BP
