## Single Cell RNA Seq processing: Using CZ Data as the example

#### The only package that you need to install in order to process single cell rnaseq data from an h5ad file is the biobox-analytics package. The package should install and import the necessary packages, which include: json, os, gzip, math, datetime, itertools, scanpy, pandas

In [1]:
from biobox_analytics.data.adapters.scrna import ScRNA

#### Pass in the path to your h5ad file. The object of instance ScRNA will load the object using the scanpy.read_h5ad function. Your h5ad object (scanpy object) is then held under the .rna variable

In [2]:
scrna = ScRNA(h5adFile="/Users/hamza/Documents/BX/TestData/SingleCell/cz/scRna-0206ea52-4932-4c71-87ed-58d00ffffd49.h5ad")

#### You can inspect your object freely. However, the major properties that are expected are the obs, holding metadata regarding each cell, and var, holding information about the features being observed (ie the genes) You can should inspect the columns of the obs dataframe to determine what metadata you'd like to capture for the SingleCellExperiment and Donor nodes respectively. It is expected to have a column within the obs dataframe that contains information for mappign each cell to a celltype, encoded using CellType ontology IDs

In [3]:
scrna.rna.obs

Unnamed: 0,donor_id,self_reported_ethnicity_ontology_term_id,organism_ontology_term_id,sample_uuid,sample_preservation_method,tissue_ontology_term_id,development_stage_ontology_term_id,suspension_uuid,suspension_type,library_uuid,...,tissue_type,cell_type,assay,disease,organism,sex,tissue,self_reported_ethnicity,development_stage,observation_joinid
AAACCTGAGTGTTAGA-1,control_1,HANCESTRO:0005,NCBITaxon:9606,d8b737fb-8a38-4a1d-a421-cbc81edddd05,flash-freezing,UBERON:0001225,HsapDv:0000148,a0d05ece-11da-451c-8178-bc66d038afa6,nucleus,cccd2996-0f80-4b49-b210-86a55b800dd1,...,tissue,epithelial cell of proximal tubule,10x 5' v1,normal,Homo sapiens,male,cortex of kidney,European,54-year-old human stage,7R8xa1|~-!
AAACCTGCAAGCGCTC-1,control_1,HANCESTRO:0005,NCBITaxon:9606,d8b737fb-8a38-4a1d-a421-cbc81edddd05,flash-freezing,UBERON:0001225,HsapDv:0000148,a0d05ece-11da-451c-8178-bc66d038afa6,nucleus,cccd2996-0f80-4b49-b210-86a55b800dd1,...,tissue,kidney loop of Henle thick ascending limb epit...,10x 5' v1,normal,Homo sapiens,male,cortex of kidney,European,54-year-old human stage,3Pc4Dv65rX
AAACCTGCACCAGATT-1,control_1,HANCESTRO:0005,NCBITaxon:9606,d8b737fb-8a38-4a1d-a421-cbc81edddd05,flash-freezing,UBERON:0001225,HsapDv:0000148,a0d05ece-11da-451c-8178-bc66d038afa6,nucleus,cccd2996-0f80-4b49-b210-86a55b800dd1,...,tissue,kidney loop of Henle thick ascending limb epit...,10x 5' v1,normal,Homo sapiens,male,cortex of kidney,European,54-year-old human stage,F>nB>Hpgo?
AAACCTGCAGTCAGAG-1,control_1,HANCESTRO:0005,NCBITaxon:9606,d8b737fb-8a38-4a1d-a421-cbc81edddd05,flash-freezing,UBERON:0001225,HsapDv:0000148,a0d05ece-11da-451c-8178-bc66d038afa6,nucleus,cccd2996-0f80-4b49-b210-86a55b800dd1,...,tissue,epithelial cell of proximal tubule,10x 5' v1,normal,Homo sapiens,male,cortex of kidney,European,54-year-old human stage,+UUZ45nZm-
AAACCTGCATGGAATA-1,control_1,HANCESTRO:0005,NCBITaxon:9606,d8b737fb-8a38-4a1d-a421-cbc81edddd05,flash-freezing,UBERON:0001225,HsapDv:0000148,a0d05ece-11da-451c-8178-bc66d038afa6,nucleus,cccd2996-0f80-4b49-b210-86a55b800dd1,...,tissue,kidney distal convoluted tubule epithelial cell,10x 5' v1,normal,Homo sapiens,male,cortex of kidney,European,54-year-old human stage,4U;((Btbm%
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TTTGTTGAGACATGCG-11,healthy_6,unknown,NCBITaxon:9606,8949b21f-6f57-40c0-8f7b-0d6e9bff05cd,flash-freezing,UBERON:0001225,HsapDv:0000153,02dbb079-1fdd-4811-81de-9ba8b6962916,nucleus,84cce7e6-e630-498a-b4a9-3664b6cde8ad,...,tissue,epithelial cell of proximal tubule,10x 3' v3,normal,Homo sapiens,female,cortex of kidney,unknown,59-year-old human stage,q%afF8ua<f
TTTGTTGAGGCTTAGG-11,healthy_6,unknown,NCBITaxon:9606,8949b21f-6f57-40c0-8f7b-0d6e9bff05cd,flash-freezing,UBERON:0001225,HsapDv:0000153,02dbb079-1fdd-4811-81de-9ba8b6962916,nucleus,84cce7e6-e630-498a-b4a9-3664b6cde8ad,...,tissue,epithelial cell of proximal tubule,10x 3' v3,normal,Homo sapiens,female,cortex of kidney,unknown,59-year-old human stage,vqGfC+&&-2
TTTGTTGCAATCTGCA-11,healthy_6,unknown,NCBITaxon:9606,8949b21f-6f57-40c0-8f7b-0d6e9bff05cd,flash-freezing,UBERON:0001225,HsapDv:0000153,02dbb079-1fdd-4811-81de-9ba8b6962916,nucleus,84cce7e6-e630-498a-b4a9-3664b6cde8ad,...,tissue,kidney loop of Henle thick ascending limb epit...,10x 3' v3,normal,Homo sapiens,female,cortex of kidney,unknown,59-year-old human stage,NCd`koy0En
TTTGTTGGTACGTGTT-11,healthy_6,unknown,NCBITaxon:9606,8949b21f-6f57-40c0-8f7b-0d6e9bff05cd,flash-freezing,UBERON:0001225,HsapDv:0000153,02dbb079-1fdd-4811-81de-9ba8b6962916,nucleus,84cce7e6-e630-498a-b4a9-3664b6cde8ad,...,tissue,epithelial cell of proximal tubule,10x 3' v3,normal,Homo sapiens,female,cortex of kidney,unknown,59-year-old human stage,^8Z$aU^A%q


In [4]:
scrna.rna

AnnData object with n_obs × n_vars = 39176 × 36398
    obs: 'donor_id', 'self_reported_ethnicity_ontology_term_id', 'organism_ontology_term_id', 'sample_uuid', 'sample_preservation_method', 'tissue_ontology_term_id', 'development_stage_ontology_term_id', 'suspension_uuid', 'suspension_type', 'library_uuid', 'assay_ontology_term_id', 'mapped_reference_annotation', 'is_primary_data', 'cell_type_ontology_term_id', 'author_cell_type', 'disease_ontology_term_id', 'reported_diseases', 'sex_ontology_term_id', 'nCount_RNA', 'nFeature_RNA', 'percent.mt', 'percent.rpl', 'percent.rps', 'doublet_id', 'nCount_SCT', 'nFeature_SCT', 'seurat_clusters', 'tissue_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length'
    uns: 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
    obsm: 'X_har

#### Now that you've observed all the necessary columns and have a better understanding of your data, it's time to set the column values. Set the following variables with the appropriate column header values:
- celltype_col: Column value containing the cell type ontology ID mapping
- library_col: Column containing the uuid of the single cell rna experiment
- sample_col: Column containing the uuid of the sample uuid information
- library_metadata_cols: Array of columns that contain the metadata/information associated with the single cell experiment. Only unique rows per library_col will be kept before transforming each row into an object of type single cell experiment
- sample_metadata_cols: Array of columns that contain the metadata/information associated with the samples. Only unique rows per sample_col will be kept before transforming each row into an object of type single cell experiment

In [5]:
celltype_col = 'cell_type_ontology_term_id'
library_col = 'library_uuid'
sample_col = 'sample_uuid'
sample_metadata_cols = ['donor_id', 'sample_uuid', 'self_reported_ethnicity_ontology_term_id', 'sample_preservation_method', 'tissue_ontology_term_id', 'development_stage_ontology_term_id', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage']
library_metadata_cols = ['library_uuid', 'assay', 'tissue_type']

#### With the variables now set, you can run the iterate_nodes and iterate_edges function to output a node.jsonl.gz and edge.jsonl.gz file respectively. These writes function as appends to the file, so ensure that all previous runs are cleared from the directory before calling the fucntion again. These files will have the correct schema that is ingestible by the BioBox platform. If you would like to hold the nodes and edges in memory, you can set the parameter write_to_disk=False in the function call. Note that because of the extensive number of edge connections between each cell barcode and gene, the edges will not be help in memory

In [6]:
scrna.iterate_nodes(
    sample_id_col=sample_col,
    sc_library_experiment_id=library_col,
    sample_metadata_cols_to_subset=sample_metadata_cols,
    sc_experiment_cols_to_subset=library_metadata_cols
)

Running function in write mode. Writing to file node.jsonl.gz. To return nodes, set write_to_disk=False in function call
Create cell nodes
Writing 39176 cell nodes to file
Create experiment nodes
Writing 11 experiment nodes to file
Create sample nodes
Writing 11 sample nodes to file
All nodes written to file: node.jsonl.gz


#### To prevent excessive memory usage, the cell-gene edges are processed using batches of 1000 cells at a time. These batches are transformed into the correct payload, written to disk, and then cleared from memory before the next batch begins. The print messages will give you updates on which batch is currently being written.

In [6]:
scrna.iterate_edges(
    sample_id_col=sample_col,
    sc_library_experiment_id=library_col,
    celltype_id_col=celltype_col
)

Running function in write mode. Writing to file edge.jsonl.gz. To return edges, set write_to_disk=False in function call
Calculating experiment-cell edges
Writing 39176 experiment-cell edges to file
Calculating sample-experiment edges
Writing 11 sample-experiment edges to file
Calculating cell-celltype edges
Writing 39176 cell-celltype edges to file
Calculating cell-gene edges
Number of cells to process: 39176
Starting Cell x Gene edge processing now: 2024-06-13 15:25:39.268704
Processing batch index 0:1000
Processing batch index 1000:2000
Processing batch index 2000:3000
Processing batch index 3000:4000
Processing batch index 4000:5000
Processing batch index 5000:6000
Processing batch index 6000:7000
Processing batch index 7000:8000
Processing batch index 8000:9000
Processing batch index 9000:10000
Processing batch index 10000:11000
Processing batch index 11000:12000
Processing batch index 12000:13000
Processing batch index 13000:14000
Processing batch index 14000:15000
Processing bat

#### Additionally, to understand the schema of the concepts and relationships of the adapter, you can call the list_schema() function, which will return the metadata associated with this datapack. You can update the name, key, and description of the datapack through the function set_metadata() prior to calling the list_schema() method.

In [3]:
scrna.list_schema()

{"_meta": {"version": "0.0.1", "date_updated": "2024-06-13 18:26:29.138131"}, "name": "SingleCellRNASeq Datapack - 2024-06-13 18:26:29.138131", "key": "scrna:2024-06-13 18:26:29.138131", "description": "SingleCellRNASeq Datapack created through Python SDK", "dependencies": ["Ensembl"], "concepts": {"SingleCellRNAseqExperiment": {"label": "SingleCellRNAseqExperiment", "dbLabel": "SingleCellRNAseqExperiment", "definition": "Single Cell RNAseq Experiment of the sample tissue"}, "Sample": {"label": "Sample", "dbLabel": "Sample", "definition": "Sample organism from which tissue was taken to be analyzed"}, "CellBarcode": {"label": "CellBarcode", "dbLabel": "CellBarcode", "definition": "Individual cell from scRNA experiment, identified by barcode"}}, "relationships": {"contains cell": {"from": "SingleCellRNAseqExperiment", "to": "CellBarcode"}, "expresses": {"from": "CellBarcode", "to": "Gene"}, "has experiment": {"from": "Sample", "to": "SingleCellRNAseqExperiment"}, "has cell type": {"from"

{'_meta': {'version': '0.0.1', 'date_updated': '2024-06-13 18:26:29.138131'},
 'name': 'SingleCellRNASeq Datapack - 2024-06-13 18:26:29.138131',
 'key': 'scrna:2024-06-13 18:26:29.138131',
 'description': 'SingleCellRNASeq Datapack created through Python SDK',
 'dependencies': ['Ensembl'],
 'concepts': {'SingleCellRNAseqExperiment': {'label': 'SingleCellRNAseqExperiment',
   'dbLabel': 'SingleCellRNAseqExperiment',
   'definition': 'Single Cell RNAseq Experiment of the sample tissue'},
  'Sample': {'label': 'Sample',
   'dbLabel': 'Sample',
   'definition': 'Sample organism from which tissue was taken to be analyzed'},
  'CellBarcode': {'label': 'CellBarcode',
   'dbLabel': 'CellBarcode',
   'definition': 'Individual cell from scRNA experiment, identified by barcode'}},
 'relationships': {'contains cell': {'from': 'SingleCellRNAseqExperiment',
   'to': 'CellBarcode'},
  'expresses': {'from': 'CellBarcode', 'to': 'Gene'},
  'has experiment': {'from': 'Sample', 'to': 'SingleCellRNAseqExp