## Single Cell ATAC Seq processing: Using CZ Data as the example

#### The only package that you need to install in order to process single cell atacseq data from an h5ad file is the biobox-analytics package. The package should install and import the necessary packages, which include: json, os, gzip, math, datetime, itertools, scanpy, pandas

In [1]:
from biobox_analytics.data.adapters.scatac import ScATAC

#### Pass in the path to your h5ad file. The object of instance ScATAC will load the object using the scanpy.read_h5ad function. Your h5ad object (scanpy object) is then held under the .atac variable

In [2]:
scatac = ScATAC(h5adFile="/Users/hamza/Downloads/OvaryATAC-aa3e7259-0864-4c04-9e3d-e05c3c05d879.h5ad")

#### Optionally, set the name, description, and key of the datapack that you are generating from your file. This step is optional at this time and can be skipped

In [3]:
scatac.set_metadata(displayName="Ovary ATAC - A molecular atlas of the human postmenopausal fallopian tube and ovary from single-cell RNA and ATAC sequencing", description="As part of the Human Cell Atlas initiative, we generated transcriptomic (scRNA-seq; 86,708 cells) and regulatory (scATAC-seq; 59,118 cells) profiles of the normal postmenopausal ovary and fallopian tube (FT) at single-cell resolution. In the FT, 22 cell clusters integrated into 11 cell types, including ciliated and secretory epithelial cells, while the ovary had 17 distinct cell clusters defining 6 major cell types.", key="EGAS00001006780_OvaryATAC")

#### You can inspect your object freely. However, the major properties that are expected are the obs, holding metadata regarding each cell, and var, holding information about the features being observed (ie the genes) You can should inspect the columns of the obs dataframe to determine what metadata you'd like to capture for the SingleCellExperiment and Donor nodes respectively. It is expected to have a column within the obs dataframe that contains information for mappign each cell to a celltype, encoded using CellType ontology IDs

In [31]:
scatac.atac

AnnData object with n_obs × n_vars = 18315 × 19281
    obs: 'mapped_reference_assembly', 'alignment_software', 'donor_id', 'self_reported_ethnicity_ontology_term_id', 'donor_living_at_sample_collection', 'donor_menopausal_status', 'organism_ontology_term_id', 'sample_uuid', 'sample_preservation_method', 'tissue_ontology_term_id', 'development_stage_ontology_term_id', 'sample_derivation_process', 'sample_source', 'suspension_derivation_process', 'suspension_uuid', 'suspension_type', 'library_uuid', 'assay_ontology_term_id', 'library_starting_quantity', 'is_primary_data', 'cell_type_ontology_term_id', 'author_cell_type', 'disease_ontology_term_id', 'sex_ontology_term_id', 'nCount_ATAC', 'nFeature_ATAC', 'mitochondrial', 'sub_celltype', 'sample', 'tissue_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_len

#### Now that you've observed all the necessary columns and have a better understanding of your data, it's time to set the column values. Set the following variables with the appropriate column header values:
- celltype_col: Column value containing the cell type ontology ID mapping
- library_col: Column containing the uuid of the single cell atac experiment
- sample_col: Column containing the uuid of the sample uuid information
- library_metadata_cols: Array of columns that contain the metadata/information associated with the single cell experiment. Only unique rows per library_col will be kept before transforming each row into an object of type single cell experiment
- sample_metadata_cols: Array of columns that contain the metadata/information associated with the samples. Only unique rows per sample_col will be kept before transforming each row into an object of type single cell experiment

In [4]:
celltype_col = 'cell_type_ontology_term_id'
library_col = 'library_uuid'
sample_col = 'sample_uuid'
sample_metadata_cols = ['donor_id', 'sample_uuid', 'self_reported_ethnicity_ontology_term_id', 'sample_preservation_method', 'tissue_ontology_term_id', 'development_stage_ontology_term_id', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage']
library_metadata_cols = ['library_uuid', 'assay', 'tissue_type']

#### With the variables now set, you can run the iterate_nodes and iterate_edges function to output a node.jsonl.gz and edge.jsonl.gz file respectively. These writes function as appends to the file, so ensure that all previous runs are cleared from the directory before calling the fucntion again. These files will have the correct schema that is ingestible by the BioBox platform. If you would like to hold the nodes and edges in memory, you can set the parameter write_to_disk=False in the function call. Note that because of the extensive number of edge connections between each cell barcode and gene, the edges will not be help in memory

In [5]:
nodes = scatac.iterate_nodes(
    sample_id_col=sample_col,
    sc_library_experiment_id=library_col,
    sample_metadata_cols_to_subset=sample_metadata_cols,
    sc_experiment_cols_to_subset=library_metadata_cols
)

Running function in write mode. Writing to file node.jsonl.gz. To return nodes, set write_to_disk=False in function call
Create cell nodes
Writing 18315 cell nodes to file
Create experiment nodes
Writing 3 experiment nodes to file
Create sample nodes
Writing 3 sample nodes to file
All nodes written to file: node.jsonl.gz


#### To prevent excessive memory usage, the cell-gene edges are processed using batches of 1000 cells at a time. These batches are transformed into the correct payload, written to disk, and then cleared from memory before the next batch begins. The print messages will give you updates on which batch is currently being written.

In [6]:
scatac.iterate_edges(
    sample_id_col=sample_col,
    sc_library_experiment_id=library_col,
    celltype_id_col=celltype_col
)

Running function in write mode. Writing to file edge.jsonl.gz. To return edges, set write_to_disk=False in function call
Calculating experiment-cell edges
Writing 18315 experiment-cell edges to file
Calculating sample-experiment edges
Writing 3 sample-experiment edges to file
Calculating cell-celltype edges
Writing 18315 cell-celltype edges to file
Calculating cell-gene edges
Number of cells to process: 18315
Starting Cell x Gene edge processing now: 2024-06-14 13:38:11.847266
Processing batch index 0:1000 at time 2024-06-14 13:38:11.847288
Processing batch index 1000:2000 at time 2024-06-14 13:40:03.616322
Processing batch index 2000:3000 at time 2024-06-14 13:41:52.079281
Processing batch index 3000:4000 at time 2024-06-14 13:43:43.204352
Processing batch index 4000:5000 at time 2024-06-14 13:45:34.297835
Processing batch index 5000:6000 at time 2024-06-14 13:47:29.143572
Processing batch index 6000:7000 at time 2024-06-14 13:49:22.771702
Processing batch index 7000:8000 at time 2024

#### Additionally, to understand the schema of the concepts and relationships of the adapter, you can call the list_schema() function, which will return the metadata associated with this datapack. You can update the name, key, and description of the datapack through the function set_metadata() prior to calling the list_schema() method.

In [29]:
scatac.list_schema()

{'_meta': {'version': '0.0.1', 'date_updated': '2024-06-14 11:16:08.113905'},
 'name': 'Ovary ATAC - A molecular atlas of the human postmenopausal fallopian tube and ovary from single-cell RNA and ATAC sequencing',
 'key': 'EGAS00001006780_OvaryATAC',
 'description': 'As part of the Human Cell Atlas initiative, we generated transcriptomic (scRNA-seq; 86,708 cells) and regulatory (scATAC-seq; 59,118 cells) profiles of the normal postmenopausal ovary and fallopian tube (FT) at single-cell resolution. In the FT, 22 cell clusters integrated into 11 cell types, including ciliated and secretory epithelial cells, while the ovary had 17 distinct cell clusters defining 6 major cell types.',
 'dependencies': ['Ensembl'],
 'concepts': {'Experiment': {'label': 'Experiment',
   'dbLabel': 'Experiment',
   'definition': 'Experiment of the sample tissue'},
  'SingleCellExperiment': {'label': 'SingleCellExperiment',
   'dbLabel': 'SingleCellExperiment',
   'definition': 'Single Cell Experiment of the 