# Label cell types using CellTypist Models

To build our reference, we would like to start with labels that originate from published cell type references. 

One of the approaches for this cell type labeling is CellTypist, a model-based approach to cell type labeling.  

CellTypist is described [on their website](https://www.celltypist.org/), and in this publication:  
Dom√≠nguez Conde, C. et al. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science 376, eabl5197 (2022)

Here, we'll load in our cells in batches, and assign cell types based on 3 available CellTypist models (descriptions are from celltypist.org):  

- Immune_All_High:
    - 32 types
    - immune populations combined from 20 tissues of 18 studies  
- Immune_All_Low:  
    - 98 types
    - immune sub-populations combined from 20 tissues of 18 studies  
- Healthy_COVID19_PBMC:
    - 51 types
    - peripheral blood mononuclear cell types from healthy and COVID-19 individuals

## Load Packages

anndata: Data structures for scRNA-seq  
celltypist: Model-based cell type annotation  
concurrent.futures: parallelization methods  
h5py: HDF5 file I/O  
hisepy: The HISE SDK for Python  
numpy: Mathematical data structures and computation  
os: operating system calls  
pandas: DataFrame data structures  
re: Regular expressions  
scanpy: scRNA-seq analysis  
scipy.sparse: Spare matrix data structures  
shutil: Shell utilities

In [1]:
import anndata
import celltypist
from celltypist import models
import concurrent.futures
import h5py
import hisepy
import numpy as np
import os
import pandas as pd 
import re
import scanpy as sc
import scipy.sparse as scs
import shutil

## Obtain CellTypist Models

In [2]:
models.download_models(
    force_update = True,
    model = ['Immune_All_High.pkl',
             'Immune_All_Low.pkl',
             'Healthy_COVID19_PBMC.pkl']
)

üìú Retrieving model list from server https://celltypist.cog.sanger.ac.uk/models/models.json
üìö Total models in list: 44
üìÇ Storing models in /root/.celltypist/data/models
üíæ Total models to download: 3
üíæ Downloading model [1/3]: Immune_All_Low.pkl
üíæ Downloading model [2/3]: Immune_All_High.pkl
üíæ Downloading model [3/3]: Healthy_COVID19_PBMC.pkl


## Read sample metadata from HISE

In [3]:
sample_meta_file_uuid = '223b4aa9-19fc-41e1-8bea-43682e5ac278'
sample_meta_file_name = 'ref_h5_meta_data_2024-02-08.csv'
file_query = hisepy.reader.download_files(
    {sample_meta_file_uuid: sample_meta_file_name}
)

In [4]:
sample_meta_file = 'cache/downloadable/' + sample_meta_file_name
meta_data = pd.read_csv(sample_meta_file)

## Helper functions

These functions will retrieve data for a batch of samples, assemble a joint AnnData object, perform normalization and log transformation, then generate predictions for each of the 3 models retrieved, above.

In [5]:
# define a function to read count data
def read_mat(h5_con):
    mat = scs.csc_matrix(
        (h5_con['matrix']['data'][:], # Count values
         h5_con['matrix']['indices'][:], # Row indices
         h5_con['matrix']['indptr'][:]), # Pointers for column positions
        shape = tuple(h5_con['matrix']['shape'][:]) # Matrix dimensions
    )
    return mat

# define a function to read obeservation metadata (i.e. cell metadata)
def read_obs(h5con):
    bc = h5con['matrix']['barcodes'][:]
    bc = [x.decode('UTF-8') for x in bc]

    # Initialized the DataFrame with cell barcodes
    obs_df = pd.DataFrame({ 'barcodes' : bc })

    # Get the list of available metadata columns
    obs_columns = h5con['matrix']['observations'].keys()

    # For each column
    for col in obs_columns:
        # Read the values
        values = h5con['matrix']['observations'][col][:]
        # Check for byte storage
        if(isinstance(values[0], (bytes, bytearray))):
            # Decode byte strings
            values = [x.decode('UTF-8') for x in values]
        # Add column to the DataFrame
        obs_df[col] = values

    obs_df = obs_df.set_index('barcodes', drop = False)
    
    return obs_df

# define a function to construct anndata object from a h5 file
def read_h5_anndata(h5_con):
    #h5_con = h5py.File(h5_file, mode = 'r')
    # extract the expression matrix
    mat = read_mat(h5_con)
    # extract gene names
    genes = h5_con['matrix']['features']['name'][:]
    genes = [x.decode('UTF-8') for x in genes]
    # extract metadata
    obs_df = read_obs(h5_con)
    # construct anndata
    adata = anndata.AnnData(mat.T,
                             obs = obs_df)
    # make sure the gene names aligned
    adata.var_names = genes

    adata.var_names_make_unique()
    return adata

In [6]:
def get_adata(uuid):
    # Load the file using HISE
    res = hisepy.reader.read_files([uuid])
    
    # Read the file to adata
    h5_con = res['values'][0]
    adata = read_h5_anndata(h5_con)
    
    # Clean up the file now that we're done with it
    h5_file = h5_con.filename
    h5_con.close()
    os.remove(h5_file)

    return(adata)

In [7]:
def run_prediction(adata, model, model_name, out_dir = "output"):
    # Perform prediction
    predictions = celltypist.annotate(
        adata, 
        model = model, 
        majority_voting = True)

    # Make output directory
    model_dir = "{d}/{m}".format(d = out_dir, m = model_name)
    if not os.path.isdir(model_dir):
        os.makedirs(model_dir)

    samples = adata.obs['pbmc_sample_id'].unique()
    
    # Write output per sample
    for sample_id in samples:
        barcodes = adata.obs[adata.obs['pbmc_sample_id'] == sample_id].index.tolist()
        sample_results = predictions.predicted_labels.loc[barcodes,:]
        out_file = "{d}/{s}_{m}.csv".format(d = model_dir, s = sample_id, m = model_name)
        sample_results.to_csv(out_file)

def process_data(meta_data_sub):
    out_dir = "output"
    
    # Load cells from HISE .h5 files
    results = []
    for file_uuid in meta_data_sub:
        result = get_adata(file_uuid)
        results.append(result)
    adata = anndata.concat(results)
    del results
    
    # Normalize data
    sc.pp.normalize_total(adata, target_sum=1e4)
    sc.pp.log1p(adata)
    adata.obs.index = adata.obs['barcodes']
    
    # Predict cell types
    run_prediction(adata, "Immune_All_Low.pkl", "Low", out_dir)
    run_prediction(adata, "Immune_All_High.pkl", "High", out_dir)
    run_prediction(adata, "Healthy_COVID19_PBMC.pkl", "Covid_Healthy", out_dir)
    
    del adata

## Apply across batches

Here, we'll generate the batches, then use `concurrent.futures` to apply the function above to our batches in parallel.

In [8]:
out_dir = 'output'
if not os.path.isdir(out_dir):
    os.makedirs(out_dir)

In [11]:
meta_data_subsets = []
for i in range(0, len(meta_data), 10):
    subset_uuids = meta_data["file.id"][i:i + 10]
    meta_data_subsets.append(subset_uuids)

In [12]:
# Process each subset in parallel
pool_executor = concurrent.futures.ProcessPoolExecutor(max_workers = 2)
with pool_executor as executor:
    
    futures = []
    for meta_data_sub in meta_data_subsets:
        futures.append(executor.submit(process_data, meta_data_sub))
    
    for future in concurrent.futures.as_completed(futures):
        try:
            future.result()
        except Exception as e:
            print(f'Error: {e}')

üî¨ Input data has 207788 cells and 33538 genes
üî¨ Input data has 187700 cells and 33538 genes
üîó Matching reference genes in the model
üîó Matching reference genes in the model
üß¨ 5967 features used for prediction
‚öñÔ∏è Scaling input data
üß¨ 5967 features used for prediction
‚öñÔ∏è Scaling input data
üñãÔ∏è Predicting labels
üñãÔ∏è Predicting labels
‚úÖ Prediction done!
üëÄ Can not detect a neighborhood graph, will construct one before the over-clustering
‚úÖ Prediction done!
üëÄ Can not detect a neighborhood graph, will construct one before the over-clustering
‚õìÔ∏è Over-clustering input data with resolution set to 25
‚õìÔ∏è Over-clustering input data with resolution set to 30
üó≥Ô∏è Majority voting the predictions
‚úÖ Majority voting done!
üî¨ Input data has 187700 cells and 33538 genes
üîó Matching reference genes in the model
üß¨ 5967 features used for prediction
‚öñÔ∏è Scaling input data
üñãÔ∏è Predicting labels
‚úÖ Prediction done!
üëÄ Detected a neighborho

Error: A process in the process pool was terminated abruptly while the future was running or pending.
Error: A process in the process pool was terminated abruptly while the future was running or pending.
Error: A process in the process pool was terminated abruptly while the future was running or pending.
Error: A process in the process pool was terminated abruptly while the future was running or pending.
Error: A process in the process pool was terminated abruptly while the future was running or pending.
Error: A process in the process pool was terminated abruptly while the future was running or pending.
Error: A process in the process pool was terminated abruptly while the future was running or pending.
Error: A process in the process pool was terminated abruptly while the future was running or pending.
Error: A process in the process pool was terminated abruptly while the future was running or pending.


## Assemble results

For each model, we'll assemble the results as a .csv file that we can utilize later for subclustering and analysis of major cell classes.

In [None]:
models = ['High', 'Low', 'Covid_Healthy']

In [10]:
import session_info
session_info.show()