# Label cell types using CellTypist Models

To build our reference, we would like to start with labels that originate from published cell type references. 

One of the approaches for this cell type labeling is CellTypist, a model-based approach to cell type labeling.  

CellTypist is described [on their website](https://www.celltypist.org/), and in this publication:  

Domínguez Conde, C. et al. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science 376, eabl5197 (2022)

Here, we'll load in our cells individually, and assign labels based on 2 of our 3-level annotated PBMC reference:  

- AIFI_L1:
    - 9 types
- AIFI_L2:  
    - 29 types

## Load Packages

`anndata`: Data structures for scRNA-seq  
`celltypist`: Model-based cell type annotation  
`concurrent.futures`: parallelization methods  
`datetime`: date and time functions  
`h5py`: HDF5 file I/O  
`hisepy`: The HISE SDK for Python  
`numpy`: Mathematical data structures and computation  
`os`: operating system calls  
`pandas`: DataFrame data structures  
`re`: Regular expressions  
`scanpy`: scRNA-seq analysis  
`scipy.sparse`: Spare matrix data structures  
`shutil`: Shell utilities

In [1]:
import anndata
import celltypist
from celltypist import models
import concurrent.futures
from datetime import date
import h5py
import hisepy
import numpy as np
import os
import pandas as pd 
import re
import scanpy as sc
import scipy.sparse as scs
import shutil

In [1]:
cohort = 'BR2'
subject_sex = 'Male'

Load a model to prevent CellTypist from loading all models per core

In [2]:
models.download_models(
    force_update = True,
    model = ['Immune_All_High.pkl']
)

📜 Retrieving model list from server https://celltypist.cog.sanger.ac.uk/models/models.json
📚 Total models in list: 44
📂 Storing models in /root/.celltypist/data/models
💾 Total models to download: 1
💾 Downloading model [1/1]: Immune_All_High.pkl


## Helper functions

This function allows easy reading of .csv files stored in HISE

In [3]:
def read_csv_uuid(csv_uuid):
    csv_path = '/home/jupyter/cache/{u}'.format(u = csv_uuid)
    if not os.path.isdir(csv_path):
        hise_res = hisepy.reader.cache_files([csv_uuid])
    csv_filename = os.listdir(csv_path)[0]
    csv_file = '{p}/{f}'.format(p = csv_path, f = csv_filename)
    df = pd.read_csv(csv_file, index_col = 0)
    return df

This function allows easy identification of the cached file path for files retrieved from HISE

In [4]:
def read_path_uuid(file_uuid):
    file_path = '/home/jupyter/cache/{u}'.format(u = file_uuid)
    if not os.path.isdir(file_path):
        hise_res = hisepy.reader.cache_files([file_uuid])
    filename = os.listdir(file_path)[0]
    full_path = '{p}/{f}'.format(p = file_path, f = filename)
    return full_path

These functions will retrieve data for a sample, assemble an AnnData object, perform normalization and log transformation, then generate predictions for each of the 3 models retrieved.

In [5]:
# define a function to read count data
def read_mat(h5_con):
    mat = scs.csc_matrix(
        (h5_con['matrix']['data'][:], # Count values
         h5_con['matrix']['indices'][:], # Row indices
         h5_con['matrix']['indptr'][:]), # Pointers for column positions
        shape = tuple(h5_con['matrix']['shape'][:]) # Matrix dimensions
    )
    return mat

# define a function to read obeservation metadata (i.e. cell metadata)
def read_obs(h5con):
    bc = h5con['matrix']['barcodes'][:]
    bc = [x.decode('UTF-8') for x in bc]

    # Initialized the DataFrame with cell barcodes
    obs_df = pd.DataFrame({ 'barcodes' : bc })

    # Get the list of available metadata columns
    obs_columns = h5con['matrix']['observations'].keys()

    # For each column
    for col in obs_columns:
        # Read the values
        values = h5con['matrix']['observations'][col][:]
        # Check for byte storage
        if(isinstance(values[0], (bytes, bytearray))):
            # Decode byte strings
            values = [x.decode('UTF-8') for x in values]
        # Add column to the DataFrame
        obs_df[col] = values

    obs_df = obs_df.set_index('barcodes', drop = False)
    
    return obs_df

# define a function to construct anndata object from a h5 file
def read_h5_anndata(h5_con):
    #h5_con = h5py.File(h5_file, mode = 'r')
    # extract the expression matrix
    mat = read_mat(h5_con)
    # extract gene names
    genes = h5_con['matrix']['features']['name'][:]
    genes = [x.decode('UTF-8') for x in genes]
    # extract metadata
    obs_df = read_obs(h5_con)
    # construct anndata
    adata = anndata.AnnData(mat.T,
                             obs = obs_df)
    # make sure the gene names aligned
    adata.var_names = genes

    adata.var_names_make_unique()
    return adata

In [6]:
def get_adata(uuid):
    # Load the file using HISE
    res = hisepy.reader.read_files([uuid])

    # If there's an error, read_files returns a list instead of a dictionary.
    # We should raise and exception with the message when this happens.
    if(isinstance(res, list)):
        error_message = res[0]['message']
        raise Exception('{u}: {e}'.format(u = uuid, e = error_message))
    
    # Read the file to adata
    h5_con = res['values'][0]
    adata = read_h5_anndata(h5_con)
    
    # Close the file now that we're done with it
    h5_con.close()

    return(adata)

In [7]:
def run_prediction(adata, model, model_name, out_dir = "output"):
    # Make output directories
    model_dir = "{d}/{m}".format(d = out_dir, m = model_name)
    if not os.path.isdir(model_dir):
        os.makedirs(model_dir)
    
    sample_id = adata.obs['pbmc_sample_id'].unique()[0]
    label_file = "{d}/{s}_{m}_labels.csv".format(d = model_dir, s = sample_id, m = model_name)

    if os.path.exists(label_file):
        print("{s}: {m} Previously computed; Skipping.".format(s = sample_id, m = model_name))
    else:
        # Perform prediction
        predictions = celltypist.annotate(
            adata, 
            model = model, 
            majority_voting = True)
    
        # Write output
        
        prob_file = "{d}/{s}_{m}_probability_mat.parquet".format(d = model_dir, s = sample_id, m = model_name)
        prob = predictions.probability_matrix
        prob.to_parquet(prob_file)
    
        dec_file = "{d}/{s}_{m}_decision_mat.parquet".format(d = model_dir, s = sample_id, m = model_name)
        predictions.decision_matrix.to_parquet(dec_file)
        
        labels = predictions.predicted_labels
        labels = labels.rename({'predicted_labels': model_name}, axis = 1)
        
        prob_scores = []
        for i in range(labels.shape[0]):
            prob_scores.append(prob.loc[labels.index.to_list()[i],labels[model_name][i]])
        labels['{m}_score'.format(m = model_name)] = prob_scores
        labels.to_csv(label_file)
    
def process_data(file_uuid, sample_id):
    out_dir = "output"
    check_file = '{d}/{m}/{s}_{m}_labels.csv'.format(d = out_dir, m = 'AIFI_L3', s = sample_id)

    if os.path.exists(check_file):
        print('{s} Previously labeled; Skipping.'.format(s = sample_id))
    else:
        # Load cells from HISE .h5 files
        adata = get_adata(file_uuid)
        
        # Normalize data
        sc.pp.normalize_total(adata, target_sum=1e4)
        sc.pp.log1p(adata)
        adata.obs.index = adata.obs['barcodes']
        
        # Predict cell types
        for model_name,model_path in model_paths.items():
            run_prediction(
                adata,
                model_path,
                model_name,
                out_dir
            )
        
        del adata

In [15]:
def element_id(n = 3):
    import periodictable
    from random import randrange
    rand_el = []
    for i in range(n):
        el = randrange(0,118)
        rand_el.append(periodictable.elements[el].name)
    rand_str = '-'.join(rand_el)
    return rand_str

## Obtain CellTypist Models

In [8]:
model_uuids = {
    'AIFI_L1': '54d9399c-3bfd-4619-a32d-8c7f3c0911f6',
    'AIFI_L2': 'e2898db9-e121-4263-b0c6-a19eadd4217a'
}

In [9]:
model_paths = {}
for name,uuid in model_uuids.items():
    model_paths[name] = read_path_uuid(uuid)

In [10]:
model_paths

{'AIFI_L1': '/home/jupyter/cache/54d9399c-3bfd-4619-a32d-8c7f3c0911f6/ref_pbmc_clean_celltypist_model_AIFI_L1_2024-03-09.pkl',
 'AIFI_L2': '/home/jupyter/cache/e2898db9-e121-4263-b0c6-a19eadd4217a/ref_pbmc_clean_celltypist_model_AIFI_L2_2024-03-10.pkl',
 'AIFI_L3': '/home/jupyter/cache/d18fe8b3-b8e7-4b25-9966-0c05a1ba9d7a/ref_pbmc_clean_celltypist_model_AIFI_L3_2024-03-11.pkl'}

## Read sample metadata from HISE

In [13]:
sample_meta_file_uuid = 'd82c5c42-ae5f-4e67-956e-cd3b7bf88105'
file_query = hisepy.reader.read_files(
    [sample_meta_file_uuid]
)

In [14]:
meta_data = file_query['values']

In [15]:
meta_data.shape

(868, 33)

### Filter metadata for selected cohort and sex

In [None]:
meta_data = meta_data[meta_data['cohort.cohortGuid'] == cohort]
meta_data = meta_data[meta_data['subject.biologicalSex'] == subject_sex]

## Apply across files

Here, we'll use `concurrent.futures` to apply the function above to our files in parallel.

In [16]:
out_dir = 'output'
if not os.path.isdir(out_dir):
    os.makedirs(out_dir)

In [17]:
file_uuids = meta_data['file.id'].to_list()
sample_ids = meta_data['pbmc_sample_id'].to_list()

In [19]:
# Process each subset in parallel
pool_executor = concurrent.futures.ProcessPoolExecutor(max_workers = 60)
with pool_executor as executor:
    
    futures = []
    for i in range(len(file_uuids)):
        file_uuid = file_uuids[i]
        sample_id = sample_ids[i]
        futures.append(executor.submit(process_data, file_uuid, sample_id))

    # Check for errors when parallel processes return results
    for future in concurrent.futures.as_completed(futures):
        try:
            print(future.result())
        except Exception as e:
            print(f'Error: {e}')

PB00001-01 Previously labeled; Skipping.
PB00002-01 Previously labeled; Skipping.PB00004-01 Previously labeled; Skipping.PB00003-01 Previously labeled; Skipping.
PB00006-01 Previously labeled; Skipping.


PB00008-01 Previously labeled; Skipping.PB00007-01 Previously labeled; Skipping.PB00009-01 Previously labeled; Skipping.PB00010-02 Previously labeled; Skipping.PB00012-01 Previously labeled; Skipping.PB00011-01 Previously labeled; Skipping.PB00014-01 Previously labeled; Skipping.PB00013-01 Previously labeled; Skipping.PB00016-01 Previously labeled; Skipping.PB00015-01 Previously labeled; Skipping.


PB00017-01 Previously labeled; Skipping.PB00018-01 Previously labeled; Skipping.PB00019-01 Previously labeled; Skipping.
PB00020-01 Previously labeled; Skipping.

PB00021-01 Previously labeled; Skipping.PB00023-05 Previously labeled; Skipping.
PB00022-01 Previously labeled; Skipping.PB00024-01 Previously labeled; Skipping.


PB00025-04 Previously labeled; Skipping.PB00027-05 Previously lab

## Assemble results

For each model, we'll assemble the results as a .csv file that we can utilize later for subclustering and analysis of major cell classes.

In [23]:
models = list(model_paths.keys())

In [25]:
out_files = []
for model in models:
    model_path = 'output/{m}'.format(m = model)
    model_path_files = os.listdir(model_path)
    model_files = []
    for model_path_file in model_path_files:
        if 'labels' in model_path_file:
            model_files.append(model_path_file)
    
    model_list = []
    for model_file in model_files:
        df = pd.read_csv('output/{m}/{f}'.format(m = model, f = model_file))
        model_list.append(df)
    model_df = pd.concat(model_list)

    out_csv = 'output/diha_celltypist_{c}_{s}_{m}_{d}.csv'.format(
        c = cohort, s = subject_sex, m = model, d = date.today())
    out_files.append(out_csv)
    
    model_df.to_csv(out_csv)

    out_parquet = 'output/diha_celltypist_{c}_{s}_{m}_{d}.parquet'.format(
        c = cohort, s = subject_sex, m = model, d = date.today())
    out_files.append(out_parquet)
    
    model_df.to_parquet(out_parquet)

## Upload assembled data to HISE

Finally, we'll use `hisepy.upload.upload_files()` to send a copy of our output to HISE to use for downstream analysis steps.

In [26]:
study_space_uuid = 'de025812-5e73-4b3c-9c3b-6d0eac412f2a'
title = 'DIHA PBMC CellTypist {c} {s} {d}'.format(
    c = cohort, s = subject_sex, d = date.today())

In [None]:
search_id = element_id()
search_id

In [27]:
in_files = list(model_uuids.values()) + [sample_meta_file_uuid] + meta_data['file.id'].to_list() 

In [28]:
in_files[0:10]

['54d9399c-3bfd-4619-a32d-8c7f3c0911f6',
 'e2898db9-e121-4263-b0c6-a19eadd4217a',
 'd18fe8b3-b8e7-4b25-9966-0c05a1ba9d7a',
 'd82c5c42-ae5f-4e67-956e-cd3b7bf88105',
 'fec489f9-9a74-4635-aa91-d2bf09d1faec',
 '7c0c7979-eebd-4aba-b5b2-6e76b4643623',
 '40efd03a-cb2f-4677-af42-a056cbfe5a17',
 '68fbcd34-1d63-461d-8195-df5b8dc61b31',
 'ea8d98e9-e99e-4dc6-9e78-9866e0deac68',
 '1faf2b5f-66e4-4787-8a8b-487621fc4c08']

In [29]:
out_files

['output/diha_celltypist_labels_AIFI_L1_2024-03-12.csv',
 'output/diha_celltypist_labels_AIFI_L1_2024-03-12.parquet',
 'output/diha_celltypist_labels_AIFI_L2_2024-03-12.csv',
 'output/diha_celltypist_labels_AIFI_L2_2024-03-12.parquet',
 'output/diha_celltypist_labels_AIFI_L3_2024-03-12.csv',
 'output/diha_celltypist_labels_AIFI_L3_2024-03-12.parquet']

In [30]:
hisepy.upload.upload_files(
    files = out_files,
    study_space_id = study_space_uuid,
    title = title,
    input_file_ids = in_files,
    destination = search_id
)

output/diha_celltypist_labels_AIFI_L1_2024-03-12.csv
output/diha_celltypist_labels_AIFI_L1_2024-03-12.parquet
output/diha_celltypist_labels_AIFI_L2_2024-03-12.csv
output/diha_celltypist_labels_AIFI_L2_2024-03-12.parquet
output/diha_celltypist_labels_AIFI_L3_2024-03-12.csv
output/diha_celltypist_labels_AIFI_L3_2024-03-12.parquet
Cannot determine the current notebook.
1) /home/jupyter/IH-A-Aging-Analysis-Notebooks/scrna-seq_analysis/02-reference_labeling/02-Python_label_predictions_celltypist.ipynb
2) /home/jupyter/IH-A-Aging-Analysis-Notebooks/scrna-seq_analysis/01-sample_selection/01-Python_retrieve_cmv_bmi.ipynb
3) /home/jupyter/IH-A-Aging-Analysis-Notebooks/scrna-seq_analysis/02-reference_labeling/03-Python_Doublet_detection.ipynb
Please select (1-3) 


 1


you are trying to upload file_ids... ['output/diha_celltypist_labels_AIFI_L1_2024-03-12.csv', 'output/diha_celltypist_labels_AIFI_L1_2024-03-12.parquet', 'output/diha_celltypist_labels_AIFI_L2_2024-03-12.csv', 'output/diha_celltypist_labels_AIFI_L2_2024-03-12.parquet', 'output/diha_celltypist_labels_AIFI_L3_2024-03-12.csv', 'output/diha_celltypist_labels_AIFI_L3_2024-03-12.parquet']. Do you truly want to proceed?


(y/n) y


{'trace_id': '8e47813b-7756-4a77-b304-3e4f32e13b5a',
 'files': ['output/diha_celltypist_labels_AIFI_L1_2024-03-12.csv',
  'output/diha_celltypist_labels_AIFI_L1_2024-03-12.parquet',
  'output/diha_celltypist_labels_AIFI_L2_2024-03-12.csv',
  'output/diha_celltypist_labels_AIFI_L2_2024-03-12.parquet',
  'output/diha_celltypist_labels_AIFI_L3_2024-03-12.csv',
  'output/diha_celltypist_labels_AIFI_L3_2024-03-12.parquet']}

In [31]:
import session_info
session_info.show()