# Partition Cell Types for QC Review

To ensure that our cell type labels are as accurate as possible, we'll subset our dataset based on our `AIFI_L2` labels for review.

Previously, we split our dataset up into 8 subsets based on cohort, biological sex, and CMV status. Here, we'll combine these per L2 cell type for less abundant cell types to simplify review. For very abundant cell types (i.e. Naive CD4 T cells), we'll keep them separated into the 8 groups and review each. 

This should reduce the burden of trying to cluster >2M cells (which can be very slow) without significantly reducing our power to identify doublets or mislabeled cells, as we'll still have >100k cells for these large classes.

In a later step, we'll generate rules to filter these clustered subsets of cells to help us in identifying doublets, contaminated clusters (i.e. with erythrocyte content), and mislabeled cells.

## Load libraries

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)

import concurrent.futures
from concurrent.futures import ThreadPoolExecutor
from datetime import date
import hisepy
import os
import pandas as pd
import scanpy as sc
import scanpy.external as sce

In [2]:
out_dir = 'output'
if not os.path.isdir(out_dir):
    os.makedirs(out_dir)

## Helper functions

These make it a bit simpler to cache and read in files from HISE

In [3]:
def cache_uuid_path(uuid):
    cache_path = '/home/jupyter/cache/{u}'.format(u = uuid)
    if not os.path.isdir(cache_path):
        hise_res = hisepy.reader.cache_files([uuid])
    filename = os.listdir(cache_path)[0]
    cache_file = '{p}/{f}'.format(p = cache_path, f = filename)
    return cache_file

In [4]:
def read_parquet_uuid(uuid):
    cache_file = cache_uuid_path(uuid)
    res = pd.read_parquet(cache_file)
    return res

In [5]:
def sort_adata_uuid(uuid, sort_cols = ['AIFI_L2', 'sample.sampleKitGuid']):
    cache_file = cache_uuid_path(uuid)
    adata = sc.read_h5ad(cache_file)
    obs = adata.obs
    obs = obs.sort_values(sort_cols)
    adata = adata[obs.index]
    adata.write_h5ad(cache_file)

This function will enable us to connect to our .h5ad files without loading the entire thing into memory. We'll then load only the cells that we want for each cell class to assemble them for writing. This should save us some overhead as we do our subsetting.

In [6]:
def read_adata_backed_uuid(uuid):
    cache_file = cache_uuid_path(uuid)
    res = sc.read_h5ad(cache_file, backed = 'r')
    return res

This function will apply a standard normalization, nearest neighbors, clustering, and UMAP process to our cell subsets:

In [7]:
def process_adata(adata):
    # Keep a copy of the raw data
    adata.raw = adata

    print('Normalizing')
    # Normalize and log transform
    sc.pp.normalize_total(adata)
    sc.pp.log1p(adata)

    print('Finding HVGs')
    # Restrict downstream steps to variable genes
    sc.pp.highly_variable_genes(adata)
    adata = adata[:, adata.var_names[adata.var['highly_variable']]].copy()
    print(adata.shape)

    print('Scaling')
    # Scale variable genes
    sc.pp.scale(adata)

    print('PCA')
    # Run PCA
    sc.tl.pca(adata, svd_solver = 'arpack')

    print('Neighbors')
    # Find nearest neighbors
    sc.pp.neighbors(
        adata, 
        n_neighbors = 50,
        n_pcs = 30
    )

    print('Louvain')
    # Find clusters
    sc.tl.louvain(
        adata, 
        resolution = 2, 
        key_added = 'louvain_2'
    )

    print('UMAP')
    # Run UMAP
    sc.tl.umap(adata, min_dist = 0.05)

    return adata

## Read cell metadata to identify large subsets

In [8]:
pq_uuids = {
    'BR1_Female_Negative': '2a639ff1-9544-4f55-a4be-f6589360dfb7',
    'BR1_Female_Positive': 'e637c51d-7e7f-4594-9249-dc768bcd3762',
    'BR1_Male_Negative':   '7c04ef25-9063-4281-8906-378ea6ac77f2',
    'BR1_Male_Positive':   '2a59562e-de72-4199-b219-f30f571406d8',
    'BR2_Female_Negative': '56f2fa45-71b5-4ec5-833a-a8aac470f4b3',
    'BR2_Female_Positive': '717d3165-7b36-4ac8-88a5-ef1b53c9b6cd',
    'BR2_Male_Negative':   '6603876a-9e51-496d-8666-d5441ac4375c',
    'BR2_Male_Positive':   'c1b0e096-9c48-4495-8096-e89828145700'
}

In [9]:
meta_dict = {}
for group_name, uuid in pq_uuids.items():
    meta_dict[group_name] = read_parquet_uuid(uuid)

In [10]:
l2_counts = {}
for group_name, meta in meta_dict.items():
    counts = meta['AIFI_L2'].value_counts()
    l2_counts[group_name] = counts

In [11]:
total_l2_counts = sum(list(l2_counts.values()))

In [12]:
total_l2_counts

AIFI_L2
ASDC                        4033
CD14 monocyte            2267815
CD16 monocyte             405940
CD56bright NK cell         96149
CD56dim NK cell          1123845
CD8aa                      17893
DN T cell                  17670
Effector B cell            86715
Erythrocyte                27862
ILC                         5734
Intermediate monocyte      76444
MAIT                      382216
Memory B cell             389681
Memory CD4 T cell        2810416
Memory CD8 T cell        1540017
Naive B cell              725074
Naive CD4 T cell         2986313
Naive CD8 T cell          833824
Plasma cell                16405
Platelet                   72347
Progenitor cell            12429
Proliferating NK cell      20575
Proliferating T cell       18588
Transitional B cell        79970
Treg                      309377
cDC1                        8138
cDC2                      114043
gdT                       318347
pDC                        62162
Name: count, dtype: int64

These L2 labels have > 1M cells each. We'll keep them separated per sample group.

In [13]:
large_types = total_l2_counts.index[total_l2_counts > 1e6]
large_types.tolist()

['CD14 monocyte',
 'CD56dim NK cell',
 'Memory CD4 T cell',
 'Memory CD8 T cell',
 'Naive CD4 T cell']

## Download and sort .h5ad files

Sorting the .h5ad files by AIFI_L2 will make reading each cell type much faster by placing cells of the same type next to each other in the sparse matrix.

In [14]:
h5ad_uuids = {
    'BR1_Female_Negative': '7aa6743f-c9f1-4cc1-b879-b5528f6d4270',
    'BR1_Female_Positive': 'd4e6a54f-9703-4cce-a324-519a0ad0624c',
    'BR1_Male_Negative':   '2487c0f6-d9bb-4d63-98df-f185da3e2d88',
    'BR1_Male_Positive':   'c0d0ac72-13da-41a9-806c-2027141ea44a',
    'BR2_Female_Negative': '824fcb67-d2c0-4db2-818e-966058da069b',
    'BR2_Female_Positive': 'ff7e1dcc-b4f4-4e44-90d3-33eae781940a',
    'BR2_Male_Negative':   '588b4802-e724-4a61-af21-d441f4dd0ad5',
    'BR2_Male_Positive':   '8eb02ae8-46fb-4d1d-83e3-088c55ddc3d5'
}

In [None]:
for uuid in h5ad_uuids.values():
    sort_adata_uuid(uuid, sort_cols = ['AIFI_L2', 'sample.sampleKitGuid'])

## Open connections to .h5ad files

Now that they're sorted, we can open these files with on-disk backing so we don't have to read the entire file at once.

In [15]:
h5ad_conn = {}
for group_name, uuid in h5ad_uuids.items():
    h5ad_conn[group_name] = read_adata_backed_uuid(uuid)

## Process each cell type

In [16]:
l2_types = total_l2_counts.index.tolist()

In [17]:
def read_l2_type(adata, cell_type):
    type_adata = adata[adata.obs['AIFI_L2'] == cell_type].to_memory()
    return type_adata

In [18]:
out_files = []

In [None]:
for cell_type in l2_types:
    print(cell_type)
    
    # Read data from each group for this type in parallel
    print('Loading data')
    type_adata_dict = {}

    with ThreadPoolExecutor(max_workers = 8) as executor:
        futures = {
            executor.submit(
                read_l2_type, 
                h5ad_conn[group_name], 
                cell_type): group_name 
            for group_name in h5ad_conn.keys()
        }
        for future in concurrent.futures.as_completed(futures):
            future_group = futures[future]
            type_adata_dict[future_group] = future.result()
    
    if cell_type in large_types:
        # If large, process and save separately
        for group_name, type_adata in type_adata_dict.items():
            print('Processing {g}'.format(g = group_name))
            print(type_adata.shape)
            type_adata = process_adata(type_adata)
            print('Saving processed data')
            out_file = 'output/diha_qc_{g}_{c}_{d}.h5ad'.format(
                g = group_name,
                c = cell_type,
                d = date.today()
            )
            type_adata.write_h5ad(out_file)
            out_files.append(out_file)
            
    else:
        # If small, combine and process
        print('Combining and processing')
        type_adata = sc.concat(type_adata_dict)
        print(type_adata.shape)
        type_adata = process_adata(type_adata)
        
        print('Saving processed data')
        out_file = 'output/diha_qc_All_{c}_{d}.h5ad'.format(
            g = group_name,
            c = cell_type,
            d = date.today()
        )
        type_adata.write_h5ad(out_file)
        out_files.append(out_file)

ASDC
Loading data
Combining and processing
(4033, 33538)
Normalizing
Finding HVGs
(4033, 1811)
Scaling
PCA
Neighbors
Louvain
UMAP
Saving processed data
CD14 monocyte
Loading data
Processing BR1_Female_Positive
(193982, 33538)
Normalizing
Finding HVGs
(193982, 870)
Scaling
PCA
Neighbors
Louvain
UMAP
Saving processed data
Processing BR2_Male_Positive
(214534, 33538)
Normalizing
Finding HVGs
(214534, 956)
Scaling
PCA
Neighbors
Louvain
UMAP
Saving processed data
Processing BR2_Female_Negative
(219030, 33538)
Normalizing
Finding HVGs
(219030, 878)
Scaling
PCA
Neighbors
Louvain
UMAP
Saving processed data
Processing BR1_Male_Positive
(180735, 33538)
Normalizing
Finding HVGs
(180735, 906)
Scaling
PCA
Neighbors
Louvain
UMAP
Saving processed data
Processing BR1_Male_Negative
(305415, 33538)
Normalizing
Finding HVGs
(305415, 806)
Scaling
PCA
Neighbors
Louvain
UMAP
Saving processed data
Processing BR2_Male_Negative
(349370, 33538)
Normalizing
Finding HVGs
(349370, 860)
Scaling
PCA
Neighbors
Louvai

IOStream.flush timed out
IOStream.flush timed out


Saving processed data
Processing BR1_Male_Negative
(334981, 33538)
Normalizing
Finding HVGs
(334981, 1132)
Scaling
PCA
Neighbors
Louvain
UMAP
Saving processed data
Processing BR2_Male_Negative
(369311, 33538)
Normalizing
Finding HVGs
(369311, 1353)
Scaling
PCA
Neighbors
Louvain
UMAP
Saving processed data
Processing BR1_Female_Negative
(446391, 33538)
Normalizing
Finding HVGs
(446391, 1192)
Scaling
PCA
Neighbors
Louvain
UMAP
UMAP
Saving processed data
Memory CD8 T cell
Loading data
Processing BR2_Female_Negative
(81648, 33538)
Normalizing
Finding HVGs
(81648, 1243)
Scaling
PCA
Neighbors
Louvain
UMAP
Saving processed data
Processing BR1_Male_Negative
(136184, 33538)
Normalizing
Finding HVGs
(136184, 1057)
Scaling
PCA
Neighbors
Louvain
UMAP
Saving processed data
Processing BR1_Female_Negative
(163963, 33538)
Normalizing
Finding HVGs
(163963, 1017)
Scaling
PCA
Neighbors
Louvain
UMAP
Saving processed data
Processing BR1_Female_Positive
(165278, 33538)
Normalizing
Finding HVGs
(165278, 1120)

## Upload Cell Type data to HISE

Finally, we'll use `hisepy.upload.upload_files()` to send a copy of our output to HISE to use for downstream analysis steps.

In [34]:
study_space_uuid = 'de025812-5e73-4b3c-9c3b-6d0eac412f2a'
title = 'DIHA PBMC Pre-cleanup AIFI_L2 .h5ad {d}'.format(d = date.today())

In [35]:
in_files = list(h5ad_uuids.values()) + list(pq_uuids.values())
in_files

['7aa6743f-c9f1-4cc1-b879-b5528f6d4270',
 'd4e6a54f-9703-4cce-a324-519a0ad0624c',
 '2487c0f6-d9bb-4d63-98df-f185da3e2d88',
 'c0d0ac72-13da-41a9-806c-2027141ea44a',
 '824fcb67-d2c0-4db2-818e-966058da069b',
 'ff7e1dcc-b4f4-4e44-90d3-33eae781940a',
 '588b4802-e724-4a61-af21-d441f4dd0ad5',
 '8eb02ae8-46fb-4d1d-83e3-088c55ddc3d5',
 '2a639ff1-9544-4f55-a4be-f6589360dfb7',
 'e637c51d-7e7f-4594-9249-dc768bcd3762',
 '7c04ef25-9063-4281-8906-378ea6ac77f2',
 '2a59562e-de72-4199-b219-f30f571406d8',
 '56f2fa45-71b5-4ec5-833a-a8aac470f4b3',
 '717d3165-7b36-4ac8-88a5-ef1b53c9b6cd',
 '6603876a-9e51-496d-8666-d5441ac4375c',
 'c1b0e096-9c48-4495-8096-e89828145700']

In [36]:
out_files

['output/diha_qc_BR1_Male_Negative_Memory CD4 T cell_2024-03-14.h5ad',
 'output/diha_qc_BR2_Female_Positive_Memory CD4 T cell_2024-03-14.h5ad',
 'output/diha_qc_BR1_Female_Positive_CD56dim NK cell_2024-03-14.h5ad',
 'output/diha_qc_BR1_Male_Positive_Memory CD8 T cell_2024-03-14.h5ad',
 'output/diha_qc_BR2_Male_Positive_Naive CD4 T cell_2024-03-14.h5ad',
 'output/diha_qc_BR1_Female_Negative_CD56dim NK cell_2024-03-14.h5ad',
 'output/diha_qc_BR2_Male_Negative_Naive CD4 T cell_2024-03-14.h5ad',
 'output/diha_qc_All_Naive B cell_2024-03-14.h5ad',
 'output/diha_qc_BR2_Female_Negative_Memory CD4 T cell_2024-03-14.h5ad',
 'output/diha_qc_BR2_Male_Negative_CD14 monocyte_2024-03-13.h5ad',
 'output/diha_qc_BR1_Male_Positive_CD14 monocyte_2024-03-13.h5ad',
 'output/diha_qc_BR1_Female_Negative_Memory CD4 T cell_2024-03-14.h5ad',
 'output/diha_qc_All_ASDC_2024-03-13.h5ad',
 'output/diha_qc_BR1_Female_Positive_Naive CD4 T cell_2024-03-14.h5ad',
 'output/diha_qc_All_Progenitor cell_2024-03-14.h5ad',


In [None]:
hisepy.upload.upload_files(
    files = out_files,
    study_space_id = study_space_uuid,
    title = title,
    input_file_ids = in_files
)

output/diha_qc_BR1_Male_Negative_Memory CD4 T cell_2024-03-14.h5ad
output/diha_qc_BR2_Female_Positive_Memory CD4 T cell_2024-03-14.h5ad
output/diha_qc_BR1_Female_Positive_CD56dim NK cell_2024-03-14.h5ad
output/diha_qc_BR1_Male_Positive_Memory CD8 T cell_2024-03-14.h5ad
output/diha_qc_BR2_Male_Positive_Naive CD4 T cell_2024-03-14.h5ad
output/diha_qc_BR1_Female_Negative_CD56dim NK cell_2024-03-14.h5ad
output/diha_qc_BR2_Male_Negative_Naive CD4 T cell_2024-03-14.h5ad
output/diha_qc_All_Naive B cell_2024-03-14.h5ad
output/diha_qc_BR2_Female_Negative_Memory CD4 T cell_2024-03-14.h5ad
output/diha_qc_BR2_Male_Negative_CD14 monocyte_2024-03-13.h5ad
output/diha_qc_BR1_Male_Positive_CD14 monocyte_2024-03-13.h5ad
output/diha_qc_BR1_Female_Negative_Memory CD4 T cell_2024-03-14.h5ad
output/diha_qc_All_ASDC_2024-03-13.h5ad
output/diha_qc_BR1_Female_Positive_Naive CD4 T cell_2024-03-14.h5ad
output/diha_qc_All_Progenitor cell_2024-03-14.h5ad
output/diha_qc_BR1_Male_Positive_Naive CD4 T cell_2024-03-14

(y/n) y


In [None]:
import session_info
session_info.show()