# Partition Cell Types for QC Review

To ensure that our cell type labels are as accurate as possible, we'll subset our dataset based on our `AIFI_L2` labels for review.

Previously, we split our dataset up into 8 subsets based on cohort, biological sex, and CMV status. Here, we'll combine these per L2 cell type for less abundant cell types to simplify review. For very abundant cell types (i.e. Naive CD4 T cells), we'll keep them separated into the 8 groups and review each. 

This should reduce the burden of trying to cluster >2M cells (which can be very slow) without significantly reducing our power to identify doublets or mislabeled cells, as we'll still have >100k cells for these large classes.

In a later step, we'll generate rules to filter these clustered subsets of cells to help us in identifying doublets, contaminated clusters (i.e. with erythrocyte content), and mislabeled cells.

## Load libraries

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)

import concurrent.futures
from concurrent.futures import ThreadPoolExecutor
from datetime import date
import hisepy
import os
import pandas as pd
import re
import scanpy as sc
import scanpy.external as sce

In [2]:
out_dir = 'output'
if not os.path.isdir(out_dir):
    os.makedirs(out_dir)

## Helper functions

These make it a bit simpler to cache and read in files from HISE

In [3]:
def cache_uuid_path(uuid):
    cache_path = '/home/jupyter/cache/{u}'.format(u = uuid)
    if not os.path.isdir(cache_path):
        hise_res = hisepy.reader.cache_files([uuid])
    filename = os.listdir(cache_path)[0]
    cache_file = '{p}/{f}'.format(p = cache_path, f = filename)
    return cache_file

In [4]:
def read_parquet_uuid(uuid):
    cache_file = cache_uuid_path(uuid)
    res = pd.read_parquet(cache_file)
    return res

In [5]:
def sort_adata_uuid(uuid, sort_cols = ['AIFI_L2', 'sample.sampleKitGuid']):
    cache_file = cache_uuid_path(uuid)
    adata = sc.read_h5ad(cache_file)
    obs = adata.obs
    obs = obs.sort_values(sort_cols)
    adata = adata[obs.index]
    adata.write_h5ad(cache_file)

This function will enable us to connect to our .h5ad files without loading the entire thing into memory. We'll then load only the cells that we want for each cell class to assemble them for writing. This should save us some overhead as we do our subsetting.

In [6]:
def read_adata_backed_uuid(uuid):
    cache_file = cache_uuid_path(uuid)
    res = sc.read_h5ad(cache_file, backed = 'r')
    return res

This function will apply a standard normalization, nearest neighbors, clustering, and UMAP process to our cell subsets:

In [7]:
def process_adata(adata):
    # Keep a copy of the raw data
    adata.raw = adata

    print('Normalizing')
    # Normalize and log transform
    sc.pp.normalize_total(adata)
    sc.pp.log1p(adata)

    print('Finding HVGs')
    # Restrict downstream steps to variable genes
    sc.pp.highly_variable_genes(adata)
    adata = adata[:, adata.var_names[adata.var['highly_variable']]].copy()
    print(adata.shape)

    print('Scaling')
    # Scale variable genes
    sc.pp.scale(adata)

    print('PCA')
    # Run PCA
    sc.tl.pca(adata, svd_solver = 'arpack')

    print('Neighbors')
    # Find nearest neighbors
    sc.pp.neighbors(
        adata, 
        n_neighbors = 50,
        n_pcs = 30
    )

    print('Leiden')
    # Find clusters
    sc.tl.leiden(
        adata, 
        resolution = 2, 
        key_added = 'leiden_2',
        n_iterations = 2
    )

    print('UMAP')
    # Run UMAP
    sc.tl.umap(adata, min_dist = 0.05)

    return adata

In [8]:
def format_cell_type(cell_type):
    cell_type = re.sub('\\+', 'pos', cell_type)
    cell_type = re.sub('-', 'neg', cell_type)
    cell_type = re.sub(' ', '_', cell_type)
    return cell_type

In [9]:
def element_id(n = 3):
    import periodictable
    from random import randrange
    rand_el = []
    for i in range(n):
        el = randrange(0,118)
        rand_el.append(periodictable.elements[el].name)
    rand_str = '-'.join(rand_el)
    return rand_str

## Identify files for use in HISE

In [10]:
search_id = 'chromium-meitnerium-europium'

Retrieve files stored in our HISE project store

In [11]:
ps_df = hisepy.list_files_in_project_store('cohorts')
ps_df = ps_df[['id', 'name']]

Filter for files from the previous notebook using our search_id

In [12]:
search_df = ps_df[ps_df['name'].str.contains(search_id)]
search_df = search_df.sort_values('name')

## Read cell metadata to identify large subsets

In [13]:
pq_df = search_df[search_df['name'].str.contains('.parquet')]

In [14]:
pq_uuids = {}
for i in range(pq_df.shape[0]):
    fn = pq_df['name'].tolist()[i]
    group_name = re.sub('.+_PBMC_', '', fn)
    group_name = re.sub('_qc.+', '', group_name)
    pq_uuids[group_name] = pq_df['id'].tolist()[i]

In [15]:
pq_uuids

{'BR1_Female_Negative': 'a0ce94a0-de3f-46a1-99bc-3acf24dae0ea',
 'BR1_Female_Positive': 'd0b5c470-2481-455f-9b8e-b6ca28ea4bbe',
 'BR1_Male_Negative': 'd9cffce7-4018-43b8-afea-9176b1b54532',
 'BR1_Male_Positive': '28a22e8e-0aca-45db-9e28-19907d297a38',
 'BR2_Female_Negative': 'ee89cb24-7439-486c-8f50-983895e6d4b7',
 'BR2_Female_Positive': 'bb47f341-d036-4054-8783-0a66b44e5c08',
 'BR2_Male_Negative': '6331f471-6bc3-49a8-9b4b-d83c7dd95320',
 'BR2_Male_Positive': 'bdaef83d-76da-4bc7-acc6-5db2006908fa'}

In [16]:
meta_dict = {}
for group_name, uuid in pq_uuids.items():
    meta_dict[group_name] = read_parquet_uuid(uuid)

In [17]:
l2_counts = {}
for group_name, meta in meta_dict.items():
    counts = meta['AIFI_L2'].value_counts()
    l2_counts[group_name] = counts

In [18]:
total_l2_counts = sum(list(l2_counts.values()))

In [19]:
total_l2_counts

AIFI_L2
ASDC                        4008
CD14 monocyte            2268111
CD16 monocyte             405852
CD56bright NK cell         96063
CD56dim NK cell          1123560
CD8aa                      17893
DN T cell                  17648
Effector B cell            86541
Erythrocyte                28362
ILC                         5734
Intermediate monocyte      76160
MAIT                      381960
Memory B cell             389804
Memory CD4 T cell        2812203
Memory CD8 T cell        1539618
Naive B cell              725113
Naive CD4 T cell         2986280
Naive CD8 T cell          833785
Plasma cell                16569
Platelet                   72385
Progenitor cell            12457
Proliferating NK cell      20542
Proliferating T cell       18632
Transitional B cell        79851
Treg                      308864
cDC1                        8142
cDC2                      113465
gdT                       318356
pDC                        62064
Name: count, dtype: int64

These L2 labels have > 1M cells each. We'll keep them separated per sample group.

In [20]:
large_types = total_l2_counts.index[total_l2_counts > 1e6]
large_types.tolist()

['CD14 monocyte',
 'CD56dim NK cell',
 'Memory CD4 T cell',
 'Memory CD8 T cell',
 'Naive CD4 T cell']

## Download and sort .h5ad files

Sorting the .h5ad files by AIFI_L2 will make reading each cell type much faster by placing cells of the same type next to each other in the sparse matrix.

In [21]:
h5ad_df = search_df[search_df['name'].str.contains('.h5ad')]

In [22]:
h5ad_uuids = {}
for i in range(h5ad_df.shape[0]):
    fn = h5ad_df['name'].tolist()[i]
    group_name = re.sub('.+_PBMC_', '', fn)
    group_name = re.sub('_qc.+', '', group_name)
    h5ad_uuids[group_name] = h5ad_df['id'].tolist()[i]

In [23]:
h5ad_uuids

{'BR1_Female_Negative': '73894515-92dc-4865-9247-e38f7911a529',
 'BR1_Female_Positive': '361c5e80-2a86-4c26-a494-fb46401fe1e2',
 'BR1_Male_Negative': '1ff69ea5-e189-4875-9178-5af2a61186e6',
 'BR1_Male_Positive': '72ca8c5f-3e1c-4360-856d-b53657697793',
 'BR2_Female_Negative': '74755a15-f0a1-411e-b071-7fd802b60e80',
 'BR2_Female_Positive': '277489fd-12fe-48ed-b3a0-7494858fa512',
 'BR2_Male_Negative': '425b59cc-098d-4833-ae5f-57c4582421c9',
 'BR2_Male_Positive': '5ba4e36b-3928-4023-90a0-403e0c43a809'}

In [24]:
for uuid in h5ad_uuids.values():
    sort_adata_uuid(uuid, sort_cols = ['AIFI_L2', 'sample.sampleKitGuid'])

downloading fileID: 73894515-92dc-4865-9247-e38f7911a529
Files have been successfully downloaded!
downloading fileID: 361c5e80-2a86-4c26-a494-fb46401fe1e2
Files have been successfully downloaded!
downloading fileID: 74755a15-f0a1-411e-b071-7fd802b60e80
Files have been successfully downloaded!
downloading fileID: 277489fd-12fe-48ed-b3a0-7494858fa512
Files have been successfully downloaded!
downloading fileID: 425b59cc-098d-4833-ae5f-57c4582421c9
Files have been successfully downloaded!
downloading fileID: 5ba4e36b-3928-4023-90a0-403e0c43a809
Files have been successfully downloaded!


## Open connections to .h5ad files

Now that they're sorted, we can open these files with on-disk backing so we don't have to read the entire file at once.

In [25]:
h5ad_conn = {}
for group_name, uuid in h5ad_uuids.items():
    h5ad_conn[group_name] = read_adata_backed_uuid(uuid)

## Process each cell type

In [26]:
l2_types = total_l2_counts.index.tolist()
l2_types = set(l2_types) - set(large_types)
l2_types = list(l2_types)

In [27]:
def read_l2_type(adata, cell_type):
    type_adata = adata[adata.obs['AIFI_L2'] == cell_type].to_memory()
    return type_adata

In [28]:
out_files = []

In [29]:
for cell_type in l2_types:
    print(cell_type)
    
    # Read data from each group for this type in parallel
    print('Loading data')
    type_adata_dict = {}

    with ThreadPoolExecutor(max_workers = 8) as executor:
        futures = {
            executor.submit(
                read_l2_type, 
                h5ad_conn[group_name], 
                cell_type): group_name 
            for group_name in h5ad_conn.keys()
        }
        for future in concurrent.futures.as_completed(futures):
            future_group = futures[future]
            type_adata_dict[future_group] = future.result()
    
    # If small, combine and process
    print('Combining and processing')
    type_adata = sc.concat(type_adata_dict)
    print(type_adata.shape)
    type_adata = process_adata(type_adata)
    
    print('Saving processed data')
    out_type = format_cell_type(cell_type)
    out_file = 'output/diha_qc_{c}_{d}.h5ad'.format(
        g = group_name,
        c = out_type,
        d = date.today()
    )
    type_adata.write_h5ad(out_file)
    out_files.append(out_file)

CD16 monocyte
Loading data


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sub[k] = df_sub[k].cat.remove_unused_categories()


Combining and processing
(405852, 33538)
Normalizing
Finding HVGs
(405852, 1177)
Scaling
PCA
Neighbors
Leiden
UMAP
Saving processed data
CD56bright NK cell
Loading data
Combining and processing
(96063, 33538)
Normalizing
Finding HVGs
(96063, 1343)
Scaling
PCA
Neighbors
Leiden
UMAP
Saving processed data
Naive CD8 T cell
Loading data
Combining and processing
(833785, 33538)
Normalizing
Finding HVGs
(833785, 910)
Scaling
PCA
Neighbors
Leiden
UMAP
Saving processed data
Treg
Loading data
Combining and processing
(308864, 33538)
Normalizing
Finding HVGs
(308864, 1321)
Scaling
PCA
Neighbors
Leiden
UMAP
Saving processed data
Plasma cell
Loading data
Combining and processing
(16569, 33538)
Normalizing
Finding HVGs
(16569, 2569)
Scaling
PCA
Neighbors
Leiden
UMAP
Saving processed data
ILC
Loading data
Combining and processing
(5734, 33538)
Normalizing
Finding HVGs
(5734, 1856)
Scaling
PCA
Neighbors
Leiden
UMAP
Saving processed data
Platelet
Loading data
Combining and processing
(72385, 33538)
Nor

## Upload Cell Type data to HISE

Finally, we'll use `hisepy.upload.upload_files()` to send a copy of our output to HISE to use for downstream analysis steps.

In [30]:
study_space_uuid = 'de025812-5e73-4b3c-9c3b-6d0eac412f2a'
title = 'DIHA PBMC L2 Pre-cleanup Small Types .h5ad {d}'.format(d = date.today())

In [31]:
search_id = element_id()
search_id

'uranium-sodium-cesium'

In [32]:
in_files = list(h5ad_uuids.values()) + list(pq_uuids.values())
in_files

['73894515-92dc-4865-9247-e38f7911a529',
 '361c5e80-2a86-4c26-a494-fb46401fe1e2',
 '1ff69ea5-e189-4875-9178-5af2a61186e6',
 '72ca8c5f-3e1c-4360-856d-b53657697793',
 '74755a15-f0a1-411e-b071-7fd802b60e80',
 '277489fd-12fe-48ed-b3a0-7494858fa512',
 '425b59cc-098d-4833-ae5f-57c4582421c9',
 '5ba4e36b-3928-4023-90a0-403e0c43a809',
 'a0ce94a0-de3f-46a1-99bc-3acf24dae0ea',
 'd0b5c470-2481-455f-9b8e-b6ca28ea4bbe',
 'd9cffce7-4018-43b8-afea-9176b1b54532',
 '28a22e8e-0aca-45db-9e28-19907d297a38',
 'ee89cb24-7439-486c-8f50-983895e6d4b7',
 'bb47f341-d036-4054-8783-0a66b44e5c08',
 '6331f471-6bc3-49a8-9b4b-d83c7dd95320',
 'bdaef83d-76da-4bc7-acc6-5db2006908fa']

In [33]:
out_files

['output/diha_qc_CD16_monocyte_2024-04-20.h5ad',
 'output/diha_qc_CD56bright_NK_cell_2024-04-20.h5ad',
 'output/diha_qc_Naive_CD8_T_cell_2024-04-20.h5ad',
 'output/diha_qc_Treg_2024-04-20.h5ad',
 'output/diha_qc_Plasma_cell_2024-04-20.h5ad',
 'output/diha_qc_ILC_2024-04-20.h5ad',
 'output/diha_qc_Platelet_2024-04-20.h5ad',
 'output/diha_qc_DN_T_cell_2024-04-20.h5ad',
 'output/diha_qc_cDC1_2024-04-20.h5ad',
 'output/diha_qc_MAIT_2024-04-20.h5ad',
 'output/diha_qc_Proliferating_T_cell_2024-04-20.h5ad',
 'output/diha_qc_Intermediate_monocyte_2024-04-20.h5ad',
 'output/diha_qc_Effector_B_cell_2024-04-20.h5ad',
 'output/diha_qc_Memory_B_cell_2024-04-20.h5ad',
 'output/diha_qc_Progenitor_cell_2024-04-20.h5ad',
 'output/diha_qc_Transitional_B_cell_2024-04-20.h5ad',
 'output/diha_qc_Naive_B_cell_2024-04-21.h5ad',
 'output/diha_qc_gdT_2024-04-21.h5ad',
 'output/diha_qc_cDC2_2024-04-21.h5ad',
 'output/diha_qc_Proliferating_NK_cell_2024-04-21.h5ad',
 'output/diha_qc_ASDC_2024-04-21.h5ad',
 'outpu

In [35]:
hisepy.upload.upload_files(
    files = out_files,
    study_space_id = study_space_uuid,
    title = title,
    input_file_ids = in_files,
    destination = search_id
)

you are trying to upload file_ids... ['output/diha_qc_CD16_monocyte_2024-04-20.h5ad', 'output/diha_qc_CD56bright_NK_cell_2024-04-20.h5ad', 'output/diha_qc_Naive_CD8_T_cell_2024-04-20.h5ad', 'output/diha_qc_Treg_2024-04-20.h5ad', 'output/diha_qc_Plasma_cell_2024-04-20.h5ad', 'output/diha_qc_ILC_2024-04-20.h5ad', 'output/diha_qc_Platelet_2024-04-20.h5ad', 'output/diha_qc_DN_T_cell_2024-04-20.h5ad', 'output/diha_qc_cDC1_2024-04-20.h5ad', 'output/diha_qc_MAIT_2024-04-20.h5ad', 'output/diha_qc_Proliferating_T_cell_2024-04-20.h5ad', 'output/diha_qc_Intermediate_monocyte_2024-04-20.h5ad', 'output/diha_qc_Effector_B_cell_2024-04-20.h5ad', 'output/diha_qc_Memory_B_cell_2024-04-20.h5ad', 'output/diha_qc_Progenitor_cell_2024-04-20.h5ad', 'output/diha_qc_Transitional_B_cell_2024-04-20.h5ad', 'output/diha_qc_Naive_B_cell_2024-04-21.h5ad', 'output/diha_qc_gdT_2024-04-21.h5ad', 'output/diha_qc_cDC2_2024-04-21.h5ad', 'output/diha_qc_Proliferating_NK_cell_2024-04-21.h5ad', 'output/diha_qc_ASDC_2024-04-

(y/n) y


{'trace_id': '34981c98-42d7-42e3-afcc-0af57862ee3f',
 'files': ['output/diha_qc_CD16_monocyte_2024-04-20.h5ad',
  'output/diha_qc_CD56bright_NK_cell_2024-04-20.h5ad',
  'output/diha_qc_Naive_CD8_T_cell_2024-04-20.h5ad',
  'output/diha_qc_Treg_2024-04-20.h5ad',
  'output/diha_qc_Plasma_cell_2024-04-20.h5ad',
  'output/diha_qc_ILC_2024-04-20.h5ad',
  'output/diha_qc_Platelet_2024-04-20.h5ad',
  'output/diha_qc_DN_T_cell_2024-04-20.h5ad',
  'output/diha_qc_cDC1_2024-04-20.h5ad',
  'output/diha_qc_MAIT_2024-04-20.h5ad',
  'output/diha_qc_Proliferating_T_cell_2024-04-20.h5ad',
  'output/diha_qc_Intermediate_monocyte_2024-04-20.h5ad',
  'output/diha_qc_Effector_B_cell_2024-04-20.h5ad',
  'output/diha_qc_Memory_B_cell_2024-04-20.h5ad',
  'output/diha_qc_Progenitor_cell_2024-04-20.h5ad',
  'output/diha_qc_Transitional_B_cell_2024-04-20.h5ad',
  'output/diha_qc_Naive_B_cell_2024-04-21.h5ad',
  'output/diha_qc_gdT_2024-04-21.h5ad',
  'output/diha_qc_cDC2_2024-04-21.h5ad',
  'output/diha_qc_Proli

In [None]:
import session_info
session_info.show()