# Partition Cell Types for QC Review

**Note:** This notebook serves as a patch for two files that have download problems in HISE. To enable downstream analyses, this notebook applies cell type partitioning for just these two subsets:  
BR1_Female_Negative_Naive-CD4-T-cell  
BR2_Male_Positive_CD56dim-NK-cell

Changes specific to this patch notebook are indicated with **Patch Divergence** notes.

To ensure that our cell type labels are as accurate as possible, we'll subset our dataset based on our `AIFI_L2` labels for review.

Previously, we split our dataset up into 8 subsets based on cohort, biological sex, and CMV status. Here, we'll combine these per L2 cell type for less abundant cell types to simplify review. For very abundant cell types (i.e. Naive CD4 T cells), we'll keep them separated into the 8 groups and review each. 

This should reduce the burden of trying to cluster >2M cells (which can be very slow) without significantly reducing our power to identify doublets or mislabeled cells, as we'll still have >100k cells for these large classes.

In a later step, we'll generate rules to filter these clustered subsets of cells to help us in identifying doublets, contaminated clusters (i.e. with erythrocyte content), and mislabeled cells.

## Load libraries

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)

import concurrent.futures
from concurrent.futures import ThreadPoolExecutor
from datetime import date
import hisepy
import os
import pandas as pd
import re
import scanpy as sc
import scanpy.external as sce

In [2]:
out_dir = 'output'
if not os.path.isdir(out_dir):
    os.makedirs(out_dir)

## Helper functions

These make it a bit simpler to cache and read in files from HISE

In [3]:
def cache_uuid_path(uuid):
    cache_path = '/home/jupyter/cache/{u}'.format(u = uuid)
    if not os.path.isdir(cache_path):
        hise_res = hisepy.reader.cache_files([uuid])
    filename = os.listdir(cache_path)[0]
    cache_file = '{p}/{f}'.format(p = cache_path, f = filename)
    return cache_file

In [4]:
def read_parquet_uuid(uuid):
    cache_file = cache_uuid_path(uuid)
    res = pd.read_parquet(cache_file)
    return res

In [5]:
def sort_adata_uuid(uuid, sort_cols = ['AIFI_L2', 'sample.sampleKitGuid']):
    cache_file = cache_uuid_path(uuid)
    adata = sc.read_h5ad(cache_file)
    obs = adata.obs
    obs = obs.sort_values(sort_cols)
    adata = adata[obs.index]
    adata.write_h5ad(cache_file)

This function will enable us to connect to our .h5ad files without loading the entire thing into memory. We'll then load only the cells that we want for each cell class to assemble them for writing. This should save us some overhead as we do our subsetting.

In [6]:
def read_adata_backed_uuid(uuid):
    cache_file = cache_uuid_path(uuid)
    res = sc.read_h5ad(cache_file, backed = 'r')
    return res

This function will apply a standard normalization, nearest neighbors, clustering, and UMAP process to our cell subsets:

In [7]:
def process_adata(adata):
    # Keep a copy of the raw data
    adata.raw = adata

    print('Normalizing')
    # Normalize and log transform
    sc.pp.normalize_total(adata)
    sc.pp.log1p(adata)

    print('Finding HVGs')
    # Restrict downstream steps to variable genes
    sc.pp.highly_variable_genes(adata)
    adata = adata[:, adata.var_names[adata.var['highly_variable']]].copy()
    print(adata.shape)

    print('Scaling')
    # Scale variable genes
    sc.pp.scale(adata)

    print('PCA')
    # Run PCA
    sc.tl.pca(adata, svd_solver = 'arpack')

    print('Neighbors')
    # Find nearest neighbors
    sc.pp.neighbors(
        adata, 
        n_neighbors = 50,
        n_pcs = 30
    )

    print('Louvain')
    # Find clusters
    sc.tl.louvain(
        adata, 
        resolution = 2, 
        key_added = 'louvain_2'
    )

    print('UMAP')
    # Run UMAP
    sc.tl.umap(adata, min_dist = 0.05)

    return adata

## Read cell metadata to identify large subsets

**Patch Divergence:** We'll skip this step used to identify large subsets and specify the two L2 types we need for our patch.

In [8]:
large_types = ['CD56dim NK cell', 'Naive CD4 T cell']

These L2 labels have > 1M cells each. We'll keep them separated per sample group.

## Download and sort .h5ad files

Sorting the .h5ad files by AIFI_L2 will make reading each cell type much faster by placing cells of the same type next to each other in the sparse matrix.

**Patch Divergence:** Here, we only load the two sample groups that we need for our missing files.

In [9]:
h5ad_uuids = {
    'BR1_Female_Negative': '7aa6743f-c9f1-4cc1-b879-b5528f6d4270',
    'BR2_Male_Positive':   '8eb02ae8-46fb-4d1d-83e3-088c55ddc3d5'
}

In [10]:
for uuid in h5ad_uuids.values():
    sort_adata_uuid(uuid, sort_cols = ['AIFI_L2', 'sample.sampleKitGuid'])

downloading fileID: 7aa6743f-c9f1-4cc1-b879-b5528f6d4270
Files have been successfully downloaded!
downloading fileID: 8eb02ae8-46fb-4d1d-83e3-088c55ddc3d5
Files have been successfully downloaded!


## Open connections to .h5ad files

Now that they're sorted, we can open these files with on-disk backing so we don't have to read the entire file at once.

In [11]:
h5ad_conn = {}
for group_name, uuid in h5ad_uuids.items():
    h5ad_conn[group_name] = read_adata_backed_uuid(uuid)

## Process each cell type

**Patch divergence:** For this patch, we'll only run the two cell types we need.

In [12]:
l2_types = ['CD56dim NK cell', 'Naive CD4 T cell']

In [13]:
def read_l2_type(adata, cell_type):
    type_adata = adata[adata.obs['AIFI_L2'] == cell_type].to_memory()
    return type_adata

In [14]:
out_files = []

In [15]:
for cell_type in l2_types:
    print(cell_type)
    
    # Read data from each group for this type in parallel
    print('Loading data')
    type_adata_dict = {}

    with ThreadPoolExecutor(max_workers = 8) as executor:
        futures = {
            executor.submit(
                read_l2_type, 
                h5ad_conn[group_name], 
                cell_type): group_name 
            for group_name in h5ad_conn.keys()
        }
        for future in concurrent.futures.as_completed(futures):
            future_group = futures[future]
            type_adata_dict[future_group] = future.result()
    
    if cell_type in large_types:
        # If large, process and save separately
        for group_name, type_adata in type_adata_dict.items():
            print('Processing {g}'.format(g = group_name))
            print(type_adata.shape)
            type_adata = process_adata(type_adata)
            print('Saving processed data')
            out_file = 'output/diha_qc_{g}_{c}_{d}.h5ad'.format(
                g = group_name,
                c = cell_type,
                d = date.today()
            )
            type_adata.write_h5ad(out_file)
            out_files.append(out_file)
            
    else:
        # If small, combine and process
        print('Combining and processing')
        type_adata = sc.concat(type_adata_dict)
        print(type_adata.shape)
        type_adata = process_adata(type_adata)
        
        print('Saving processed data')
        out_file = 'output/diha_qc_{c}_{d}.h5ad'.format(
            g = group_name,
            c = cell_type,
            d = date.today()
        )
        type_adata.write_h5ad(out_file)
        out_files.append(out_file)

CD56dim NK cell
Loading data


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sub[k] = df_sub[k].cat.remove_unused_categories()


Processing BR2_Male_Positive
(74411, 33538)
Normalizing
Finding HVGs
(74411, 1127)
Scaling
PCA
Neighbors
Louvain
UMAP
Saving processed data
Processing BR1_Female_Negative
(199167, 33538)
Normalizing
Finding HVGs
(199167, 899)
Scaling
PCA
Neighbors
Louvain
UMAP
Saving processed data
Naive CD4 T cell
Loading data
Processing BR2_Male_Positive
(248771, 33538)
Normalizing
Finding HVGs
(248771, 565)
Scaling
PCA
Neighbors
Louvain
UMAP
Saving processed data
Processing BR1_Female_Negative
(659799, 33538)
Normalizing
Finding HVGs
(659799, 502)
Scaling
PCA
Neighbors
Louvain
UMAP
Saving processed data


## Fix output filenames

We'll update the output file names to replace spaces with dashes.

In [16]:
dash_out_files = []
for out_file in out_files:
    dash_out_file = re.sub(' ', '-', out_file)
    os.rename(out_file, dash_out_file)
    dash_out_files.append(dash_out_file)

## Upload Cell Type data to HISE

Finally, we'll use `hisepy.upload.upload_files()` to send a copy of our output to HISE to use for downstream analysis steps.

In [17]:
study_space_uuid = 'de025812-5e73-4b3c-9c3b-6d0eac412f2a'
title = 'DIHA PBMC AIFI_L2 Pre-cleanup .h5ad Patch Files {d}'.format(d = date.today())

In [18]:
in_files = list(h5ad_uuids.values())
in_files

['7aa6743f-c9f1-4cc1-b879-b5528f6d4270',
 '8eb02ae8-46fb-4d1d-83e3-088c55ddc3d5']

**Patch Divergence:** We'll specify exactly which output files should be uploaded so we use only the two we require.

In [19]:
out_files = [
    'output/diha_qc_BR1_Female_Negative_Naive-CD4-T-cell_2024-03-15.h5ad',
    'output/diha_qc_BR2_Male_Positive_CD56dim-NK-cell_2024-03-15.h5ad'
]

In [21]:
hisepy.upload.upload_files(
    files = out_files,
    study_space_id = study_space_uuid,
    title = title,
    input_file_ids = in_files
)

output/diha_qc_BR1_Female_Negative_Naive-CD4-T-cell_2024-03-15.h5ad
output/diha_qc_BR2_Male_Positive_CD56dim-NK-cell_2024-03-15.h5ad
you are trying to upload file_ids... ['output/diha_qc_BR1_Female_Negative_Naive-CD4-T-cell_2024-03-15.h5ad', 'output/diha_qc_BR2_Male_Positive_CD56dim-NK-cell_2024-03-15.h5ad']. Do you truly want to proceed?


(y/n) y


{'trace_id': '4d63d9c0-2462-4779-91e2-60407de85ffe',
 'files': ['output/diha_qc_BR1_Female_Negative_Naive-CD4-T-cell_2024-03-15.h5ad',
  'output/diha_qc_BR2_Male_Positive_CD56dim-NK-cell_2024-03-15.h5ad']}

In [22]:
import session_info
session_info.show()