# Refine NK filtering for T cell doublets

In this notebook, we'll use the output of our initial doublet and gene detection filtering for NK cells to iterate and better identify T cell doublets. This distinction is difficult to make when other doublets are present because Adaptive NK cells and T cell doublets share some features. To get a better handle on removal of T cell doublets, we'll recluster the NK cells after removing other doublets and identify clusters that include markers for T cells.

## Load Packages

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)

from datetime import date
import h5py
import hisepy
import os
import pandas as pd
import scanpy as sc
import tarfile

In [2]:
out_dir = 'output'
if not os.path.isdir(out_dir):
    os.makedirs(out_dir)

In [3]:
review_dir = 'output/review'
if not os.path.isdir(review_dir):
    os.makedirs(review_dir)

## Helper functions

These functions make it easy to read our files from UUID in HISE

In [4]:
def cache_uuid_path(uuid):
    cache_path = '/home/jupyter/cache/{u}'.format(u = uuid)
    if not os.path.isdir(cache_path):
        hise_res = hisepy.reader.cache_files([uuid])
    filename = os.listdir(cache_path)[0]
    cache_file = '{p}/{f}'.format(p = cache_path, f = filename)
    return cache_file

In [5]:
def read_adata_uuid(uuid):
    cache_file = cache_uuid_path(uuid)
    res = sc.read_h5ad(cache_file)
    return res

These functions utilize scanpy's dotplot function to identify clusters to filter.

The dotplot function needs to assemble the fraction of cells expressing a set of genes (or features), as well as the average per cluster, which is useful for applying threshholds to filter.

In [6]:
def marker_frac_df(adata, markers, clusters = 'louvain_2'):
    gene_cl_frac = sc.pl.dotplot(
        adata, 
        groupby = clusters,
        var_names = markers,
        return_fig = True
    ).dot_size_df
    return gene_cl_frac

def marker_mean_df(adata, markers, log = False, clusters = 'louvain_2'):
    gene_cl_mean = sc.pl.dotplot(
        adata, 
        groupby = clusters,
        var_names = markers,
        return_fig = True,
        log = log
    ).dot_color_df
    
    return gene_cl_mean

In [7]:
def select_clusters_above_gene_frac(adata, gene, cutoff, clusters = 'louvain_2'):
    gene_cl_frac = marker_frac_df(adata, gene, clusters)
    select_cl = gene_cl_frac.index[gene_cl_frac[gene] > cutoff].tolist()

    return select_cl

def select_clusters_above_gene_mean(adata, gene, cutoff, clusters = 'louvain_2'):
    gene_cl_mean = marker_mean_df(adata, gene, log = True, clusters = clusters)
    select_cl = gene_cl_mean.index[gene_cl_mean[gene] > cutoff].tolist()

    return select_cl

def select_clusters_by_low_gene_frac(adata, n_cutoff, frac_cutoff, clusters = 'louvain_2'):

    obs = adata.obs
    n_cells = obs.groupby(clusters)['barcodes'].count()

    low_obs = obs['n_genes'] < n_cutoff
    n_low = obs[low_obs].groupby(clusters)['barcodes'].count()

    frac_low = n_low / n_cells
    low_cl = frac_low[frac_low > frac_cutoff]
    low_cl = low_cl.index.tolist()

    return low_cl

In [8]:
def tidy_marker_df(adata, markers, clusters = 'louvain_2'):
    gene_cl_frac = marker_frac_df(adata, markers, clusters)
    gene_cl_frac = gene_cl_frac.reset_index(drop = False)
    gene_cl_frac = pd.melt(gene_cl_frac, id_vars = clusters, var_name = 'gene', value_name = 'gene_frac')
    
    gene_cl_mean = marker_mean_df(adata, markers, clusters)
    gene_cl_mean = gene_cl_mean.reset_index(drop = False)
    gene_cl_mean = pd.melt(gene_cl_mean, id_vars = clusters, var_name = 'gene', value_name = 'gene_mean')

    marker_df = gene_cl_frac.merge(gene_cl_mean, on = [clusters, 'gene'], how = 'left')
    return marker_df

This function applies data analysis methods to our scRNA-seq data, including normalization, HVG selection, PCA, nearest neighbors, UMAP, and Leiden clustering.

## Set expression cutoffs

After generating the AIFI_L2 partitioned data, we interactively examined the expression of cell class-specific marker genes to identify good frequency or mean expression cutoffs for gene expression and gene detection. Here, we'll encode these cutoffs in a dictionary so that we can apply them to our datasets

### Fraction filters

Most filters work well using the frequency of gene detection. The dictionary below defines these cutoffs for each cell type, with this nested structure:  
`dict[cell_type][reason][marker] = cutoff`

Markers for which the fraction of cells in a cluster are greater than the specified cutoff are flagged with the Reason for removal of the cluster.

In [9]:
frac_filter_dict = {
    'CD56bright NK cell': {
        'T cell doublet':      {'CD3D':  0.4}
    },
    'CD56dim NK cell': {
        'T cell doublet':      {'IL7R':  0.4}
    }
}

## Assemble markers for review plots

In [10]:
all_filter_markers = []
for cell_type, filters in frac_filter_dict.items():
    for reason, filter in filters.items():
        for gene in filter.keys():
            if not gene in all_filter_markers:
                all_filter_markers.append(gene)

In [11]:
all_filter_markers.sort()
all_filter_markers

['CD3D', 'IL7R']

Additional markers for review

In [12]:
additional_markers = ['CD3E','NCAM1','ISG15', 'MKI67','CD4','CD8A']
all_filter_markers = all_filter_markers + additional_markers

## Identify files in HISE

In [13]:
h5ad_uuids = {
    'BR1_Female_Negative_CD56dim-NK-cell': '426366d0-fdf7-4ab2-8339-0baefe80d096',
    'BR1_Female_Positive_CD56dim-NK-cell': '8dd11b33-9065-460f-bcea-88a3092bf662',
    'BR1_Male_Negative_CD56dim-NK-cell': '6edf4d9d-f29a-4c7c-bc86-7e87b53ca9f5',
    'BR1_Male_Positive_CD56dim-NK-cell': '3788e5c0-5fee-4f6d-b108-77d9da289a7f',
    'BR2_Female_Negative_CD56dim-NK-cell': '82b143e9-0dd6-4ad9-b59e-6feb135f5c0c',
    'BR2_Female_Positive_CD56dim-NK-cell': 'eda7be7b-7ba6-4832-83b9-8b210319c078',
    'BR2_Male_Negative_CD56dim-NK-cell': '6dcc8d60-7b43-40f1-90b3-2390d09e4bbc',
    'BR2_Male_Positive_CD56dim-NK-cell': 'f84657c6-d7df-42f9-9a42-5d3aa2e5c4c2',
    'CD56bright-NK-cell': 'd4960075-6eba-4d79-9157-5f8259bbeedf'
}

## Apply filters to datasets

In [14]:
out_files = []
review_files = []
for group_name, uuid in h5ad_uuids.items():
    print(group_name)
    out_file = 'output/diha_{g}_refined_{d}.h5ad'.format(g = group_name, d = date.today())
    if os.path.isfile(out_file):
        print('Previously filtered {g}; Skipping.'.format(g = group_name))
        out_files.append(out_file)
    else:        
        adata = read_adata_uuid(uuid)
        
        cell_type = adata.obs['AIFI_L2'].iloc[0]

        # Track filter results
        filter_list = []
        filter_cl_list = []
        
        # Filter by Fractional gene expression
        marker_filters = frac_filter_dict[cell_type]
    
        for reason, filter in marker_filters.items():
            for marker, cutoff in filter.items():
                filter_cl = select_clusters_above_gene_frac(
                    adata,
                    gene = marker,
                    cutoff = cutoff,
                    clusters = 'louvain_2'
                )

                check_cl = []
                for cl in filter_cl:
                    if not cl in filter_cl_list:
                        check_cl.append(cl)
                        filter_cl_list.append(cl)
                
                filter_df = pd.DataFrame({'louvain_2': check_cl, 'remove_reason': [reason]*len(check_cl)})
                filter_list.append(filter_df)
    
        # Assemble all filtering results
        filter_df = pd.concat(filter_list)

        # Save filtered clusters for review
        filter_file = '{r}/diha_qc_{g}_filter_df_{d}.csv'.format(
            r = review_dir,
            g = group_name,
            d = date.today()
        )
        filter_df.to_csv(filter_file)
        review_files.append(filter_file)
    
        # Add filters to cells
        obs = adata.obs.copy()
        obs = obs.merge(filter_df, on = 'louvain_2', how = 'left')
        obs['remove_reason'] = obs['remove_reason'].fillna('Not removed')
        
        # Save observations and UMAP coordinates for review
        review_obs = obs
        umap_mat = adata.obsm['X_umap']
        umap_df = pd.DataFrame(umap_mat, columns = ['umap_1', 'umap_2'])
        review_obs['umap_1'] = umap_df['umap_1']
        review_obs['umap_2'] = umap_df['umap_2']
        
        obs_file = '{r}/diha_qc_{g}_refined_obs_df_{d}.csv'.format(
            r = review_dir,
            g = group_name,
            d = date.today()
        )
        review_obs.to_csv(obs_file)
        review_files.append(obs_file)

        # Save expression of marker features for review
        marker_df = tidy_marker_df(adata, all_filter_markers, clusters = 'louvain_2')

        marker_file = '{r}/diha_qc_{g}_refined_marker_df_{d}.csv'.format(
            r = review_dir,
            g = group_name,
            d = date.today()
        )
        marker_df.to_csv(marker_file)
        review_files.append(marker_file)
    
        # Apply filters to data
        print(adata.shape)
        keep_cells = obs['remove_reason'] == 'Not removed'
        adata = adata[keep_cells]
        print(adata.shape)
        
        # Save filtered data
        adata.write_h5ad(out_file)
    
        out_files.append(out_file)

        # Clean up cache so we don't run out of disk space
        h5ad_path = cache_uuid_path(uuid)
        os.remove(h5ad_path)
        cache_path = '/home/jupyter/cache/{u}'.format(u = uuid)
        os.rmdir(cache_path)

BR1_Female_Negative_CD56dim-NK-cell
(187379, 899)
(181154, 899)
BR1_Female_Positive_CD56dim-NK-cell
(79611, 888)
(76987, 888)
BR1_Male_Negative_CD56dim-NK-cell
(156404, 863)
(152746, 863)
BR1_Male_Positive_CD56dim-NK-cell
(127326, 1048)
(122831, 1048)
BR2_Female_Negative_CD56dim-NK-cell
(122044, 983)
(118648, 983)
BR2_Female_Positive_CD56dim-NK-cell
(155287, 950)
(149915, 950)
BR2_Male_Negative_CD56dim-NK-cell
(171079, 1005)
(167283, 1005)
BR2_Male_Positive_CD56dim-NK-cell
(69976, 1127)
(67829, 1127)
CD56bright-NK-cell
(93676, 1350)
(88824, 1350)


## Bundle Review data

We saved review data, including cell metadata and UMAP coordinates, filtered clusters, and marker gene expression, to enable us to assemble figures to double-check our filtering process.

To help with file transfer, we'll use `tarfile` to bundle our review files.

In [15]:
review_tar = 'output/diha_qc_AIFI_L2_NK_refined_review_{d}.tar.gz'.format(d = date.today())
tar = tarfile.open(review_tar, 'w:gz')
for review_file in review_files:
    tar.add(review_file)
tar.close()

## Upload Cell Type data to HISE

Finally, we'll use `hisepy.upload.upload_files()` to send a copy of our output to HISE to use for downstream analysis steps.

In [16]:
study_space_uuid = 'de025812-5e73-4b3c-9c3b-6d0eac412f2a'
title = 'DIHA PBMC Filter Refinement NK .h5ad {d}'.format(d = date.today())

In [17]:
in_files = list(h5ad_uuids.values())
in_files

['426366d0-fdf7-4ab2-8339-0baefe80d096',
 '8dd11b33-9065-460f-bcea-88a3092bf662',
 '6edf4d9d-f29a-4c7c-bc86-7e87b53ca9f5',
 '3788e5c0-5fee-4f6d-b108-77d9da289a7f',
 '82b143e9-0dd6-4ad9-b59e-6feb135f5c0c',
 'eda7be7b-7ba6-4832-83b9-8b210319c078',
 '6dcc8d60-7b43-40f1-90b3-2390d09e4bbc',
 'f84657c6-d7df-42f9-9a42-5d3aa2e5c4c2',
 'd4960075-6eba-4d79-9157-5f8259bbeedf']

In [18]:
out_files = out_files + [review_tar]

In [19]:
out_files

['output/diha_BR1_Female_Negative_CD56dim-NK-cell_refined_2024-03-22.h5ad',
 'output/diha_BR1_Female_Positive_CD56dim-NK-cell_refined_2024-03-22.h5ad',
 'output/diha_BR1_Male_Negative_CD56dim-NK-cell_refined_2024-03-22.h5ad',
 'output/diha_BR1_Male_Positive_CD56dim-NK-cell_refined_2024-03-22.h5ad',
 'output/diha_BR2_Female_Negative_CD56dim-NK-cell_refined_2024-03-22.h5ad',
 'output/diha_BR2_Female_Positive_CD56dim-NK-cell_refined_2024-03-22.h5ad',
 'output/diha_BR2_Male_Negative_CD56dim-NK-cell_refined_2024-03-22.h5ad',
 'output/diha_BR2_Male_Positive_CD56dim-NK-cell_refined_2024-03-22.h5ad',
 'output/diha_CD56bright-NK-cell_refined_2024-03-22.h5ad',
 'output/diha_qc_AIFI_L2_NK_refined_review_2024-03-22.tar.gz']

In [20]:
hisepy.upload.upload_files(
    files = out_files,
    study_space_id = study_space_uuid,
    title = title,
    input_file_ids = in_files,
    destination = 'diha_nk_filter_refinement'
)

output/diha_BR1_Female_Negative_CD56dim-NK-cell_refined_2024-03-22.h5ad
output/diha_BR1_Female_Positive_CD56dim-NK-cell_refined_2024-03-22.h5ad
output/diha_BR1_Male_Negative_CD56dim-NK-cell_refined_2024-03-22.h5ad
output/diha_BR1_Male_Positive_CD56dim-NK-cell_refined_2024-03-22.h5ad
output/diha_BR2_Female_Negative_CD56dim-NK-cell_refined_2024-03-22.h5ad
output/diha_BR2_Female_Positive_CD56dim-NK-cell_refined_2024-03-22.h5ad
output/diha_BR2_Male_Negative_CD56dim-NK-cell_refined_2024-03-22.h5ad
output/diha_BR2_Male_Positive_CD56dim-NK-cell_refined_2024-03-22.h5ad
output/diha_CD56bright-NK-cell_refined_2024-03-22.h5ad
output/diha_qc_AIFI_L2_NK_refined_review_2024-03-22.tar.gz
Cannot determine the current notebook.
1) /home/jupyter/IH-A-Aging-Analysis-Notebooks/scrna-seq_analysis/02-reference_labeling/07a-Python_refine_NK_filtering.ipynb
2) /home/jupyter/IH-A-Aging-Analysis-Notebooks/scrna-seq_analysis/02-reference_labeling/09f-Python_naive_cd4_t_cell_L3_refinement.ipynb
3) /home/jupyter/I

 1


you are trying to upload file_ids... ['output/diha_BR1_Female_Negative_CD56dim-NK-cell_refined_2024-03-22.h5ad', 'output/diha_BR1_Female_Positive_CD56dim-NK-cell_refined_2024-03-22.h5ad', 'output/diha_BR1_Male_Negative_CD56dim-NK-cell_refined_2024-03-22.h5ad', 'output/diha_BR1_Male_Positive_CD56dim-NK-cell_refined_2024-03-22.h5ad', 'output/diha_BR2_Female_Negative_CD56dim-NK-cell_refined_2024-03-22.h5ad', 'output/diha_BR2_Female_Positive_CD56dim-NK-cell_refined_2024-03-22.h5ad', 'output/diha_BR2_Male_Negative_CD56dim-NK-cell_refined_2024-03-22.h5ad', 'output/diha_BR2_Male_Positive_CD56dim-NK-cell_refined_2024-03-22.h5ad', 'output/diha_CD56bright-NK-cell_refined_2024-03-22.h5ad', 'output/diha_qc_AIFI_L2_NK_refined_review_2024-03-22.tar.gz']. Do you truly want to proceed?


(y/n) y


{'trace_id': '56ade688-1a4f-4b00-b946-2243403c5550',
 'files': ['output/diha_BR1_Female_Negative_CD56dim-NK-cell_refined_2024-03-22.h5ad',
  'output/diha_BR1_Female_Positive_CD56dim-NK-cell_refined_2024-03-22.h5ad',
  'output/diha_BR1_Male_Negative_CD56dim-NK-cell_refined_2024-03-22.h5ad',
  'output/diha_BR1_Male_Positive_CD56dim-NK-cell_refined_2024-03-22.h5ad',
  'output/diha_BR2_Female_Negative_CD56dim-NK-cell_refined_2024-03-22.h5ad',
  'output/diha_BR2_Female_Positive_CD56dim-NK-cell_refined_2024-03-22.h5ad',
  'output/diha_BR2_Male_Negative_CD56dim-NK-cell_refined_2024-03-22.h5ad',
  'output/diha_BR2_Male_Positive_CD56dim-NK-cell_refined_2024-03-22.h5ad',
  'output/diha_CD56bright-NK-cell_refined_2024-03-22.h5ad',
  'output/diha_qc_AIFI_L2_NK_refined_review_2024-03-22.tar.gz']}

In [21]:
import session_info
session_info.show()