# Filter cell classes for doublets and low gene abundance

In this notebook, we'll read the AIFI_L2 cell class partition subsets and filter Louvain clusters based on expression of marker genes from off-target cell types - these look like doublets that are missed by scrublet - as well as clusters with abnormally low average gene detection.

We'll also generate plots of these results to help us manually review our filtering to double-check removed cells.

## Load Packages

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)

from datetime import date
import h5py
import hisepy
import os
import pandas as pd
import scanpy as sc
import tarfile

In [2]:
out_dir = 'output'
if not os.path.isdir(out_dir):
    os.makedirs(out_dir)

In [3]:
review_dir = 'output/review'
if not os.path.isdir(review_dir):
    os.makedirs(review_dir)

## Helper functions

These functions make it easy to read our files from UUID in HISE

In [4]:
def cache_uuid_path(uuid):
    cache_path = '/home/jupyter/cache/{u}'.format(u = uuid)
    if not os.path.isdir(cache_path):
        hise_res = hisepy.reader.cache_files([uuid])
    filename = os.listdir(cache_path)[0]
    cache_file = '{p}/{f}'.format(p = cache_path, f = filename)
    return cache_file

In [5]:
def read_adata_uuid(uuid):
    cache_file = cache_uuid_path(uuid)
    res = sc.read_h5ad(cache_file)
    return res

These functions utilize scanpy's dotplot function to identify clusters to filter.

The dotplot function needs to assemble the fraction of cells expressing a set of genes (or features), as well as the average per cluster, which is useful for applying threshholds to filter.

In [6]:
def marker_frac_df(adata, markers, clusters = 'louvain_2'):
    gene_cl_frac = sc.pl.dotplot(
        adata, 
        groupby = clusters,
        var_names = markers,
        return_fig = True
    ).dot_size_df
    return gene_cl_frac

def marker_mean_df(adata, markers, log = False, clusters = 'louvain_2'):
    gene_cl_mean = sc.pl.dotplot(
        adata, 
        groupby = clusters,
        var_names = markers,
        return_fig = True,
        log = log
    ).dot_color_df
    
    return gene_cl_mean

In [7]:
def select_clusters_above_gene_frac(adata, gene, cutoff, clusters = 'louvain_2'):
    gene_cl_frac = marker_frac_df(adata, gene, clusters)
    select_cl = gene_cl_frac.index[gene_cl_frac[gene] > cutoff].tolist()

    return select_cl

def select_clusters_above_gene_mean(adata, gene, cutoff, clusters = 'louvain_2'):
    gene_cl_mean = marker_mean_df(adata, gene, log = True, clusters = clusters)
    select_cl = gene_cl_mean.index[gene_cl_mean[gene] > cutoff].tolist()

    return select_cl

def select_clusters_by_low_gene_frac(adata, n_cutoff, frac_cutoff, clusters = 'louvain_2'):

    obs = adata.obs
    n_cells = obs.groupby(clusters)['barcodes'].count()

    low_obs = obs['n_genes'] < n_cutoff
    n_low = obs[low_obs].groupby(clusters)['barcodes'].count()

    frac_low = n_low / n_cells
    low_cl = frac_low[frac_low > frac_cutoff]
    low_cl = low_cl.index.tolist()

    return low_cl

In [8]:
def tidy_marker_df(adata, markers, clusters = 'louvain_2'):
    gene_cl_frac = marker_frac_df(adata, markers, clusters)
    gene_cl_frac = gene_cl_frac.reset_index(drop = False)
    gene_cl_frac = pd.melt(gene_cl_frac, id_vars = clusters, var_name = 'gene', value_name = 'gene_frac')
    
    gene_cl_mean = marker_mean_df(adata, markers, clusters)
    gene_cl_mean = gene_cl_mean.reset_index(drop = False)
    gene_cl_mean = pd.melt(gene_cl_mean, id_vars = clusters, var_name = 'gene', value_name = 'gene_mean')

    marker_df = gene_cl_frac.merge(gene_cl_mean, on = [clusters, 'gene'], how = 'left')
    return marker_df

## Set expression cutoffs

After generating the AIFI_L2 partitioned data, we interactively examined the expression of cell class-specific marker genes to identify good frequency or mean expression cutoffs for gene expression and gene detection. Here, we'll encode these cutoffs in a dictionary so that we can apply them to our datasets

### Fraction filters

Most filters work well using the frequency of gene detection. The dictionary below defines these cutoffs for each cell type, with this nested structure:  
`dict[cell_type][reason][marker] = cutoff`

Markers for which the fraction of cells in a cluster are greater than the specified cutoff are flagged with the Reason for removal of the cluster.

In [9]:
frac_filter_dict = {
    'ASDC' : {
        'T cell doublet':      {'CD3E': 0.4}
    },
    'CD14 monocyte': {
        'T cell doublet':      {'CD3E':  0.1},
        'T cell doublet':      {'IL7R':  0.2},
        'B cell doublet':      {'MS4A1': 0.2},
        'Platelet doublet':    {'PPBP':  0.4}
    },
    'CD16 monocyte': {
        'T cell doublet':      {'CD3E':  0.2},
        'Erythrocyte doublet': {'HBB':   0.2},
        'B cell doublet':      {'MS4A1': 0.4},
        'Platelet doublet':    {'PPBP':  0.4}
    },
    'CD56bright NK cell': {
        #'T cell doublet':      {'CD3D':  0.4},
        'Myeloid doublet':     {'FCN1':  0.4},
        'Erythrocyte doublet': {'HBB':   0.2},
        'B cell doublet':      {'MS4A1': 0.4},
        'Platelet doublet':    {'PPBP':  0.4}
    },
    'CD56dim NK cell': {
        #'T cell doublet':      {'CD3D':  0.4},
        'Myeloid doublet':     {'FCN1':  0.4},
        'B cell doublet':      {'MS4A1': 0.4},
        'Platelet doublet':    {'PPBP':  0.4}
    },
    'CD8aa': {
        'Myeloid doublet':     {'FCN1':  0.2},
        'Platelet doublet':    {'PPBP':  0.4}
    },
    'cDC1': {
        'T cell doublet':      {'CD3D':  0.2},
        'Erythrocyte doublet': {'HBB':   0.4},
        'T cell doublet':      {'IL7R':  0.2},
        'B cell doublet':      {'MS4A1': 0.4},
        'Platelet doublet':    {'PPBP':  0.2}
    } ,
    'cDC2': {
        'T cell doublet':      {'CD3D':  0.2},
        'Erythrocyte doublet': {'HBB':   0.4},
        'Platelet doublet':    {'PPBP':  0.2}
    },
    'DN T cell': {
        'Myeloid doublet':     {'FCN1':  0.2},
        'Erythrocyte doublet': {'HBB':   0.4},
        'Platelet doublet':    {'PPBP':  0.4}
    },
    'Effector B cell': {
        'T cell doublet':      {'CD3D':  0.2},
        'Myeloid doublet':     {'FCN1':  0.2},
        'Erythrocyte doublet': {'HBB':   0.4},
        'T cell doublet':      {'IL7R':  0.2},
        'Platelet doublet':    {'PPBP':  0.4}
    },
    'Erythrocyte': {
        'Myeloid doublet':     {'FCN1':  0.2},
        'B cell doublet':      {'MS4A1': 0.4},
        'Platelet doublet':    {'PPBP':  0.4}
    },
    'gdT': {
        'Myeloid doublet':     {'FCN1':  0.2},
        'Erythrocyte doublet': {'HBB':   0.2},
        'B cell doublet':      {'MS4A1': 0.4},
        'Platelet doublet':    {'PPBP':  0.4}
    },
    'ILC': {
        'T cell doublet':      {'CD3D':  0.4},
        'Erythrocyte doublet': {'HBB':   0.4},
        'Platelet doublet':    {'PPBP':  0.2},
        'Myeloid doublet':     {'FCN1':  0.2}
    },
    'Intermediate monocyte': {
        'T cell doublet':      {'CD3D':  0.4},
        'Erythrocyte doublet': {'HBB':   0.4},
        'B cell doublet':      {'MS4A1': 0.4},
        'Platelet doublet':    {'PPBP':  0.2}
    },
    'MAIT': {
        'Myeloid doublet':     {'FCN1':  0.2},
        'Erythrocyte doublet': {'HBB':   0.4},
        'B cell doublet':      {'MS4A1': 0.4},
        'Platelet doublet':    {'PPBP':  0.2}
    },
    'Memory B cell': {
        'T cell doublet':      {'CD3D':  0.4},
        'Myeloid doublet':     {'FCN1':  0.2},
        'Erythrocyte doublet': {'HBB':   0.4},
        'Platelet doublet':    {'PPBP':  0.2}
    },
    'Memory CD4 T cell': {
        'Myeloid doublet':     {'FCN1':  0.2},
        'Erythrocyte doublet': {'HBB':   0.4},
        'B cell doublet':      {'MS4A1': 0.2}
    },
    'Memory CD8 T cell': {
        'Myeloid doublet':     {'FCN1':  0.2},
        'B cell doublet':      {'MS4A1': 0.4},
        'Platelet doublet':    {'PPBP':  0.2}
    },
    'Naive B cell': {
        'T cell doublet':      {'CD3D':  0.4},
        'Erythrocyte doublet': {'HBB':   0.4},
        'Platelet doublet':    {'PPBP':  0.2}        
    },
    'Naive CD4 T cell': {
        'Myeloid doublet':     {'FCN1':  0.2},
        'Erythrocyte doublet': {'HBB':   0.4},
        'Platelet doublet':    {'PPBP':  0.2},
        'B cell doublet':      {'MS4A1': 0.2},
    },
    'Naive CD8 T cell': {
        'Myeloid doublet':     {'FCN1':  0.2},
        'Erythrocyte doublet': {'HBB':   0.4},
        'B cell doublet':      {'MS4A1': 0.2},
        'Platelet doublet':    {'PPBP':  0.2}
    },
    'pDC' : {
        'T cell doublet':      {'CD3D':  0.4},
        'Myeloid doublet':     {'FCN1':  0.2},
        'Erythrocyte doublet': {'HBB':   0.4},
        'B cell doublet':      {'MS4A1': 0.2},
        'Platelet doublet':    {'PPBP':  0.2}
    },
    'Plasma cell': {
        'T cell doublet':      {'CD3D':  0.4},
        'Myeloid doublet':     {'FCN1':  0.2},
        'Erythrocyte doublet': {'HBB':   0.4},
        'B cell doublet':      {'MS4A1': 0.4},
        'Platelet doublet':    {'PPBP':  0.4}     
    },
    'Platelet' : {
        'T cell doublet':      {'CD3D':  0.4},
        'Myeloid doublet':     {'FCN1':  0.2},
        'Erythrocyte doublet': {'HBB':   0.4},
        'B cell doublet':      {'MS4A1': 0.4}        
    },
    'Progenitor cell': {
        'T cell doublet':      {'CD3E':  0.2},
        'Myeloid doublet':     {'FCN1':  0.2},
        'Erythrocyte doublet': {'HBB':   0.2},
        'Platelet doublet':    {'PPBP':  0.2}       
    },
    'Proliferating NK cell': {
        'T cell doublet':      {'CD3D':  0.4},
        'Myeloid doublet':     {'FCN1':  0.2},
        'Erythrocyte doublet': {'HBB':   0.4},
        'Platelet doublet':    {'PPBP':  0.2}       
    },
    'Proliferating T cell': {
        'Myeloid doublet':     {'FCN1':  0.2},
        'B cell doublet':      {'MS4A1': 0.4},
        'Platelet doublet':    {'PPBP':  0.2}       
    },
    'Transitional B cell': {
        'T cell doublet':      {'CD3D':  0.4},
        'Myeloid doublet':     {'FCN1':  0.2},
        'Erythrocyte doublet': {'HBB':   0.4},
        'Platelet doublet':    {'PPBP':  0.2}               
    },
    'Treg': {
        'Myeloid doublet':     {'FCN1':  0.2},
        'Erythrocyte doublet': {'HBB':   0.4},
        'B cell doublet':      {'MS4A1': 0.4}
    }
}

### Mean cutoffs

In some cases, the mean of gene expression rather than the fraction of cells works as a better cutoff for cleanup. We'll specify these cutoffs here in normalized, log-transformed units:

In [10]:
mean_filter_dict = {
    'CD14 monocyte': {
        'Erythrocyte doublet':    {'HBB':  1}
    },
    'CD56dim NK cell': {
        'Erythrocyte doublet':    {'HBB':  0.7}
    },
    'Memory CD8 T cell': {
        'Erythrocyte doublet':    {'HBB':  0.7}
    },
    'Proliferating T cell': {
        'Erythrocyte doublet':    {'HBB':  0.7}
    }
}

## Assemble markers for review plots

In [11]:
all_filter_markers = []
for cell_type, filters in frac_filter_dict.items():
    for reason, filter in filters.items():
        for gene in filter.keys():
            if not gene in all_filter_markers:
                all_filter_markers.append(gene)
                
for cell_type, filters in mean_filter_dict.items():
    for reason, filter in filters.items():
        for gene in filter.keys():
            if not gene in all_filter_markers:
                all_filter_markers.append(gene)

In [12]:
all_filter_markers.sort()
all_filter_markers

['CD3D', 'CD3E', 'FCN1', 'HBB', 'IL7R', 'MS4A1', 'PPBP']

Additional markers for review

In [13]:
additional_markers = ['NCAM1','ISG15', 'MKI67','CD4','CD8A']
all_filter_markers = all_filter_markers + additional_markers

## Set Gene Detection cutoffs

We'll use these cutoffs to filter clusters with low gene expression. If more than the `min_gene_frac` proportion of cells in a cluster have expression below `min_n_genes`, we'll flag that cluster for removal.

Because Erythrocytes and Platelets normally have low gene detection, we'll add exceptions for these two classes for this filter.

In [14]:
min_n_genes = 750
min_gene_frac = 0.3
min_gene_exceptions = ['Erythrocyte', 'Platelet']

## Identify files in HISE

In [15]:
h5ad_uuids = {
    'ASDC': '8fd1b5aa-2589-45d7-8954-c5623c5b75b4',
    'BR1_Female_Negative_CD14-monocyte': '4ea6f421-c4a7-457b-a296-b0b35f136742',
    'BR1_Female_Negative_CD56dim-NK-cell': '54a90a65-dc08-453c-916e-298872131055',
    'BR1_Female_Negative_Memory-CD4-T-cell': 'afdf70a7-66d8-4339-89e1-9978188bcdda',
    'BR1_Female_Negative_Memory-CD8-T-cell': '566f4ca5-bd44-4647-bb21-77713c169069',
    #'BR1_Female_Negative_Naive-CD4-T-cell': '95056ef6-a9ef-4b70-834a-a39ed13a5589', # Original
    'BR1_Female_Negative_Naive-CD4-T-cell': 'eff06635-b63c-4ebc-9928-99dda5fb5082',  # Patch File
    'BR1_Female_Positive_CD14-monocyte': 'fa343b54-4b45-49d2-a86f-6ee1d3fc3c9a',
    'BR1_Female_Positive_CD56dim-NK-cell': '330d1831-d684-4bc7-8394-6b97150dea75',
    'BR1_Female_Positive_Memory-CD4-T-cell': '52749427-354f-4fdf-b866-66044dbaa626',
    'BR1_Female_Positive_Memory-CD8-T-cell': 'b7d09799-68bf-4730-951b-bba8ee36b05d',
    'BR1_Female_Positive_Naive-CD4-T-cell': '3251d7f1-731c-4a89-bb18-7ab173d5f44a',
    'BR1_Male_Negative_CD14-monocyte': 'f355429f-cc54-4616-9c4d-e32b8009d422',
    'BR1_Male_Negative_CD56dim-NK-cell': '376f94eb-27f0-4826-8097-6f1e3c834999',
    'BR1_Male_Negative_Memory-CD4-T-cell': 'bd6c716c-e67c-45f8-815b-0b397c56c92c',
    'BR1_Male_Negative_Memory-CD8-T-cell': '018352bb-a48c-46bd-a586-5f3b6322382e',
    'BR1_Male_Negative_Naive-CD4-T-cell': '0aa350a4-27c2-4e0d-b74a-04d1c1418d4a',
    'BR1_Male_Positive_CD14-monocyte': '75f54b8a-0a9f-4c05-a40a-5dd822cc8aa0',
    'BR1_Male_Positive_CD56dim-NK-cell': 'c07364b5-7d15-4afa-b59c-d9e3f080c535',
    'BR1_Male_Positive_Memory-CD4-T-cell': '5e147121-ab97-4c5a-b121-7b86e90faea8',
    'BR1_Male_Positive_Memory-CD8-T-cell': '927177ec-3aaa-45fb-b27a-3b4cdbd17a70',
    'BR1_Male_Positive_Naive-CD4-T-cell': 'be404ee1-3631-4bdf-b7a0-8d63da190578',
    'BR2_Female_Negative_CD14-monocyte': '5d396391-c4eb-45e7-9822-5cc3cedd248b',
    'BR2_Female_Negative_CD56dim-NK-cell': '3b335b5d-090e-4f1e-bc4a-882b64e0e00b',
    'BR2_Female_Negative_Memory-CD4-T-cell': '67f5b530-2fc3-4c83-ad8b-9aefffa29195',
    'BR2_Female_Negative_Memory-CD8-T-cell': 'f5593e37-60bd-4455-9e1d-4f1d29229c90',
    'BR2_Female_Negative_Naive-CD4-T-cell': 'bcea6a2d-ba04-4d87-a4d9-580a9b7e4995',
    'BR2_Female_Positive_CD14-monocyte': 'da32cd9a-c137-4381-93cc-6c8ba03822fc',
    'BR2_Female_Positive_CD56dim-NK-cell': 'd9ae7270-464e-446c-a7a1-f88daa77b1a2',
    'BR2_Female_Positive_Memory-CD4-T-cell': '58438fb2-917d-4ae3-9590-a096e4f15033',
    'BR2_Female_Positive_Memory-CD8-T-cell': '80859337-9223-4684-947e-7f3c9eb64604',
    'BR2_Female_Positive_Naive-CD4-T-cell': '45c15444-60fc-49c2-99b1-092b48973bb9',
    'BR2_Male_Negative_CD14-monocyte': '8e0a61d6-76ff-4f90-bd0a-bc2be32aa0d6',
    'BR2_Male_Negative_CD56dim-NK-cell': '8258fe57-f80e-4bd7-8ec9-adc1ea82abc9',
    'BR2_Male_Negative_Memory-CD4-T-cell': 'a43d103b-a696-4164-8553-ae1e29255553',
    'BR2_Male_Negative_Memory-CD8-T-cell': '86ce5862-b825-465b-8586-b9f5a7b278bc',
    'BR2_Male_Negative_Naive-CD4-T-cell': 'aad9f3e1-5628-4c89-97ec-2191b9db00f6',
    'BR2_Male_Positive_CD14-monocyte': 'e8cf92b8-4815-46fe-a24a-2245587fb0e4',
    #'BR2_Male_Positive_CD56dim-NK-cell': '4e1be519-59cb-4190-94ba-08712a4f9bf3', # Original
    'BR2_Male_Positive_CD56dim-NK-cell': '51439761-ba8d-4a37-a6c7-667f1e78bca0',  # Patch File
    'BR2_Male_Positive_Memory-CD4-T-cell': '5b120e94-20ca-4649-8c1e-93c1aef6bb7f',
    'BR2_Male_Positive_Memory-CD8-T-cell': 'e29692f6-44cc-446d-a1a8-c928d5bff028',
    'BR2_Male_Positive_Naive-CD4-T-cell': 'b905fb05-ced2-489e-b61e-0448cfd21085',
    'CD16-monocyte': '0cbd533a-31f7-4d99-80f6-242f7d98af85',
    'CD56bright-NK-cell': '6da0bd94-d6bb-40bc-8224-b76a0fb318a8',
    'CD8aa': '9c2ab478-5f7a-44c2-bab4-f9dfb34d42ce',
    'cDC1': '68a84b62-e410-4f65-a5d6-3915baddbd73',
    'cDC2': '74e6b709-446b-4201-a576-53250dcfd716',
    'DN-T-cell': 'ffd978d0-9c4d-4172-96c2-93038b64ed8c',
    'Effector-B-cell': '26d2c731-5c8a-4a4e-8aae-f3b76633c86d',
    'Erythrocyte': '23770252-a469-4931-bcee-53f13f428ecc',
    'gdT': 'f377ce2a-aaaf-46dd-8a94-eaa5d8bc3208',
    'ILC': '07ca6282-edcf-40db-85b7-8cd9ec9cccae',
    'Intermediate-monocyte': '4e5429c5-0178-4028-b201-5ce9b57ddd32',
    'MAIT': 'b6d07a85-abd5-4d06-93ff-f8c7be76c5b5',
    'Memory-B-cell': '6c4b97d3-359c-4974-ba6b-17c0c20cf7a6',
    'Naive-B-cell': 'e5192938-535e-4462-a1e4-c3dbe1a1523d',
    'Naive-CD8-T-cell': '95955d10-a21a-4317-9552-c21558838834',
    'pDC': '95e6d042-6a53-4188-adc2-f5869d560a94',
    'Plasma-cell': '70adf533-b865-4dc8-a9a5-6a7a8426e20f',
    'Platelet': '32ca7b61-8ed3-43d2-9994-8ce56125db9e',
    'Progenitor-cell': '59b46361-281a-45e2-83c7-ab1cb25ef9b9',
    'Proliferating-NK-cell': '58fa6e53-ef38-42b8-98ad-5de9a31d1e5f',
    'Proliferating-T-cell': '9d86a60b-ce6f-4514-885d-c5436d79d72f',
    'Transitional-B-cell': 'ee3092b7-96af-47db-9888-a926fd6ec85b',
    'Treg': '8170a4e6-62a1-4264-9b16-419a2c96a1ee'
}

## Apply filters to datasets

In [16]:
out_files = []
for group_name, uuid in h5ad_uuids.items():
    print(group_name)
    out_file = 'output/diha_{g}_filtered_{d}.h5ad'.format(g = group_name, d = date.today())
    if os.path.isfile(out_file):
        print('Previously filtered {g}; Skipping.'.format(g = group_name))
        out_files.append(out_file)
    else:        
        adata = read_adata_uuid(uuid)
    
        cell_type = adata.obs['AIFI_L2'].iloc[0]

        # Track filter results
        filter_list = []
        filter_cl_list = []
        
        # Filter for low gene detection
        if not group_name in min_gene_exceptions:
            filter_cl = select_clusters_by_low_gene_frac(
                adata,
                n_cutoff = min_n_genes, 
                frac_cutoff = min_gene_frac, 
                clusters = 'louvain_2'
            )
            reason = 'Low gene detection'
            check_cl = []
            for cl in filter_cl:
                if not cl in filter_cl_list:
                    check_cl.append(cl)
                    filter_cl_list.append(cl)
            
            filter_df = pd.DataFrame({'louvain_2': check_cl, 'remove_reason': [reason]*len(check_cl)})
            filter_list.append(filter_df)
        
        # Filter by Fractional gene expression
        marker_filters = frac_filter_dict[cell_type]
    
        for reason, filter in marker_filters.items():
            for marker, cutoff in filter.items():
                filter_cl = select_clusters_above_gene_frac(
                    adata,
                    gene = marker,
                    cutoff = cutoff,
                    clusters = 'louvain_2'
                )

                check_cl = []
                for cl in filter_cl:
                    if not cl in filter_cl_list:
                        check_cl.append(cl)
                        filter_cl_list.append(cl)
                
                filter_df = pd.DataFrame({'louvain_2': check_cl, 'remove_reason': [reason]*len(check_cl)})
                filter_list.append(filter_df)
    
        # Filter by Mean gene expression
        if cell_type in mean_filter_dict.keys():
            marker_filters = mean_filter_dict[cell_type]
        
            for reason, filter in marker_filters.items():
                for marker, cutoff in filter.items():
                    filter_cl = select_clusters_above_gene_mean(
                        adata,
                        gene = marker,
                        cutoff = cutoff,
                        clusters = 'louvain_2'
                    )
    
                    check_cl = []
                    for cl in filter_cl:
                        if not cl in filter_cl_list:
                            check_cl.append(cl)
                            filter_cl_list.append(cl)
                    
                    filter_df = pd.DataFrame({'louvain_2': check_cl, 'remove_reason': [reason]*len(check_cl)})
                    filter_list.append(filter_df)

        # Assemble all filtering results
        filter_df = pd.concat(filter_list)

        # Save filtered clusters for review
        rev_file = '{r}/diha_qc_{g}_filter_df_{d}.csv'.format(
            r = review_dir,
            g = group_name,
            d = date.today()
        )
        filter_df.to_csv(rev_file)
    
        # Add filters to cells
        obs = adata.obs.copy()
        obs = obs.merge(filter_df, on = 'louvain_2', how = 'left')
        obs['remove_reason'] = obs['remove_reason'].fillna('Not removed')
        
        # Save observations and UMAP coordinates for review
        review_obs = obs
        umap_mat = adata.obsm['X_umap']
        umap_df = pd.DataFrame(umap_mat, columns = ['umap_1', 'umap_2'])
        review_obs['umap_1'] = umap_df['umap_1']
        review_obs['umap_2'] = umap_df['umap_2']
        
        rev_file = '{r}/diha_qc_{g}_obs_df_{d}.csv'.format(
            r = review_dir,
            g = group_name,
            d = date.today()
        )
        review_obs.to_csv(rev_file)
        
        # Save expression of marker features for review
        marker_df = tidy_marker_df(adata, all_filter_markers, clusters = 'louvain_2')

        rev_file = '{r}/diha_qc_{g}_marker_df_{d}.csv'.format(
            r = review_dir,
            g = group_name,
            d = date.today()
        )
        marker_df.to_csv(rev_file)
    
        # Apply filters to data
        print(adata.shape)
        keep_cells = obs['remove_reason'] == 'Not removed'
        adata = adata[keep_cells]
        print(adata.shape)
        
        # Save filtered data
        adata.write_h5ad(out_file)
    
        out_files.append(out_file)

        # Clean up cache so we don't run out of disk space
        h5ad_path = cache_uuid_path(uuid)
        os.remove(h5ad_path)
        cache_path = '/home/jupyter/cache/{u}'.format(u = uuid)
        os.rmdir(cache_path)

ASDC
downloading fileID: 8fd1b5aa-2589-45d7-8954-c5623c5b75b4
Files have been successfully downloaded!
(4033, 1811)
(3699, 1811)
BR1_Female_Negative_CD14-monocyte
downloading fileID: 4ea6f421-c4a7-457b-a296-b0b35f136742
Files have been successfully downloaded!
(405572, 774)
(369773, 774)
BR1_Female_Negative_CD56dim-NK-cell
downloading fileID: 54a90a65-dc08-453c-916e-298872131055
Files have been successfully downloaded!
(199167, 899)
(187379, 899)
BR1_Female_Negative_Memory-CD4-T-cell
downloading fileID: afdf70a7-66d8-4339-89e1-9978188bcdda
Files have been successfully downloaded!
(446391, 1192)
(442373, 1192)
BR1_Female_Negative_Memory-CD8-T-cell
downloading fileID: 566f4ca5-bd44-4647-bb21-77713c169069
Files have been successfully downloaded!
(163963, 1017)
(158252, 1017)
BR1_Female_Negative_Naive-CD4-T-cell
downloading fileID: eff06635-b63c-4ebc-9928-99dda5fb5082
Files have been successfully downloaded!
(659799, 502)
(627635, 502)
BR1_Female_Positive_CD14-monocyte
downloading fileID: 

## Bundle Review data

We saved review data, including cell metadata and UMAP coordinates, filtered clusters, and marker gene expression, to enable us to assemble figures to double-check our filtering process.

To help with file transfer, we'll use `tarfile` to bundle our review files.

In [17]:
review_files = os.listdir(review_dir)
review_files = ['{p}/{f}'.format(p = review_dir, f = fn) for fn in review_files]

review_tar = 'output/diha_qc_AIFI_L2_filter_review_{d}.tar.gz'.format(d = date.today())
tar = tarfile.open(review_tar, 'w:gz')
for review_file in review_files:
    tar.add(review_file)
tar.close()

## Upload Cell Type data to HISE

Finally, we'll use `hisepy.upload.upload_files()` to send a copy of our output to HISE to use for downstream analysis steps.

In [18]:
study_space_uuid = 'de025812-5e73-4b3c-9c3b-6d0eac412f2a'
title = 'DIHA PBMC AIFI_L2 Filter Cleanup .h5ad {d}'.format(d = date.today())

In [19]:
in_files = list(h5ad_uuids.values())
in_files

['8fd1b5aa-2589-45d7-8954-c5623c5b75b4',
 '4ea6f421-c4a7-457b-a296-b0b35f136742',
 '54a90a65-dc08-453c-916e-298872131055',
 'afdf70a7-66d8-4339-89e1-9978188bcdda',
 '566f4ca5-bd44-4647-bb21-77713c169069',
 'eff06635-b63c-4ebc-9928-99dda5fb5082',
 'fa343b54-4b45-49d2-a86f-6ee1d3fc3c9a',
 '330d1831-d684-4bc7-8394-6b97150dea75',
 '52749427-354f-4fdf-b866-66044dbaa626',
 'b7d09799-68bf-4730-951b-bba8ee36b05d',
 '3251d7f1-731c-4a89-bb18-7ab173d5f44a',
 'f355429f-cc54-4616-9c4d-e32b8009d422',
 '376f94eb-27f0-4826-8097-6f1e3c834999',
 'bd6c716c-e67c-45f8-815b-0b397c56c92c',
 '018352bb-a48c-46bd-a586-5f3b6322382e',
 '0aa350a4-27c2-4e0d-b74a-04d1c1418d4a',
 '75f54b8a-0a9f-4c05-a40a-5dd822cc8aa0',
 'c07364b5-7d15-4afa-b59c-d9e3f080c535',
 '5e147121-ab97-4c5a-b121-7b86e90faea8',
 '927177ec-3aaa-45fb-b27a-3b4cdbd17a70',
 'be404ee1-3631-4bdf-b7a0-8d63da190578',
 '5d396391-c4eb-45e7-9822-5cc3cedd248b',
 '3b335b5d-090e-4f1e-bc4a-882b64e0e00b',
 '67f5b530-2fc3-4c83-ad8b-9aefffa29195',
 'f5593e37-60bd-

In [20]:
out_files = out_files + [review_tar]

In [21]:
out_files

['output/diha_ASDC_filtered_2024-03-17.h5ad',
 'output/diha_BR1_Female_Negative_CD14-monocyte_filtered_2024-03-17.h5ad',
 'output/diha_BR1_Female_Negative_CD56dim-NK-cell_filtered_2024-03-17.h5ad',
 'output/diha_BR1_Female_Negative_Memory-CD4-T-cell_filtered_2024-03-17.h5ad',
 'output/diha_BR1_Female_Negative_Memory-CD8-T-cell_filtered_2024-03-17.h5ad',
 'output/diha_BR1_Female_Negative_Naive-CD4-T-cell_filtered_2024-03-17.h5ad',
 'output/diha_BR1_Female_Positive_CD14-monocyte_filtered_2024-03-17.h5ad',
 'output/diha_BR1_Female_Positive_CD56dim-NK-cell_filtered_2024-03-17.h5ad',
 'output/diha_BR1_Female_Positive_Memory-CD4-T-cell_filtered_2024-03-17.h5ad',
 'output/diha_BR1_Female_Positive_Memory-CD8-T-cell_filtered_2024-03-17.h5ad',
 'output/diha_BR1_Female_Positive_Naive-CD4-T-cell_filtered_2024-03-17.h5ad',
 'output/diha_BR1_Male_Negative_CD14-monocyte_filtered_2024-03-17.h5ad',
 'output/diha_BR1_Male_Negative_CD56dim-NK-cell_filtered_2024-03-17.h5ad',
 'output/diha_BR1_Male_Negativ

In [23]:
hisepy.upload.upload_files(
    files = out_files,
    study_space_id = study_space_uuid,
    title = title,
    input_file_ids = in_files
)

output/diha_ASDC_filtered_2024-03-17.h5ad
output/diha_BR1_Female_Negative_CD14-monocyte_filtered_2024-03-17.h5ad
output/diha_BR1_Female_Negative_CD56dim-NK-cell_filtered_2024-03-17.h5ad
output/diha_BR1_Female_Negative_Memory-CD4-T-cell_filtered_2024-03-17.h5ad
output/diha_BR1_Female_Negative_Memory-CD8-T-cell_filtered_2024-03-17.h5ad
output/diha_BR1_Female_Negative_Naive-CD4-T-cell_filtered_2024-03-17.h5ad
output/diha_BR1_Female_Positive_CD14-monocyte_filtered_2024-03-17.h5ad
output/diha_BR1_Female_Positive_CD56dim-NK-cell_filtered_2024-03-17.h5ad
output/diha_BR1_Female_Positive_Memory-CD4-T-cell_filtered_2024-03-17.h5ad
output/diha_BR1_Female_Positive_Memory-CD8-T-cell_filtered_2024-03-17.h5ad
output/diha_BR1_Female_Positive_Naive-CD4-T-cell_filtered_2024-03-17.h5ad
output/diha_BR1_Male_Negative_CD14-monocyte_filtered_2024-03-17.h5ad
output/diha_BR1_Male_Negative_CD56dim-NK-cell_filtered_2024-03-17.h5ad
output/diha_BR1_Male_Negative_Memory-CD4-T-cell_filtered_2024-03-17.h5ad
output/di

(y/n) y


{'trace_id': '7c91500a-644c-4592-9d96-e33957858f63',
 'files': ['output/diha_ASDC_filtered_2024-03-17.h5ad',
  'output/diha_BR1_Female_Negative_CD14-monocyte_filtered_2024-03-17.h5ad',
  'output/diha_BR1_Female_Negative_CD56dim-NK-cell_filtered_2024-03-17.h5ad',
  'output/diha_BR1_Female_Negative_Memory-CD4-T-cell_filtered_2024-03-17.h5ad',
  'output/diha_BR1_Female_Negative_Memory-CD8-T-cell_filtered_2024-03-17.h5ad',
  'output/diha_BR1_Female_Negative_Naive-CD4-T-cell_filtered_2024-03-17.h5ad',
  'output/diha_BR1_Female_Positive_CD14-monocyte_filtered_2024-03-17.h5ad',
  'output/diha_BR1_Female_Positive_CD56dim-NK-cell_filtered_2024-03-17.h5ad',
  'output/diha_BR1_Female_Positive_Memory-CD4-T-cell_filtered_2024-03-17.h5ad',
  'output/diha_BR1_Female_Positive_Memory-CD8-T-cell_filtered_2024-03-17.h5ad',
  'output/diha_BR1_Female_Positive_Naive-CD4-T-cell_filtered_2024-03-17.h5ad',
  'output/diha_BR1_Male_Negative_CD14-monocyte_filtered_2024-03-17.h5ad',
  'output/diha_BR1_Male_Negativ

In [24]:
import session_info
session_info.show()