# Detect doublets using Scrublet

Our cell hashing processes catch and remove many doublets that are generated by mixing of cells from different samples, but some percentage of doublets (~7-8%) will be generated by collisions of cells from the same sample, and will not be detected.

To detect and remove these, we'll utilize the Scrublet package. Scrublet's process for doublet identification is described in this publication:

Wolock, S. L., Lopez, R. & Klein, A. M. Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data. Cell Syst 8, 281–291.e9 (2019)

We'll use scrublet's integration into the scanpy package's [scanpy.external tools](https://scanpy.readthedocs.io/en/stable/generated/scanpy.external.pp.scrublet.html#scanpy.external.pp.scrublet).

Here, we apply scrublet to each of our .h5 files, and store the results in HISE for downstream analysis steps.

In [2]:
import anndata
import concurrent.futures
from concurrent.futures import ThreadPoolExecutor
from datetime import date
import h5py
import numpy as np
import pandas as pd 
import re
import scanpy as sc
import scanpy.external as sce
import scipy.sparse as scs

## Read sample metadata from HISE

In [39]:
sample_meta_file_uuid = '223b4aa9-19fc-41e1-8bea-43682e5ac278'
sample_meta_file_name = 'ref_h5_meta_data_2024-02-08.csv'
file_query = hisepy.reader.download_files(
    {sample_meta_file_uuid: sample_meta_file_name}
)

In [45]:
file_query[0].__dict__

{'id': UUID('223b4aa9-19fc-41e1-8bea-43682e5ac278'),
 'status': False,
 'message': 'File was not found in ledger',
 'descriptors': None,
 'path': None,
 'filetype': None,
 'data_values': None}

In [37]:
sample_meta_file = 'cache/downloadable/' + sample_meta_file_name
meta_data = pd.read_csv(sample_meta_file)

## Helper functions

These functions will retrieve data from HISE and read as AnnData for use with scrublet, and for reading and applying scrublet to each file.

In [5]:
# define a function to read count data
def read_mat(h5_con):
    mat = scs.csc_matrix(
        (h5_con['matrix']['data'][:], # Count values
         h5_con['matrix']['indices'][:], # Row indices
         h5_con['matrix']['indptr'][:]), # Pointers for column positions
        shape = tuple(h5_con['matrix']['shape'][:]) # Matrix dimensions
    )
    return mat

# define a function to read obeservation metadata (i.e. cell metadata)
def read_obs(h5con):
    bc = h5con['matrix']['barcodes'][:]
    bc = [x.decode('UTF-8') for x in bc]

    # Initialized the DataFrame with cell barcodes
    obs_df = pd.DataFrame({ 'barcodes' : bc })

    # Get the list of available metadata columns
    obs_columns = h5con['matrix']['observations'].keys()

    # For each column
    for col in obs_columns:
        # Read the values
        values = h5con['matrix']['observations'][col][:]
        # Check for byte storage
        if(isinstance(values[0], (bytes, bytearray))):
            # Decode byte strings
            values = [x.decode('UTF-8') for x in values]
        # Add column to the DataFrame
        obs_df[col] = values

    obs_df = obs_df.set_index('barcodes', drop = False)
    
    return obs_df

# define a function to construct anndata object from a h5 file
def read_h5_anndata(h5_con):
    #h5_con = h5py.File(h5_file, mode = 'r')
    # extract the expression matrix
    mat = read_mat(h5_con)
    # extract gene names
    genes = h5_con['matrix']['features']['name'][:]
    genes = [x.decode('UTF-8') for x in genes]
    # extract metadata
    obs_df = read_obs(h5_con)
    # construct anndata
    adata = anndata.AnnData(mat.T,
                             obs = obs_df)
    # make sure the gene names aligned
    adata.var_names = genes

    adata.var_names_make_unique()
    return adata

In [6]:
def get_adata(uuid):
    # Load the file using HISE
    res = hisepy.reader.read_files([uuid])

    # If there's an error, read_files returns a list instead of a dictionary.
    # We should raise and exception with the message when this happens.
    if(isinstance(res, list)):
        error_message = res[0]['message']
        raise Exception(error_message)
    
    # Read the file to adata
    h5_con = res['values'][0]
    adata = read_h5_anndata(h5_con)
    
    # Clean up the file now that we're done with it
    h5_file = h5_con.filename
    h5_con.close()
    os.remove(h5_file)

    return(adata)

In [6]:
def process_file(file_uuid):
    adata = get_adata(file_uuid)
    sc.external.pp.scrublet(adata, verbose = False)
    result = adata.obs[['barcodes','predicted_doublet','doublet_score']]
    return result

## Apply to each sample in parallel

Here, we'll use `concurrent.futures` to apply the function above to our samples in parallel.

In [8]:
results = []
file_uuids = meta_data['file.id'].tolist()
with ThreadPoolExecutor(max_workers=20) as executor:  
    for result in executor.map(process_file, file_uuids):
        results.append(result)

Automatically set threshold at doublet score = 0.35
Detected doublet rate = 1.7%
Estimated detectable doublet fraction = 29.1%
Overall doublet rate:
	Expected   = 5.0%
	Estimated  = 6.0%
Automatically set threshold at doublet score = 0.68
Detected doublet rate = 0.1%
Estimated detectable doublet fraction = 3.7%
Overall doublet rate:
	Expected   = 5.0%
	Estimated  = 3.6%
Automatically set threshold at doublet score = 0.36
Detected doublet rate = 1.2%
Estimated detectable doublet fraction = 36.3%
Overall doublet rate:
	Expected   = 5.0%
	Estimated  = 3.3%
Automatically set threshold at doublet score = 0.72
Detected doublet rate = 0.0%
Estimated detectable doublet fraction = 0.3%
Overall doublet rate:
	Expected   = 5.0%
	Estimated  = 3.2%
Automatically set threshold at doublet score = 0.38
Detected doublet rate = 1.1%
Estimated detectable doublet fraction = 27.6%
Overall doublet rate:
	Expected   = 5.0%
	Estimated  = 3.9%
Automatically set threshold at doublet score = 0.70
Detected double

In [9]:
final_result = pd.concat(results, ignore_index=True)

In [10]:
prediction_counts = final_result['predicted_doublet'].value_counts()
prediction_counts

predicted_doublet
False    2066388
True       27399
Name: count, dtype: int64

In [None]:
prediction_counts['True'] / sum(prediction_counts)

In [12]:
27399/(2066388+27399)

0.013085858303638336

In [8]:
out_dir = 'output'
if not os.path.isdir(out_dir):
    os.makedirs(out_dir)

In [10]:
out_file = 'output/ref_scrublet_results_{d}.csv'.format(d = date.today())
final_result.to_csv(out_file)

## Upload assembled data to HISE

Finally, we'll use `hisepy.upload.upload_files()` to send a copy of our output to HISE to use for downstream analysis steps.

In [29]:
study_space_uuid = '64097865-486d-43b3-8f94-74994e0a72e0'
title = 'Reference Scrublet results'

In [34]:
in_files = [sample_meta_file_uuid] + meta_data['file.id'].to_list()

In [None]:
out_files = [out_file]

In [38]:
hisepy.upload.upload_files(
    files = out_files,
    study_space_id = study_space_uuid,
    title = title,
    input_file_ids = in_files
)

AssertionError: The following file Ids were not downloaded in this IDE. You cannot reference a file in a result without downloading it first. ['223b4aa9-19fc-41e1-8bea-43682e5ac278']

In [None]:
import session_info
session_info.show()