### Introduction:

In this notebook, I'm analyzing the results of cell ranger that is run on TSP1-15 fastq files. Every file contains a raw matrix and filtered matrix. According to cell ranger, if using cellranger 6.0.1, the subtraction of filtered matrix from the raw will give away the droplets that cellranger identifies as "empty".

To identify putative contaminating species, I first defined empty versus cell-containing droplets based on whether the droplet contains more than n genes. Then, based on the relative frequencies of a given species across empty and cell-containing droplets, I identify putative contaminating species. The logic is based on the idea that contamination should occur in an equi-prevelant manner across those two classes of droplets. If there are species that are particularly enriched in the cell-containing droplets (i.e. >=100x more prevelant), then I consider them to be real signal and associated with that cell.

I run this process on a per-sample basis, and so what may be found as a contaminant in one sample may not be a contaminant in another, simply because you could lose real signal in this way.

- **c_dts**:  number of cell containing droplets, c, per donor, d, and tissue, t, that contain a given species, s
- **e_dts**:  number of empty droplets, e, per donor and tissue that contain a given species
- **C_dt**: total number of cell containing droplets per donor and tissue 
- **E_dt**:  total number of empty droplets per donor and tissue 
- **R_dts** = (c_dts/C_dt) / (e_dts/E_dt) 
- I select species that are 100x more prevelant in cell-containing droplets, R_dts ≥ 100

In addition to this decontamination step, I will implement several others. I will also do an analysis with pre-filtered results to see what fraction of sequences have remained at the end of all QC and decontamination steps. 


### Loading libraries

In [None]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from matplotlib_venn import venn3
from matplotlib_venn import venn2

import os
import glob
import re
import itertools
from collections import Counter

import numpy as np
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import seaborn as sns
cmap = sns.cm.rocket_r
sns.set_style("white")

import anndata
print('anndata version:', anndata.__version__)
from anndata import read_h5ad
from anndata import AnnData

import scanpy as sc
sc.settings.verbosity = 3  # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.settings.set_figure_params(dpi=80)  # low dpi (dots per inch) yields small inline figures
sc.logging.print_version_and_date()


### Loading directories

In [3]:
#tsp1and2
mainDir = '/oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/'
mainDir2 ='/oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/'
mainDir3 = '/oak/stanford/groups/quake/gita/raw/tab2_20200508/'
mainDir4= '/oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/cranger/'
mainDir8 = '/oak/stanford/groups/quake/gita/raw/tab1_20200407/controlAnalysis/'
mainDir9 = '/oak/stanford/groups/quake/gita/raw/organ_20191025/controlAnalysis/'

#tsp3-16 data
cranger_dir='/oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/'  
mainDir10 = '/oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/'


### Single-cell decontamination for donors TSP3-15

Through the SIMBA pipeline, I have raw fastqs that I have run UMI tools on and subsequently BLAST. Separately, I have run cell ranger count command to get the raw feature barcode matrix starting from the same raw fastq files.


In [4]:
#blast_df
all_hits_meta=pd.read_csv(mainDir10 + 'alldons_blast_with_cell_annotations_qc_filtered.csv')

#cleaning up the blast df,some of the samples from TSP15 (because of '-' ) were named incorrectly, so I'm correcting their file names
all_hits_meta['sample'] = all_hits_meta['sample'].replace('TSP15_Eye_Neuroretina', 'TSP15_Eye_Neuroretina-etc_10X_3_1_S17')
all_hits_meta['sample'] = all_hits_meta['sample'].replace('TSP15_Eye_Sclera', 'TSP15_Eye_Sclera-etc_10X_2_1_S16')

#this column will match the file names processed by cell ranger (based off of raw folder names, rather than TS object)
all_hits_meta['sample_match_cranger'] = all_hits_meta['sample'].apply(lambda x: x.split('_L00')[0])
all_hits_meta.drop(columns=['cell_bc_x', 'cell_bc_y'], inplace=True)

#removing uninformative columns
all_hits_meta = all_hits_meta.drop(columns=['Annotation', 'seqrun','Manually Annotated'])

  interactivity=interactivity, compiler=compiler, result=result)


I have put together a column called cell_bc_all which has a cell barcode for hits that appear in annotated and un-annotated cells
10X_barcode contains the TS object barcodes and cell_bc contains blast dataframe 10x barcodes, but this new column will contain both

In [269]:
all_hits_meta['cell_bc_all'] = all_hits_meta['cell_bc'].fillna(all_hits_meta['10X_barcode']) 


In [178]:
n = 100 #minimum number of genes

df = all_hits_meta
all_ratios=pd.DataFrame({})

#running this on a per sample basis
for sample in df['sample_match_cranger'].unique():
    if sample.startswith('TSP16')==False: #don't need donor 16 data because it doesn't have annotations

        blast_cells=df[df['sample_match_cranger']==sample] #getting the blast dataframe for each sample
        print(sample)
        #reading the raw feature bc matrix 
        raw = sc.read_10x_h5(cranger_dir + sample + '/outs/raw_feature_bc_matrix.h5')
        raw.var_names_make_unique()


        #getting n_counts and n_genes for the raw matrix
        raw.obs['n_counts'] = raw.X.sum(axis=1).A1
        sc.pp.filter_cells(raw, min_genes=0)
        raw.obs = raw.obs.reset_index().rename(columns={'index':'cell_bc_all'})
        raw.obs['cell_bc_all'] = raw.obs['cell_bc_all'].str.strip('-1') #removing the -1 extention to match the object and dataframe formats of cell barcodes

        #making classifications of empty and non-empty cells based on how many genes they contain
        num_all_empty = raw[raw.obs['n_genes']<n].shape[0]
        num_all_cells =  raw[raw.obs['n_genes']>=n].shape[0]

        #for each species in a given sample the following procedure is run
        ratio=[]
        for sp in blast_cells['species'].unique():
            #this is the blast dataframe for each species in each sample
            blast_cells_sp = blast_cells[blast_cells['species']==sp] 

            #what fraction of blast_cells for a given species are showing up in empty droplets versus cells
            total=raw.obs.merge(blast_cells_sp, on='cell_bc_all', how='inner')

            #identifying whether they occured in cells versus in empty droplets
            put_cell=total[total['n_genes_x']>=n]
            put_empty=total[total['n_genes_x']<n]

            ratio_dict = {'species':sp, 'num_cells': put_cell['cell_bc_all'].nunique(), 
                  'num_empty': put_empty['cell_bc_all'].nunique()}
            ratio.append(ratio_dict)


        #creating a ratio dataframe
        ratio_df=pd.DataFrame(ratio)
        ratio_df['sample']=[sample]*ratio_df.shape[0]
        ratio_df['ratio']=ratio_df['num_cells']/ratio_df['num_empty'] #number of hits found in cells versus empty droplets
        ratio_df['adj_num_cells'] = ratio_df['num_cells']/num_all_cells #number of cells divided by total number of cells 
        ratio_df['adj_num_empty'] = ratio_df['num_empty']/num_all_empty #number of empty droplets divided by total number of empty droplets
        ratio_df['adj_ratio'] = ratio_df['adj_num_cells']/ratio_df['adj_num_empty']

        #replacing infinity values (division by 0 - because those species had 0 occurance in empty droplets) by 1000 and the threshold for enrichment is 100, so 1000 will allow them to pass since they don't appear in empty droplets
        ratio_df['ratio'].replace([np.inf, -np.inf], 1000, inplace=True)
        ratio_df['adj_ratio'].replace([np.inf, -np.inf], 1000, inplace=True)

        all_ratios = pd.concat([all_ratios, ratio_df])


all_ratios.to_csv(mainDir10 + 'don3-15_n100_m100_cell_to_empty_species_ratios.csv', index=False)

TSP3_Eye4_062620_S4
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP3_Eye4_062620_S4/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP14_Blood_NA_10X_1_1_5Prime_S1
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP14_Blood_NA_10X_1_1_5Prime_S1/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP14_Spleen_NA_10X_1_1_5Prime_S5
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP14_Spleen_NA_10X_1_1_5Prime_S5/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP14_Blood_NA_10X_1_2_S5
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP14_Blood_NA_10X_1_2_S5/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP3_Eye_062620_S1
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP3_Eye_062620_S1/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP6_Liver_NA_10X_1_1_S8
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP6_Liver_NA_10X_1_1_S8/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:04)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP6_Liver_NA_10X_1_2_S9
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP6_Liver_NA_10X_1_2_S9/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:04)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP6_Trachea_NA_10X_1_1_S10
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP6_Trachea_NA_10X_1_1_S10/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:03)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP6_Trachea_NA_10X_2_1_S12
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP6_Trachea_NA_10X_2_1_S12/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:04)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP7_Blood_NA_10X_1_1_S6
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP7_Blood_NA_10X_1_1_S6/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:06)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP7_LymphNodes_Inguinal_10X_1_1_S2
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP7_LymphNodes_Inguinal_10X_1_1_S2/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:04)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP7_LymphNodes_Supradiaphagmatic_10X_1_1_S3
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP7_LymphNodes_Supradiaphagmatic_10X_1_1_S3/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:03)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP7_SalivaryGland_Parotid_10X_1_1_S14
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP7_SalivaryGland_Parotid_10X_1_1_S14/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:06)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP7_SalivaryGland_Parotid_10X_1_2_S15
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP7_SalivaryGland_Parotid_10X_1_2_S15/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:06)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP7_Spleen_NA_10X_1_1_S4
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP7_Spleen_NA_10X_1_1_S4/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:04)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP7_Spleen_NA_10X_2_1_S5
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP7_Spleen_NA_10X_2_1_S5/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP7_Tongue_Anterior_10X_1_2_S1
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP7_Tongue_Anterior_10X_1_2_S1/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:03)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP4_Mammary1_062920_S7
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP4_Mammary1_062920_S7/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:04)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP4_Mammary2_062920_S8
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP4_Mammary2_062920_S8/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:05)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP4_Myometrium_062920_S10
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP4_Myometrium_062920_S10/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:04)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP5_Eye2_062920_S6
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP5_Eye2_062920_S6/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP8_Blood_NA_10X_1_1_S1
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP8_Blood_NA_10X_1_1_S1/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP3_Eye3_062620_S3
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP3_Eye3_062620_S3/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:05)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP5_Eye1_062920_S5
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP5_Eye1_062920_S5/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:03)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP10_Blood_NA_10X_1_1_Enriched_S4
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP10_Blood_NA_10X_1_1_Enriched_S4/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP10_Blood_NA_10X_1_1_Whole_S3
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP10_Blood_NA_10X_1_1_Whole_S3/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP10_FAT_MAT_10X_1_1_S8
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP10_FAT_MAT_10X_1_1_S8/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:06)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP10_FAT_SCAT_10X_1_1_S9
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP10_FAT_SCAT_10X_1_1_S9/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:04)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP10_Skin_NA_10X_1_1_S5
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP10_Skin_NA_10X_1_1_S5/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP10_Skin_NA_10X_1_2_S6
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP10_Skin_NA_10X_1_2_S6/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP11_BoneMarrow_NA_10X_1_1_LinDepleted_S10
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP11_BoneMarrow_NA_10X_1_1_LinDepleted_S10/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP11_BoneMarrow_NA_10X_1_1_LinEnriched_S11
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP11_BoneMarrow_NA_10X_1_1_LinEnriched_S11/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP12_Heart_Atria_10X_1_1_S13
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP12_Heart_Atria_10X_1_1_S13/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:06)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP12_Heart_Ventricle_10X_1_1_S12
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP12_Heart_Ventricle_10X_1_1_S12/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:04)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP9_Pancreas_exocrine_10X_1_1_CellCountLive_S2
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP9_Pancreas_exocrine_10X_1_1_CellCountLive_S2/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:05)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP9_Pancreas_exocrine_10X_1_1_CellCountTotal_S1
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP9_Pancreas_exocrine_10X_1_1_CellCountTotal_S1/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP14_Blood_NA_10X_2_1_5Prime_S2
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP14_Blood_NA_10X_2_1_5Prime_S2/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP14_BoneMarrow_NA_10X_1_1_5Prime_S9
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP14_BoneMarrow_NA_10X_1_1_5Prime_S9/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP14_BoneMarrow_NA_10X_2_1_5Prime_S10
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP14_BoneMarrow_NA_10X_2_1_5Prime_S10/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP14_Liver_NA_10X_1_1_S11
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP14_Liver_NA_10X_1_1_S11/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:04)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP14_Liver_NA_10X_2_1_S12
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP14_Liver_NA_10X_2_1_S12/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:04)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP14_LymphNode_NA_10X_1_1_5Prime_S3
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP14_LymphNode_NA_10X_1_1_5Prime_S3/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:04)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP14_LymphNode_NA_10X_2_1_5Prime_S4
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP14_LymphNode_NA_10X_2_1_5Prime_S4/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:03)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP14_SalivaryGland_Parotid_10X_1_1_S14
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP14_SalivaryGland_Parotid_10X_1_1_S14/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:07)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP14_SalivaryGland_Submandibular_10X_1_1_S13
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP14_SalivaryGland_Submandibular_10X_1_1_S13/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:07)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP14_Spleen_NA_10X_2_1_5Prime_S6
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP14_Spleen_NA_10X_2_1_5Prime_S6/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP14_Thymus_NA_10X_1_1_5Prime_S7
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP14_Thymus_NA_10X_1_1_5Prime_S7/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP14_Thymus_NA_10X_2_1_5Prime_S8
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP14_Thymus_NA_10X_2_1_5Prime_S8/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP14_LI_Distal_10X_1_1_S10
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP14_LI_Distal_10X_1_1_S10/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:03)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP14_Skin_Abdomen_10X_1_1_S17
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP14_Skin_Abdomen_10X_1_1_S17/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP14_Skin_Chest_10X_1_1_S18
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP14_Skin_Chest_10X_1_1_S18/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP14_Muscle_Abdomen_10X_1_1_S4
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP14_Muscle_Abdomen_10X_1_1_S4/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:03)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP14_Muscle_Diaphragm_10X_1_1_S2
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP14_Muscle_Diaphragm_10X_1_1_S2/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP14_Spleen_NA_10X_1_1_S10
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP14_Spleen_NA_10X_1_1_S10/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:03)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP6_Trachea_NA_10X_1_2_S11
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP6_Trachea_NA_10X_1_2_S11/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:03)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP3_Eye2_062620_S2
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP3_Eye2_062620_S2/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP7_Blood_NA_10X_2_1_S7
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP7_Blood_NA_10X_2_1_S7/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:05)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP14_Muscle_Diaphragm_10X_1_2_S3
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP14_Muscle_Diaphragm_10X_1_2_S3/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP7_Tongue_Posterior_10X_1_1_S13
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP7_Tongue_Posterior_10X_1_1_S13/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:08)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP14_Bladder_NA_10X_1_2_S12
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP14_Bladder_NA_10X_1_2_S12/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP14_Tongue_Posterior_10X_1_1_S13
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP14_Tongue_Posterior_10X_1_1_S13/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:05)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP15_Eye_Sclera-etc_10X_2_1_S16
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP15_Eye_Sclera-etc_10X_2_1_S16/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


TSP15_Eye_Neuroretina-etc_10X_3_1_S17
reading /oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/TSP15_Eye_Neuroretina-etc_10X_3_1_S17/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


now I will merge the ratios dataframe with the blast dataframe which has cell annotations already. 

In [270]:
#changing column name to match all_ratios df to blast df. 
all_ratios = all_ratios.rename(columns={'sample':'sample_match_cranger'})

In [271]:
all_hits_meta_ratios = all_hits_meta.merge(all_ratios, on=['species', 'sample_match_cranger'], how='outer')
print('surviving fraction of hits:', all_hits_meta_ratios[all_hits_meta_ratios['adj_ratio']>=100].shape[0]/all_hits_meta_ratios.shape[0])

#making sure I can seperate out donor 1 and 2 dataset from the rest by adding an extra column
all_hits_meta_ratios['dataset'] = ['tsp3_on'] *all_hits_meta_ratios.shape[0]


surviving fraction of hits: 0.17041960498230255


Saving this dataframe for donors 3-16, which has now R_dts information which can be used to decontaminate the dataset
note this has ratios (R_dts) but hasn't been filtered based on R_dts 

In [272]:
all_hits_meta_ratios.to_csv(mainDir + 'don3_to_don16_10x_blastn_nt_withRatios_10_14_2021.csv', index=False)

### Single-cell decontamination for donors TSP1-2


In [None]:
#reading the blast dataframe
fin_fil = pd.read_csv(mainDir + 'don1_don2_10x_blastn_nt_naPhylaNotFiltered_03_12_2021.csv')

fin_fil['cell_bc'] = fin_fil['cell'].apply(lambda x: x+'-1')
fin_fil['sample_new'] = fin_fil['sample'].apply(lambda x: x.split('_L0')[0])
fin_fil['donor_new'] = fin_fil['sample'].apply(lambda x: x.split('_')[0])
fin_fil['donor_new'] = fin_fil['sample'].apply(lambda x: x.split('_')[0])
fin_fil['tissue_new'] = fin_fil['sample'].apply(lambda x: x.split('_')[1]).apply(lambda x: x.lower())

### donor 2
performing the same operation as the one above (for donors 3-15) for donors 2 and 1. 

In [49]:
# getting the number of non-host hits found
nonhost_counts=fin_fil.groupby(['cell_bc']).count().cell.to_frame('non-host_counts').reset_index()
fin_fil_withcounts = fin_fil.merge(nonhost_counts, on='cell_bc', how='outer')

fin_fil_don2=fin_fil_withcounts[fin_fil_withcounts['donor_new']=='TSP2']
all_ratios=pd.DataFrame({})

#running this on a per sample basis
for sample in fin_fil_don2['sample_new'].unique():
    blast_cells=fin_fil_don2[fin_fil_don2['sample_new']==sample]
    
    #reading the raw feature bc matrix 
    raw = sc.read_10x_h5(mainDir2 + sample + '/outs/raw_feature_bc_matrix.h5')
    raw.var_names_make_unique()

    #getting n_counts and n_genes for the raw matrix
    raw.obs['n_counts'] = raw.X.sum(axis=1).A1
    sc.pp.filter_cells(raw, min_genes=0)
    raw.obs = raw.obs.reset_index().rename(columns={'index':'cell_bc'})

    #making classifications of empty and non-empty cells based on how many genes they contain
    n = 100
    num_all_empty = raw[raw.obs['n_genes']<n].shape[0]
    num_all_cells =  raw[raw.obs['n_genes']>=n].shape[0]
    
    #for each species in a given sample the following procedure is run
    ratio=[]
    for sp in blast_cells['species'].unique():
        blast_cells_sp = blast_cells[blast_cells['species']==sp]

        #what fraction of blast_cells for a given species are showing up in empty droplets versus cells
        total=raw.obs.merge(blast_cells_sp, on='cell_bc', how='inner')

        put_cell=total[total['n_genes_x']>=n]
        put_empty=total[total['n_genes_x']<n]

        ratio_dict = {'species':sp, 'num_cells': put_cell['cell_bc'].nunique(), 
              'num_empty': put_empty['cell_bc'].nunique()}
        ratio.append(ratio_dict)


    ratio_df=pd.DataFrame(ratio)
    ratio_df['sample']=[sample]*ratio_df.shape[0]
    ratio_df['ratio']=ratio_df['num_cells']/ratio_df['num_empty']
    ratio_df['adj_num_cells'] = ratio_df['num_cells']/num_all_cells
    ratio_df['adj_num_empty'] = ratio_df['num_empty']/num_all_empty
    ratio_df['adj_ratio'] = ratio_df['adj_num_cells']/ratio_df['adj_num_empty']
    #replacing infinity values (division by 0 - because those species had 0 occurance in empty droplets) by 1000. 
    ratio_df['ratio'].replace([np.inf, -np.inf], 1000, inplace=True)
    ratio_df['adj_ratio'].replace([np.inf, -np.inf], 1000, inplace=True)

    
    all_ratios = pd.concat([all_ratios, ratio_df])
    
#saving the results
all_ratios.to_csv(mainDir + 'don2_n100_cell_to_empty_species_ratios.csv')

reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_Heart_ventricle_10X_1_1_S14/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_Blood_NA_10X_1_3_S15/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_Bladder_NA_10X_1_1_S5/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_SI_distal_10X_1_1_Jejunum_S22/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_Vasculature_Aorta_10X_1_1_S26/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_Spleen_NA_10X_2_1_S18/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_Thymus_NA_10X_1_2_S4/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_Trachea_NA_10X_1_2_S23/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_Kidney_NA_10X_1_1_S1/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_BM_vertebralbody_10X_2_1_S28/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_Spleen_NA_10X_1_1_S17/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_BM_vertebralbody_10X_1_1_S29/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_Muscle_rectusabdominus_10X_1_1_S10/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_Blood_NA_10X_2_1_S16/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_Bladder_NA_10X_1_2_S6/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_LymphNode_NA_10X_2_1_S20/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_Lung_proxmedialdistal_10X_1_1_S24/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_Muscle_diaphragm_10X_1_2_S13/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_Vasculature_Aorta_10X_1_2_S27/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_Thymus_NA_10X_1_1_S3/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_LymphNode_NA_10X_1_1_S19/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_Muscle_diaphragm_10X_1_1_S12/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_Muscle_rectusabdominus_10X_1_2_S11/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_LI_proximal_10X_1_1_Ascending_S8/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_SI_proximal_10X_1_1_Duodenum_S21/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_Lung_proxmedialdistal_10X_1_2_S25/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_Kidney_NA_10X_1_2_S2/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_Thymus_NA_10X_1_4_5prime_S33/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:00)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_BM_vertebralbody_10X_1_2_5prime_S35/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:00)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_Trachea_NA_10X_1_1_S7/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_LI_distal_10X_1_1_Sigmoid_S9/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/TSP2_Blood_NA_10X_1_5_5prime_S31/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


### donor 1

In [52]:
nonhost_counts=fin_fil.groupby(['cell_bc']).count().cell.to_frame('non-host_counts').reset_index()
fin_fil_withcounts = fin_fil.merge(nonhost_counts, on='cell_bc', how='outer')

fin_fil_don1=fin_fil_withcounts[fin_fil_withcounts['donor_new']=='TSP1']
all_ratios_don1=pd.DataFrame({})

for sample in fin_fil_don1['sample_new'].unique():
    blast_cells=fin_fil_don1[fin_fil_don1['sample_new']==sample]
    raw = sc.read_10x_h5(mainDir4 + sample + '/outs/raw_feature_bc_matrix.h5')
    raw.var_names_make_unique()

    #getting n_counts and n_genes for the raw matrix
    raw.obs['n_counts'] = raw.X.sum(axis=1).A1
    sc.pp.filter_cells(raw, min_genes=0)
    raw.obs = raw.obs.reset_index().rename(columns={'index':'cell_bc'})

    #total number of empty droplets and cells 
    n = 100
    num_all_empty = raw[raw.obs['n_genes']<n].shape[0]
    num_all_cells =  raw[raw.obs['n_genes']>=n].shape[0]

    ratio=[]
    for sp in blast_cells['species'].unique():
        blast_cells_sp = blast_cells[blast_cells['species']==sp]

        #what fraction of blast_cells for a given species are showing up in empty droplets versus cells

        total=raw.obs.merge(blast_cells_sp, on='cell_bc', how='inner')

        put_cell=total[total['n_genes_x']>=n]
        put_empty=total[total['n_genes_x']<n]

        ratio_dict = {'species':sp, 'num_cells': put_cell['cell_bc'].nunique(), 
              'num_empty': put_empty['cell_bc'].nunique()}
        ratio.append(ratio_dict)


    ratio_df=pd.DataFrame(ratio)
    ratio_df['sample']=[sample]*ratio_df.shape[0]
    ratio_df['ratio']=ratio_df['num_cells']/ratio_df['num_empty']
    ratio_df['adj_num_cells'] = ratio_df['num_cells']/num_all_cells
    ratio_df['adj_num_empty'] = ratio_df['num_empty']/num_all_empty
    ratio_df['adj_ratio'] = ratio_df['adj_num_cells']/ratio_df['adj_num_empty']
    ratio_df['ratio'].replace([np.inf, -np.inf], 1000, inplace=True)
    ratio_df['adj_ratio'].replace([np.inf, -np.inf], 1000, inplace=True)

    
    all_ratios_don1 = pd.concat([all_ratios_don1, ratio_df])

#saving the results
all_ratios_don1.to_csv(mainDir + 'don1_n100_cell_to_empty_species_ratios.csv')

reading /oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/cranger/TSP1_bladder_1_S7/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/cranger/TSP1_exopancreas2_1_S4/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/cranger/TSP1_endopancreas_1_S1/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:00)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/cranger/TSP1_bladder_2_S8/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/cranger/TSP1_exopancreas1_3_S21/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:00)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/cranger/TSP1_bladder_3_S9/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/cranger/TSP1_blood_3_S18/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/cranger/TSP1_exopancreas1_1_S19/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:00)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/cranger/TSP1_endopancreas_3_S3/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/cranger/TSP1_exopancreas2_2_S5/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/cranger/TSP1_lung_3_S12/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:03)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/cranger/TSP1_exopancreas2_3_S6/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/cranger/TSP1_muscle_3_S15/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/cranger/TSP1_muscle_1_S13/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/cranger/TSP1_lung_1_S10/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/cranger/TSP1_endopancreas_2_S2/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:00)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/cranger/TSP1_blood_2_S17/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/cranger/TSP1_lung_2_S11/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:02)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/cranger/TSP1_blood_1_S16/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/cranger/TSP1_muscle_2_S14/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:01)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


reading /oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/cranger/TSP1_exopancreas1_2_S20/outs/raw_feature_bc_matrix.h5


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


 (0:00:00)


Variable names are not unique. To make them unique, call `.var_names_make_unique`.


In [273]:
#n=100 (min number of genes)
all_ratios_don1_n100 = pd.read_csv(mainDir + 'don1_n100_cell_to_empty_species_ratios.csv')
all_ratios_don2_n100 = pd.read_csv(mainDir + 'don2_n100_cell_to_empty_species_ratios.csv')
all_ratios_don1_n100['donor_info']=['TSP1']*all_ratios_don1_n100.shape[0]
all_ratios_don2_n100['donor_info']=['TSP2']*all_ratios_don2_n100.shape[0]
all_ratios_don12_n100=pd.concat([all_ratios_don2_n100, all_ratios_don1_n100])
all_ratios_don12_n100.rename(columns={'sample':'sample_new'}, inplace=True)

fin_fil_ratios_n100 = all_ratios_don12_n100.merge(fin_fil, on=['species', 'sample_new'], how='outer')
#will need to filter out species with low adjusted ratio >100
m=100

print('fraction of surviving hits: ',fin_fil_ratios_n100[fin_fil_ratios_n100['adj_ratio']>=m].shape[0]/fin_fil_ratios_n100.shape[0])


fraction of surviving hits:  0.08208417499371044


~92% of the original hits will be filtered by this decontamination pipeline (this excludes the species with unknown ratios)

In [275]:
fin_fil_ratios_n100.drop(columns={'Unnamed: 0_x', 'Unnamed: 0_y', 'Unnamed: 0.1',
                         'donor', 'don', 'donor_new', 'tissue', 'blast', 'db', 'source'}, inplace=True)


fin_fil_ratios_n100.rename(columns={'donor_info': 'donor', 'tissue_new': 'tissue', 
                                    'sample_new':'sample_match_cranger', 'cell':'cell_bc_all'}, inplace=True)

fin_fil_ratios_n100['cell_bc_umi'] = fin_fil_ratios_n100['cell_bc_all'] + '_' + fin_fil_ratios_n100['umi'] 
fin_fil_ratios_n100['cell_type_tissue'] = fin_fil_ratios_n100['celltype2'] + '_' + fin_fil_ratios_n100['tissue'] 
fin_fil_ratios_n100['tissue_cell_type'] = fin_fil_ratios_n100['tissue'] + '_' + fin_fil_ratios_n100['celltype2'] 

#making sure I can seperate out donor 1 and 2 dataset from the rest by adding an extra column
fin_fil_ratios_n100['dataset'] = ['tsp1_2'] *fin_fil_ratios_n100.shape[0]


#saving the dataframe for donors 1 and 2
#note this has ratios (R_dts) but hasn't been filtered based on R_dts
fin_fil_ratios_n100.to_csv(mainDir + 'don1_don2_10x_blastn_nt_withRatios_10_14_2021.csv', index=False)

all_dons_processed = all_hits_meta_ratios.merge(fin_fil_ratios_n100, how='outer')
all_dons_processed.to_csv(mainDir + 'all_dons_10x_with_cell_annotations_qc_filtered_withRatios_10_14_2021.csv',
                          index=False)


### Next, time to implement a few more decontamination filters and QC as well as reformatting before saving results

first reading in the saved dataframe for all donors (and re-implementing pident and length filters just in case)

In [4]:
all_dons_processed = pd.read_csv(mainDir + 'all_dons_10x_with_cell_annotations_qc_filtered_withRatios_10_14_2021.csv')
all_dons_processed_fil = all_dons_processed[(all_dons_processed['pident']>=90) & (all_dons_processed['length']>=90)]


In [24]:
#changing the adj ratio of donor 16 to 100 to allow the samples to pass through this filter because they weren't annotated by cell type. 
#They will go through all other filters
all_dons_processed_fil.loc[all_dons_processed_fil['donor']=='TSP16','adj_ratio']=100

### In addition to the previous filter, I also want to exclude certain genera 
list of contaminating species commonly found in reagents, based on https://link.springer.com/article/10.1186/s12915-014-0087-z 
This is in addition to my own water control contamination lists (tissue + SS2 from 100 wells) and other suspected contaminating species


In [26]:
#excluded from contamination 92 genera 'Bacillus','Streptococcus','Acidobacteria','Enterobacter','Pseudomonas','Corynebacterium' because they are part of human microbiome and I have an additional filter from previous steps to eliminate those genus if they are not cell associated. 

cont_genus=pd.read_csv('/oak/stanford/groups/quake/gita/raw/nb/microbe/paper/contamination_genus_species.csv', delimiter='\t')

#many of these species are likely contaminants or other technical artifacts (such as being part of food ingested) based on their low counts 
all_dons_processed_fil2 = all_dons_processed_fil[~all_dons_processed_fil['species'].str.contains('synthetic|vector|geobacillus|thermo|thermus|gonorrhoeae|brucella|vibrio|Cutibacterium|Escherichia|Pestivirus|Garlic virus|Tomato|Bosavirus|Giant panda anellovirus|Aroa virus|Acyrthosiphon pisum|Woodchuck hepatitis virus|Prunus necrotic ringspot|Brome mosaic|respirovirus',case=False,na=False)] 

#genera 
all_dons_processed_fil3 = all_dons_processed_fil2[~all_dons_processed_fil2['genus'].isin(cont_genus['genus'].tolist())]

print('fraction of surviving hits after these filters:', np.round(all_dons_processed_fil3.shape[0]/all_dons_processed.shape[0], 2))

fraction of surviving hits after these filters: 0.38


### Removing putative contaminants based on adj_ratio
removing putative contamination based on adj_ratio column (R_dts) <br>
**20% of hits survived this sc decontamination filter & 34% of species**

In [27]:
all_dons_processed_filt=all_dons_processed_fil3[all_dons_processed_fil3['adj_ratio']>=100]

print('fraction of surviving hits after this filter:', 
      np.round(all_dons_processed_filt.shape[0]/all_dons_processed_fil3.shape[0],2))


print('fraction of surviving species after this filter:', 
      np.round(all_dons_processed_filt.species.nunique() / all_dons_processed_fil3.species.nunique(),2))


nulls=all_dons_processed_fil3[all_dons_processed_fil3['adj_ratio'].isnull()]
print('fraction of hits occuring in unknown cell barcodes which are removed:', 
      np.round(nulls.shape[0]/all_dons_processed_fil3.shape[0],2))



fraction of surviving hits after this filter: 0.21
fraction of surviving species after this filter: 0.34
fraction of hits occuring in unknown cell barcodes which are removed: 0.03


### Some additional name clean up 

going to exclude archaea since we weren't exclusively looking for it. only 9 hits were mapping to archaea anyway\
cleaning up tissue names & converting nulls to "unknown" for celltype2 column<br>

In [29]:
all_hits = all_dons_processed_filt[all_dons_processed_filt['superkingdom']!='Archaea']
all_hits['method']=all_hits.shape[0]*['10X']

all_hits['tissue'] = all_hits['tissue'].str.lower()
all_hits['tissue'] = all_hits['tissue'].replace('exopancreas2', 'pancreas')
all_hits['tissue'] = all_hits['tissue'].replace('exopancreas1', 'pancreas')
all_hits['tissue'] = all_hits['tissue'].replace('endopancreas', 'pancreas')
all_hits['tissue'] = all_hits['tissue'].replace('bm', 'bone_marrow')
all_hits['tissue'] = all_hits['tissue'].replace('li', 'large_intestine')
all_hits['tissue'] = all_hits['tissue'].replace('si', 'small_intestine')
all_hits['tissue'] = all_hits['tissue'].replace('lymphnode', 'lymph_node')

all_hits['celltype2'] = all_hits['celltype2'].replace(np.NaN, 'unknown')

adding a column "cell_ann" to see which cells have annotations based on TS object


In [32]:
all_hits['cell_ann'] = np.where(all_hits['celltype2']=='unknown', 'yes', 'no')


creating a hit count table based on each donor's tissue sample to exclude samples that have low hits (fewer than 50) 

In [33]:
per_don_tis_counts=all_hits.groupby(['donor','tissue'], as_index=False).count().iloc[:,0:3].rename(columns={'method':'tot_hit_ct_per_don_tissue'})
per_don_tis_counts.head(3)

Unnamed: 0,donor,tissue,tot_hit_ct_per_don_tissue
0,TSP1,bladder,107
1,TSP1,blood,4
2,TSP1,lung,94


In [37]:
#merging it with the all_hits df to then be able to exclude some don_tissue combo based on number of hits
all_hits2 = all_hits.merge(per_don_tis_counts, on=['donor', 'tissue'])

#getting rid of samples with fewer than 50 hits
all_hits3 = all_hits2[all_hits2['tot_hit_ct_per_don_tissue']>=10]
all_hits3.head(3)

Unnamed: 0,method,10X_barcode,n_counts,n_genes,compartment,tissue_cell_type,cell_type_tissue,celltype,free_annotation,manually_annotated,...,adj_ratio,dataset,seq,duplicates,operation,log_n_counts,log_n_genes,sp_in_celltype,cell_ann,tot_hit_ct_per_don_tissue
0,10X,,,,unknown,,,unknown,,,...,100.0,tsp3_on,,,,,,,yes,1946
1,10X,,,,unknown,,,unknown,,,...,100.0,tsp3_on,,,,,,,yes,1946
2,10X,,,,unknown,,,unknown,,,...,100.0,tsp3_on,,,,,,unknown_only,yes,1946


want to add several columns called containing combinatorial categories for donor, celltype and tissue information for later plots

In [39]:
all_hits3['donor_tissue']=all_hits3['donor'] + '_' + all_hits3['tissue']
all_hits3['donor_tissue_celltype']=all_hits3['donor'] + '_' + all_hits3['tissue'] + '_' + all_hits3['celltype2']
all_hits3['donor_celltype']=all_hits3['donor'] + '_' + all_hits3['celltype2']


all_hits3['phylum'] = all_hits3['phylum'].fillna('Unknown')
all_hits3['phylum_species']=all_hits3['phylum'] + '_' + all_hits3['species']

saving this version of the blast dataframe

In [40]:
all_hits3.to_csv(mainDir + 'all_dons_10x_with_cell_annotations_qc_filtered_withRatios_decont_10_18_2021.csv',
                          index=False)

### Reading the saved blast dataframe for all donors to make some additions
- adding log2_n_genes columns (log2 of number of genes)
- adding a column that shows the count of each species across the whole dataset
- modifying some cell type names to correct for errors
- adding a celltype2_short column to group some similar cell types for some of the plots (i.e. grouping some of the T cells)
- adding a column that combines tissue and celltype information for both celltype2, and celltype2_short

In [3]:
all_hits = pd.read_csv(mainDir + 'all_dons_10x_with_cell_annotations_qc_filtered_withRatios_decont_10_18_2021.csv')


In [14]:
#counting the occurance of each species and adding that info to a column
all_hits['counts_per_species'] = all_hits.groupby('species')['species'].transform('size')

#counting the occurance of each species in a donor and additing that to a column
per_don_species_counts=all_hits.groupby(['donor','species'], as_index=False).count().iloc[:,0:3].rename(columns={'method':'count_species_per_donor'})
all_hits = all_hits.merge(per_don_species_counts, on=['donor', 'species'])


all_hits['log2_n_genes'] = np.log2(all_hits['n_genes']) #taking log2 of the gene count (works on annotated cells)
all_hits['celltype2'] = all_hits['celltype2'].str.lower() #lower case cell type names

#fixing a few cell type names
all_hits['celltype2'] = all_hits['celltype2'].apply(lambda x: x.replace(', human', ''))
all_hits['celltype2'] = all_hits['celltype2'].apply(lambda x: x.replace('mem.', 'memory'))
all_hits['celltype2'] = all_hits['celltype2'].apply(lambda x: x.replace('cd8-positive', 'cd8+'))

#creating a new column called celltype2_short that groups similar cell types
all_hits['celltype2_short'] = all_hits['celltype2'] 
all_hits.loc[all_hits['celltype2'].str.contains('cd4+'), 'celltype2_short'] = 'cd4+ t c.'
all_hits.loc[all_hits['celltype2'].str.contains('cd8+'), 'celltype2_short'] = 'cd8+ t c.'
all_hits.loc[all_hits['celltype2'].str.contains('b c.'), 'celltype2_short'] = 'b c.'
all_hits.loc[all_hits['celltype2'].str.contains('endo'), 'celltype2_short'] = 'endo. c.'
all_hits.loc[all_hits['celltype2'].str.contains('natural killer c.'), 'celltype2_short'] = 'nk c.'
all_hits.loc[all_hits['celltype2'].str.contains('mature nk t c.'), 'celltype2_short'] = 't c.'
all_hits.loc[all_hits['celltype2'].str.contains('t follicu'), 'celltype2_short'] = 't c.'
all_hits.loc[all_hits['celltype2'].str.contains('monocyte'), 'celltype2_short'] = 'monocyte'
all_hits.loc[all_hits['celltype2'].str.contains('mesenchymal stem'), 'celltype2_short'] = 'mesenchymal stem c.'
all_hits.loc[all_hits['celltype2'].str.contains('mesenchymal c.'), 'celltype2_short'] = 'mesenchymal c.'

all_hits['tissue_celltype'] = all_hits['tissue'] + '_' + all_hits['celltype2'] #adding a column for later plots
all_hits['tissue_celltype_short'] = all_hits['tissue'] + '_' + all_hits['celltype2_short'] #adding a column for later plots


### In addition to those changes, I also want to add the count of cells in the TS dataset for normalization purposes down the line
- will first read the obs layers of the TSP1-2 and TSP3-16 objects
- then concatenat those two dataframes
- perform similar changes to their columns as the BLAST dataframe (see code block above)
- then get the counts of cells grouped by tissue_celltype_short
- merge that information with the blast dataframe (all_hits)
- then save both all_hits and the tsp1-16 obj obs layer that's been modified in case I need it for future

In [10]:
#reading in the obs layers for tsp3-16 and tsp1-2
alldons_10x_don3to16_obs_df = pd.read_csv(mainDir10 + 'alldons_10x_don3to16_obs_df.csv') #reading the TS3P-16 object obs
#reading the TS1-2 object obs
tsp12_10x_obs = pd.read_csv(mainDir3 +'objects/totalVelo_withCellCycleScores_mitoUnfiltered_retrieved_Dec2020_microbialprocessed_10x_March2021_v2_OBS.csv') 
#concatenating the two to get all donors obs 
alldons_10x_don1to16_obs = pd.concat([alldons_10x_don3to16_obs_df,tsp12_10x_obs])

#making same modification as blast dataframe (see above code block)
alldons_10x_don1to16_obs['celltype2'] = alldons_10x_don1to16_obs['celltype2'].str.lower() #lower case cell type names
alldons_10x_don1to16_obs['tissue'] = alldons_10x_don1to16_obs['tissue'].str.lower() #lower case tissue names

#fixing a few cell type names
alldons_10x_don1to16_obs['celltype2'] = alldons_10x_don1to16_obs['celltype2'].apply(lambda x: x.replace(', human', ''))
alldons_10x_don1to16_obs['celltype2'] = alldons_10x_don1to16_obs['celltype2'].apply(lambda x: x.replace('mem.', 'memory'))
alldons_10x_don1to16_obs['celltype2'] = alldons_10x_don1to16_obs['celltype2'].apply(lambda x: x.replace('cd8-positive', 'cd8+'))

#creating a new column called celltype2_short that groups similar cell types
alldons_10x_don1to16_obs['celltype2_short'] = alldons_10x_don1to16_obs['celltype2'] 
alldons_10x_don1to16_obs.loc[alldons_10x_don1to16_obs['celltype2'].str.contains('cd4+'), 'celltype2_short'] = 'cd4+ t c.'
alldons_10x_don1to16_obs.loc[alldons_10x_don1to16_obs['celltype2'].str.contains('cd8+'), 'celltype2_short'] = 'cd8+ t c.'
alldons_10x_don1to16_obs.loc[alldons_10x_don1to16_obs['celltype2'].str.contains('b c.'), 'celltype2_short'] = 'b c.'
alldons_10x_don1to16_obs.loc[alldons_10x_don1to16_obs['celltype2'].str.contains('endo'), 'celltype2_short'] = 'endo. c.'
alldons_10x_don1to16_obs.loc[alldons_10x_don1to16_obs['celltype2'].str.contains('natural killer c.'), 'celltype2_short'] = 'nk c.'
alldons_10x_don1to16_obs.loc[alldons_10x_don1to16_obs['celltype2'].str.contains('mature nk t c.'), 'celltype2_short'] = 't c.'
alldons_10x_don1to16_obs.loc[alldons_10x_don1to16_obs['celltype2'].str.contains('t follicu'), 'celltype2_short'] = 't c.'
alldons_10x_don1to16_obs.loc[alldons_10x_don1to16_obs['celltype2'].str.contains('monocyte'), 'celltype2_short'] = 'monocyte'
alldons_10x_don1to16_obs.loc[alldons_10x_don1to16_obs['celltype2'].str.contains('mesenchymal stem'), 'celltype2_short'] = 'mesenchymal stem c.'
alldons_10x_don1to16_obs.loc[alldons_10x_don1to16_obs['celltype2'].str.contains('mesenchymal c.'), 'celltype2_short'] = 'mesenchymal c.'

alldons_10x_don1to16_obs['tissue_celltype'] = alldons_10x_don1to16_obs['tissue'] + '_' + alldons_10x_don1to16_obs['celltype2'] #adding a column for later plots
alldons_10x_don1to16_obs['tissue_celltype_short'] = alldons_10x_don1to16_obs['tissue'] + '_' + alldons_10x_don1to16_obs['celltype2_short'] #adding a column for later plots

#getting the counts of cells in the TS object grouped by tissue and celltype_short
all_dons_cell_count_by_tis_celltype = alldons_10x_don1to16_obs.groupby(['tissue_celltype_short'], as_index=False).count().iloc[:,:2].rename(
    columns={'tissue':'obj_cell_ct_by_tis_celltype_short'})

#saving the final obs layer for all donors
alldons_10x_don1to16_obs.to_csv(mainDir10 + 'alldons_10x_don1to16_obs_df.csv', index=False)

In [15]:
#finally adding the cell counts to the blast dataframe for later normalization
all_hits = all_hits.merge(all_dons_cell_count_by_tis_celltype, on ='tissue_celltype_short', how='left')


### Saving the final dataframe
all filters (taxonomic, alignment qc, decontamination) have been applied at this point. This dataframe is ready for analysis. 

In [None]:
all_hits.to_csv(mainDir + 'all_dons_10x_with_cell_annotations_qc_filtered_withRatios_decont_v2_10_28_2021.csv',
                          index=False)

### Now just to explore what fraction of hits were filtered out as a result of all various qc and decontamination filters
need the prefiltered datasets for all donors

In [109]:
mainDir = '/oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/'
mainDir2 = mainDir + 'analyze/'
imgDir = mainDir2 + 'images/'

mainDir3 = '/oak/stanford/groups/quake/gita/raw/tab2_20200508/'
mainDir4 = '/oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/'
mainDir5 = '/oak/stanford/groups/quake/gita/raw/organ_20200204/secondAnalysis/'
mainDir6 = '/oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/analyze/'
mainDir7 = '/oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/'
mainDir8 = '/oak/stanford/groups/quake/gita/raw/tab1_20200407/controlAnalysis/'
mainDir9 = '/oak/stanford/groups/quake/gita/raw/organ_20191025/controlAnalysis/'
mainDir10 = '/oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/'
blastxDir = mainDir + 'inputToBlastx/'
dbDir = '/oak/stanford/groups/quake/gita/raw/database/taxonomyNCBI20200125/'
taxDir = dbDir + 'taxonkit/'

#path to muscle binaries
muscle ='/home/groups/quake/gita/miniconda3/envs/mainEnv2/bin/muscle'

tax = pd.read_csv(dbDir + 'ncbi_lineages_2021-01-26.csv')
#want to take only the following columns from the lineage dataframe tax 
tax_short=tax[['tax_id','superkingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']]


In [65]:
#donor 1 and 2
prefil_10x_don1and2 = pd.read_csv(mainDir + 'don1_don2_10x_blastn_nt_pre_filteration_withLineage.csv')


donors 3-16 have batched and unbatched files that need to be read, and concatenated (after adding lineage information)

bacteria

In [67]:
#column headers
cols =['seqName', 'refName', 'pathogen', 'bitscore', 'pident', 'evalue', 'gapopen', 'qstart', 'qend', 'sstart', 'send', 'length', 'mismatch', 'tax_id']
bact_batch =pd.DataFrame({})

for file in glob.glob(mainDir10 + 'batchedFiles/bacteria/batched/micoNT_blastn/*.tab'):
    #looking at non empty files 
    if os.path.getsize(file)>0:
        #getting the batch number 
        batch = file.split('/')[-1].split('.tab')[0]
        
        df=pd.read_csv(file, delimiter='\t')
        df.columns=cols
        df['sample'] = df.seqName.apply(lambda x: x.split('-')[1])
        df['seqName'] = df.seqName.apply(lambda x: x.split('-')[0])
        df['batch']=df.shape[0]*[batch]
        df['filepath']=df.shape[0]*[file]
        
        df['tax_id']=df['tax_id'].apply(lambda x: str(x).split(';')[0])
        df.tax_id = df.tax_id.astype('int64')
        #adding lineage information
        df=df.merge(tax_short, on='tax_id', how='left')
       
    bact_batch = pd.concat([bact_batch, df])

viruses

In [68]:
#pulling in data from the batched data files first (viruses) - will redo this step once all BLAST is done

#column headers
cols =['seqName', 'refName', 'pathogen', 'bitscore', 'pident', 'evalue', 'gapopen', 'qstart', 'qend', 'sstart', 'send', 'length', 'mismatch', 'tax_id']
virus_batch =pd.DataFrame({})

for file in glob.glob(mainDir10 + 'batchedFiles/viruses/batched/virusNT_blastn/*.tab'):
    #looking at non empty files 
    if os.path.getsize(file)>0:
        #getting the batch number 
        batch = file.split('/')[-1].split('.tab')[0]
        
        df=pd.read_csv(file, delimiter='\t')
        df.columns=cols
        df['sample'] = df.seqName.apply(lambda x: x.split('-')[1])
        df['seqName'] = df.seqName.apply(lambda x: x.split('-')[0])
        df['batch']=df.shape[0]*[batch]
        df['filepath']=df.shape[0]*[file]
        
        df['tax_id']=df['tax_id'].apply(lambda x: str(x).split(';')[0])
        df.tax_id = df.tax_id.astype('int64')
        #adding lineage information
        df=df.merge(tax_short, on='tax_id', how='left')
    virus_batch = pd.concat([virus_batch, df])

fungi

In [69]:
#pulling in data from the batched data files first (fungi) - will redo this step once all BLAST is done

#column headers
cols =['seqName', 'refName', 'pathogen', 'bitscore', 'pident', 'evalue', 'gapopen', 'qstart', 'qend', 'sstart', 'send', 'length', 'mismatch', 'tax_id']
fungi_batch =pd.DataFrame({})

for file in glob.glob(mainDir10 + 'batchedFiles/fungi/batched/fungiNT_blastn/*.tab'):
    #looking at non empty files 
    if os.path.getsize(file)>0:
        #getting the batch number 
        batch = file.split('/')[-1].split('.tab')[0]
        
        df=pd.read_csv(file, delimiter='\t')
        df.columns=cols
        df['sample'] = df.seqName.apply(lambda x: x.split('-')[1])
        df['seqName'] = df.seqName.apply(lambda x: x.split('-')[0])
        df['batch']=df.shape[0]*[batch]
        df['filepath']=df.shape[0]*[file]
        
        df['tax_id']=df['tax_id'].apply(lambda x: str(x).split(';')[0])
        df.tax_id = df.tax_id.astype('int64')
        #adding lineage information
        df=df.merge(tax_short, on='tax_id', how='left')

    fungi_batch = pd.concat([fungi_batch, df])

all batched

In [70]:
all_batched=pd.concat([bact_batch, virus_batch, fungi_batch])

all_batched.to_csv(mainDir10 + 'all_batched_prefiltered_withLineage_09_09_2021.csv', index=False)


all unbatched

In [73]:
def status(x):
    if x ==0:
        status='empty'
    elif x<.80:
        status='partial'
    elif x>=.80:
        status='done'
    else:
        status='not started'
    
    return(status)

In [74]:
# reading the status of jobs from a previous file
maindf_all=pd.read_csv(mainDir10 + 'status_of_nt_jobs.csv')
# here are the samples
samples=maindf_all[maindf_all['operation']=='humanFiltered'].file.unique().tolist()
samples_df=pd.DataFrame(samples, columns=['file'])


# actually getting the status of each job based on fraction of blast that's complete
maindf_vir = maindf_all[maindf_all['operation']=='virNTblastn']
maindf_vir = maindf_vir.merge(samples_df, on='file', how='outer') 
maindf_vir['status_vir'] = maindf_vir['frac_complete'].apply(lambda x: status(x))

maindf_bac = maindf_all[maindf_all['operation']=='micoNT_blastn']
maindf_bac = maindf_bac.merge(samples_df, on='file', how='outer') 
maindf_bac['status_bact'] = maindf_bac['frac_complete_bact'].apply(lambda x: status(x))

maindf_fun = maindf_all[maindf_all['operation']=='fungi_NT_blastn']
maindf_fun = maindf_fun.merge(samples_df, on='file', how='outer') 
maindf_fun['status_fun'] = maindf_fun['frac_complete_fun'].apply(lambda x: status(x))

#now just collecting all the filepaths to the files that were more than 80% done. 
maindf_vir_done = maindf_vir[maindf_vir['status_vir']=='done']
maindf_bac_done = maindf_bac[maindf_bac['status_bact']=='done']
maindf_fun_done = maindf_fun[maindf_fun['status_fun']=='done']
fps1 = maindf_vir_done.filepath.tolist()
fps2 = maindf_bac_done.filepath.tolist()
fps3 = maindf_fun_done.filepath.tolist()
#list of file paths to nt blast jobs from the three kingdoms that are complete
fps=fps1 + fps2 + fps3
print('number of complete files:', len(fps))


number of complete files: 193


In [75]:
#column headers
cols =['seqName', 'refName', 'pathogen', 'bitscore', 'pident', 'evalue', 'gapopen',
       'qstart', 'qend', 'sstart', 'send', 'length', 'mismatch', 'tax_id']
all_unbatched =pd.DataFrame({})

for file in fps:
    #looking at non empty files 
    if os.path.getsize(file)>0:
        #getting the batch number 
        filename = file.split('/')[-1].split('.tab')[0]
        
        df=pd.read_csv(file, delimiter='\t')
        df.columns=cols
        df['sample'] = df.shape[0]*[filename]
        df['batch']=df.shape[0]*['unbatched']
        df['filepath']=df.shape[0]*[file]
        
        df['tax_id']=df['tax_id'].apply(lambda x: str(x).split(';')[0])
        df.tax_id = df.tax_id.astype('int64')
        #adding lineage information
        df=df.merge(tax_short, on='tax_id', how='left')
        #filtering df to contain only bacteria, viruses, archaea, fungi and blastocyst
        
    all_unbatched = pd.concat([all_unbatched, df])  
    
  

In [76]:
all_unbatched.to_csv(mainDir10 + 'all_unbatched_prefiltered_withLineage_09_09_2021.csv', index=False)

combining the batched and unbatched for donors 3-16

In [77]:
donor3_16_prefil_withLineage = pd.concat([all_unbatched, all_batched])
donor3_16_prefil_withLineage.to_csv(mainDir10 + 'all_batched_unbatched_don3_16_prefiltered_withLineage_09_09_2021.csv', index=False)

### Reading the 10X blast dataframes for all donors with lineage information (without any filters in place)

In [42]:
donor3_16_prefil_withLineage = pd.read_csv(mainDir10 + 'all_batched_unbatched_don3_16_prefiltered_withLineage_09_09_2021.csv')

In [43]:
don1and2_prefil_withLineage = pd.read_csv(mainDir + 'don1_don2_10x_blastn_nt_pre_filteration_withLineage.csv')


In [44]:
don1and2_prefil_withLineage['dataset']=['TSP1-2']*don1and2_prefil_withLineage.shape[0]
donor3_16_prefil_withLineage['dataset']=['TSP3-16']*donor3_16_prefil_withLineage.shape[0]

#concatentating the two dataframes
all_dons_prefilt=pd.concat([don1and2_prefil_withLineage,donor3_16_prefil_withLineage ])
all_dons_prefilt.drop(columns=['Unnamed: 0', 'Unnamed: 0.1'], inplace=True)

number of hits prior to any filters for all donors

In [None]:
print('number of hits prior to any filters for all donors:', all_dons_prefilt.shape[0])

In [None]:
don1and2_prefil_withLineage.to_csv(mainDir + 'alldons_10x_blastn_nt_pre_filteration_withLineage.csv', index=False)


#### what fraction of hits remain after removing sequences originating from humans

In [None]:
all_dons_fil_1=all_dons_prefilt[(all_dons_prefilt['genus'].str.contains('homo', case=False))==False] 

print('fraction of hits surviving after human reads are removed:', 
      np.round(all_dons_fil_1.shape[0]/all_dons_prefilt.shape[0],2))

#### what fraction remain after selecting only for bacteria, viruses and fungi

In [None]:
all_dons_fil_2 = all_dons_fil_1[(all_dons_fil_1['superkingdom'].str.contains('Bacteria|Viruses', case=False) | 
                  all_dons_fil_1['phylum'].str.contains('mycota', case=False) |
                  all_dons_fil_1['species'].str.contains('Blastocyst',case=False)==True)] 

print('fraction of hits surviving after all taxonomic filters:', 
      np.round(all_dons_fil_2.shape[0]/all_dons_prefilt.shape[0],2))

#### what fraction of reads remain after all filters (taxonomic, alignment qc, decontamination through different steps)
going to compare the shape of the final dataframe to the prefiltered one.

**we went from 18.8M hits to about 61K hits, so only 0.3% of the original BLAST hits (all with E-values < 10^-5)**

In [46]:
all_hits_postfilters = pd.read_csv(mainDir + 'all_dons_10x_with_cell_annotations_qc_filtered_withRatios_decont_10_18_2021.csv')

all_dons_prefilt = pd.read_csv(mainDir + 'alldons_10x_blastn_nt_pre_filteration_withLineage.csv')

print('fraction of hits surviving after all filters:', 
      np.round(all_hits_postfilters.shape[0]/all_dons_prefilt.shape[0],4))

fraction of hits surviving after all filters: 0.0055


### Lastly, how many raw reads did we begin and how many ended up mapping to human genome? 
see read_counter.sh in these dir1,2 and 3 for bash command used to get the raw counts from gunzipped fastq files

In [236]:
#note these are counts of sequences, not lines (so no need to divide by 4 - since they are fastqs)

dir1 = '/oak/stanford/groups/quake/gita/raw/tab3-14_20210420/'
#for counts of reads for tabula donors 3-16 (10x)
raw_read_count_donors3to16=pd.read_csv(dir1 + 'counter.txt')
raw_read_count_donors3to16_sum = raw_read_count_donors3to16.sum()
print('total number of raw reads donors 3-16 10x:', raw_read_count_donors3to16_sum)

#for counts of reads for tabula donor 2 (10x)
dir2 = '/oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/raw_reads/'
raw_read_count_donor2=pd.read_csv(dir2 + 'counter.txt')
raw_read_count_donor2_sum = raw_read_count_donor2.sum()
print('total number of raw reads donor 2 10x:',raw_read_count_donor2_sum)

#for counts of reads for tabula donor 1 (10x)
dir3 = '/oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/'
raw_read_count_donor1=pd.read_csv(dir3 + 'counter.txt')
raw_read_count_donor1_sum = raw_read_count_donor1.sum()
print('total number of raw reads donor 1 10x:',raw_read_count_donor1_sum)

total = raw_read_count_donors3to16_sum + raw_read_count_donor2_sum + raw_read_count_donor1_sum
print('total number of raw reads for all donors:', total)

total number of raw reads donors 3-16 10x: counts    52783456701
dtype: int64
total number of raw reads donor 2 10x: counts    11214581299
dtype: int64
total number of raw reads donor 1 10x: counts    5665162120
dtype: int64
total number of raw reads for all donors: counts    69663200120
dtype: int64


#### looking to see how many reads remained after mapping to human genome and ERCCs using STAR
wc -l *.fasta 
so the counts were divided by two

In [225]:
#donors 3-16 (though it contains all donors actually- had to filter TSP1 and 2 out)
human_fil_count_donor3to16=pd.read_csv(dir1 + 'all/count_of_humanFiltered_reads.txt')
line_count=[]
sample=[]
for index,row in human_fil_count_donor3to16.iterrows():
    line_count.append(int(human_fil_count_donor3to16.iloc[index].str.split(' ')[0][0]))
    sample.append(human_fil_count_donor3to16.iloc[index].str.split(' ')[0][1])
human_fil_count_donor3to16['line_count'] = line_count
human_fil_count_donor3to16['count'] = human_fil_count_donor3to16['line_count']/2 #dividing by two because they are line counts, not sequence counts
human_fil_count_donor3to16['sample'] = sample

#excluding donors 1 and 2 (since they are included in the tsp3-16 human filtered file)
hum_fil_don3to16 = human_fil_count_donor3to16[(human_fil_count_donor3to16['sample'].str.startswith('TSP1_')==False) &
                            (human_fil_count_donor3to16['sample'].str.startswith('TSP2_')==False)]['count'].sum()
print('number of reads after removing those that map to human genome for donor 3-16 10x:', hum_fil_don3to16)

#donor 2 
human_fil_count_donor2=pd.read_csv('/oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/count_of_humanFiltered_reads.txt')
human_fil_count_donor2.tail(1)
hum_fil_don2 = 681750380/2
print('number of reads after removing those that map to human genome for donor 2 10x:', hum_fil_don2)

#donor 1
human_fil_count_donor1=pd.read_csv(dir3 + 'count_of_humanFiltered_reads.txt')
human_fil_count_donor1.tail(1)
hum_fil_don1 = 357472376/2
print('number of reads after removing those that map to human genome for donor 1 10x:', hum_fil_don1)

total_hum_fil = hum_fil_don3to16 + hum_fil_don2 + hum_fil_don1
print('total count of reads after alignment to human genome and ERCC all donors 10x', total_hum_fil)



number of reads after removing those that map to human genome for donor 3-16 10x: 5250281677.0
number of reads after removing those that map to human genome for donor 2 10x: 340875190.0
number of reads after removing those that map to human genome for donor 1 10x: 178736188.0
total count of reads after alignment to human genome and ERCC all donors 10x 5769893055.0


**fraction of reads that did not align to human genome or ERCCs** 
8% for all donors

In [241]:
print('fraction non-human donors 3-16:', np.round(hum_fil_don3to16/ raw_read_count_donors3to16_sum,2))

print('fraction non-human donor 2:', np.round(hum_fil_don2/raw_read_count_donor2_sum,2))

print('fraction non-human donor 1:', np.round(hum_fil_don1/raw_read_count_donor1_sum,2))

print('fraction non-human all donors:', np.round(total_hum_fil/total,2))


fraction non-human donors 3-16: counts    0.1
dtype: float64
fraction non-human donor 2: counts    0.03
dtype: float64
fraction non-human donor 1: counts    0.03
dtype: float64
fraction non-human all donors: counts    0.08
dtype: float64


## Summary
so we went from 69,663,200,120 raw reads (all donors 10X) to 5,769,893,055 reads after removing human reads, and finally to a blast dataframe representing 18,862,647 sequences (prior to any filters), which then go filtered further down to ~30k sequences. 

**~70 Billion reads --> ~5.7 Billion reads (human filtered) [8% of original raw reads] --> ~19 Million reads (after BLAST) [0.03% of original raw reads] --> ~30 thousand hits (after all filters and qc of blast dataframe)