In this notebook, I'm analyzing the results of cell ranger that is run on TSP1-14 fastq files. Every file contains a raw matrix and filtered matrix. According to cell ranger, if using cellranger 6.0.1, the subtraction of filtered matrix from the raw will give away the droplets that cellranger identifies as "empty".

To identify putative contaminating species, I first defined empty versus cell-containing droplets based on whether the droplet contains more than n genes. Note donor 16 doesn't have cell annotations, so this exercise doesn't make sense for this donor. 

Through SIMBA, I have raw fastqs that I have run UMI tools on and subsequently BLAST (see viral7_10x.snakefile for example). Separately, I have run cell ranger count command to get the raw feature barcode matrix starting from the same raw fastq files.

one issue I noticed is that Tabula Sapiens (TS) object sample names can be a bit shorter than BLAST dataframe. To fix this, we create a dictionary of keys and values that changes ts names to BLAST names since they are longer and can better match cell ranger file names. I will re-write the "sample" column. Another issue is that cell ranger files don't always match the BLAST or TS object sample names. To correct for this, I now have a new column called "cr_sample". 

For each "cr" column =='yes (sample whose name has been matched to cell ranger folders), we want to open the cell ranger files to get "n_genes_cr" and "n_counts_cr" for each cell based on the shared "cell_bc" column between the ts-blast dataframe and the output of a cell ranger for a given sample. Based on "n_genes_cr" values, we will also have a seperate dataframe where for each cr_sample we denote the number of empty droplets versus cell containing droplets. We can set the cutoff to 200 genes per cell, which is commonly used threshold in this field (e.g. scanpy tutorial).  

Because donor 3-16 cell ranger files are in a different directory, I'm running those donors seperately from donor 1 and 2. Then I concatenate the resulting dataframes later into one. These are the columns in the final dataframe. Same as the input dataframe (df=pd.read_csv(mainDir10 +'all_dons_except15_blast_tsobject_postreview.csv') from previous notebook, but with 6 additional columns (those related to cell ranger). Main dataframe is alldons. 

- alldons.to_csv(mainDir10 + 'all_dons_cr_processed_postreview2022.csv', index=False)




### Column name explanation 
for joined dataframe of ALL donors 1-16 (except 15) - see part1 notebook. 
- Blast dataframe columns (TS object is missing these columns):
    - **seqName**: e.g. A00111:327:HL57HDSXX:4:1202:22941:23719_TAGGTCAGAGACATCA_AGAAATAACGCC
    - **seq**: microbial sequence 
    - **refName**: the BLAST subject reference gi|1862738216|gb|CP055292.1| 
    - **pathogen**: common name of the hit "Shigella sonnei strain SE6-1 chromosome, complete genome"
    - **bitscore**: see BLAST tutorial
    - **pident**: percent identity between subject and query
    - **evalue**: see BLAST tutorial
    - **qstart**: query start position
    - **qend**: query end position
    -  **sstart**: subject start position
    -  **send**: subject end position
    -  **length**: length of the alignment between subject and query
    - Taxonomy columns 
        -  **tax_id**: taxonomic id with which we can get the following taxonomic categories
        -  **superkingdom**
        -  **phylum**
        -  **class**
        -  **order**
        -  **family**
        -  **genus**
        -  **species**
    -  **umi**: AGAAATAACGCC 
    -  **cell_bc_umi**: cell barcode and umi e.g. TGTCCCATGTACTCTG_CGTTGATACCAC
    -  **batch**: this is related to how blast was done for these donors (divided into batches with fixed number of input seqs)
    - **filepath**: where the output of the blast resides on my local drive


- TS object columns (blast dataframe does not produce these on its own)
    - **n_counts**: number of reads per cell
    - **n_genes**: number of genes per cell
    - **log2_n_counts**: log2 of n_counts
    - **log2_n_genes**: log2 of n_genes
    - **compartment**: e.g. immune, epithelial
    - **decision**: cell cycle decision
    - **celltype2**: cell type
    - **tissue_cell_type**: tissue and cell type
    - **cell_type_tissue**: cell type and tissue


- Shared columns:
    - **cell**: cell barcode+sample  TGGGCGTGTTGCGCAC_TSP14_Blood_NA_10X_1_1_1_5Prime
    - **cell_bc**: cell barcode with "-1" appended to it. (donor1and2, not for donors 3-16) e.g. CATATTCCAAAGCGGT-1
    - **tissue**: tissue type
    - **donor**: donor (e.g. TSP1)
    - **sample**: sample name e.g. TSP14_Bladder_NA_10X_1_2
    - **hit**: the column with which to seperate out the dataframe into blast (hit=='yes') and ts object (hit=='no')
    - **hit_type**: tells us whether the hit comes from an annotated cell (hit_type=='intra'), unannotated cell ('extra') or no hit ('none')
    - **donor_batch**: donor 1 and 2 dataframe ('1_2'), all others ('3_16')


- Cell ranger columns (rows with cr column=='yes should have all this info):
    - **cr**: "yes" if the sample was run through cell ranger pipeline, and "no" otherwise
    - **cr_sample**: the name of the cell ranger file
    - **n_counts_cr**: number of reads per droplet as determined by my cell ranger pipeline
    - **n_genes_cr**: same as above, for number of genes per droplet
    - **num_empty_drop_per_sample**: number of empty droplets in a sample (based on cut off of 200 genes)
    - **num_cell_drop_per_sample**: number of cell-containing droplets in a sample (based on cut off of 200 genes)

### Loading libraries

In [1]:
%matplotlib inline
import matplotlib
print('matplotlib version:', matplotlib.__version__)
import matplotlib.pyplot as plt
from matplotlib_venn import venn3
from matplotlib_venn import venn2
import os
import glob
import re
import itertools
from collections import Counter

import numpy as np
print('numpy version', np.__version__)
import pandas as pd
print('pandas version:', pd.__version__)
from pandas import ExcelWriter
from pandas import ExcelFile
import seaborn as sns
print('seaborn version:', sns.__version__)

cmap = sns.cm.rocket_r
from Bio import SeqIO
import scipy as sc
import scipy.sparse
print('scipy version:', sc.__version__)
from scipy import spatial
from scipy.stats import( kstest, poisson)
sns.set_style("white")

import anndata
print('anndata version:', anndata.__version__)
from anndata import read_h5ad
from anndata import AnnData
import scanpy as sc

sc.settings.verbosity = 3  # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.settings.set_figure_params(dpi=80)  # low dpi (dots per inch) yields small inline figures
sc.logging.print_version_and_date()

import scvelo as scv
scv.logging.print_version()
scv.settings.verbosity = 3  # show errors(0), warnings(1), info(2), hints(3)
scv.settings.presenter_view = True  # set max width size for presenter view
scv.set_figure_params('scvelo')  # for beautified visualization



matplotlib version: 3.3.0
numpy version 1.19.2
pandas version: 1.0.3
seaborn version: 0.11.1
scipy version: 1.5.2
anndata version: 0.7.3
Running Scanpy 1.5.1, on 2023-02-17 10:59.
Running scvelo 0.2.1 (python 3.6.10) on 2023-02-17 10:59.


### Directories

In [2]:
#tsp1and2
mainDir = '/oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/'
mainDir2 ='/oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/cranger/round2/'
mainDir4= '/oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/cranger/'

#tsp3-16 data
cranger_dir='/oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/cranger3/'  
mainDir10 = '/oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/'


### TSP3-15

In [None]:
df=pd.read_csv(mainDir10 +'all_dons_except15_blast_tsobject_postreview.csv')

one issue I noticed is that ts object sample names can be a bit shorter than blast dataframe. To fix this, we create a dictionary of keys and values that changes ts names to blast names since they are longer and can better match cell ranger file names. I will re-write the sample column. 

In [31]:
ls0=df['sample'].unique().tolist()
ls1=[]
ls2=[]
di={}
for x in ls0:
    for y in ls0:
        if (x in y) & (x!=y):
            di[x]=y
            
df['sample'] = df['sample'].map(di).fillna(df['sample'])
df['sample'] = df['sample'].str.replace('TSP14_Spleen_NA_10X_1_1_1_5Prime', 'TSP14_Spleen_NA_10X_1_1_5Prime')
df['sample'] = df['sample'].str.replace('TSP14_Blood_NA_10X_1_1_1_5Prime', 'TSP14_Blood_NA_10X_1_1_5Prime')


just double checking that we correct most mismatches (62 unique mismatches were fixed)

In [32]:
n = df[ df['hit_type']=='none']
e = df[ df['hit_type']=='extra']

remaining=list(set(e['sample']) - set(n['sample']))


print('number of remaining file names that are in blast but not in ts object after correction:', len(remaining))


number of remaining file names that are in blast but not in ts object after correction: 6


In [33]:
print(remaining)

['TSP3_Eye_LacrimalGland_10X_1_1', 'TSP16_Eye_lacrimalgland_10X_1_1', 'TSP16_Heart_ventricle_10X_1_2_left', 'TSP2_Heart_ventricle_10X_1_1_S14', 'TSP16_Heart_atrium_10X_1_1_left', 'TSP16_Eye_external_10X_1_1']


what about samples that appear in ts object but not in blast? There are 30 samples. This is due to these samples not having been blasted (or the job remained incomplete). 

In [34]:
remaining2=list(set(n['sample']) - set(e['sample']))
len(remaining2)

30

this part is just for matching cr files names to that from our dataframe

These are sample names that are synonymous. raw read have a different cell ranger file name from what is in the TS object. So going to create a column that is specifically for this notebook called cr_sample (cr stands for cell ranger). 


In [35]:
df['cr_sample'] = df['sample']
df['cr_sample']=df['cr_sample'].str.replace('TSP5_Eye_NA_10X_1_1', 'TSP5_Eye1_062920')
df['cr_sample']=df['cr_sample'].str.replace('TSP5_Eye_NA_10X_1_2', 'TSP5_Eye2_062920')

df['cr_sample']=df['cr_sample'].str.replace('TSP4_Mammary_NA_10X_1_1', 'TSP4_Mammary1_062920')
df['cr_sample']=df['cr_sample'].str.replace('TSP4_Mammary_NA_10X_1_2', 'TSP4_Mammary2_062920')
df['cr_sample']=df['cr_sample'].str.replace('TSP4_Uterus_Myometrium_10X_1_1', 'TSP4_Myometrium_062920')
df['cr_sample']=df['cr_sample'].str.replace('TSP4_Uterus_Endometrium_10X_1_1', 'TSP4_Endometrium_062920_S9')

df['cr_sample']=df['cr_sample'].str.replace('TSP3_Eye_LacrimalGland_10X_1_1', 'TSP3_Eye_062620_S1')
df['cr_sample']=df['cr_sample'].str.replace('TSP3_Eye_NA_10X_1_1_NoCornea', 'TSP3_Eye3_062620_S3')
df['cr_sample']=df['cr_sample'].str.replace('TSP3_Eye_Conjunctiva_10X_1_1', 'TSP3_Eye2_062620_S2') 
df['cr_sample']=df['cr_sample'].str.replace('TSP3_Eye_Orbital_10X_1_1', 'TSP3_Eye4_062620_S4')



additionally some sample names are denoted by "5Prime" in ts-blast dataframe whereas they appear as _S[number] in cell ranger folder, so going to create a dict to map the cr_sample column with

In [36]:
di2=dict({'TSP14_Blood_NA_10X_2_1_5Prime' : 'TSP14_Blood_NA_10X_2_1_5Prime_S2',
'TSP14_Blood_NA_10X_1_1_5Prime' : 'TSP14_Blood_NA_10X_1_1_5Prime_S1',
'TSP14_Blood_NA_10X_2_1_5Prime' : 'TSP14_Blood_NA_10X_2_1_5Prime_S2',
'TSP14_BoneMarrow_NA_10X_1_1_5Prime': 'TSP14_BoneMarrow_NA_10X_1_1_5Prime_S9',
'TSP14_BoneMarrow_NA_10X_2_1_5Prime' : 'TSP14_BoneMarrow_NA_10X_2_1_5Prime_S10',
'TSP14_LymphNode_NA_10X_1_1_5Prime' : 'TSP14_LymphNode_NA_10X_1_1_5Prime_S3',
'TSP14_LymphNode_NA_10X_2_1_5Prime' : 'TSP14_LymphNode_NA_10X_2_1_5Prime_S4', 
'TSP14_Spleen_NA_10X_1_1_5Prime':'TSP14_Spleen_NA_10X_1_1_5Prime_S5',
'TSP14_Spleen_NA_10X_2_1_5Prime': 'TSP14_Spleen_NA_10X_2_1_5Prime_S6',
'TSP14_Thymus_NA_10X_1_1_5Prime': 'TSP14_Thymus_NA_10X_1_1_5Prime_S7',
'TSP14_Thymus_NA_10X_2_1_5Prime':'TSP14_Thymus_NA_10X_2_1_5Prime_S8'})
    
df['cr_sample'] = df['cr_sample'].map(di2).fillna(df['cr_sample'])

let's see what samples we are missing 

In [37]:
#the filenames that appear in our blast+tsobject dataset 
samplenames=df['cr_sample'].unique().tolist()
#cr is the name of cell ranger files gathered from multiple directories 
cr = ['TSP10_Blood_NA_10X_1_1_Enriched_S4','TSP10_Blood_NA_10X_1_1_Whole_S3','TSP10_FAT_MAT_10X_1_1_S8','TSP10_FAT_SCAT_10X_1_1_S9','TSP10_Skin_NA_10X_1_1_S5','TSP10_Skin_NA_10X_1_2_S6','TSP11_BoneMarrow_NA_10X_1_1_LinDepleted_S10','TSP11_BoneMarrow_NA_10X_1_1_LinEnriched_S11','TSP12_Heart_Atria_10X_1_1_S13','TSP12_Heart_Ventricle_10X_1_1_S12','TSP14_Bladder_NA_10X_1_1_S11','TSP14_Bladder_NA_10X_1_2_S12','TSP14_Blood_NA_10X_1_1_5Prime_S1','TSP14_Blood_NA_10X_1_1_S1','TSP14_Blood_NA_10X_1_2_S5','TSP14_Blood_NA_10X_2_1_5Prime_S2','TSP14_Blood_NA_10X_2_1_S16','TSP14_Blood_NA_10X_2_1_S2','TSP14_BoneMarrow_NA_10X_1_1_5Prime_S9','TSP14_BoneMarrow_NA_10X_1_1_S14','TSP14_BoneMarrow_NA_10X_2_1_5Prime_S10','TSP14_BoneMarrow_NA_10X_2_1_S15','TSP14_BoneMarrow_NA_10X_3_1_S17','TSP14_BoneMarrow_NA_10X_4_1_S18','TSP14_Fat_MAT_10X_1_1_S6','TSP14_Fat_SCAT_10X_1_1_S7','TSP14_LI_Distal_10X_1_1_S10','TSP14_LI_Proximal_10X_1_1_S9','TSP14_Liver_NA_10X_1_1_S11','TSP14_Liver_NA_10X_2_1_S12','TSP14_Lung_Distal_10X_1_1_S5','TSP14_Lung_NA_10X_2_1_S1','TSP14_Lung_Proximal_10X_1_1_S6','TSP14_LymphNode_NA_10X_1_1_5Prime_S3','TSP14_LymphNode_NA_10X_1_1_S8','TSP14_LymphNode_NA_10X_2_1_5Prime_S4','TSP14_LymphNode_NA_10X_2_1_S9','TSP14_Muscle_Abdomen_10X_1_1_S4','TSP14_Muscle_Diaphragm_10X_1_1_S2','TSP14_Muscle_Diaphragm_10X_1_2_S3','TSP14_Prostate_NA_10X_1_1_S15','TSP14_Prostate_NA_10X_1_2_S16','TSP14_SI_Distal_10X_1_1_S8','TSP14_SI_Proximal_10X_1_1_S7','TSP14_SalivaryGland_Parotid_10X_1_1_S14','TSP14_SalivaryGland_Submandibular_10X_1_1_S13','TSP14_Skin_Abdomen_10X_1_1_S17','TSP14_Skin_Chest_10X_1_1_S18','TSP14_Spleen_NA_10X_1_1_5Prime_S5','TSP14_Spleen_NA_10X_1_1_S10','TSP14_Spleen_NA_10X_2_1_5Prime_S6','TSP14_Spleen_NA_10X_2_1_S11','TSP14_Thymus_NA_10X_1_1_5Prime_S7','TSP14_Thymus_NA_10X_1_1_S3','TSP14_Thymus_NA_10X_2_1_5Prime_S8','TSP14_Thymus_NA_10X_2_1_S4','TSP14_Tongue_Anterior_10X_1_1_S14','TSP14_Tongue_Posterior_10X_1_1_S13','TSP14_Vasculature_AortaVeneCava_10X_1_1_S12','TSP14_Vasculature_CoronaryArteries_10X_1_1_S13','TSP15_Eye_Cornea-etc_10X_1_1_S15','TSP15_Eye_Neuroretina-etc_10X_3_1_S17','TSP15_Eye_Sclera-etc_10X_2_1_S16','TSP3_Eye2_062620_S2','TSP3_Eye3_062620_S3','TSP3_Eye4_062620_S4','TSP3_Eye_062620_S1','TSP4_Endometrium_062920_S9','TSP4_Mammary1_062920_S7','TSP4_Mammary2_062920_S8','TSP4_Myometrium_062920_S10','TSP5_Eye1_062920_S5','TSP5_Eye2_062920_S6','TSP6_Liver_NA_10X_1_1_S8','TSP6_Liver_NA_10X_1_2_S9','TSP6_Trachea_NA_10X_1_1_S10','TSP6_Trachea_NA_10X_1_2_S11','TSP6_Trachea_NA_10X_2_1_S12','TSP7_Blood_NA_10X_1_1_S6','TSP7_Blood_NA_10X_2_1_S7','TSP7_LymphNodes_Inguinal_10X_1_1_S2','TSP7_LymphNodes_Supradiaphagmatic_10X_1_1_S3','TSP7_SalivaryGland_Parotid_10X_1_1_S14','TSP7_SalivaryGland_Parotid_10X_1_2_S15','TSP7_Spleen_NA_10X_1_1_S4','TSP7_Spleen_NA_10X_2_1_S5','TSP7_Tongue_Anterior_10X_1_2_S1','TSP7_Tongue_Posterior_10X_1_1_S13','TSP8_Blood_NA_10X_1_1_S1','TSP9_Pancreas_exocrine_10X_1_1_CellCountLive_S2','TSP9_Pancreas_exocrine_10X_1_1_CellCountTotal_S1','TSP1_endopancreas_3_S3','TSP1_bladder_1_S7','TSP1_bladder_2_S8','TSP1_bladder_3_S9','TSP1_blood_1_S16','TSP1_blood_2_S17','TSP1_blood_3_S18','TSP1_endopancreas_1_S1','TSP1_endopancreas_2_S2','TSP1_lung_3_S12','TSP1_exopancreas1_1_S19','TSP1_exopancreas1_2_S20','TSP1_exopancreas1_3_S21','TSP1_exopancreas2_1_S4','TSP1_exopancreas2_2_S5','TSP1_exopancreas2_3_S6','TSP1_lung_1_S10','TSP1_lung_2_S11','TSP1_muscle_1_S13','TSP1_muscle_2_S14','TSP1_muscle_3_S15','TSP2_BM_vertebralbody_10X_1_1_S29','TSP2_BM_vertebralbody_10X_1_2_5prime_S35','TSP2_BM_vertebralbody_10X_2_1_S28','TSP2_BM_vertebralbody_10X_2_2_5prime_S34','TSP2_Bladder_NA_10X_1_1_S5','TSP2_Bladder_NA_10X_1_2_S6','TSP2_Blood_NA_10X_1_3_S15','TSP2_Blood_NA_10X_1_4_5prime_S30','TSP2_Blood_NA_10X_1_5_5prime_S31','TSP2_Blood_NA_10X_2_1_S16','TSP2_Heart_ventricle_10X_1_1_S14','TSP2_Kidney_NA_10X_1_1_S1','TSP2_Kidney_NA_10X_1_2_S2','TSP2_LI_distal_10X_1_1_Sigmoid_S9','TSP2_LI_proximal_10X_1_1_Ascending_S8','TSP2_Lung_proxmedialdistal_10X_1_1_S24','TSP2_Lung_proxmedialdistal_10X_1_2_S25','TSP2_LymphNode_NA_10X_1_1_S19','TSP2_LymphNode_NA_10X_2_1_S20','TSP2_Muscle_diaphragm_10X_1_1_S12','TSP2_Muscle_diaphragm_10X_1_2_S13','TSP2_Muscle_rectusabdominus_10X_1_1_S10','TSP2_Muscle_rectusabdominus_10X_1_2_S11','TSP2_SI_distal_10X_1_1_Jejunum_S22','TSP2_SI_proximal_10X_1_1_Duodenum_S21','TSP2_Spleen_NA_10X_1_1_S17','TSP2_Spleen_NA_10X_2_1_S18','TSP2_Thymus_NA_10X_1_1_S3','TSP2_Thymus_NA_10X_1_2_S4','TSP2_Thymus_NA_10X_1_3_5prime_S32','TSP2_Thymus_NA_10X_1_4_5prime_S33','TSP2_Trachea_NA_10X_1_1_S7','TSP2_Trachea_NA_10X_1_2_S23','TSP2_Vasculature_Aorta_10X_1_1_S26','TSP2_Vasculature_Aorta_10X_1_2_S27']
#trying to see which sample names match the cell ranger file names 
flist=[]
clist=[]
for f in samplenames:
    for c in cr:
        if f in c:
            flist.append(f)
            clist.append(c)
    
            
sn = pd.DataFrame({})
sn['f']= flist
sn['c']=clist

missing cell ranger files. Cell ranger wasn't run for donor 16 because it didn't have annotated cell types. 

In [38]:
miss0=list(set(df['cr_sample']) - set(sn['f']) )
print('number of samples:', len(miss0))
miss0

number of samples: 12


['TSP10_Blood_NA_10X_2_1_Whole',
 'TSP16_Eye_lacrimalgland_10X_1_1',
 'TSP2_Blood_NA_10X_1_1_SheelaPrep',
 'TSP16_Heart_ventricle_10X_1_2_left',
 'TSP8_Prostate_NA_10X_1_1',
 'TSP2_Vasculature_Aorta_10X_2_1',
 'TSP8_Prostate_NA_10X_1_2',
 'TSP2_Blood_NA_10X_1_2_SheelaPrep',
 'TSP16_Heart_atrium_10X_1_1_left',
 'TSP2_Vasculature_Aorta_10X_2_2',
 'TSP16_Eye_external_10X_1_1',
 'TSP14_Blood_NA_10X_3_1']

missing from ts-blast: cell ranger filenames that I was unable to match to a sample because the dataframe doesn't contain them (e.g. TS15 samples which were excluded previously)

In [39]:
miss1 =list(set(cr) - set(sn['c']))
print('number of samples:', len(miss1))
miss1

number of samples: 14


['TSP14_LymphNode_NA_10X_1_1_S8',
 'TSP14_Blood_NA_10X_2_1_S16',
 'TSP15_Eye_Neuroretina-etc_10X_3_1_S17',
 'TSP14_LymphNode_NA_10X_2_1_S9',
 'TSP14_Blood_NA_10X_1_1_S1',
 'TSP15_Eye_Sclera-etc_10X_2_1_S16',
 'TSP14_Thymus_NA_10X_2_1_S4',
 'TSP14_Blood_NA_10X_2_1_S2',
 'TSP14_Thymus_NA_10X_1_1_S3',
 'TSP14_Spleen_NA_10X_1_1_S10',
 'TSP14_Spleen_NA_10X_2_1_S11',
 'TSP15_Eye_Cornea-etc_10X_1_1_S15',
 'TSP14_BoneMarrow_NA_10X_1_1_S14',
 'TSP14_BoneMarrow_NA_10X_2_1_S15']

In [40]:
n = df[ df['hit_type']=='none']
e = df[ df['hit_type']=='extra']

samples that appear in ts object but not in blast. Will be excluding these from the dataframe since they appear to be files that were not blasted. 

In [41]:
miss2 = list(set(n['sample']) - set(e['sample'] ) )
print('number of samples:', len(miss2))
miss2

number of samples: 30


['TSP14_Lung_Distal_10X_1_1',
 'TSP10_Blood_NA_10X_2_1_Whole',
 'TSP2_Blood_NA_10X_1_4_5prime',
 'TSP2_Thymus_NA_10X_1_3_5prime',
 'TSP14_Bladder_NA_10X_1_2',
 'TSP14_SI_Proximal_10X_1_1',
 'TSP14_Fat_MAT_10X_1_1',
 'TSP8_Prostate_NA_10X_1_1',
 'TSP14_BoneMarrow_NA_10X_3_1',
 'TSP14_LI_Proximal_10X_1_1',
 'TSP14_BoneMarrow_NA_10X_4_1',
 'TSP14_Tongue_Anterior_10X_1_1',
 'TSP14_Vasculature_CoronaryArteries_10X_1_1',
 'TSP2_Vasculature_Aorta_10X_2_1',
 'TSP2_BM_vertebralbody_10X_2_2_5prime',
 'TSP2_Vasculature_Aorta_10X_2_2',
 'TSP14_SI_Distal_10X_1_1',
 'TSP14_Fat_SCAT_10X_1_1',
 'TSP4_Uterus_Endometrium_10X_1_1',
 'TSP2_Blood_NA_10X_1_1_SheelaPrep',
 'TSP8_Prostate_NA_10X_1_2',
 'TSP14_Prostate_NA_10X_1_1',
 'TSP14_Lung_NA_10X_2_1',
 'TSP14_Prostate_NA_10X_1_2',
 'TSP14_Vasculature_AortaVeneCava_10X_1_1',
 'TSP14_Lung_Proximal_10X_1_1',
 'TSP14_Blood_NA_10X_1_2',
 'TSP2_Blood_NA_10X_1_2_SheelaPrep',
 'TSP14_Muscle_Diaphragm_10X_1_2',
 'TSP14_Blood_NA_10X_3_1']

So to sum up the previous blocks, these samples don't appear to have any BLAST results (most likely they were not entered into the pipeline)
additionally samples from donor 16 because they did not have cell annotations were not included in this analysis
samples from donor 15 should also be ignored due to some inconsistent annotation issues. 

There are additionally 30 samples that are showing up in ts object that don't have blast results

#missing cell ranger files (miss0)
-    'TSP2_Blood_NA_10X_1_1_SheelaPrep'
-    'TSP2_Blood_NA_10X_1_2_SheelaPrep'
-    'TSP2_Vasculature_Aorta_10X_2_1'
-    'TSP2_Vasculature_Aorta_10X_2_2'
-    'TSP8_Prostate_NA_10X_1_1'
-    'TSP8_Prostate_NA_10X_1_2'
-    'TSP10_Blood_NA_10X_2_1_Whole'
-    'TSP14_Blood_NA_10X_3_1'
-    'TSP16_Eye_external_10X_1_1' #Excluded TSP16 on purpose (didn't have annotations)
-    'TSP16_Eye_lacrimalgland_10X_1_1'
-    'TSP16_Heart_atrium_10X_1_1_left'
-    'TSP16_Heart_ventricle_10X_1_2_left'

#missing from ts-blast (miss 1)
-    'TSP14_Blood_NA_10X_1_1_S1',
-    'TSP14_Blood_NA_10X_2_1_S16',
-    'TSP14_Blood_NA_10X_2_1_S2',
-    'TSP14_BoneMarrow_NA_10X_1_1_S14',
-    'TSP14_BoneMarrow_NA_10X_2_1_S15',
-    'TSP14_LymphNode_NA_10X_1_1_S8',
-    'TSP14_LymphNode_NA_10X_2_1_S9',
-    'TSP14_Spleen_NA_10X_1_1_S10',
-    'TSP14_Spleen_NA_10X_2_1_S11',
-    'TSP14_Thymus_NA_10X_1_1_S3',
-    'TSP14_Thymus_NA_10X_2_1_S4'
-    'TSP15_Eye_Cornea-etc_10X_1_1_S15' #Excluded TSP15 on purpose (bad annotations)
-    'TSP15_Eye_Neuroretina-etc_10X_3_1_S17'
-    'TSP15_Eye_Sclera-etc_10X_2_1_S16'

#appears in ts object but not in blast (miss 2) 
-     'TSP14_Prostate_NA_10X_1_2',
-     'TSP4_Uterus_Endometrium_10X_1_1',
-     'TSP2_Vasculature_Aorta_10X_2_2',
-     'TSP14_Fat_MAT_10X_1_1',
-     'TSP8_Prostate_NA_10X_1_1',
-     'TSP14_BoneMarrow_NA_10X_3_1',
-     'TSP14_SI_Proximal_10X_1_1',
-     'TSP14_Blood_NA_10X_3_1',
-     'TSP14_Prostate_NA_10X_1_1',
-     'TSP14_Lung_Proximal_10X_1_1',
-     'TSP2_Blood_NA_10X_1_4_5prime',
-     'TSP2_Blood_NA_10X_1_2_SheelaPrep',
-     'TSP14_SI_Distal_10X_1_1',
-     'TSP10_Blood_NA_10X_2_1_Whole',
-     'TSP14_LI_Proximal_10X_1_1',
-     'TSP14_Fat_SCAT_10X_1_1',
-     'TSP14_BoneMarrow_NA_10X_4_1',
-     'TSP2_Vasculature_Aorta_10X_2_1',
-     'TSP14_Blood_NA_10X_1_2',
-     'TSP14_Vasculature_AortaVeneCava_10X_1_1',
-     'TSP14_Tongue_Anterior_10X_1_1',
-     'TSP14_Vasculature_CoronaryArteries_10X_1_1',
-     'TSP14_Bladder_NA_10X_1_2',
-     'TSP2_Blood_NA_10X_1_1_SheelaPrep',
-     'TSP2_Thymus_NA_10X_1_3_5prime',
-     'TSP14_Lung_NA_10X_2_1',
-     'TSP14_Muscle_Diaphragm_10X_1_2',
-     'TSP8_Prostate_NA_10X_1_2',
-     'TSP14_Lung_Distal_10X_1_1',
-     'TSP2_BM_vertebralbody_10X_2_2_5prime'



we have to add these samples to those we found by running a comparison between dataframe and cell ranger file names: 


total of 44 sample names that were not blasted will be excluded. Miss0 is already a list of cell ranger files that we're missing- no need to exclude them. 

In [42]:
miss3 = miss1 + miss2
len(miss3)

44

In [43]:
df2= df[~df['sample'].isin(miss3)]

### Donor 3-14

For each "cr_sample" (sample whose name has been matched to cell ranger folders), we want to open the cell ranger files to get "n_genes_cr" and "n_counts_cr" for each cell based on the shared "cell_bc" column between the ts-blast dataframe and the output of a cell ranger for a given sample. If the barcode found in ts-blast is not found by cell ranger, we can assume the sequence is coming from an outside of a droplet, which is no different than an empty droplet. Based on "n_genes_cr" values, we will also have a seperate dataframe where for each cr_sample we denote the number of empty droplets versus cell containing droplets. We can set the cutoff to 200 genes per cell, which is commonly used threshold in the single cell field (e.g. scanpy tutorial). Later, we can merge this dataframe with ts-blast and then calculate downstream parameters such as cell enrichment. 

Because donor 3-16 cell ranger files are in a different folder, I'm running those donors seperately from donor 1 and 2. Will concatenate the resulting dataframes later into one. 

In [None]:
#looking at donors 3-16 for now
df3  = df2[(df2['cr_sample'].str.startswith('TSP1_')==False) & (df2['cr_sample'].str.startswith('TSP2_')==False)]

# minimum number of genes per droplet to be thought of as a cell-containing droplet
n=200
#ts_blast_with_counts
ts_blast_samples_with_counts =pd.DataFrame({})

# this will be the dataframe that holds information about how many empty and cell droplets there are per sample
num_drops = pd.DataFrame({})
samples_found =[]
num_cells_all=[]
num_empty_all=[]

for sample in df3['cr_sample'].unique():
    for filepath in glob.glob(cranger_dir + '/TSP*'):
        filename = filepath.split('/')[-1] #this is cell ranger folder name
        if filename.startswith('TSP15')==False: #excluding donor 15 and donor 16 is already not in cell ranger folder
            
            if sample in filename:
                print(filename)
                tsblast_sample=df3[(df3['cr_sample']==sample)] #getting the tsblast dataframe for each sample- just two columns
                #reading the raw feature bc matrix 
                raw = sc.read_10x_h5(cranger_dir + filename + '/outs/raw_feature_bc_matrix.h5')
                raw.var_names_make_unique()

                #getting n_counts and n_genes for the raw matrix
                raw.obs['n_counts_cr'] = raw.X.sum(axis=1).A1
                sc.pp.filter_cells(raw, min_genes=0)
                counts = raw.obs.reset_index().rename(columns={'index':'cell_bc'})
                counts=counts.rename(columns={'n_genes':'n_genes_cr'})
                counts['cell_bc'] = counts['cell_bc'].str.strip('-1') #removing the -1 extention to match the object and dataframe formats of cell barcodes

                #now merging output of cell ranger counts with tsblast_sample dataframe based on shared barcodes 
                mer = tsblast_sample.merge(counts, on = 'cell_bc', how='left')
                #counting number of hits with barcodes that can't be identified with cell ranger 
                unidentified_bc=mer[mer['n_counts_cr'].isnull()] 

                #keeping information for each sample to later merge with the main ts_blast dataframe    
                ts_blast_samples_with_counts = pd.concat([ts_blast_samples_with_counts, mer])

                #making classifications of empty and non-empty cells based on how many genes they contain
                num_empty = counts[counts['n_genes_cr']<n].shape[0] + unidentified_bc.shape[0]
                num_cells =  counts[counts['n_genes_cr']>=n].shape[0]
                samples_found.append(sample)
                num_empty_all.append(num_empty)
                num_cells_all.append(num_cells)


num_drops['cr_sample'] = samples_found
num_drops['num_empty'] = num_empty_all
num_drops['num_cells'] = num_cells_all


saving the resulting dataframes for donors 3-14

In [45]:
ts_blast_samples_with_counts.to_csv(mainDir10 + 'ts_blast_n_genes_cr_donor3-14.csv', index=False)

In [46]:
num_drops.to_csv(mainDir10 + 'num_cells_empty_drops_per_sample_donor3-14.csv', index=False)

In [47]:
ts_blast_samples_with_counts.head(2)

Unnamed: 0,tissue,donor,n_counts,n_genes,compartment,tissue_cell_type,cell_type_tissue,decision,cell,celltype2,...,hit,hit_type,log2_n_counts,log2_n_genes,sample,donor_batch,seq,cr_sample,n_counts_cr,n_genes_cr
0,trachea,TSP6,,,,,,,GTTAGTGGTCTGGTTA_TSP6_Trachea_NA_10X_1_1,,...,yes,extra,,,TSP6_Trachea_NA_10X_1_1,3_16,,TSP6_Trachea_NA_10X_1_1,2.0,2.0
1,trachea,TSP6,,,,,,,GTTAGTGGTCTGGTTA_TSP6_Trachea_NA_10X_1_1,,...,yes,extra,,,TSP6_Trachea_NA_10X_1_1,3_16,,TSP6_Trachea_NA_10X_1_1,2.0,2.0


In [219]:
num_drops.head(2)

Unnamed: 0,cr_sample,num_empty,num_cells
0,TSP6_Trachea_NA_10X_1_1,2288744,8813
1,TSP6_Liver_NA_10X_1_2,2993998,6055


### Donor 1 

In [None]:
df3  = df2[df2['cr_sample'].str.startswith('TSP1_')]

# minimum number of genes per droplet to be thought of as a cell-containing droplet
n=200
#ts_blast_with_counts
ts_blast_samples_with_counts =pd.DataFrame({})

# this will be the dataframe that holds information about how many empty and cell droplets there are per sample
num_drops = pd.DataFrame({})
samples_found =[]
num_cells_all=[]
num_empty_all=[]

for sample in df3['cr_sample'].unique():
    for filepath in glob.glob(mainDir4 + '/TSP*'):
        filename = filepath.split('/')[-1] #this is cell ranger folder name
        if filename.startswith('TSP15')==False: #excluding donor 15 and donor 16 is already not in cell ranger folder
            
            if sample in filename:
                print(filename)
                tsblast_sample=df3[(df3['cr_sample']==sample)] #getting the tsblast dataframe for each sample- just two columns
                #reading the raw feature bc matrix 
                raw = sc.read_10x_h5(mainDir4 + filename + '/outs/raw_feature_bc_matrix.h5')
                raw.var_names_make_unique()

                #getting n_counts and n_genes for the raw matrix
                raw.obs['n_counts_cr'] = raw.X.sum(axis=1).A1
                sc.pp.filter_cells(raw, min_genes=0)
                counts = raw.obs.reset_index().rename(columns={'index':'cell_bc'})
                counts=counts.rename(columns={'n_genes':'n_genes_cr'})
#                 counts['cell_bc'] = counts['cell_bc'].str.strip('-1') # note, this line is unnecessary for donor 1 and 2 whose barcodes have a "-1" and match cell ranger barcode format

                #now merging output of cell ranger counts with tsblast_sample dataframe based on shared barcodes 
                mer = tsblast_sample.merge(counts, on = 'cell_bc', how='left')
                #counting number of hits with barcodes that can't be identified with cell ranger 
                unidentified_bc=mer[mer['n_counts_cr'].isnull()] 

                #keeping information for each sample to later merge with the main ts_blast dataframe    
                ts_blast_samples_with_counts = pd.concat([ts_blast_samples_with_counts, mer])

                #making classifications of empty and non-empty cells based on how many genes they contain
                num_empty = counts[counts['n_genes_cr']<n].shape[0] + unidentified_bc.shape[0]
                num_cells =  counts[counts['n_genes_cr']>=n].shape[0]
                samples_found.append(sample)
                num_empty_all.append(num_empty)
                num_cells_all.append(num_cells)


num_drops['cr_sample'] = samples_found
num_drops['num_empty'] = num_empty_all
num_drops['num_cells'] = num_cells_all


#saving key files
ts_blast_samples_with_counts.to_csv(mainDir10 + 'ts_blast_n_genes_cr_donor1.csv', index=False)
num_drops.to_csv(mainDir10 + 'num_cells_empty_drops_per_sample_donor1.csv', index=False)

### Donor 2

In [None]:
#looking at donors 3-16 for now
df3  = df2[df2['cr_sample'].str.startswith('TSP2_')]

# minimum number of genes per droplet to be thought of as a cell-containing droplet
n=200
#ts_blast_with_counts
ts_blast_samples_with_counts =pd.DataFrame({})

# this will be the dataframe that holds information about how many empty and cell droplets there are per sample
num_drops = pd.DataFrame({})
samples_found =[]
num_cells_all=[]
num_empty_all=[]

for sample in df3['cr_sample'].unique():
    for filepath in glob.glob(mainDir2 + '/TSP*'):
        filename = filepath.split('/')[-1] #this is cell ranger folder name
        if filename.startswith('TSP15')==False: #excluding donor 15 and donor 16 is already not in cell ranger folder
            
            if sample in filename:
                print(filename)
                tsblast_sample=df3[(df3['cr_sample']==sample)] #getting the tsblast dataframe for each sample- just two columns
                #reading the raw feature bc matrix 
                raw = sc.read_10x_h5(mainDir2 + filename + '/outs/raw_feature_bc_matrix.h5')
                raw.var_names_make_unique()

                #getting n_counts and n_genes for the raw matrix
                raw.obs['n_counts_cr'] = raw.X.sum(axis=1).A1
                sc.pp.filter_cells(raw, min_genes=0)
                counts = raw.obs.reset_index().rename(columns={'index':'cell_bc'})
                counts=counts.rename(columns={'n_genes':'n_genes_cr'})
#                 counts['cell_bc'] = counts['cell_bc'].str.strip('-1') # note, this line is unnecessary for donor 1 and 2 whose barcodes have a "-1" and match cell ranger barcode format

                #now merging output of cell ranger counts with tsblast_sample dataframe based on shared barcodes 
                mer = tsblast_sample.merge(counts, on = 'cell_bc', how='left')
                #counting number of hits with barcodes that can't be identified with cell ranger 
                unidentified_bc=mer[mer['n_counts_cr'].isnull()] 

                #keeping information for each sample to later merge with the main ts_blast dataframe    
                ts_blast_samples_with_counts = pd.concat([ts_blast_samples_with_counts, mer])

                #making classifications of empty and non-empty cells based on how many genes they contain
                num_empty = counts[counts['n_genes_cr']<n].shape[0] + unidentified_bc.shape[0]
                num_cells =  counts[counts['n_genes_cr']>=n].shape[0]
                samples_found.append(sample)
                num_empty_all.append(num_empty)
                num_cells_all.append(num_cells)


num_drops['cr_sample'] = samples_found
num_drops['num_empty'] = num_empty_all
num_drops['num_cells'] = num_cells_all


#saving key files
ts_blast_samples_with_counts.to_csv(mainDir10 + 'ts_blast_n_genes_cr_donor2.csv', index=False)
num_drops.to_csv(mainDir10 + 'num_cells_empty_drops_per_sample_donor2.csv', index=False)

saving the counts of droplets for each sample for donors 3-14

In [146]:
don3on = pd.read_csv(mainDir10 + 'num_cells_empty_drops_per_sample_donor3-14.csv')
don1 = pd.read_csv(mainDir10 + 'num_cells_empty_drops_per_sample_donor1.csv')
don2 = pd.read_csv(mainDir10 + 'num_cells_empty_drops_per_sample_donor2.csv')


drop_counts=pd.concat([don3on, don1, don2])

In [57]:
drop_counts.to_csv(mainDir10 + 'drop_counts_dons1-14.csv', index=False)

saving the dataframe containing donor 1-14 information with three additional columns 'n_count_cr' and 'n_genes_cr' and 
'cr_sample'

In [58]:
d1 = pd.read_csv(mainDir10 + 'ts_blast_n_genes_cr_donor1.csv')
d2 = pd.read_csv(mainDir10 + 'ts_blast_n_genes_cr_donor2.csv')
d3_14 = pd.read_csv(mainDir10 + 'ts_blast_n_genes_cr_donor3-14.csv')

dons=pd.concat([d1,d2,d3_14])

before saving, the dons dataframe containing cell ranger info isn't the entire dataset. There are some samples such as those from donor 16 for which we have blast info, which i'm not going to exclude. Instead, I will mark the cell ranger containing part of the dataframe (dons) by a column "cr" =='yes' and any samples for which there are cell ranger count data I will mark with "no". I will also fillna (i.e. barcodes for a sample that weren't detected using cell ranger- might as well be empty droplets) with 0 in the cr=='yes' part of the final dataframe. The other part, na will mean that cell ranger data isn't available.  

number of barcodes that are not identified


In [107]:
dons[dons['n_counts_cr'].isnull()].shape[0]

472654

In [108]:
dons['n_genes_cr'] = dons['n_genes_cr'].fillna(0)
dons['n_counts_cr'] = dons['n_counts_cr'].fillna(0)


adding a column to later be able to seperate out the dataframe by whether it was cell ranger processed or not

In [122]:
dons['cr']=dons.shape[0]*['yes']

now getting the samples that appear in the original dataframe but not in dons (i.e. cell ranger processed)


In [111]:
s1 = set(df[~df['sample'].isin(dons['sample'].unique().tolist())]['sample'].value_counts(dropna=False).index.tolist())

note, the only samples that we have not already identified as missing are those from donor 16 which we already know we don't have cell ranger counts for. 

In [112]:
s1 - set(miss3)

{'TSP16_Eye_external_10X_1_1',
 'TSP16_Eye_lacrimalgland_10X_1_1',
 'TSP16_Heart_atrium_10X_1_1_left',
 'TSP16_Heart_ventricle_10X_1_2_left'}

will get the part of the original df that contains only those missing (from dons) samples in s1. 

In [123]:
sub=df[df['sample'].isin(list(s1))]
sub['cr']=sub.shape[0]*['no']

In [125]:
sub.shape[0] + dons.shape[0]

2406150

In [144]:
alldons=pd.concat([dons, sub])

In [145]:
alldons.shape[0]

2406150

now merging the drop_counts with alldons dataframe so that we can have the number of empty and cell containing droplets for each sample. 

In [148]:
drop_counts = drop_counts.rename(columns={'num_empty':'num_empty_drop_per_sample', 'num_cells':'num_cell_drop_per_sample'})

In [149]:
drop_counts

Unnamed: 0,cr_sample,num_empty_drop_per_sample,num_cell_drop_per_sample
0,TSP6_Trachea_NA_10X_1_1,2288744,8813
1,TSP6_Liver_NA_10X_1_2,2993998,6055
2,TSP6_Trachea_NA_10X_2_1,2174587,11347
3,TSP6_Trachea_NA_10X_1_2,2234097,31565
4,TSP6_Liver_NA_10X_1_1,2972298,17400
...,...,...,...
27,TSP2_Vasculature_Aorta_10X_1_2_S27,1309550,8062
28,TSP2_BM_vertebralbody_10X_1_2_5prime_S35,484376,1645
29,TSP2_Muscle_diaphragm_10X_1_2_S13,839038,3693
30,TSP2_Thymus_NA_10X_1_4_5prime_S33,603041,1352


In [150]:
alldons = alldons.merge(drop_counts, on='cr_sample', how='outer')

### Final dataframe

In [152]:
alldons.to_csv(mainDir10 + 'all_dons_cr_processed_postreview2022.csv', index=False)
