In this notebook, I perform additional processing of the SIMBA outputs for various datasets (10X) including the EHTM (bulk tissue) datasets. Some of the key steps include 
- adding taxonomic/lineage information 
- performing QC based on alignment length and percent identity of each hit
- merging cell barcodes with those found in the Tabula Sapiens (TS) objects to get annotation data
- formatting the TS objects such that cell type names are shortened and more neatly presented in subsequent figures
- removed species found in negative controls from bulk tissue dataset. 
- will remove the contaminants for 10X dataset in a subsequent notebook. 


### Loading libraries

In [1]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 200
from matplotlib import cm

import os
import glob
import re
import itertools
from collections import Counter
import math
import random
from random import randrange
import string
import subprocess

import numpy as np
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import seaborn as sns

cmap = sns.cm.rocket_r
sns.set_style("white")

import anndata
from anndata import read_h5ad
from anndata import AnnData

import scanpy as sc
sc.logging.print_version_and_date()
sc.settings.verbosity = 3  # verbosity: errors (0), warnings (1), info (2), hints (3)


Running Scanpy 1.5.1, on 2022-12-14 08:51.


### Directories

In [2]:
mainDir = '/oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/'
mainDir2 = mainDir + 'analyze/'
mainDir3 = '/oak/stanford/groups/quake/gita/raw/tab2_20200508/'
mainDir4 = '/oak/stanford/groups/quake/gita/raw/tab2_20200508/tab2microbial/thirdAnalysis/'
mainDir5 = '/oak/stanford/groups/quake/gita/raw/organ_20200204/secondAnalysis/'
mainDir6 = '/oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/analyze/'
mainDir7 = '/oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/'
mainDir8 = '/oak/stanford/groups/quake/gita/raw/tab1_20200407/controlAnalysis/'
mainDir9 = '/oak/stanford/groups/quake/gita/raw/organ_20191025/controlAnalysis/'
mainDir10 = '/oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/'
dbDir = '/oak/stanford/groups/quake/gita/raw/database/taxonomyNCBI20200125/'


### Taxonomy

I used the ncbitax2lin tool, which you can use by activating the "taxonomy" conda environment (uses python 3.7): https://github.com/zyxue/ncbitax2lin
This tool allows the conversion of taxon ids to lineages, and the output is saved as a dataframe. 


In [145]:
tax = pd.read_csv(dbDir + 'ncbi_lineages_2021-01-26.csv')
#want to take only the following columns from the lineage dataframe tax 
tax_short=tax[['tax_id','superkingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']]


  interactivity=interactivity, compiler=compiler, result=result)


### Functions

In [5]:
"""
takes in a dataframe and calculates based on fraction_complete how to bin different samples
if it's 80% complete, then we can call that sample as done. 
"""

def status(x):
    if x ==0:
        status='empty'
    elif x<.80:
        status='partial'
    elif x>=.80:
        status='done'
    else:
        status='not started'
    
    return(status)

### Creating a contamination blacklist for the EHTM dataset
consisting of controls from the two sequencing runs I performed on tissues, bulk DNAseq. I have merged those tax_ids to create one taxid blacklist 


In [6]:
#getting unique taxids found in bulk DNA seq study of 10 donors (5 negative controls)
negControl_tissue = pd.read_csv(mainDir9 + "control_concatenated.tab", delimiter='\t')
negControl_tissue.columns = ['seqName','refName','pathogen','bitscore','pident','evalue','gapopen','qstart','qend','sstart',
'send','length','mismatch','staxids']
negControl_tissue['tax_id'] = negControl_tissue['staxids'].apply(lambda x: int(str(x).split(';')[0]))

#getting lineage information based on taxids
negControl_tissue_lin=negControl_tissue.merge(tax_short, on='tax_id', how='left')
negControl_tissue_lin['study']=['tissue']*negControl_tissue_lin.shape[0]


#reading in all control reads and their lineages
negControl_tissue_lin.to_csv(mainDir8 + 'tissue_controls_03192021.csv',index=False)

#reading unique contaminating species
negControl_tissue_lin_unique = negControl_tissue_lin.drop_duplicates(subset='species', keep='first')
negControl_tissue_lin_unique.to_csv(mainDir8 + 'tissue_controls_uniqueSpecies_03192021.csv',index=False)


  interactivity=interactivity, compiler=compiler, result=result)


In [7]:
#reading unique contaminating species
tissue_controls=pd.read_csv(mainDir8 + 'tissue_controls_uniqueSpecies_03192021.csv')
tissue_controls = tissue_controls[(tissue_controls['pident']>=90) & (tissue_controls['length']>=90)]


# C. Processing 10X TSP1 & TSP2, and EHTM SIMBA outputs (steps 1-8)
the data is divided across virus blast, bacterial, and fungal branches. 

#### 1. reading the concatenated SIMBA output dataframes 

In [190]:
#10X donor 1
vir=pd.read_csv(mainDir + 'virNTblastn_concatenated.csv')
#10x donor 2
vir2 = pd.read_csv(mainDir4 + 'virNTblastn_concatenated.csv') 
#tissues (bulk dna sequencing)
vir3 = pd.read_csv(mainDir5 + 'virNTblastn_concatenated.csv') 

In [191]:
bac=pd.read_csv(mainDir + 'micoNT_blastn_concatenated.csv')
bac2 = pd.read_csv(mainDir4 + 'micoNT_blastn_concatenated.csv')
#tissues (bulk dna sequencing)
bac3 = pd.read_csv(mainDir5 + 'micoNT_blastn_concatenated.csv') 

In [192]:
fung=pd.read_csv(mainDir + 'fungi_NT_blastn_concatenated.csv')
fung2=pd.read_csv(mainDir4 + 'fungi_NT_blastn_concatenated.csv')
#tissues (bulk dna sequencing)
fung3=pd.read_csv(mainDir5 + 'fungi_NT_blastn_concatenated.csv')


In [193]:
#the pre-filtered 10x dataset
don1_10x=pd.concat([vir, bac, fung])
don2_10x=pd.concat([vir2, bac2, fung2])
prefil_10x = pd.concat([don1_10x,don2_10x])

#### 2. Getting lineage information based on taxids for 10x (donor 1 & 2) and EHTM validation (bulk DNA seq) dataset

In [194]:
#want to take only the following columns from the lineage dataframe tax 
tax_short=tax[['tax_id','superkingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']]

vir = vir.rename(columns={'staxids':'tax_id'})
vir2 = vir2.rename(columns={'staxids':'tax_id'})
vir3=vir3.rename(columns={'staxids':'tax_id'})

bac = bac.rename(columns={'staxids':'tax_id'})
bac2 = bac2.rename(columns={'staxids':'tax_id'})
bac2.tax_id=bac2.tax_id.apply(lambda x: str(x).split(';')[0])
bac2.tax_id = bac2.tax_id.astype('int64')
bac3 = bac3.rename(columns={'staxids':'tax_id'})
bac3.tax_id = bac3.tax_id.astype('int64')

fung =fung.rename(columns={'staxids':'tax_id'})
fung2 = fung2.rename(columns={'staxids':'tax_id'})
fung3 = fung3.rename(columns={'staxids':'tax_id'})

fung.tax_id = fung.tax_id.apply(lambda x: str(x).split(';')[0])
fung2.tax_id=fung2.tax_id.apply(lambda x: str(x).split(';')[0])
fung3.tax_id=fung3.tax_id.apply(lambda x: str(x).split(';')[0])


fung.tax_id = fung.tax_id.astype('int64')
fung2.tax_id = fung2.tax_id.astype('int64')
fung3.tax_id = fung3.tax_id.astype('int64')

#merging the  dataframe (for which we have taxid and merging it with tax_short for lineage informatin)
vir_lin=vir.merge(tax_short, on='tax_id', how='left')
vir_lin2=vir2.merge(tax_short, on='tax_id', how='left')
vir_lin3=vir3.merge(tax_short, on='tax_id', how='left')

bac_lin=bac.merge(tax_short, on='tax_id', how='left')
bac_lin2=bac2.merge(tax_short, on='tax_id', how='left')
bac_lin3=bac3.merge(tax_short, on='tax_id', how='left')

fung_lin=fung.merge(tax_short, on='tax_id', how='left')
fung_lin2=fung2.merge(tax_short, on='tax_id', how='left')
fung_lin3=fung3.merge(tax_short, on='tax_id', how='left')


don1_10x_prefil=pd.concat([vir_lin, bac_lin, fung_lin])
don2_10x_prefil=pd.concat([vir_lin2, bac_lin2, fung_lin2])
prefil_10x_withLineage = pd.concat([don1_10x_prefil,don2_10x_prefil])

prefil_10x_withLineage.to_csv(mainDir + 'don1_don2_10x_blastn_nt_pre_filteration_withLineage.csv')


#### 3. concatenating viral bacterial and fungal dataframes for don1, don2 (both 10x), and EHTM validation dataset 

In [195]:
don1_10x=pd.concat([vir_lin, bac_lin, fung_lin])
don2_10x=pd.concat([vir_lin2, bac_lin2, fung_lin2])
                 
                 

#### 4. eliminating the contamination species for the EHTM bulk tissue controls


In [196]:
tis_19=pd.concat([vir_lin3, bac_lin3, fung_lin3])

tis_19_cont_free = tis_19[~tis_19['species'].isin(tissue_controls.species.to_list())]



NameError: name 'tissue_controls' is not defined

#### 5. taking only reads from viruses, bacteria, and fungi along with more reformatting steps


In [None]:
don1_10x_fil2 = don1_10x_prefil[(don1_10x_prefil['superkingdom'].str.contains('Bacteria|Viruses', case=False)) | 
                 (don1_10x_prefil['phylum'].str.contains('mycota', case=False))] 

don2_10x_fil2 = don2_10x_prefil[(don2_10x_prefil['superkingdom'].str.contains('Bacteria|Viruses', case=False)) | 
                 ( don2_10x_prefil['phylum'].str.contains('mycota', case=False))] 


In [53]:
tis_19_cont_free_fil2 = tis_19_cont_free[(tis_19_cont_free['superkingdom'].str.contains('Bacteria|Viruses', case=False)) | 
                 ( tis_19_cont_free['phylum'].str.contains('mycota', case=False))] 
tis_19_cont_free_fil2['donor'] = tis_19_cont_free_fil2['sample'].apply(lambda x: x.split('_')[0])
tis_19_cont_free_fil2['tissue'] = tis_19_cont_free_fil2['sample'].apply(lambda x: x.split('_')[1])
tis_19_cont_free_fil2['fraction'] = tis_19_cont_free_fil2['sample'].apply(lambda x: x.split('_')[2])
tis_19_cont_free_fil2['sample2'] = tis_19_cont_free_fil2['donor']+ '_' + tis_19_cont_free_fil2['tissue']
tis_19_cont_free_fil2 = tis_19_cont_free_fil2[~tis_19_cont_free_fil2['sample'].str.contains('control|hood_water')]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using

In [None]:
don1_10x_fil2['cell'] = don1_10x_fil2.seqName.apply(lambda x: x.split('_')[1])
don1_10x_fil2['umi'] = don1_10x_fil2.seqName.apply(lambda x: x.split('_')[2])

don2_10x_fil2['cell'] = don2_10x_fil2.seqName.apply(lambda x: x.split('_')[1])
don2_10x_fil2['umi'] = don2_10x_fil2.seqName.apply(lambda x: x.split('_')[2])


don1_10x_fil2['don'] = ['don1'] * don1_10x_fil2.shape[0]
don2_10x_fil2['don'] = ['don2'] * don2_10x_fil2.shape[0]

blastdb_don1and2=pd.concat([don1_10x_fil2, don2_10x_fil2])


first let's get rid of any possible duplicates (multiple hits for the same query sequence.)


In [None]:
print(blastdb_don1and2.shape[0])
blastdb_don1and2 = blastdb_don1and2.drop_duplicates('seqName')
print(blastdb_don1and2.shape[0])
#dataframe size is reduced from 839049 to 838047

### Lets load and clean up 10X TSP1 and TSP2 objects


In [197]:
#Loading and modifying 10X TSP1 and TSP2 objects
tsp1and2_10x = sc.read_h5ad(mainDir3 +'objects/totalVelo_withCellCycleScores_mitoUnfiltered_Dec2020.h5ad')

tsp1and2_10x.obs = tsp1and2_10x.obs.reset_index()
tsp1and2_10x.obs.rename(columns={'X10X_barcode':'cell', 'compartment_pred':'compartment',
                                 'Propagated.Annotation':'celltype', 'X10X_run':'sample'}, inplace=True)

tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype']
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('double-positive, alpha-beta thymocyte','alpha-beta thymocyte')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('CD4-positive, alpha-beta T cell','CD4+ T c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('endothelial cell of artery','artery endo. c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('endothelial cell of vascular tree','vascular endo. c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('CD8-positive, alpha-beta T cell','CD8+, T c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('endothelial cell of lymphatic vessel','lymph endo.c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('vein endothelial cell','vein endo. c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('naive thymus-derived CD4-positive, alpha-beta T cell','naive CD4+ T c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('skeletal muscle satellite stem cell','skeletal muscle stem c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('naive thymus-derived CD4+ T cell','naive CD4+ T c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('capillary endothelial cell','endo. capillary c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('endothelial cell','endo. c.')
tsp1and2_10x.obs['celltype2'] = tsp1and2_10x.obs['celltype2'].apply(lambda x: x.replace('cell', 'c.'))
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('CD8-alpha-alpha-positive, alpha-beta intraepithelial T c.','CD8+ intraepithelial T c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('CD4-positive, CD25-positive, alpha-beta regulatory T c.','CD4+ CD25+ reg. T c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('CD4-positive, alpha-beta cytotoxic T c','CD4+ CD25+ cytotoxic T c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('CD34-positive, CD38-negative multipotent progenitor c.','CD34+ CD38- multipotent T c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('intestinal secretory progenitor','intestinal secr. prog.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('intestinal transient amplifying c.','intestinal trans. ampl. c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('intestinal crypt stem c. distal','intestinal crypt stem c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('intestinal crypt stem c. proximal','intestinal crypt stem c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('medullary thymic epithelial c.','medullary thymic epi. c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('vascular associated smooth muscle c.','vasc. smooth muscle c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('skeletal muscle stem c. 3','skeletal muscle stem c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('skeletal muscle stem c. 2','skeletal muscle stem c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('skeletal muscle stem c. 1','skeletal muscle stem c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('skeletal muscle stem c. 4','skeletal muscle stem c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('mesenchymal stem c. 1','mesenchymal stem c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('mesenchymal stem c. 2','mesenchymal stem c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('epithelial c. of alveolus of lung','lung alveolus epi. c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('myeloid dendritic c., human','myeloid dendritic c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('naive thymus-derived CD4+ T c.','naive CD4+ T c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('CD4-positive, alpha-beta memory T c.','CD4+ memory T c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('plasmacytoid dendritic c., human','plasmacytoid dendritic c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('smooth muscle c. 1','smooth muscle c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('smooth muscle c. 2','smooth muscle c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('smooth muscle c. 3','smooth muscle c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('pericyte c. 3','pericyte c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('pericyte c. 2','pericyte c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('pericyte c. 1','pericyte c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('activated CD8+, T c., human','activated CD8+, T c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('serous c. of epithelium of bronchus','bronchus epi. serous c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('CD8-positive, alpha-beta memory T c.','CD8+ memory T c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('naive thymus-derived CD4+ T c.','naive CD4+ T c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('pericyte c. 4','pericyte c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('mesenchymal c. 3','mesenchymal c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('mesenchymal c. 1','mesenchymal c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('mesenchymal c. 2','mesenchymal c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('mesenchymal stem c. 3','mesenchymal stem c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('macrophage 2','macrophage')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('macrophage 1','macrophage')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('group 3 innate lymphoid c.','lymphoid c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('CD4-positive helper T c.','CD4+ helper T c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('naive thymus-derived CD8+, T c.','naive CD8+ T c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('naive thymus-derived CD4+ T c.','naive CD4+ T c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('dendritic c., human','dendritic c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('activated CD8+, T c., human', 'activated CD8+ T c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('CD8+, T c.','CD8+ T c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('CD4+ CD25+ cytotoxic T c..','CD4+ CD25+ cytotoxic T c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].apply(lambda x: x.replace(',',''))
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].apply(lambda x: x.replace('thymus-derived',' '))
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('double-positive, alpha-beta thymocyte','alpha-beta thymocyte')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('CD4-positive, alpha-beta T cell','CD4+ T c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('endothelial cell of artery','artery endo. c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('endothelial cell of vascular tree','vascular endo. c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('CD8-positive, alpha-beta T cell','CD8+, T c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('endothelial cell of lymphatic vessel','lymph endo.c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('vein endothelial cell','vein endo. c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('naive thymus-derived CD4-positive, alpha-beta T cell','naive CD4+ T c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('skeletal muscle satellite stem cell','skeletal muscle stem c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('naive thymus-derived CD4+ T cell','naive CD4+ T c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('capillary endothelial cell','endo. capillary c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].str.replace('endothelial cell','endo. c.')
tsp1and2_10x.obs['celltype2'] =tsp1and2_10x.obs['celltype2'].apply(lambda x: x.replace('cell', 'c.'))
tsp1and2_10x.obs['tissue'] =tsp1and2_10x.obs['tissue'].apply(lambda x: x.replace('PancreasExocrine', 'Pancreas'))
tsp1and2_10x.obs['tissue'] =tsp1and2_10x.obs['tissue'].apply(lambda x: x.replace('PancreasEndocrine', 'Pancreas'))


tsp1and2_10x.obs['log2_n_genes']= np.log2(tsp1and2_10x.obs['n_genes'])
tsp1and2_10x.obs['log2_n_counts']= np.log2(tsp1and2_10x.obs['n_counts'])
comp=tsp1and2_10x.obs['compartment'].iloc[:,1].tolist()
tsp1and2_10x.obs['compartment2']= comp
tsp1and2_10x.obs.drop(columns=['compartment'],inplace=True)
tsp1and2_10x = tsp1and2_10x[~tsp1and2_10x.obs['compartment2'].str.contains('PNS')]


tsobj_don1and2=tsp1and2_10x.obs[['cell','tissue','log2_n_counts','n_counts', 'log2_n_genes','n_genes', 'celltype2',
                                 'compartment2','donor','decision','sample']]



# Putting together BLAST dataframe (blastdb_don1and2) and TS object (tsobj_don1and2) donors 1 and 2
There are three categories for the merged blast outputs (blastdb_don1and2) and TS object (tsobj_don1and2). \
**1) intra** --> annotated cells that have significant **intra**cellular hits \
**2) nothing** --> annotated cells that don't have any significant intracellular hits (based on evalue) \
**3) extra** --> significant hits that are found in either found **extra**cellularly or in unannotated cells. i.e. associated with cellular barcodes. 

how many of each category is in the dataset (prior to removal of contamination?)


number of annotated cells that have significant microbial hits (based on e-value < 10^-5, no other filters)

In [210]:
ann_cells_with_hits = tsobj_don1and2[tsobj_don1and2['cell'].isin(blastdb_don1and2['cell'].tolist())]
ann_cells_with_hits.shape[0]

24351

total number of annotated cells & fraction of total annotated cells with a hit. **14% of annotated cells had a sig. hit**

In [211]:
print ('total number of annotated cells (donor 1 and 2): ', tsobj_don1and2['cell'].nunique())

print ('fraction of total annotated cells with a hit  (donor 1 and 2): ', np.round(ann_cells_with_hits.shape[0]/tsobj_don1and2['cell'].nunique(),2))


total number of annotated cells (donor 1 and 2):  175489
fraction of total annotated cells with a hit  (donor 1 and 2):  0.14


these are the hits **(56k)** found in annotated cells - thus **intra**cellular hits, coming from **24K cells**. see above

In [212]:
intra_don1and2=blastdb_don1and2[blastdb_don1and2['cell'].isin(tsobj_don1and2['cell'].tolist())]
print(intra_don1and2.shape[0])

intra_don1and2['hit']=['yes']* intra_don1and2.shape[0]
intra_don1and2['hit_type']=['intra']* intra_don1and2.shape[0]


56763


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


these are hits that are NOT found in annotated cells - thus **extra**cellular hits. 7% of total hits are associated with annotated cells. Others are extracellular or from unannotated cells

In [213]:
extra_don1and2 = blastdb_don1and2[~blastdb_don1and2['cell'].isin(tsobj_don1and2['cell'].tolist())]
print(extra_don1and2.shape[0])

extra_don1and2['hit']=extra_don1and2.shape[0]*['yes']
extra_don1and2['hit_type']=extra_don1and2.shape[0]*['extra']


781284


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [214]:
print(np.round(intra_don1and2.shape[0]/extra_don1and2.shape[0],2))

0.07


these are the annotated cells with **nothing** detected inside. There are about 150K

In [215]:
nothing_don1and2 = tsobj_don1and2[~tsobj_don1and2['cell'].isin(blastdb_don1and2['cell'].tolist())]
print(nothing_don1and2.shape[0])

nothing_don1and2['hit']=['no']* nothing_don1and2.shape[0]
nothing_don1and2['hit_type']=['none']* nothing_don1and2.shape[0]


152799


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


lets put together these different pieces now that they have labels with which we can seperate them out again


In [232]:
blhits=pd.concat([extra_don1and2, intra_don1and2])
fin=blhits.merge(nothing_don1and2, on=['cell','hit_type', 'hit', 'sample'], how='outer')

sanity check on the final joint dataframe. Note hit will be no, if the row is of a cell that doesn't have a hit. That cell will get hit type =='none'. If hit=='yes', then it is either from an annotated cell hit_type=='intra' or 'extra' if it comes from an unannotated cell or empty droplet

In [234]:
fin['hit'].value_counts(dropna=False) 

yes    838047
no     152799
Name: hit, dtype: int64

In [235]:
fin['hit_type'].value_counts(dropna=False)

extra    781284
none     152799
intra     56763
Name: hit_type, dtype: int64

In [236]:
print('these are the number of unique hits that are found in annotated cells (category intra): ', intra_don1and2.seqName.nunique())
print('these are the number of unique umis that are found in annotated cells (category intra): ', intra_don1and2.umi.nunique())
print('number of annotated cells (unique cell barcodes) that have significant hits: ', intra_don1and2.cell.nunique())

these are the number of unique hits that are found in annotated cells (category intra):  56763
these are the number of unique umis that are found in annotated cells (category intra):  27768
number of annotated cells (unique cell barcodes) that have significant hits:  23967


there is already donor information under the heading 'don' but not under 'donor'. these two columns need to be consolidated. Need to make sure i can do apply for nan values. 

In [238]:
fin['cell_bc'] = fin['cell'].apply(lambda x: x+'-1')
e = fin[fin['hit']=='yes']
no = fin[fin['hit']=='no']


getting tissue & donor information for the blast df based on sample name. The ts object already has this information

In [417]:
# fin['cell_bc'] = fin['cell'].apply(lambda x: x+'-1')
e['sample'] = e['sample'].apply(lambda x: x.split('_L0')[0])
e['donor'] = e['sample'].apply(lambda x: x.split('_')[0])
e['tissue'] = e['sample'].apply(lambda x: x.split('_')[1]).apply(lambda x: x.lower())

#matching tissue names (can't find heart in TS object but i see it in blast dataframe from sample name - wonder if it is a misannotation)
no['tissue'] = no['tissue'].apply(lambda x: x.lower())
no['tissue'] = no['tissue'].apply(lambda x: x.replace('small_intestine', 'si'))
no['tissue'] = no['tissue'].apply(lambda x: x.replace('large_intestine', 'li'))
no['tissue'] = no['tissue'].apply(lambda x: x.replace('bone_marrow', 'bm'))
e['tissue'] = e['tissue'].apply(lambda x: x.replace('exopancreas2', 'pancreas'))
e['tissue'] = e['tissue'].apply(lambda x: x.replace('exopancreas1', 'pancreas'))
e['tissue'] = e['tissue'].apply(lambda x: x.replace('endopancreas', 'pancreas'))
e['tissue'] = e['tissue'].apply(lambda x: x.replace('lymphnode', 'lymph_node'))


fin2=pd.concat([e, no])

#interstingly tabula sapiens doesn't have umi information
fin2['compartment']= fin2['compartment2']

fin2['cell']= fin2['cell'] + '_' + fin2['sample']

fin2 = fin2.drop(columns={'Unnamed: 0', 'duplicates', 'operation','blast', 'source', 'db', 'gapopen', 'don', 'mismatch','compartment2'})


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_

# Saving don1 and 2

In [420]:
fin2.to_csv(mainDir + 'don1_don2_10x_12_08_2022-postreview2022.csv', index=False)


### Column name explanation final dataframe (donor 1 and 2) 
- Blast dataframe columns:
    - **seqName**: e.g. A00111:327:HL57HDSXX:4:1202:22941:23719_TAGGTCAGAGACATCA_AGAAATAACGCC
    - **seq**: microbial sequence 
    - **refName**: the BLAST subject reference gi|1862738216|gb|CP055292.1| 
    - **pathogen**: common name of the hit Shigella sonnei strain SE6-1 chromosome, complete genome
    - **bitscore**: see BLAST tutorial
    - **pident**: percent identity between subject and query
    - **evalue**: see BLAST tutorial
    - **qstart**: query start position
    - **qend**: query end position
    - **sstart**: subject start position
    - **send**: subject end position
    - **length**: length of the alignment between subject and query
    - taxonomy columns
        - **tax_id**: taxonomic id with which we can get the following taxonomic categories
        - **superkingdom**
        - **phylum**
        - **class**
        - **order**
        - **family**
        - **genus**
        - **species**

    - **sample**: TSP1_endopancreas_3_S3
    - **umi**: AGAAATAACGCC 


- TS object columns (blast dataframe does not produce these on its own)
    - **n_counts**: number of reads per cell
    - **n_genes**: number of genes per cell
    - **log2_n_counts**: log2 of n_counts
    - **log2_n_genes**: log2 of n_genes
    - **compartment**: e.g. immune, epithelial
    - **celltype2**: cell type


- Shared columns:
    - **cell**: cell barcode e.g. CATATTCCAAAGCGGT_TSP2_Thymus_NA_10X_1_4_5prime_S33
    - **cell_bc**: cell barcode with "-1" appended to it. e.g. CATATTCCAAAGCGGT-1 
    - **hit**: the column with which to seperate out the dataframe into blast (hit=='yes') and ts object (hit=='no')
    - **hit_type**: tells us whether the hit comes from an annotated cell (hit_type=='intra'), unannotated cell ('extra') or no hit ('none')
    - **tissue**: tissue type

# Saving the EHTM dataset (bulk DNA seq )

In [None]:
tis = pd.read_csv(mainDir + 'bulk_tissues_blastn_nt_naPhylaNotFiltered_03_12_2021.csv')
#additionally getting rid of some of the other common contaminants in our lab
tis = tis[~tis['species'].str.contains('Ralstonia|Sphingomonas|Variovorax|Hathewaya histolytica', case=False, na=False)]
tis_filt=tis[(tis['length']>=90) & (tis['pident']>=90)] 
tis_filt.to_csv(mainDir + 'bulk_tissues_blastn_nt_naPhylaNotFiltered_11_30_2021.csv', index=False) #no change here in post-review


# C. Processing 10x TSP3-16 SIMBA outputs
you can skip to the next section (section 10 "start here") , since results are saved. 


some of those reads can be read from the completed files in the NT folders and others need to be grabbed from the batched results. 

Note batch files are different because they will have the name of the file they came from as part of their sequence id, whereas unbatched regular file outputs will not. They need to be matched before they are combined. I will do this by modifying the batch files. 

In [58]:
#want to take only the following columns from the lineage dataframe tax 
tax_short=tax[['tax_id','superkingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']]


#### bacteria

In [9]:
#pulling in data from the batched data files first (bacteria) 
#column headers
cols =['seqName', 'refName', 'pathogen', 'bitscore', 'pident', 'evalue', 'gapopen', 'qstart', 'qend', 'sstart', 'send', 'length', 'mismatch', 'tax_id']
bact_batch =pd.DataFrame({})

for file in glob.glob(mainDir10 + 'batchedFiles/bacteria/batched/micoNT_blastn/*.tab'):
    #looking at non empty files 
    if os.path.getsize(file)>0:
        #getting the batch number 
        batch = file.split('/')[-1].split('.tab')[0]
        
        df=pd.read_csv(file, delimiter='\t')
        df.columns=cols
        df['sample'] = df.seqName.apply(lambda x: x.split('-')[1])
        df['seqName'] = df.seqName.apply(lambda x: x.split('-')[0])
        df['batch']=df.shape[0]*[batch]
        df['filepath']=df.shape[0]*[file]
        
        df['tax_id']=df['tax_id'].apply(lambda x: str(x).split(';')[0])
        df.tax_id = df.tax_id.astype('int64')
        #adding lineage information
        df=df.merge(tax_short, on='tax_id', how='left')
        #filtering df to contain only bacteria, viruses, archaea, fungi & blastocyst
        df = df[(df['superkingdom'].str.contains('Bacteria|Viruses|Archaea', case=False) | 
                  df['phylum'].str.contains('mycota', case=False) |
                  df['species'].str.contains('Blastocyst',case=False)==True)] 
        
    bact_batch = pd.concat([bact_batch, df])

#### viruses

In [10]:
#pulling in data from the batched data files first (viruses)
#column headers
cols =['seqName', 'refName', 'pathogen', 'bitscore', 'pident', 'evalue', 'gapopen', 'qstart', 'qend', 'sstart', 'send', 'length', 'mismatch', 'tax_id']
virus_batch =pd.DataFrame({})

for file in glob.glob(mainDir10 + 'batchedFiles/viruses/batched/virusNT_blastn/*.tab'):
    #looking at non empty files 
    if os.path.getsize(file)>0:
        #getting the batch number 
        batch = file.split('/')[-1].split('.tab')[0]
        
        df=pd.read_csv(file, delimiter='\t')
        df.columns=cols
        df['sample'] = df.seqName.apply(lambda x: x.split('-')[1])
        df['seqName'] = df.seqName.apply(lambda x: x.split('-')[0])
        df['batch']=df.shape[0]*[batch]
        df['filepath']=df.shape[0]*[file]
        
        df['tax_id']=df['tax_id'].apply(lambda x: str(x).split(';')[0])
        df.tax_id = df.tax_id.astype('int64')
        #adding lineage information
        df=df.merge(tax_short, on='tax_id', how='left')
        #filtering df to contain only bacteria, viruses, archaea, fungi & blastocyst
        df = df[(df['superkingdom'].str.contains('Bacteria|Viruses|Archaea', case=False) | 
                  df['phylum'].str.contains('mycota', case=False) |
                  df['species'].str.contains('Blastocyst',case=False)==True)] 
    virus_batch = pd.concat([virus_batch, df])

#### fungi

In [11]:
#pulling in data from the batched data files first (fungi) 
#column headers
cols =['seqName', 'refName', 'pathogen', 'bitscore', 'pident', 'evalue', 'gapopen', 'qstart', 'qend', 'sstart', 'send', 'length', 'mismatch', 'tax_id']
fungi_batch =pd.DataFrame({})

for file in glob.glob(mainDir10 + 'batchedFiles/fungi/batched/fungiNT_blastn/*.tab'):
    #looking at non empty files 
    if os.path.getsize(file)>0:
        #getting the batch number 
        batch = file.split('/')[-1].split('.tab')[0]
        
        df=pd.read_csv(file, delimiter='\t')
        df.columns=cols
        df['sample'] = df.seqName.apply(lambda x: x.split('-')[1])
        df['seqName'] = df.seqName.apply(lambda x: x.split('-')[0])
        df['batch']=df.shape[0]*[batch]
        df['filepath']=df.shape[0]*[file]
        
        df['tax_id']=df['tax_id'].apply(lambda x: str(x).split(';')[0])
        df.tax_id = df.tax_id.astype('int64')
        #adding lineage information
        df=df.merge(tax_short, on='tax_id', how='left')
        #filtering df to contain only bacteria, viruses, archaea, fungi & blastocyst
        df = df[(df['superkingdom'].str.contains('Bacteria|Viruses|Archaea', case=False) | 
                  df['phylum'].str.contains('mycota', case=False) |
                  df['species'].str.contains('Blastocyst',case=False)==True)] 
    fungi_batch = pd.concat([fungi_batch, df])

#### concatenating all batched (currently partial) blast outputs from fungi, viruses, and bacteria

In [12]:
all_batched=pd.concat([bact_batch, virus_batch, fungi_batch])

all_batched.to_csv(mainDir10 + 'all_batched_09_09_2021.csv', index=False)


#### I also want to read the previously unbatched SIMBA outputs
need to know which jobs weren't batched because they were already complete. 

In [15]:
# reading the status of jobs from a previous file
maindf_all=pd.read_csv(mainDir10 + 'status_of_nt_jobs.csv')

In [16]:
# here are the samples
samples=maindf_all[maindf_all['operation']=='humanFiltered'].file.unique().tolist()
samples_df=pd.DataFrame(samples, columns=['file'])


In [17]:
# actually getting the status of each job based on fraction of blast that's complete
maindf_vir = maindf_all[maindf_all['operation']=='virNTblastn']
maindf_vir = maindf_vir.merge(samples_df, on='file', how='outer') 
maindf_vir['status_vir'] = maindf_vir['frac_complete'].apply(lambda x: status(x))

maindf_bac = maindf_all[maindf_all['operation']=='micoNT_blastn']
maindf_bac = maindf_bac.merge(samples_df, on='file', how='outer') 
maindf_bac['status_bact'] = maindf_bac['frac_complete_bact'].apply(lambda x: status(x))

maindf_fun = maindf_all[maindf_all['operation']=='fungi_NT_blastn']
maindf_fun = maindf_fun.merge(samples_df, on='file', how='outer') 
maindf_fun['status_fun'] = maindf_fun['frac_complete_fun'].apply(lambda x: status(x))

In [18]:
#now just collecting all the filepaths to the files that were more than 80% done. 
maindf_vir_done = maindf_vir[maindf_vir['status_vir']=='done']
maindf_bac_done = maindf_bac[maindf_bac['status_bact']=='done']
maindf_fun_done = maindf_fun[maindf_fun['status_fun']=='done']
fps1 = maindf_vir_done.filepath.tolist()
fps2 = maindf_bac_done.filepath.tolist()
fps3 = maindf_fun_done.filepath.tolist()
#list of file paths to nt blast jobs from the three kingdoms that are complete
fps=fps1 + fps2 + fps3
print('number of complete files:', len(fps))

number of complete files: 193


In [19]:
#column headers
cols =['seqName', 'refName', 'pathogen', 'bitscore', 'pident', 'evalue', 'gapopen',
       'qstart', 'qend', 'sstart', 'send', 'length', 'mismatch', 'tax_id']
all_unbatched =pd.DataFrame({})

for file in fps:
    #looking at non empty files 
    if os.path.getsize(file)>0:
        #getting the batch number 
        filename = file.split('/')[-1].split('.tab')[0]
        
        df=pd.read_csv(file, delimiter='\t')
        df.columns=cols
        df['sample'] = df.shape[0]*[filename]
        df['batch']=df.shape[0]*['unbatched']
        df['filepath']=df.shape[0]*[file]
        
        df['tax_id']=df['tax_id'].apply(lambda x: str(x).split(';')[0])
        df.tax_id = df.tax_id.astype('int64')
        #adding lineage information
        df=df.merge(tax_short, on='tax_id', how='left')
        #filtering df to contain only bacteria, viruses, archaea, fungi and blastocyst
        
        df = df[(df['superkingdom'].str.contains('Bacteria|Viruses|Archaea', case=False) | 
                          df['phylum'].str.contains('mycota', case=False) |
                          df['species'].str.contains('Blastocyst',case=False)==True)]
        
    all_unbatched = pd.concat([all_unbatched, df])  

In [20]:
all_unbatched.to_csv(mainDir10 + 'all_unbatched_09_09_2021.csv', index=False)

#### concatenating all batched and unbatched SIMBA outputs

In [59]:
#this is the processed SIMBA outputs of all the batched files for blast against NT through the fungal, viral, and bacterial branches
all_batched=pd.read_csv(mainDir10 + 'all_batched_09_09_2021.csv')

In [60]:
#this is the processed SIMBA outputs of all the unbatched files for blast against NT through the fungal, viral, and bacterial branches
all_unbatched=pd.read_csv(mainDir10 + 'all_unbatched_09_09_2021.csv')

In [61]:
final=pd.concat([all_unbatched, all_batched])

In [64]:
print('a total of ' + str(final.shape[0]) + ' top hits were found in after BLASTING files from TSP3-16 with E-values < 10^-5')

a total of 1533950 top hits were found in after BLASTING files from TSP3-16 with E-values < 10^-5


In [65]:
#this step is done to match the TS 10x object
final['sample_short']= final['sample'].apply(lambda x: '_'.join(x.split('_')[:-2]))

#also reading out the cell barcode and umi from the seq name
final['cell_bc'] = final['seqName'].apply(lambda x: x.split('_')[1])
final['umi'] = final['seqName'].apply(lambda x: x.split('_')[2])
final['cell_bc_umi'] = final['seqName'].apply(lambda x: '_'.join(x.split('_')[1:]))


donors 3,4 and 5 had sample names that were different across fastq files and TS object. 
other donors are fine, except for donor 14 where there is an extra '_1' added to some samples it seems, which I will correct in the SIMBA output so that they can be merged properly. 

In [66]:
#donor 5
final['sample_short']=final['sample_short'].str.replace('TSP5_Eye1_062920', 'TSP5_Eye_NA_10X_1_1')
final['sample_short']=final['sample_short'].str.replace('TSP5_Eye2_062920', 'TSP5_Eye_NA_10X_1_2')
#donor 4
final['sample_short']=final['sample_short'].str.replace('TSP4_Mammary1_062920', 'TSP4_Mammary_NA_10X_1_1')
final['sample_short']=final['sample_short'].str.replace('TSP4_Mammary2_062920', 'TSP4_Mammary_NA_10X_1_2')
final['sample_short']=final['sample_short'].str.replace('TSP4_Myometrium_062920', 'TSP4_Uterus_Myometrium_10X_1_1')
#donor 3
final['sample_short']=final['sample_short'].str.replace('TSP3_Eye_062620', 'TSP3_Eye_LacrimalGland_10X_1_1')
final['sample_short']=final['sample_short'].str.replace('TSP3_Eye2_062620', 'TSP3_Eye_NA_10X_1_1_NoCornea')
final['sample_short']=final['sample_short'].str.replace('TSP3_Eye3_062620', 'TSP3_Eye_Conjunctiva_10X_1_1')
final['sample_short']=final['sample_short'].str.replace('TSP3_Eye4_062620', 'TSP3_Eye_Orbital_10X_1_1')
#donor 14
final['sample_short']=final['sample_short'].str.replace('TSP14_Bladder_NA_10X_1', 'TSP14_Bladder_NA_10X_1_1')
final['sample_short']=final['sample_short'].str.replace('TSP14_Blood_NA_10X_1', 'TSP14_Blood_NA_10X_1_1')
final['sample_short']=final['sample_short'].str.replace('TSP14_LI_Distal_10X_1', 'TSP14_LI_Distal_10X_1_1')
final['sample_short']=final['sample_short'].str.replace('TSP14_Muscle_Abdomen_10X_1', 'TSP14_Muscle_Abdomen_10X_1_1')
final['sample_short']=final['sample_short'].str.replace('TSP14_Muscle_Diaphragm_10X_1', 'TSP14_Muscle_Diaphragm_10X_1_1')
final['sample_short']=final['sample_short'].str.replace('TSP14_Skin_Abdomen_10X_1', 'TSP14_Skin_Abdomen_10X_1_1')
final['sample_short']=final['sample_short'].str.replace('TSP14_Skin_Chest_10X_1', 'TSP14_Skin_Chest_10X_1_1')
final['sample_short']=final['sample_short'].str.replace('TSP14_Spleen_NA_10X_1', 'TSP14_Spleen_NA_10X_1_1')
final['sample_short']=final['sample_short'].str.replace('TSP14_Tongue_Posterior_10X_1', 'TSP14_Tongue_Posterior_10X_1_1')

In [67]:
#this step is done to match the TS 10x object

final['cell'] = final['cell_bc'] + '_' +final['sample_short']

In [68]:
final.to_csv(mainDir10 + 'all_batched_and_unbatched_09_09_2021.csv', index=False)

#### NNTR: just open the obs layer as a dataframe once these pre-processing steps have been done

In [3]:
#10x tabula sapiens object
alldons = sc.read_h5ad(mainDir3 + 'objects/tSP1_TSP15_scvi_donor-method_normalized-log1p-scaled_annotated_withCellcycle.h5ad')

Just renaming columns and doing some filtering of the 10x object, to include only 10x sequenced cells, and donors 3-15 (16 was not yet available so we have BLAST results but no cell identity info for that donor)

In [None]:
# get rid of info-poor columns in the object
# get the cell barcode in a format that would be the same between the 10x object and SIMBA output dataframe, called cell_bc2

alldons_10x = alldons[alldons.obs['method']=='10X']

alldons_10x.obs['sample'] = alldons_10x.obs['cell_identifier'].apply(lambda x: '_'.join(x.split('_')[1:]))
#dropping some columns that are not important
alldons_10x.obs.drop(columns={'cell_identifier','10X_run', 'pilot', 'subtissue', '10X_sample', '10X_replicate', 'cDNAplate', 
                               'libraryplate', 'well', 'score_epithelial', 'score_endothelial', 'score_stromal','score_immune',
                                '_scvi_batch', '_scvi_labels', '_scvi_local_l_mean','_scvi_local_l_var', '_dataset',
                                'knn_on_bbknn_pred','knn_on_scanorama_pred','decontX_split', 'consensus_percentage',
                                'consensus_prediction', '_labels_annotation', 'scanvi_offline_pred', 'svm_pred', 
                               '_labels_annotation', '_batch_annotation','onclass_pred', 'rf_pred', '_ref_subsample',
                               'knn_on_scvi_offline_pred'}, inplace=True)


Trying to set attribute `.obs` of view, copying.


In [None]:
#renaming some columns
alldons_10x.obs = alldons_10x.obs.rename(columns={'cell_ontology_class':'celltype', 
                          'computational_compartment_assignment':'compartment', 'sample': 'sample2'})
alldons_10x.obs.drop(columns={'notes', 'donor_method','final_annotation_cell_ontology_id'}, inplace=True)

#so I can match the SIMBA output dataframe and the 10X object is to use the 10x object index, which I'm calling cell. I have an equivalent column in the SIMBA output dataframe
# cell contains the cell barcode along with sample name (short version, doesn't have library lane)
alldons_10x.obs['cell'] = alldons_10x.obs.index 

#going to exclude donors 1 and 2 since they were previously accounted for
alldons_10x_don3to16 = alldons_10x[~alldons_10x.obs['donor'].isin(['TSP1', 'TSP2'])]


In [None]:
#changing the names of cell types to shorten them for plotting purposes
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype']
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype2'].str.replace('double-positive, alpha-beta thymocyte','alpha-beta thymocyte')
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype2'].str.replace('endothelial cell of artery','artery endo. c.')
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype2'].str.replace('endothelial cell of vascular tree','vascular endo. c.')
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype2'].str.replace('endothelial cell of lymphatic vessel','lymph endo.c.')
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype2'].str.replace('vein endothelial cell','vein endo. c.')
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype2'].str.replace('skeletal muscle satellite stem cell','skeletal muscle stem c.')
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype2'].str.replace('capillary endothelial cell','endo. capillary c.')
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype2'].str.replace('endothelial cell','endo. c.')

alldons_10x_don3to16.obs['celltype2']= alldons_10x_don3to16.obs['celltype2'].apply(lambda x: x.replace('cell', 'c.'))
alldons_10x_don3to16.obs['celltype2']= alldons_10x_don3to16.obs['celltype2'].str.replace('fibroblast of lung', 'fibroblast')
alldons_10x_don3to16.obs['celltype2']= alldons_10x_don3to16.obs['celltype2'].str.replace('pancreatic A c.', 'pancreatic acinar c.')
alldons_10x_don3to16.obs['celltype2']= alldons_10x_don3to16.obs['celltype2'].str.replace('pancreatic D c.', 'pancreatic ductal c.')
alldons_10x_don3to16.obs['celltype2']= alldons_10x_don3to16.obs['celltype2'].str.replace('double negative thymocyte', 'double neg. thymocyte')
alldons_10x_don3to16.obs['celltype2']= alldons_10x_don3to16.obs['celltype2'].str.replace('multi-potent skeletal muscle stem c.', 'skeletal muscle stem c.')
alldons_10x_don3to16.obs['celltype2']= alldons_10x_don3to16.obs['celltype2'].str.replace('  ', ' ')

alldons_10x_don3to16.obs = alldons_10x_don3to16.obs[~alldons_10x_don3to16.obs['celltype2'].isin(['c.', 'animal c.'])]
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype2'].str.replace('transit amplifying c. of large intestine', 'transient amplifying c.')
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype2'].str.replace('transit amplifying c. of small intestine', 'transient amplifying c.')
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype2'].str.replace('paneth c. of epithelium of small intestine', 'paneth c.')
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype2'].str.replace('vascular associated smooth muscle c.', 'vasc. smooth muscle c.')
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype2'].str.replace('cd4-positive','cd4+')
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype2'].str.replace('cd4-negative', 'cd4-')
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype2'].str.replace('cd8-positive','cd8+')
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype2'].str.replace('cd8-negative', 'cd8-')
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype2'].str.replace('cd1c-positive', 'cd1c+')
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype2'].str.replace('cd141-positive', 'cd141+')
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype2'].str.replace('cd45ro-positive', 'cd45ro+')
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype2'].str.replace('cytokine secreting ', '')
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype2'].str.replace('memory', 'mem.')
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype2'].str.replace('thymus-derived ', '')
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype2'].str.replace('medullary thymic epithelial c.', 'medullary thymic epi. c.')
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype2'].str.replace('paneth c. of epithelium of large intestine', 'paneth c.')
alldons_10x_don3to16.obs['celltype2']=alldons_10x_don3to16.obs['celltype2'].str.replace('serous c. of epithelium of bronchus', 'bronchus serous c.')


writing the results of preprocessing of tsp3-16 object 

In [None]:
alldons_10x_don3to16.write_h5ad(mainDir3 + 'objects/tSP3-16_processed_withCellCycleInfo_09_29_2021-postreview2022.h5ad')

also saving just the obs layer as a dataframe for easier access

In [None]:
alldons_10x_don3to16_obs_df = alldons_10x_don3to16.obs
alldons_10x_don3to16_obs_df.to_csv(mainDir10 + 'alldons_10x_don3to16_obs_df.csv', index=False)

# Merging blast datafame with TS object (donors 3-16, excl.15)

#### TSP3-16 Reading the 10x object obs dataframe that contains all donors and has cell cycle information so that I can merge cellular annotation with SIMBA output
will only analyze TSP3-16 (excluding donor 15 because of weird cell type annotations - only contained eye tissue anyway) since processing has already been done for TSP1-2.

In [327]:
#TS object obs
alldons_10x_don3to16_obs_df = pd.read_csv(mainDir10 + 'alldons_10x_don3to16_obs_df.csv')



  interactivity=interactivity, compiler=compiler, result=result)


In [328]:
#SIMBA BLAST processed output
final=pd.read_csv(mainDir10 + 'all_batched_and_unbatched_09_09_2021.csv')

In [329]:
hiv_hits = final[final['species'].str.contains('human immuno|Aids', case=False)]
print('number of hiv hits:', hiv_hits.shape[0])

final2=final[final['species'].str.contains('human immuno|Aids', case=False)==False]      

number of hiv hits: 40354


also want to exclude donor 15 due to misannotation

In [330]:
alldons_10x_don3to16_not15 = alldons_10x_don3to16_obs_df[~alldons_10x_don3to16_obs_df['donor'].isin(['TSP15'])]
final_not_tsp15 = final2[~final2['sample'].str.startswith('TSP15')]



additionally, i want to make sure there are no more than one significant hit per query sequence in the blast dataset. We go from ~1.5M hits down to ~1.1M. Changing the name of the dataframe from final to blastdb.  Will also change name of alldons_10x_don3to16_not15 (which is the ts object 10X obs layer) to ts_obj

In [331]:
final_not_tsp15.shape[0]

1468318

In [332]:
blastdb = final_not_tsp15.drop_duplicates('seqName')
blastdb.shape[0]

1101511

In [333]:
#just renaming to something shorter
ts_obj = alldons_10x_don3to16_not15

These two dataframes only share one column in common called cell. No na values for this column across either of the dataframes, so we can merge the two dataframes on this column.  


In [334]:
print('size of the TS object donor3-16 excluding 15: ', ts_obj.shape[0])
print('size of the BLAST dataframe donor3-16 excluding 15: ', blastdb.shape[0])
inter = ts_obj.merge(blastdb, on='cell', how='inner')
print('size of the intersection: ', inter.shape[0])


size of the TS object donor3-16 excluding 15:  326666
size of the BLAST dataframe donor3-16 excluding 15:  1101511
size of the intersection:  107744


#### There are three categories for the merged blast outputs (blastdb) and TS object (alldons_10x_don3to16_not15).
**1) intra** --> annotated cells that have significant **intra**cellular hits \
**2) nothing** --> annotated cells that don't have any significant intracellular hits (based on evalue) \
**3) extra** --> significant hits that are found in either found **extra**cellularly or in unannotated cells. i.e. associated with cellular barcodes. 

how many of each category is in the dataset (prior to removal of contamination?)



In [335]:
print('these are the number of unique hits that are found in annotated cells (category intra): ', inter.seqName.nunique())
print('these are the number of unique umis that are found in annotated cells (category intra): ', inter.umi.nunique())
print('number of annotated cells (unique cell barcodes) that have significant hits: ', inter.cell.nunique())

these are the number of unique hits that are found in annotated cells (category intra):  107744
these are the number of unique umis that are found in annotated cells (category intra):  75376
number of annotated cells (unique cell barcodes) that have significant hits:  12873


#### Labeling the blast dataframe by a column called hit: 'yes' and hit_type: 'intra' or 'extra'
intra are those hits that were found in annotated cells, and extra are all other hits
The cells with no hits will also get hit column=='no'


In [320]:
tsblast = ts_obj.merge(blastdb, on='cell', how='outer')

#these are the hits that are in the "extra" category. They don't associate with annotated cells
extra=tsblast[tsblast['tissue'].isna()] 
extra['hit']=extra.shape[0]*['yes']
extra['hit_type']=extra.shape[0]*['extra']

#these are the hits in the nothing category (i.e. annotated cells without hits)
nothing = tsblast[(tsblast['seqName'].isna())] 
nothing['hit']=['no']* nothing.shape[0]
nothing['hit_type']=['none']* nothing.shape[0]

#these are the hits in annotated cells
intra = tsblast[(tsblast['seqName'].isna()==False) & (tsblast['tissue'].isna()==False)] 
intra['hit']=['yes']* intra.shape[0]
intra['hit_type']=['intra']* intra.shape[0]

#lets put together these different pieces now that they have labels with which we can seperate them out again

tsblast2=pd.concat([extra, nothing, intra])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the d

more formatting

In [339]:
tsblast2['cell_bc'] = tsblast2['10X_barcode'].fillna(tsblast2['cell_bc'])

ex  = tsblast2[tsblast2['hit_type']=='extra']
other  = tsblast2[tsblast2['hit_type']!='extra']

ex['donor'] = ex['sample'].apply(lambda x: x.split('_')[0])
ex['tissue'] = ex['sample'].apply(lambda x: x.split('_')[1])
tsblast3 = pd.concat([ex, other])


tsblast3['tissue'] = tsblast3['tissue'].str.lower()
tsblast3['tissue'] = tsblast3['tissue'].apply(lambda x: x.replace('bonemarrow', 'bm'))
tsblast3['tissue'] = tsblast3['tissue'].apply(lambda x: x.replace('bone_marrow', 'bm'))
tsblast3['tissue'] = tsblast3['tissue'].apply(lambda x: x.replace('lymphnode', 'lymph_node'))
tsblast3['tissue'] = tsblast3['tissue'].apply(lambda x: x.replace('lymphnodes', 'lymph_node'))
tsblast3['tissue'] = tsblast3['tissue'].apply(lambda x: x.replace('lymph_nodes', 'lymph_node'))
tsblast3['tissue'] = tsblast3['tissue'].apply(lambda x: x.replace('eye1', 'eye'))
tsblast3['tissue'] = tsblast3['tissue'].apply(lambda x: x.replace('eye2', 'eye'))
tsblast3['tissue'] = tsblast3['tissue'].apply(lambda x: x.replace('eye3', 'eye'))
tsblast3['tissue'] = tsblast3['tissue'].apply(lambda x: x.replace('eye4', 'eye'))
tsblast3['tissue'] = tsblast3['tissue'].apply(lambda x: x.replace('mammary1', 'mammary'))
tsblast3['tissue'] = tsblast3['tissue'].apply(lambda x: x.replace('mammary2', 'mammary'))
tsblast3['tissue'] = tsblast3['tissue'].apply(lambda x: x.replace('exopancreas2', 'pancreas'))
tsblast3['tissue'] = tsblast3['tissue'].apply(lambda x: x.replace('exopancreas1', 'pancreas'))
tsblast3['tissue'] = tsblast3['tissue'].apply(lambda x: x.replace('endopancreas', 'pancreas'))
tsblast3['tissue'] = tsblast3['tissue'].apply(lambda x: x.replace('salivarygland', 'salivary_gland'))
tsblast3['tissue'] = tsblast3['tissue'].apply(lambda x: x.replace('large_intestine', 'li'))
tsblast3['tissue'] = tsblast3['tissue'].apply(lambda x: x.replace('small_intestine', 'si'))

tsblast3['log2_n_counts']=np.log2(tsblast3['n_counts'])
tsblast3['log2_n_genes']=np.log2(tsblast3['n_genes'])

tsblast3 = tsblast3.drop(columns={'free_annotation', 'manually_annotated','10X_barcode','Annotation','celltype',
                      'Manually Annotated','cycling', 'non-cycling','10X_barcode','anatomical_position',
                                 'sample', 'gapopen', 'mismatch', 'method','seqrun'})
tsblast3['sample']=tsblast3['sample2'].fillna(tsblast3['sample_short'])

tsblast3= tsblast3.drop(columns={'sample_short','sample2'})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [370]:
tsblast3.to_csv(mainDir10 + 'dons3-16_excluding15_blast_with_cell_annotations-postreview2022.csv', index=False)

### Column name explanation (donor 3-16 excluding 15) final dataframe
- Blast dataframe columns (TS object is missing these columns):
    - **seqName**: e.g. A00111:327:HL57HDSXX:4:1202:22941:23719_TAGGTCAGAGACATCA_AGAAATAACGCC
    - **refName**: the BLAST subject reference gi|1862738216|gb|CP055292.1| 
    - **pathogen**: common name of the hit "Shigella sonnei strain SE6-1 chromosome, complete genome"
    - **bitscore**: see BLAST tutorial
    - **pident**: percent identity between subject and query
    - **evalue**: see BLAST tutorial
    - **qstart**: query start position
    - **qend**: query end position
    -  **sstart**: subject start position
    -  **send**: subject end position
    -  **length**: length of the alignment between subject and query
    - Taxonomy columns 
        -  **tax_id**: taxonomic id with which we can get the following taxonomic categories
        -  **superkingdom**
        -  **phylum**
        -  **class**
        -  **order**
        -  **family**
        -  **genus**
        -  **species**
    -  **umi**: AGAAATAACGCC 
    -  **cell_bc_umi**: cell barcode and umi e.g. TGTCCCATGTACTCTG_CGTTGATACCAC
    -  **batch**: this is related to how blast was done for these donors (divided into batches with fixed number of input seqs)
    - **filepath**: where the output of the blast resides on my local drive


- TS object columns (blast dataframe does not produce these on its own)
    - **n_counts**: number of reads per cell
    - **n_genes**: number of genes per cell
    - **log2_n_counts**: log2 of n_counts
    - **log2_n_genes**: log2 of n_genes
    - **compartment**: e.g. immune, epithelial
    - **decision**: cell cycle decision
    - **celltype2**: cell type
    - **tissue_cell_type**: tissue and cell type
    - **cell_type_tissue**: cell type and tissue


- Shared columns:
    - **cell**: cell barcode+sample  TGGGCGTGTTGCGCAC_TSP14_Blood_NA_10X_1_1_1_5Prime
    - **cell_bc**: cell barcode with "-1" appended to it. e.g. CATATTCCAAAGCGGT
    - **tissue**: tissue type
    - **donor**: donor (e.g. TSP1)
    - **sample**: sample name e.g. TSP14_Bladder_NA_10X_1_2
    - **hit**: the column with which to seperate out the dataframe into blast (hit=='yes') and ts object (hit=='no')
    - **hit_type**: tells us whether the hit comes from an annotated cell (hit_type=='intra'), unannotated cell ('extra') or no hit ('none')


# Now let's put together the dataframes from ALL donors

The two dataframes can be seperated by the column 'donor_batch' donor 1 and 2 "1_2", and others "3_16"
Note, this df hasn't been filtered based on contamination or blast parameters
- the columns 'cell_type_tissue', 'cell_bc_umi', 'tissue_cell_type', 'filepath', 'batch' will be null for donors 1 and 2
- the column 'seq' will be null for donor 3-16 
- the cell column is most useful for counting cells (cell_bc column only contains the cell barcode, and barcodes are used across different samples). 
- the column cell_bc has 1 appended to it for donor 1-2
- as before, both dataframes can be decomposed by "hit" and "hit_type" columns. 

In [421]:
fin2= pd.read_csv(mainDir + 'don1_don2_10x_12_08_2022-postreview2022.csv')
tsblast3= pd.read_csv(mainDir10 + 'dons3-16_excluding15_blast_with_cell_annotations-postreview2022.csv')


fin2['donor_batch']=fin2.shape[0]*['1_2']
tsblast3['donor_batch']=tsblast3.shape[0]*['3_16']


  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


In [422]:
#lets see which columns they share in common (should be most)
print('columns that donors 3-16 dataframe has that the other does not:', set(tsblast3.columns) - set(fin2.columns))
print('columns that donors 1-2 dataframe has that the other does not:', set(fin2.columns) - set(tsblast3.columns))


columns that donors 3-16 dataframe has that the other does not: {'cell_type_tissue', 'cell_bc_umi', 'tissue_cell_type', 'filepath', 'batch'}
columns that donors 1-2 dataframe has that the other does not: {'seq'}


always use cell column (cell bc + sample name) as apposed to cell_bc column which contains just the barcode because the same barcode could be used across different samples

In [423]:
alldf=pd.concat([tsblast3,fin2])

### Column name explanation for joined dataframe of ALL donors 1-16 (except 15)
- Blast dataframe columns (TS object is missing these columns):
    - **seqName**: e.g. A00111:327:HL57HDSXX:4:1202:22941:23719_TAGGTCAGAGACATCA_AGAAATAACGCC
    - **seq**: microbial sequence 
    - **refName**: the BLAST subject reference gi|1862738216|gb|CP055292.1| 
    - **pathogen**: common name of the hit "Shigella sonnei strain SE6-1 chromosome, complete genome"
    - **bitscore**: see BLAST tutorial
    - **pident**: percent identity between subject and query
    - **evalue**: see BLAST tutorial
    - **qstart**: query start position
    - **qend**: query end position
    -  **sstart**: subject start position
    -  **send**: subject end position
    -  **length**: length of the alignment between subject and query
    - Taxonomy columns 
        -  **tax_id**: taxonomic id with which we can get the following taxonomic categories
        -  **superkingdom**
        -  **phylum**
        -  **class**
        -  **order**
        -  **family**
        -  **genus**
        -  **species**
    -  **umi**: AGAAATAACGCC 
    -  **cell_bc_umi**: cell barcode and umi e.g. TGTCCCATGTACTCTG_CGTTGATACCAC
    -  **batch**: this is related to how blast was done for these donors (divided into batches with fixed number of input seqs)
    - **filepath**: where the output of the blast resides on my local drive


- TS object columns (blast dataframe does not produce these on its own)
    - **n_counts**: number of reads per cell
    - **n_genes**: number of genes per cell
    - **log2_n_counts**: log2 of n_counts
    - **log2_n_genes**: log2 of n_genes
    - **compartment**: e.g. immune, epithelial
    - **decision**: cell cycle decision
    - **celltype2**: cell type
    - **tissue_cell_type**: tissue and cell type
    - **cell_type_tissue**: cell type and tissue


- Shared columns:
    - **cell**: cell barcode+sample  TGGGCGTGTTGCGCAC_TSP14_Blood_NA_10X_1_1_1_5Prime
    - **cell_bc**: cell barcode with "-1" appended to it. (donor1and2, not for donors 3-16) e.g. CATATTCCAAAGCGGT-1
    - **tissue**: tissue type
    - **donor**: donor (e.g. TSP1)
    - **sample**: sample name e.g. TSP14_Bladder_NA_10X_1_2
    - **hit**: the column with which to seperate out the dataframe into blast (hit=='yes') and ts object (hit=='no')
    - **hit_type**: tells us whether the hit comes from an annotated cell (hit_type=='intra'), unannotated cell ('extra') or no hit ('none')
    - **donor_batch**: donor 1 and 2 dataframe ('1_2'), all others ('3_16')

### Saving the combined dataframe

In [426]:
alldf.to_csv(mainDir10 + 'all_dons_except15_blast_tsobject_postreview.csv', index=False)