This notebook is used to explore stats surrounding the number of cells, empty droplets positive for hits from different filters. 

### Loading libraries

In [None]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib_venn import venn2, venn3
import venn
from venn import venn4
from venn import venn5
plt.rcParams['figure.figsize'] = [5, 5]
plt.rcParams['figure.dpi'] = 120
font_label_size=12
plt.rcParams['font.size'] = font_label_size
plt.rcParams['legend.fontsize'] = font_label_size
plt.rcParams['figure.titlesize'] = font_label_size
plt.rcParams['axes.labelsize']= font_label_size
plt.rcParams['axes.titlesize']= font_label_size
plt.rcParams['xtick.labelsize']= font_label_size
plt.rcParams['ytick.labelsize']= font_label_size
plt.rcParams["font.family"] = "arial"
import collections
from collections import Counter, ChainMap
import os
import glob
import re
import itertools
import math
import random
from random import randrange
import string
import numpy as np
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
from statistics import mean
from matplotlib import pyplot
import scipy as sc
from scipy import stats
import upsetplot
from upsetplot import UpSet, generate_counts, from_contents, plot
from pyvis.network import Network
import seaborn as sns
cmap = sns.cm.rocket_r
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, Plot, Grid, Range1d
from bokeh.models.glyphs import Text, Rect
from bokeh.layouts import gridplot
from bokeh.sampledata.les_mis import data
import panel as pn
import panel.widgets as pnw
pn.extension()
import holoviews as hv
from holoviews import opts, dim
from holoviews.plotting.util import process_cmap
import plotly.graph_objects as go
import plotly.express as pex
hv.extension('bokeh')
hv.output(size=150)
pd.options.display.max_colwidth = 1000 #allows for viewing larger amount of text in pandas

### Directories 

In [None]:
mainDir = '/oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/'
mainDir10 = '/oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/'
paper = '/oak/stanford/groups/quake/gita/raw/nb/microbe/paper/forGitHub/human_tissue_microbiome_atlas/post_review/'
tables = paper + 'tables/'
images = paper + 'images/'
dbDir = '/oak/stanford/groups/quake/gita/raw/database/taxonomyNCBI20200125/'
taxDir = dbDir + 'taxonkit/'
tax = pd.read_csv(dbDir + 'ncbi_lineages_2021-01-26.csv')
tax_short=tax[['tax_id','superkingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']] #want to take only the following columns from the lineage dataframe tax 


### Explanation of the dataset

- Blast dataframe columns (TS object is missing these columns):
    - **seqName**: e.g. A00111:327:HL57HDSXX:4:1202:22941:23719_TAGGTCAGAGACATCA_AGAAATAACGCC
    - **seq**: microbial sequence 
    - **refName**: the BLAST subject reference gi|1862738216|gb|CP055292.1| 
    - **pathogen**: common name of the hit "Shigella sonnei strain SE6-1 chromosome, complete genome"
    - **bitscore**: see BLAST tutorial
    - **pident**: percent identity between subject and query
    - **evalue**: see BLAST tutorial
    - **qstart**: query start position
    - **qend**: query end position
    -  **sstart**: subject start position
    -  **send**: subject end position
    -  **length**: length of the alignment between subject and query
    - Taxonomy columns 
        -  **tax_id**: taxonomic id with which we can get the following taxonomic categories
        -  **superkingdom**
        -  **phylum**
        -  **class**
        -  **order**
        -  **family**
        -  **genus**
        -  **species**
    -  **umi**: AGAAATAACGCC 
    -  **cell_bc_umi**: cell barcode and umi e.g. TGTCCCATGTACTCTG_CGTTGATACCAC
    -  **cell_umi**: cell barcode, sample, umi e.g. TCCCATGTACTCTGCG_TSP14_Liver_NA_10X_1_1_TTGATACCACTG
    -  **batch**: this is related to how blast was done for these donors (divided into batches with fixed number of input seqs)
    - **filepath**: where the output of the blast resides on my local drive


- TS object columns (blast dataframe does not produce these on its own)
    - **n_counts**: number of reads per cell
    - **n_genes**: number of genes per cell
    - **log2_n_counts**: log2 of n_counts
    - **log2_n_genes**: log2 of n_genes
    - **compartment**: e.g. immune, epithelial
    - **decision**: cell cycle decision
    - **celltype2**: cell type
    - **tissue_cell_type**: tissue and cell type
    - **cell_type_tissue**: cell type and tissue


- Shared columns:
    - **cell**: cell barcode+sample  TGGGCGTGTTGCGCAC_TSP14_Blood_NA_10X_1_1_1_5Prime
    - **cell_bc**: cell barcode with "-1" appended to it. (donor1and2, not for donors 3-16) e.g. CATATTCCAAAGCGGT-1
    - **tissue**: tissue type
    - **donor**: donor (e.g. TSP1)
    - **sample**: sample name e.g. TSP14_Bladder_NA_10X_1_2
    - **hit**: the column with which to seperate out the dataframe into blast (hit=='yes') and ts object (hit=='no')
    - **hit_type**: tells us whether the hit comes from an annotated cell (hit_type=='intra'), unannotated cell ('extra') or no hit ('none')
    - **donor_batch**: donor 1 and 2 dataframe ('1_2'), all others ('3_16')
    - **empty_drop**: set to 'yes' if empty, and 'no' if a cell. Based on a cut off of 200 genes per droplet


- Cell ranger columns (rows with cr column=='yes should have all this info):
    - **cr**: "yes" if the sample was run through cell ranger pipeline, and "no" otherwise
    - **cr_sample**: the name of the cell ranger file
    - **n_counts_cr**: number of reads per droplet as determined by my cell ranger pipeline
    - **n_genes_cr**: same as above, for number of genes per droplet
    - **num_empty_drop_per_sample**: number of empty droplets in a sample (based on cut off of 200 genes)
    - **num_cell_drop_per_sample**: number of cell-containing droplets in a sample (based on cut off of 200 genes)
    


- Shared columns for removal of contaminants, "orphan" alignments, and environmentally sourced bacteria
    - **filter1**: Removal of species and taxids found in the 10X experimental contamination dataset (1= contaminant, 0=non-contaminant)
    - **filter2**: Removal of genera found in the Salter et al. list of common contaminant genera (1=contaminant)
    - **filter3**: Removal of certain fungal species found in the Haziza et al dataset 
    - **filter4**: Removal of putative contaminants based on ML model prediction
    Filter4 is based on ML model prediction which takes into account the following columns. 
    - **num_tissues**: Number of tissues that each species appears in
    - **num_donors**: Number of donors that each species appears in
    - **abundance**: The total abundance of each species (number of hits) in the TS dataset
    - **med_pident**:  Median percent identity of alignments corresponding to a given species
    - **med_length**: Median alignment length corresponding to a given species
    - **med_ngene**: Median number of genes per droplet that contains a given species
    - **pred**: The prediction of the model. The input label of the model are species, 1 if any of the previous filters are 1 (contaminant), and 0 otherwise. 
    - **R_g**: R_g is c_g/ts_g. c_g is the number of species from genus g that have been detected in the experimental contamination dataset, and ts_g is the number of species from that genus detected in the total ts dataset. The ratio is a measure of confidence in a genus. If a genus has more species detected in the contamination dataset than in the total TS dataset, then even species that were not detetcted in the contamination dataset are less trustworthy considering that we show taxonomic assignment at the species level using short-read sequencing (~100bp) is not nearly as precise as genus level assignments 
    - **filter5** : We combine pident and length alignment parameters into one that measures the number of exact base matches in the alignment. This new pararmeter,"pidlen" =pident*length/100. pidlen > pthresh will be 0, and 1 otherwise. pthresh corresponds to 3sigma away from the mean of the second pidlen distribution. Essentially we select for high-quality alignments using this filter. 
    - **pidlen**: pident*length/100 corresponds to number of exact base matches in the alignment
    - **emp_habitat1**: coarse-grained habitat information (list of habitats) for bacterial genera at the intersection of ts and GEM
    - **emp_habitat2**: even more coarse-grained habitat information (list of habitats) for bacterial genera at the intersection of ts and GEM
    - **emp_habitat3**: even more coarse-grained habitat information than emp_habitat2 for bacterial genera at the intersection of ts and GEM
    - **assignment**: binary assignment of GEM bacterial species found in the tsm dataset into "free_living_or_other_hosts" and "human_associated"
    - **emp**:	1 is "free_living_or_other_hosts" and 0 is "human_associated" based on whether the species was isolated from the  human microbiome. If found in both natural environments as well as humans, it will be labeled as human associated. It will be NaN if the species is not found in EMP, which in addition to bacterial species will include all viruses and fungi.
    - **emp2**: 1 is "free_living_or_other_hosts" and 0 is "human_associated" based on whether the genus was predominantly isolated from the human microbiome (more species in the human microbiome than outside of it based on GEM study). It will be NaN if the genus is not found in GEM, which in addition to bacterial species will include all viruses and fungi.
    - **assignment2**: binary assignment of gem bacterial genera found in the tsm dataset into "free_living_or_other_hosts" and "human_associated" 
    - **emp1_2**: combination of emp and emp2 columns. Basically, it is emp2 (genus) filter but can be overriden by emp (species) filter. So unless a given species has been explicitly found in the human microbiome, if it belongs to a genus that is mostly found outside of the human microbiome, then it would be deemeed not part of the human microbiome
    - **hmp_species**:	set to 0 if a species is found in the HMP dataset, NaN otherwise (note, hmp is predominantly bacterial, with small number of eukaryotic species)
    - **hmp_genus**:	set to 0 if a genus is found in the HMP dataset, NaN otherwise 
    - **hmp_phylum**:	set to 0 if a species is found in the HMP dataset, NaN otherwise 
    - **uhgg_species**:	set to 0 if a species is found in the UHGG dataset, NaN otherwise (note, uhgg doesn't have fungi or viruses, it does have some archaea)
    - **uhgg_genus**:	set to 0 if a genus is found in the UHGG dataset, NaN otherwise 
    - **uhgg_phylum**:	set to 0 if a species is found in the UHGG dataset, NaN otherwise 
    - **hue (filter6)**: combination of **h**mp, **u**hgg and **e**mp filters. Set to 0 if any of the species have been found in association with the human microbiome, set to 1 if only found outside of the human microbiome, and NaN if no information can be gathered from these three databases. Note, only "emp1_2" column offers some values of 1, others (uhgg_species, hmp_species) are either 0 or NaN because they only explore sites in the human body.
    - **chatgpt (filter7)**: this is the result of chatgpt search for habitat of origin for a species that was not found in the three studies explored. Set to 0 if found in humans, and 1 otherwise. 
    - **davinci_combo**: the combined outputs of chatgpt's davinci model to the same prompt posed 3 times (independently, for each species). 

In [None]:
hits = pd.read_csv(mainDir10 + 'all_dons_cr_processed_7filters_postreview2022_v2.csv')
print(hits.shape[0])


which species do I not have habitat information on? 

In [None]:
hits_yes = hits[hits['hit']=='yes']
print(hits_yes.shape[0])

In [None]:
hits_no = hits[hits['hit']=='no']

f7 and f5 filtered data


In [None]:

cols=['filter1', 'filter2', 'filter3', 'filter4', 'filter5', 'hue', 'chatgpt']
f7=hits_yes[(hits_yes[cols].any(axis=1)==False)]

cols=['filter1', 'filter2', 'filter3', 'filter4', 'filter5']
f5=hits_yes[(hits_yes[cols].any(axis=1)==False)]


- How many microbial reads do each cell have on average? 
- How many cells have at least one significant hit? 


### how many droplets (empty or cell containing) had a hit prior to filters (f0 dataset)

this is the number of annotated cells that did not have hits but were analyzed. 

In [None]:
num_cells_with_no_hits=hits_no['cell'].nunique()
print(num_cells_with_no_hits)

total number of droplets (empty and cell) with hits (without any filters)

In [None]:
num_droplet_with_hits=hits_yes['cell'].nunique()
print(num_droplet_with_hits)

 number of cells that don't have "empty_drop" status because they were not run through cell ranger. they belong to donor 16 

In [None]:
don16_cells=hits_yes[(hits_yes.empty_drop.isna()) & (hits_yes.donor=='TSP16')].cell.nunique()
print('number of cells from donor 16:', don16_cells)

number of empty droplets with hits, before filters

In [None]:
num_empty_drops_with_hits=hits_yes[hits_yes['empty_drop']=='yes'].cell.nunique()
print(num_empty_drops_with_hits)

number of cells with hits, before filters

In [None]:
num_cells_with_hits=hits_yes[hits_yes['empty_drop']=='no'].cell.nunique()
print(num_cells_with_hits)

as a sanity check, these numbers add up to the total number of droplets with hits, before filters (f0), which they do

In [None]:
tot_num_drops_with_hits_incl_don16=don16_cells + num_cells_with_hits + num_empty_drops_with_hits
tot_num_drops_with_hits_incl_don16

#### number of droplets that had a hit after 7 filters (f7 dataset)

total number of droplets with f7 hits

In [None]:
f7_t=f7.cell.nunique()
print(f7_t)

cell-containing droplets

In [None]:
f7_c=f7[f7['empty_drop']=='no'].cell.nunique()
print(f7_c)

how many of those cell-containing droplets have celltype annotation?

In [None]:
f7[(f7['empty_drop']=='no') & (f7['hit_type']=='intra')]['cell'].nunique()

empty drops

In [None]:
f7_e=f7[f7['empty_drop']=='yes'].cell.nunique()
print(f7_e)

donor 16 contributes 40 cells for which we don't know whether they are empty or cell-containing

In [None]:
f7_16=f7[(f7['empty_drop'].isna()) & (f7['donor']=='TSP16')].cell.nunique()
print(f7_16)

sanity check, these numbers should add to the total number of droplets with f7 hits, which they do 

In [None]:
f7_e + f7_c + f7_16

#### How many total number of cells and droplets (with or without hits) were in each sample?

This table provides information about the total number of empty and cell containing droplets per sample. Based on columns previously included when I ran cell ranger on these samples. 

In [None]:
sample_stat = hits_yes[['donor','tissue','sample', 'num_cell_drop_per_sample', 'num_empty_drop_per_sample']].drop_duplicates()
sample_stat.head(3)

#### How many total cell-containing and empty droplets did we search through for microbes?
156 Million empty drops, and 1.3 Million cell-containing drops. 

In [None]:
sample_stat['total_num_drops']= sample_stat['num_cell_drop_per_sample'] + sample_stat['num_empty_drop_per_sample']

tot_cell = sample_stat['num_cell_drop_per_sample'].sum()
tot_empty =  sample_stat['num_empty_drop_per_sample'].sum()
print('total number of cell-containing drops',tot_cell)
print('total number of empty drops', tot_empty)
sample_stat.head(2)

#### saving

In [None]:
sample_stat.to_csv(tables + 'num_empty_and_cell_droplets_per_sample_without_filters.csv', index=False)

#### what fraction of total cell-containing and empty droplets had a hit?
let's read the table we got from last notebook 

In [None]:
drop_stat = pd.read_csv(tables + 'impact_of_filters.csv')


drop_stat

In [None]:
f0_empty = drop_stat['num_empty_drops_with_hits'][0] #no filters
f0_cell = drop_stat['num_cell_drops_with_hits'][0]
f7_empty = drop_stat['num_empty_drops_with_hits'][7] #all 7 filters
f7_cell = drop_stat['num_cell_drops_with_hits'][7]


x=pd.DataFrame([f0_empty,f7_empty, f0_cell, f7_cell, tot_empty, tot_cell]).T
x.columns= ['total_empty_droplets_f0_hits', 'total_empty_droplets_f7_hits',
           'total_cell_droplets_f0_hits', 'total_cell_droplets_f7_hits', 
            'total_empty_droplets', 'total_cell_droplets']

#f7/f0 empty droplets
frac_empty_drops_with_f7_hits = x['total_empty_droplets_f7_hits']/x['total_empty_droplets'] 
print('percentage of total empty drops with F7 hits', np.round(frac_empty_drops_with_f7_hits[0]*100,3))

#f7/f0 cell droplets
frac_cell_drops_with_f7_hits = x['total_cell_droplets_f7_hits']/x['total_cell_droplets']
print('percentage of total cell drops with F7 hits', np.round(frac_cell_drops_with_f7_hits[0]*100,2))

#ratios of ratios: comparing cell-containing to empty droplets at the same level of filtering
cell_to_empty_ratio_f7 = frac_cell_drops_with_f7_hits/ frac_empty_drops_with_f7_hits
print('how many times are cell containing drops more likely to harbor F7 hits?', 
      np.round(cell_to_empty_ratio_f7[0],0)) 


In [None]:
x

### Same analysis as above just on per sample basis

let's get the number of droplets (both cells and empty) comprising the f7 and f5 filtered data

In [None]:
lst=[]
elist=[]
clist=[]
for samp in list(f7['sample'].unique()):
    dff = f7[f7['sample']==samp]
    num_cells_with_hits=dff.cell.nunique()
    empty = dff[dff['empty_drop']=='yes'].cell.nunique()
    ce = dff[dff['empty_drop']=='no'].cell.nunique()

    lst.append(num_cells_with_hits)
    elist.append(empty)
    clist.append(ce)

df_sev=pd.DataFrame({})
df_sev['sample'] = list(f7['sample'].unique())
df_sev['num_total_drops_with_f7_hits'] = lst
df_sev['num_empty_drops_with_f7_hits'] = elist
df_sev['num_cell_drops_with_f7_hits'] = clist
df_sev.head(3)

doing the same for f5 dataset

In [None]:
lst=[]
elist=[]
clist=[]
for samp in list(f5['sample'].unique()):
    dff = f5[f5['sample']==samp]
    num_cells_with_hits=dff.cell.nunique()
    empty = dff[dff['empty_drop']=='yes'].cell.nunique()
    ce = dff[dff['empty_drop']=='no'].cell.nunique()

    lst.append(num_cells_with_hits)
    elist.append(empty)
    clist.append(ce)


df_five=pd.DataFrame({})
df_five['sample'] = list(f5['sample'].unique())
df_five['num_total_drops_with_f5_hits'] = lst
df_five['num_empty_drops_with_f5_hits'] = elist
df_five['num_cell_drops_with_f5_hits'] = clist
df_five.head(3)

merging data from f5 and f7 datasets

In [None]:
num_cells=df_five.merge(df_sev, on='sample', how='outer').replace(np.nan, 0)
num_cells.head(3)

merging this dataset with sample_stats dataframe to get the total number of cells and empty drops per sample (regardless of whether they have a hit or not). 

In [None]:
data=num_cells.merge(sample_stat, on='sample', how='right').replace(np.nan, 0)
data.head(3)

fractions of empty and cell droplets with f5 and f7 filtered hits per sample


In [None]:
data['frac_empty_drops_with_f5_hits']=data['num_empty_drops_with_f5_hits']/data['num_empty_drop_per_sample']
data['frac_cell_drops_with_f5_hits']=data['num_cell_drops_with_f5_hits']/data['num_cell_drop_per_sample']

data['frac_empty_drops_with_f7_hits']=data['num_empty_drops_with_f7_hits']/data['num_empty_drop_per_sample']
data['frac_cell_drops_with_f7_hits']=data['num_cell_drops_with_f7_hits']/data['num_cell_drop_per_sample']

data.head(3)

#### let's save the fractional data!
fraction of empty and cell containing droplets that contain f5 or f7 hits. Stratified based on sample

In [None]:
data = data.sort_values(by=['frac_cell_drops_with_f5_hits'])
data.head(3)

In [None]:
data.to_csv(tables + 'frac_empty_cell_droplet_with_f5_f7_hits_per_sample.csv', index=False)

## Cell type analysis

### What is the heterogeneity within the same cell type
Note that not all cells will have annotation. Because of this, for this analysis, I will use the ones that do, and use not my own cell ranger counts of cells per sample, but TS's based on the hits_no dataset (these are cells without hits) in each cell type. 

let's do the same thing but stratify based on cell type, rather than sample. 

In [None]:
cols=['filter1']
f1=hits_yes[(hits_yes[cols].any(axis=1)==False)]


In [None]:
#c='tissue_cell_type'
c='celltype2'
f7_c = pd.DataFrame(f7.groupby(c)['cell'].nunique()
                   ).reset_index().rename(columns={'cell':'num_cells_with_f7_hits'})
f7_c

In [None]:
f5_c = pd.DataFrame(f5.groupby(c)['cell'].nunique()
                   ).reset_index().rename(columns={'cell':'num_cells_with_f5_hits'})


In [None]:
f1_c = pd.DataFrame(f1.groupby(c)['cell'].nunique()
                   ).reset_index().rename(columns={'cell':'num_cells_with_f1_hits'})


now let's get the total number of cells from each cell type that DID NOT have a hit. This will include cells in the "hits_no" (component 1) dataset as well as those that are in "hits_yes" (component 2) but did not make it through the f7 filters. let's get component 2 numbers first and then add to component 1. 

In [None]:
hy=hits_yes.groupby([c, 'cell']).size().reset_index().iloc[:,:2]
hy2=pd.DataFrame(hy[c].value_counts(dropna=False)).reset_index().rename(columns={c:'num_cells_with_f0_hits', 'index':c})
hy2.head(3)

f7h = f7_c.merge(hy2, on=c, how='outer').merge(f5_c, on=c, how='outer').merge(f1_c, on=c, how='outer').replace(np.nan, 0)

f7h.head(2)

let's get component 2 

In [None]:
hn=hits_no.groupby([c, 'cell']).size().reset_index().iloc[:,:2]
hn2=pd.DataFrame(hn[c].value_counts(dropna=False)).reset_index().rename(columns={c:'num_cells_without_hits', 'index':c})
f7_data = hn2.merge(f7h, on=c, how='outer').replace(np.nan, 0)
f7_data['total_num_cells'] = f7_data['num_cells_with_f0_hits'] + f7_data['num_cells_without_hits'] #f0 hits already incorporates f7 hits
f7_data['frac_cells_with_f7_hits']= f7_data['num_cells_with_f7_hits']/f7_data['total_num_cells']
f7_data['frac_cells_with_f1_hits']= f7_data['num_cells_with_f1_hits']/f7_data['total_num_cells']

f7_data.head(3)

not a lot of cell types have fewer than 10 cells, but we will exclude those to be able to run some stats

In [None]:
sub=f7_data[f7_data['total_num_cells']>=10]
# sub.to_csv(mainDir10 + 'fraction_of_cells_per_cell_type_positive_for_f7_hits.csv', index=False)

In [None]:
sub2 = sub[['celltype2', 'num_cells_with_f7_hits', 'total_num_cells', 'frac_cells_with_f7_hits']]
sub2['frac_cells_with_f7_hits'] = np.round(sub2['frac_cells_with_f7_hits'], 5)
sub3 = sub2[sub2['frac_cells_with_f7_hits']>0] # NOTE we're including those cell types with 0 cells that have f7 hits in this analysis

In [None]:
plt.figure(figsize=[5,15])
sub4 = sub3.sort_values('frac_cells_with_f7_hits', ascending=False)
data = sub4[['celltype2', 'frac_cells_with_f7_hits']]
ax = sns.barplot(y=data['celltype2'], x=data['frac_cells_with_f7_hits'])
ax.set_xticklabels(data['frac_cells_with_f7_hits'], rotation=90)

this particular dataframe contains only cell types with greater than 0 fraction of F7 hit-positive cells 

In [None]:
sub4.to_csv(tables + 'celltypes_with_non-zero_fractions_of_hit_positive_cells.csv', index=False)

In [None]:
sub4.head(4)

these stats are for all cell types, including those without F7 hit-positive cells

In [None]:
plt.figure(figsize=[4,2])
sns.boxplot(data=sub2, x='frac_cells_with_f7_hits')

a=pd.DataFrame(sub2['frac_cells_with_f7_hits'].describe()).reset_index()

q3=a.iloc[6][1]
q1=a.iloc[4][1]
iqr=q3-q1
thresh = q3 + 1.5*iqr
print('threshold:', thresh)

out_ct = sub2[sub2['frac_cells_with_f7_hits']>thresh]

print('Number of cell types included in the analysis:', sub2.shape[0])
print('Number of outlier cell types:', out_ct.shape[0])


#### fraction of cells in each cell type with f7 hits

In [None]:
a=pd.DataFrame(sub2[['frac_cells_with_f7_hits']].describe()).reset_index()
a

14 out of the 36 outlier cell types are nonimmune cell types (~60% are immune)

saving the outlier cell types (relevant columns only)

In [None]:
out_ct = out_ct.sort_values(by=['frac_cells_with_f7_hits'], ascending=False)
out_ct2 = out_ct.reset_index().drop(columns=['index'])
out_ct2 = out_ct2.rename(columns={'celltype2':'celltype'})
out_ct2

In [None]:
out_ct2.to_csv(tables + 'outlier_cell_types_f7.csv', index=False)

## Single cell stats for F7 dataset
how many microbial hits does each cell have on average? Excluding droplets and cells that had no hits. 

In [None]:
f7['celltype3']=f7['celltype2'].fillna('na')
f7['num_genes']=f7[['n_genes','n_genes_cr']].max(axis=1) #taking the max of ts cr gene counts and mine


In [None]:
cdrops = f7[(f7['empty_drop']=='no')]
edrops = f7[(f7['empty_drop']=='yes')]

print('number of annotated cells with hits after 7 filters:', cdrops[cdrops['hit_type']=='intra']['cell'].nunique())
print('total number of  cells with hits after 7 filters:', cdrops['cell'].nunique())


#### Getting the number of unique hits per cell-containing droplets & empty droplets (F7)

In [None]:
num_uhits_cell=pd.DataFrame(cdrops.groupby(['cell'])['cell_umi'].nunique()).reset_index()
num_uhits_cell = num_uhits_cell.merge(cdrops[['cell','tissue', 'num_genes', 'celltype3']].drop_duplicates(), on='cell')
#species per cell
sp = pd.DataFrame(cdrops.groupby(['cell'])['species'].unique()).reset_index()
num_sp = pd.DataFrame(cdrops.groupby(['cell'])['species'].nunique()).reset_index().rename(columns={'species':'num_unique_species_per_drop'})

#domains per cell
dom = pd.DataFrame(cdrops.groupby(['cell'])['superkingdom'].unique()).reset_index()
num_dom = pd.DataFrame(cdrops.groupby(['cell'])['superkingdom'].nunique()).reset_index().rename(columns={'superkingdom':'num_unique_domains_per_drop'})

num_uhits_cell = num_uhits_cell.merge(sp, on='cell').merge(num_sp, on='cell').merge(dom, on='cell').merge(num_dom, on='cell')
num_uhits_cell = num_uhits_cell.sort_values(by='num_unique_species_per_drop', ascending=False)

# print('number of cells with F7 hits:', num_uhits_cell.shape[0])



same thing as above but for empty droplets

In [None]:
num_uhits_e=pd.DataFrame(edrops.groupby(['cell'])['cell_umi'].nunique()).reset_index()
num_uhits_e = num_uhits_e.merge(edrops[['cell','tissue', 'num_genes', 'celltype3']].drop_duplicates(), on='cell')
#species per droplet
sp = pd.DataFrame(edrops.groupby(['cell'])['species'].unique()).reset_index()
num_sp = pd.DataFrame(edrops.groupby(['cell'])['species'].nunique()).reset_index().rename(columns={'species':'num_unique_species_per_drop'})
#domains per droplet
dom = pd.DataFrame(edrops.groupby(['cell'])['superkingdom'].unique()).reset_index()
num_dom = pd.DataFrame(edrops.groupby(['cell'])['superkingdom'].nunique()).reset_index().rename(columns={'superkingdom':'num_unique_domains_per_drop'})

num_uhits_e = num_uhits_e.merge(sp, on='cell').merge(num_sp, on='cell').merge(dom, on='cell').merge(num_dom, on='cell')
num_uhits_e = num_uhits_e.sort_values(by='num_unique_species_per_drop', ascending=False)
print('number of empty droplets with F7 hits:', num_uhits_e.shape[0])
num_uhits_e.head(10)

stats related to the number of unique hits per cell-containing and empty droplets (f7 data)

In [None]:
#cell
cell_stat=pd.DataFrame(num_uhits_cell['cell_umi'].describe()).reset_index()
#empty
e_stat=pd.DataFrame(num_uhits_e['cell_umi'].describe()).reset_index().rename(columns={'cell_umi':'empty_umi'})

ce_stats=pd.concat([ cell_stat, e_stat ], axis=1).iloc[:,[0,1,3]]
ce_stats = np.round(ce_stats,2)
ce_stats

saving the above table

In [None]:
ce_stats.to_csv(tables + 'single_cell_and_empty_droplet_stats_F7.csv', index=False)

### Outliers
Exploring the outlier cells and empty droplets

let's get outlier cells (those with higher cell_umi_normalized than the rest of the population)
we can used the formula for boxplot outliers: Q1 - 1.5*IQR (left) and Q3+1.5*IQR (right)

In [None]:
#getting outlier cells
q1 = ce_stats.iloc[4,1]
q3 = ce_stats.iloc[6,1]
iqr=q3-q1
thresh = q3 + 1.5*iqr
print('number of unique hits per cell threshold for outliers:', thresh)

#getting outlier empty
q1 = ce_stats.iloc[4,1]
q3 = ce_stats.iloc[6,1]
iqr=q3-q1
thresh = q3 + 1.5*iqr
print('number of unique hits per empty drop threshold for outliers:', thresh)

these are the outlier cells

In [None]:
cell_outs = num_uhits_cell[num_uhits_cell['cell_umi']>thresh]
cell_outs2 =cell_outs.drop(columns=['num_genes'])
cell_outs3 = cell_outs2.rename(columns={'celltype3':'celltype'})



saving outlier cells with F7 hits

In [None]:
cell_outs4 = cell_outs3[['cell', 'tissue', 'celltype', 'species', 'superkingdom']]


In [None]:
cell_outs4.to_csv(tables + 'outlier_cells_with_f7_hits.csv', index=False)

outlier empty droplets

In [None]:
empty_outs = num_uhits_e[num_uhits_e['cell_umi']>thresh]


In [None]:
empty_outs.to_csv(mainDir10 + 'outlier_empty_droplets_with_f7_hits.csv', index=False)

#### Exploring outliers by domain

Note that the domain that dominates outlier cells is viruses (because of EBV), whereas bacteria and fungi are prevelant in empty droplets

In [None]:
#cell
a = cell_outs['superkingdom'].apply(lambda x: pd.Series(x))
a = np.round(a[0].value_counts(normalize=True),2)
a = pd.DataFrame(a).reset_index().rename(columns={'index':'domain', 0:'fraction of outlier cells with F7 hits'})
#empty
b = empty_outs['superkingdom'].apply(lambda x: pd.Series(x))
b = np.round(b[0].value_counts(normalize=True),2)
b = pd.DataFrame(b).reset_index().rename(columns={'index':'domain', 0:'fraction of outlier empty droplets with F7 hits'})


In [None]:
outlier_hits_by_domain=a.merge(b)
outlier_hits_by_domain.to_csv(tables + 'fraction_of_f7_outlier_cells_and_emptydrops.csv', index=False)


#### by cell type
Note 6 of the 8 cell types associated with outlier cells are immune cells. 

In [None]:
outcellct= pd.DataFrame(np.round(cell_outs[cell_outs['celltype3']!='na']['celltype3'].value_counts(normalize=True),3)).reset_index().rename(columns={'celltype3':'fraction', 'index':
                                                                                            'cell_type'})

saving the above table, though I just talk about the results rather than include a table

In [None]:
outcellct.to_csv(mainDir10 + 'outlier_cells_celltype.csv', index=False)
