The purpose of this notebook is to compare the Tabula Sapiens Microbiome (TSM) dataset to the Human Microbiome Project (HMP), the tumor microbiome dataset by Nejman et al., and the EHTM (or validation, bulk DNAseq of extracted VLPs and BLPs) dataset. Additionally, I obtain data from the PATRIC database on pathogenicity.

Because the notebook was too large to upload to github, I had to clear the cell outputs prior to upload. 

### Loading Libraries

In [None]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib_venn import venn2, venn3
import venn
from venn import venn5
plt.rcParams['figure.figsize'] = [5, 5]
plt.rcParams['figure.dpi'] = 150
font_label_size=12
plt.rcParams['font.size'] = font_label_size
plt.rcParams['legend.fontsize'] = font_label_size
plt.rcParams['figure.titlesize'] = font_label_size
plt.rcParams['axes.labelsize']= font_label_size
plt.rcParams['axes.titlesize']= font_label_size
plt.rcParams['xtick.labelsize']= font_label_size
plt.rcParams['ytick.labelsize']= font_label_size
plt.rcParams["font.family"] = "arial"
plt.rcParams['svg.fonttype'] = 'none'
import collections
from collections import Counter, ChainMap
import os
import glob
import re
import itertools
import math
import random
from random import randrange
import string
import numpy as np
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
from statistics import mean
from matplotlib import pyplot
import scipy.stats as st
from sklearn.utils import shuffle
import upsetplot
from upsetplot import UpSet, generate_counts, from_contents, plot
from pyvis.network import Network
import seaborn as sns
cmap = sns.cm.rocket_r
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, Plot, Grid, Range1d
from bokeh.models.glyphs import Text, Rect
from bokeh.layouts import gridplot
from bokeh.sampledata.les_mis import data
import panel as pn
import panel.widgets as pnw
pn.extension()
import holoviews as hv
from holoviews import opts, dim
from holoviews.plotting.util import process_cmap
import plotly.graph_objects as go
import plotly.express as pex
hv.extension('bokeh')
hv.output(size=150)
import pickle5 as pickle

### Directories

In [None]:
mainDir = '/oak/stanford/groups/quake/gita/raw/tab1_20200407/thirdAnalysis/10x/'
mainDir10 = '/oak/stanford/groups/quake/gita/raw/tab3-14_20210420/all/'

paper = '/oak/stanford/groups/quake/gita/raw/nb/microbe/paper/'
ext = paper + 'forGitHub/human_tissue_microbiome_atlas/post_review/external_datasets/'
figs = paper + 'forGitHub/human_tissue_microbiome_atlas/post_review/figures/'

tables = paper + 'forGitHub/human_tissue_microbiome_atlas/post_review/tables/'
dbDir = '/oak/stanford/groups/quake/gita/raw/database/taxonomyNCBI20200125/'
taxDir = dbDir + 'taxonkit/'
tax = pd.read_csv(dbDir + 'ncbi_lineages_2021-01-26.csv')
tax_short=tax[['tax_id','superkingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']] #want to take only the following columns from the lineage dataframe tax 

uhgg = '/oak/stanford/groups/quake/gita/raw/nb/microbe/paper/forGitHub/human_tissue_microbiome_atlas/post_review/external_datasets/UHGG/'



### Explanation of the dataset

- Blast dataframe columns (TS object is missing these columns):
    - **seqName**: e.g. A00111:327:HL57HDSXX:4:1202:22941:23719_TAGGTCAGAGACATCA_AGAAATAACGCC
    - **seq**: microbial sequence 
    - **refName**: the BLAST subject reference gi|1862738216|gb|CP055292.1| 
    - **pathogen**: common name of the hit "Shigella sonnei strain SE6-1 chromosome, complete genome"
    - **bitscore**: see BLAST tutorial
    - **pident**: percent identity between subject and query
    - **evalue**: see BLAST tutorial
    - **qstart**: query start position
    - **qend**: query end position
    -  **sstart**: subject start position
    -  **send**: subject end position
    -  **length**: length of the alignment between subject and query
    - Taxonomy columns 
        -  **tax_id**: taxonomic id with which we can get the following taxonomic categories
        -  **superkingdom**
        -  **phylum**
        -  **class**
        -  **order**
        -  **family**
        -  **genus**
        -  **species**
    -  **umi**: AGAAATAACGCC 
    -  **cell_bc_umi**: cell barcode and umi e.g. TGTCCCATGTACTCTG_CGTTGATACCAC
    -  **cell_umi**: cell barcode, sample, umi e.g. TCCCATGTACTCTGCG_TSP14_Liver_NA_10X_1_1_TTGATACCACTG
    -  **batch**: this is related to how blast was done for these donors (divided into batches with fixed number of input seqs)
    - **filepath**: where the output of the blast resides on my local drive


- TS object columns (blast dataframe does not produce these on its own)
    - **n_counts**: number of reads per cell
    - **n_genes**: number of genes per cell
    - **log2_n_counts**: log2 of n_counts
    - **log2_n_genes**: log2 of n_genes
    - **compartment**: e.g. immune, epithelial
    - **decision**: cell cycle decision
    - **celltype2**: cell type
    - **tissue_cell_type**: tissue and cell type
    - **cell_type_tissue**: cell type and tissue


- Shared columns:
    - **cell**: cell barcode+sample  TGGGCGTGTTGCGCAC_TSP14_Blood_NA_10X_1_1_1_5Prime
    - **cell_bc**: cell barcode with "-1" appended to it. (donor1and2, not for donors 3-16) e.g. CATATTCCAAAGCGGT-1
    - **tissue**: tissue type
    - **donor**: donor (e.g. TSP1)
    - **sample**: sample name e.g. TSP14_Bladder_NA_10X_1_2
    - **hit**: the column with which to seperate out the dataframe into blast (hit=='yes') and ts object (hit=='no')
    - **hit_type**: tells us whether the hit comes from an annotated cell (hit_type=='intra'), unannotated cell ('extra') or no hit ('none')
    - **donor_batch**: donor 1 and 2 dataframe ('1_2'), all others ('3_16')
    - **empty_drop**: set to 'yes' if empty, and 'no' if a cell. Based on a cut off of 200 genes per droplet


- Cell ranger columns (rows with cr column=='yes should have all this info):
    - **cr**: "yes" if the sample was run through cell ranger pipeline, and "no" otherwise
    - **cr_sample**: the name of the cell ranger file
    - **n_counts_cr**: number of reads per droplet as determined by my cell ranger pipeline
    - **n_genes_cr**: same as above, for number of genes per droplet
    - **num_empty_drop_per_sample**: number of empty droplets in a sample (based on cut off of 200 genes)
    - **num_cell_drop_per_sample**: number of cell-containing droplets in a sample (based on cut off of 200 genes)
    


- Shared columns for removal of contaminants, "orphan" alignments, and environmentally sourced bacteria
    - **filter1**: Removal of species and taxids found in the 10X experimental contamination dataset (1= contaminant, 0=non-contaminant)
    - **filter2**: Removal of genera found in the Salter et al. list of common contaminant genera (1=contaminant)
    - **filter3**: Removal of certain fungal species found in the Haziza et al dataset 
    - **filter4**: Removal of putative contaminants based on ML model prediction
    Filter4 is based on ML model prediction which takes into account the following columns. 
    - **num_tissues**: Number of tissues that each species appears in
    - **num_donors**: Number of donors that each species appears in
    - **abundance**: The total abundance of each species (number of hits) in the TS dataset
    - **med_pident**:  Median percent identity of alignments corresponding to a given species
    - **med_length**: Median alignment length corresponding to a given species
    - **med_ngene**: Median number of genes per droplet that contains a given species
    - **pred**: The prediction of the model. The input label of the model are species, 1 if any of the previous filters are 1 (contaminant), and 0 otherwise. 
    - **R_g**: R_g is c_g/ts_g. c_g is the number of species from genus g that have been detected in the experimental contamination dataset, and ts_g is the number of species from that genus detected in the total ts dataset. The ratio is a measure of confidence in a genus. If a genus has more species detected in the contamination dataset than in the total TS dataset, then even species that were not detetcted in the contamination dataset are less trustworthy considering that we show taxonomic assignment at the species level using short-read sequencing (~100bp) is not nearly as precise as genus level assignments 
    - **filter5** : We combine pident and length alignment parameters into one that measures the number of exact base matches in the alignment. This new pararmeter,"pidlen" =pident*length/100. pidlen > pthresh will be 0, and 1 otherwise. pthresh corresponds to 3sigma away from the mean of the second pidlen distribution. Essentially we select for high-quality alignments using this filter. 
    - **pidlen**: pident*length/100 corresponds to number of exact base matches in the alignment
    - **emp_habitat1**: coarse-grained habitat information (list of habitats) for bacterial genera at the intersection of ts and GEM
    - **emp_habitat2**: even more coarse-grained habitat information (list of habitats) for bacterial genera at the intersection of ts and GEM
    - **emp_habitat3**: even more coarse-grained habitat information than emp_habitat2 for bacterial genera at the intersection of ts and GEM
    - **assignment**: binary assignment of GEM bacterial species found in the tsm dataset into "free_living_or_other_hosts" and "human_associated"
    - **emp**:	1 is "free_living_or_other_hosts" and 0 is "human_associated" based on whether the species was isolated from the  human microbiome. If found in both natural environments as well as humans, it will be labeled as human associated. It will be NaN if the species is not found in EMP, which in addition to bacterial species will include all viruses and fungi.
    - **emp2**: 1 is "free_living_or_other_hosts" and 0 is "human_associated" based on whether the genus was predominantly isolated from the human microbiome (more species in the human microbiome than outside of it based on GEM study). It will be NaN if the genus is not found in GEM, which in addition to bacterial species will include all viruses and fungi.
    - **assignment2**: binary assignment of gem bacterial genera found in the tsm dataset into "free_living_or_other_hosts" and "human_associated" 
    - **emp1_2**: combination of emp and emp2 columns. Basically, it is emp2 (genus) filter but can be overriden by emp (species) filter. So unless a given species has been explicitly found in the human microbiome, if it belongs to a genus that is mostly found outside of the human microbiome, then it would be deemeed not part of the human microbiome
    - **hmp_species**:	set to 0 if a species is found in the HMP dataset, NaN otherwise (note, hmp is predominantly bacterial, with small number of eukaryotic species)
    - **hmp_genus**:	set to 0 if a genus is found in the HMP dataset, NaN otherwise 
    - **hmp_phylum**:	set to 0 if a species is found in the HMP dataset, NaN otherwise 
    - **uhgg_species**:	set to 0 if a species is found in the UHGG dataset, NaN otherwise (note, uhgg doesn't have fungi or viruses, it does have some archaea)
    - **uhgg_genus**:	set to 0 if a genus is found in the UHGG dataset, NaN otherwise 
    - **uhgg_phylum**:	set to 0 if a species is found in the UHGG dataset, NaN otherwise 
    - **hue (filter6)**: combination of **h**mp, **u**hgg and **e**mp filters. Set to 0 if any of the species have been found in association with the human microbiome, set to 1 if only found outside of the human microbiome, and NaN if no information can be gathered from these three databases. Note, only "emp1_2" column offers some values of 1, others (uhgg_species, hmp_species) are either 0 or NaN because they only explore sites in the human body.
    - **chatgpt (filter7)**: this is the result of chatgpt search for habitat of origin for a species that was not found in the three studies explored. Set to 0 if found in humans, and 1 otherwise. 
    - **davinci_combo**: the combined outputs of chatgpt's davinci model to the same prompt posed 3 times (independently, for each species). 


### color list

In [None]:
color_lst =['salmon','lavender','mistyrose','lightblue', 'mediumaquamarine','teal','cornflowerblue','plum',
            'steelblue', 'pink', 'thistle','chocolate',  'darkkhaki', 'darksalmon', 'lemonchiffon','lightslategrey',
            'darkgoldenrod', 'darkgray','olive','slategray', 'deepskyblue', 'firebrick', 'gainsboro', 'honeydew',
            'crimson','hotpink', 'indianred','darkorchid','aquamarine','darkgreen','indigo','lawngreen', 'maroon',
            'lightcoral','wheat','tomato','lightseagreen', 'lightskyblue', 'lightyellow','mediumseagreen',
            'linen', 'magenta', 'darkviolet', 'mediumorchid', 'tan','mediumpurple', 'mediumspringgreen', 
            'mintcream', 'cadetblue', 'moccasin','aqua' ,'navajowhite', 'orchid', 'palegoldenrod', 'palegreen',
            'paleturquoise', 'palevioletred', 'coral','papayawhip', 'peachpuff', 'peru',  'cornsilk','powderblue', 
            'purple', 'chartreuse', 'red', 'darkolivegreen','darkslateblue','rosybrown', 'darkslategrey','royalblue', 
            'sandybrown', 'seagreen', 'sienna', 'silver', 'skyblue', 'slateblue', 'darkmagenta','orangered', 'gold',
            'bisque','yellow','orange', 'lightpink','dodgerblue','greenyellow','deeppink']


### Functions

In [None]:
"""
creates a dictionary of colors mapping to categories defined in grouplist. This function pairs up with sankey function. 
for groups that don't need to be colored, simply leave them out of grouplist, and they will be automatically grey. 
df is the input_df to sankey plot
"""

def get_colormap(df, grouplist):
    color_map={}
    counter = -2 #setting this up so that the same colors aren't picked from the color_lst for each category
    for group in grouplist:
        counter = counter +2
        subgroups = df[group].value_counts().index.tolist() #getting the categories for each group (e.g. viruses, bacteria, eukaryotes for superkingdom), sorted by abundance
        subgroups=[x + '_' + group[0:3] for x in subgroups] #adding "_don" or any other three first letters of a group (how sankey is done)
        
        sub_colors=color_lst[counter:counter + len(subgroups)] #getting list of colors from color_lst
        color_sub_dict = dict(zip(subgroups, sub_colors)) #getting dictionary of colors and values
        
        color_map.update(color_sub_dict) #adding the dictionary to the main color dictionary
    return(color_map)


"""
creates legends for the sankey plot, using the same first two input parameters as get_colormap function
fontsize --> font size
tsize --> title font size
msize --> marker size
figs --> figure size
ncol --> number of columns
"""

def get_legend(df, grouplist, fontsize=12, tsize=12, msize=1.6, fs=3, ncol=2, figname='sankeyplots_'):
    color_map={}
    counter = -2 #setting this up so that the same colors aren't picked from the color_lst for each category

    for group in grouplist:
        counter = counter +2
        
        subgroups = df[group].value_counts().index.tolist() #getting the categories for each group (e.g. viruses, bacteria, eukaryotes for superkingdom), sorted by abundance      
        sub_colors=color_lst[counter:counter + len(subgroups)] #getting list of colors from color_lst

        #getting legend handels
        h1 = [matplotlib.lines.Line2D([],[], marker="s", color=c, linestyle="none") for c in sub_colors]

        #plotting the legened figures
        fig, ax= plt.subplots(figsize=[fs,fs])
        plt.axis('off')
        plt.legend(handles=h1, 
           labels=subgroups,
           loc='lower left', prop={'size':fontsize}, ncol=ncol, numpoints=1, 
           frameon=False, markerscale=msize, title=group, title_fontsize=tsize, labelcolor='dimgrey')
        
        plt.savefig(figs + figname + group + '.svg', bbox_inches='tight')
    return()




# Analysis

### Reading the main dataframe

In [None]:
hits= pd.read_csv(mainDir10 + 'all_dons_cr_processed_7filters_postreview2022_v2.csv')

hits_yes = hits[hits['hit']=='yes']
hits_no = hits[hits['hit']=='no']

#adding a column that is useful for these plots
#crearting a tissue_donor column for the plot
hits_yes['tissue_donor']= hits_yes['tissue'] + '_' + hits_yes['donor']

cols=['filter1', 'filter2', 'filter3', 'filter4', 'filter5', 'hue', 'chatgpt']
f7=hits_yes[(hits_yes[cols].any(axis=1)==False)]

cols=['filter1', 'filter2', 'filter3', 'filter4', 'filter5']
f5=hits_yes[(hits_yes[cols].any(axis=1)==False)]

cols=['filter1']
f1=hits_yes[(hits_yes[cols].any(axis=1)==False)]

cont = hits_yes[(hits_yes['filter1']==1)]# these are known contaminants
print('total number of hits without filters:', hits_yes.shape[0])
print('total number of annotated cells without hits:', hits_no.shape[0])
print('f1 num hits:',f1.shape[0])
print('f5 num hits:',f5.shape[0])
print('f7 num hits:',f7.shape[0])


## Tumor microbiome
### Is there an overlap between species and genera found in the TS dataset and Nejman et al. tumor microbiome study ?
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7757858/#SD3 

species are obtained from Table S4 (Hits sheet), only those that appear in tumor are selected and pass all 6 contamination filters are selected.  Tumor enriched species are obtained from figure 4D.


In [None]:
#reading table S4 (Hits sheet modified to contain data on tumors and healthy tissues seperately)
df = pd.read_excel(ext + 'nejman_tumor_NIHMS1645237-supplement-Table_S4.xlsx', sheet_name=None)
tumor = df['Hits_tumor'].drop(columns='Unnamed: 0')
tumor = tumor[~((tumor['species']=='Enterococcus faecium') & (tumor['genus']=='Streptococcus'))] #taking out a duplicate entry where enterococcus faecium is given two different taxonomic info
tumor = tumor[(tumor['species'].isnull()==False) | (tumor['genus'].isnull()==False)] #taking out unknown species or genera 
tumor = tumor[tumor.sum(axis=1)>0] # these are species that appear in at least one tumor sample
tumor['order'] = tumor['order'].apply(lambda x: x.replace('Enterobacteriales', 'Enterobacterales'))#this is because In 2016, "Enterobacteriales" was proposed to be reclassified as Enterobacterales


I'm also going to drop those genera that are "unknown" in the tumor dataset

In [None]:
tumor = tumor[tumor.genus.str.contains('Unknown genus', case=False)==False]

#### Creating a dataframe (tumor_sp_ct) that transforms the count table from original tumor dataframe. More useful format for my plots

In [None]:
tum_list = ['Breast_T','Lung_T','Melanoma_T','Pancreas_T','Ovary_T','Bone_T','GBM_T','Colon_T']
tumor_ct=tumor.iloc[:,-9:].dropna().set_index('species') #just getting the count table (and tumor list as columns, also dropping na values - species info is na)
sp_lst=[]
tumtype_lst=[]

for row in tumor_ct.iterrows():
    for i,tumtype in enumerate(tum_list):
        if row[1][i]>0:
            sp_lst.append(row[0])
            tumtype_lst.append(tumtype)

tumor_sp_ct=pd.DataFrame({}, columns=['species', 'tumor_type'])
tumor_sp_ct['species']=sp_lst
tumor_sp_ct['tumor_type']=tumtype_lst
tumor_sp_ct = tumor_sp_ct.drop_duplicates()
tumor_sp_ct.head(3)


#### Getting taxonomic info for each species so I can add this info to the tumor_sp_ct dataframe


In [None]:
tumor_tax = tumor.iloc[:,:7].dropna().drop_duplicates(subset=['species'])
tumor_sp_ct = tumor_sp_ct.merge(tumor_tax, on='species', how='left')
tumor_sp_ct.head(3)


#### Plotting the genera that are found in tumor samples that also appear in Tabula Sapiens samples
To be able to see all the genera in one plot, we're using F7 dataset

In [None]:
gens = tumor['genus'].unique().tolist()
tumor_gen = '|'.join(gens)
hits_fil=f7[f7['genus'].isna()==False] #getting rid of hits with nan order information

#finding the genera that are in both tumors and normal tissue
tum_hits = hits_fil[hits_fil['genus'].str.contains(tumor_gen, case=False)]

#creating a heatmap of orders found in each tissue_donor sample
tum_gen_heat=tum_hits.groupby(['tissue_donor', 'genus'], as_index=False).count().iloc[:,:3].rename(
    columns={'R_g':'count'}).pivot('tissue_donor', 'genus', 'count').replace(np.nan, 0)

tum_gen_heat_log = np.log2(tum_gen_heat + 1)
g = sns.clustermap(tum_gen_heat_log.T, linewidths=.05, figsize=[15,20], dendrogram_ratio=(.1, .3), col_cluster=False)




saving above heatmap as an SI table

In [None]:
tum_gen_heat_log.T.to_excel(tables + 'genera_heatmap_shared_between_tumors_and_tsm_f7.xlsx')

#### The bacterial genera that are found in Nejman et al. tumor dataset that weren't detected in TSM F7
many are unknown genera

#### Intersection of TSM F5 adn F7 dataset and Nejman et al. tumor dataset

##### f5

In [None]:
taxcol = 'genus'
ts_ = set(f5[(f5['superkingdom']=='Bacteria') & (f5[taxcol].isna()==False)][taxcol])
tumor_ = set(tumor[tumor[taxcol].isna()==False][taxcol])

v = venn2([ts_, tumor_], set_labels = ('TSM F5','Tumors')) 

v.get_patch_by_id('100').set_color('cornflowerblue')
v.get_patch_by_id('010').set_color('teal')
v.get_patch_by_id('110').set_color('grey')
plt.title('genera')
frac_tum=np.round(len(ts_.intersection(tumor_))/len(tumor_), 2)
plt.show()

print('what fraction of TSM F5 genera is shared with nejman?', np.round(len(ts_.intersection(tumor_))/len(ts_), 2))
print('what fraction of nejman is shared with TSM F5 tissue microbiome?', np.round(len(tumor_.intersection(ts_))/len(tumor_), 2))


##### f7

In [None]:
taxcol = 'genus'
ts_ = set(f7[(f7['superkingdom']=='Bacteria') & (f7[taxcol].isna()==False)][taxcol])
tumor_ = set(tumor[tumor[taxcol].isna()==False][taxcol])

v = venn2([ts_, tumor_], set_labels = ('TSM F7','Tumors')) 

v.get_patch_by_id('100').set_color('cornflowerblue')
v.get_patch_by_id('010').set_color('teal')
v.get_patch_by_id('110').set_color('grey')
plt.title('genera')

plt.show()
frac_tum=np.round(len(ts_.intersection(tumor_))/len(tumor_), 2)

print('what fraction of TSM F7 genera is shared with nejman?', np.round(len(ts_.intersection(tumor_))/len(ts_), 2))
print('what fraction of nejman is shared with TSM F7 tissue microbiome?', np.round(len(tumor_.intersection(ts_))/len(tumor_), 2))


#### Tumor-enriched species
I'm taking a list of species enriched in various tumors found in panel D of Figure 4. 

Note we do find these species, however they are primarily in our contamination dataset. Only 2 of them appear after the first filter is applied (f1 dataset) and none after f5 or f7. 

In [None]:
tumor_species = ['fusobacterium nucleatum', 'corynebacterium US_1715', 
        'staphylococcus aureus', 'paracoccus marcusii', 'klebsiella pneumoniae', 
        'roseomonas mucosa', 'staphylococcus cohnii', 'actinomyces massiliensis', 
        'neisseria macacae', 'enterobacter cloacae']
tumor_sp = '|'.join(tumor_species)

hits_fil=hits_yes[hits_yes.species.isna()==False] #getting rid of hits with nan species information

#finding the orders that are in both tumors and normal tissue
tum_hits = hits_fil[hits_fil.species.str.contains(tumor_sp, case=False)]

#crearting a tissue_donor column for the plot
tum_hits['tissue_donor']= tum_hits['tissue'] + '_' + tum_hits['donor']

#creating a heatmap of orders found in each tissue_donor sample
tum_sp_heat=tum_hits.groupby(['tissue_donor', 'species'], as_index=False).count().iloc[:,:3].rename(
    columns={'genus':'count'}).pivot('tissue_donor', 'species', 'count').replace(np.nan, 0)

tum_sp_heat_log = np.log2(tum_sp_heat + 1)
g = sns.clustermap(tum_sp_heat_log, linewidths=.5, figsize=[6,12], dendrogram_ratio=(.2, .1), row_cluster=False)



these are the two that appear after filter 1 only

In [None]:
tumor_species = ['fusobacterium nucleatum', 'corynebacterium US_1715', 
        'staphylococcus aureus', 'paracoccus marcusii', 'klebsiella pneumoniae', 
        'roseomonas mucosa', 'staphylococcus cohnii', 'actinomyces massiliensis', 
        'neisseria macacae', 'enterobacter cloacae']
tumor_sp = '|'.join(tumor_species)

hits_fil=f1[f1.species.isna()==False] #getting rid of hits with nan species information

#finding the orders that are in both tumors and normal tissue
tum_hits = hits_fil[hits_fil.species.str.contains(tumor_sp, case=False)]

#crearting a tissue_donor column for the plot
tum_hits['tissue_donor']= tum_hits['tissue'] + '_' + tum_hits['donor']

#creating a heatmap of orders found in each tissue_donor sample
tum_sp_heat=tum_hits.groupby(['tissue_donor', 'species'], as_index=False).count().iloc[:,:3].rename(
    columns={'genus':'count'}).pivot('tissue_donor', 'species', 'count').replace(np.nan, 0)

tum_sp_heat_log = np.log2(tum_sp_heat + 1)
g = sns.clustermap(tum_sp_heat_log, linewidths=.5, figsize=[4,5], dendrogram_ratio=(.2, .1), row_cluster=False)



#### bacterial genera shared between TS and Nejman et al. datasets (TSM F5)

In [None]:
tumor_sub = tumor[['genus', 'Breast_T','Lung_T','Melanoma_T','Pancreas_T','Ovary_T','Bone_T','GBM_T','Colon_T']]

tumor_sub2 = tumor_sub[tumor_sub['genus'].isnull()==False] #dropping those enteries without genus information


#looping through the tumor samples to get counts of each genus in each sample
t =tumor_sub2.groupby(['Breast_T', 'genus'], as_index=False).count()
tum_ord=t[t['Breast_T']==1].iloc[:,1:3]
tum_ord.columns=['genus', 'Breast_T']
tumor_lst=['Lung_T','Melanoma_T','Pancreas_T','Ovary_T','Bone_T','GBM_T','Colon_T']
for tum_name in tumor_lst:
    t =tumor_sub2.groupby([tum_name, 'genus'], as_index=False).count()
    t=t[t[tum_name]==1].iloc[:,1:3]
    t.columns=['genus', tum_name]
    tum_ord = tum_ord.merge(t, on='genus', how='outer')
tum_ord = tum_ord.replace(np.nan, 0)

#getting hits dataframe in the same format as tum_ord
hits_ord_count=f5.groupby(['genus', 'tissue'],as_index=False).count().iloc[:,:3].rename(columns={'R_g':'count'}).pivot(
    index='genus', columns='tissue', values='count').replace(np.nan, 0)

#now merging the two dataframes (hits and tumor order counts) based on their shared orders
hits_tum_merge=hits_ord_count.merge(tum_ord, on='genus', how='inner').set_index('genus')
#excluding those tissues with 0 hits
hits_tum_merge = hits_tum_merge.loc[:,hits_tum_merge.sum()!=0]

sns.clustermap(np.log2(hits_tum_merge+1), col_cluster=False, figsize=[10,35],
                    dendrogram_ratio=(.1, .22), linewidth=.1)

# plt.savefig(figs + 'tumor_tsm_f5_genera_heatmap.pdf')

##### TSM F7 and Nejman

In [None]:
tumor_sub = tumor[['genus', 'Breast_T','Lung_T','Melanoma_T','Pancreas_T','Ovary_T','Bone_T','GBM_T','Colon_T']]

tumor_sub2 = tumor_sub[tumor_sub['genus'].isnull()==False] #dropping those enteries without genus information


#looping through the tumor samples to get counts of each genus in each sample
t =tumor_sub2.groupby(['Breast_T', 'genus'], as_index=False).count()
tum_ord=t[t['Breast_T']==1].iloc[:,1:3]
tum_ord.columns=['genus', 'Breast_T']
tumor_lst=['Lung_T','Melanoma_T','Pancreas_T','Ovary_T','Bone_T','GBM_T','Colon_T']
for tum_name in tumor_lst:
    t =tumor_sub2.groupby([tum_name, 'genus'], as_index=False).count()
    t=t[t[tum_name]==1].iloc[:,1:3]
    t.columns=['genus', tum_name]
    tum_ord = tum_ord.merge(t, on='genus', how='outer')
tum_ord = tum_ord.replace(np.nan, 0)

#getting hits dataframe in the same format as tum_ord
hits_ord_count=f7.groupby(['genus', 'tissue'],as_index=False).count().iloc[:,:3].rename(columns={'R_g':'count'}).pivot(
    index='genus', columns='tissue', values='count').replace(np.nan, 0)

#now merging the two dataframes (hits and tumor order counts) based on their shared orders
hits_tum_merge=hits_ord_count.merge(tum_ord, on='genus', how='inner').set_index('genus')
#excluding those tissues with 0 hits
hits_tum_merge = hits_tum_merge.loc[:,hits_tum_merge.sum()!=0]

sns.clustermap(np.log2(hits_tum_merge+1), col_cluster=False, figsize=[10,15],
                    dendrogram_ratio=(.1, .24), linewidth=.1)

# plt.savefig(figs + 'tumor_tsm_f7_genera_heatmap.pdf')

#### Number of bacterial genera shared between each tumor type and healthy tissue type 
(TSM F7, variables coming from the heatmap above)

In [None]:
df = hits_tum_merge #creating a binary matrix of counts (if an order is present, then it gets a value of one.)
df[df>1]=1
df_ts = df.iloc[:,:-8] #TS tissues

tumor_lst =['Breast_T','Lung_T','Melanoma_T','Pancreas_T','Ovary_T','Bone_T','GBM_T','Colon_T']
new_df = pd.DataFrame(columns=df_ts.columns, index=tumor_lst) #creating a new dataframe to store number of shared orders 

#looping through the tumor samples to get counts of each order in each sample
for tum_name in tumor_lst:
    for tiss in df_ts.columns:
        df2 = df.loc[:,tiss] + df.loc[:, tum_name] #If the genera sum up to 2, that means they appear in both the TS tissue and tumor 
        num_orders=df2[df2>1].shape[0] #number of genera shared between ts tissue and tumor
        new_df.loc[tum_name, tiss] = num_orders

#new_df describes the number of shared genera between tumor and ts tissue

g = sns.relplot(data=new_df.T, height=6, markers=['o']*new_df.T.shape[1], alpha=.8)

# Tweak the figure to finalize
g.set(xlabel="\n", ylabel="Number of bacterial genera shared between each \ntumor type and TSM tissue type\n")
g.despine(left=True, bottom=True)
for label in g.ax.get_xticklabels():
    label.set_rotation(90)

plt.grid() 

        

the table that is being visualized above

In [None]:
new_df.T.head(3)

#### which genera appear in tumors but not in TSM F7 or F5

In [None]:
df = pd.DataFrame(columns=['f5', 'f7'])


taxcol = 'genus'
ts_ = set(f7[(f7['superkingdom']=='Bacteria') & (f7[taxcol].isna()==False)][taxcol])
tumor_ = set(tumor[tumor[taxcol].isna()==False][taxcol])

df['genus']=list(tumor_ - ts_)

df['f7'] = ['missing']*df.shape[0]

ts_f5 = set(f5[(f5['superkingdom']=='Bacteria') & (f5[taxcol].isna()==False)][taxcol])
tumor_ = set(tumor[tumor[taxcol].isna()==False][taxcol])
tum_missing_from_f5=list(tumor_ - ts_f5)


df.loc[df['genus'].isin(tum_missing_from_f5), 'f5']='missing'

df = df.replace(np.nan, 'found')
df

In [None]:
df.to_csv(tables + 'genera_found_in_tumors_but_not_in_tsm.csv', index=False)

In [None]:
f5.tissue.nunique()

## HMP

### Is there an overlap between species and orders found in the TSM F7 and Human Microbiome Project (HMP)?
downloaded in Dec 22 2021: https://www.hmpdacc.org/hmp/catalog/grid.php?dataset=genomic plotting the presence/absence of shared species across TS and HMP across different tissues and sites

In [None]:
cat=pd.read_csv(ext + '/hmp_project_catalog_downloaded_on_12_09_2021.csv')
cat.drop(columns=['Unnamed: 17', 'Unnamed: 18'], inplace=True)
cat['species'] = cat['Organism Name'].apply(lambda x: ' '.join(x.split(' ')[0:2])) #getting species information
cat.rename(columns={'HMP Isolation Body Site':'hmp_site'},inplace=True)
cat2 =cat[cat['hmp_site'].isin(['unknown','other','bone','eye','blood','lymph_nodes','liver', 'heart'])==False] #throwing out species from unknown sites or sites that are already in ts
cat3 = cat2.groupby(['hmp_site', 'species'], as_index=False).count().iloc[:,:3].rename(columns={'HMP ID': 'count'})
cat4 = cat3.pivot( 'species', 'hmp_site', 'count').replace(np.nan, 0)
 
lst = cat3['species']

overlap = f7[f7.species.isin(lst)]
overlap_lst = overlap.species.unique()
ts=overlap.groupby(['species', 'tissue'], as_index=False).count().iloc[:,:3].rename(columns={'method':'count'})
ts2 = ts.pivot('species','tissue', 'genus').replace(np.nan, 0)
ts2 =ts2.loc[ts2.sum(axis=1)!=0,:] #dropping taxa that don't appear in ts.

ts_hmp = ts2.merge(cat4, on='species')
cols = ts_hmp.columns.values
ts_hmp = ts_hmp.rename(columns={'skin_y': 'skin_hmp', 'skin_x': 'skin'})
mask = ts_hmp.gt(0.0).values
out = [cols[x].tolist() for x in mask]
ts_hmp['combined']= out

#dropping columns with all zero values
ts_hmp = ts_hmp.loc[:, (ts_hmp!=0).any(axis=0)]


plt.figure(figsize=[10,20])

g = sns.heatmap(ts_hmp.iloc[:,:-1], vmin=0, vmax=0.01, linewidth=.5) #want to get presence and absence only
# g.set_xticklabels(g.get_xmajorticklabels(), fontsize = 25)
# g.set_yticklabels(g.get_ymajorticklabels(), fontsize = 25)

#### getting the taxonomy information for HMP dataset 
for some reason only strain/species information is provided, not high-level taxonomic categories. 

In [None]:
tax_for_hmp=tax_short[tax_short['species'].isin(cat3['species'])].iloc[:,1:].drop_duplicates()
cat3_with_tax=cat3.merge(tax_for_hmp, on='species', how='outer')

In [None]:
ts_order_set = set(f7[(f7['superkingdom']=='Bacteria') & (f7['genus'].isna()==False)]['genus'])
hmp_order_set=set(cat3_with_tax[(cat3_with_tax['genus'].isna()==False)]['genus'])

v = venn2([ts_order_set, hmp_order_set], set_labels = ('TSM', 'HMP'))


v.get_patch_by_id('100').set_color('cornflowerblue')
v.get_patch_by_id('010').set_color('teal')
v.get_patch_by_id('110').set_color('salmon')
# plt.title('F7 vs HMP genera')


In [None]:
ts_genus_set = set(f7[(f7['superkingdom']=='Bacteria') & (f7['tissue'].isin(['li','si']))]['genus'])
hmp_genus_set=set(cat3_with_tax[(cat3_with_tax['superkingdom']=='Bacteria') & 
                                (cat3_with_tax['hmp_site']=='gastrointestinal_tract') &
                               (cat3_with_tax['genus'].isna()==False)]['genus'])
                                                                            

v = venn2([ts_genus_set, hmp_genus_set], set_labels = ('TSM Intestines', 'HMP Gut'))


v.get_patch_by_id('100').set_color('cornflowerblue')
v.get_patch_by_id('010').set_color('teal')
v.get_patch_by_id('110').set_color('salmon')
# plt.title('F7 vs HMP genera')
plt.show()

In [None]:
ts_genus_set = set(f7[(f7['superkingdom']=='Bacteria') & (f7['genus'].isna()==False) & (f7['tissue']=='skin')]['genus'])
hmp_genus_set=set(cat3_with_tax[(cat3_with_tax['superkingdom']=='Bacteria') & (cat3_with_tax['hmp_site']=='skin') &
                                (cat3_with_tax['genus'].isna()==False)]['genus'])
                                                                            

v = venn2([ts_genus_set, hmp_genus_set], set_labels = ('TSM skin', 'HMP skin'))


v.get_patch_by_id('100').set_color('cornflowerblue')
v.get_patch_by_id('010').set_color('teal')
v.get_patch_by_id('110').set_color('salmon')
# plt.title('F7 vs HMP genera')
plt.show()

In [None]:
ts_genus_set = set(f7[(f7['superkingdom']=='Bacteria') & (f7['genus'].isna()==False) & (f7['tissue']=='trachea')]['genus'])
hmp_genus_set=set(cat3_with_tax[(cat3_with_tax['superkingdom']=='Bacteria') & (cat3_with_tax['hmp_site']=='airways') &
                                (cat3_with_tax['genus'].isna()==False)]['genus'])
                                                                            

v = venn2([ts_genus_set, hmp_genus_set], set_labels = ('TSM trachea', 'HMP airways'))


v.get_patch_by_id('100').set_color('cornflowerblue')
v.get_patch_by_id('010').set_color('teal')
v.get_patch_by_id('110').set_color('salmon')
# plt.title('F7 vs HMP genera')
plt.show()

In [None]:
ts_genus_set = set(f7[(f7['superkingdom']=='Bacteria') & (f7['genus'].isna()==False) & (f7['tissue']=='salivary_gland')]['genus'])
hmp_genus_set=set(cat3_with_tax[(cat3_with_tax['superkingdom']=='Bacteria') & (cat3_with_tax['hmp_site']=='oral') &
                                (cat3_with_tax['genus'].isna()==False)]['genus'])
                                                                            

v = venn2([ts_genus_set, hmp_genus_set], set_labels = ('TSM salivary gland', 'HMP oral'))


v.get_patch_by_id('100').set_color('cornflowerblue')
v.get_patch_by_id('010').set_color('teal')
v.get_patch_by_id('110').set_color('salmon')
# plt.title('F7 vs HMP genera')
plt.show()

a table of the genera shared between the TSM F7 and HMP datasets

In [None]:
ts_genus_set = set(f7[(f7['superkingdom']=='Bacteria') & (f7['genus'].isna()==False) & (f7['tissue']=='salivary_gland')]['genus'])
hmp_genus_set=set(cat3_with_tax[(cat3_with_tax['superkingdom']=='Bacteria') & (cat3_with_tax['hmp_site']=='oral') &
                                (cat3_with_tax['genus'].isna()==False)]['genus'])
x1 = pd.DataFrame(ts_genus_set.intersection(hmp_genus_set), columns=['genus'])
x1['hmp_site'] = ['oral']*x1.shape[0]
x1['TSM_F7_site'] = ['salivary_gland']*x1.shape[0]

ts_genus_set = set(f7[(f7['superkingdom']=='Bacteria') & (f7['genus'].isna()==False) & (f7['tissue']=='trachea')]['genus'])
hmp_genus_set=set(cat3_with_tax[(cat3_with_tax['superkingdom']=='Bacteria') & (cat3_with_tax['hmp_site']=='airways') &
                                (cat3_with_tax['genus'].isna()==False)]['genus'])
x2 = pd.DataFrame(ts_genus_set.intersection(hmp_genus_set), columns=['genus'])
x2['hmp_site'] = ['airways']*x2.shape[0]
x2['TSM_F7_site'] = ['trachea']*x2.shape[0]

ts_genus_set = set(f7[(f7['superkingdom']=='Bacteria') & (f7['genus'].isna()==False) & (f7['tissue']=='skin')]['genus'])
hmp_genus_set=set(cat3_with_tax[(cat3_with_tax['superkingdom']=='Bacteria') & (cat3_with_tax['hmp_site']=='skin') &
                                (cat3_with_tax['genus'].isna()==False)]['genus'])
x3 = pd.DataFrame(ts_genus_set.intersection(hmp_genus_set), columns=['genus'])
x3['hmp_site'] = ['skin']*x3.shape[0]
x3['TSM_F7_site'] = ['skin']*x3.shape[0]

ts_genus_set = set(f7[(f7['superkingdom']=='Bacteria') & (f7['genus'].isna()==False) & (f7['tissue'].isin(['si', 'li']))]['genus'])
hmp_genus_set=set(cat3_with_tax[(cat3_with_tax['superkingdom']=='Bacteria') & (cat3_with_tax['hmp_site']=='gastrointestinal_tract') &
                                (cat3_with_tax['genus'].isna()==False)]['genus'])
x4 = pd.DataFrame(ts_genus_set.intersection(hmp_genus_set), columns=['genus'])
x4['hmp_site'] = ['gastrointestinal_tract']*x4.shape[0]
x4['TSM_F7_site'] = ['intestines']*x4.shape[0]

xall=pd.concat([x1,x2,x3,x4]).reset_index(drop=True)
#saving
xall.to_csv(tables + 'genera_shared_between_hmp_tsm_f7_four_sites.csv')


#### Intersection of HMP, TS, Nejman (number of species)

we've done these operations before to get to the tumor dataframe, but here is a copy just in case

In [None]:
#reading table S4 (Hits sheet modified to contain data on tumors and healthy tissues seperately)
df = pd.read_excel(ext + 'nejman_tumor_NIHMS1645237-supplement-Table_S4.xlsx', sheet_name=None)
tumor = df['Hits_tumor'].drop(columns='Unnamed: 0')
tumor = tumor[~((tumor['species']=='Enterococcus faecium') & (tumor['genus']=='Streptococcus'))] #taking out a duplicate entry where enterococcus faecium is given two different taxonomic info
tumor = tumor[(tumor['species'].isnull()==False) | (tumor['genus'].isnull()==False)] #taking out unknown species or genera 
tumor = tumor[tumor.sum(axis=1)>0] # these are species that appear in at least one tumor sample
tumor['order'] = tumor['order'].apply(lambda x: x.replace('Enterobacteriales', 'Enterobacterales'))#this is because In 2016, "Enterobacteriales" was proposed to be reclassified as Enterobacterales
tumor = tumor[tumor.genus.str.contains('Unknown genus', case=False)==False]
tumor.head(3)

In [None]:

tiss = cat3.pivot('hmp_site', 'species', 'count').replace(np.nan, 0)
df = tiss
cols = df.columns
bt = df.apply(lambda x: x > 0)
contents = pd.DataFrame(bt.apply(lambda x: list(cols[x.values]), axis=1), columns={'species'}).to_dict()

# Remove Top level from Dictionary
res = dict(ChainMap(*contents.values()))
example = from_contents(res)  

plot(example,sort_by='cardinality')
plt.title('HMP', fontsize=16)
plt.show()

#converting the tiss dataframe to a binary one
tiss_bin = tiss.applymap(lambda x: 0 if x == 0 else 1) 



what is being plotted above

## UHGG
### Exploring the intersection with Unified Human Gastrointestinal Genome (UHGG) collection
comprising 204,938 nonredundant genomes from 4,644 gut prokaryotes

https://www.nature.com/articles/s41587-020-0603-3 (paper)\
http://ftp.ebi.ac.uk/pub/databases/metagenomics/mgnify_genomes/human-gut/v2.0.1/ (where data resides) \
Downloaded the genomes-all_metadata.tsv

In [None]:
ug = pd.read_csv(uhgg + 'genomes-all_metadata.tsv', delimiter='\t')

In [None]:
ug['domain'] = ug['Lineage'].apply(lambda x: x.split('d__')[1].split(';p__')[0])
ug['phylum'] =ug['Lineage'].apply(lambda x: x.split('d__')[1].split(';p__')[1].split(';c__')[0].split('_')[0])
ug['class'] =ug['Lineage'].apply(lambda x: x.split('d__')[1].split(';p__')[1].split(';c__')[1].split(';o__')[0])
ug['order'] =ug['Lineage'].apply(lambda x: x.split('d__')[1].split(';p__')[1].split(';c__')[1].split(';o__')[1].split(';f__')[0])
ug['family'] =ug['Lineage'].apply(lambda x: x.split('d__')[1].split(';p__')[1].split(';c__')[1].split(';o__')[1].split(';f__')[1].split(';g__')[0])
ug['genus'] =ug['Lineage'].apply(lambda x: x.split('d__')[1].split(';p__')[1].split(';c__')[1].split(';o__')[1].split(';f__')[1].split(';g__')[1].split(';s__')[0].split('_')[0])
ug['species'] =ug['Lineage'].apply(lambda x: x.split('d__')[1].split(';p__')[1].split(';c__')[1].split(';o__')[1].split(';f__')[1].split(';g__')[1].split(';s__')[1].replace('_A',''))
ug.head(2)

f7 and uhgg (all tissues)

In [None]:
cat = 'genus'
dataset1=f7[(f7['superkingdom']=='Bacteria') & (f7[cat].isna()==False)]


dataset2= ug[(ug[cat].isna()==False) & (ug[cat]!='') & (ug['domain']=='Bacteria')]



set1 = set(dataset1[cat])
set2 = set(dataset2[cat])
v = venn2([set1, set2], 
     set_labels = ('TSM F7',  'UHGG'))
v.get_patch_by_id('100').set_color('cornflowerblue')
v.get_patch_by_id('010').set_color('crimson')
v.get_patch_by_id('110').set_color('grey')


plt.title('TSM tissue microbiomes vs UHGG gut microbiome')
plt.show()

print('tsm num hits:', dataset1.shape[0])
print('uhgg dataframe size:', dataset2.shape[0])


print('what fraction of TSM F7 genera is shared with UHGG?', np.round(len(set1.intersection(set2))/len(set1), 2))
set1.intersection(set2)
print('what fraction of UHGG is shared with TSM F7 tissue microbiome?', np.round(len(set1.intersection(set2))/len(set2), 2))


f7 and uhgg (intestinal tissues)

In [None]:
cat = 'genus'
dataset1= f7[(f7[cat].isna()==False) & (f7['tissue'].isin(['si', 'li'])) & (f7['superkingdom']=='Bacteria')]

dataset2= ug[(ug[cat].isna()==False) & (ug[cat]!='') & (ug['domain']=='Bacteria')]



set1 = set(dataset1[cat])
set2 = set(dataset2[cat])
v = venn2([set1, set2], 
     set_labels = ('TSM F7',  'UHGG'))
v.get_patch_by_id('100').set_color('cornflowerblue')
v.get_patch_by_id('010').set_color('crimson')
v.get_patch_by_id('110').set_color('grey')


plt.title('TSM Intestinal tissue microbiome vs UHGG gut microbiome')
plt.show()

print('tsm num hits:', dataset1.shape[0])
print('uhgg dataframe size:', dataset2.shape[0])


print('what fraction of TSM F7 intestinal tissue microbiome genera is shared with UHGG?', np.round(len(set1.intersection(set2))/len(set1), 2))
set1.intersection(set2)
print('what fraction of UHGG is shared with TSM F7 intestinal tissue microbiome?', np.round(len(set1.intersection(set2))/len(set2), 2))


f5 and uhgg (all tissues)

In [None]:
cat = 'genus'
dataset1=f5[(f5['superkingdom']=='Bacteria') & (f5[cat].isna()==False)]


dataset2= ug[(ug[cat].isna()==False) & (ug[cat]!='') & (ug['domain']=='Bacteria')]


set1 = set(dataset1[cat])
set2 = set(dataset2[cat])
v = venn2([set1, set2], 
     set_labels = ('TSM F5',  'UHGG'))
v.get_patch_by_id('100').set_color('cornflowerblue')
v.get_patch_by_id('010').set_color('crimson')
v.get_patch_by_id('110').set_color('grey')


plt.title('TSM tissue microbiomes vs UHGG gut microbiome')
plt.show()

print('tsm num hits:', dataset1.shape[0])
print('uhgg dataframe size:', dataset2.shape[0])


print('what fraction of TSM F5 is shared with UHGG?', np.round(len(set1.intersection(set2))/len(set1), 2))
set1.intersection(set2)
print('what fraction of UHGG is shared with TSM F5?', np.round(len(set1.intersection(set2))/len(set2), 2))


f5 and uhgg (intestinal tissues)

In [None]:
cat = 'genus'
dataset1= f5[(f5[cat].isna()==False) & (f5['tissue'].isin(['si', 'li'])) & (f5['superkingdom']=='Bacteria')]

dataset2= ug[(ug[cat].isna()==False) & (ug[cat]!='') & (ug['domain']=='Bacteria')]


set1 = set(dataset1[cat])
set2 = set(dataset2[cat])
v = venn2([set1, set2], 
     set_labels = ('TSM F5',  'UHGG'))
v.get_patch_by_id('100').set_color('cornflowerblue')
v.get_patch_by_id('010').set_color('crimson')
v.get_patch_by_id('110').set_color('grey')


plt.title('TSM Intestinal tissue microbiome vs UHGG gut microbiome')
plt.show()

print('tsm num hits:', dataset1.shape[0])
print('uhgg dataframe size:', dataset2.shape[0])


print('what fraction of TSM F5 intestinal tissue microbiome genera is shared with UHGG?', np.round(len(set1.intersection(set2))/len(set1), 2))
set1.intersection(set2)
print('what fraction of UHGG is shared with TSM F5 intestinal tissue microbiome?', np.round(len(set1.intersection(set2))/len(set2), 2))


#### A table summary
containing the bacerial genera shared between F5 or F7 and UHGG dataset (both for all tissues and just intestinal tissues from TSM). 

In [None]:
ut = pd.DataFrame(columns=['genus','F5_intestines', 'F7_intestines', 'F5_all_tissues', 'F7_all_tissues'])

cat = 'genus'
dataset1= f5[f5[cat].isna()==False]
dataset2= ug[(ug[cat].isna()==False) & (ug[cat]!='')]
ut['genus'] = list(set(dataset1[cat]).intersection(set(dataset2[cat])))


ut.loc[ut['genus'].isin(list(set(f5[f5['tissue'].isin(['li', 'si'])][cat]))), 'F5_intestines']=1
ut.loc[ut['genus'].isin(list(set(f7[f7['tissue'].isin(['li', 'si'])][cat]))), 'F7_intestines']=1
ut.loc[ut['genus'].isin(list(set(f5[cat]))), 'F5_all_tissues']=1
ut.loc[ut['genus'].isin(list(set(f7[cat]))), 'F7_all_tissues']=1


In [None]:
ut.to_csv(tables + 'genera_overlap_between_f5_f7_datasets_and_uhgg.csv')

## EMP

Earth microbiome project: https://www.nature.com/articles/s41587-020-0718-6#MOESM3
metadata was obtained from:  https://portal.nersc.gov/GEM/genomes/ (genome_metadata.tsv)

In [None]:
emp= pd.read_csv(ext + 'earth_microbiome.tsv',delimiter='\t')
emp.ecosystem = emp.ecosystem.astype(str)
emp.head(2)

getting lineage information from the "ecosystem" column. Filling in nonexisting values with NaN and getting rid of 10 rows that don't have any lineage info

In [None]:
emp['lin_list'] = emp['ecosystem'].apply(lambda x: list(str(x).split(';')))
emp =emp[emp.ecosystem!='nan'] #there are 10 entries that don't have lineage information. they will be excluded

d=[]
p=[]
c=[]
o=[]
f=[]
g=[]
s=[]
for i in list(range(emp['lin_list'].shape[0])):
    d.append(emp['lin_list'].iloc[i][0].strip('d__'))
    p.append(emp['lin_list'].iloc[i][1].strip('p__'))
    c.append(emp['lin_list'].iloc[i][2].strip('c__'))
    o.append(emp['lin_list'].iloc[i][3].strip('o__'))
    f.append(emp['lin_list'].iloc[i][4].strip('f__'))
    g.append(emp['lin_list'].iloc[i][5].strip('g__'))
    s.append(emp['lin_list'].iloc[i][6].strip('s__'))
emp['domain'] = d
emp['phylum'] =p
emp['class'] =c
emp['order'] =o
emp['family'] =f
emp['genus'] =g
emp['species'] =s

#filling any non-existing value in the lineage info with nan
for x in ['species', 'genus', 'family', 'order', 'class', 'phylum', 'domain']:
    emp.loc[emp[x]=='', x]=np.nan


there are a lot of categories that mean the same thing but have been labeled slightly differently, or there is information in one column that provides more insight than the other. So the following lines will wrangle labels  to consolidate and group labels

In [None]:
emp_fil2 = emp[(emp['genus'].isna()==False)]

emp_fil2.loc[(emp_fil2.ecosystem_type.str.contains('Skin', case=False)), 'habitat']='human skin'
emp_fil2.loc[ (emp_fil2.ecosystem_type=='Respiratory system'), 'habitat']='human respiratory system'
emp_fil2.loc[(emp_fil2.ecosystem_type=='Reproductive system'), 'habitat']='human reproductive system'
emp_fil2.loc[(emp_fil2.ecosystem_type=='Digestive system') & (emp_fil2.habitat!='Human oral') &
        (emp_fil2.habitat.str.contains('Huma',case=False) | emp_fil2.habitat.str.contains('host-associated',case=False)),
        'habitat']='human digestive system'
emp_fil2.loc[emp_fil2.habitat.str.contains('Oral', case=False), 'habitat']='human oral'

emp_fil2[emp_fil2.habitat.str.contains('human', case=False)].habitat.value_counts()

Using some search terms I found in the emp dataset, I will group as much of the sub categories into largers bins. This will be "habitat2" column. anything else that couldn't fit those labels is "other", these are small one-off subcategories


In [None]:
emp_fil2['habitat2']=emp_fil2['habitat']
emp_fil2.loc[emp_fil2['ecosystem_category'].str.contains(('|').join(['Mammals','Animal','Birds','Fish']), case=False), 'habitat2']='non-human vertebrates'
emp_fil2.loc[emp_fil2['ecosystem_category'].str.contains(('|').join(['Insecta','Arthropoda','Invertebrates','Annelida','Cnidaria','Porifera']), case=False), 'habitat2']='invertebrates'
emp_fil2.loc[emp_fil2['ecosystem_category'].str.contains(('|').join(['Plants']), case=False), 'habitat2']='plants'
emp_fil2.loc[emp_fil2['ecosystem_category'].str.contains(('|').join(['Fungi']), case=False), 'habitat2']='fungi'
emp_fil2.loc[emp_fil2['ecosystem_category'].str.contains(('|').join(['Terrestrial']), case=False), 'habitat2']='terrestrial'
emp_fil2.loc[emp_fil2['ecosystem_category'].str.contains(('|').join(['Aquatic']), case=False), 'habitat2']='aquatic'
emp_fil2.loc[emp_fil2['ecosystem_category'].str.contains(('|').join(['Bioreactor','Biotransformation','Bioremediation']), case=False), 'habitat2']='bioreactor'
emp_fil2.loc[emp_fil2['ecosystem_category'].str.contains(('|').join(['Built environment']), case=False), 'habitat2']='human_built_environment'
emp_fil2.loc[emp_fil2['ecosystem_category'].str.contains(('|').join(['Wastewater','Solid waste']), case=False), 'habitat2']='wastewater'
#anything else that couldn't fit those labels is "other", these are small one-off subcategories
emp_fil2.loc[emp_fil2.habitat2.isin(['non-human vertebrates','invertebrates', 'fungi', 'plants', 'terrestrial', 'aquatic' ,
                                               'bioreactor', 'human built environment', 'wastewater', 'human skin','human oral',
                                              'human respiratory system','human reproductive system','human digestive system'])==False, 'habitat2']='other'

I will also create more coarse grained labels under "habitat3" column. 

In [None]:
human = ['human digestive system','human oral','human skin', 'human reproductive system',
        'human respiratory system'] 
other_hosts = ['non-human vertebrates', 'invertebrates', 'plants', 'fungi']
natural_environments =['aquatic', 'terrestrial']
human_built_environments =['bioreactor', 'wastewater', 'human_built_environment']

emp_fil2.loc[emp_fil2['habitat2'].str.contains(('|').join([x for x in human]), case=False), 'habitat3']='human'
emp_fil2.loc[emp_fil2['habitat2'].str.contains(('|').join([x for x in other_hosts]), case=False), 'habitat3']='other_hosts'
emp_fil2.loc[emp_fil2['habitat2'].str.contains(('|').join([x for x in natural_environments]), case=False), 'habitat3']='natural_environments'
emp_fil2.loc[emp_fil2['habitat2'].str.contains(('|').join([x for x in human_built_environments]), case=False), 'habitat3']='human_built_environments'
emp_fil2['habitat3'] = emp_fil2.habitat3.fillna('other')

emp_fil2['habitat3'].value_counts(dropna=False)

creating a yet coarser category called habitat4 that will contain just 2 columns: 'free-living_or_other_hosts' and 'human-associated'

In [None]:
emp_fil2.loc[emp_fil2['habitat3'].isin(['natural_environments', 'other', 'other_hosts','human_built_environments']), 'habitat4']='free_living_or_other_hosts'
emp_fil2.loc[emp_fil2['habitat3'].isin(['human']), 'habitat4']='human'
emp_fil2['habitat4'].value_counts(dropna=False)

In [None]:
#getting just some columns of earth microbiome dataset 
emp_fil3 = emp_fil2[['ecosystem_category','habitat', 'habitat2', 'habitat3','habitat4','phylum','domain','phylum', 'order', 'class','family','genus','species']]
print(emp_fil3.shape[0])
emp_fil4 =emp_fil3.drop_duplicates()
print(emp_fil4.shape[0])
emp_fil4.head(2)

#### reading the validation dataset

In [None]:
tis = pd.read_csv(mainDir + 'bulk_tissues_blastn_nt_naPhylaNotFiltered_11_30_2021.csv')



## All 7 datasets
Intersections of EMP, UHGG, TSM F5 and TSM F7, Extracted human tissue microbiome (EHTM) previously called validation, Nejman and HMP. 
- note I am cutting off subsets that are less than 7 large, for visual clarity. 

In [None]:
ts5_sp_set = set(f5[(f5['superkingdom']=='Bacteria') & (f5['genus'].isna()==False)]['genus'])
ts7_sp_set = set(f7[(f7['superkingdom']=='Bacteria') & (f7['genus'].isna()==False)]['genus'])
nej_sp_set = set(tumor[(tumor['domain']=='Bacteria') & (tumor['genus'].isna()==False)]['genus'])
hmp_sp_set = set(cat3_with_tax[(cat3_with_tax['superkingdom']=='Bacteria') &
                               (cat3_with_tax['genus'].isna()==False)]['genus'])
ug_set = set(ug[(ug['genus'].isna()==False) & (ug['genus']!='') & (ug['domain']=='Bacteria')]['genus'])
emp_set = set(emp_fil4[(emp_fil4['genus'].isna()==False)  & (emp_fil4['domain']=='Bacteria')]['genus'])
ehtm = set(tis[tis['genus'].isna()==False]['genus'])

res=dict(zip(['TSM F5','TSM F7', 'Nejman', 'HMP', 'UHGG','EMP','EHTM'],
             [ts5_sp_set,ts7_sp_set,nej_sp_set,hmp_sp_set,ug_set, emp_set,ehtm]))
example = from_contents(res)  


plot(example,sort_by='cardinality', show_counts=True, 
     totals_plot_elements=10, element_size=30,min_subset_size=7)

plt.title('Intersection of genera between various datasets', fontsize=16)
plt.show()

#these are other ways the plot could be modified to better fit
# plot(example, show_counts=True, min_degree=2)
# fig = plt.figure(figsize=(10, 3))
# plot(example, fig=fig, element_size=None)
# plot(example, intersection_plot_elements=3)
# plot(example, totals_plot_elements=5)
# plot(example, show_counts=True, min_subset_size=100)

#### bacterial genera that are shared in common between 7 datasets


In [None]:
four_sp = list(ts5_sp_set & ts7_sp_set & nej_sp_set & hmp_sp_set & ug_set & emp_set & ehtm)
four_sp

#### plotting presence/absence of shared species across different datasets and samples

In [None]:
#Tumors
tumor_sp_four_datasets=tumor[tumor['genus'].isin(four_sp)]
tumor_sp_four_datasets_sub = tumor_sp_four_datasets[['genus', 'Breast_T','Lung_T','Melanoma_T','Pancreas_T','Ovary_T','Bone_T','GBM_T','Colon_T']]

#looping through the tumor samples to get counts of each species in each sample
t =tumor_sp_four_datasets_sub.groupby(['Breast_T', 'genus'], as_index=False).count()
tum_sp=t[t['Breast_T']==1].iloc[:,1:3]
tum_sp.columns=['genus', 'Breast_T']
tumor_lst=['Lung_T','Melanoma_T','Pancreas_T','Ovary_T','Bone_T','GBM_T','Colon_T']
for tum_name in tumor_lst:
    t =tumor_sp_four_datasets_sub.groupby([tum_name, 'genus'], as_index=False).count()
    t=t[t[tum_name]==1].iloc[:,1:3]
    t.columns=['genus', tum_name]
    tum_sp = tum_sp.merge(t, on='genus', how='outer')
tumor_sp_four_datasets = tum_sp.replace(np.nan, 0)
tumor_sp_four_datasets.head(3)
col_lst = ['genus', 'Breast_Nejman','Lung_Nejman','Melanoma_Nejman','Pancreas_Nejman',
           'Ovary_Nejman','Bone_Nejman','GBM_Nejman','Colon_Nejman']
tumor_sp_four_datasets.columns=col_lst
tumor_sp_four_datasets.head(3)


#HMP
hmp_sp_four_datasets=cat3_with_tax[cat3_with_tax['genus'].isin(four_sp)].groupby(['genus', 'hmp_site'], 
                                            as_index=False).count().iloc[:,:3].pivot('genus', 'hmp_site').fillna(0)

col_lst = ['airways_HMP', 'ear_HMP','gastrointestinal_HMP', 'nose_HMP', 'oral_HMP', 'skin_HMP','urogenital_HMP']
hmp_sp_four_datasets.columns=col_lst

#TS 
hits_sp_four_datasets = f7[f7['genus'].isin(four_sp)]
hits_sp_four_datasets['tissue'] = hits_sp_four_datasets['tissue'].apply(lambda x: x+ '_TS')

hits_sp_four_datasets = hits_sp_four_datasets.groupby(['genus', 'tissue'],as_index=False
                                              ).count().iloc[:,:3].pivot('genus', 'tissue'
                                                             ).fillna(0).droplevel(level=0, axis=1)

#Validation
validation_sp = tis[tis['genus'].isin(four_sp)] 

validation_sp['tissue']=validation_sp['tissue'].apply(lambda x: x + '_TSM_DNAseq')
validation_sp_four_datasets = validation_sp.groupby(['genus', 'tissue'], as_index=False).count().iloc[:,:3].pivot('genus', 'tissue'
                                                                         ).fillna(0).droplevel(level=0, axis=1)


#EMP
emp_gen = emp_fil4[emp_fil4['genus'].isin(four_sp)] 

emp_gen['habitat2']=emp_gen['habitat2'].apply(lambda x: x + '_EMP').apply(lambda x: "_".join(x.split(' ')))
emp_gen2 = emp_gen.groupby(['genus', 'habitat2'], as_index=False).count().iloc[:,:3].pivot('genus', 'habitat2'
                                                                         ).fillna(0).droplevel(level=0, axis=1)

# UHGG
ug_gen = ug[(ug['genus'].isin(four_sp)) & (ug['Country'].isin([np.nan, 'not provided'])==False)] 
ug_gen['Country']=ug_gen['Country'].apply(lambda x: str(x) + '_gut_UHGG').apply(lambda x: "_".join(x.split(' ')))
ug_gen2 = ug_gen.groupby(['genus', 'Country'], as_index=False).count().iloc[:,:3].pivot('genus', 'Country'
                                                                         ).fillna(0).droplevel(level=0, axis=1)



four_sets = hits_sp_four_datasets.merge(validation_sp_four_datasets, on='genus').merge(
    hmp_sp_four_datasets, on='genus').merge(tumor_sp_four_datasets, on='genus').merge(emp_gen2, on='genus').merge(
ug_gen2, on='genus')
four_sets = four_sets.set_index('genus')
four_sets[four_sets>1]=1
four_sets = four_sets.loc[:, (four_sets != 0).any(axis=0)]

plt.figure(figsize=[5,20])
sns.heatmap(four_sets.T)

# SANKEY

first clearning out any nans in the genus column which is the main one we will be using for this plot

In [None]:
dset=f7
col='genus'
dset_ = dset[dset[col].isna()==False]
print(dset.shape)
print(dset_.shape)



### HMP
getting the mapping between taxa and the known (or unknown) hmp sites for all taxa in TS dataset including all three domains

In [None]:
cat=pd.read_csv(ext + 'hmp_project_catalog_downloaded_on_12_09_2021.csv')
cat.drop(columns=['Unnamed: 17', 'Unnamed: 18'], inplace=True)
cat['species'] = cat['Organism Name'].apply(lambda x: ' '.join(x.split(' ')[0:2])) #getting species information
cat.rename(columns={'HMP Isolation Body Site':'hmp_site'},inplace=True)
cat2 =cat[cat['hmp_site'].isin(['unknown','other','bone','eye','blood','lymph_nodes', 'liver', 'heart'])==False] #throwing out species from unknown sites or sites that are already in ts
cat3 = cat2.groupby(['hmp_site', 'species'], as_index=False).count().iloc[:,:3].rename(columns={'HMP ID': 'count'})

#getting the taxonomic lineage of the hmp species, not included in the original cat df
cat3 = cat3.merge(tax_short, on='species', how='left')[['hmp_site',col, 'count']].drop_duplicates()

#I want to get only taxa and sites that are in TS dataset
ts_sp_set = dset_[col].unique() # all the unique species in ts dataset including all domains
hmp_data = cat3[['hmp_site', col]] 
hmp_ts = hmp_data[hmp_data[col].isin(ts_sp_set)].drop_duplicates()


#now for other taxa in ts for which there is no hmp site info, I want to add them to the hmp_ts dataframe and add "unknown" as their hmp_site 
species_not_in_hmp=list(set(ts_sp_set) - set(hmp_data[col]))
species_not_in_hmp_df = pd.DataFrame({}, columns=['hmp_site', col])
species_not_in_hmp_df[col]= species_not_in_hmp
species_not_in_hmp_df['hmp_site']=['uknown']*len(species_not_in_hmp)

hmp_ts_sankeymap=pd.concat([hmp_ts, species_not_in_hmp_df])

hmp_ts_sankeymap.head(3)


### Tumor
getting the mapping between species and the known (or unknown) tumors from Nejman et al, for all species in TS dataset including all three domains

In [None]:
#I want to get only taxa and sites that are in TS dataset
ts_sp_set = dset_[col].unique() # all the unique taxa in ts dataset including all domains
nej_data = tumor_sp_ct[['tumor_type', col]]
nej_ts = nej_data[nej_data[col].isin(ts_sp_set)].drop_duplicates()

#now for other taxa in ts for which there is no tumor info, I want to add them to the nej_ts dataframe and add "unknown" as their tumor 
species_not_in_nej=list(set(ts_sp_set) - set(nej_data[col]))
species_not_in_nej_df = pd.DataFrame({}, columns=['tumor_type', col])
species_not_in_nej_df[col]= species_not_in_nej
species_not_in_nej_df['tumor_type']=['Unknown']*len(species_not_in_nej)

nej_ts_sankeymap=pd.concat([nej_ts, species_not_in_nej_df])

nej_ts_sankeymap.head(3)




### UHGG

In [None]:
#I want to get only taxa and sites that are in TS dataset
ts_sp_set = dset_[col].unique() # all the unique species in ts dataset including all domains
ug2=ug[(ug['Country'].isna()==False) & (ug['Country']!='Unknown') & (ug['Country']!='not provided')]
ug_data = ug2[['Country', col]]
ug_ts = ug_data[ug_data[col].isin(ts_sp_set)].drop_duplicates()

#now for other taxa in ts for which there is no uhgg info, I want to add them to the uhgg dataframe and add "unknown" as their country 
species_not_in_ug=list(set(ts_sp_set) - set(ug_data[col]))
species_not_in_ug_df = pd.DataFrame({}, columns=['Country', col])
species_not_in_ug_df[col]= species_not_in_ug
species_not_in_ug_df['Country']=['Unknown']*len(species_not_in_ug)

ug_ts_sankeymap=pd.concat([ug_ts, species_not_in_ug_df])

ug_ts_sankeymap.head(3)

### EMP

In [None]:
#I want to get only taxa and sites that are in TS dataset
ts_sp_set = dset_[col].unique() # all the unique taxa in ts dataset including all domains
emp_data = emp_fil4[['habitat2', col]]
emp_ts = emp_data[emp_data[col].isin(ts_sp_set)].drop_duplicates()

#now for other species in ts for which there is no habitat info, I want to add them to the emp_ts dataframe and add "unknown" as their habitat2 
species_not_in_emp=list(set(ts_sp_set) - set(emp_data[col]))
species_not_in_emp_df = pd.DataFrame({}, columns=['habitat2', col])
species_not_in_emp_df[col]= species_not_in_emp
species_not_in_emp_df['habitat2']=['Unknown']*len(species_not_in_emp)

emp_ts_sankeymap=pd.concat([emp_ts, species_not_in_emp_df])

emp_ts_sankeymap.head(3)

#### The denominator dataframe
this is how I will get the right weights or values. It represents the number of times each taxa is getting multiplied by because it appears in multiple tumors or hmp_sites, for example. So I use the denominator to make sure they don't increase out of proportion in the final sankey plot. For each pair of connections (columns A and B) the denominator will be the weights of  A times the weight of B, where the weight is the number of times the taxa is repeated in A or B (i.e. EBV appears across 3 EMP habitats and 4 tumor types, as a toy example, thus the denominator for the connection between habitats and tumor types for EBV will be 3*4. 

In [None]:
denominator_hmp = hmp_ts_sankeymap.groupby([col]).count().reset_index().rename(columns={'hmp_site':
                                                                                             'hmp_den'})
denominator_nej = nej_ts_sankeymap.groupby([col]).count().reset_index().rename(columns={'tumor_type':
                                                                                             'nej_den'})
denominator_ug = ug_ts_sankeymap.groupby([col]).count().reset_index().rename(columns={'Country':
                                                                                             'ug_den'})
denominator_emp = emp_ts_sankeymap.groupby([col]).count().reset_index().rename(columns={'habitat2':
                                                                                             'emp_den'})

denominator= denominator_hmp.merge(denominator_nej).merge(denominator_ug).merge(denominator_emp)
denominator.head(3)

### Connections
#### For connections I just need the hit counts of species within the main dataframe (they will serve as my "value")

In [None]:
sp_tis_hit_count = dset_.groupby(['tissue', col],as_index=False).count().iloc[:,0:3].rename(columns={'R_g':'hits_per_species_per_tissue'})
sp_tis_hit_count.head(3)

In [None]:
tot_hit_count_per_species = dset_[col].value_counts().to_frame().reset_index().rename(
    columns={'index':col, col:'total_hit_count_per_species'})

tot_hit_count_per_species.head(3)


In [None]:
second_half = nej_ts_sankeymap.merge(hmp_ts_sankeymap, how='outer').merge(
    sp_tis_hit_count).merge(tot_hit_count_per_species).merge(ug_ts_sankeymap).merge(emp_ts_sankeymap)
                                                                            

#need to add three characters of the col names to each entry to match previous sankey function
second_half['tumor_type'] = second_half['tumor_type'] + '_' + 'tumor_type'[:3]
second_half['hmp_site'] = second_half['hmp_site'] + '_' + 'hmp_site'[:3]
second_half['tissue'] = second_half['tissue'] + '_' + 'tissue'[:3]
second_half['habitat2'] = second_half['habitat2'] + '_' + 'habitat2'[:3]
second_half['Country'] = second_half['Country'] + '_' + 'Country'[:3]

second_half.head(3)


#### tissue to genus mapping

In [None]:
source='tissue'
target=col
value='hits_per_species_per_tissue'

extra=col
g1=[source, target, extra, value]
g2= [source, target, 'value']

tis_to_sp = second_half[[source, target,value]].drop_duplicates()
tis_to_sp['value'] = tis_to_sp[value]
tis_to_sp2 = tis_to_sp[g2].rename(columns={source:'source', target:'dest'})
tis_to_sp2.head(3)


#### genus to hmp_site mapping & legend

In [None]:
source=col
target='hmp_site'
value='total_hit_count_per_species'
extra=col
g1=[source, target, extra, value]
g2= [source, target, 'value']

sp_to_hmp = second_half[[source, target,value]].drop_duplicates()
sp_to_hmp2 = sp_to_hmp.merge(denominator)
sp_to_hmp2['value'] = sp_to_hmp2['total_hit_count_per_species']/(sp_to_hmp2['hmp_den']) #denominator
sp_to_hmp2 = sp_to_hmp2[g2].rename(columns={source:'source', target:'dest'})


#not going to create a color map for species because there are too many species, but will get colors for hmp_site
color_map={}
counter = 0 #setting this up so that the same colors aren't picked from the color_lst for each category
df = sp_to_hmp2
subgroups = list(df['dest'].unique()) #getting the categories for each group (e.g. viruses, bacteria, eukaryotes for superkingdom), sorted by abundance
sub_colors=color_lst[counter: counter + len(subgroups)] #getting list of colors from color_lst
color_sub_dict = dict(zip(subgroups, sub_colors)) #getting dictionary of colors and values
color_map.update(color_sub_dict)


#getting legend
fontsize=12
msize=1.6
fs=1
ncol=1
h1 = [matplotlib.lines.Line2D([],[], marker="s", color=c, linestyle="none") for c in sub_colors] #legend handels
subgroups_modified=[x.strip(x[-4:]) for x in subgroups] #labels (taking out the last appended characters (e.g. _hmp))
subgroups_modified=[x.split('_tract')[0] for x in subgroups_modified] #labels (taking out the last appended characters (e.g. _hmp))

#plotting the legened figures
fig, ax= plt.subplots(figsize=[fs,fs])
plt.axis('off')
plt.legend(handles=h1, 
   labels=subgroups_modified,
   loc='lower left', prop={'size':fontsize}, ncol=ncol, numpoints=1, 
   frameon=False, markerscale=msize, labelcolor='dimgrey')

plt.savefig(figs + 'hmp_captions.svg', bbox_inches='tight')



#### hmp_site to tumor_type mapping & legend

In [None]:
source='hmp_site'
target='tumor_type'
value='total_hit_count_per_species'
extra=col
g1=[source, target, extra, value]
g2= [source, target, 'value']

hmp_to_tum = second_half[g1].drop_duplicates()
hmp_to_tum2 = hmp_to_tum.merge(denominator)
hmp_to_tum2['value'] = hmp_to_tum2['total_hit_count_per_species']/(hmp_to_tum2['hmp_den']*hmp_to_tum2['nej_den'])
hmp_to_tum2 = hmp_to_tum2[g2].rename(columns={source:'source', target:'dest'})


#getting colormap
counter=1
df = hmp_to_tum2
subgroups = list(df['dest'].unique()) #getting the categories for each group (e.g. viruses, bacteria, eukaryotes for superkingdom), sorted by abundance
sub_colors=color_lst[counter: counter + len(subgroups)] #getting list of colors from color_lst
color_sub_dict = dict(zip(subgroups, sub_colors)) #getting dictionary of colors and values
color_map.update(color_sub_dict)

#getting legend
fontsize=12
msize=1.6
fs=3
ncol=1
h1 = [matplotlib.lines.Line2D([],[], marker="s", color=c, linestyle="none") for c in sub_colors] #legend handels
subgroups_modified=[x.strip(x[-4:]) for x in subgroups] #labels (taking out the last appended characters (e.g. _hmp))
subgroups_modified=[x.strip('_T') for x in subgroups_modified]

#plotting the legened figures
fig, ax= plt.subplots(figsize=[fs,fs])
plt.axis('off')
plt.legend(handles=h1, 
   labels=subgroups_modified,
   loc='lower left', prop={'size':fontsize}, ncol=ncol, numpoints=1, 
   frameon=False, markerscale=msize, labelcolor='dimgrey')
plt.savefig(figs + 'tumor_captions.svg', bbox_inches='tight')


#### Tumor_type to UHGG mapping & legend

In [None]:
source='tumor_type'
target='Country'
value='total_hit_count_per_species'
extra=col
g1=[source, target, extra, value]
g2= [source, target, 'value']

sp_to_ug = second_half[g1].drop_duplicates()
sp_to_ug2 = sp_to_ug.merge(denominator)
sp_to_ug2['value'] = sp_to_ug2['total_hit_count_per_species']/(sp_to_ug2['hmp_den']*sp_to_ug2['nej_den']*sp_to_ug2['ug_den'])
sp_to_ug2['value'] = sp_to_ug2['total_hit_count_per_species']/(sp_to_ug2['nej_den']*sp_to_ug2['ug_den'])

sp_to_ug2 = sp_to_ug2[g2].rename(columns={source:'source', target:'dest'})

counter = 0 #setting this up so that the same colors aren't picked from the color_lst for each category
df = sp_to_ug2
subgroups = list(df['dest'].unique()) #getting the categories for each group (e.g. viruses, bacteria, eukaryotes for superkingdom), sorted by abundance
sub_colors=color_lst[counter: counter + len(subgroups)] #getting list of colors from color_lst
color_sub_dict = dict(zip(subgroups, sub_colors)) #getting dictionary of colors and values
color_map.update(color_sub_dict)


#getting legend
fontsize=12
msize=1.6
fs=1
ncol=1
h1 = [matplotlib.lines.Line2D([],[], marker="s", color=c, linestyle="none") for c in sub_colors] #legend handels
subgroups_modified=[x.split('_')[0] for x in subgroups] #labels (taking out the last appended characters (e.g. _hmp))

#plotting the legened figures
fig, ax= plt.subplots(figsize=[fs,fs])
plt.axis('off')
plt.legend(handles=h1, 
   labels=subgroups_modified,
   loc='lower left', prop={'size':fontsize}, ncol=ncol, numpoints=1, 
   frameon=False, markerscale=msize, labelcolor='dimgrey')
plt.savefig(figs + 'uhgg_captions.svg', bbox_inches='tight')


#### UHGG to EMP mapping

In [None]:
source='Country'
target='habitat2'
value='total_hit_count_per_species'
extra=col
g1=[source, target, extra, value]
g2= [source, target, 'value']

sp_to_emp = second_half[g1].drop_duplicates()
sp_to_emp2 = sp_to_emp.merge(denominator)
sp_to_emp2['value'] = sp_to_emp2['total_hit_count_per_species']/(sp_to_emp2['hmp_den']*sp_to_emp2['nej_den']*sp_to_emp2['ug_den']*sp_to_emp2['emp_den'])
sp_to_emp2['value'] = sp_to_emp2['total_hit_count_per_species']/(sp_to_emp2['ug_den']*sp_to_emp2['emp_den'])

sp_to_emp2 = sp_to_emp2[g2].rename(columns={source:'source', target:'dest'})

counter = 0 #setting this up so that the same colors aren't picked from the color_lst for each category
df = sp_to_emp2
subgroups = list(df['dest'].unique()) #getting the categories for each group (e.g. viruses, bacteria, eukaryotes for superkingdom), sorted by abundance
sub_colors=color_lst[counter: counter + len(subgroups)] #getting list of colors from color_lst
color_sub_dict = dict(zip(subgroups, sub_colors)) #getting dictionary of colors and values
color_map.update(color_sub_dict)


#getting legend
fontsize=12
msize=1.6
fs=1
ncol=1
h1 = [matplotlib.lines.Line2D([],[], marker="s", color=c, linestyle="none") for c in sub_colors] #legend handels
subgroups_modified=[x.split('_')[0] for x in subgroups] #labels (taking out the last appended characters (e.g. _hmp))

#plotting the legened figures
fig, ax= plt.subplots(figsize=[fs,fs])
plt.axis('off')
plt.legend(handles=h1, 
   labels=subgroups_modified,
   loc='lower left', prop={'size':fontsize}, ncol=ncol, numpoints=1, 
   frameon=False, markerscale=msize, labelcolor='dimgrey')
plt.savefig(figs + 'emp_captions.svg', bbox_inches='tight')

## Final plot

In [None]:
second_half_sankey = pd.concat([tis_to_sp2, sp_to_hmp2,hmp_to_tum2,sp_to_ug2, sp_to_emp2])
second_half_sankey.head(2)


In [None]:
input_df= dset_
group1='phylum' 
group2='order' 
group3='tissue' 

width=800
height=600
colname='cell'
label_text_font_size='0pt'


df=input_df.groupby([group1,group2]).count()[colname].reset_index()
df[group2]=df[group2].apply(lambda x: x + '_' + group2[:3])
df[group1]=df[group1].apply(lambda x: x + '_' + group1[:3])
df.rename(columns={colname:'value', group1:'source', group2:'dest'}, inplace=True)

df2=input_df.groupby([group2,group3]).count()[colname].reset_index()
df2[group2]=df2[group2].apply(lambda x: x+ '_' + group2[:3])
df2[group3]=df2[group3].apply(lambda x: x+  '_' + group3[:3])
df2.rename(columns={colname:'value',group2:'source', group3:'dest'}, inplace=True)


#getting the colormap
color_map_first_half=get_colormap(input_df, [group1,group2,group3]) #note the get_colormap and get_legend functions sort based on abundance
color_map.update(color_map_first_half)

first_half_sankey = pd.concat([df, df2]) #first half of sankey mapping

final_sankey=pd.concat([first_half_sankey, second_half_sankey]) #second half is from code blocks above


sankey1 = hv.Sankey(final_sankey, kdims=["source", "dest"], vdims=["value"])
cmap_list = process_cmap("glasbey_hv")

sankey1.opts(cmap=color_map, label_position='outer',
                                 edge_color='dest', edge_line_width=0, node_line_width=.1,
                                 node_alpha=.7, node_width=30, node_sort=True,
                    width=width, height=height, bgcolor="white",padding=0,label_text_font_size='0pt')





getting the rest of the legends (first part of sankey). The get_legend function that I've written already saves the three plots as svgs in the figs folder. 

In [None]:
input_df= dset_
group1='phylum' 
group2='order' 
group3='tissue' 
get_legend(input_df, [group1,group2,group3], fontsize=12, tsize=12, msize=1.6, fs=3, ncol=2) #this will already save the plots in the figs directory


#### saving the plot as an html 

In [None]:
renderer = hv.renderer('bokeh')
renderer.save(sankey1, figs + 'f7_sankey')



#### sankey table
creating a table that summarizes the flow paths that may be hard to follow in the sankey plot above.


In [None]:
final_sankey.head(3)



In [None]:
final_sankey.to_csv(tables + 'f7_sankey_underlying_table.csv', index=False)



## Pathogens
### Exploring possible pathogens from the PATRIC database
Pathogenic species downloaded from this paper: https://www.nature.com/articles/nature11234#MOESM103 SI table 2. derived from PATRIC database. "Structure, function and diversity of the healthy human microbiome" by HMP consortium

In [None]:
pathos = pd.read_csv(ext + '/PATRIC_pathogens_2011-09-12404C-tables2_pathogens.csv', delimiter='\t')
pathos= pathos.iloc[:,:-1]
pathos.columns=['species', 'site', 'mean abd', 'stdev', 'median', 'min', 'max']


#### adding pathogen information to hits dataframe

In [None]:
col='genus'
dset= f7[f7[col].isna()==False]

dset['is_pathogen'] = dset['species'].isin(pathos['species'].unique())
dset.loc[dset['is_pathogen']==False, 'is_pathogen']='no'
dset.loc[dset['is_pathogen']==True, 'is_pathogen']='yes'


these species  were found to be potentially pathogenic in the TSM F7 

In [None]:
dset[dset['is_pathogen']=='yes']['species'].value_counts()

Some of these species were found in the HMP, Nejman, and UHGG databases

In [None]:
pats = list(set(dset[dset['is_pathogen']=='yes']['species']))

cat3_with_tax[cat3_with_tax['species'].isin(pats)].species.unique()



In [None]:
ug[ug['species'].isin(pats)].species.unique()



In [None]:
tumor[tumor['species'].isin(pats)].species.unique()


## Antibiotic resistance
### Antibiotic resistance status from PATRIC

downloaded from https://www.patricbrc.org/view/Taxonomy/2#view_tab=amr on Dec 22, 2021

In [None]:
ars = pd.read_csv(ext + 'PATRIC_genome_amr.csv')
ars.rename(columns={'Antibiotic':'antibiotic', 'Resistant Phenotype':'resistance'},inplace=True)
ars['species'] = ars['Genome Name'].apply(lambda x: ' '.join(x.split(' ')[0:2]))#getting species information, not strain
sub = ars[['Genome Name','species', 'antibiotic', 'resistance']]
sub = sub[sub['resistance'].isin(['Intermediate', 'Not defined', 'r', np.nan])==False] #getting columns of interest #excluding ambiguious resistance categories and susceptible

df=sub.drop_duplicates()
df['antibiotic']=df['antibiotic'].str.replace('-', '_' )
df['antibiotic']=df['antibiotic'].str.replace('/', '_' )

df_sub=df[(df['species'].str.contains('tuber|boydii|flexneri|baumannii', case=False)==False)] #more sparse plots are excluded
df2 = df_sub.groupby(['species','antibiotic','resistance'], as_index=False).count()
df2['log10_num_strains'] = np.log10(df2['Genome Name']+1)

#time to plot
g = sns.catplot(x="antibiotic", y="log10_num_strains", hue="resistance",col="species",
            kind="bar", col_wrap=1, data=df2, palette=['teal','salmon'], size=2, aspect=6)
g.set_xticklabels(rotation=90) 



 SI table containing the strain information and antibiotic resistance profile

In [None]:
sub.drop_duplicates().to_csv(tables + 'antibiotic_resistance_profiles_with_strain_information.csv')



In [None]:
#how many bacterial strains are included in this dataset?
sub.drop_duplicates()['Genome Name'].nunique()



adding Antimicrobial resistance information. It turns out that for none of the F7 species we have any antibiotic resistance information from the PATRIC database. 

In [None]:
dset['amr_known'] = dset['species'].isin(list(sub['species'].unique()))
dset.loc[dset['amr_known']==False, 'amr_known']='no'
dset.loc[dset['amr_known']==True, 'amr_known']='yes'
dset[dset['amr_known']=='yes']['species'].value_counts()



# Sankey (shared taxa)
### Plotting a Sankey for the taxa shared between several datasets
note this is constructed in a similar way to the full sankey plot. The only difference is that we are only selecting certain taxa that are at the intersection of various datasets. Then subsetting the original dset_ dataframe, the denominator dataframe and all others to include only those taxa. In other words, the second_half dataframe is not created again, just subsampled. So some of these variables from this section are being carried over from the main Sankey plot, some may be renamed to include "_multi" to make the distinction a bit more clear


In [None]:
tri_sp = list(set(ts_sp_set) &  set(nej_sp_set) & set(hmp_sp_set) & (ug_set) & (emp_set))

second_half_multi=second_half[second_half[col].isin(tri_sp)]

denominator_multi = denominator[denominator[col].isin(tri_sp)]



### Connections
#### For connections I just need the hit counts of species within the main dataframe (they will serve as my "value")

In [None]:
sp_tis_hit_count = dset_.groupby(['tissue', col],as_index=False).count().iloc[:,0:3].rename(columns={'R_g':'hits_per_species_per_tissue'})
sp_tis_hit_count_multi = sp_tis_hit_count[sp_tis_hit_count[col].isin(tri_sp)]



In [None]:
tot_hit_count_per_species = dset_[col].value_counts().to_frame().reset_index().rename(
    columns={'index':col, col:'total_hit_count_per_species'})

tot_hit_count_per_species_multi = tot_hit_count_per_species[tot_hit_count_per_species[col].isin(tri_sp)]



In [None]:
source='tissue'
target=col
value='hits_per_species_per_tissue'

extra=col
g1=[source, target, extra, value]
g2= [source, target, 'value']

tis_to_sp = second_half_multi[[source, target,value]].drop_duplicates()
tis_to_sp['value'] = tis_to_sp[value]
tis_to_sp2 = tis_to_sp[g2].rename(columns={source:'source', target:'dest'})



#### genus to hmp_site mapping & legend

In [None]:
source=col
target='hmp_site'
value='total_hit_count_per_species'
extra=col
g1=[source, target, extra, value]
g2= [source, target, 'value']

sp_to_hmp = second_half_multi[[source, target,value]].drop_duplicates()
sp_to_hmp2 = sp_to_hmp.merge(denominator_multi)
sp_to_hmp2['value'] = sp_to_hmp2['total_hit_count_per_species']/(sp_to_hmp2['hmp_den']) #denominator
sp_to_hmp2 = sp_to_hmp2[g2].rename(columns={source:'source', target:'dest'})


#not going to create a color map for species because there are too many species, but will get colors for hmp_site
color_map={}
counter = 0 #setting this up so that the same colors aren't picked from the color_lst for each category
df = sp_to_hmp2
subgroups = list(df['dest'].unique()) #getting the categories for each group (e.g. viruses, bacteria, eukaryotes for superkingdom), sorted by abundance
sub_colors=color_lst[counter: counter + len(subgroups)] #getting list of colors from color_lst
color_sub_dict = dict(zip(subgroups, sub_colors)) #getting dictionary of colors and values
color_map.update(color_sub_dict)


#getting legend
fontsize=12
msize=1.6
fs=1
ncol=1
h1 = [matplotlib.lines.Line2D([],[], marker="s", color=c, linestyle="none") for c in sub_colors] #legend handels
subgroups_modified=[x.strip(x[-4:]) for x in subgroups] #labels (taking out the last appended characters (e.g. _hmp))
subgroups_modified=[x.split('_tract')[0] for x in subgroups_modified] #labels (taking out the last appended characters (e.g. _hmp))

#plotting the legened figures
fig, ax= plt.subplots(figsize=[fs,fs])
plt.axis('off')
plt.legend(handles=h1, 
   labels=subgroups_modified,
   loc='lower left', prop={'size':fontsize}, ncol=ncol, numpoints=1, 
   frameon=False, markerscale=msize, labelcolor='dimgrey')

# plt.savefig(figs + 'hmp_captions.svg', bbox_inches='tight')



#### hmp_site to tumor_type mapping & legend

In [None]:
source='hmp_site'
target='tumor_type'
value='total_hit_count_per_species'
extra=col
g1=[source, target, extra, value]
g2= [source, target, 'value']

hmp_to_tum = second_half_multi[g1].drop_duplicates()
hmp_to_tum2 = hmp_to_tum.merge(denominator_multi)
hmp_to_tum2['value'] = hmp_to_tum2['total_hit_count_per_species']/(hmp_to_tum2['hmp_den']*hmp_to_tum2['nej_den'])
hmp_to_tum2 = hmp_to_tum2[g2].rename(columns={source:'source', target:'dest'})


#getting colormap
counter=1
df = hmp_to_tum2
subgroups = list(df['dest'].unique()) #getting the categories for each group (e.g. viruses, bacteria, eukaryotes for superkingdom), sorted by abundance
sub_colors=color_lst[counter: counter + len(subgroups)] #getting list of colors from color_lst
color_sub_dict = dict(zip(subgroups, sub_colors)) #getting dictionary of colors and values
color_map.update(color_sub_dict)

#getting legend
fontsize=12
msize=1.6
fs=3
ncol=1
h1 = [matplotlib.lines.Line2D([],[], marker="s", color=c, linestyle="none") for c in sub_colors] #legend handels
subgroups_modified=[x.strip(x[-4:]) for x in subgroups] #labels (taking out the last appended characters (e.g. _hmp))
subgroups_modified=[x.strip('_T') for x in subgroups_modified]

#plotting the legened figures
fig, ax= plt.subplots(figsize=[fs,fs])
plt.axis('off')
plt.legend(handles=h1, 
   labels=subgroups_modified,
   loc='lower left', prop={'size':fontsize}, ncol=ncol, numpoints=1, 
   frameon=False, markerscale=msize, labelcolor='dimgrey')
# plt.savefig(figs + 'tumor_captions.svg', bbox_inches='tight')


#### Tumor_type to UHGG mapping & legend

In [None]:
source='tumor_type'
target='Country'
value='total_hit_count_per_species'
extra=col
g1=[source, target, extra, value]
g2= [source, target, 'value']

sp_to_ug = second_half_multi[g1].drop_duplicates()
sp_to_ug2 = sp_to_ug.merge(denominator_multi)
sp_to_ug2['value'] = sp_to_ug2['total_hit_count_per_species']/(sp_to_ug2['hmp_den']*sp_to_ug2['nej_den']*sp_to_ug2['ug_den'])
sp_to_ug2['value'] = sp_to_ug2['total_hit_count_per_species']/(sp_to_ug2['nej_den']*sp_to_ug2['ug_den'])

sp_to_ug2 = sp_to_ug2[g2].rename(columns={source:'source', target:'dest'})

counter = 0 #setting this up so that the same colors aren't picked from the color_lst for each category
df = sp_to_ug2
subgroups = list(df['dest'].unique()) #getting the categories for each group (e.g. viruses, bacteria, eukaryotes for superkingdom), sorted by abundance
sub_colors=color_lst[counter: counter + len(subgroups)] #getting list of colors from color_lst
color_sub_dict = dict(zip(subgroups, sub_colors)) #getting dictionary of colors and values
color_map.update(color_sub_dict)


#getting legend
fontsize=12
msize=1.6
fs=1
ncol=1
h1 = [matplotlib.lines.Line2D([],[], marker="s", color=c, linestyle="none") for c in sub_colors] #legend handels
subgroups_modified=[x.split('_')[0] for x in subgroups] #labels (taking out the last appended characters (e.g. _hmp))

#plotting the legened figures
fig, ax= plt.subplots(figsize=[fs,fs])
plt.axis('off')
plt.legend(handles=h1, 
   labels=subgroups_modified,
   loc='lower left', prop={'size':fontsize}, ncol=ncol, numpoints=1, 
   frameon=False, markerscale=msize, labelcolor='dimgrey')
plt.savefig(figs + 'uhgg_captions.svg', bbox_inches='tight')


#### UHGG to EMP mapping

In [None]:
source='Country'
target='habitat2'
value='total_hit_count_per_species'
extra=col
g1=[source, target, extra, value]
g2= [source, target, 'value']

sp_to_emp = second_half_multi[g1].drop_duplicates()
sp_to_emp2 = sp_to_emp.merge(denominator_multi)
sp_to_emp2['value'] = sp_to_emp2['total_hit_count_per_species']/(sp_to_emp2['hmp_den']*sp_to_emp2['nej_den']*sp_to_emp2['ug_den']*sp_to_emp2['emp_den'])
sp_to_emp2['value'] = sp_to_emp2['total_hit_count_per_species']/(sp_to_emp2['ug_den']*sp_to_emp2['emp_den'])

sp_to_emp2 = sp_to_emp2[g2].rename(columns={source:'source', target:'dest'})

counter = 0 #setting this up so that the same colors aren't picked from the color_lst for each category
df = sp_to_emp2
subgroups = list(df['dest'].unique()) #getting the categories for each group (e.g. viruses, bacteria, eukaryotes for superkingdom), sorted by abundance
sub_colors=color_lst[counter: counter + len(subgroups)] #getting list of colors from color_lst
color_sub_dict = dict(zip(subgroups, sub_colors)) #getting dictionary of colors and values
color_map.update(color_sub_dict)


#getting legend
fontsize=12
msize=1.6
fs=1
ncol=1
h1 = [matplotlib.lines.Line2D([],[], marker="s", color=c, linestyle="none") for c in sub_colors] #legend handels
subgroups_modified=[x.split('_')[0] for x in subgroups] #labels (taking out the last appended characters (e.g. _hmp))

#plotting the legened figures
fig, ax= plt.subplots(figsize=[fs,fs])
plt.axis('off')
plt.legend(handles=h1, 
   labels=subgroups_modified,
   loc='lower left', prop={'size':fontsize}, ncol=ncol, numpoints=1, 
   frameon=False, markerscale=msize, labelcolor='dimgrey')
# plt.savefig(figs + 'emp_captions.svg', bbox_inches='tight')

## Final plot

In [None]:
second_half_multi_sankey = pd.concat([tis_to_sp2, sp_to_hmp2,hmp_to_tum2,sp_to_ug2, sp_to_emp2])
second_half_multi_sankey.head(2)


In [None]:
input_df= dset_[dset_[col].isin(tri_sp)]
group1='phylum' 
group2='order' 
group3='tissue' 

width=600
height=400
colname='cell'
label_text_font_size='0pt'


df=input_df.groupby([group1,group2]).count()[colname].reset_index()
df[group2]=df[group2].apply(lambda x: x + '_' + group2[:3])
df[group1]=df[group1].apply(lambda x: x + '_' + group1[:3])
df.rename(columns={colname:'value', group1:'source', group2:'dest'}, inplace=True)

df2=input_df.groupby([group2,group3]).count()[colname].reset_index()
df2[group2]=df2[group2].apply(lambda x: x+ '_' + group2[:3])
df2[group3]=df2[group3].apply(lambda x: x+  '_' + group3[:3])
df2.rename(columns={colname:'value',group2:'source', group3:'dest'}, inplace=True)


#getting the colormap
color_map_first_half=get_colormap(input_df, [group1,group2,group3]) #note the get_colormap and get_legend functions sort based on abundance
color_map.update(color_map_first_half)

first_half_multi_sankey = pd.concat([df, df2]) #first half of sankey mapping

final_sankey_multi=pd.concat([first_half_multi_sankey, second_half_multi_sankey]) #second half is from code blocks above


sankey1 = hv.Sankey(final_sankey_multi, kdims=["source", "dest"], vdims=["value"])
cmap_list = process_cmap("glasbey_hv")

sankey1.opts(cmap=color_map, label_position='outer',
                                 edge_color='dest', edge_line_width=0, node_line_width=.1,
                                 node_alpha=.7, node_width=30, node_sort=True,
                    width=width, height=height, bgcolor="white",padding=0,label_text_font_size='0pt')





In [None]:
renderer = hv.renderer('bokeh')
renderer.save(sankey1, figs + 'f7_sankey_subset_shared_taxa')


In [None]:
final_sankey_multi.head(3)

saving the sankey table in this format because its more readable than nodes and edges

In [None]:
second_half_multi[['tumor_type','genus','hmp_site','tissue','Country','habitat2']].head(3)

In [None]:
second_half_multi[['tumor_type','genus','hmp_site','tissue','Country','habitat2']].to_csv(tables + 'f7_sankey_subset_shared_taxa_underlying_table.csv', index=False)
