# This notebook counts the number of different TnSeq screen types in the database. 

This is used in the manuscript to describe the database in big picture terms.

Relies on the most current metadata (column descriptor) file: 

data/column_descriptors_standardized_021023.xlsx 


In [1]:
import pandas as pd
import os

# Load data

In [2]:
fn_meta = '../../data/meta_data/column_descriptors_standardized_021023.xlsx'
df_meta = pd.read_excel(fn_meta)
df_meta.head(1)

Unnamed: 0,column_ID,wig_files,control,experimental,column_ID_2,column_ID_SI,num_replicates_control,num_replicates_experimental,meaning,year,...,carbon_source,stress_description,GI_RvID,GI_name,MicArr_or_TnSeq,stat_analysis,mouse_strain,cell_type,Mtb_strain,plot_SI_graph
0,2003A_Sassetti,,,,,2003A_Sassetti,,,,2003.0,...,glycerol,-,,,microarray,,,,H37Rv,No


In [3]:
df_meta.columns.tolist()

['column_ID',
 'wig_files',
 'control',
 'experimental',
 'column_ID_2',
 'column_ID_SI',
 'num_replicates_control',
 'num_replicates_experimental',
 'meaning',
 'year',
 'paper_title',
 'paper_URL',
 'journal',
 'first_author',
 'last_author',
 'in_vitro_cell_vivo',
 'in_vitro_media',
 'carbon_source',
 'stress_description',
 'GI_RvID',
 'GI_name',
 'MicArr_or_TnSeq',
 'stat_analysis',
 'mouse_strain',
 'cell_type',
 'Mtb_strain',
 'plot_SI_graph']

# Number of unique papers: 

In [4]:
papers_uniq = list(df_meta.paper_title.unique())
for num, p in enumerate(papers_uniq):
    print(num+1, p)

1 Genes required for mycobacterial growth defined by high density mutagenesis
2 Genetic requirements for mycobacterial survival during infection
3 Genome-wide requirements for Mycobacterium tuberculosis adaptation and survival in macrophages
4 Characterization of mycobacterial virulence genes through genetic interaction mapping
5 High-Resolution Phenotypic Profiling Defines Genes Essential for Mycobacterial Growth and Cholesterol Catabolism
6 Global Assessment of Genomic Regions Required for Growth in Mycobacterium tuberculosis
7 A Hidden Markov Model for identifying essential and growth-defect regions in bacterial genomes from transposon insertion sequencing data.
8 Tryptophan biosynthesis protects mycobacteria from CD4 T cell-mediated killing
9 Peptidoglycan synthesis in Mycobacterium tuberculosis is organized into networks with varying drug susceptibility
10 Lipid metabolism and Type VII secretion systems dominate the genome scale virulence profile of Mycobacterium tuberculosis in h

#### Number of screens that come from publications: 

In [5]:
df_meta[~df_meta.paper_title.isnull()].shape

(143, 27)

#### Number of screens that come from FLUTE:

In [6]:
df_meta[df_meta.paper_title.isnull()].shape

(15, 27)

# Microarray vs. TnSeq: 

In [7]:
df_meta.MicArr_or_TnSeq.value_counts()

MicArr_or_TnSeq
TnSeq         153
microarray      5
Name: count, dtype: int64

In [8]:
df_meta_tn = df_meta[df_meta.MicArr_or_TnSeq == 'TnSeq'].copy()
df_meta_microarr = df_meta[df_meta.MicArr_or_TnSeq == 'microarray'].copy()

- There are a total of 158 screens. 
- 146 are what we call standardized in the manuscript
- of the 12 that are not standardized, 5 are microarray-based. 
- there are 7 TnSeq screens that are not standardized. 

In [9]:
df_meta_microarr.column_ID

0     2003A_Sassetti
1     2003B_Sassetti
2    2005_Rengarajan
3    2006_Joshi_GI_1
4    2006_Joshi_GI_2
Name: column_ID, dtype: object

In [10]:
df_meta_tn[df_meta_tn.column_ID_2.isnull()].column_ID

6              2012_Zhang
19          2015_Kieser_2
20          2015_Kieser_3
21            2015_Mendum
28    2017B_DeJesus_GI_1A
32    2017B_DeJesus_GI_1B
36    2017B_DeJesus_GI_1C
Name: column_ID, dtype: object

# mouse (in-vivo) vs. in-vitro vs. macrophage:   

In [11]:
df_meta.in_vitro_cell_vivo.value_counts()

in_vitro_cell_vivo
in_vivo     109
in_vitro     47
in_cell       2
Name: count, dtype: int64

#### The large majority of the in-vivo screens come from 2 publications: 

In [12]:
df_meta_in_vivo = df_meta[df_meta.in_vitro_cell_vivo == 'in_vivo'].copy()
df_meta_in_vivo.paper_title.value_counts()

paper_title
Host-pathogen genetic interactions underlie tuberculosis susceptibility in genetically diverse mice                       61
Genome-wide host loci regulate M. tuberculosis fitness in immunodivergent mice                                            21
Statistical analysis of genetic interactions in Tn-Seq data                                                               13
Tryptophan biosynthesis protects mycobacteria from CD4 T cell-mediated killing                                             4
Nitric oxide prevents a pathogen-permissive granulocytic inflammation during tuberculosis                                  3
Common Variants in the Glycerol Kinase Gene Reduce Tuberculosis Drug Efficacy                                              3
Characterization of mycobacterial virulence genes through genetic interaction mapping                                      2
Genetic requirements for mycobacterial survival during infection                                                 