In [1]:
pip install matplotlib


Note: you may need to restart the kernel to use updated packages.


In [2]:
from matplotlib import pyplot as plt

2018-08-20

## Description

Structure and function annotation of compounds used in screen.

## Data structure

One row per compound structure.

## Columns

**compound_stem**: Compound structure identifier.  Matches middle field of BRD-XXXXXXXXX-XXX-XX-X compound identifiers.

**SMILES**: SMILES string for compound structure.

**compound_name**: Name of compound.

**pubchem_cid**: PubChem Compound ID.

**kegg_id**: KEGGDrug ID.

**classifier_target**: Target of compound, if known.

**similar_to**: Compound target family to which this compound is trivially similar in structure and chemical genetic interaction profile.

## Questions?

In [3]:
import pandas as pd
compounds_annotation_2018=pd.read_csv('2018-08-20_compound-annotation.csv')
compounds_annotation_2018.head()


Unnamed: 0,compound_stem,SMILES,compound_name,pubchem_cid,kegg_kid,classifier_target,similar_to
0,K88043978,NCC[C@H](O)C(=O)N[C@@H]1C[C@H](N)[C@@H](O[C@H]...,amikacin,37768.0,D00865,30S ribosome 16S rRNA decoding,
1,U78772829,C1[C@@H]([C@H]([C@@H]([C@H]([C@@H]1NC(=O)[C@H]...,amikacin hydrate,16218899.0,,30S ribosome 16S rRNA decoding,
2,A56169713,CNC(C)[C@@H]1CC[C@@H](N)[C@@H](OC2[C@@H](N)C[C...,gentamycin sulfate (gentacycol),53245641.0,,30S ribosome 16S rRNA decoding,
3,A24252652,CN[C@H]1C[C@@H](N)[C@H](O)[C@@H](O[C@@H]2O[C@H...,hygromycin b,56928061.0,,30S ribosome 16S rRNA decoding,
4,M53946149,CCN[C@@H]1C[C@H](N)[C@@H](O[C@H]2OC(CN)=CC[C@H...,netilmicin sulfate,62115.0,,30S ribosome 16S rRNA decoding,


In [4]:
compounds_annotation_2018.shape

(3145, 7)

We assembled a library of 3,226 bioactive small molecules, enriched for compounds with activity against wild-type Mtb based on literature reports (Extended Data Fig. 2c; see Methods), confirmed their activity by testing for inhibition of green fluorescent protein (GFP)-expressing wild-type Mtb, and found that 1,312 (45%) had an MIC90 value (the minimum inhibitory concentration required to inhibit growth by 90% of its maximal rate) of less than 64 µM (Extended Data Fig. 2d). We then screened this chemical library at 1.1, 3.3, 10 and 30 µM against a pool of the first 100 successfully created Mtb hypomorphs in duplicate (Pearson’s r = 0.93), generating 1,290,400 potential chemical–genetic interactions.

Most interactions (927,025, 71%) were inhibitory (fold change < 1) (Extended Data Fig. 2e); of these, 55,508 interactions (6%) representing 940 compounds (29%) were strong (P < 10−10). In a minority, protein depletion conferred resistance to inhibitors of wild-type Mtb; for example, the mycothiol cysteine ligase (MshC) hypomorph was resistant to tuberculosis drugs isoniazid (INH) and ethionamide (ETH), known inhibitors of enoyl-[acyl-carrier-protein] reductase (InhA)25.

Using an orthogonal simplex growth assay, we retested 112 identified hits that displayed some specificity (activity against less than 10 strains) for a subset of mutants (P < 10−10) against their corresponding hypomorph interactor, wild-type Mtb, and several other hypomorphs as negative controls. Using a receiver operating characteristic (ROC) curve to assess the ability of the multiplexed assay to predict activity in the orthogonal assay, the area under the ROC curve (AUROC; 0.74) indicated a high true-positive rate in the primary assay with a well-controlled false-positive rate (Fig. 1b). Given the complexity of the primary screen, reassuringly 1,375 (52%) of the 2,664 strong interactions were confirmed in the secondary assay.

# Chemical Genomics of Tuberculosis - 3226 compounds x 4 concentrations x 100 strains

2018-08-20

## Description

Raw count data from 3226 compounds x 4 concentrations x 100 strains.

Also included are 10-point dose response curves for rifampin and trimethoprim.

Spike-in plasmids were added as internal controls with lysis solution ("intcon1") and PCR master mix ("intcon2").

## Data structure

One row per well-strain combination. 

## Columns

**id**: Sequencing lane that each sample was run on.

**plate_name**: Identifier of the microwell plate that each sample was on.

**quadrant**: Plate quadrant that each sample was on.

**row**: The row of each sample's well.

**column**: The column of each sample's well.

**plate_source**: The compound source plate for each sample's assay plate.

**compound**: Compound identifier for each sample's well. 

**concentration**: Compound concentration in micromolar.

**strain**: Mnemonic name of gene product knocked down, or spike-in plasmid name.

**count**: Barcode count (exact matches) for each strain in each well.

**primer_plate**: The primer array used for library construction of each sample.

**thermocycler**: Thermocycler used for Illumina library construction.

**overnight_cycle**: Whether each samples was PCR amplified overnight.

**library**: Library into which each sample was pooled.

**untreated**: Indicator of whether each well was an "untreated" or "vehicle" DMSO control.

## Questions?

Email hung {at} broadinstitute.org.

In [5]:
raw_counts_2018=pd.read_csv('2018-08-20_raw-counts/2018-08-20_raw-counts.csv')
raw_counts_2018.head()


Unnamed: 0,id,plate_name,quadrant,row,column,plate_source,compound,concentration,strain,count,primer_plate,thermocycler,overnight_cycle,library,untreated
0,/home/unix/ejohnson/ejohnson/seq_data/2015-10-...,BR00073899,1,E,8,CM 34,BRD-K27893489-003-11-5,9.0,H37Rv,76,3,G,no,2,False
1,/home/unix/ejohnson/ejohnson/seq_data/2015-10-...,BR00073899,1,M,10,CM 34,BRD-K13086613-001-01-6,10.0,H37Rv,113,3,G,no,2,False
2,/home/unix/ejohnson/ejohnson/seq_data/2015-10-...,BR00073899,1,K,8,CM 34,BRD-K88492762-001-01-5,10.0,H37Rv,51,3,G,no,2,False
3,/home/unix/ejohnson/ejohnson/seq_data/2015-10-...,BR00073899,1,K,18,CM 34,BRD-K06934249-001-01-8,10.0,H37Rv,116,3,G,no,2,False
4,/home/unix/ejohnson/ejohnson/seq_data/2015-10-...,BR00073899,1,G,6,CM 34,BRD-A88358860-001-01-1,10.0,H37Rv,27,3,G,no,2,False


In [6]:
raw_counts_2018.shape

(13140864, 15)

In [7]:
new_raw_counts_2018=raw_counts_2018.rename(columns={'compound': 'full_name_compound'})
new_raw_counts_2018.head()

Unnamed: 0,id,plate_name,quadrant,row,column,plate_source,full_name_compound,concentration,strain,count,primer_plate,thermocycler,overnight_cycle,library,untreated
0,/home/unix/ejohnson/ejohnson/seq_data/2015-10-...,BR00073899,1,E,8,CM 34,BRD-K27893489-003-11-5,9.0,H37Rv,76,3,G,no,2,False
1,/home/unix/ejohnson/ejohnson/seq_data/2015-10-...,BR00073899,1,M,10,CM 34,BRD-K13086613-001-01-6,10.0,H37Rv,113,3,G,no,2,False
2,/home/unix/ejohnson/ejohnson/seq_data/2015-10-...,BR00073899,1,K,8,CM 34,BRD-K88492762-001-01-5,10.0,H37Rv,51,3,G,no,2,False
3,/home/unix/ejohnson/ejohnson/seq_data/2015-10-...,BR00073899,1,K,18,CM 34,BRD-K06934249-001-01-8,10.0,H37Rv,116,3,G,no,2,False
4,/home/unix/ejohnson/ejohnson/seq_data/2015-10-...,BR00073899,1,G,6,CM 34,BRD-A88358860-001-01-1,10.0,H37Rv,27,3,G,no,2,False


In [8]:
compound=[]
for string in new_raw_counts_2018['full_name_compound']:
    if '-' in string:
        compound.append(string.split('-')[1])
    else:
        compound.append(string)


In [9]:
new_raw_counts_2018.insert(1, "compound_stem", compound) 
#break

In [10]:
new_raw_counts_2018.head()

Unnamed: 0,id,compound_stem,plate_name,quadrant,row,column,plate_source,full_name_compound,concentration,strain,count,primer_plate,thermocycler,overnight_cycle,library,untreated
0,/home/unix/ejohnson/ejohnson/seq_data/2015-10-...,K27893489,BR00073899,1,E,8,CM 34,BRD-K27893489-003-11-5,9.0,H37Rv,76,3,G,no,2,False
1,/home/unix/ejohnson/ejohnson/seq_data/2015-10-...,K13086613,BR00073899,1,M,10,CM 34,BRD-K13086613-001-01-6,10.0,H37Rv,113,3,G,no,2,False
2,/home/unix/ejohnson/ejohnson/seq_data/2015-10-...,K88492762,BR00073899,1,K,8,CM 34,BRD-K88492762-001-01-5,10.0,H37Rv,51,3,G,no,2,False
3,/home/unix/ejohnson/ejohnson/seq_data/2015-10-...,K06934249,BR00073899,1,K,18,CM 34,BRD-K06934249-001-01-8,10.0,H37Rv,116,3,G,no,2,False
4,/home/unix/ejohnson/ejohnson/seq_data/2015-10-...,A88358860,BR00073899,1,G,6,CM 34,BRD-A88358860-001-01-1,10.0,H37Rv,27,3,G,no,2,False


In [11]:
new_raw_counts_2018.shape

(13140864, 16)

In [12]:
new_raw_counts_2018[~pd.isnull(new_raw_counts_2018['compound_stem'])].shape

(13140864, 16)

In [15]:
import pandas as pd
left = compounds_annotation_2018.set_index(['compound_stem'])
right = new_raw_counts_2018.set_index(['compound_stem'])

combined_annotation_and_counts=left.join(right)
combined_annotation_and_counts.head()

Unnamed: 0_level_0,SMILES,compound_name,pubchem_cid,kegg_kid,classifier_target,similar_to,id,plate_name,quadrant,row,...,plate_source,full_name_compound,concentration,strain,count,primer_plate,thermocycler,overnight_cycle,library,untreated
compound_stem,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A00113255,CC1(C)C[C@H](C2CC[C@]3(C)C(=CC[C@@H]4[C@@]5(C)...,"Oleanolic Acid, Caryophyllin, Astrantiagenin C...",49867939.0,,,,/home/unix/ejohnson/ejohnson/seq_data/2015-10-...,AB00017481.4,4,J,...,AB00017481,BRD-A00113255-001-01-8,3.0,H37Rv,160,15,J,yes,4,False
A00113255,CC1(C)C[C@H](C2CC[C@]3(C)C(=CC[C@@H]4[C@@]5(C)...,"Oleanolic Acid, Caryophyllin, Astrantiagenin C...",49867939.0,,,,/home/unix/ejohnson/ejohnson/seq_data/2015-10-...,AB00017481.4,3,J,...,AB00017481,BRD-A00113255-001-01-8,1.0,H37Rv,107,15,J,yes,4,False
A00113255,CC1(C)C[C@H](C2CC[C@]3(C)C(=CC[C@@H]4[C@@]5(C)...,"Oleanolic Acid, Caryophyllin, Astrantiagenin C...",49867939.0,,,,/home/unix/ejohnson/ejohnson/seq_data/2015-10-...,AB00017481.4,1,I,...,AB00017481,BRD-A00113255-001-01-8,10.0,H37Rv,94,15,J,yes,4,False
A00113255,CC1(C)C[C@H](C2CC[C@]3(C)C(=CC[C@@H]4[C@@]5(C)...,"Oleanolic Acid, Caryophyllin, Astrantiagenin C...",49867939.0,,,,/home/unix/ejohnson/ejohnson/seq_data/2015-10-...,AB00017481.4,2,I,...,AB00017481,BRD-A00113255-001-01-8,30.0,H37Rv,97,15,J,yes,4,False
A00113255,CC1(C)C[C@H](C2CC[C@]3(C)C(=CC[C@@H]4[C@@]5(C)...,"Oleanolic Acid, Caryophyllin, Astrantiagenin C...",49867939.0,,,,/home/unix/ejohnson/ejohnson/seq_data/2015-10-...,AB00017481.3,4,J,...,AB00017481,BRD-A00113255-001-01-8,3.0,H37Rv,243,15,I,yes,3,False


In [16]:
combined_annotation_and_counts.to_csv('combined_annotation_and_counts.csv')

In [17]:
combined_annotation_and_counts.shape

(10540272, 21)

# Chemical Genomics of Tuberculosis - 3226 compounds x 4 concentrations x 100 strains

2018-08-20

## Description

Calculated log fold change and p-values from 3226 compounds x 4 concentrations x 100 strains, taking into account batch effects and negative binomial distribution of counts.

Also included are 10-point dose response curves for rifampin and trimethoprim.

Spike-in plasmids were added as internal controls with lysis solution ("intcon1") and PCR master mix ("intcon2").

## Data structure

One row per compound-concentration-strain combination. 

## Columns

**compound**: Compound identifier for each sample's well. 

**concentration**: Compound concentration in micromolar.

**strain**: Mnemonic name of gene product knocked down, or spike-in plasmid name.

**n_replicates**: Combined number of replicates across sequencing lanes and completely independent replicates.

**log_fold_change**: Strain-matched maximum likelihood estimate of natural log fold change (LFC) of counts in each condition-strain combination compared to untreated DMSO control.

**std_error_lfc**: Standard error of the LFC estimate.

**log2_fold_change**: Base-2 LFC estimate [LFC / log(2)].

**std_error_l2fc**: Standard error of the base-2 LFC estimate.

**z_score**: Wald test Z-score [LFC / std_error].

**p_value**: P-value assuming Z ~ N(0, 1).

## Questions?

Email hung {at} broadinstitute.org.

In [18]:
import pandas as pd
value_data_2018=pd.read_csv('2018-08-20_lfc-p-values/2018-08-20_lfc-p-values.csv')
new_value_data_2018=value_data_2018.rename(columns={'compound': 'full_name_compound'})
new_value_data_2018.head()

Unnamed: 0,full_name_compound,concentration,strain,n_replicates,log_fold_change,std_error_lfc,log2_fold_change,std_error_l2fc,z_score,p_value
0,BRD-K27893489-003-11-5,1.0,aceE,8,-0.744124,0.096776,-1.073544,0.139618,-7.689167,1.63859e-14
1,BRD-K27893489-003-11-5,3.0,aceE,8,-0.533935,0.096395,-0.770305,0.139068,-5.539061,3.127667e-08
2,BRD-K27893489-003-11-5,9.0,aceE,8,-0.899398,0.097281,-1.297557,0.140347,-9.245345,2.889549e-20
3,BRD-K27893489-003-11-5,30.0,aceE,8,-1.560373,0.099562,-2.251143,0.143638,-15.672326,1.2540030000000001e-54
4,BRD-K13086613-001-01-6,1.0,aceE,8,-0.45824,0.096279,-0.661101,0.138901,-4.759517,1.971194e-06


In [19]:
new_value_data_2018.shape

(1330590, 10)

In [20]:
compound=[]
for string in new_value_data_2018['full_name_compound']:
    if '-' in string:
        compound.append(string.split('-')[1])
    else:
        compound.append(string)


In [21]:
new_value_data_2018.insert(1, "compound_stem", compound) 
#break

In [22]:
new_value_data_2018.head()

Unnamed: 0,full_name_compound,compound_stem,concentration,strain,n_replicates,log_fold_change,std_error_lfc,log2_fold_change,std_error_l2fc,z_score,p_value
0,BRD-K27893489-003-11-5,K27893489,1.0,aceE,8,-0.744124,0.096776,-1.073544,0.139618,-7.689167,1.63859e-14
1,BRD-K27893489-003-11-5,K27893489,3.0,aceE,8,-0.533935,0.096395,-0.770305,0.139068,-5.539061,3.127667e-08
2,BRD-K27893489-003-11-5,K27893489,9.0,aceE,8,-0.899398,0.097281,-1.297557,0.140347,-9.245345,2.889549e-20
3,BRD-K27893489-003-11-5,K27893489,30.0,aceE,8,-1.560373,0.099562,-2.251143,0.143638,-15.672326,1.2540030000000001e-54
4,BRD-K13086613-001-01-6,K13086613,1.0,aceE,8,-0.45824,0.096279,-0.661101,0.138901,-4.759517,1.971194e-06


In [23]:
new_value_data_2018.shape

(1330590, 11)

In [24]:
import pandas as pd
left = compounds_annotation_2018.set_index(['compound_stem'])
right = new_value_data_2018.set_index(['compound_stem'])

combined_annotation_and_value=left.join(right)
combined_annotation_and_value

Unnamed: 0_level_0,SMILES,compound_name,pubchem_cid,kegg_kid,classifier_target,similar_to,full_name_compound,concentration,strain,n_replicates,log_fold_change,std_error_lfc,log2_fold_change,std_error_l2fc,z_score,p_value
compound_stem,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
A00113255,CC1(C)C[C@H](C2CC[C@]3(C)C(=CC[C@@H]4[C@@]5(C)...,"Oleanolic Acid, Caryophyllin, Astrantiagenin C...",49867939.0,,,,BRD-A00113255-001-01-8,1.0,aceE,8,0.213821,0.095516,0.308478,0.137801,2.238578,2.520801e-02
A00113255,CC1(C)C[C@H](C2CC[C@]3(C)C(=CC[C@@H]4[C@@]5(C)...,"Oleanolic Acid, Caryophyllin, Astrantiagenin C...",49867939.0,,,,BRD-A00113255-001-01-8,30.0,aceE,8,0.406739,0.095449,0.586800,0.137704,4.261321,2.053183e-05
A00113255,CC1(C)C[C@H](C2CC[C@]3(C)C(=CC[C@@H]4[C@@]5(C)...,"Oleanolic Acid, Caryophyllin, Astrantiagenin C...",49867939.0,,,,BRD-A00113255-001-01-8,3.0,aceE,8,0.515958,0.095313,0.744371,0.137507,5.413314,6.348054e-08
A00113255,CC1(C)C[C@H](C2CC[C@]3(C)C(=CC[C@@H]4[C@@]5(C)...,"Oleanolic Acid, Caryophyllin, Astrantiagenin C...",49867939.0,,,,BRD-A00113255-001-01-8,10.0,aceE,8,0.332635,0.095504,0.479890,0.137783,3.482948,4.983102e-04
A00113255,CC1(C)C[C@H](C2CC[C@]3(C)C(=CC[C@@H]4[C@@]5(C)...,"Oleanolic Acid, Caryophyllin, Astrantiagenin C...",49867939.0,,,,BRD-A00113255-001-01-8,1.0,adoK,8,0.143765,0.077977,0.207409,0.112497,1.843695,6.526093e-02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
U78772829,C1[C@@H]([C@H]([C@@H]([C@H]([C@@H]1NC(=O)[C@H]...,amikacin hydrate,16218899.0,,30S ribosome 16S rRNA decoding,,BRD-U78772829-000-01-3,10.0,trpG,8,0.296634,0.103410,0.427952,0.149189,2.868526,4.133597e-03
U78772829,C1[C@@H]([C@H]([C@@H]([C@H]([C@@H]1NC(=O)[C@H]...,amikacin hydrate,16218899.0,,30S ribosome 16S rRNA decoding,,BRD-U78772829-000-01-3,1.0,trpS,8,0.501194,0.150708,0.723070,0.217426,3.325596,8.858710e-04
U78772829,C1[C@@H]([C@H]([C@@H]([C@H]([C@@H]1NC(=O)[C@H]...,amikacin hydrate,16218899.0,,30S ribosome 16S rRNA decoding,,BRD-U78772829-000-01-3,30.0,trpS,8,0.450464,0.151038,0.649883,0.217901,2.982462,2.867158e-03
U78772829,C1[C@@H]([C@H]([C@@H]([C@H]([C@@H]1NC(=O)[C@H]...,amikacin hydrate,16218899.0,,30S ribosome 16S rRNA decoding,,BRD-U78772829-000-01-3,3.0,trpS,8,0.504390,0.150699,0.727681,0.217412,3.347014,8.202603e-04


In [25]:
combined_annotation_and_value.to_csv('combined_annotation_and_value.csv')

In [26]:
combined_annotation_and_value.shape

(1310904, 16)

In [None]:
import pandas as pd
combined_annotation_and_counts=pd.read_csv('combined_annotation_and_counts.csv')
right = combined_annotation_and_counts.set_index(['full_name_compound'],['compound_stem'])
combined_annotation_and_value=pd.read_csv('combined_annotation_and_value.csv')
left=combined_annotation_and_value.set_index(['full_name_compound'],['compound_stem'])


In [None]:
combined_all_info=left.join(right)

In [None]:
combined_all_info.head()