# Tox21 Assays

Meta-data for the [Tox21](https://www.epa.gov/chemical-research/toxicology-testing-21st-century-tox21) assays obtained from PubChem _via_ the [PUG API](https://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST.html).

Note that this notebook is based on work by Patricia Bento.

In [1]:
%run setup.py

### Config

Note that the information returned by the 'summary' and 'description' URLs overlaps, so some data could be obtained from either. The overlap does not seem to be complete, however.

In particular, only the 'summary' seems to contain the 'Method', _i.e._ whether the assay is 'confirmatory' or 'summary' _etc._, and only the 'description' provides information on the various result types for an assay (_a.k.a._ the 'endpoints').

In [2]:
# PubChem PUG URL for all AIDs for a particular datasource...

aids_url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/sourceall/{source}/aids/JSON'

source = 'Tox21' # Source name for Tox21 data

# URL for assay summary (i.e. assay metadata) for an AID...

tox21_summary_url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/{aid}/summary/JSON"

# URL for assay description (i.e. assay metadata) for an AID...

tox21_description_url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/{aid}/description/JSON'

In [3]:
# Directory for reading and writing data files...

data_dir = 'data'

### Initialisation

In [4]:
if not 'logger' in locals(): logger = make_logger.run(__name__)

## Get Tox21 assays|

Query PubChem PUG API for meta-data on Tox21 assays.

In [5]:
# Get list of AIDs associated with Tox21...

aids = requests.get(aids_url.format(source=source)).json()['IdentifierList']['AID']

len(aids)

150

In [6]:
# Get meta-data for the assays...

def f(aid):
        
    assay = requests.get(tox21_summary_url.format(aid=aid)).json()['AssaySummaries']['AssaySummary'][0]

    assay_name, method = [assay[x] for x in ('Name', 'Method')]

    target, gene_id = [assay['Target'][0][x] for x in ('Name', 'GI')] if 'Target' in assay else ('', '')

    protocol = assay['Protocol'][0] if assay_name.endswith(': Summary') else ''

    return aid, assay_name, method, target, gene_id, protocol

tox21_assays_df = pd.DataFrame([f(x) for x in aids], columns=['AID', 'assay_name', 'method', 'target','gene_id', 'protocol'])

tox21_assays_df.shape

(150, 6)

In [7]:
tox21_assays_df.head(1)

Unnamed: 0,AID,assay_name,method,target,gene_id,protocol
0,720687,qHTS assay to test for compound auto fluorescence at 460 nm (blue) in HepG2 cells,confirmatory,,,


### Assay name standardisation

As an example, consider the assays associated with summary AID 743219 (based on contents of 'protocol' column) - 743202, 743203, 720687, 720685, 720678 and 720681...

In [8]:
tox21_assays_df.query("AID == 743219")['protocol']

82    Please refer to other AIDs 743202, 743203, 720687, 720685, 720678 and 720681, for detailed assay protocols.
Name: protocol, dtype: object

In [9]:
example_AIDs = [743219, 743202, 743203, 720687, 720685, 720678, 720681]

tox21_assays_df[tox21_assays_df['AID'].isin(example_AIDs)].sort_values('AID')

Unnamed: 0,AID,assay_name,method,target,gene_id,protocol
140,720678,qHTS assay to test for compound auto fluorescence at 460 nm (blue) in HEK293 cells,confirmatory,,,
137,720681,qHTS assay to test for compound auto fluorescence at 460 nm (blue) in HEK293 cell free culture,confirmatory,,,
133,720685,qHTS assay to test for compound auto fluorescence at 460 nm (blue) in HepG2 cell free culture,confirmatory,,,
0,720687,qHTS assay to test for compound auto fluorescence at 460 nm (blue) in HepG2 cells,confirmatory,,,
92,743202,qHTS assay for small molecule agonists of the antioxidant response element (ARE) signaling pathway,confirmatory,nuclear factor erythroid 2-related factor 2 isoform 1 [Homo sapiens],20149576.0,
91,743203,qHTS assay for small molecule agonists of the antioxidant response element (ARE) signaling pathway - cell viability counter screen,confirmatory,,,
82,743219,qHTS assay for small molecule agonists of the antioxidant response element (ARE) signaling pathway: Summary,summary,nuclear factor erythroid 2-related factor 2 isoform 1 [Homo sapiens],20149576.0,"Please refer to other AIDs 743202, 743203, 720687, 720685, 720678 and 720681, for detailed assay protocols."


The standardised assay name groups AIDs related to summary AID 743219 in 3 groups:

* qHTS assay for small molecule agonists of the antioxidant response element (ARE) signaling pathway - AIDS 743202 (activity assay), 743203 (cell viability counterscreen) and 743219 itself (summary)

* qHTS assay to test for compound auto fluorescence at 460 nm (blue) in HepG2 cells                  - AIDs 720685, 720687

* qHTS assay to test for compound auto fluorescence at 460 nm (blue) in HEK293 cell free culture     - AIDs 720678, 720681

For the EU-ToxRisk project, the first group, consisting of the main activity assay, it's cell-viability counterscreen and a 'pseudo-assay' summarising the results of the other would be of interest.
Inspection of the assay names suggests these can be grouped using the 'stem' of the assay name. This allows an association to be made between a summary assay and the assays it summarises.

In [10]:
# Standardise assay_name...

tox21_assays_df['assay_name_original'] = tox21_assays_df['assay_name']

tox21_assays_df['assay_name'] = tox21_assays_df['assay_name_original'].str.replace('\s*(- cell viability(?: counter screen)?|: Summary)$', '').str.strip()

In [11]:
tox21_assays_df[tox21_assays_df['AID'].isin(example_AIDs)].sort_values('AID')

Unnamed: 0,AID,assay_name,method,target,gene_id,protocol,assay_name_original
140,720678,qHTS assay to test for compound auto fluorescence at 460 nm (blue) in HEK293 cells,confirmatory,,,,qHTS assay to test for compound auto fluorescence at 460 nm (blue) in HEK293 cells
137,720681,qHTS assay to test for compound auto fluorescence at 460 nm (blue) in HEK293 cell free culture,confirmatory,,,,qHTS assay to test for compound auto fluorescence at 460 nm (blue) in HEK293 cell free culture
133,720685,qHTS assay to test for compound auto fluorescence at 460 nm (blue) in HepG2 cell free culture,confirmatory,,,,qHTS assay to test for compound auto fluorescence at 460 nm (blue) in HepG2 cell free culture
0,720687,qHTS assay to test for compound auto fluorescence at 460 nm (blue) in HepG2 cells,confirmatory,,,,qHTS assay to test for compound auto fluorescence at 460 nm (blue) in HepG2 cells
92,743202,qHTS assay for small molecule agonists of the antioxidant response element (ARE) signaling pathway,confirmatory,nuclear factor erythroid 2-related factor 2 isoform 1 [Homo sapiens],20149576.0,,qHTS assay for small molecule agonists of the antioxidant response element (ARE) signaling pathway
91,743203,qHTS assay for small molecule agonists of the antioxidant response element (ARE) signaling pathway,confirmatory,,,,qHTS assay for small molecule agonists of the antioxidant response element (ARE) signaling pathway - cell viability counter screen
82,743219,qHTS assay for small molecule agonists of the antioxidant response element (ARE) signaling pathway,summary,nuclear factor erythroid 2-related factor 2 isoform 1 [Homo sapiens],20149576.0,"Please refer to other AIDs 743202, 743203, 720687, 720685, 720678 and 720681, for detailed assay protocols.",qHTS assay for small molecule agonists of the antioxidant response element (ARE) signaling pathway: Summary


In [12]:
# Get table of standardised assay_names and the AID of the corresponding summary assay...

tox21_summary_aid_df = tox21_assays_df.query("method == 'summary'")[['assay_name', 'AID']].rename(columns={'AID': 'summary_AID'})

tox21_summary_aid_df['summary_AID'].astype(str, inplace=True) # To avoid conversion to float due to presence of 'missing' values in column

tox21_summary_aid_df.shape

(35, 2)

In [13]:
tox21_summary_aid_df.head()

Unnamed: 0,assay_name,summary_AID
3,qHTS assay to identify small molecule agonists of H2AX,1224896
4,qHTS assay to identify small molecule agonists of the thyroid stimulating hormone receptor (TSHR) signaling pathway,1224895
5,qHTS assay to identify small molecule agonists of the hypoxia (HIF-1) signaling pathway,1224894
6,qHTS assay to identify small molecule antagonists of the constitutive androstane receptor (CAR) signaling pathway,1224893
7,qHTS assay to identify small molecule agonists of the constitutive androstane receptor (CAR) signaling pathway,1224892


In [14]:
# Add column showing corresponding summary assay for each assay (where appropriate) to master table of Tox21 assay meta-data...

tox21_assays_df_0 = tox21_assays_df.copy()

tox21_assays_df = (
    tox21_assays_df.merge( tox21_summary_aid_df, on='assay_name', how='left')
    .sort_values(['assay_name', 'AID']) 
    .reset_index(drop=True)
)

tox21_assays_df['summary_AID'].fillna('', inplace=True)

tox21_assays_df.shape

(150, 8)

In [15]:
tox21_assays_df.head(1)

Unnamed: 0,AID,assay_name,method,target,gene_id,protocol,assay_name_original,summary_AID
0,1224869,A CellTox Green Cytotoxicity Assay to monitor cytotoxicity in HEK293 cells - 0 hour,confirmatory,,,,A CellTox Green Cytotoxicity Assay to monitor cytotoxicity in HEK293 cells - 0 hour,


In [16]:
tox21_assays_df[tox21_assays_df['AID'].isin(example_AIDs)].sort_values('AID')

Unnamed: 0,AID,assay_name,method,target,gene_id,protocol,assay_name_original,summary_AID
139,720678,qHTS assay to test for compound auto fluorescence at 460 nm (blue) in HEK293 cells,confirmatory,,,,qHTS assay to test for compound auto fluorescence at 460 nm (blue) in HEK293 cells,
138,720681,qHTS assay to test for compound auto fluorescence at 460 nm (blue) in HEK293 cell free culture,confirmatory,,,,qHTS assay to test for compound auto fluorescence at 460 nm (blue) in HEK293 cell free culture,
140,720685,qHTS assay to test for compound auto fluorescence at 460 nm (blue) in HepG2 cell free culture,confirmatory,,,,qHTS assay to test for compound auto fluorescence at 460 nm (blue) in HepG2 cell free culture,
141,720687,qHTS assay to test for compound auto fluorescence at 460 nm (blue) in HepG2 cells,confirmatory,,,,qHTS assay to test for compound auto fluorescence at 460 nm (blue) in HepG2 cells,
30,743202,qHTS assay for small molecule agonists of the antioxidant response element (ARE) signaling pathway,confirmatory,nuclear factor erythroid 2-related factor 2 isoform 1 [Homo sapiens],20149576.0,,qHTS assay for small molecule agonists of the antioxidant response element (ARE) signaling pathway,743219.0
31,743203,qHTS assay for small molecule agonists of the antioxidant response element (ARE) signaling pathway,confirmatory,,,,qHTS assay for small molecule agonists of the antioxidant response element (ARE) signaling pathway - cell viability counter screen,743219.0
32,743219,qHTS assay for small molecule agonists of the antioxidant response element (ARE) signaling pathway,summary,nuclear factor erythroid 2-related factor 2 isoform 1 [Homo sapiens],20149576.0,"Please refer to other AIDs 743202, 743203, 720687, 720685, 720678 and 720681, for detailed assay protocols.",qHTS assay for small molecule agonists of the antioxidant response element (ARE) signaling pathway: Summary,743219.0


In [17]:
# tox21_assays_df.to_pickle(os.path.join(data_dir, 'tox21_assays.pkl'))

In [18]:
# tox21_assays_df = pd.read_pickle(os.path.join(data_dir, 'tox21_assays.pkl'))

# tox21_assays_df.shape

### Summary assays

Subset the assay data, as the summary assays are of primary interest going forward...

In [19]:
tox21_summary_assays_df = (
    tox21_assays_df
        .query("method == 'summary'")
        .sort_values('AID')
        .reset_index(drop=True)
        [['AID', 'assay_name', 'target', 'gene_id']]
    )

tox21_summary_assays_df.shape

(35, 4)

In [20]:
tox21_summary_assays_df.to_pickle(os.path.join(data_dir, 'tox21_summary_assays.pkl'))

In [21]:
# tox21_summary_assays_df = pd.read_pickle(os.path.join(data_dir, 'tox21_summary_assays.pkl'))

# tox21_summary_assays_df.shape

<a name="assay_endpoints"></a>

### Assay endpoints

Investigate the names of the various assay result types, hereafter called 'endpoints', for the summary assays.

Note the use of the 'description' URL here.

In [22]:
# Function to extract result names from description record for AID...

def f(aid):
    
    results = requests.get(tox21_description_url.format(aid=aid)).json()['PC_AssayContainer'][0]['assay']['descr']['results']

    return list(zip([aid]*len(results), [x['name'] for x in results]))

tox21_result_type_df = pd.DataFrame([z for y in [f(x) for x in tox21_summary_assays_df['AID']] for z in y], columns=['AID', 'name'])

tox21_result_type_df.shape

(429, 2)

In [23]:
# Rename result 'name' column to something more informative...

tox21_result_type_df.rename(columns={'name': 'endpoint'}, inplace=True)

In [24]:
tox21_result_type_df.head()

Unnamed: 0,AID,endpoint
0,720516,Activity Summary
1,720516,ATAD5 Activity
2,720516,ATAD5 Potency (uM)
3,720516,ATAD5 Efficacy (%)
4,720516,Viability Activity


In [25]:
# Count endpoint occurences...

tox21_result_type_df['endpoint'].value_counts().to_frame('count').reset_index().rename(columns={'index': 'endpoint'})

Unnamed: 0,endpoint,count
0,Activity Summary,35
1,Viability Efficacy (%),30
2,Sample Source,30
3,Viability Potency (uM),30
4,Viability Activity,30
5,Ratio Activity,25
6,Ratio Efficacy (%),25
7,Ratio Potency (uM),25
8,530 nm Activity,22
9,460 nm Potency (uM),22


In [26]:
# Show distribution of endpoints across assays...

df = tox21_result_type_df.copy()

df['tick'] = '\u2713'

pd.pivot_table(df, index=['AID'], columns=['endpoint'], values=['tick'], aggfunc='first').fillna('')

Unnamed: 0_level_0,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick,tick
endpoint,460 nm Activity,460 nm Efficacy (%),460 nm Potency (uM),530 nm Activity,530 nm Efficacy (%),530 nm Potency (uM),535 nm Activity,535 nm Efficacy (%),535 nm Potency (uM),590 nm Activity,590 nm Efficacy (%),590 nm Potency (uM),615 nm Activity,615 nm Efficacy (%),615 nm Potency (uM),620 nm Activity,620 nm Efficacy (%),620 nm Potency (uM),665 nm Activity,665 nm Efficacy (%),665 nm Potency (uM),AR Activity,AR Efficacy (%),AR Potency (uM),ATAD5 Activity,ATAD5 Efficacy (%),ATAD5 Potency (uM),Activity Summary,Agonist Activity,Agonist Efficacy (%),Agonist Potency (uM),AhR Activity,AhR Efficacy (%),AhR Potency (uM),Antagonist Activity,Antagonist Efficacy (%),Antagonist Potency (uM),Blue (46 nm) auto fluorescence outcome,Blue (460 nm) auto fluorescence outcome,ER Activity,ER Efficacy (%),ER Potency (uM),Ratio Activity,Ratio Efficacy (%),Ratio Potency (uM),Sample Source,Supplier,TR Activity,TR Efficacy (%),TR Potency (uM),Viability Activity,Viability Efficacy (%),Viability Potency (uM)
AID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2,Unnamed: 27_level_2,Unnamed: 28_level_2,Unnamed: 29_level_2,Unnamed: 30_level_2,Unnamed: 31_level_2,Unnamed: 32_level_2,Unnamed: 33_level_2,Unnamed: 34_level_2,Unnamed: 35_level_2,Unnamed: 36_level_2,Unnamed: 37_level_2,Unnamed: 38_level_2,Unnamed: 39_level_2,Unnamed: 40_level_2,Unnamed: 41_level_2,Unnamed: 42_level_2,Unnamed: 43_level_2,Unnamed: 44_level_2,Unnamed: 45_level_2,Unnamed: 46_level_2,Unnamed: 47_level_2,Unnamed: 48_level_2,Unnamed: 49_level_2,Unnamed: 50_level_2,Unnamed: 51_level_2,Unnamed: 52_level_2,Unnamed: 53_level_2
720516,,,,,,,,,,,,,,,,,,,,,,,,,✓,✓,✓,✓,,,,,,,,,,,,,,,,,,✓,,,,,✓,✓,✓
720552,✓,✓,✓,✓,✓,✓,,,,,,,,,,,,,,,,,,,,,,✓,,,,,,,,,,,,,,,✓,✓,✓,✓,,,,,✓,✓,✓
720637,,,,,,,✓,✓,✓,✓,✓,✓,,,,,,,,,,,,,,,,✓,,,,,,,,,,,,,,,✓,✓,✓,✓,,,,,✓,✓,✓
720719,✓,✓,✓,✓,✓,✓,,,,,,,,,,,,,,,,,,,,,,✓,,,,,,,,,,,✓,,,,✓,✓,✓,✓,,,,,,,
720725,✓,✓,✓,✓,✓,✓,,,,,,,,,,,,,,,,,,,,,,✓,,,,,,,,,,,,,,,✓,✓,✓,✓,,,,,✓,✓,✓
743053,✓,✓,✓,✓,✓,✓,,,,,,,,,,,,,,,,,,,,,,✓,,,,,,,,,,,✓,,,,✓,✓,✓,✓,,,,,,,
743054,,,,,,,,,,,,,,,,,,,,,,✓,✓,✓,,,,✓,,,,,,,,,,,,,,,,,,✓,,,,,✓,✓,✓
743063,✓,✓,✓,✓,✓,✓,,,,,,,,,,,,,,,,,,,,,,✓,,,,,,,,,,,,,,,✓,✓,✓,✓,,,,,✓,✓,✓
743067,,,,,,,,,,,,,,,,,,,,,,,,,,,,✓,,,,,,,,,,,,,,,,,,✓,,✓,✓,✓,✓,✓,✓
743077,✓,✓,✓,✓,✓,✓,,,,,,,,,,,,,,,,,,,,,,✓,,,,,,,,,,,✓,,,,✓,✓,✓,✓,,,,,,,


There are too many different result types (_a.k.a._ 'endpoints').

Therefore, we standardise endpoint names by removing first word, except where the endpoint names contains any of 'nm', 'Viability', 'Source' or 'Supplier'...

In [27]:
tox21_result_type_df['endpoint_original'] = tox21_result_type_df['endpoint']

tox21_result_type_df['endpoint'] = tox21_result_type_df['endpoint_original'].apply(lambda x: re.sub('^\w+', '', x).strip() if not re.search("nm|Viability|Source|Supplier", x) else x).reset_index(drop=True)

tox21_result_type_df.shape

(429, 3)

In [28]:
tox21_result_type_df.head(10)

Unnamed: 0,AID,endpoint,endpoint_original
0,720516,Summary,Activity Summary
1,720516,Activity,ATAD5 Activity
2,720516,Potency (uM),ATAD5 Potency (uM)
3,720516,Efficacy (%),ATAD5 Efficacy (%)
4,720516,Viability Activity,Viability Activity
5,720516,Viability Potency (uM),Viability Potency (uM)
6,720516,Viability Efficacy (%),Viability Efficacy (%)
7,720516,Sample Source,Sample Source
8,720552,Summary,Activity Summary
9,720552,Activity,Ratio Activity


In [29]:
tox21_result_type_df['endpoint'].value_counts().to_frame('count').reset_index().rename(columns={'index': 'endpoint'})

Unnamed: 0,endpoint,count
0,Activity,35
1,Potency (uM),35
2,Summary,35
3,Efficacy (%),35
4,Viability Efficacy (%),30
5,Sample Source,30
6,Viability Activity,30
7,Viability Potency (uM),30
8,460 nm Potency (uM),22
9,460 nm Efficacy (%),22


In [30]:
# Inspect 'lost' result types...

lost_df = (
    tox21_result_type_df[tox21_result_type_df['endpoint'] != tox21_result_type_df['endpoint_original']][['endpoint', 'endpoint_original']]
        .drop_duplicates()
        .sort_values(['endpoint', 'endpoint_original'])
        .reset_index(drop=True)
)

lost_df.shape

(25, 2)

In [31]:
lost_df

Unnamed: 0,endpoint,endpoint_original
0,Activity,AR Activity
1,Activity,ATAD5 Activity
2,Activity,Agonist Activity
3,Activity,AhR Activity
4,Activity,Antagonist Activity
5,Activity,ER Activity
6,Activity,Ratio Activity
7,Activity,TR Activity
8,Efficacy (%),AR Efficacy (%)
9,Efficacy (%),ATAD5 Efficacy (%)
