# Dataset report: Zhu rat

## _Oral acute toxicity_


> Zhu, Hao, et al. “Quantitative structure− activity relationship modeling of rat acute toxicity by oral exposure.” Chemical research in toxicology 22.12 (2009): 1913-1921.

In [34]:
import altair as alt
alt.data_transformers.disable_max_rows()
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

import cytoxnet.dataprep.io

In [35]:
## get the data
dataset = cytoxnet.dataprep.io.load_zhu_rat()
dataset.describe()

Unnamed: 0,LD50
count,7342.0
mean,2.542693
std,0.958225
min,-0.343
25%,1.85425
50%,2.367
75%,3.03275
max,10.207


In [36]:
print('Number of unique molecules: ', len(dataset))

Number of unique molecules:  7342


### Targets present

#### <span style='color:blue'>__The range of targets seems to be quite large for each species (units of mg/L)__</span>

In [37]:
dataset.describe().loc[['min', 'max']]

Unnamed: 0,LD50
min,-0.343
max,10.207


In [38]:
alt.Chart(dataset).mark_area(
    opacity=0.7,
    interpolate='step'
).encode(
    alt.X('LD50:Q', bin=alt.Bin(maxbins=100)),
    alt.Y('count()', stack=None)
)

#### <span style='color:blue'>__The dataset is heavily imbalanced towards the toxic side, but it doesn't have any indredibly toxic examples__</span>

### Molecule space

In [20]:
!pip install --quiet umap-learn hdbscan

distutils: /Users/ek/miniconda3/envs/cytoxnet/include/python3.6m/UNKNOWN
sysconfig: /Users/ek/miniconda3/envs/cytoxnet/include/python3.6m[0m
user = False
home = None
root = None
prefix = None[0m
distutils: /Users/ek/miniconda3/envs/cytoxnet/include/python3.6m/UNKNOWN
sysconfig: /Users/ek/miniconda3/envs/cytoxnet/include/python3.6m[0m
user = False
home = None
root = None
prefix = None[0m


In [39]:
import rdkit.Chem.AllChem
import umap

Set the descriptors to use for mapping

In [40]:
dataset['descriptor'] = dataset['smiles'].apply(
    lambda smiles: rdkit.Chem.AllChem.GetMorganFingerprintAsBitVect(rdkit.Chem.MolFromSmiles(smiles), radius=2, nBits=2048)
    )

UMAP the smiles

In [41]:
%%time
umap_model = umap.UMAP(metric = "jaccard",
                      n_neighbors = 25,
                      n_components = 2,
                      low_memory = False,
                      min_dist = 0.001)
X_umap = umap_model.fit_transform(np.vstack(dataset['descriptor'].values))
dataset["UMAP_0"], dataset["UMAP_1"] = X_umap[:,0], X_umap[:,1]

  "inverse_transform will be unavailable".format(self.metric)


CPU times: user 1min 6s, sys: 391 ms, total: 1min 6s
Wall time: 39.7 s


Are there any clusters?

In [44]:
alt.Chart(dataset[['UMAP_0', 'UMAP_1']]).mark_circle(size=60).encode(
    x='UMAP_0',
    y='UMAP_1',
)

#### <span style='color:blue'>__All three species seem to cover the available space__</span>

### Do any clusters in UMAP space seem to exhibit high toxicity?

In [46]:
alt.Chart(dataset[['UMAP_0', 'UMAP_1', 'LD50']]).mark_circle(size=60).encode(
    x='UMAP_0',
    y='UMAP_1',
    color='LD50:Q',
)

#### <span style='color:blue'>__There is a small cluster of very non toxic molecules__</span>