# Dataset report: Lunghini ecotox

## _Consensus QSAR models estimating acute toxicity to aquatic organisms from different trophic levels: algae, Daphnia and fish_


> F. Lunghini, G. Marcou, P. Azam, M. H. Enrici, E. Van Miert, and A. Varnek, “Consensus QSAR models estimating acute toxicity to aquatic organisms from different trophic levels: algae, Daphnia and fish,” SAR QSAR Environ. Res., vol. 31, no. 9, pp. 655–675, Sep. 2020, doi: 10.1080/1062936X.2020.1797872.

In [1]:
import altair as alt
alt.data_transformers.disable_max_rows()
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

import cytoxnet.dataprep.io

In [2]:
## get the data
dataset = cytoxnet.dataprep.io.load_data('lunghini')
dataset.describe()

Unnamed: 0,molecular_weight,daphnia_EC50,fish_LC50,algea_EC50
count,3679.0,2120.0,2199.0,1440.0
mean,215.019743,62.426641,202.975335,118.942882
std,109.202813,632.706477,1308.709203,596.983371
min,24.0214,0.0,0.00013,0.000395
25%,143.248215,1.0,1.8,3.2
50%,192.254242,6.455,9.4,15.0
75%,263.374435,31.075,52.0,56.325
max,1338.086792,25000.0,37700.0,9120.0


In [3]:
print('Number of unique molecules: ', len(dataset))

Number of unique molecules:  3680


In [4]:
cytoxnet.dataprep.io.create_compound_codex('./data_lunghini/')

In [5]:
data = cytoxnet.dataprep.io.add_datasets(
                 dataframes=dataset,
                 names=['lunghini_ecotox'],
                 id_col='smiles',
                 db_path='./data_lunghini',
                 new_featurizers=None)

In [6]:
compounds = pd.read_csv('./data_lunghini/compounds.csv')
print('Number of unique molecules after SMILES canonicalization: ', len(compounds))

Number of unique molecules after SMILES canonicalization:  3680


### Targets present
The dataset includes toxicity targets for Algea, Fish, and Daphnea. From the datasset describe above we can see that not all 3680 molecules have data for each target, with each species having only 1500-2200 of the total molecules with measured targets.

#### <span style='color:blue'>__The range of targets seems to be quite large for each species (units of mg/L)__</span>

In [7]:
dataset.describe().loc[['min', 'max']]

Unnamed: 0,molecular_weight,daphnia_EC50,fish_LC50,algea_EC50
min,24.0214,0.0,0.00013,0.000395
max,1338.086792,25000.0,37700.0,9120.0


Scale the targets to plot the distribution on the same axis.

In [8]:
for target in ['algea_EC50', 'fish_LC50', 'daphnia_EC50']:
    scaled = MinMaxScaler().fit_transform(dataset[target].values.reshape(-1,1))
    dataset[target+' (scaled)'] = scaled

In [9]:

alt.Chart(dataset).transform_fold(
    ['algea_EC50 (scaled)', 'fish_LC50 (scaled)', 'daphnia_EC50 (scaled)'],
    as_=['Target', 'Measurement (scaled)']
).mark_area(
    opacity=0.7,
    interpolate='step'
).encode(
    alt.X('Measurement (scaled):Q', bin=alt.Bin(maxbins=100)),
    alt.Y('count()', stack=None),
    alt.Color('Target:N')
)


#### <span style='color:blue'>__The dataset is heavily imbalanced towards the toxic side. Let's try the log transformed data__</span>

In [10]:
# load log-transformed data
algae_lt = cytoxnet.dataprep.io.load_data('lunghini_algea_EC50')
fish_lt = cytoxnet.dataprep.io.load_data('lunghini_fish_LC50')
daphnia_lt = cytoxnet.dataprep.io.load_data('lunghini_daphnia_EC50')

In [11]:
# visualize 
algae = alt.Chart(algae_lt).transform_fold(
    ['algea_EC50'],
    as_=['Target', 'Measurement (scaled)']
).mark_area(
    opacity=0.7,
    interpolate='step'
).encode(
    alt.X('algea_EC50:Q', bin=alt.Bin(maxbins=100)),
    alt.Y('count()', stack=None),
    alt.Color('Target:N')
)

fish = alt.Chart(fish_lt).transform_fold(
    ['fish_LC50'],
    as_=['Target', 'Measurement (scaled)']
).mark_area(
    opacity=0.7,
    interpolate='step'
).encode(
    alt.X('fish_LC50:Q', bin=alt.Bin(maxbins=100)),
    alt.Y('count()', stack=None),
    alt.Color('Target:N')
)

daphnia = alt.Chart(daphnia_lt).transform_fold(
    ['daphnia_EC50'],
    as_=['Target', 'Measurement (scaled)']
).mark_area(
    opacity=0.7,
    interpolate='step'
).encode(
    alt.X('daphnia_EC50:Q', bin=alt.Bin(maxbins=100)),
    alt.Y('count()', stack=None),
    alt.Color('Target:N')
)

daphnia + algae + fish


In [12]:
"""
alt.Chart(dataset).transform_fold(
    ['algea_EC50 (scaled)', 'fish_LC50 (scaled)', 'daphnia_EC50 (scaled)'],
    as_=['Target', 'Measurement']
).transform_filter('datum.Measurement < .003').mark_area(
    opacity=0.4,
    interpolate='step'
).encode(
    alt.X('Measurement:Q', bin=alt.Bin(maxbins=100)),
    alt.Y('count()', stack=None),
    alt.Color('Target:N')
)
"""

"\nalt.Chart(dataset).transform_fold(\n    ['algea_EC50 (scaled)', 'fish_LC50 (scaled)', 'daphnia_EC50 (scaled)'],\n    as_=['Target', 'Measurement']\n).transform_filter('datum.Measurement < .003').mark_area(\n    opacity=0.4,\n    interpolate='step'\n).encode(\n    alt.X('Measurement:Q', bin=alt.Bin(maxbins=100)),\n    alt.Y('count()', stack=None),\n    alt.Color('Target:N')\n)\n"

In [13]:
# for umap ease of plotting, add log transform columns to the all-species dataframe
def log_trans(row):
    inp = row.values.reshape(-1)
    out = []
    for val in inp:
        if not np.isnan(val):
            out.append(np.log(val))
        else:
            out.append(val)
    return out

In [14]:
dataset[['algea_EC50 (logged)', 'fish_LC50 (logged)', 'daphnia_EC50 (logged)']] = dataset[['algea_EC50', 'fish_LC50', 'daphnia_EC50']].apply(log_trans)

  import sys


### Molecule space

In [15]:
!pip install --quiet umap-learn hdbscan

distutils: /opt/anaconda3/envs/cytoxnet/include/python3.7m/UNKNOWN
sysconfig: /opt/anaconda3/envs/cytoxnet/include/python3.7m[0m
user = False
home = None
root = None
prefix = None[0m
distutils: /opt/anaconda3/envs/cytoxnet/include/python3.7m/UNKNOWN
sysconfig: /opt/anaconda3/envs/cytoxnet/include/python3.7m[0m
user = False
home = None
root = None
prefix = None[0m


In [16]:
import rdkit.Chem.AllChem
import umap.umap_ as umap

Set the descriptors to use for mapping

In [17]:
dataset['descriptor'] = dataset['smiles'].apply(
    lambda smiles: rdkit.Chem.AllChem.GetMorganFingerprintAsBitVect(rdkit.Chem.MolFromSmiles(smiles), radius=2, nBits=2048)
    )

Compute the bitwise species present

In [18]:
dataset['fish'] = dataset['fish_LC50'].apply(lambda value: not np.isnan(value))
dataset['algea'] = dataset['algea_EC50'].apply(lambda value: not np.isnan(value))
dataset['daphnia'] = dataset['daphnia_EC50'].apply(lambda value: not np.isnan(value))

UMAP the smiles

In [19]:
%%time
umap_model = umap.UMAP(metric = "jaccard",
                      n_neighbors = 25,
                      n_components = 2,
                      low_memory = False,
                      min_dist = 0.001)
X_umap = umap_model.fit_transform(np.vstack(dataset['descriptor'].values))
dataset["UMAP_0"], dataset["UMAP_1"] = X_umap[:,0], X_umap[:,1]

  "inverse_transform will be unavailable".format(self.metric)


CPU times: user 1min 9s, sys: 973 ms, total: 1min 10s
Wall time: 56.8 s


Does data for the three species cover a similar molecular space?

In [20]:
fish = dataset[dataset['fish']]
fish['species'] = 'fish'
daphnia = dataset[dataset['daphnia']]
daphnia['species'] = 'daphnia'
algea = dataset[dataset['algea']]
algea['species'] = 'algea'
dataset_ = pd.concat([fish, daphnia, algea], ignore_index=True)[['UMAP_0', 'UMAP_1', 'species']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [21]:
selection = alt.selection_multi(fields=['species'], bind='legend')
alt.Chart(dataset_).mark_circle(size=60).encode(
    x='UMAP_0',
    y='UMAP_1',
    color='species',
    opacity=alt.condition(selection, alt.value(1), alt.value(0.01))
).add_selection(selection)

#### <span style='color:blue'>__All three species seem to cover a similar space__</span>

### Do any clusters in UMAP space seem to exhibit high toxicity?

In [22]:
alg = alt.Chart(dataset[['UMAP_0', 'UMAP_1', 'algea_EC50 (logged)']][dataset['algea'] == 1]).mark_circle(size=60).encode(
    x='UMAP_0',
    y='UMAP_1',
    color='algea_EC50 (logged):Q',
)
daph = alt.Chart(dataset[['UMAP_0', 'UMAP_1', 'daphnia_EC50 (logged)']][dataset['daphnia'] == 1]).mark_circle(size=60).encode(
    x='UMAP_0',
    y='UMAP_1',
    color='daphnia_EC50 (logged):Q',
)
fish = alt.Chart(dataset[['UMAP_0', 'UMAP_1', 'fish_LC50 (logged)']][dataset['fish'] == 1]).mark_circle(size=60).encode(
    x='UMAP_0',
    y='UMAP_1',
    color='fish_LC50 (logged):Q',
)

In [23]:
alg

In [24]:
daph

In [25]:
fish

#### <span style='color:blue'>__Not a clear trend in clusters of toxic/nontoxic__</span>