# Dataset report: Lunghini ecotox

## _Consensus QSAR models estimating acute toxicity to aquatic organisms from different trophic levels: algae, Daphnia and fish_


> F. Lunghini, G. Marcou, P. Azam, M. H. Enrici, E. Van Miert, and A. Varnek, “Consensus QSAR models estimating acute toxicity to aquatic organisms from different trophic levels: algae, Daphnia and fish,” SAR QSAR Environ. Res., vol. 31, no. 9, pp. 655–675, Sep. 2020, doi: 10.1080/1062936X.2020.1797872.

In [1]:
import altair as alt
alt.data_transformers.disable_max_rows()
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

import cytoxnet.dataprep.io

In [2]:
## get the data
dataset = cytoxnet.dataprep.io.load_lunghini(nans='keep')
dataset.describe()

Unnamed: 0,algea_EC50,fish_LC50,daphnia_EC50
count,1440.0,2199.0,2120.0
mean,118.942882,202.975335,62.426641
std,596.983371,1308.709203,632.706477
min,0.000395,0.00013,0.0
25%,3.2,1.8,1.0
50%,15.0,9.4,6.455
75%,56.325,52.0,31.075
max,9120.0,37700.0,25000.0


In [3]:
print('Number of unique molecules: ', len(dataset))

Number of unique molecules:  3680


### Targets present
The dataset includes toxicity targets for Algea, Fish, and Daphnea. From the datasset describe above we can see that not all 3680 molecules have data for each target, with each species having only 1500-2200 of the total molecules with measured targets.

#### <span style='color:blue'>__The range of targets seems to be quite large for each species (units of mg/L)__</span>

In [4]:
dataset.describe().loc[['min', 'max']]

Unnamed: 0,algea_EC50,fish_LC50,daphnia_EC50
min,0.000395,0.00013,0.0
max,9120.0,37700.0,25000.0


Scale the targets to plot the distribution on the same axis.

In [5]:
for target in ['algea_EC50', 'fish_LC50', 'daphnia_EC50']:
    scaled = MinMaxScaler().fit_transform(dataset[target].values.reshape(-1,1))
    dataset[target+' (scaled)'] = scaled

In [6]:
alt.Chart(dataset).transform_fold(
    ['algea_EC50 (scaled)', 'fish_LC50 (scaled)', 'daphnia_EC50 (scaled)'],
    as_=['Target', 'Measurement (scaled)']
).mark_area(
    opacity=0.7,
    interpolate='step'
).encode(
    alt.X('Measurement (scaled):Q', bin=alt.Bin(maxbins=100)),
    alt.Y('count()', stack=None),
    alt.Color('Target:N')
)

#### <span style='color:blue'>__The dataset is heavily imbalanced towards the toxic side. Zoom inconsidering the range__</span>

In [7]:
alt.Chart(dataset).transform_fold(
    ['algea_EC50 (scaled)', 'fish_LC50 (scaled)', 'daphnia_EC50 (scaled)'],
    as_=['Target', 'Measurement']
).transform_filter('datum.Measurement < .003').mark_area(
    opacity=0.4,
    interpolate='step'
).encode(
    alt.X('Measurement:Q', bin=alt.Bin(maxbins=100)),
    alt.Y('count()', stack=None),
    alt.Color('Target:N')
)

#### <span style='color:blue'>__Sharp drop in count towards toxic. Let's try a log transform__</span>

In [8]:
from sklearn.preprocessing import FunctionTransformer

In [9]:
def log_trans(row):
    inp = row.values.reshape(-1)
    out = []
    for val in inp:
        if not np.isnan(val):
            out.append(np.log(val))
        else:
            out.append(val)
    return out

In [10]:
dataset[['algea_EC50 (logged)', 'fish_LC50 (logged)', 'daphnia_EC50 (logged)']] = dataset[['algea_EC50', 'fish_LC50', 'daphnia_EC50']].apply(log_trans)

  


In [11]:
alt.Chart(dataset).transform_fold(
    ['algea_EC50 (logged)', 'fish_LC50 (logged)', 'daphnia_EC50 (logged)'],
    as_=['Target', 'Measurement (logged)']
).mark_area(
    opacity=0.7,
    interpolate='step'
).encode(
    alt.X('Measurement (logged):Q', bin=alt.Bin(maxbins=100)),
    alt.Y('count()', stack=None),
    alt.Color('Target:N')
)

### Molecule space

In [20]:
!pip install --quiet umap-learn hdbscan

distutils: /Users/ek/miniconda3/envs/cytoxnet/include/python3.6m/UNKNOWN
sysconfig: /Users/ek/miniconda3/envs/cytoxnet/include/python3.6m[0m
user = False
home = None
root = None
prefix = None[0m
distutils: /Users/ek/miniconda3/envs/cytoxnet/include/python3.6m/UNKNOWN
sysconfig: /Users/ek/miniconda3/envs/cytoxnet/include/python3.6m[0m
user = False
home = None
root = None
prefix = None[0m


In [43]:
import rdkit.Chem.AllChem
import umap

Set the descriptors to use for mapping

In [23]:
dataset['descriptor'] = dataset['smiles'].apply(
    lambda smiles: rdkit.Chem.AllChem.GetMorganFingerprintAsBitVect(rdkit.Chem.MolFromSmiles(smiles), radius=2, nBits=2048)
    )

Compute the bitwise species present

In [66]:
dataset['fish'] = dataset['fish_LC50'].apply(lambda value: not np.isnan(value))
dataset['algea'] = dataset['algea_EC50'].apply(lambda value: not np.isnan(value))
dataset['daphnia'] = dataset['daphnia_EC50'].apply(lambda value: not np.isnan(value))

UMAP the smiles

In [47]:
%%time
umap_model = umap.UMAP(metric = "jaccard",
                      n_neighbors = 25,
                      n_components = 2,
                      low_memory = False,
                      min_dist = 0.001)
X_umap = umap_model.fit_transform(np.vstack(dataset['descriptor'].values))
dataset["UMAP_0"], dataset["UMAP_1"] = X_umap[:,0], X_umap[:,1]

  "inverse_transform will be unavailable".format(self.metric)


CPU times: user 1min 8s, sys: 261 ms, total: 1min 8s
Wall time: 38.9 s


Does data for the three species cover the molecular space?

In [88]:
fish = dataset[dataset['fish']]
fish['species'] = 'fish'
daphnia = dataset[dataset['daphnia']]
daphnia['species'] = 'daphnia'
algea = dataset[dataset['algea']]
algea['species'] = 'algea'
dataset_ = pd.concat([fish, daphnia, algea], ignore_index=True)[['UMAP_0', 'UMAP_1', 'species']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [94]:
selection = alt.selection_multi(fields=['species'], bind='legend')
alt.Chart(dataset_).mark_circle(size=60).encode(
    x='UMAP_0',
    y='UMAP_1',
    color='species',
    opacity=alt.condition(selection, alt.value(1), alt.value(0.01))
).add_selection(selection)

#### <span style='color:blue'>__All three species seem to cover the available space__</span>

### Do any clusters in UMAP space seem to exhibit high toxicity?

In [74]:
alg = alt.Chart(dataset[['UMAP_0', 'UMAP_1', 'algea_EC50']][dataset['algea'] == 1]).mark_circle(size=60).encode(
    x='UMAP_0',
    y='UMAP_1',
    color='algea_EC50:Q',
)
daph = alt.Chart(dataset[['UMAP_0', 'UMAP_1', 'daphnia_EC50']][dataset['daphnia'] == 1]).mark_circle(size=60).encode(
    x='UMAP_0',
    y='UMAP_1',
    color='daphnia_EC50:Q',
)
fish = alt.Chart(dataset[['UMAP_0', 'UMAP_1', 'fish_LC50']][dataset['fish'] == 1]).mark_circle(size=60).encode(
    x='UMAP_0',
    y='UMAP_1',
    color='fish_LC50:Q',
)

In [75]:
alg

In [76]:
daph

In [77]:
fish

#### <span style='color:blue'>__Its difficult to see given the bias towards toxic, but it seems that the few compounds that are extremetely non-toxic are the same few between species__</span>