# Preparing data for machine learning
This notebook showcases the minimal necessary steps of turning a raw set of data into one ready to be learned.

__Key Steps__
- Featurizing molecule strings in the dataset
- Selecting and converting data to `deepchem` dataset for learning

__Additional Tools__
- Handling sparse target datasets
- Binarizing continuous variables
- Transforming data before learning
- Splitting data into different sets

***
***

In [1]:
import cytoxnet.dataprep.dataprep
import cytoxnet.dataprep.database
import cytoxnet.dataprep.featurize

## Loading data
This step assume you have already converted your (or the packages) raw data into a accessible database. See the `initialize_database` notebook for details. Otherwise, __we can start from and dataframe we are unterested in__.

Query our database - we will ask for two targets, algea and daphnia toxicity. We can also ask for previously computed features.

In [27]:
dataframe_in = cytoxnet.dataprep.database.query_to_dataframe(['algea', 'daphnia'], features_list=['circularfingerprint'])

  util.warn(message)


In [28]:
dataframe_in.describe()

Unnamed: 0,algea_ids,algea_molecular_weight,algea_algea_ec50,algea_foreign_key,daphnia_ids,daphnia_molecular_weight,daphnia_daphnia_ec50,daphnia_foreign_key
count,1440.0,1440.0,1440.0,1440.0,864.0,864.0,864.0,864.0
mean,719.5,189.630547,2.460036,719.5,1035.429398,185.846149,1.919594,700.663194
std,415.836506,83.114704,2.348855,415.836506,623.908342,75.766279,2.12435,419.677776
min,0.0,24.0214,-7.836625,0.0,2.0,44.052559,-8.568486,0.0
25%,359.75,136.181038,1.163151,359.75,472.5,136.234039,0.630736,323.5
50%,719.5,172.588982,2.70805,719.5,1019.5,172.192474,2.162748,706.5
75%,1079.25,224.447578,4.031138,1079.25,1595.0,218.200092,3.47038,1046.25
max,1439.0,801.375671,9.118225,1439.0,2106.0,801.375671,7.333023,1438.0


## <span style='color:blue'>Key step:</span> adding features

We can add features to a dataframe by pointing towards the column containing the chemical identifier, and giving a feature name. By default smiles will be canonicalized before retrieving features,  but this can be turned off with the `canonicalize` keyword bool.

In [8]:
dataframe_feats = cytoxnet.dataprep.featurize.add_features(dataframe_in, id_col='smiles', method='RDKitDescriptors')

In [9]:
dataframe_feats['RDKitDescriptors']

0       [5.637361111111112, 0.1308531746031747, 5.6373...
1       [3.2055144557823128, 1.217013888888889, 3.2055...
2       [12.499231859410429, -3.573032407407408, 12.49...
3       [3.3200393282312923, 0.918545918367347, 3.3200...
4       [3.201388888888889, 1.0486111111111112, 3.2013...
                              ...                        
1435    [2.0416666666666665, 1.7129629629629628, 2.041...
1436    [4.063148148148148, 1.025462962962963, 4.06314...
1437    [4.137222222222222, 1.0995370370370372, 4.1372...
1438    [3.5555555555555554, 1.4444444444444444, 3.555...
1439    [3.5625, 1.4375, 3.5625, 1.4375, 0.43639752184...
Name: RDKitDescriptors, Length: 1440, dtype: object

We can also specify a compounds codex - _this saves time by retrieving the already computed and saved features_ for molecules in our database instead of computing them again. In preparing our database in the previous example notebook, we computed the circular fingerprint. If features are not found in the database, they will simply be computed as normal.

In [10]:
dataframe_feats = cytoxnet.dataprep.featurize.add_features(
    dataframe_in,
    id_col='smiles',
    method='CircularFingerprint',
    codex='database/compounds.csv'
)

In [11]:
dataframe_feats['CircularFingerprint']

0       [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
1       [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
2       [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
3       [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
4       [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
                              ...                        
1435    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
1436    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
1437    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
1438    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
1439    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
Name: CircularFingerprint, Length: 1440, dtype: object

_Note: this was unnecessary, because we retrieved the circular fingerprint when initially querying the database, but we can retrospectively add features from the database as shown here._

## Handling sparse datasets
If we have a sparse target matrix, we can deal with the NAs for neural network models by replacing them with any value (in this case 0) and setting their weights to 0.0 so that they do not impact training or evaluation. We simply have to specify our targets.

In [15]:
dataframe_weighted = cytoxnet.dataprep.dataprep.handle_sparsity(
    dataframe_feats,
    y_col=['algea_algea_ec50', 'daphnia_daphnia_ec50']
)

For each target specified, the function added a column of weights:

In [17]:
for c in dataframe_weighted.columns:
    if 'w_' in c:
        print('Column: ', c)

Column:  w_algea_algea_ec50
Column:  w_daphnia_daphnia_ec50


The 'w_' is a default value, but can be chosen.

_Note: this function can be used directly on deepchem `datasets` without having to specify columns - simply pass your dataset with y labels (see __Key step: converting to dataset__)._

## Binarizing targets
If we have a continuous variable and wish to convert to binary nontoxic/toxic values, we specifify either a value to consider as threshold between the two or a percentile to compute on the data. We can also specify whether we want values larger than the threshold to be positive or negative. This function will handle sparisity (see above) if it finds it. In this case we ask the threshold to be the 20th percentile in each target.

In [21]:
dataframe_binarized = cytoxnet.dataprep.dataprep.binarize_targets(
    dataframe_weighted,
    target_cols=['algea_algea_ec50', 'daphnia_daphnia_ec50'],
    percentile = 0.2,
    high_positive=False
)

In [22]:
dataframe_binarized[['algea_algea_ec50', 'daphnia_daphnia_ec50']]

Unnamed: 0,algea_algea_ec50,daphnia_daphnia_ec50
0,True,False
1,False,False
2,False,False
3,False,True
4,False,False
...,...,...
1435,False,False
1436,False,False
1437,False,False
1438,False,False


The defailt choice is to consider low values as positive (`high_positive` parameter), which is sensible for toxicity metrics where a low value means it is more toxic.

## <span style='color:blue'>Key step:</span> converting to dataset
We can convert our data into a deepchem `dataset`, removing all of the fluff in our dataframe and making it machine ready. We simply pass the names of columns containing featurized input data, columns containing our targets, and any weights for the samples.

In [23]:
dataset = cytoxnet.dataprep.dataprep.convert_to_dataset(
    dataframe_weighted,
    X_col='CircularFingerprint',
    y_col=['algea_algea_ec50', 'daphnia_daphnia_ec50'],
    w_label='w'
)

In this case, we had handled the sparsity in the data which added columns with a 'w' weight label. We could also pass the columns containing weights directly using the `w_col` parameter.

In [24]:
dataset

<NumpyDataset X.shape: (1440, 2048), y.shape: (1440, 2), w.shape: (1440, 2), task_names: [0 1]>

This is ready to be put into a ToxModel

## Applying transformations
If we want to transform our data, we can ask for deepchem transformers by name, and targeting wither the X or y data. The transformers are also returned, in order to untransform data after learning. In this case, let's apply normalization to the y data.

In [29]:
dataset_normed, transformations = cytoxnet.dataprep.dataprep.data_transformation(
    dataset,
    transformations=['NormalizationTransformer'],
    to_transform='y'
)

## Splitting Data

In almost all cases, we want to split our dataset into some form of training, validation and testing. As an example, we create 5 cross validation folds from an initial training set. See documentation for the full function options. Random splitter is used here, though deepchem has more informed splitting options such as scaffold.

In [30]:
# first do a train test split
dev, test = cytoxnet.dataprep.dataprep.data_splitting(
    dataset_normed,
    splitter='RandomSplitter',
    split_type='tt'
)

In [31]:
# and now a k fold split of the dev set
folds = cytoxnet.dataprep.dataprep.data_splitting(
    dev,
    splitter='RandomSplitter',
    split_type='k_fold_split',
    k=5
)

In [32]:
folds

[(<DiskDataset X.shape: (922, 2048), y.shape: (922, 2), w.shape: (922, 2), ids: [798 585 1179 ... 605 1013 510], task_names: [0 1]>,
  <DiskDataset X.shape: (230, 2048), y.shape: (230, 2), w.shape: (230, 2), ids: [233 295 906 ... 752 451 947], task_names: [0 1]>),
 (<DiskDataset X.shape: (922, 2048), y.shape: (922, 2), w.shape: (922, 2), ids: [233 295 906 ... 935 548 497], task_names: [0 1]>,
  <DiskDataset X.shape: (230, 2048), y.shape: (230, 2), w.shape: (230, 2), ids: [815 1336 956 ... 997 1071 654], task_names: [0 1]>),
 (<DiskDataset X.shape: (922, 2048), y.shape: (922, 2), w.shape: (922, 2), ids: [233 295 906 ... 592 637 632], task_names: [0 1]>,
  <DiskDataset X.shape: (230, 2048), y.shape: (230, 2), w.shape: (230, 2), ids: [146 1099 951 ... 718 1437 835], task_names: [0 1]>),
 (<DiskDataset X.shape: (921, 2048), y.shape: (921, 2), w.shape: (921, 2), ids: [233 295 906 ... 485 219 50], task_names: [0 1]>,
  <DiskDataset X.shape: (231, 2048), y.shape: (231, 2), w.shape: (231, 2), 