# Primer on CYTOXNET ToxModels
Initializing, training, evaluating, and visualizing the results of package machine learning models.
***
***
The package includes a number of model types already, but natively supports any model class from [deepchem](https://deepchem.io/) or [sklearn](https://scikit-learn.org/stable/). These are all accesed through  the `ToxModel` class. This class is effectively a wrapper for both deepchem and sklearn models in one place, along with some additional functionality including and quicker API for calling for desired metrics, and visualization methods.

In [1]:
from cytoxnet.models.models import ToxModel

In [2]:
help(ToxModel)

Help on class ToxModel in module cytoxnet.models.models:

class ToxModel(builtins.object)
 |  Highlevel access to available model classes.
 |  
 |  Class instantialization will retrieve the requested model type. The help
 |  method provides quick access to model docs.
 |  
 |  Parameters
 |  ----------
 |      model_name : str
 |          The name of the model type to instatialize.
 |      transformers : list of :obj:deepchem.transformers.RawTransformer
 |          Data transformations to apply to output predictions. If the
 |          training data was transformed/preprocessed, this will allow
 |          predictions and evaluation to be done in the raw data space.
 |      tasks : list of str
 |          Names for the different targets. Default only one unnamed task.
 |      use_weights : bool, default False
 |          Only relevant for sklearn models, some of which can accept weights
 |          for fitting while others cannot.
 |      kwargs
 |          Keyword arguments to pass to

We pass to the model the name of the wrapped model to use, a list of `deepchem` transformers that were used to prepare the data (if any), a list of tasks corresponding to the model's targets, and finally __any keyword arguments to pass to the wrapped model initialization__.

***
### Minimally prepare a dataset to use for demonstration
See the dataprep example notebook for functionality and options in preparing data

### <span style='color:red'>NEED TO UPDATE WITH DATABASE CALL</span>

In [13]:
import cytoxnet.dataprep.io
import cytoxnet.dataprep.dataprep
import cytoxnet.dataprep.featurize
import pandas as pd

In [19]:
fish = cytoxnet.dataprep.io.load_data('lunghini_fish_LC50')
algea = cytoxnet.dataprep.io.load_data('lunghini_algea_EC50')

In [20]:
df = pd.concat([fish, algea]).reset_index()
df = cytoxnet.dataprep.featurize.add_features(df, method='ConvMolFeaturizer')

In [25]:
data = cytoxnet.dataprep.dataprep.convert_to_dataset(
    df,
    X_col='ConvMolFeaturizer',
    y_col=['algea_EC50', 'fish_LC50']
)

In [26]:
data = cytoxnet.dataprep.dataprep.handle_sparsity(data)

In [27]:
data, transformers = cytoxnet.dataprep.dataprep.data_transformation(
    data, ['MinMaxTransformer'], to_transform='y'
)

***
### Initializing a model
The models currently wrapped in the class can be listed with the `help` method.

In [3]:
ToxModel.help()

AVAILABLE MODELS:
GPR:  (sklearn) Gaussian Process Regressor. Accepts vector features.
GPC:  (sklearn) Gaussian Process Classifier. Accepts vector features.
KNNC:  (sklearn) K nearest neighbor Classifier. Accepts vector features.
GraphCNN:  (deepchem) Graph Convolutional Neural Network. Accepts graph features.
LASSO:  (sklearn) Least Absolute Shrinkage and Selection Operator. Accepts vector features
RFR:  (sklearn) Random Forest Regressor. Accepts vector features.
RFC:  (sklearn) Random Forest Classifier. Accepts vector features.


We can additionally ask for the wrapped model's documentation to help with initialization; in this example, let's use a graph CNN.

In [30]:
ToxModel.help('GraphCNN')

Tox model:  GraphCNN
Help on class GraphConvModel in module deepchem.models.graph_models:

class GraphConvModel(deepchem.models.keras_model.KerasModel)
 |  Graph Convolutional Models.
 |  
 |  This class implements the graph convolutional model from the
 |  following paper [1]_. These graph convolutions start with a per-atom set of
 |  descriptors for each atom in a molecule, then combine and recombine these
 |  descriptors over convolutional layers.
 |  following [1]_.
 |  
 |  
 |  References
 |  ----------
 |  .. [1] Duvenaud, David K., et al. "Convolutional networks on graphs for
 |  learning molecular fingerprints." Advances in neural information processing
 |  systems. 2015.
 |  
 |  Method resolution order:
 |      GraphConvModel
 |      deepchem.models.keras_model.KerasModel
 |      deepchem.models.models.Model
 |      sklearn.base.BaseEstimator
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, n_tasks:int, graph_conv_layers:List[int]=[64, 64], dens

If we do not pass a list of tasks to the model, it will by default assume that there will only be a single target, and if we do not pass any transformers, some options within the class such as untransforming output data for evaluation metrics will not be available. We can pass keywords to the wrapped model class here. Let's specify a dense regressor layer with 12 neurons. We also ask for a regression task, otherwise classification is chosen by default.

In [32]:
my_model = ToxModel('GraphCNN',
                    dense_layer_size=12,
                    tasks=['algea_EC50', 'fish_LC50'],
                    transformers=transformers,
                    mode='regression'
                   )

***
### Fitting the model
Fitting whatever model being used is simply a matter off passing the `deepchem` dataset to train on, and any keyword arguments for the wrapped model's fit method. Here we specify 25 epochs.

In [33]:
my_model.fit(data, nb_epoch=25)

  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." % value)
  "shape. This may consume a large amount of memory." %

0.008236944079399108

***
### Predicting with the model
Call the model's predict method with a dataset object (containing at least X data of the type/shape expected by the model) to return a prediction vector of our targets, in this case 2 columns for two targets. We also pass `untransform=True` causing the predictions to be untransformed by the transformers stored in the model, otherwise predictions would not be interpritable to the desired output space.

In [34]:
my_model.predict(data, untransform=True)

array([[ 8.12795115e-01,  1.11178725e+00],
       [-3.42956709e+00,  3.50053402e+00],
       [-3.50459010e+00,  3.76978984e-01],
       [-2.82335286e+00, -4.51004186e-01],
       [-1.70185451e-01,  2.63372110e+00],
       [ 4.97997616e+00,  2.21077612e+00],
       [ 1.96695511e+00,  7.25605861e-01],
       [ 1.47992657e+00,  2.40403058e+00],
       [ 3.66928723e+00,  6.97394238e+00],
       [ 8.12664239e+00,  4.72598990e+00],
       [ 2.50354388e+00,  1.66687741e+00],
       [ 2.86077252e+00,  1.01873778e+00],
       [ 3.00950982e+00,  2.00394700e+00],
       [ 1.19325722e-01, -5.11323106e-01],
       [ 2.25325771e-01,  3.41288514e+00],
       [-2.05461534e-01,  1.06962074e+00],
       [-2.85962589e+00,  1.47938592e+00],
       [ 1.29716176e+00, -3.42326648e-01],
       [ 4.96287605e-01,  2.27151606e+00],
       [-1.28921792e+00,  2.42732634e+00],
       [ 3.39874479e+00,  4.75205564e+00],
       [-6.21121351e-01,  1.19270801e+00],
       [ 1.13181848e+00,  2.35479102e+00],
       [ 3.

***
### Evaluating the model
We can ask for metrics to be used for evaluation of a test set with both X and y data. Again we can pass `untransform=True` because we initialized the model with the transformers initially used on the data. Because this task had multiple targets and a sparse target matrices with weight masks, we can specify `use_sample_weights=True` to indicate that predictions made for data we do not have should not be compared to the masked 0.0 value

In [42]:
my_model.evaluate(data,
                  metrics=['mean_absolute_error'],
                  untransform=True,
                  use_sample_weights=True)

{'metric-1': 1.6871931946022893}

We can also plot predictions for one of the tasks directly from the class.

In [38]:
my_model.visualize('pair_predict', data, untransform=True, task='algea_EC50')

(3639, 2)


We note the sharp line of data with true value=0 are the sparse points in the data that were replaced by zeros and deweighted for training and evaluation.