In [None]:
import pathlib
import time

import deepchem
import joblib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import stats
from sklearn import compose, linear_model, metrics, model_selection
from sklearn import pipeline, preprocessing, utils

# MoleculeNet

One of the most powerful features of DeepChem is that it comes "batteries included" with datasets to use. The DeepChem developer community maintains the MoleculeNet [1] suite of datasets which maintains a large collection of different scientific datasets for use in machine learning applications. The original MoleculeNet suite had 17 datasets mostly focused on molecular properties. Over the last several years, MoleculeNet has evolved into a broader collection of scientific datasets to facilitate the broad use and development of scientific machine learning tools.

These datasets are integrated with the rest of the DeepChem suite so you can conveniently access these these through functions in the dc.molnet submodule. You've already seen a few examples of these loaders already as you've worked through the tutorial series. The full documentation for the MoleculeNet suite is available in our docs [2].

[1] Wu, Zhenqin, et al. "MoleculeNet: a benchmark for molecular machine learning." Chemical science 9.2 (2018): 513-530.

[2] https://deepchem.readthedocs.io/en/latest/moleculenet.html


## Delany (ESOL) Dataset

The [Delaney (ESOL) dataset](https://pubs.acs.org/doi/pdf/10.1021/ci034243x) is a regression dataset containing structures and water solubility data for 1128 compounds. The dataset is widely used to validate machine learning models on estimating solubility directly from molecular structures (as encoded in [SMILES strings](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system).

The raw data csv file contains columns below:

* Compound ID - Name of the compound
* ESOL predicted log solubility in mols per litre
* Minimum Degree
* Molecular Weight
* Number of H-Bond Donors
* Number of Rings
* Number of Rotatable Bonds
* Polar Surface Area
* measured log solubility in mols per litre - Log-scale water solubility of the compound, used as label
* smiles - SMILES representation of the molecular structure



### Download and extract the data

In [None]:
DATA_DIR = pathlib.Path("../data/moleculenet/delaney")
DATA_DIR.mkdir(parents=True, exist_ok=True)

_, (dataset,), _ = (deepchem.molnet
                            .load_delaney(
                                data_dir=DATA_DIR,
                                reload=False,
                                save_dir=DATA_DIR,
                                splitter=None,
                                transformers=[],
                            )
                    )

### Load the data

We can use the following code to quickly look at the first few lines of the CSV file.

In [None]:
%%bash

head ../data/moleculenet/delaney/delaney-processed.csv

We will load the data using the [Pandas](https://pandas.pydata.org/) library. Highly recommend the most recent edition of [*Python for Data Analysis*](https://learning.oreilly.com/library/view/python-for-data/9781491957653/) by Pandas creator Wes Mckinney for anyone interested in learning how to use Pandas.

In [None]:
_usecols = [
    "ESOL predicted log solubility in mols per litre",
    "Minimum Degree",
    "Molecular Weight",
    "Number of H-Bond Donors",
    "Number of Rings",
    "Number of Rotatable Bonds",
    "Polar Surface Area", 
    "measured log solubility in mols per litre",
    "smiles",
]

data = pd.read_csv(
    DATA_DIR / "delaney-processed.csv",
    usecols=_usecols,
)

### Explore the data

In [None]:
data.head()

In [None]:
_ = data.hist(bins=50, figsize=(12, 8))

In [None]:
(data.corr()
     .loc[:, "measured log solubility in mols per litre"]
     .sort_values(ascending=False))

In [None]:
_ = (pd.plotting
       .scatter_matrix(data, figsize=(12, 8)))
plt.show()

In [None]:
_esol_predictions_label = "ESOL predicted log solubility in mols per litre"
_target_label = "measured log solubility in mols per litre"
features = data.drop([_esol_predictions_label, _target_label], axis=1)
esol_predictions = data.loc[:, _esol_predictions_label]
target = data.loc[:, _target_label]

In [None]:
features.info()

In [None]:
features.head()

In [None]:
features.describe()

In [None]:
target.info()

In [None]:
target.head()

In [None]:
target.describe()

# Look at the Big Picture

Our goal over this three day hands-on workshop is to build a machine learning modeling pipeline that is capable of accurately predicting water solubility of chemical compounds. Today and tomorrow we will mostly focus on classical machine learning algorithms implemented in Scikit-Learn; on the final day we will revist the same problem using deep learning algorithms implemented in PyTorch. By the time you have finished this workshop you should understand how to build a machine learning application and be ready to apply what you have learned to a new dataset.

Prior to the break we will mostly focus on getting the data and exploring the data to gain new insights. Believe it or not these initial steps are what data scientists and machine learning engineers spend the majority of their time doing! Following the break we will prepare our data for machine learning, see how to fit a variety of machine learning models to our dataset and shortlist a few candidate models for further analysis. We will then use hyper-parameter tuning to improve the performance of our shortlisted models to arrive at an overall best model.

## Framing the problem

### What is the business/research objective?

Typically building the model is not the overall objective but rather the model itself is one part of a larger process used to answer a business/research question. Knowing the overall objective is important because it will determine your choice of machine learning algorithms to train, your measure(s) of model performance, and how much time you will spend tweaking the hyper-parameters of your model.

In our example today, the overall business/research objective is to build a model capable of estimating the solubility of chemical compounds. Our solubility model is just one of potentially many other models whose predictions are taken as inputs into another machine learning models that will be used to aid in drug discovery.

### What is the current solution?

Always a good idea to know what the current solution to the problem you are trying to solve. Current solution gives a benchmark for performance. Note that the current "best" solution could be very simple or could be very sophisticated. Understanding the current solution helps you think of a good place to start.

With all this information, you are now ready to start designing your system. First, you need to frame the problem by answering the following questions.

* Is our problem supervised, unsupervised, or reinforcement learning?
* Is our problem a classification task, a regression task, or something else? If our problem is a classification task are we trying to classify samples into 2 categories (binary classification) or more than 2 (multi-class classification) categories? If our problem is a regression task, are we trying to predict a single value (univariate regression) or multiple values (multivariate regression) for each sample?
* Should you use batch learning or online learning techniques?


### Exercise: Selecting a metric

Scikit-Learn has a number of different [possible metrics](https://scikit-learn.org/stable/modules/model_evaluation.html) that you can choose from (or you can create your own custom metric if required). Can you find a few metrics that seems appropriate for our regression model?

# Creating a Test Dataset

Before we look at the data any further, we need to create a test set, put it aside, and never look at it (until we are ready to test our trainined machine learning model!). Why? We don't want our machine learning model to memorize our dataset (this is called overfitting). Instead we want a model that will generalize well (i.e., make good predictions) for inputs that it didn't see during training. To do this we hold split our dataset into training and testing datasets. The training dataset will be used to train our machine learning model(s) and the testing dataset will be used to make a final evaluation of our machine learning model(s).

## If you might refresh data in the future...

...then you want to use some particular hashing function to compute the hash of a unique identifier for each observation of data and include the observation in the test set if resulting hash value is less than some fixed percentage of the maximum possible hash value for your algorithm. This way even if you fetch more data, your test set will never include data that was previously included in the training data.

In [None]:
import zlib


def in_testing_data(identifier, test_size):
    _hash = zlib.crc32(bytes(identifier))
    return _hash & 0xffffffff < test_size * 2**32


def split_train_test_by_id(data, test_size, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda identifier: in_testing_data(identifier, test_size))
    return data.loc[~in_test_set], data.loc[in_test_set]


## If this is all the data you will ever have...

...then you can just set a seed for the random number generator and then randomly split the data. Scikit-Learn has a [`model_selection`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) module that contains tools for splitting datasets into training and testing sets.

In [None]:
model_selection.train_test_split?

In [None]:
SEED = 42
SEED_GENERATOR = np.random.RandomState(SEED)


def generate_seed():
    return SEED_GENERATOR.randint(np.iinfo("uint16").max)

In [None]:
# split the dataset into training and testing data
_seed = generate_seed()
_random_state = np.random.RandomState(_seed)
train_features, test_features, train_target, test_target, train_esol_predictions, test_esol_predictions = model_selection.train_test_split(
    features,
    target,
    esol_predictions,
    test_size=1e-1,
    random_state=_random_state
)

In [None]:
train_features.info()

In [None]:
train_target.info()

In [None]:
train_esol_predictions.info()

Again, if you want to you can write out the train and test sets to disk to avoid having to recreate them later.

In [None]:
_ = (train_features.join(train_target)
                   .join(train_esol_predictions)
                   .to_csv(DATA_DIR / "train.csv", index=False))

_ = (test_features.join(test_target)
                  .join(test_esol_predictions)
                  .to_csv(DATA_DIR / "test.csv", index=False))

# Prepare the data for machine learning algorithms

Best practice is to write functions to automate the process of preparing your data for machine learning. Why?

* Allows you to reproduce these transformations easily on any dataset.
* You will gradually build a library of transformation functions that you can reuse in future projects.
* You can use these functions in a live system to transform the new data before feeding it to your algorithms.
* This will make it possible for you to easily experiment with various transformations and see which combination of transformations works best.

We are working with an benchmark dataset that has already been prepared for analysis (mostly!). You should be aware that academic benchmark datasets are not very representative of the type of datasets that you will encounter in most practical applications. 

## Feature Scaling

Machine learning algorithms typically don’t perform well when the input numerical attributes have very different scales. One of the most common approaches is to rescale features so that they all have zero mean and unit standard deviation. This approach, which is also called standardization, is particularly useful when attributes/features have outliers and when downstream machine learning algorithms assume that attributes/features have a Gaussian or Normal distribution. This approach is implemented in Scikit-Learn by the [`preprocessing.StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.StandardScaler) class.

In [None]:
preprocessing.StandardScaler?

In [None]:
# hyper-parameters
_hyperparameters = {
    "copy": True,
    "with_mean": True,
    "with_std": True,
}
preprocessor = preprocessing.StandardScaler(**_hyperparameters)

In [None]:
_cardinal_labels = [
    "Molecular Weight",
    "Polar Surface Area",
]
_train_cardinal_features = train_features.loc[:, _cardinal_labels]
preprocessed_train_cardinal_features = preprocessor.fit_transform(_train_cardinal_features)

In [None]:
preprocessed_train_cardinal_features.shape

In [None]:
preprocessed_train_cardinal_features[:, :5]

In [None]:
preprocessed_train_cardinal_features.mean(axis=0)

In [None]:
preprocessed_train_cardinal_features.std(axis=0)

The `preprocessing.MinMaxScaler` and the `preprocessing.StandardScaler` classes are the first Scikit-Learn `Transformer` classes that we have encountered. As such now is a good to to discuss the Scikit-Learn application programming interface (API). The [Scikit-Learn API](https://scikit-learn.org/stable/modules/classes.html) is one of the best designed API's around and has heavily influenced API design choices of other libraries in the Python Data Science and Machine Learning ecosystem, in particular [Dask](https://dask.org/) and [NVIDIA RAPIDS](https://rapids.ai/). Familiarly with the Scikit-Learn API will make it easier for you to get started with these libraries.

The Scikit-Learn API is built around the following key concepts.

* Estimators: Any object that can estimate some parameters based on a dataset is called an estimator (e.g., an `preprocessing.MinMaxScaler` is an estimator). The estimation itself is performed by the `fit` method, and it takes only a dataset as a parameter (or two for supervised learning algorithms; the second dataset contains the labels). Any other parameter needed to guide the estimation process is considered a hyperparameter (such as the `feature_range` parameter in `preprocessing.MinMaxScaler`), and it must be set as an instance variable (generally via a constructor parameter).

* Transformers: Some estimators (such as an `preprocessing.MinMaxScaler`) can also transform a dataset; these are called transformers. Once again, the API is simple: the transformation is performed by the transform method with the dataset to transform as a parameter. It returns the transformed dataset. This transformation generally relies on the learned parameters. All transformers also have a convenience method called `fit_transform` that is equivalent to calling `fit` and then `transform` (but sometimes `fit_transform` is optimized and runs much faster).

* Predictors: Finally, some estimators, given a dataset, are capable of making predictions; they are called predictors. A predictor has a `predict` method that takes a dataset of new instances and returns a dataset of corresponding predictions. It also has a score method that measures the quality of the predictions, given a test set (and the corresponding labels, in the case of supervised learning algorithms).

All of an estimator’s hyperparameters are accessible directly via public instance variables (e.g., `preprocessor.feature_range`), and all the estimator’s learned parameters are accessible via public instance variables with an underscore suffix (e.g., `preprocessor.scale_`). Finally, Scikit-Learn provides reasonable default values for most parameters which makes it easy to quickly create a baseline working system.

### Exercise: MinMaxScaler vs StandardScaler

An alternative to standard scaling is to rescale features so that they all reside within the same range (typically between 0 and 1).

Create an instance of the [`preprocessing.MinMaxScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) class and use it to rescale the training dataset. Compare the two different rescaled versions of the dataset. Which of the two methods do you prefer?

In [None]:
# insert your code here!

As with all the transformations, it is important to fit the scalers to the training data only, not to the full dataset (including the test set). Only then can you use them to transform the training set and the test set (and new data).

## Feature Engineering

Feature engineering is one of the most important parts of any machine learning project. There are two main tasks in feature engineering.

* Feature selection: selecting the best subset of features for training. 
* Feature extraction: combining existing features to produce new features for training.
* Feature creation: finding additional data sources to use as features.

Feature engineering is often the most labor intensive part of building a machine learning pipeline and often requires extensive expertise/domain knowledge relevant to the problem at hand. Recently packages such as [featuretools](https://www.featuretools.com/) have been developed to (partially) automate the process of feature engineering.

The success of deep learning in various domains is in significant part due to the fact that deep learning models are able to automatically engineer features that are most useful for solving certain machine learning tasks. In effect deep learning replaces the expensive to acquire expertise/domain knowledge required to hand-engineer predictive features. 

A recent example that demonstrates that power of automated feature engineering is [Space2vec](https://medium.com/dessa-news/space-2-vec-fd900f5566), a deep learning based supernovae classifier developed by machine learning engineers with no expertise in Astronomy that was able to outperform the machine learning solution developed by NERSC scientists. The machine learning pipeline developed by NERSC scientists, called [AUTOSCAN](https://portal.nersc.gov/project/dessn/autoscan/), was a significant improvement over the previous solution which relied on manual classification of supernovae by astronomers. However, in order to achieve such high accuracy, the NERSC solution relied on a dataset of hand-engineered features developed by astronomers with over a century of combined training and expertise in the domain. The deep learning algorithm used by space2vec could be applied directly to the raw image data and did not rely on any hand-engineered features.

In [None]:
deepchem.molnet.load_delaney?

In [None]:
dataset.to_dataframe()

## Transformation pipelines

As you can see creating preprocessing pipelines involves quite a lot of steps and each of the steps needs to be executed in the correct order. Fortunately Scikit-Learn allows you to combine estimators together to create [pipelines](https://scikit-learn.org/stable/modules/compose.html#combining-estimators). We can encapsulate all of the preprocessing logic into instances of the [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) class.

The `Pipeline` constructor takes a list of name/estimator pairs defining a sequence of steps. All but the last estimator must be transformers (i.e., they must have a `fit_transform` method). The names can be anything you like (as long as they are unique). Later we will see how to access the parameters of pipelines using these names when we discuss hyperparameter tuning.

In [None]:
_seed = generate_seed()

_hyperparameters = {
    "copy": True,
    "with_mean": True,
    "with_std": True,
}

# default Pipeline constructor
cardinal_pipeline = pipeline.Pipeline(
    [
        ("standardscaler", preprocessing.StandardScaler(**_hyperparameters)),
    ],
    verbose=True,
)

_hyperparameters = {
    "feature_range": (0, 1),
    "copy": True,
    "clip": False,
}
ordinal_pipeline = pipeline.Pipeline(
    [
        ("minmaxscaler", preprocessing.MinMaxScaler(**_hyperparameters)),
    ],
    verbose=True,
)

### Custom transformers

Although Scikit-Learn provides many useful transformers, you will need to write your own for tasks such as custom transformations, cleanup operations, or combining specific attributes.

For transformations that don’t require any training, you can just write a function that takes a NumPy array as input, and outputs the transformed array.

In [None]:
_featurized_smiles = (dataset.to_dataframe()
                             .drop(['y', 'w', "ids"], axis=1))

def featurizer(object_features):
    ixs = object_features.index
    return _featurized_smiles.loc[ixs, :]

_hyperparameters = {
    "func": featurizer,
}

non_numeric_pipeline = pipeline.Pipeline(
    [
        ("functiontransformer", preprocessing.FunctionTransformer(**_hyperparameters)),
    ],
    verbose=True,
)

The code in the cell below creates the same pipelines as above but use the `pipeline.make_pipeline` function which automaticlly generates names for the different stages of the pipeline using the class names.

In [None]:
_seed = generate_seed()

_hyperparameters = {
    "copy": True,
    "with_mean": True,
    "with_std": True,
}

# alternative constructor that is equivalent to the above!
cardinal_pipeline = pipeline.make_pipeline(
    preprocessing.StandardScaler(**_hyperparameters),
    verbose=True,
)

_hyperparameters = {
    "feature_range": (0, 1),
    "copy": True,
    "clip": False,
}

ordinal_pipeline = pipeline.make_pipeline(
    preprocessing.MinMaxScaler(**_hyperparameters),
    verbose=True,
)

_featurized_smiles = (dataset.to_dataframe()
                             .drop(['y', 'w', "ids"], axis=1))

def featurizer(object_features):
    ixs = object_features.index
    return _featurized_smiles.loc[ixs, :]

_hyperparameters = {
    "func": featurizer,
}
non_numeric_pipeline = pipeline.make_pipeline(
    preprocessing.FunctionTransformer(**_hyperparameters),
    verbose=True,
)

So far, we have handled the non-numeric columns and the numeric columns separately. It would be more convenient to have a single transformer capable of handling all columns, applying the appropriate transformations to each column. For this, you can use a [`compose.ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html).

In [None]:
preprocessing_pipeline = compose.make_column_transformer(
    (cardinal_pipeline, compose.make_column_selector(dtype_include=np.float64)),
    (ordinal_pipeline, compose.make_column_selector(dtype_include=np.int64)),
    (non_numeric_pipeline, compose.make_column_selector(dtype_include=object)),
)

In [None]:
preprocessed_train_features = preprocessing_pipeline.fit_transform(train_features)

In [None]:
preprocessed_train_features.shape

In [None]:
preprocessed_train_features[:5, :]

# Select and train a model

At last! You framed the problem, you got the data and explored it, you sampled a training set and a test set, and you wrote transformation pipelines to clean up and prepare your data for machine learning algorithms automatically. You are now ready to select and train a Machine Learning model. You might have been wondering if we were every going to make it to this point! Fact is, most of your time developing machine learning solutions to real-world problems will not be spent training machine learning models: most of your time will be spent preparing the data for machine learning algorithms and most of the computer time will be spent training the machine learning models.

## Training and evaluating on the training dataset

In [None]:
_seed = generate_seed()
_hyperparameters = {
    "fit_intercept": True,
    "loss": "squared_error",
    "penalty": None,
    "random_state": np.random.RandomState(_seed),
}
estimator = linear_model.SGDRegressor(**_hyperparameters)
_ = estimator.fit(preprocessed_train_features, train_target)

In [None]:
predictions = estimator.predict(preprocessed_train_features)

In [None]:
predictions

Congrats! You have fit your first machine learning model using Scikit-Learn and made some predictions. Now let's see how good those predictions really are.

### Evaluation metrics

#### Mean squared error

In [None]:
mse = metrics.mean_squared_error(
    train_target,
    predictions,
)
print(f"Root mean squared error: {np.sqrt(mse)}")

In [None]:
mse = metrics.mean_squared_error(
    train_target,
    train_esol_predictions,
)
print(f"Root mean squared error: {np.sqrt(mse)}")

#### Mean absolute error

In [None]:
mae = metrics.mean_absolute_error(
    train_target,
    predictions,
)
print(f"Mean absolute squared error: {mae}")

In [None]:
mae = metrics.mean_absolute_error(
    train_target,
    train_esol_predictions,
)
print(f"Mean absolute squared error: {mae}")

#### Median absolute error

In [None]:
mae = metrics.median_absolute_error(
    train_target,
    predictions,
)
print(f"Median absolute squared error: {mae}")

In [None]:
mae = metrics.median_absolute_error(
    train_target,
    train_esol_predictions,
)
print(f"Median absolute squared error: {mae}")

### Exercise: experiment with different loss functions

In [None]:
linear_model.SGDRegressor?

In [None]:
_seed = generate_seed()
_hyperparameters = {
    "fit_intercept": True,
    "loss": "squared_error", # change this!
    "penalty": None,
    "random_state": np.random.RandomState(_seed),
}
_estimator = linear_model.SGDRegressor(**_hyperparameters)
_ = _estimator.fit(preprocessed_train_features, train_target)

# make predictions
_predictions = _estimator.predict(preprocessed_train_features)

mse = metrics.mean_squared_error(
    train_target,
    _predictions,
)
print(f"Root mean squared error: {np.sqrt(mse)}")

### Exercise: experiment with different penalties

In [None]:
linear_model.SGDRegressor?

In [None]:
_seed = generate_seed()
_hyperparameters = {
    "alpha": 1e-4, # try changing this!
    "fit_intercept": True,
    "l1_ratio": 0.15, # only used for penalty=elastic_net
    "loss": "squared_error",
    "penalty": None, # try changing this!
    "random_state": np.random.RandomState(_seed),
}
_estimator = linear_model.SGDRegressor(**_hyperparameters)
_ = _estimator.fit(preprocessed_train_features, train_target)

# make predictions
_predictions = _estimator.predict(preprocessed_train_features)

mse = metrics.mean_squared_error(
    train_target,
    _predictions,
)
print(f"Root mean squared error: {np.sqrt(mse)}")

### Mini-batch gradient descent

Since we talked about the difference between stochastic, batch, and mini-batch gradient descent in the lectures I wanted you to see how to implement mini-batch gradient descent in Scikit-Learn. You will see much more of this idea in the deep learning hands on session so we will not spend too much time on it now.

In [None]:
_seed = generate_seed()
_random_state = np.random.RandomState(_seed)

n_epochs = 100
batch_size = 128
X = preprocessed_train_features
y = train_target
m, _ = X.shape

# define your estimator
_hyperparameters = {
    "alpha": 1e-4,
    "fit_intercept": True,
    "l1_ratio": 0.15,
    "learning_rate": "invscaling",
    "loss": "squared_error",
    "penalty": None,
    "random_state": _random_state,
    "warm_start": True,
}
estimator = linear_model.SGDRegressor(**_hyperparameters)

# nested for loops implement the training
for _ in range(n_epochs):

    # shuffle the dataset before every training epoch
    shuffled_indices = _random_state.permutation(m)
    _X, _y = X[shuffled_indices], y.iloc[shuffled_indices]

    for batch_ixs in utils.gen_batches(m, batch_size):
        _ = estimator.partial_fit(_X[batch_ixs], _y[batch_ixs])


In [None]:
# make predictions
_predictions = estimator.predict(preprocessed_train_features)

# report the error on the training set
mse = metrics.mean_squared_error(
    train_target,
    _predictions,
)
print(f"Root mean squared error: {np.sqrt(mse)}")

## Estimating generalization error using cross-validation

So far we have only evaluated our models performance on the training set. However, in order to assess whether we are overfitting or underfitting we need to estimate the generalization error.

The following code use Scikit-Learn [`model_selection.cross_val_score`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score) to randomly split the training set into distinct subsets called folds, then it trains and evaluates our model `CV_FOLDS` times, picking a different fold for evaluation every time and training on the other `CV_FOLDS - 1` folds. The result is an array containing the `CV_FOLDS` evaluation scores.

In [None]:
CV_FOLDS = 5

estimator_scores = model_selection.cross_val_score(
    estimator,
    X=preprocessed_train_features,
    y=train_target,
    cv=CV_FOLDS,
    scoring="neg_root_mean_squared_error",
    n_jobs=-1,
    verbose=1
)

In [None]:
valid_error = -estimator_scores.mean()
print(f"Estimated generalization error: {valid_error}")

In [None]:
# report the error on the training set
mse = metrics.mean_squared_error(
    train_target,
    train_esol_predictions,
)
print(f"Root mean squared error: {np.sqrt(mse)}")

What is going on here? Are we underfitting? Are we overfitting? If you think that we are underfitting, then what could we do to try and get the model to overfit? If we are overfitting, what could we do to get the model to underfit?

### Learning Curves

In [None]:
model_selection.learning_curve?

In [None]:
metrics.get_scorer_names()

In [None]:
_seed = generate_seed()
_random_state = np.random.RandomState(_seed)
_hyperparameters = {
    "fit_intercept": True,
    "loss": "squared_error", # change this!
    "penalty": None,
    "random_state": np.random.RandomState(_seed),
}
_estimator = linear_model.SGDRegressor(**_hyperparameters)
_ = _estimator.fit(preprocessed_train_features, train_target)

train_sizes, train_scores, val_scores = model_selection.learning_curve(
    _estimator,
    preprocessed_train_features,
    train_target,
    cv=CV_FOLDS,
    n_jobs=-1,
    random_state=_random_state,
    scoring="neg_root_mean_squared_error",
    train_sizes=np.linspace(0.1, 1.0, 15),
    verbose=1,
)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(15, 10))
ax.plot(train_sizes, -train_scores.mean(axis=1), "r-+", linewidth=2, label="train")
ax.plot(train_sizes, -val_scores.mean(axis=1), "b-", linewidth=3, label="valid")
ax.set_xlabel("Training set size", fontsize=15)
ax.set_ylabel("RMSE", fontsize=15)
ax.grid()
ax.legend()

plt.show()

# Fine-tune your models

Most common approach to tuning a model is to manually fiddle with the hyperparameters until you find a great combination of hyperparameter values. Needless to day, this approach to model tuning is very tedious and not at all scientific. We can do much better!

## Grid Search

Simplest approach is to use Scikit-Learn’s [`model_selection.GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). All you need to do is tell it which hyperparameters you want it to experiment with and what values to try out. The `model_selection.GridSearchCV` class will then use cross-validation to evaluate all the possible combinations of hyperparameter values and return the best scoring set of hyperparameters according to your specified metric.

In [None]:
_seed = generate_seed()

_hyperparameters = {
    "copy": True,
    "with_mean": True,
    "with_std": True,
}

# alternative constructor that is equivalent to the above!
cardinal_pipeline = pipeline.make_pipeline(
    preprocessing.StandardScaler(**_hyperparameters),
    verbose=True,
)

_hyperparameters = {
    "feature_range": (0, 1),
    "copy": True,
    "clip": False,
}

ordinal_pipeline = pipeline.make_pipeline(
    preprocessing.MinMaxScaler(**_hyperparameters),
    verbose=True,
)

_featurized_smiles = (dataset.to_dataframe()
                             .drop(['y', 'w', "ids"], axis=1))
def featurizer(object_features):
    ixs = object_features.index
    return _featurized_smiles.loc[ixs, :]

_hyperparameters = {
    "func": featurizer,
}
non_numeric_pipeline = pipeline.make_pipeline(
    preprocessing.FunctionTransformer(**_hyperparameters),
    verbose=True,
)

_preprocessing_pipeline = compose.make_column_transformer(
    (cardinal_pipeline, compose.make_column_selector(dtype_include=np.float64)),
    (ordinal_pipeline, compose.make_column_selector(dtype_include=np.int64)),
    (non_numeric_pipeline, compose.make_column_selector(dtype_include=object)),
)

_seed = generate_seed()
_random_state = np.random.RandomState(_seed)
_hyperparameters = {
    "fit_intercept": True,
    "loss": "squared_error",
    "max_iter": 1000,
    "random_state": np.random.RandomState(_seed),
}
_pipeline = pipeline.make_pipeline(
    _preprocessing_pipeline,
    linear_model.SGDRegressor(**_hyperparameters),
    verbose=True,
)

_parameter_grid = [
    {
        "sgdregressor__penalty": [None],
        "sgdregressor__alpha": [0],
    }, # 1 * 1 = 1 parameter combination to try
    {
        "sgdregressor__penalty": ["l1"],
        "sgdregressor__alpha": [1e-1, 1e0, 1e1],
    }, # 1 * 3 = 3 parameter combinations to try
    {
        "sgdregressor__penalty": ["l2"],
        "sgdregressor__alpha": [1e-1, 1e0, 1e1],
    }, # 1 * 3 = 3 parameter combinations to try
    {
        "sgdregressor__penalty": ["elasticnet"],
        "sgdregressor__alpha": [1e-1, 1e0, 1e1],
        "sgdregressor__l1_ratio": [0.1, 0.5, 0.9]
    }, # 1 * 3 * 3 = 9 parameter combinations to try
] # 1 + 3 + 3 + 9 = 16 total parameter combinations to try

estimator = model_selection.GridSearchCV(
    _pipeline,
    _parameter_grid,
    cv=CV_FOLDS, # 5 * 16 = 80 total fits!
    scoring="neg_root_mean_squared_error",
    return_train_score=True,
    n_jobs=-1,
    pre_dispatch=2, # important to set this properly to avoid OOM errors
    verbose=1,
)

In [None]:
_ = estimator.fit(train_features, train_target)

In [None]:
estimator.best_score_

In [None]:
estimator.best_params_

In [None]:
estimator.best_estimator_

You should save every model you experiment with so that you can come back easily to any model you want. Make sure you save both the hyperparameters and the trained parameters as well as the cross-validation scores and perhaps the actual predictions as well. This will allow you to more easily compare scores across model types and compare the types of errors they make.

In [None]:
RESULTS_DIR = pathlib.Path("../results")
RESULTS_DIR.mkdir(parents=True, exist_ok=True)

timestamp = time.strftime("%Y%m%d-%H%M%S")
_ = joblib.dump(estimator, RESULTS_DIR / f"grid-search-cv-regressor-{timestamp}.pkl")

For reference here is how you would reload the trained model from the file.

In [None]:
reloaded_estimator = joblib.load(RESULTS_DIR / f"grid-search-cv-regressor-{timestamp}.pkl")

In [None]:
reloaded_estimator.best_params_

### Exercise:

Fine-tune one of your models using Grid Search.

In [None]:
# insert your code here!

# Evaluate your models on the test dataset

After tweaking your models for a while, you eventually have a system that performs sufficiently well. Now is the time to evaluate the final model on the test set.

In [None]:
# make predictions
predictions = estimator.predict(test_features)

# report the error on the training set
mse = metrics.mean_squared_error(
    train_target,
    _predictions,
)
print(f"Root mean squared error: {np.sqrt(mse)}")

In [None]:
# report the error on the training set
mse = metrics.mean_squared_error(
    train_target,
    train_esol_predictions,
)
print(f"Root mean squared error: {np.sqrt(mse)}")

If you did a lot of hyperparameter tuning, the performance will usually be slightly worse than what you measured using cross-validation (because your system ends up fine-tuned to perform well on the validation data and will likely not perform as well on unknown datasets). It is not the case in this example, but when this happens you must resist the temptation to tweak the hyperparameters to make the numbers look good on the test set; the improvements would be unlikely to generalize to new data.