# ML4DD Summer School Hackathon

The final days of the Machine Learning For Drug Discovery summer school ends with a hackathon. We will use Polaris as a tool to get the associated benchmarks and datasets. First things first, we will install Polaris from PyPi.

We next need to authenticate ourselves to Polaris. If you haven't done so yet, you can create an account at https://polarishub.io. Afterwards, you can simply run the command below.

In [None]:
# @title Set an owner

owner = 'cwognum' # @param {type:"string"}

print(f"You have set \"{owner}\" as the owner")

# Solubility Benchmark

The first benchmark we will use is `polaris/adme-fang-solu-1`. The associated page for this benchmark on the Polaris Hub can be found at https://polarishub.io/benchmarks/polaris/adme-fang-solu-1.

In [None]:
import polaris as po
import datamol as dm
import numpy as np

In [None]:
benchmark = po.load_benchmark("polaris/adme-fang-solu-1")

We will use Datamol's `dm.to_fp` to directly featurize the inputs.

In [None]:
train, test = benchmark.get_train_test_split(featurization_fn=dm.to_fp)
train[0]

As a model, we will train a simple Random Forest model from scikit-learn.

In [None]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(max_depth=5)
model.fit(train.X, train.y)

Using that model, we can then generate our predictions for the test set.

In [None]:
y_pred = model.predict(test.X)

And finally, we evaluate our predictions

In [None]:
results = benchmark.evaluate(y_pred)
results

There are multiple metadata fields we can fill in to provide additional information about these results.

In [None]:
results.name = "my-first-result"
results.description = "ECFP fingerprints with a Random Forest"

And finally - We can upload our results to the Hub! The results will be private.

In [None]:
results.upload_to_hub(owner=owner);

# Kinase Selectivity

The second benchmark we will use is `polaris/pkis1-kit-wt-mut-c-1`. Using this benchmark is very similar to before, except for one difference: This is a multi-task benchmark.

In [None]:
benchmark = po.load_benchmark("polaris/pkis1-kit-wt-mut-c-1")
train, test = benchmark.get_train_test_split(featurization_fn=dm.to_fp)
train[0]

As we can see, the targets are now returned to us as a dictionary. Let's train a multi-task model on this data! We first preprocess the data to be in a format we can use with scikit-learn.

In [None]:
ys = train.y
ys = np.stack([ys[target] for target in benchmark.target_cols], axis=1)
ys.shape

Now that we're working with a multi-task dataset, it's also possible for these arrays to be sparse. Let's filter out any data points that doesn't have readouts for _all_ targets.

In [None]:
mask = ~np.any(np.isnan(ys), axis=1)
mask.sum()

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(max_depth=5)
model.fit(train.X[mask], ys[mask])

In [None]:
y_pred = model.predict(test.X)
y_pred.shape

In addition to `y_pred`, we also need to specify `y_prob` as this benchmark uses the AUROC measure.

In [None]:
y_prob = model.predict_proba(test.X)
y_prob = np.stack(y_prob, axis=1)
y_prob.shape

Polaris expects a dictionary, so let's convert our results again.

In [None]:
y_pred = {k: y_pred[:, idx] for idx, k in enumerate(benchmark.target_cols)}
y_prob = {k: y_prob[:, idx, 1] for idx, k in enumerate(benchmark.target_cols)}

And let's evaluate our predictions!

In [None]:
benchmark.evaluate(y_pred=y_pred, y_prob=y_prob)

Although this works, we're not required to train a multi-task model. Polaris doesn't impose any restrictions on the methodology. You could e.g. also train multiple single-task models.

In [None]:
from sklearn.ensemble import RandomForestClassifier

models = {target: RandomForestClassifier(max_depth=5) for target in benchmark.target_cols}
X = train.X

for target, model in models.items():
  y = train.y[target]
  mask = ~np.isnan(y)
  model.fit(X[mask], y[mask])

y_prob = {target: model.predict_proba(test.X)[:, 1] for target, model in models.items()}
y_pred = {target: model.predict(test.X) for target, model in models.items()}

results = benchmark.evaluate(y_pred=y_pred, y_prob=y_prob)

Finally, let's upload our results to the Hub again!

In [None]:
results.name = "my-second-result"
results.description = "ECFP fingerprints with a Random Forest"

In [None]:
results.upload_to_hub(owner=owner);

The End.