In this notebook, we download the docking dataset from the open database, preprocess the data, and use it to benchmark a machine learning model on predicting docking scores.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt

# Download data from provided URL (placeholder for actual URL)
data_url = 'https://lsd.docking.org/data.csv'
data = pd.read_csv(data_url)

# Preprocessing: filter and select features
features = data[['molecular_weight', 'logP', 'num_rotatable_bonds']]
target = data['docking_score']

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Train a RandomForest model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict and evaluate
predictions = model.predict(X_test)
rmse = np.sqrt(np.mean((predictions - y_test) ** 2))
print('RMSE:', rmse)

# Plot the predictions vs true values
plt.figure(figsize=(8,6))
plt.scatter(y_test, predictions, alpha=0.5, color='#6A0C76')
plt.xlabel('True Docking Scores')
plt.ylabel('Predicted Docking Scores')
plt.title('ML Model Performance on Docking Scores')
plt.show()

This block details the step-by-step approach to evaluate the model's ability to predict docking scores, which is critical for prioritizing compounds in large virtual screenings.

In [None]:
from sklearn.metrics import r2_score

r2 = r2_score(y_test, predictions)
print('R2 Score:', r2)

# Additional analysis: feature importance
importances = model.feature_importances_
feature_names = features.columns

# Create a simple plot for feature importance
plt.figure(figsize=(8,6))
plt.bar(feature_names, importances, color='#6A0C76')
plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('Feature Importance in Docking Score Prediction')
plt.show()

The feature importance plot helps identify which molecular descriptors most strongly influence docking score predictions, guiding future model improvements.





***
### [**Evolve This Code**](https://biologpt.com/?q=Evolve%20Code%3A%20This%20code%20downloads%20and%20processes%20docking%20and%20experimental%20datasets%20to%20benchmark%20ML%20models%20on%20virtual%20screening%20tasks.%0A%0AIncorporate%20additional%20molecular%20descriptors%20and%20external%20validation%20datasets%20for%20improved%20robustness.%0A%0ADatabase%20for%20large-scale%20docking%20and%20experimental%20results%20review%0A%0AIn%20this%20notebook%2C%20we%20download%20the%20docking%20dataset%20from%20the%20open%20database%2C%20preprocess%20the%20data%2C%20and%20use%20it%20to%20benchmark%20a%20machine%20learning%20model%20on%20predicting%20docking%20scores.%0A%0Aimport%20pandas%20as%20pd%0Aimport%20numpy%20as%20np%0Afrom%20sklearn.model_selection%20import%20train_test_split%0Afrom%20sklearn.ensemble%20import%20RandomForestRegressor%0Aimport%20matplotlib.pyplot%20as%20plt%0A%0A%23%20Download%20data%20from%20provided%20URL%20%28placeholder%20for%20actual%20URL%29%0Adata_url%20%3D%20%27https%3A%2F%2Flsd.docking.org%2Fdata.csv%27%0Adata%20%3D%20pd.read_csv%28data_url%29%0A%0A%23%20Preprocessing%3A%20filter%20and%20select%20features%0Afeatures%20%3D%20data%5B%5B%27molecular_weight%27%2C%20%27logP%27%2C%20%27num_rotatable_bonds%27%5D%5D%0Atarget%20%3D%20data%5B%27docking_score%27%5D%0A%0AX_train%2C%20X_test%2C%20y_train%2C%20y_test%20%3D%20train_test_split%28features%2C%20target%2C%20test_size%3D0.2%2C%20random_state%3D42%29%0A%0A%23%20Train%20a%20RandomForest%20model%0Amodel%20%3D%20RandomForestRegressor%28n_estimators%3D100%2C%20random_state%3D42%29%0Amodel.fit%28X_train%2C%20y_train%29%0A%0A%23%20Predict%20and%20evaluate%0Apredictions%20%3D%20model.predict%28X_test%29%0Armse%20%3D%20np.sqrt%28np.mean%28%28predictions%20-%20y_test%29%20%2A%2A%202%29%29%0Aprint%28%27RMSE%3A%27%2C%20rmse%29%0A%0A%23%20Plot%20the%20predictions%20vs%20true%20values%0Aplt.figure%28figsize%3D%288%2C6%29%29%0Aplt.scatter%28y_test%2C%20predictions%2C%20alpha%3D0.5%2C%20color%3D%27%236A0C76%27%29%0Aplt.xlabel%28%27True%20Docking%20Scores%27%29%0Aplt.ylabel%28%27Predicted%20Docking%20Scores%27%29%0Aplt.title%28%27ML%20Model%20Performance%20on%20Docking%20Scores%27%29%0Aplt.show%28%29%0A%0AThis%20block%20details%20the%20step-by-step%20approach%20to%20evaluate%20the%20model%27s%20ability%20to%20predict%20docking%20scores%2C%20which%20is%20critical%20for%20prioritizing%20compounds%20in%20large%20virtual%20screenings.%0A%0Afrom%20sklearn.metrics%20import%20r2_score%0A%0Ar2%20%3D%20r2_score%28y_test%2C%20predictions%29%0Aprint%28%27R2%20Score%3A%27%2C%20r2%29%0A%0A%23%20Additional%20analysis%3A%20feature%20importance%0Aimportances%20%3D%20model.feature_importances_%0Afeature_names%20%3D%20features.columns%0A%0A%23%20Create%20a%20simple%20plot%20for%20feature%20importance%0Aplt.figure%28figsize%3D%288%2C6%29%29%0Aplt.bar%28feature_names%2C%20importances%2C%20color%3D%27%236A0C76%27%29%0Aplt.xlabel%28%27Features%27%29%0Aplt.ylabel%28%27Importance%27%29%0Aplt.title%28%27Feature%20Importance%20in%20Docking%20Score%20Prediction%27%29%0Aplt.show%28%29%0A%0AThe%20feature%20importance%20plot%20helps%20identify%20which%20molecular%20descriptors%20most%20strongly%20influence%20docking%20score%20predictions%2C%20guiding%20future%20model%20improvements.%0A%0A)
***

### [Created with BioloGPT](https://biologpt.com/?q=Paper%20Review%3A%20A%20database%20for%20large-scale%20docking%20and%20experimental%20results)
[![BioloGPT Logo](https://biologpt.com/static/icons/bioinformatics_wizard.png)](https://biologpt.com/)
***