# Homework Assignment: Exploring the eSOLV Dataset with Machine Learning

## Introduction

In this homework project, you will be undertaking a detailed exploration of the eSOLV dataset using machine learning (ML) techniques. This project can be done in this Jupyter notebook

## Dataset Overview

The eSOLV dataset is part of the moleculenet dataset, published in [1]. The eSOLV dataset is a compilation of molecules extracted from a initial publication [2]. This dataset is composed of 1128 molecules, each molecule being associated to its water solubility. 

References

[1] Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, Vijay Pande, MoleculeNet: A Benchmark for Molecular Machine Learning, arXiv preprint, arXiv: 1703.00564, 2017.

[2] Delaney, J. S. (2004). ESOL: estimating aqueous solubility directly from molecular structure. Journal of chemical information and computer sciences, 44(3), 1000-1005.

## Assignment Objective

The primary goal of this project is to implement a prediction model on the eSOLV dataset.

## Approach and Methodology

The approach to this project should involve these key stages:

* **Data Exploration and Preprocessing**: Present the dataset, identify its characteristics, and perform any necessary preprocessing steps (cleaning, normalizing, and feature engineering).

* **Model Selection and Implementation**: Select appropriate ML models based on the dataset's features and  problem statement. The models we plan to explore include the ones presented during the 4 lessons. Use these models and fine-tune their hyper parameters for optimal performance.

* **Evaluation and Analysis**: The effectiveness of each model will be evaluated using the metric of your choice. Analyze the results to understand each model's strengths and limitations. For your information, the website [https://moleculenet.org/] provides some reference scores.



<div>
<img src="result_r2.png" height="500"/>
<img src="result_rmse.png" height="500"/>

</div>


* **Conclusion and Insights**: Finally, summarize our findings, discuss any insights gained, and reflect on how our approach addresses the problem statement.

## Assignment Instructions

As you work through this assignment, ensure that you:

* Thoroughly document each step in the Jupyter notebook.
* Include visualizations to aid in data exploration and results interpretation.
* Provide explanations for your choices in preprocessing, model selection, and evaluation methods.
* Reflect on the results and discuss potential improvements or alternative approaches.

It's ok to compute more descriptors or to find more data. But if you use something, explain it in details and justify why it's useful. Any copy-pasted data without explanation will induce penalities.

When your project is finished, drop your ipynb file with all necessaries files on universitice : []




# Dataset loading

To work on the eSOLV dataset, the function  `load_esolv` allows to load the dataset in different formats: 
* graph embeddings: `data['X']`
* data for pytorch and GNN : `data['torch_data']`
* SMILES representation : `data['smiles']`
* networkX graphs : `data['graphs']`

In [2]:
import deepchem as dc
from rdkit import Chem
import networkx as nx
from torch_geometric.data import Data
import torch


def smiles_to_nx_with_features(smiles):
    mol = Chem.MolFromSmiles(smiles)
    G = nx.Graph()

    for atom in mol.GetAtoms():
        G.add_node(atom.GetIdx(), 
                   atomic_num=atom.GetAtomicNum(),
                   formal_charge=atom.GetFormalCharge(),
                   num_explicit_hs=atom.GetNumExplicitHs(),
                   num_implicit_hs=atom.GetNumImplicitHs(),
                   degree=atom.GetDegree(),
                   total_degree=atom.GetTotalDegree(),
                   is_aromatic=atom.GetIsAromatic())

    for bond in mol.GetBonds():
        G.add_edge(bond.GetBeginAtomIdx(), bond.GetEndAtomIdx(), bond_type=bond.GetBondType())

    return G

def graph_to_pyg_data(graph):
    edge_index = torch.tensor(list(graph.edges)).t().contiguous()
    x = torch.tensor([list(data.values()) for _, data in graph.nodes(data=True)], dtype=torch.float)
    data = Data(x=x, edge_index=edge_index)
    return data

Skipped loading some Tensorflow models, missing a dependency. No module named 'tensorflow'
Skipped loading modules with pytorch-lightning dependency, missing a dependency. No module named 'pytorch_lightning'
Skipped loading some Jax models, missing a dependency. No module named 'jax'


In [3]:



def load_esolv():
    _, datasets, _ = dc.molnet.load_delaney(featurizer='ecfp',splitter='random')
    X = datasets[0].X
    y = datasets[0].y
    X_smiles = datasets[0].ids
    pyg_datasets = []
    nx_graphs = []
    for smiles in X_smiles:
        graph = smiles_to_nx_with_features(smiles) # networkx
        nx_graphs.append(graph)
        pyg_data = graph_to_pyg_data(graph)  # torch_geometric
        pyg_datasets.append(pyg_data)

    return {"X": X, "y": y, "torch_data": pyg_datasets, "smiles": X_smiles,"graphs": nx_graphs}
    

In [4]:
data = load_esolv()

## Example
Here's a simple example just to test if everything is ok


In [7]:
from sklearn.model_selection import train_test_split
X = data["X"]
y= data["y"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

rf = DummyRegressor()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
# Result are not good since it's dummy !
print("RF RMSE: ", mean_squared_error(y_test, y_pred, squared=False))
print("RF R2: ", r2_score(y_test, y_pred))




RF RMSE:  0.9964987719542534
RF R2:  -0.0016202535650102767
