# **4. Evaluation of the generative model from eMolecules and MetaNetX datasets**

The notebook has been used to populate Table 1 of the manuscript. It evaluates the generative model from eMolecules and MetaNetX datasets. The evaluation is based on the following metrics:
- **SMILES Accuracy**: The percentage of generated SMILES that are identical to the query molecule.
- **Tanimoto Accuracy**: The percentage of generated SMILES share a Tanimoto similarity of 1 with the query molecule.
- **SMILES Validity**: The percentage of valid SMILES generated by the model.

In [1]:
%load_ext autoreload
%autoreload 2

from pathlib import Path

import pandas

from paper.learning import config, evaluate

## 4.1. Utilities

In [2]:
def get_stats(results: pandas.DataFrame, topk=1) -> None:
    """Get statistics from the results DataFrame.

    Parameters
    ----------
    results : pandas.DataFrame
        The results DataFrame.
    topk : int, optional
        The top-k to consider, by default 1.

    Returns
    -------
    None
    """

    # First we need to get the unique sequence IDs
    sequence_ids = results["Seq ID"].unique()

    stats = pandas.DataFrame(
        columns=[
            "Seq ID",
            "Raw SMILES Equality",
            "Tanimoto Accuracy",
            "SMILES Validity",
        ],
        index=sequence_ids,
    )

    # Now for can collect results for each sequence ID. From there:
    for seq_id in sequence_ids:
        seq_results = results[results["Seq ID"] == seq_id]
        stats.loc[seq_id, "Seq ID"] = seq_id

        if topk == 1:
            top1 = seq_results.loc[seq_results["Prediction Log Prob"].idxmax()]
            stats.loc[seq_id, "Raw SMILES Equality"] = top1["Target match"]
            stats.loc[seq_id, "Tanimoto Accuracy"] = top1["Tanimoto"] == 1.0
            stats.loc[seq_id, "SMILES Validity"] = top1["Prediction Mol"] is not None

        elif topk == 10:
            top10 = seq_results.nlargest(10, "Prediction Log Prob")
            stats.loc[seq_id, "Raw SMILES Equality"] = any(top10["Target match"])
            stats.loc[seq_id, "Tanimoto Accuracy"] = any(top10["Tanimoto"] == 1.0)
            stats.loc[seq_id, "SMILES Validity"] = not top10["Prediction Mol"].isnull().all()

    return stats.describe()

## 4.2. Pre-trained model evaluation

In [3]:
BASE_DIR = Path.cwd().parent
MODEL_PATH = BASE_DIR / "data" / "models" / "pretrained.ckpt"

CONFIG = config.Config(db="emolecules", source="ECFP", target="SMILES", base_dir=BASE_DIR)
CONFIG.model_path = MODEL_PATH

### 4.2.1. Performance on the eMolecules test dataset

In [4]:
TEST_FILE = BASE_DIR / "data" / "emolecules" / "splitting" / "test.tsv"
CONFIG.pred_test_file = TEST_FILE

##### eMolecules / top-1

In [5]:
# Settings
CONFIG.pred_mode = "greedy"
CONFIG.pred_batch_size = 100
CONFIG.pred_max_rows = 1000
CONFIG.device = "cpu"
CONFIG.output_file = "pretrain_emolecules_greedy.tsv"

# Run
pretrain_emolecules_greedy = evaluate.run(CONFIG)

# Stats
get_stats(pretrain_emolecules_greedy, topk=1)

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/Users/tduigou/miniforge3/envs/retrosig/lib/python3.11/site-packages/lightning/pytorch/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
/Users/tduigou/miniforge3/envs/retrosig/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'predict_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=9` in the `DataLoader` to improve performance.


Predicting DataLoader 0: 100%|██████████| 1/1 [00:10<00:00,  0.09it/s]


Unnamed: 0,Seq ID,Raw SMILES Equality,Tanimoto Accuracy,SMILES Validity
count,100,100,100,100
unique,100,2,2,1
top,0,True,True,True
freq,1,98,99,100


#### eMolecules / top-10 (beam search)

In [6]:
# Settings
CONFIG.pred_mode = "beam"
CONFIG.pred_beam_size = 10
CONFIG.pred_batch_size = 100
CONFIG.pred_max_rows = 1000
CONFIG.device = "gpu"

# Run
pretrain_emolecules_top10 = evaluate.run(CONFIG)

# Stats
get_stats(pretrain_emolecules_top10, topk=10)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
SLURM auto-requeueing enabled. Setting signal handlers.
/linkhome/rech/gengje01/ulh74sf/.conda/envs/retrosig/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'predict_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=31` in the `DataLoader` to improve performance.


Predicting DataLoader 0: 100%|██████████| 10/10 [07:11<00:00,  0.02it/s]


Unnamed: 0,Seq ID,Raw SMILES Equality,Tanimoto Accuracy,SMILES Validity
count,1000,1000,1000,1000
unique,1000,2,2,1
top,999,True,True,True
freq,1,996,997,1000


### 4.2.2. Performance on the MetaNetX test dataset

In [7]:
TEST_FILE = BASE_DIR / "data" / "metanetx" / "splitting" / "test.tsv"
CONFIG.pred_test_file = TEST_FILE

#### MetaNetX / top-1

In [8]:
# Settings
CONFIG.pred_mode = "greedy"
CONFIG.pred_batch_size = 100
CONFIG.pred_max_rows = 1000
CONFIG.device = "cpu"

# Run
pretrain_metanetx_greedy = evaluate.run(CONFIG)

# Stats
get_stats(pretrain_metanetx_greedy, topk=1)

GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/linkhome/rech/gengje01/ulh74sf/.conda/envs/retrosig/lib/python3.11/site-packages/lightning/pytorch/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
SLURM auto-requeueing enabled. Setting signal handlers.
/linkhome/rech/gengje01/ulh74sf/.conda/envs/retrosig/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'predict_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=31` in the `DataLoader` to improve performance.


Predicting DataLoader 0: 100%|██████████| 10/10 [02:08<00:00,  0.08it/s]


Unnamed: 0,Seq ID,Raw SMILES Equality,Tanimoto Accuracy,SMILES Validity
count,1000,1000,1000,1000
unique,1000,2,2,2
top,999,False,False,True
freq,1,708,693,842


##### MetaNetX / top-10 (beam search)

In [9]:
# Settings
CONFIG.pred_mode = "beam"
CONFIG.pred_beam_size = 10
CONFIG.pred_batch_size = 100
CONFIG.pred_max_rows = 1000
CONFIG.device = "gpu"

# Run
pretrain_metanetx_top10 = evaluate.run(CONFIG)

# Stats
get_stats(pretrain_metanetx_top10, topk=10)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
SLURM auto-requeueing enabled. Setting signal handlers.
/linkhome/rech/gengje01/ulh74sf/.conda/envs/retrosig/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'predict_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=31` in the `DataLoader` to improve performance.


Predicting DataLoader 0: 100%|██████████| 10/10 [08:51<00:00,  0.02it/s]


Unnamed: 0,Seq ID,Raw SMILES Equality,Tanimoto Accuracy,SMILES Validity
count,1000,1000,1000,1000
unique,1000,2,2,2
top,999,False,True,True
freq,1,521,503,974


## 4.3. Fine-tuned model evaluation

In [10]:
BASE_DIR = Path.cwd().parent
# MODEL_PATH = BASE_DIR / "data" / "metanetx" / "models" / "model.ckpt"
MODEL_PATH = BASE_DIR / "data" / "metanetx" / "models" / "fold=4-epoch=30.ckpt"

CONFIG = config.Config(db="emolecules", source="ECFP", target="SMILES", base_dir=BASE_DIR)
CONFIG.model_path = MODEL_PATH

### 4.3.1. Performance on the eMolecules dataset

In [11]:
TEST_FILE = BASE_DIR / "data" / "emolecules" / "splitting" / "test.tsv"
CONFIG.pred_test_file = TEST_FILE

#### eMolecules / top-1

In [12]:
# Settings
CONFIG.pred_mode = "greedy"
CONFIG.pred_batch_size = 100
CONFIG.pred_max_rows = 1000
CONFIG.device = "cpu"

# Run
finetuned_emolecules_greedy = evaluate.run(CONFIG)

# Stats
get_stats(finetuned_emolecules_greedy, topk=1)

GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/linkhome/rech/gengje01/ulh74sf/.conda/envs/retrosig/lib/python3.11/site-packages/lightning/pytorch/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
SLURM auto-requeueing enabled. Setting signal handlers.
/linkhome/rech/gengje01/ulh74sf/.conda/envs/retrosig/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'predict_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=31` in the `DataLoader` to improve performance.


Predicting DataLoader 0: 100%|██████████| 10/10 [01:15<00:00,  0.13it/s]


Unnamed: 0,Seq ID,Raw SMILES Equality,Tanimoto Accuracy,SMILES Validity
count,1000,1000,1000,1000
unique,1000,2,2,2
top,999,True,True,True
freq,1,942,956,995


#### eMolecules / top-10

In [None]:
# Settings
CONFIG.pred_mode = "beam"
CONFIG.pred_beam_size = 10
CONFIG.pred_batch_size = 100
CONFIG.pred_max_rows = 1000
CONFIG.device = "gpu"

# Run
finetuned_emolecules_top10 = evaluate.run(CONFIG)

# Stats
get_stats(finetuned_emolecules_top10, topk=10)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
SLURM auto-requeueing enabled. Setting signal handlers.
/linkhome/rech/gengje01/ulh74sf/.conda/envs/retrosig/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'predict_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=31` in the `DataLoader` to improve performance.


Predicting DataLoader 0: 100%|██████████| 10/10 [08:14<00:00,  0.02it/s]


### 4.3.1. Performance on the MetaNetX dataset

In [None]:
TEST_FILE = BASE_DIR / "data" / "metanetx" / "splitting" / "test.tsv"
CONFIG.pred_test_file = TEST_FILE

#### MetaNetX / top-1

In [None]:
# Settings
CONFIG.pred_mode = "greedy"
CONFIG.pred_batch_size = 100
CONFIG.pred_max_rows = 1000
CONFIG.device = "cpu"

# Run
finetuned_metanetx_greedy = evaluate.run(CONFIG)

# Stats
get_stats(finetuned_metanetx_greedy, topk=1)

#### MetaNetX / top-10

In [None]:
# Settings
CONFIG.pred_mode = "beam"
CONFIG.pred_beam_size = 10
CONFIG.pred_batch_size = 100
CONFIG.pred_max_rows = 1000
CONFIG.device = "gpu"

# Run
finetuned_metanetx_top10 = evaluate.run(CONFIG)

# Stats
get_stats(finetuned_metanetx_top10, topk=10)

___

## 4.3. Summary statistics

In [17]:
from pathlib import Path

import pandas as pd


# Utils -----------------------------------------------------------------------

def get_summary(results: pd.DataFrame, topk=1) -> pd.DataFrame:
    """Get summary from the results DataFrame.

    Parameters
    ----------
    results : pandas.DataFrame
        The results DataFrame.
    topk : int, optional
        The top-k to consider, by default 1.

    Returns
    -------
    pandas.DataFrame
        The summary DataFrame.
    """

    # First we need to get the unique sequence IDs
    query_ids = results["Query ID"].unique()

    summary = pd.DataFrame(
        columns=[
            "Query ID",
            "Query SMILES",
            "Query ECFP",
            "SMILES Exact Match",
            "Tanimoto Exact Match",
            "SMILES Syntaxically Valid",
            "Tanimoto Exact Match Unique Count",
            "Tanimoto Exact Match Unique List",
        ],
        index=query_ids,
    )

    # Now for can collect results for query ID
    for query_id in query_ids:

        # Get mask corresponding to the query ID
        query_mask = results["Query ID"] == query_id

        # Get subset corresponding to the top-k
        top_query_subset = results[query_mask].nlargest(topk, "Prediction Log Prob")

        # Get the subset from the top-k corresponding to Tanimoto exact match
        top_query_exact_match = top_query_subset[top_query_subset["Tanimoto Exact Match"]]

        # Fill in the stats
        summary.loc[query_id, "Query ID"] = query_id
        summary.loc[query_id, "Query SMILES"] = top_query_subset.iloc[0]["Query SMILES"]
        summary.loc[query_id, "Query ECFP"] = top_query_subset.iloc[0]["Query ECFP"]
        summary.loc[query_id, "SMILES Exact Match"] = any(top_query_subset["SMILES Exact Match"])
        summary.loc[query_id, "Tanimoto Exact Match"] = any(top_query_subset["Tanimoto Exact Match"])
        summary.loc[query_id, "SMILES Syntaxically Valid"] = any(top_query_subset["SMILES Syntaxically Valid"])
        summary.loc[query_id, "Tanimoto Exact Match Unique Count"] = top_query_exact_match["Prediction Canonic SMILES"].nunique()
        summary.loc[query_id, "Tanimoto Exact Match Unique List"] = str(list(top_query_exact_match["Prediction Canonic SMILES"].unique()))

    return summary


def get_statistics(df: pd.DataFrame, topk=1) -> pd.DataFrame:
    """Get statistics from the results DataFrame.

    Parameters
    ----------
    df : pandas.DataFrame
        The results DataFrame.
    topk : int, optional
        The top-k to consider, by default 1.

    Returns
    -------
    pandas.DataFrame
        The statistics DataFrame.
    """

    # First we get summary information
    summary = get_summary(df, topk=topk)

    # Now we can compute basic statistics
    stats = summary.aggregate(
        {
            "SMILES Exact Match": ["mean"],
            "Tanimoto Exact Match": ["mean"],
            "SMILES Syntaxically Valid": ["mean"],
        }
    )

    # Rename columns
    stats.columns = [
        "SMILES Accuracy",
        "Tanimoto Accuracy",
        "SMILES Syntax Validity",
    ]

    # Transpose and set index as a "Stat" column
    stats = stats.T
    stats["Stat"] = stats.index
    stats.reset_index(drop=True, inplace=True)
    
    # Rename and reorder columns
    stats.columns = ["Value", "Stat"]
    stats = stats[["Stat", "Value"]]

    return stats


def get_uniqueness(df: pd.DataFrame, topk=1) -> pd.DataFrame:
    """Get the number of unique molecules per query.

    Parameters
    ----------
    df : pandas.DataFrame
        The results DataFrame.
    topk : int, optional
        The top-k to consider, by default 1.

    Returns
    -------
    pandas.DataFrame
        The unique count per query DataFrame.
    """
    # First we get summary information
    summary = get_summary(df, topk=topk)

    # Get count on the number of unique SMILES per query
    uniqueness = pd.DataFrame(summary["Tanimoto Exact Match Unique Count"].value_counts().sort_index())
    uniqueness.rename(columns={"count": "Count"}, inplace=True)
    uniqueness["Distinct Molecules per Query"] = uniqueness.index
    uniqueness.reset_index(drop=True, inplace=True)
    uniqueness = uniqueness.iloc[:, [1, 0]]  # reverse the order of the columns
    
    return uniqueness


# Load data -------------------------------------------------------------------
BASE_DIR = Path.cwd()
DATA_DIR = BASE_DIR / "table-1-accuracy"
FILENAME = "finetuned-metanetx-top10-refined.tsv"
TOPK = 10

for prefix in ["finetuned", "pretrain"]:
 for db in ["emolecules", "metanetx"]:
    for topk in [1, 10]:

        FILENAME = f"{prefix}-{db}-top{topk}-refined.tsv"
        df = pd.read_csv(DATA_DIR / FILENAME, sep="\t")

        # Summary ---------------------------------------------------------------------
        summary = get_summary(df, topk=topk)
        OUTFILE = FILENAME.replace("-refined.tsv", "-summary.tsv")
        summary.to_csv(DATA_DIR / OUTFILE, sep="\t", index=False)

        # Statistics -------------------------------------------------------------------
        stats = get_statistics(df, topk=topk)
        OUTFILE = FILENAME.replace("-refined.tsv", "-statistics.tsv")
        stats.to_csv(DATA_DIR / OUTFILE, sep="\t", index=False)
        
        # Uniqueness -------------------------------------------------------------------
        uniqueness = get_uniqueness(df, topk=topk)
        OUTFILE = FILENAME.replace("-refined.tsv", "-uniqueness.tsv")
        uniqueness.to_csv(DATA_DIR / OUTFILE, sep="\t", index=False)

        print(f"{FILENAME} stats:")
        print(stats)
        print("\n")
        print(f"{FILENAME} uniqueness:")
        print(uniqueness)
        print("\n")

finetuned-emolecules-top1-refined.tsv stats:
                     Stat  Value
0         SMILES Accuracy  0.946
1       Tanimoto Accuracy  0.955
2  SMILES Syntax Validity  0.995
3     Molecule Uniqueness  0.955


finetuned-emolecules-top1-refined.tsv uniqueness:
  Distinct Molecules per Query  Count
0                            0     45
1                            1    955


finetuned-emolecules-top10-refined.tsv stats:
                     Stat  Value
0         SMILES Accuracy  0.996
1       Tanimoto Accuracy  0.996
2  SMILES Syntax Validity  1.000
3     Molecule Uniqueness  1.049


finetuned-emolecules-top10-refined.tsv uniqueness:
  Distinct Molecules per Query  Count
0                            0      4
1                            1    952
2                            2     36
3                            3      7
4                            4      1


finetuned-metanetx-top1-refined.tsv stats:
                     Stat  Value
0         SMILES Accuracy  0.774
1       Tanimoto Ac