### This tutorial is currently non-operational. When non-legacy code is operational, these tutorials are to be adapted.

I've split the Notebook in two sections. Thompson Sampling and Results Analysis. 

The files provided for the input are accessible with:
```
../examples/docking_scores/{file_name}
../examples/input_files/{file_name}
```

output files should always be placed in (this ensures any file created is not uploaded to the GitHub repo accidently):
```
./tmp
```

In [None]:
import polars as pl
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
# TACTICS Imports
from TACTICS.thompson_sampling import ThompsonSampler
from TACTICS.thompson_sampling.config import ThompsonSamplingConfig
from TACTICS.thompson_sampling.strategies.config import RouletteWheelConfig
from TACTICS.thompson_sampling.warmup.config import StratifiedWarmupConfig
from TACTICS.thompson_sampling.core.evaluator_config import LookupEvaluatorConfig
from TACTICS.thompson_sampling.presets import get_preset
from TACTICS.thompson_sampling.main import run_ts

In [None]:
import os
os.makedirs("./tmp", exist_ok=True)

# TACTICS Fundamentals:
The module consists of the a Unified Thompson Sampler that takes as input different configurations. To set up different configurations we can change the parameters of the `ThompsonSamplingConfig`. Every run consists of a configuration input into the `ThompsonSampler` class.

### Configurations 
These are also used for different search and warm up strategies. These are set up using `pydantic` and provide a clear concise way to trouble shoot any issues with parameters that might be input. `pydantic` checks the user's input for each parameter and provides code hints (parameter ranges, types etc.). If the user enters a parameter that is not correct then it automatically provides an error message letting the user know where this is an issue. Its easy to keep track of different runs trough the configurations.

### Presets 
These are pre-configured TS configurations for specific use cases. The user would not need to define the parameters for a given search strategy or warm-up, as this would all be defined with default parameters. All that would be required by the user would be the input files (reagents, reaction SMARTS and evaluator type). For example, the fast exploration configuration uses the Epsilon Greedy configuration with high $\epsilon$ value for fast exploration.

#### Wrapper Functions
The package comes with a `run_TS` wrapper function that can be used with existing presets. This allows the user to just input the reaction SMARTS and the input files and the wrapper function will run both the warm-up and search cycles and return a a `polars` data frame of the compiled results. It will also return statistics on the search efficiency as well. 

## Out-of-the-box Demonstration with Presets and Wrapper Function
Here I am using the `run_ts` wrapper function that allows the user to user a preset 

In [None]:
preset_config = get_preset(
    "fast_exploration",  # Good for LookupEvaluator (processes=1, batch_size=1)
    reaction_smarts="[#6:1](=[O:2])[OH].[#7X3;H1,H2;!$(N[!#6]);!$(N[#6]=[O]);!$(N[#6]~[!#6;!#16]):3]>>[#6:1](=[O:2])[#7:3]",
    reagent_file_list=[
        "../examples/input_files/acids.smi",
        "../examples/input_files/coupled_aa_sub.smi"
    ],
    evaluator_config=LookupEvaluatorConfig(
        ref_filename="../examples/docking_scores/product_scores.csv",
        ref_colname="Scores"
    ),
    mode="minimize",      # For docking scores lower scores are better
    num_iterations=5000
)

In [None]:
# This gives the search results without the warm-up phase
results_df = run_ts(preset_config)
# If the user wants to see the warm-up results as well, they can use the `return_warmup` argument
#results_df = run_ts(preset_config, return_warmup=True)

## Demonstrate Thompson Sampling with Nested Configuration
This is a "Do it yourself" set up, where the user has a lot more control over the parameters that are used to run the Thompson sampler.

In [None]:
# YOU choose every single component and parameter
config = ThompsonSamplingConfig(
    reaction_smarts="[#6:1](=[O:2])[OH].[#7X3;H1,H2;!$(N[!#6]);!$(N[#6]=[O]);!$(N[#6]~[!#6;!#16]):3]>>[#6:1](=[O:2])[#7:3]",
    reagent_file_list=[
        "../examples/input_files/acids.smi",
        "../examples/input_files/coupled_aa_sub.smi"
    ],
    num_ts_iterations=5000,
    num_warmup_trials=10,
    
    # Manually configure strategy
    strategy_config=RouletteWheelConfig(
        mode="minimize",
        alpha=0.15,          # You control this
        beta=0.12,           # You control this
        scaling=2.0,         # You control this
        alpha_increment=0.02,
        beta_increment=0.003,
        efficiency_threshold=0.2
    ),
    
    # Manually configure warmup
    warmup_config=StratifiedWarmupConfig(),
    
    # Manually configure evaluator
    evaluator_config=LookupEvaluatorConfig(
        ref_filename="../examples/docking_scores/product_scores.csv",
        ref_colname="Scores"
    ),
    
    # Manually configure performance
    batch_size=1,
    processes=1,
    min_cpds_per_core=10,

    # Manually set output
    results_filename="./tmp/results.csv"
)

In [None]:
sampler = ThompsonSampler.from_config(config)

# IMPORTANT: Set the reaction (required step)
sampler.set_reaction(config.reaction_smarts)
print(f"Sampler initialized with {sampler.get_num_prods():.2e} possible products")

In [None]:
print("Starting warmup phase...")
warmup_results = sampler.warm_up(num_warmup_trials=config.num_warmup_trials)
print(f"Warmup complete: {len(warmup_results)} compounds evaluated")

In [None]:
print(f"Starting Thompson Sampling search ({config.num_ts_iterations} iterations)...")
search_results = sampler.search(num_cycles=config.num_ts_iterations)
print(f"Search complete: {len(search_results)} total compounds evaluated")

In [None]:
pl.concat([warmup_results,search_results])

In [None]:
# Combine warmup and search results
all_results = pl.concat([warmup_results,search_results])

# Display top 100 compounds (lowest scores = best for docking)
top_100 = all_results.sort("score").head(100)
print("\nTop 100 compounds found:")
display(top_100)


In [None]:
# Save Results
top_100.write_csv("./tmp/ts_results.csv")
print("Results saved to ./tmp/ts_results.csv")

In [None]:
# Close the Sampler
# Mainly to be used if you want to deploy multi-processing
sampler.close()
print("Sampler closed successfully")

## Demonstration with a Preset
This is an easier method to run TS fast with a set of default parameters. But we are not going to use the wrapper function here.

In [None]:
sampler_preset = ThompsonSampler.from_config(preset_config)

# IMPORTANT: Set the reaction (required step)
sampler_preset.set_reaction(preset_config.reaction_smarts)
print(f"Sampler initialized with {sampler_preset.get_num_prods():.2e} possible products")

In [None]:
print("Starting warmup phase...")
warmup_results_preset = sampler_preset.warm_up(num_warmup_trials=preset_config.num_warmup_trials)
print(f"Warmup complete: {len(warmup_results_preset)} compounds evaluated")

In [None]:
print(f"Starting Thompson Sampling search ({config.num_ts_iterations} iterations)...")
search_results_preset = sampler_preset.search(num_cycles=preset_config.num_ts_iterations)
print(f"Search complete: {len(search_results_preset)} total compounds evaluated")

In [None]:
# Combine warmup and search results
all_results_preset = pl.concat([warmup_results_preset,search_results_preset])

# Display top 100 compounds (lowest scores = best for docking)
top_100_preset = all_results_preset.sort("score").head(100)
print("\nTop 100 compounds found:")
display(top_100_preset)


In [None]:
# Save Results
top_100_preset.write_csv("./tmp/ts_results_preset.csv")
print("Results saved to ./tmp/ts_results_preset.csv")

# Obsolete from this point onwards?

# Testing Thompson Sampling

*TACTICS.thompson_sampling.ts_main does not exist. Tried TACTICS.thompson_sampling.main, but this does not contain parse_input_dict(), so the script breaks after the third code block.*

*Using the legacy TACTICS.thompson_sampling.legacy.ts_main also doesn't work as sampler_type is not defined.*

In [None]:
from TACTICS.thompson_sampling.main import *
import json

In [None]:
input_json_file = """{
"reagent_file_list": [
        "../examples/input_files/acids.smi",
        "../examples/input_files/coupled_aa_sub.smi"
    ],
    "reaction_smarts": "[#6:1](=[O:2])[OH].[#7X3;H1,H2;!$(N[!#6]);!$(N[#6]=[O]);!$(N[#6]~[!#6;!#16]):3]>>[#6:1](=[O:2])[#7:3]",
    "num_warmup_trials": 10,
    "num_ts_iterations": 5000,
    "search_strategy": "greedy_minimize_dt",
    "processes": 1,
    "percent_of_library": 0.1,
    "scaling": -1,
    "temperature": 1,
    "evaluator_class_name": "LookupEvaluator",
    "evaluator_arg": {"ref_filename" : "../examples/docking_scores/product_scores.csv"},
    "log_filename": "./tmp/ts_logs.txt",
    "results_filename": "./tmp/ts_results.csv"
}"""
input_dict = json.loads(input_json_file)

In [None]:
parse_input_dict(input_dict)

*_______ Unable to continue further _______*

In [None]:
ts_std_df = run_ts(input_dict)

In [None]:
ts_std_df.sort_values(by="score", ascending=True).head(100)

In [None]:
prod_scores_df.sort("Scores", descending=False).head(100)

In [None]:
# Modify the TS dataframe so that it is compatible with the get_top_building_blocks function
ts_df_mod = ts_std_df.copy()
ts_df_mod = ts_df_mod.drop("SMILES",axis=1)
ts_df_mod.rename(columns={"score":"Scores", "Name":"Product_Code"},inplace=True)
top_5000_building_blocks_ts_df = get_top_building_blocks(ts_df_mod, 5000)

In [None]:
# Check the overlap between the enriched building blocks from the TS and the top 5000 building blocks from brute force docking
overlap_ts = check_overlap(top_5000_building_blocks_ts_df, top_5000_building_blocks)
visualize_overlapping_blocks(overlap_ts, combined_smiles_dict)

### Check Consistency of the TS results

In [None]:
ts_df_list = []
for i in tqdm(range(0,10)):
    ts_df_list.append(run_ts(input_dict, hide_progress=True))

In [None]:
# Extract the product codes as a list
product_codes = ts_std_df["Name"].to_list()

# Initialize counters for each position
position_counters = []

# Iterate through the product codes
for product_code in product_codes:
    building_blocks = product_code.split("_")  # Split the product code by "_"
    # Ensure the position_counters list is large enough to handle all positions
    while len(position_counters) < len(building_blocks):
        position_counters.append(Counter())
    # Update the counters for each position
    for i, block in enumerate(building_blocks):
        position_counters[i][block] += 1

# Find the top 20 building blocks for each position
for i, counter in enumerate(position_counters):
    print(f"Top 20 building blocks for position {i + 1}:")
    print(f"{'Building Block':<20}{'Frequency':<10}")
    print("-" * 30)
    for block, count in counter.most_common(20):
        print(f"{block:<20}{count:<10}")
    print("\n")

In [None]:
# Collect the top 20 building blocks for each position
top_20_building_blocks_per_position = []
for counter in position_counters:
    top_20_blocks = [block for block, _ in counter.most_common(20)]
    top_20_building_blocks_per_position.append(top_20_blocks)

# Convert the list to a tuple
top_20_building_blocks_tuple = tuple(top_20_building_blocks_per_position)

In [None]:
total_molecules = 5000  # The top 1% of products
# Combine the two dictionaries
combined_smiles_dict = {**amino_acid_bb_dict, **acids_bb_dict}

# Iterate through each position's top 20 building blocks
for i, counter in enumerate(position_counters):
    # Get the top 20 building blocks for the current position
    top_20_blocks = counter.most_common(20)
    
    # Create a list of RDKit molecules and their labels
    mols = []
    legends = []
    for block, freq in top_20_blocks:
        if block in combined_smiles_dict:  # Use the combined dictionary
            smiles = combined_smiles_dict[block]
            mol = Chem.MolFromSmiles(smiles)
            if mol:
                mols.append(mol)
                # Add building block name and frequency on the first line, fraction on the second line
                legends.append(f"{block} (Freq: {freq})\nFraction: {round((freq / total_molecules) * 100, 2)}%")
    
    # Visualize the molecules in a grid
    img = Draw.MolsToGridImage(
        mols, legends=legends, molsPerRow=5, subImgSize=(300, 300)
    )
    
    # Display the title and the image
    print(f"Top 20 Building Blocks for Position {i + 1}")
    display(img)  # Display the image in the Jupyter Notebook

## Boltzmann Sampling 
Utilizes Boltzmann sampling instead of standard greedy sampling to find new compounds to test.

In [None]:
import copy
input_dict_boltzmann = copy.copy(input_dict)
input_dict_boltzmann["search_strategy"] = "boltzmann_minimize"

In [None]:
ts_boltzmann_df = run_ts(input_dict_boltzmann)

In [None]:
# Modify the TS dataframe so that it is compatible with the get_top_building_blocks function
ts_Boltzmann_df_mod = ts_boltzmann_df.copy()
ts_Boltzmann_df_mod = ts_Boltzmann_df_mod.drop("SMILES",axis=1)
ts_Boltzmann_df_mod.rename(columns={"score":"Scores", "Name":"Product_Code"},inplace=True)
top_5000_building_blocks_ts_Boltzmann_df = get_top_building_blocks(ts_Boltzmann_df_mod, 5000)

In [None]:
# Check the overlap between the enriched building blocks from the TS and the top 5000 building blocks from brute force docking
overlap_ts_Boltzmann = check_overlap(top_5000_building_blocks_ts_Boltzmann_df, top_5000_building_blocks)
visualize_overlapping_blocks(overlap_ts_Boltzmann, combined_smiles_dict)

In [None]:
# Generate 10 runs of Boltzmann Sampling
ts_boltzmann_df_list = []
for i in tqdm(range(0,10)):
    ts_boltzmann_df_list.append(run_ts(input_dict_boltzmann, hide_progress=True))

In [None]:
# Extract the product codes as a list
product_codes = ts_boltzmann_df["Name"].to_list()

# Initialize counters for each position
position_counters = []

# Iterate through the product codes
for product_code in product_codes:
    building_blocks = product_code.split("_")  # Split the product code by "_"
    # Ensure the position_counters list is large enough to handle all positions
    while len(position_counters) < len(building_blocks):
        position_counters.append(Counter())
    # Update the counters for each position
    for i, block in enumerate(building_blocks):
        position_counters[i][block] += 1

# Find the top 20 building blocks for each position
for i, counter in enumerate(position_counters):
    print(f"Top 20 building blocks for position {i + 1}:")
    print(f"{'Building Block':<20}{'Frequency':<10}")
    print("-" * 30)
    for block, count in counter.most_common(20):
        print(f"{block:<20}{count:<10}")
    print("\n")

#### Run these cells to generate the plots

In [None]:
# Combine the dataframes 
docking_df = prod_scores_df.to_pandas() # Convert the polars dataframe to a pandas dataframe
docking_df.rename(columns={"Product_Code":"Name", "Scores":"score"},inplace=True)
docking_df["method"] = "ref"
docking_df["cycle"] = "ref"
ref_df = docking_df.sort_values(by="score", ascending=True).head(100)

In [None]:
# Process the TS dataframes
# We can substitute the regular TS with enhanced TS here
ts_graph_df_list = []
ts_enhanced_df_list_graph = []
ts_boltzmann_df_list_graph = []
for i in range(0,10):
    ts_df_temp = ts_df_list[i].copy()
    ts_df_temp["cycle"] = i
    ts_df_temp["method"] = "TS"
    ts_df_temp.drop(columns=["SMILES"],inplace=True)
    ts_graph_df_list.append(ts_df_temp)
    ts_enhanced_temp_df = ts_enhanced_df_list[i].copy()
    ts_enhanced_temp_df["cycle"] = i
    ts_enhanced_temp_df["method"] = "TS_enhanced"
    ts_enhanced_temp_df.drop(columns=["SMILES"],inplace=True)
    ts_enhanced_df_list_graph.append(ts_enhanced_temp_df)
    ts_boltzmann_temp_df = ts_boltzmann_df_list[i].copy()
    ts_boltzmann_temp_df["cycle"] = i
    ts_boltzmann_temp_df["method"] = "TS_Boltzmann"
    ts_boltzmann_temp_df.drop(columns=["SMILES"],inplace=True)
    ts_boltzmann_df_list_graph.append(ts_boltzmann_temp_df)

In [None]:
# Concatenate the dataframes
ts_combo_df = pd.concat([x.sort_values(by="score", ascending=True).head(100) for x in ts_graph_df_list])
ts_enhanced_combo_df = pd.concat([x.sort_values(by="score", ascending=True).head(100) for x in ts_enhanced_df_list_graph])
ts_boltzmann_combo_df = pd.concat([x.sort_values(by="score", ascending=True).head(100) for x in ts_boltzmann_df_list_graph])

In [None]:
# Create concatenated data points only for TS, TS_enhanced and TS_Boltzmann
concat_data = pd.DataFrame({
    'cycle': ['concat'] * (len(ts_combo_df) + len(ts_enhanced_combo_df) + len(ts_boltzmann_combo_df)),
    'score': pd.concat([ts_combo_df['score'], ts_enhanced_combo_df['score'], ts_boltzmann_combo_df['score']]),
    'method': pd.concat([ts_combo_df['method'], ts_enhanced_combo_df['method'], ts_boltzmann_combo_df['method']])
})

In [None]:
combined_df = pd.concat([ts_combo_df, ts_enhanced_combo_df, ts_boltzmann_combo_df, concat_data, ref_df])
combined_df.reset_index(drop=True,inplace=True)
combined_df.method = pd.Categorical(combined_df.method, categories=["ref","TS", "TS_enhanced", "TS_Boltzmann"], ordered=True)

In [None]:
# Define a consistent color palette
palette_colors = sns.color_palette("Set1")[:4]

# Top subplot (stripplot) with concatenated results
ax1 = sns.stripplot(data=combined_df, x="cycle", y="score", hue="method", dodge=True, palette=palette_colors)
ax1.set_ylabel("Score", fontsize=16)
ax1.set_xlabel("")
ax1.tick_params(axis='both', which='major', labelsize=14)

In [None]:
# Estimate the numbers of hits found by each method in each cycle
ref_products = ref_df["Name"].to_list()
plot_list = []
for cycle in range(0,10):
    num_in_cycle = len(combined_df.query("cycle == @cycle and method == 'TS' and Name in @ref_products"))
    plot_list.append([cycle+1,num_in_cycle,'TS'])
for cycle in range(0,10):
    num_in_cycle = len(combined_df.query("cycle == @cycle and method == 'TS_enhanced' and Name in @ref_products"))
    plot_list.append([cycle+1,num_in_cycle,'TS_enhanced'])
for cycle in range(0,10):
    num_in_cycle = len(combined_df.query("cycle == @cycle and method == 'TS_Boltzmann' and Name in @ref_products"))
    plot_list.append([cycle+1,num_in_cycle,'TS_Boltzmann'])
# Get Percentage of hits found by each method in each cycle
plot_list.append(["concat",len(ts_enhanced_combo_df.query("Name in @ref_products").drop_duplicates(subset=["Name"])),"TS_enhanced"])
plot_list.append(["concat",len(ts_boltzmann_combo_df.query("Name in @ref_products").drop_duplicates(subset=["Name"])),"TS_Boltzmann"])
plot_list.append(["concat",len(ts_combo_df.query("Name in @ref_products").drop_duplicates(subset=["Name"])),"TS"])
plot_list.append(["ref",100,"ref"])
plot_df = pd.DataFrame(plot_list, columns=["cycle","found","method"])
plot_df.method = pd.Categorical(plot_df.method, categories=["ref","TS", "TS_enhanced", "TS_Boltzmann"], ordered=True)

In [None]:
# Lets check if there is an actual difference between the standard and the enhanced TS
from scipy import stats
f_stat, p_value = stats.f_oneway(plot_df.loc[plot_df["method"] == "TS","found"], 
                                 plot_df.loc[plot_df["method"] == "TS_enhanced","found"], 
                                 plot_df.loc[plot_df["method"] == "TS_Boltzmann","found"])
print(f"F-statistic: {f_stat}, P-value: {p_value}")

In [None]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
tukey = pairwise_tukeyhsd(endog=plot_df["found"], groups=plot_df["method"], alpha=0.05)
print(tukey)

In [None]:
ax = sns.barplot(x="cycle",y="found",hue="method",data=plot_df, dodge=True)
ax.legend(loc='upper left', bbox_to_anchor=(1.00, 0.75), ncol=1)

In [None]:
# Create a larger figure
plt.figure(figsize=(15, 8))  # Increase these numbers to make plot bigger (width, height)

# Create the barplot with wider bars
ax = sns.barplot(x="cycle", y="found", hue="method", data=plot_df, dodge=True, width=0.8, palette="Set1")  # width controls bar width
ax.legend(loc='upper left', bbox_to_anchor=(1.00, 0.75), ncol=1)

# Add value labels manually with larger font size
for p in ax.patches:
    ax.annotate(f'{int(p.get_height())}', 
                xy=(p.get_x() + p.get_width()/2, p.get_height()),
                ha='center', va='bottom',
                fontsize=12)  # Increase font size of the numbers

# Adjust figure margins
plt.subplots_adjust(right=0.85, bottom=0.15)

# Optional: Increase font size of axis labels and ticks
ax.tick_params(axis='both', which='major', labelsize=12)
ax.set_xlabel(ax.get_xlabel(), fontsize=14)
ax.set_ylabel(ax.get_ylabel(), fontsize=14)

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(20, 18), height_ratios=[2, 1])

# Define a consistent color palette and category order
palette_colors = sns.color_palette("Set1")[:4]
method_order = ["TS", "TS_enhanced", "TS_Boltzmann", "ref"]

# Make sure both dataframes have the same category order
combined_df.method = pd.Categorical(combined_df.method, categories=method_order, ordered=True)
plot_df.method = pd.Categorical(plot_df.method, categories=method_order, ordered=True)

# Top subplot (stripplot) with concatenated results
sns.stripplot(data=combined_df, x="cycle", y="score", hue="method", dodge=True, palette=palette_colors, ax=ax1)
ax1.set_ylabel("Docking Score (Negative is Better)", fontsize=16)
ax1.set_xlabel("Cycle", fontsize=16)
ax1.tick_params(axis='both', which='major', labelsize=14)

# Bottom subplot (barplot)
sns.barplot(x="cycle", y="found", hue="method", data=plot_df, dodge=True, width=0.8, palette=palette_colors, ax=ax2)
# Remove the bottom legend
ax2.get_legend().remove()

# Add value labels to bars, only for non-zero values
for p in ax2.patches:
    height = p.get_height()
    if height > 0:  # Only add label if the value is greater than 0
        ax2.annotate(f'{int(height)}', 
                    xy=(p.get_x() + p.get_width()/2, height),
                    ha='center', va='bottom',
                    fontsize=14)

# Adjust font sizes for bottom plot
ax2.tick_params(axis='both', which='major', labelsize=14)
ax2.set_xlabel("Cycle", fontsize=16)
ax2.set_ylabel("Number of top 100 hits found", fontsize=16)

# Move the legend to the top of the figure and increase its font size
legend = ax1.legend(loc='upper left', bbox_to_anchor=(1.00, 0.75), ncol=1, fontsize=14)

# Adjust layout with reduced spacing between plots
plt.subplots_adjust(right=0.85, hspace=0.125)

# Show the plot
plt.show()

This can at first look misleading, but the it is important to remember that the more negative the `ChemGauss` score the better it is. Hence, points at the bottom of the plot are better to see than those at the top. The reference presents the best set of molecules as seen from brute-force docking, we see that the Boltzmann sampling does not appear to find most of the hits, hence the Boltzmann sampling is not a useful method to use to find hits for this library.

## Enhanced Thompson Sampling

In [None]:
import copy
input_dict_enhanced_TS = copy.copy(input_dict)
input_dict_enhanced_TS["search_strategy"] = "thermal_cycling"
input_dict_enhanced_TS["processes"] = 1
input_dict_enhanced_TS["percent_of_library"] = 0.1
input_dict_enhanced_TS["scaling"] = -1
input_dict_enhanced_TS["temperature"] = 1

In [None]:
ts_enhanced_df = run_ts(input_dict_enhanced_TS)

In [None]:
ts_enhanced_df

In [None]:
ts_enhanced_df.sort_values(by="score", ascending=True).head(100)

In [None]:
truth_df = prod_scores_df.sort("Scores", descending=False).head(100).to_pandas()

In [None]:
truth_df

In [None]:
top_100_ts_enhanced_df = ts_enhanced_df.sort_values(by="score", ascending=True).head(100)

In [None]:
len(list(set(truth_df["Product_Code"]) & set(top_100_ts_enhanced_df["Name"])))

In [None]:
ts_enhanced_df = ts_enhanced_df.rename(columns={"score":"Scores", "Name":"Product_Code"})

In [None]:
ts_enhanced_df_slice = ts_enhanced_df. sort_values(by="Scores", ascending=True).head(5000)

In [None]:
# Generate 10 runs of Boltzmann Sampling
ts_enhanced_df_list = []
for i in tqdm(range(0,10)):
    ts_enhanced_df_temp = run_ts(input_dict_enhanced_TS, hide_progress=True)
    ts_enhanced_df_temp = ts_enhanced_df_temp.sort_values(by="score", ascending=True).head(5000)
    ts_enhanced_df_list.append(ts_enhanced_df_temp)

In [None]:
len(ts_enhanced_df_list[0])

In [None]:
ts_enhanced_df_list[0]

In [None]:
prod_scores_df.sort(["Scores"], descending=False).head(100)
prod_scores_df.head(100).select(pl.col("Scores").min())

In [None]:
prod_scores_df.head(100).select(pl.col("Scores").max())

In [None]:
ts_df_list[0].sort_values(by="score", ascending=True).head(100)

In [None]:
ts_boltzmann_df_list[0].sort_values(by="score", ascending=True).head(100)

### How much does the Enhanced TS recover using Docking as the Scoring Function?

In [None]:
# Ground Truth
prod_scores_df_pd = prod_scores_df.to_pandas()
top_5k_truth = prod_scores_df_pd.sort_values(by="Scores", ascending=True).head(5000)

In [None]:
# Lets look at the an Instance of each of the TS methods
# Re run this if the the dataframes need to be reset
ts_std_df = ts_df_list[0].copy()
ts_enhanced_df = ts_enhanced_df_list[0].copy()
ts_boltzmann_df = ts_boltzmann_df_list[0].copy()

In [None]:
ts_dfs = [ts_std_df, ts_enhanced_df, ts_boltzmann_df]
ts_types = ["TS", "TS_enhanced", "TS_Boltzmann"]
top_5k = {}
for n, df in enumerate(ts_dfs):
    df_temp = df.sort_values(by="score", ascending=True).head(5000)
    df_temp.drop(columns=["SMILES"],inplace=True)
    df_temp.rename(columns={"score":"Scores", "Name":"Product_Code"},inplace=True)
    top_5k[ts_types[n]] = df_temp

In [None]:
top_5k["TS"] = top_5k["TS"][["Product_Code"]].assign(standard=True)
top_5k["TS_enhanced"] = top_5k["TS_enhanced"][["Product_Code"]].assign(enhanced=True)
top_5k["TS_Boltzmann"] = top_5k["TS_Boltzmann"][["Product_Code"]].assign(boltzmann=True)

In [None]:
# Merge the dataframes
df_top = pd.merge(pd.merge(pd.merge(top_5k_truth, top_5k["TS"], how="left", on="Product_Code"),
                  top_5k["TS_enhanced"], how="left", on="Product_Code"),
                  top_5k["TS_Boltzmann"], how="left", on="Product_Code")
df_top.head()

In [None]:
fig, ax = plt.subplots()
top_ns = [10, 25, 50, 100, 200, 300, 400, 500]
for col in ["standard", "enhanced", "boltzmann"]:
    ax.plot(top_ns, [df_top.head(n)[col].sum() / n for n in top_ns], label=col, marker="o")
ax.set_xlabel("top N")
ax.set_ylabel("fraction_found")
ax.axhline(1, color="k", linestyle="--", zorder=0)
ax.set_title("Frac of top N found")
ax.legend()
pass

It appears that the standard TS slightly underperforms the enhanced TS. The Boltzmann sampling's performance remains poor despite increasing the cutoff for the top n compounds.

In [None]:
# We are interested in looking at the plot with error bars
# Lets generate these error bars using the cycle data
ts_comp = []
ts_types = ["TS", "TS_enhanced", "TS_Boltzmann"] # Types of TS to compare
for n in range(0,10):
    # Get the top 5000 compounds for each method
    # Rename columns
    ts_dfs_temp = [ts_df_list[n].copy(), ts_enhanced_df_list[n].copy(), ts_boltzmann_df_list[n].copy()]
    ts_dfs_temp = [x.sort_values(by="score", ascending=True).head(5000) for x in ts_dfs_temp]
    ts_dfs_temp = [x.drop(columns=["SMILES"]) for x in ts_dfs_temp]
    ts_dfs_temp = [x.rename(columns={"score":"Scores", "Name":"Product_Code"}) for x in ts_dfs_temp]
    top_5k_temp = {}
    for n, ts_type in enumerate(ts_types):
        top_5k_temp[ts_type] = ts_dfs_temp[n] # Assign the appropriate dataframe to the dictionary
        if n == 0: # Standard TS
            top_5k_temp[ts_type] = top_5k_temp[ts_type][["Product_Code"]].assign(standard=True)
        elif n == 1: # Enhanced TS
            top_5k_temp[ts_type] = top_5k_temp[ts_type][["Product_Code"]].assign(enhanced=True)
        else: # Boltzmann TS
            top_5k_temp[ts_type] = top_5k_temp[ts_type][["Product_Code"]].assign(boltzmann=True)
    # Merge the dataframes
    df_top = pd.merge(pd.merge(pd.merge(top_5k_truth, top_5k_temp["TS"], how="left", on="Product_Code"),
                  top_5k_temp["TS_enhanced"], how="left", on="Product_Code"),
                  top_5k_temp["TS_Boltzmann"], how="left", on="Product_Code")
    # Calculate the fraction of hits found for each method
    ts_comp.append(df_top)


In [None]:
top_ns = [10, 25, 50, 100, 200, 300, 400, 500]
top_ns_frac = pd.DataFrame(columns=["cycle","top_n", "method", "frac_top_n"])
for cycle_id, ts_comp_df in enumerate(ts_comp):
    for col in ["standard", "enhanced", "boltzmann"]:
        for n in top_ns:
            frac_top_ns = ts_comp_df.head(n)[col].sum() / n
            row = {"cycle":cycle_id, "top_n":n, "method":col, "frac_top_n":frac_top_ns}
            top_ns_frac = pd.concat([top_ns_frac, pd.DataFrame([row])], ignore_index=True)
top_ns_frac.head()

In [None]:
# Generate dataframe with mean, std and count of frac_top_n for each top_n and method
grouped_stats = top_ns_frac.groupby(['top_n', 'method'])['frac_top_n'].agg(
    mean='mean',
    std='std',
    count='count'
).reset_index()

print(grouped_stats)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6)) # set the size of the plot
sns.set_theme(style="darkgrid", palette="tab10", font_scale=1.2)

# Ensure sub_category is sortable (e.g., categorical or ordered)
grouped_stats['top_n'] = grouped_stats['top_n'].astype(str)

# Line plot per category with error bars
sns.lineplot(
    data=grouped_stats,
    x='top_n',
    y='mean',
    hue='method',  # color line by category
    marker='o',
    errorbar=None,
    linewidth=2.5  # Disable built-in CI
)

# Add error bars manually using plt.errorbar
for _, row in grouped_stats.iterrows():
    if pd.notnull(row['std']):
        plt.errorbar(
            x=row['top_n'],
            y=row['mean'],
            yerr=row['std'],
            fmt='none',
            capsize=4,
            ecolor='gray'
        )

plt.title("Mean top N fraction found for each method across 10 cycles")
plt.xlabel("top n compounds")
plt.ylabel("Mean fraction found")
plt.show()

In [None]:
fig, ax = plt.subplots()
top_ns = [10, 25, 50, 100, 200, 300, 400, 500]
for col in ["standard", "enhanced", "boltzmann"]:
    for ts_comp_df in ts_comp:
        ax.plot(top_ns, [ts_comp_df.head(n)[col].sum() / n for n in top_ns], label=col, marker="o")
ax.set_xlabel("top N")
ax.set_ylabel("fraction_found")
ax.axhline(1, color="k", linestyle="--", zorder=0)
ax.set_title("Frac of top N found")
ax.legend()

In [None]:
# Extract the product codes as a list
product_codes = ts_enhanced_df["Name"].to_list()

# Initialize counters for each position
position_counters = []

# Iterate through the product codes
for product_code in product_codes:
    building_blocks = product_code.split("_")  # Split the product code by "_"
    # Ensure the position_counters list is large enough to handle all positions
    while len(position_counters) < len(building_blocks):
        position_counters.append(Counter())
    # Update the counters for each position
    for i, block in enumerate(building_blocks):
        position_counters[i][block] += 1

# Find the top 20 building blocks for each position
for i, counter in enumerate(position_counters):
    print(f"Top 20 building blocks for position {i + 1}:")
    print(f"{'Building Block':<20}{'Frequency':<10}")
    print("-" * 30)
    for block, count in counter.most_common(20):
        print(f"{block:<20}{count:<10}")
    print("\n")

In [None]:
# Collect the top 20 building blocks for each position
top_20_building_blocks_per_position = []
for counter in position_counters:
    top_20_blocks = [block for block, _ in counter.most_common(20)]
    top_20_building_blocks_per_position.append(top_20_blocks)

# Convert the list to a tuple
top_20_building_blocks_tuple = tuple(top_20_building_blocks_per_position)

In [None]:
total_molecules = 5000  # The top 1% of products
# Combine the two dictionaries
combined_smiles_dict = {**amino_acid_bb_dict, **acids_bb_dict}

# Iterate through each position's top 20 building blocks
for i, counter in enumerate(position_counters):
    # Get the top 20 building blocks for the current position
    top_20_blocks = counter.most_common(20)
    
    # Create a list of RDKit molecules and their labels
    mols = []
    legends = []
    for block, freq in top_20_blocks:
        if block in combined_smiles_dict:  # Use the combined dictionary
            smiles = combined_smiles_dict[block]
            mol = Chem.MolFromSmiles(smiles)
            if mol:
                mols.append(mol)
                # Add building block name and frequency on the first line, fraction on the second line
                legends.append(f"{block} (Freq: {freq})\nFraction: {round((freq / total_molecules) * 100, 2)}%")
    
    # Visualize the molecules in a grid
    img = Draw.MolsToGridImage(
        mols, legends=legends, molsPerRow=5, subImgSize=(300, 300)
    )
    
    # Display the title and the image
    print(f"Top 20 Building Blocks for Position {i + 1}")
    display(img)  # Display the image in the Jupyter Notebook