# __🌿 RetroGSF: Green Solvent Finder for Sustainable Chemistry!__


__🧪 Why Green Solvents ?__<br>
Molecule synthesis is a fundamental aspect of chemistry, playing a particularly central role in organic chemistry. However, alongside the valuable compounds produced, chemical processes often generate significant amounts of waste. Among the various contributors to this waste, solvents play a particularly important role. They are widely used to dissolve reagents, regulate reaction temperatures, and assist in purification steps. Unfortunately, many of these solvents are volatile, toxic, and difficult to recycle. In fact, solvents can represent up to 90% of the total mass used in a typical chemical process, making their careful selection a critical concern in sustainable chemistry.

The concept of green chemistry was introduced in the 1990s. Green chemistry focuses on designing products and processes that minimize the use and generation of hazardous substances. Beyond reducing environmental impact, it also aims to improve process efficiency, lower operational costs, and enhance safety for both people and the environment.

__⚡Why using RetroGSF ?__ <br>
The primary motivation behind this project is to reduce the environmental footprint of chemical synthesis by identifying greener solvent alternatives, provide a user-friendly tool for chemists to integrate sustainability into their workflows and finally promote green chemistry principles in both academic and industrial settings.

To support these goals, our package RetroGSF (Retrosynthesis Green Solvents Finder) was developed. This tool identifies possible synthetic pathways for a given target molecule using the retrosynthetic algorithm AiZynthFinder. It then determines the most likely solvent traditionally used for each reaction step and proposes alternative greener solvents based on their impact on human health, environmental safety, and overall sustainability. The aim is to encourage more environmentally responsible decision-making in organic synthesis by offering practical and data-driven alternatives.


## __⚙️Functionalities of the package__


In [8]:
import pprint
import numpy as np
import pandas as pd
from pathlib import Path
from rdkit import Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole
from rxn_insight.reaction import Reaction
from aizynthfinder.aizynthfinder import AiZynthExpander
from dotenv import load_dotenv

### 🔬 1. Retrosynthetic Pathway Identification
The retrosynthetic pathway is done using the __AiZynthFinder library__. Allowing to generate possible synthetic routes for a target molecule by analyzing reaction rules and databases. The tool identifies potential pathways and ranks them based on feasibility and other criteria. More information on the library can be found on the github of [AiZynthFinder](https://github.com/MolecularAI/aizynthfinder).

The function **`retrosynthesis_reaction_smiles`** is designed to perform retrosynthesis for a given target molecule (in SMILES format) and return a table of one-step reactions in forward order. This table includes details such as reactants, products, reaction SMILES and how likely they are to occur.

__How the function works__ <br>
- Input: Provide the target molecule in SMILES format and the path to the AiZynthFinder configuration file (config.yml).
- Output: The function returns a pandas DataFrame containing the retrosynthetic steps, including reactants, products, and reaction SMILES.

In [3]:
from retrogsf import retrosynthesis_reaction_smiles

product = "CC(=O)OC1=CC=CC=C1C(=O)O" # Test with Aspirin
config_path = "/Users/diego/Desktop/EPFL/Prog. in Chem/data_download/config.yml" 

result = retrosynthesis_reaction_smiles(product, config_path)

result

Unnamed: 0,template_hash,classification,library_occurence,policy_probability,policy_probability_rank,policy_name,template_code,template,feasibility,expansion_rank,mapped_reaction_smiles,smarts
0,f1de1ec6a5a54eb1b0f6cf98f6f48dc9e84bdf43b1b8bd...,0.0 Unrecognized,1196,0.7262,0,uspto,40152,[C;D1;H3:2]-[C;H0;D3;+0:1](=[O;D1;H0:3])-[O;H0...,0.999817,1,[CH3:1][C:2](=[O:3])[O:4][c:5]1[cH:6][cH:7][cH...,[C;D1;H3:2]-[C;H0;D3;+0:1](=[O;D1;H0:3])-[O;H0...
1,4cb17f48310d9c4a91b644c3e86f83cfb7ada406795575...,0.0 Unrecognized,17,0.0006,39,uspto,12855,[C;D1;H3:3]-[C:2](=[O;D1;H0:4])-[O;H0;D2;+0:1]...,0.999817,6,[CH3:1][C:2](=[O:3])[O:4][c:5]1[cH:6][cH:7][cH...,[C;D1;H3:3]-[C:2](=[O;D1;H0:4])-[O;H0;D2;+0:1]...
2,01643639d6a55c16f7f30c6505aeea5e206f45f41edb94...,0.0 Unrecognized,1107,0.0922,1,uspto,248,[C:2]-[C;H0;D3;+0:1](=[O;D1;H0:3])-[O;H0;D2;+0...,0.992561,2,[CH3:1][C:2](=[O:3])[O:4][c:5]1[cH:6][cH:7][cH...,[C:2]-[C;H0;D3;+0:1](=[O;D1;H0:3])-[O;H0;D2;+0...
3,cee4377ed1ef82bed1c1edf57d4eb93df1fc89daf8095c...,0.0 Unrecognized,13,0.0344,2,uspto,34418,[O;D1;H0:2]=[C;H0;D3;+0:1](-[OH;D1;+0:4])-[c:3...,0.996691,3,[CH3:1][C:2](=[O:3])[O:4][c:5]1[cH:6][cH:7][cH...,[O;D1;H0:2]=[C;H0;D3;+0:1](-[OH;D1;+0:4])-[c:3...
4,c0302ca933697a2750f59bf7c42ab18a4c477739ae114f...,0.0 Unrecognized,17049,0.0189,3,uspto,31992,[O;D1;H0:3]=[C:2](-[OH;D1;+0:1])-[c:4]1:[c:5]:...,0.021282,4,[CH3:1][C:2](=[O:3])[O:4][c:5]1[cH:6][cH:7][cH...,[O;D1;H0:3]=[C:2](-[OH;D1;+0:1])-[c:4]1:[c:5]:...
5,322bd81f163f002c0550f9bec3699b76ea0320685cb5fb...,0.0 Unrecognized,11198,0.0019,14,uspto,8417,[O;D1;H0:3]=[C:2](-[OH;D1;+0:1])-[c:4]>>C-[O;H...,0.021282,6,[CH3:1][C:2](=[O:3])[O:4][c:5]1[cH:6][cH:7][cH...,[O;D1;H0:3]=[C:2](-[OH;D1;+0:1])-[c:4]>>C-[O;H...
6,b35a47f32347132b8f9c0faa6d32559e86e9733c6f11a0...,0.0 Unrecognized,481,0.0114,4,uspto,29952,[O;D1;H0:1]=[C:2](-[OH;D1;+0:3])-[c:4]1:[c:5]:...,0.844527,5,[CH3:1][C:2](=[O:3])[O:4][c:5]1[cH:6][cH:7][cH...,[O;D1;H0:1]=[C:2](-[OH;D1;+0:3])-[c:4]1:[c:5]:...
7,e2e3e9afc65b69c0dc7956a9cb3b8e87ee9178a11073fe...,0.0 Unrecognized,346,0.0007,34,uspto,37656,[O;D1;H0:1]=[C:2](-[OH;D1;+0:4])-[c:3]>>[O;D1;...,0.844527,6,[CH3:1][C:2](=[O:3])[O:4][c:5]1[cH:6][cH:7][cH...,[O;D1;H0:1]=[C:2](-[OH;D1;+0:4])-[c:3]>>[O;D1;...


### 💡 2. Reaction informations
The **`rxn_info`** function is designed to return the reaction name or class of a reaction from a DataFrame containing reaction SMILES. These information are found using [Rxn-INSIGHT](https://github.com/mrodobbe/Rxn-INSIGHT) an algorthm that suggest reaction conitions based on similarity, more information are found on there github.

__How the function works__ <br>
- Input: A pandas DataFrame (df) that contains a column called 'mapped_reaction_smiles'. This DataFrame is usually the output of the retrosynthesis_reaction_smiles function.
- Process:
    - Takes the first reaction SMILES from the DataFrame. Since this is the most likely pathway AiZynthFinder has found. 
    - Creates a Reaction object using the rxn_insight.reaction library.
    - Calls get_reaction_info() on this object, which returns a dictionary with reaction metadata.

    If the "NAME" field is not "OtherReaction", it returns the value of "NAME".  
    If the "NAME" is "OtherReaction", it returns the value of the "CLASS" field instead.

- Output: A string representing either the specific reaction name or, if unavailable, the broader reaction class.

In [4]:
from retrogsf import rxn_info

reaction_information = rxn_info(result)

print(reaction_information)

Hydrolysis or Hydrogenolysis of Carboxylic Esters or Thioesters



__📌 Limitations__ <br>
- The **`rxn_info`** function may not always return detailed information, there might not be all information in Rxn-INSIGHT library in particularly for uncommon and less-documented reaction.



### 🔍 3. Solvant Identification

The **`get_solvents_for_reaction`** function takes as input a reaction class or reaction name (such as "Esterification" or "Hydrolysis of Carboxylic Esters") and a Gemini AI API key. It uses Google Gemini AI to predict the most likely solvent(s) for the given reaction type.

__How the function works__ <br>
- Input: The reaction class or name (string) and The Gemini AI API key (string or from environment variable).

- Process: The function prepares a prompt describing the reaction type and a list of allowed solvents (to ensure the answer is relevant and safe).
It sends this prompt to the Gemini AI model using the provided API key.
Gemini AI uses its knowledge to suggest the most likely solvents (in SMILES format) used for the specified reaction type. These solvents have to be part of a list containing 272 known solvents. The function returns these solvent SMILES as a string.

- Output: A string of one to three SMILES codes for solvents, separated by commas, representing the most likely solvents for the reaction.

__🌐 Why using AI?__ <br>
Gemini AI leverages a large language model trained in multiple domains, allowing it to make informed predictions even for less common or complex reactions.

In [5]:
from retrogsf import get_solvents_for_reaction


solvents = get_solvents_for_reaction(reaction_information)

print(solvents)

CCO


__📌 Limitations__ <br>
- The performance of the **`get_solvents_for_reaction`** function depends on the quality and diversity of the data the AI model was trained on. If the model has not been sufficiently exposed to certain reaction names or classes, the suggested solvents may lack consistency and could vary between iterations for the same reaction.

- Additionally, it can be difficulte to verify whether the proposed solvent is appropriate or commonly used for the synthesis of the target molecule, especially in the absence of experimental validation or supporting literature references. As a result, the recommendations should be considered as indicative suggestions rather than definitive choices.



### 🌱 4. Green Solvants Ranking

The **`rank_similar_solvents`** function identifies and ranks alternative solvents based on their similarity to a target solvent. The ranking is performed using physical properties (e.g., density, dielectric constant, dipole moment and refractive index) as well as green chemistry criterias (e.g., safety, health and environmental).

The dataset used by the fonction Solvent_properties_with_smiles.csv is simplified version of a database made by [ACS Green Chemistry Institute](https://acsgcipr.org/tools/solvent-tool/) that regroup informations on the properties and the structure of 272 solvents.

__How the function works__ <br>
- Input Validation:
    - The function checks if the target solvent (SMILES) exists in the dataset.
    - If the solvent is classified as hazardous, the user is warned, and only safer alternatives are recommended.

- Filtering:
    - Solvents classified as "Hazardous" or "Highly Hazardous" are excluded.
    - Solvents with environmental, health, and safety rankings greater than 5 are filtered out.
    - Solvents with incompatible melting and boiling points are excluded.

- Similarity Scoring:
    - A weighted relative distance is calculated for the physical properties and compared to the given solvent.
    - The similarity score is used to rank solvents, a lower score indicates  higher similarity.

- Output the function returns a dictionary containing:
    - Target solvent properties. 
    - Ranked solvents by similarity, environmental impact, health impact, safety, and overall ranking.


In [21]:
from retrogsf import rank_similar_solvents

target_solvent_smiles = solvents.split(",")[0].strip()
ranked_solvents = rank_similar_solvents(target_solvent_smiles)

print("Target solvent properties:")
pprint.pprint(ranked_solvents["target_solvent_properties"])

print("\nRanked by similarity:")
print(ranked_solvents["by_similarity"])

print("\nRanked by environmental impact:")
print(ranked_solvents["by_environment"])

print("\nRanked by health impact:")
print(ranked_solvents["by_health"])

print("\nRanked by safety:")
print(ranked_solvents["by_safety"])

print("\nRanked by overall score:")
print(ranked_solvents["by_overall_ranking"])

Target solvent properties:
{'Adjusted ranking': 'Recommended',
 'Boiling point': 78.3,
 'Density': 0.785,
 'Dielectric': 24.6,
 'Dipole': 1.74,
 'Environment Ranking': 3,
 'Health Ranking': 3,
 'Melting point': -114.5,
 'Name': 'Ethanol',
 'Refractive Index': 1.361,
 'SMILES': 'CCO',
 'Safety Ranking': 4}

Ranked by similarity:
                                  Name      SMILES  Density  Dielectric  \
109                        Propan-1-ol        CCCO    0.804        20.5   
114  iso-Butanol [2-Methylpropan-1-ol]     CC(C)CO    0.794        17.9   
15                          Butan-2-ol     CCC(C)O    0.808        16.6   
18                  3-Methylbutan-1-ol    CC(C)CCO    0.809        15.2   
94                    Ethylcyclohexane  CCC1CCCCC1    0.780         NaN   

     Dipole  Refractive Index  Melting point  Boiling point  
109    1.65             1.386         -126.2           97.2  
114    1.64             1.396         -108.0          107.8  
15     1.65             1.397    


__📌 Limitations__ <br>
- The ranking of alternative solvents is based on physical properties such as density, dielectric constant, dipole moment, and refractive index, using weights that were subjectively assigned. These weights may not accurately represent the true importance of each property for every reaction, and the resulting ranking might not always align with practical, industrial, or environmental preferences.

- The function returns alternative solvents with similar physical properties; however, these alternatives may have higher environmental, health, and safety scores, making them less green than the target solvent. This is particularly true for water, which holds the best possible green score, 1 in each criteria, making it one of the greenest solvents available. If the function is applied to water, all suggested alternatives will be less green. In such cases, the target solvent is already the optimal choice from a sustainability perspective.


### 💻 5. Streamlit Interface

A Streamlit interface was developed **`app.py`** to present the results of the different functions in a clean and user-friendly way. This web application allows users to write a smile of a molecule they want to synthesize and the application will give them the reaction diagram with the different alternative solvents. 



In [None]:
!streamlit run app.py

### 🛠️ 6. Other functions

- **`unmap_reaction_smiles`** is a function that takes a mapped reaction SMILES string and returns the corresponding reaction SMILES without atom mapping information.

- **`draw_reaction_with_solvent`** is a function that takes the SMILES of the reactants, product, and solvent, and generates a clean image of the reaction scheme including the solvent.



## 📋General Limitations


- __Lack of Reaction-Solvent Database__   
   We were unable to find a complete database that directly links reaction SMILES to the solvents typically used for those reactions. As a result, we relied on AI (Gemini) to predict suitable solvents. While this approach is flexible, it is also time- and energy-consuming, and the quality of the suggestions depends on the AI model's training and context.

- __Limited Scope of Reaction Types__   
   The retrosynthesis and solvent prediction are limited to reaction types and templates covered by the AiZynthFinder and the Rxn-INSIGHT. Uncommon reactions may not be well supported.


- __User Expertise Required__  
   Effective use of the tool still requires users to have some background in chemistry, especially to interpret the results and make final decisions.

## 🎯 Improvements

- __Develop or Integrate a Reaction-Solvent Database__   
   Building or integrating a complete database that links reaction SMILES or classes to commonly used solvents would improve accuracy and reduce reliance on AI.


- __Refine Ranking Methodology__  
   The ranking system could be improved by incorporating feedback from chemists, using machine learning to optimize weights, or including additional criteria such as cost, availability, or regulatory status.


- __User Customization__ <br>
   Allowing users to adjust the weights for ranking criteria or to input their own constraints would make the tool more flexible and relevant to specific needs.

# 📖 Conclusion
RetroGSF represents a significant step towards integrating green chemistry principles into organic synthesis. By providing data-driven insights and practical alternatives, it empowers chemists to make more sustainable choices in their workflows. While challenges remain, the tool's potential to reduce the environmental impact of chemical processes is substantial. Future developments will focus on expanding its capabilities and accessibility, further promoting the adoption of green chemistry in both academia and industry.