# Retrieve SMILES of Common Solvents

### What This Creates

* CSV files names and SMILES strings for all the common solvents in the solvents table of the `hsp.sqlite` database

### Rationale

*  Why This?  The program `substructure_search.py` generates group counts for the group contribution based estimate of Hansen Solubility Parameters.  Before running the entire set of 120,000 compounds, a validation set having very well known structures is needed to make sure that groups are automatically counted correctly.  

*  Why Me?  Because I have hand-translated the Fedors gorups to SMARTS patterns, no one else has validated that these patterns work as expected.

*  Why Now?  Validating the SMARTS-based group counts is the final step needed before generating Hansen Solubility Parameters for all the compounds in the National Library of Medicine database, which is the key achievement to be demonstrated at the ACS meeting on Aug 26.

### Requirements

* Pandas 0.25.0

### Input / Output

*  The notebook should be in `solubility_parameters/notebooks`.  The input files are expected to be in `solubility_parameters/db_files` for the solvents table (`solvents.csv`) and in `solubility_parameters/smiles` for the SMILES tables, which are in 11 files of up to 10,000 entries each with filename `ids_smiles_{*}.csv`

* The output file will be placed in `solubility_parameters/aprl-ssp`, which is the folder where the program `substructure_search.py` resides and expects to see the SMILES tables.  The output filename will be `solvent_smiles.csv`.

## Import Set-Up

In [1]:
import pandas as pd
import glob

In [5]:
smiles_df = pd.DataFrame()
for smiles_file in glob.glob('../smiles/ids_smiles_*.csv'):
    temp_df = pd.read_csv(smiles_file, header=None)
    smiles_df = smiles_df.append(temp_df)
smiles_df.head()

Unnamed: 0,0,1
0,34742,CCCCOC(=O)C1=CC=CC=C1C(=O)O
1,35676,CN1CCC2=CC(=C3C=C2[C@@H]1CC4=CC=C(C=C4)OC5=C6[...
2,38266,C[C@@H]1[C@H]([C@@H]([C@H]([C@@H](O1)O[C@@H]2[...
3,50000,C=O
4,50011,C(=N)(N)N.Cl


In [6]:
len(smiles_df)

108604

In [7]:
smiles_df.columns = ['nlm_num','smiles']
smiles_df.head()

Unnamed: 0,nlm_num,smiles
0,34742,CCCCOC(=O)C1=CC=CC=C1C(=O)O
1,35676,CN1CCC2=CC(=C3C=C2[C@@H]1CC4=CC=C(C=C4)OC5=C6[...
2,38266,C[C@@H]1[C@H]([C@@H]([C@H]([C@@H](O1)O[C@@H]2[...
3,50000,C=O
4,50011,C(=N)(N)N.Cl


In [8]:
solvents_df = pd.read_csv('../db_files/solvents.csv')
solvents_df.head()

Unnamed: 0,solvent_id,nlm_num,subst_display_name,subst_category,delta_d,delta_p,delta_h,mol_vol,src_id,src_ref,boil_pt,flash_pt,chem21_safety,chem21_health,chem21_env,chem21_rank,prop_src_id
0,0,64197,Acetic acid (Ethanoic acid),solvent,14.5,8.0,13.5,57.1,1,A1.5,118.0,39.0,3.0,7.0,3.0,2.0,2.0
1,1,108247,Acetic anhydride (Acetyl acetate),solvent rx,16.0,11.7,10.2,94.5,1,A1.6,139.0,49.0,3.0,7.0,3.0,2.0,2.0
2,2,67641,Acetone (Dimethyl ketone),solvent,15.5,10.4,7.0,74.0,1,A1.7,56.0,-18.0,5.0,3.0,5.0,1.0,2.0
3,3,75058,Acetonitrile (Methyl cyanide),solvent,15.3,18.0,6.1,52.6,1,A1.10,82.0,2.0,4.0,3.0,3.0,2.0,2.0
4,4,98862,Acetophenone (Acetylbenzene),solvent,19.6,8.6,3.7,117.4,1,A1.11,,,,,,,


### Process files

In [13]:
# A simple merge and clean-up to make the table have headings of 'compound','SMILES' for compatibility
# with substructure_search.py is all that is needed...
solvent_smiles_df = solvents_df.merge(smiles_df, on='nlm_num', how='inner')[['nlm_num','smiles']]
solvent_smiles_df = solvent_smiles_df.rename(columns = {'nlm_num':'compound','smiles':'SMILES'}).set_index('compound')
solvent_smiles_df.head()

Unnamed: 0_level_0,SMILES
compound,Unnamed: 1_level_1
64197,CC(=O)O
108247,CC(=O)OC(=O)C
67641,CC(=O)C
75058,CC#N
98862,CC(=O)C1=CC=CC=C1


In [15]:
solvent_smiles_df.to_csv('../aprl-ssp/solvent_smiles.csv')