# Reaxys Data Extraction 

## 0. General Procedure

The individual procedure I took will be outlined in the following sections. This section will just summarize the general procedure.

1) Setup conda environment for RDKit, Pandas, and OpenBabel

2)  Combine excel files into one CSV

3) Convert CSV to dataframe and remove unnessecary features

4) Identify major product by cross referencing yield

5) Identify major reactants (functional groups, atom economy, bonding patterns, etc)

6) Distinguish between major reactants (functional groups, bonding patterns, etc)

7) Export final dataset


## 1. Set up Conda and Jupyter Environment

1) Download conda according to your OS (https://docs.conda.io/projects/conda/en/latest/user-guide/install/windows.html)

2) Set up RDKit environment (https://www.rdkit.org/docs/Install.html)

3) conda activate my-rdkit-env

4) conda install openbabel (https://anaconda.org/conda-forge/openbabel) (This is not needed for this script but is used in other scripts so might as well have one environment for all)

5) conda install ipykernel pandas numpy (if the latter two don't work, try "conda install pip" and "pip install pandas numpy")

6) python -m ipykernel install --user --name my-rdkit-env --display-name "Python (my-rdkit-env)"

7) Go to Jupyter Noteboook, click "kernel", "change kernel", and select your conda environment (may need to restart Jupyter Notebook)

If the cell below does not run without errors, check if the conda environment install the packages correctly with "conda list". If that is not the issue, check to see if the kernel is set to the right environment. This was all done on macOS so there may be some challenges when using windows or linux.

## 2. Import Packages

In [None]:
from rdkit import Chem
from rdkit.Chem import Descriptors, rdchem, Draw, rdFMCS
import pandas as pd
import csv
import numpy as np
from itertools import combinations

## 3. CSV to DataFrame

After assembling the excel files into one document, convert it into a CSV. This can then be converted into a dataframe without using difficult modules like openpyxl.

In [None]:
def extract_df(file_path):
    ##CP1252 encoding required for macOS, remove if using windows
    with open(file_path, newline='',encoding='CP1252') as csvfile:
        data = []
        for n,row in enumerate(csv.reader(csvfile)):
            if n == 0: 
                ### Set first row of CSV to be the columns of the dataframe
                cols = row
            else:
                data.append(row)
        df = pd.DataFrame(data, columns=cols)
    return df
df = extract_df('mydataset.csv')
df 

## 4. Sort Dataset by Features

The features that you want from your original dataset can be defined here. Here I create two datasets, termed simple and complex, to remove unnecessary columns.

In [None]:
def comp_simp_df(dataframe,complex_cols = ['Reaction ID','References','Reaction',
                                       'Other Conditions', 'Yield (numerical)',
                                       'Catalyst','Solvent (Reaction Details)'],
                  simple_cols = ['Reaction ID','References','Reaction','Yield (numerical)']):
    ### The column values to keep for the complex and simple datasets are outlined above
    ### A copy of the dataframe with only selected columns is made
    complex_df = dataframe[complex_cols].copy()
    simple_df = dataframe[simple_cols].copy()
    return complex_df,simple_df

complex_df,simple_df = comp_simp_df(df)
simple_df

## 5. Identify Major Product and Reactants

This function will extract the reactants and products from the reaction SMILES and then assign the major product based on reported yield (if multiple products are shown). The two major reactants (diene and dienophile) are identified based on atom economy. The Diels-Alder reaction is 100% atom efficient so reactant_1 + reactant_2 == major_product. Differentiation between diene and dienophile comes later. 

Three datasets are created. The first (df_react) is for reactions where the major product and two reactants were identified. The second (df_prod_filt) is for reactions where the major product could not be identified. The third (df_mass_filt) is for reactions where the two major reactants could not be identified. This could also included invalid reactions that are reported as Diels-Alder but actually are not.

RDKit errors will be generated but these can be ignored. These arise from invalid SMILES strings.

In [None]:
def SMILES_extract(dataframe):
    ### The reaction SMILES string (reactant1.reactant2.reactant3>>product1.product2) is in the 'Reaction' column
    ### This SMILES is split into reactants and products by '>>' and those are further split by '.'
    ### Columns of the dataframe are then dynamically extended to account for how many products and reactants there are
    dataframe[['Reactants','Products']] = dataframe['Reaction'].str.split('>>', expand=True)
    dataframe = pd.concat([dataframe,dataframe['Reactants'].str.split('.', expand=True).rename(columns=lambda x: f"Reactant_{x+1}")],axis=1)
    dataframe = pd.concat([dataframe,dataframe['Products'].str.split('.', expand=True).rename(columns=lambda x: f"Product_{x+1}")],axis=1)
    ### Drop rows with no products. These are cases where the product is listed in Reaxys but is not contained in the excel file export
    ### The solution to this would be to export in the XML format but this is harder to work with / requires learning XML formatting
    dataframe = dataframe.dropna(axis=0, how='any',subset=['Products'])
    
    ### This section is to determine the major product
    dataframe['Major Product'] = np.nan
    for n,row in enumerate(dataframe['Yield (numerical)']):
        if len(row.split(';')) != len(dataframe.iloc[n,dataframe.columns.get_loc('Products')].split('.')):
            ### Find situations where number of products != number of yields and so major product cannot be determined
            ### Note that if no yield or one yield is reported, len(row.split(';')) == 1 and so will not be accidently ignored here
            pass
        else:
            if len(row.split(';')) > 1:
                ### If the number of products is equal to the number of reactants and is greater than 1, select major product by max yield
                dataframe.iloc[n,-1] = dataframe.iloc[n,dataframe.columns.get_loc('Products')].split('.')[np.argmax([float(i) for i in row.split(';')])]
            else:
                ### For cases of 1 product and 1, or 0 yeilds. Major product is only product shown.
                dataframe.iloc[n,-1] = dataframe.iloc[n,dataframe.columns.get_loc('Products')]
    ### All reactions where the major product could not be identified are sent to the dataframe 'df_prod_filt'
    df_prod_filt = dataframe[dataframe['Major Product'].isnull()]
    dataframe = dataframe.dropna(axis=0, how='all', subset=['Major Product'])

    ### This is a quick test to see if the simple or complex dataframe is used. This is to help with formatting later on
    complex_cols = ['Reaction ID','References','Reaction',
                    'Other Conditions', 'Yield (numerical)',
                    'Catalyst','Solvent (Reaction Details)'],
    complex_active = False
    if complex_cols[-1] in dataframe.columns.values.tolist():
        complex_active = True
    else:
        pass
    
    ### A new temporary dataframe is created to sort the reactants
    ### The first columns are the possible reactants, the next columns are the major products and the reaction IDS,
    ### and finally the last 3 columns are blank and will store the two reactants and information about intramolecular reactions
    df_react = dataframe.loc[:, dataframe.columns.str.startswith('Reactant_')]
    df_react = pd.concat([df_react,dataframe['Major Product']], axis = 1)
    df_react = pd.concat([df_react, dataframe['Reaction ID']], axis=1)
    df_react['Component 1'] = np.nan
    df_react['Component 2'] = np.nan
    df_react['Intramolecular'] = np.nan

    problems = 0
    not_issue = 0
    ### NOTE! The usage of .iloc in this section is suited to this dataset only! Use this instead: dataframe.columns.get_loc('Reactants')
    for n,row in enumerate(df_react.iterrows()):
        try:
            ### The SMILES for the major product is converted to an RDKit object and the heavy atom weight stored 
            ### Heavy atom is used to avoid protonation errors and the try except block is used because of RDKit errors
            prod_m = Chem.MolFromSmiles(df_react.iloc[n, -5])
            prod_mass = Descriptors.HeavyAtomMolWt(prod_m)
        except:
            ### If an error occurs, the mass is set to a high number so it can be filtered out later
            prod_mass = 100000
        reac_dic = {}
        for x in range(0,len(df_react.columns)-5):
            try:
                ### The heavy atom mass of reactants is made and added to the reaction dictionary
                m = Chem.MolFromSmiles(df_react.iloc[n,x])
                reac_dic[df_react.iloc[n,x]]=Descriptors.HeavyAtomMolWt(m)
            except:
                ### In case of RDKit error, nothing is added to the reaction dictionary
                pass
        
        ### Various combinations of reactant massess are taken here. The reactant combination and mass is stored in a dictionary
        output_masses = {}
        for z in range(1, 3):
            ### The masses of additive combinations of two reactants as well as two of the same reactants are stored in react_masses
            ### z covers 1,2 so that intramolecular combinations are accounted for
            ### See itertools documentation for more details on itertools.combinations()
            react_masses = [x for x in combinations(reac_dic.values(), z)]
            react_masses.extend([2*x for x in reac_dic.values()])
            ### The corresponding SMILES combinatiosn to react_massess is stored in react_mols
            react_mols = [x for x in combinations(reac_dic.keys(),z)]
            react_mols.extend((x,x) for x in reac_dic.keys())
            ### The SMILES combination is set to the key and the mass set to the value
            for nz,mass in enumerate(react_masses):
                if type(mass) == float:
                    output_masses[react_mols[nz]] = (abs(prod_mass - mass))
                else:
                    ### Covers cases where z is 2 and a list of reactant masses is made during combinations()
                    output_masses[react_mols[nz]] = (abs(prod_mass - sum(list(mass))))
        
        ### Now we compare the major product mass with the reactant combination masses
        ### This is the Atom economy approach
        if len(output_masses) == 0:
            ### If there are no reactant ombination masses, the reaction is skipped
            continue
        if min(output_masses.values()) <= 0.01:
            ### If the minimum difference between a reactant combination and major product is less than 0.01, component 1 and 2 are assigned
            ### There is a risk that there are multiple combinations that fit this criteria but that is why the minimum is used to mitigate this
            ### Also, the presence of appropriate reaction structures in later functions helps remove any invalid reactions
            if len(min(output_masses,key = output_masses.get)) == 2:
                if min(output_masses,key = output_masses.get)[0] == min(output_masses,key = output_masses.get)[1]:
                    ### If the reactants are identical, this is an intermolecular reaction between the same molecule type
                    df_react.iloc[n, -1] = '2x'
                    df_react.iloc[n, -3],df_react.iloc[n,-2] = min(output_masses,key = output_masses.get)
                else:
                    ### Otherwise this is an intermolecular reaction between different reactants
                    df_react.iloc[n, -3], df_react.iloc[n, -2] = min(output_masses, key=output_masses.get)
            else:
                ### If there is one reactant, the reaction is intramolecular
                if len(min(output_masses, key=output_masses.get)[0]) == 1:
                    df_react.iloc[n, -3] = min(output_masses, key=output_masses.get)
                    df_react.iloc[n, -1] = True
                else:
                    df_react.iloc[n, -3] = min(output_masses, key=output_masses.get)[0]
                    df_react.iloc[n, -1] = True
            not_issue += 1
        else:
            ### If there is no reactant combination that yields a mass within 0.01 of the major product, the reaction is deemed invalid
            ### "not_issue" and "problems" are stored in case of troubleshooting
            problems += 1
    
    ### The df_react dataset is then reformatted depending on whether a simple or complex dataset was used
    ### This can be adjusted to suit the preference of the scripter
    if complex_active == True:
        df_tmp = dataframe[complex_cols].copy()
        df_react.reset_index(drop=True)
        df_tmp.reset_index(drop=True)
        df_react = pd.concat([df_tmp,df_react.iloc[:, 10:-1], df_react.iloc[:, 9], df_react.iloc[:, -1]], axis=1)
    else:
        df_react = pd.concat([df_react.iloc[:,10:-1], df_react.iloc[:,9],df_react.iloc[:,-1]], axis=1)
    ### The index is reset and not included as a column (important for mergering dataframes)
    df_react.reset_index(drop=True)
    ### The reactions that failed this mass (atom economy) test are stored in df_mass_filt and are dropped from df_react
    df_mass_filt = df_react[df_react['Component 1'].isnull()]
    df_react =df_react.dropna(axis=0, how='all', subset=['Component 1'])
    return df_react, df_prod_filt, df_mass_filt

extracted_simple_df,df_prod_filt,df_mass_filt = SMILES_extract(simple_df)
extracted_simple_df

## 6. Removal of Repeat Reaction IDs

Repeat reaction IDs represent the same reaction under different conditions. 

In [None]:
def remove_repeat_ID(dataframe):
    no_repeat_df = dataframe.drop_duplicates(subset = ['Reaction ID'])
    return no_repeat_df

unique_e_simp_df = remove_repeat_ID(extracted_simple_df)
unique_e_simp_df

## 7. Dienophile Identification

The next function identifies the dienophle by detecting the presence of one suitable bond (carbon-carbon double, aromatic, or triple). The quality of the diene is then checked by asserting there are at least two suitable bonds. Reactions that fail this test are deemed invalid and are printed to console. Intramolecular reactions are automatically exempted from this search and are classified as "known". Intermolecular reactions between identical molecules are also exempted. 

The output is one dataset containing reactions with known dienophiles and dienes and another dataset containing the unknown reactions.

In [None]:
def diene_dieneophile_extract(dataframe):
    ### Columns for the diene and dienophile are added
    dataframe['Diene'] = np.nan
    dataframe['Dienophile'] = np.nan
    invalid_count = []
    for n,row in enumerate(dataframe.iterrows()):
        if dataframe.iat[n,-3] == True:
            ### If the reaction is intramolecular, set the diene to the reactant and move on
            dataframe.iloc[n,dataframe.columns.get_loc('Diene')] = dataframe.iat[n,dataframe.columns.get_loc('Component 1')]
            continue
        if dataframe.iat[n,-3] == '2x':
            ### If the reaction is intermolecular between two identical molecules, set the diene and dienophile to this repeat molecule
            dataframe.iloc[n, dataframe.columns.get_loc('Diene')] = dataframe.iat[n, dataframe.columns.get_loc('Component 1')]
            dataframe.iloc[n, dataframe.columns.get_loc('Dienophile')] = dataframe.iat[n, dataframe.columns.get_loc('Component 1')]
            continue
        else:
            ### For all other reactions, the reactants are converted into RDKit molecule objects
            reactants = [Chem.MolFromSmiles(dataframe.iat[n,dataframe.columns.get_loc('Component 1')]),Chem.MolFromSmiles(dataframe.iat[n,dataframe.columns.get_loc('Component 2')])]
            cdb_ls = []
            for arg in reactants:
                ### For each reactant, the number of suitable bonds (carbon-carbon double, aromatic, and triple) is recorded
                cdb_count = 0
                for b in arg.GetBonds():
                    if str(b.GetBondType()) == 'DOUBLE' or str(b.GetBondType()) == 'AROMATIC' or str(b.GetBondType()) == 'TRIPLE':
                        if b.GetBeginAtom().GetSymbol() == 'C' and b.GetEndAtom().GetSymbol() == 'C':
                            cdb_count += 1
                cdb_ls.append(cdb_count)
            if 0 in cdb_ls:
                ### If one of the molecules does not have any suitable bond, the reaction is invalid and reaction index is stored for later elimination
                invalid_count.append(n)
                continue
            elif cdb_ls[0] == cdb_ls[1]:
                if cdb_ls[0] == 1:
                    ### If both reactants only have 1 suitable bond, the reaction is invalid
                    continue
                else:
                    ### If both reactants have more than 1 suitable bond, further validation is required
                    continue
            elif 1 in cdb_ls:
                ### By this stage, 1 reactant has 1 suitable bond and the other reactant does not have 1 or 0 suitable bonds, therefore the diene and dienophile are identified
                dataframe.iloc[n, dataframe.columns.get_loc('Dienophile')] = dataframe.iat[n, dataframe.columns.get_loc(f'Component {cdb_ls.index(1) + 1}')]
                dataframe.iloc[n, dataframe.columns.get_loc('Diene')] = dataframe.iat[n, dataframe.columns.get_loc(f'Component {cdb_ls.index(max(cdb_ls)) + 1}')]
            else:
                ### Any other outlier cases can be examined later
                continue
            pass
#     print(f'Detected {invalid_count} invalid reactions')
    ### The index is reset and not included as a column (important for mergering dataframes)
    dataframe.reset_index(drop=True, inplace=True)
    ### All validated reactions must have at least the diene set so the "unknown" dataset is created from the invalid reactions
    df_unknown_de_dep = dataframe[dataframe['Diene'].isnull()].drop(axis=0,index=invalid_count)
    df_known_de_dep =dataframe.dropna(axis=0, how='all', subset=['Diene'])
    print(df_unknown_de_dep.shape)
    print(df_known_de_dep.shape)
    return df_known_de_dep,df_unknown_de_dep

dienophile_known_simp_df, dienophile_unknown_simp_df = diene_dieneophile_extract(unique_e_simp_df)
dienophile_known_simp_df

## 8. Butadiene Extraction

For the structures where the dienophile could not unequivably be identified, the diene was to be identified. The output is a dataset with known diene and dienophile as well as an unknown dataset as before.

In [None]:
def butadiene_detect(dataframe):
    invalid_count = []
    for n,row in enumerate(dataframe.iterrows()):
        ### Same section as "diene_dieneophile_extract()"
        if dataframe.iat[n,dataframe.columns.get_loc('Intramolecular')] == True:
            dataframe.iloc[n,dataframe.columns.get_loc('Diene')] = dataframe.iat[n,dataframe.columns.get_loc('Component 1')]
            continue
        elif dataframe.iat[n,dataframe.columns.get_loc('Intramolecular')] == '2x':
            dataframe.iloc[n, dataframe.columns.get_loc('Diene')] = dataframe.iat[n, dataframe.columns.get_loc('Component 1')]
            dataframe.iloc[n, dataframe.columns.get_loc('Dienophile')] = dataframe.iat[n, dataframe.columns.get_loc('Component 1')]
            continue
        else:
            ### Instead of checking for 1 viable bond, the bonding pattern of butadiene is searched for
            reactants = [Chem.MolFromSmiles(dataframe.iat[n,dataframe.columns.get_loc('Component 1')]),Chem.MolFromSmiles(dataframe.iat[n,dataframe.columns.get_loc('Component 2')])]
            bond_dic = {}
            cdb_ls = []
            for arg in reactants:
                cdb_count = 0
                for b in arg.GetBonds():
                    if str(b.GetBondType()) == 'DOUBLE' or str(b.GetBondType()) == 'AROMATIC' or str(b.GetBondType()) == 'TRIPLE':
                        if b.GetBeginAtom().GetSymbol() == 'C' and b.GetEndAtom().GetSymbol() == 'C':
                            ### The number of valid bonds is recorded for each reactant
                            cdb_count += 1
                cdb_ls.append(cdb_count)
                buta_diene = 0
                bond_ind = []
                for b in arg.GetBonds():
                    ### Aromatic bonds are not used for detecting the butadiene pattern since they are far less reactive than double bonds
                    ### There may be vary rare instances where an aromatic bond reacts instead of a butadiene structure
                    ### However, aromatics tend to undergo Diels-Alder reactions under forcing conditions so this assumptions is assumed to be valid
                    if str(b.GetBondType()) == 'DOUBLE':
                        if b.GetBeginAtom().GetSymbol() == 'C' and b.GetEndAtom().GetSymbol() == 'C':
                            ### The atom indexes for the double bonds are recorded
                            bond_ind.extend([b.GetBeginAtomIdx(), b.GetEndAtomIdx()])
                ### The bond indexes are used to get every combination of bond between indexes
                pos_bond_combos = combinations(bond_ind, 2)
                for i in pos_bond_combos:
                    ### Each possible bond is assessed with a try except block to verify if it is real and if it is single
                    try:
                        if str(rdchem.Mol.GetBondBetweenAtoms(arg, i[0], i[1]).GetBondType()) == 'SINGLE':
                            ### If there is a single bond between any double bonds (as defined by the double bond atom indexes), a butadiene moiety is recorded
                            buta_diene = + 1
                    except:
                        pass
                ### The number of butadiene moieties present in each reactant is stored
                bond_dic[arg] = buta_diene
            if 0 in cdb_ls:
                ### If one of the reactants does not have any valid bonds, the reaction is skipped
                continue

            if 0 in bond_dic.values() and any(x!=0 for x in bond_dic.values()):
                ### If one reactant has a butadiene moiety and the other has none, assign diene and dienophile based on this
                if list(bond_dic.values()).index(0) == 0:
                    dataframe.iloc[n, dataframe.columns.get_loc('Dienophile')] = dataframe.iat[n, dataframe.columns.get_loc(f'Component 1')]
                    dataframe.iloc[n, dataframe.columns.get_loc('Diene')] = dataframe.iat[n, dataframe.columns.get_loc(f'Component 2')]
                else:
                    dataframe.iloc[n, dataframe.columns.get_loc('Dienophile')] = dataframe.iat[n, dataframe.columns.get_loc(f'Component 2')]
                    dataframe.iloc[n, dataframe.columns.get_loc('Diene')] = dataframe.iat[n, dataframe.columns.get_loc(f'Component 1')]
            else:
                invalid_count.append(n)
    ### The index is reset and not included as a column (important for mergering dataframes)
    dataframe.reset_index(drop=True, inplace=True)
    df_unknown_de_dep_2 = dataframe[dataframe['Diene'].isnull()]
    df_known_de_dep_2 = dataframe.dropna(axis=0, how='all', subset=['Diene'])
    return df_known_de_dep_2, df_unknown_de_dep_2


butadiene_known_simp_df, butadiene_unknown_simp_df = butadiene_detect(dienophile_unknown_simp_df)
butadiene_known_simp_df

## 9. Finalization of Dataset

The two known datasets are then combined into one final dataset. Note that for the butadiene extraction, the input was the unknown dataset from the dineophile extraction. The final dataset will not have repeat reaction IDs.

A function is not totally necessary but it makes it simpler.

In [None]:
def stack_df(df1,df2):
    ### Stack datasets of idential columns in a vertical manner 
    df3 = pd.concat([df1,df2])
    return df3
final_df = stack_df(dienophile_known_simp_df,butadiene_known_simp_df)
final_df

## 10. Export Dataset to CSV

Now the dataset can be exported back into a CSV for later use. This can also be done for the unknown datasets.

In [None]:
final_df.to_csv(f'da_{final_df.shape[0]}.csv',index=False) ### Index is set to false so that it does not appear in the CSV as a column