# Updating Microstate Lists Based on Replicate Microstate Detection with OpenEye

This jupyter notebook incorporates corrections based on detected replicate microstates with OpenEye tools in this notebook:  
detecting_replicate_microstates.ipynb

First I will create cumulative correction files in `./corrections_for_v1_4_1_cumulative/` directory 

Updated microstate list files to be created for 24 molecules (v1_4_1):  
* `SMX_microstates.csv` 
* `SMX_microstates_deprecated.csv`  
* `SMX_microstate_IDs_with_2D_depiction.xlsx`

In [1]:
import pandas as pd
from openeye.oechem import *
import os
import glob

### Creating cumulative correction files for v1.4.1

In [3]:
path_to_new_corrections = "./corrections_for_v1_4_1_new"
path_to_cumulative_corrections = "./corrections_for_v1_4_1_cumulative/"

# Figure out which molecules have new corrections
corrected_mol_list = []

new_correction_files = path_to_new_corrections+"/SM*_correction.csv"

for filename in list(glob.glob(new_correction_files)):
    mol_ID = filename.split("/")[2][:4]
    # print(mol_ID)
    corrected_mol_list.append(mol_ID)

# Iterate over new correction files
for mol_name in corrected_mol_list:

    print("Adding correction remarks for ", mol_name)

    # Find new corrections (only "deprecated")
    new_corr_file = path_to_new_corrections + "/" + mol_name + "_correction.csv"
    df_new_corr = pd.read_csv(new_corr_file)
    df_new_corr = df_new_corr.loc[df_new_corr["correction"] == "deprecated"]

    # Add it to the cumulative corrections file
    cumulative_corr_file = path_to_cumulative_corrections + "/" + mol_name + "_correction.csv" 
    df_cumulative_corr = pd.read_csv(cumulative_corr_file)

    #Iterate over new corrections
    for i, row in enumerate(df_new_corr.iterrows()):
        microstate_ID_with_corr = row[1][0]
        correction_remark = row[1][2]

        print(microstate_ID_with_corr, correction_remark)

        # Add correction remark ("deprecated") to the matching microstate ID in cumulative dataframe
        for j, row in enumerate(df_cumulative_corr.iterrows()):
            microstate_ID = row[1][0]
            if microstate_ID == microstate_ID_with_corr:
                df_cumulative_corr.loc[j, "correction"] = correction_remark

    # Update the cumulative correction file 
    df_cumulative_corr.to_csv(cumulative_corr_file, index=False)
    print()

Adding correction remarks for  SM10
SM10_micro029 deprecated

Adding correction remarks for  SM18
SM18_micro008 deprecated
SM18_micro015 deprecated
SM18_micro019 deprecated
SM18_micro027 deprecated
SM18_micro035 deprecated
SM18_micro039 deprecated
SM18_micro040 deprecated
SM18_micro041 deprecated
SM18_micro043 deprecated
SM18_micro044 deprecated
SM18_micro046 deprecated
SM18_micro066 deprecated
SM18_micro067 deprecated
SM18_micro073 deprecated

Adding correction remarks for  SM21
SM21_micro016 deprecated

Adding correction remarks for  SM23
SM23_micro001 deprecated
SM23_micro004 deprecated
SM23_micro006 deprecated
SM23_micro008 deprecated
SM23_micro014 deprecated
SM23_micro015 deprecated
SM23_micro016 deprecated
SM23_micro027 deprecated
SM23_micro029 deprecated

Adding correction remarks for  SM24
SM24_micro008 deprecated



### Creating updated microstate list files for SAMPL6 repo

In [5]:
path_to_correction_files = path_to_cumulative_corrections
path_to_corrected_files = "microstate_lists_after_correction/"

# Iterate over 24 molecules
for j in range(24):
    mol_name = "SM"+str(j+1).zfill(2)
    print(mol_name, "...")

    # Read correction file
    correction_file = path_to_correction_files + mol_name + "_correction.csv"
    df_microstates = pd.read_csv(correction_file)

    # Convert all SMILES to canonical isomeric SMILES
    
    for i, row in enumerate(df_microstates.iterrows()):
        smiles = df_microstates.loc[i,"canonical isomeric SMILES"]

        mol = OEGraphMol()
        OESmilesToMol(mol, smiles)
        canonical_smiles = OEMolToSmiles(mol)

        df_microstates.loc[i, "canonical isomeric SMILES"] = canonical_smiles
    
    # Check if there is any deprecated microstate
    
    correction = df_microstates["correction"]
    deprecated_boolean = correction.isin(["deprecated"])

    deprecated_label = False
    for b in deprecated_boolean:
        if b == False:
            continue
        if b == True:
            print("Deprecated microstate found.")
            deprecated_label = True


    # Check if there is any added microstate

    correction = df_microstates["correction"]
    added_boolean = correction.isin(["added"])

    added_label = False
    for b in added_boolean:
        if b == False:
            continue
        if b == True:
            print("Added microstate found.")
            added_label = True


    # Write deprecated microstates to a separate file

    if(deprecated_label):
        df_deprecated = df_microstates.loc[df_microstates["correction"] == "deprecated"]
        print("Number of deprecated microstates of {}: {}".format(mol_name, df_deprecated.shape[0]))

        df_deprecated = df_deprecated.rename(columns = {"correction":"remarks"})

        deprecated_microstates_file_name = path_to_corrected_files + mol_name + "_microstates_deprecated.csv"
        df_deprecated.to_csv(deprecated_microstates_file_name, index=False)
        print("Created:" , deprecated_microstates_file_name)
        print("\n")


    # Write new microstates list with deprecated microstates removed and new microstates added.

    if(deprecated_label and added_label):
        df_remaining = df_microstates.loc[df_microstates["correction"] != "deprecated"]
        df_remaining = df_remaining.loc[df_remaining["correction"] != "added"]
        print("Number of remaining microstates of {}: {}".format(mol_name, df_remaining.shape[0]))

        df_added = df_microstates.loc[df_microstates["correction"] == "added"]
        print("Number of new microstates of {}: {}".format(mol_name, df_added.shape[0]))
        
        df_updated = df_microstates.loc[df_microstates["correction"] != "deprecated"]
        print("Total number of microstates in updated list of {}: {}".format(mol_name, df_updated.shape[0]))

    elif(added_label): # no deprecated
        df_remaining = df_microstates.loc[df_remaining["correction"] != "added"]
        print("Number of remaining microstates of {}: {}".format(mol_name, df_remaining.shape[0]))

        df_added = df_microstates.loc[df_microstates["correction"] == "added"]
        print("Number of new microstates of {}: {}".format(mol_name, df_added.shape[0]))

        df_updated = df_microstates
        print("Total number of microstates in updated list of {}: {}".format(mol_name, df_updated.shape[0]))

    elif(deprecated_label): # no added

        df_updated = df_microstates.loc[df_microstates["correction"] != "deprecated"]
        print("Total number of microstates in updated list of {}: {}".format(mol_name, df_updated.shape[0]))

    else:
        df_updated = df_microstates
        print("No correction to microstate list.")
        print("Total number of microstates in updated list of {}: {}".format(mol_name, df_updated.shape[0]))


    df_updated = df_updated.loc[:,("microstate ID","canonical isomeric SMILES")]

    updated_microstates_file_name = path_to_corrected_files + mol_name + "_microstates.csv"
    df_updated.to_csv(updated_microstates_file_name, index=False)
    print("Created:" , updated_microstates_file_name)
    print("\n")


    # Create Excel file with 2D depiction for updated microstates list

    # Organize colums to create csv input file for csv2xlsx.py script
    df_2D_input = pd.DataFrame()
    df_2D_input["Molecule"] = df_updated["canonical isomeric SMILES"]
    df_2D_input["Microstate ID"] = df_updated["microstate ID"]
    df_2D_input["microstate ID"] = df_updated["microstate ID"]
    df_2D_input["canonical isomeric SMILES"] = df_updated["canonical isomeric SMILES"]

    csv_file_name = path_to_corrected_files + "{}_microstate_IDs_with_2D_depiction.csv".format(mol_name)
    xlsx_file_name = path_to_corrected_files + "{}_microstate_IDs_with_2D_depiction.xlsx".format(mol_name)

    df_2D_input.to_csv(csv_file_name, index=False)

    !python csv2xlsx.py $csv_file_name $xlsx_file_name
    !trash $csv_file_name
    print("Created: ",xlsx_file_name)
    print(mol_name, ": Done!")
    print("\n")

SM01 ...
Deprecated microstate found.
Deprecated microstate found.
Number of deprecated microstates of SM01: 2
Created: microstate_lists_after_correction/SM01_microstates_deprecated.csv


Total number of microstates in updated list of SM01: 8
Created: microstate_lists_after_correction/SM01_microstates.csv


Created:  microstate_lists_after_correction/SM01_microstate_IDs_with_2D_depiction.xlsx
SM01 : Done!


SM02 ...
Deprecated microstate found.
Deprecated microstate found.
Deprecated microstate found.
Added microstate found.
Added microstate found.
Added microstate found.
Number of deprecated microstates of SM02: 3
Created: microstate_lists_after_correction/SM02_microstates_deprecated.csv


Number of remaining microstates of SM02: 8
Number of new microstates of SM02: 3
Total number of microstates in updated list of SM02: 11
Created: microstate_lists_after_correction/SM02_microstates.csv


Created:  microstate_lists_after_correction/SM02_microstate_IDs_with_2D_depiction.xlsx
SM02 : Done

SM15 : Done!


SM16 ...
No correction to microstate list.
Total number of microstates in updated list of SM16: 8
Created: microstate_lists_after_correction/SM16_microstates.csv


Created:  microstate_lists_after_correction/SM16_microstate_IDs_with_2D_depiction.xlsx
SM16 : Done!


SM17 ...
Deprecated microstate found.
Deprecated microstate found.
Deprecated microstate found.
Added microstate found.
Added microstate found.
Added microstate found.
Added microstate found.
Added microstate found.
Added microstate found.
Number of deprecated microstates of SM17: 3
Created: microstate_lists_after_correction/SM17_microstates_deprecated.csv


Number of remaining microstates of SM17: 2
Number of new microstates of SM17: 6
Total number of microstates in updated list of SM17: 8
Created: microstate_lists_after_correction/SM17_microstates.csv


Created:  microstate_lists_after_correction/SM17_microstate_IDs_with_2D_depiction.xlsx
SM17 : Done!


SM18 ...
Deprecated microstate found.
Deprecated micros