# Which active compounds are available in kemikum and ACES according to KLARA? **STAY TUNED AND FIND OUT!!!**

### In this episode of **'We have no choice, experimental acquisition is needed'**, the sequel of **'APCI-HRMS data should be abundant, right?'**, we present the potential main characters of the test set! 
***PREVIOUSLY ON 'We have no choice, experimental acquisition is needed':*** 

All endpoints except AHR and MMP were voted out, since they had a larger number of polar compounds in their active/inactive split. Here we will found out which active compounds are available in KLARA Kemikum and ACES, and then an exiting hunt for the chemicals and the politics needed between the groups will determine which ones can be added to the final experimental dataset. 

### **TASK 1: IS IT GC-AMENABLE?**
Here we will make and look at histogram visualizations for both KLARA Kemikum and KLARA Aces datasets to determine if these compounds are predicted as GC-amenable compounds according to the SUSDAT dataset. 

In [None]:
#starting of strong by loading the data
import pickle
import pandas as pd

# Made in '2025-01-30_Comparisons.ipynb'
with open('/Users/elli/Library/CloudStorage/OneDrive-Kruvelab/Master_thesis/Code/Tox21 comparison/2025-03-06_tox21_ahr_mmp_available_compounds_all_sources_UPDATED.pkl', 'rb') as f:
    tox21_ahr_mmp = pickle.load(f)


sirius = pd.read_csv('/Users/elli/Library/CloudStorage/OneDrive-Kruvelab/Master_thesis/Data/SIRIUS training set/sirius_without_dup.tsv', sep='\t')

In [None]:
#Unique compounds with APCI spectra w. sirius data removed. 

# tox21 = tox21_ahr_mmp.drop_duplicates(subset=tox21_ahr_mmp.columns)
# tox21 = tox21[tox21.sirius_data.isna()]

# tox21 = tox21.dropna(subset=['ms_library', 'iris_data', 'isabel_data'], how='all')
# tox21 = tox21.drop_duplicates(subset='InChIKey14')

In [None]:
#Filtering tox21 to only show actives
tox21_actives = tox21_ahr_mmp[(tox21_ahr_mmp['nr.ahr'] == 1) | (tox21_ahr_mmp['sr.mmp'] == 1)].reset_index(drop=True)

#Filtrate compounds to the gc-probability which is the highest of the multiple probabilities that are available, while still keeping information on locations
tox21_actives_sorted = tox21_actives.sort_values(by=['InChIKey14', 'section_aces', 'section_kemikum', 'gc_probability'], ascending=[True, False, False, False]).reset_index(drop=True)
tox21_actives_no_dup = tox21_actives_sorted.drop_duplicates(subset=['InChIKey14', 'section_aces', 'section_kemikum'], keep='first').reset_index(drop=True)

#Remove any compounds already present in a ms library, iris data or isabel data
tox21_experimental = tox21_actives_no_dup[(tox21_actives_no_dup['ms_library'].isna())&
                                          (tox21_actives_no_dup['iris_data'].isna())&
                                          (tox21_actives_no_dup['isabel_data'].isna())].reset_index(drop=True)

#Filter so that all kemikum compounds are available in one dataset
tox21_kemikum_actives = tox21_experimental.dropna(subset='section_kemikum').reset_index(drop=True)
tox21_kemikum_actives_no_dupl = tox21_kemikum_actives.drop_duplicates(subset='InChIKey14', keep='first').reset_index(drop=True) 

#Filter so that all aces compounds are available in one dataset and remove any compounds which overlap with kemikum
tox21_aces_actives = tox21_experimental.dropna(subset='section_aces').reset_index(drop=True)
tox21_aces_actives = tox21_aces_actives[tox21_aces_actives.section_kemikum.isna()].reset_index(drop=True)
tox21_aces_actives_no_dupl = tox21_aces_actives.drop_duplicates(subset='InChIKey14', keep='first').reset_index(drop=True) 

In [None]:
#Compare the number of compounds in the different datasets
tox21_kemikum_actives.shape, tox21_kemikum_actives_no_dupl.shape, tox21_aces_actives.shape, tox21_aces_actives_no_dupl.shape

In [None]:
#Remove any compounds that are already present in the SIRIUS training set
tox21_kemikum_actives_no_dupl_no_sirius = tox21_kemikum_actives_no_dupl[tox21_kemikum_actives_no_dupl.sirius_data.isna()].reset_index(drop=True)
tox21_aces_actives_no_dupl_no_sirius = tox21_aces_actives_no_dupl[tox21_aces_actives_no_dupl.sirius_data.isna()].reset_index(drop=True)

In [None]:
# Compare the number of compounds in the different datasets after removing SIRIUS data
tox21_kemikum_actives_no_dupl.shape, tox21_kemikum_actives_no_dupl_no_sirius.shape, tox21_aces_actives_no_dupl.shape, tox21_aces_actives_no_dupl_no_sirius.shape

In [None]:
tox21_kemikum_actives_no_dupl_no_sirius

In [None]:
# Remove any compounds that have a GC-probability of <0.5
tox21_kemikum_actives_no_dupl_no_sirius_gc50 = tox21_kemikum_actives_no_dupl_no_sirius[tox21_kemikum_actives_no_dupl_no_sirius.gc_probability >= 0.5].reset_index(drop=True)
tox21_aces_actives_no_dupl_no_sirius_gc50 = tox21_aces_actives_no_dupl_no_sirius[tox21_aces_actives_no_dupl_no_sirius.gc_probability >= 0.5].reset_index(drop=True)

In [None]:
tox21_kemikum_actives_no_dupl_no_sirius_gc50.shape, tox21_aces_actives_no_dupl_no_sirius_gc50.shape

In [None]:
tox21_kemikum_actives_no_dupl_no_sirius_gc50[tox21_kemikum_actives_no_dupl_no_sirius_gc50['sr.mmp']==1].shape

In [None]:
tox21_aces_actives_no_dupl_no_sirius_gc50[tox21_aces_actives_no_dupl_no_sirius_gc50['sr.mmp']==1].shape

#### **SUBTASK 1: DATA VISUALIZATIONS**

We will look at the GC-amenaibilty prediction for all available compounds for each section. From this we can then determine which percentage (or in this case; decimal point) will be an appropriate cutoff for subsetting appropriate compounds for the experimental analysis!

In [None]:
at_uni_tox21_expr

In [None]:
no_sirius_tox21_expr = tox21_experimental[tox21_experimental.sirius_data.isna()].reset_index(drop=True)
at_uni_tox21_expr = no_sirius_tox21_expr[(no_sirius_tox21_expr.section_aces.notna())|(no_sirius_tox21_expr.section_kemikum.notna())]

at_uni_tox21_expr = at_uni_tox21_expr.drop_duplicates(subset='InChIKey14', keep='first').reset_index(drop=True)

In [None]:
# Remove any duplicate chemicals from the datasets to make sure visualization is correct
list_of_dfs = [tox21_kemikum_actives, tox21_kemikum_actives_no_sirius, tox21_aces_actives, tox21_aces_actives_no_sirius]
list_of_dfs_for_visualization = []

for df in list_of_dfs:
    df = pd.DataFrame(df)
    filtered_df = df.drop_duplicates(subset='InChIKey14', keep='first')
    list_of_dfs_for_visualization.append(filtered_df)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set the theme for the plots
sns.set_theme(style='white', rc={'figure.figsize':(5,5), 
                                 'font.family':['Sans-serif'],
                                 'font.sans-serif':['Times New Roman'], 
                                 'font.size':12, 
                                 'xtick.labelsize': 12,
                                 'ytick.labelsize': 12,
                                 'axes.labelsize': 12,
                                 'axes.titlesize': 14,
                                 'legend.fontsize': 12,
                                 'legend.title_fontsize': 12,
                                 'font.style': 'normal',
                                 'font.weight': 400})
                                
                                

plt.rcParams['savefig.transparent'] = True

In [None]:
import matplotlib.pyplot as plt
import numpy as np

from matplotlib.font_manager import FontProperties

# Example datasets
aces = tox21_aces_actives_no_dupl_no_sirius['gc_probability'] #aces data
kemikum = tox21_kemikum_actives_no_dupl_no_sirius['gc_probability'] #kemikum data

# Define bin edges
bins = np.linspace(0, 1, 15)

# Create stacked histogram
plt.figure(figsize=(7, 3))
plt.hist(at_uni_tox21_expr.gc_probability, bins=bins, color=['#58A7D2'], edgecolor=None)

# Add labels and legend
#bold_font = FontProperties(weight='bold')

plt.xlabel('GC-probability')
plt.ylabel('Compounds')
#plt.title('Distribution of GC-probability for compounds\navailable across departments', fontsize=14, fontweight='bold')
#plt.legend(title='Departments', title_fontproperties=bold_font, loc='upper left')

# Show the plot
#plt.show()

# Save the plot
plt.savefig('/Users/elli/Library/CloudStorage/OneDrive-Kruvelab/Master_thesis/Visualizations/2025-05-23_GC_probability_experimental.pdf', dpi=300, bbox_inches='tight', transparent=True)
# plt.savefig('/Users/elli/Library/CloudStorage/OneDrive-Kruvelab/Ellinor - Master thesis/Visualizations/2025-03-06_Comparison_GC_probability.png', dpi=300, bbox_inches='tight')

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style='white', rc={'figure.figsize':(5,5), 
                                 'font.family':['Sans-serif'],
                                 'font.sans-serif':['Times New Roman'], 
                                 'font.size':14, 
                                 'xtick.labelsize': 14,
                                 'ytick.labelsize': 14,
                                 'axes.labelsize': 14,
                                 'axes.titlesize': 14,
                                 'legend.fontsize': 14,
                                 'legend.title_fontsize': 14,
                                 'font.style': 'normal',
                                 'font.weight': 400})
                                
                                

plt.rcParams['savefig.transparent'] = True

#First, let's look at the distribution of the number of compounds in the different data sources

fig, axs = plt.subplots(1, 2, figsize=(10, 6), sharey=True, sharex=True)

#General information for figure
fig.supxlabel('Probability of GC-amenability', fontweight='normal', color='black')
fig.supylabel('Number of compounds', color='black')
fig.suptitle('Distribution of the probability of \nGC-amenability  for active compounds in KLARA', color='black', fontweight='bold')

#Kemikum histograms
axs[0].hist(list_of_dfs_for_visualization[0]['gc_probability'], bins=15, color='#8ECAE6', edgecolor='#023047', alpha= 0.5)
axs[0].set_title('Kemikum\nSIRIUS training data included')

axs[0].hist(list_of_dfs_for_visualization[1]['gc_probability'], bins=15, color='#219EBC', edgecolor='#023047', alpha= 1)
axs[0].set_title('Kemikum', color='#022E60')

#ACES histograms
axs[1].hist(list_of_dfs_for_visualization[2]['gc_probability'], bins=15, color='#FFB703', edgecolor='#023047', alpha = 0.5)
axs[1].set_title('ACES\nSIRIUS training data included')

axs[1].hist(list_of_dfs_for_visualization[3]['gc_probability'], bins=15, color='#FB8500', edgecolor='#023047', alpha = 1)
axs[1].set_title('ACES', color='#022E60')

plt.tight_layout()
plt.show()

#plt.savefig('2025-03-17_distr_gc_prob_KLARA.svg', format='svg')


In [None]:
from matplotlib import rcParams
rcParams.keys()

And it is once again a beautiful visualization! We can clearly see from the this that most compounds are predicted to have either a very high or very low probability of GC-amenability for active compounds found in both Kemikum and ACES. We also see that the largest number of compounds are in general found in Kemikum and quite a bit fewer are found in ACES. Fortunately we don't seem to be loosing too many GC-amenable compounds when removing the SIRIUS training data. 

LETS FOUND OUT HOW MANY WE LOST! 

#### **SUBTASK 2: SPOT THE DIFFERENCE (SIRIUS training data edition)**

With a probability of GC-amenability > 0.5, how many compounds did we loose for ACES and Kemikum respectively when removing SIRIUS training data? 
**FIND OUT AFTER THE BREAK!**

In [None]:
#ANOTHER FUNCTION??!! This girl is on a roll today!
def filter_gc_amenability(df, threshold):
    """
    This function filters the data based on the probability of GC-amenability
    """

    filtered_df = df[df['gc_probability'] > threshold].reset_index(drop=True)
    return filtered_df

#To make the application of the function as easy as possible, all df's are stored in a list and then looped over
list_of_dfs_GC50 = []

for df in list_of_dfs_for_visualization:
    df_filtered = filter_gc_amenability(df, 0.5)
    list_of_dfs_GC50.append(df_filtered)

In [None]:
#NOW FOR THE RESULTS!!!

#Kemikum compounds that are lost:
kemikum_lost = len(list_of_dfs_GC50[0]) - len(list_of_dfs_GC50[1])
print(f'Kemikum compounds lost: {kemikum_lost}\n' +
      f'Kemikum compounds left: {len(list_of_dfs_GC50[0])}')

print('-----')
#ACES compounds that are lost:
aces_lost = len(list_of_dfs_GC50[2]) - len(list_of_dfs_GC50[3])
print(f'ACES compounds lost: {aces_lost}\n' +
      f'ACES compounds left: {len(list_of_dfs_GC50[2])}')

**THE RESULTS ARE IN EVERYBODY!!!** 

In total we are loosing 93 and 20 compounds respectively from Kemikum and ACES when filtering out the SIRIUS training data. If all goes well that means that we have 187+46=***233 compounds to analyse!!!***

Let us pray for the group politics going well so we manage to analyse all of them. 

But it's not over yet! Join us again after the break to find out what the other tasks will be!

\* break \*

***Welcome back!!***

We're starting the second task of the evening;
### TASK 2: 'Please Sir... Can I have 5mg, sir?'

For this task we will compile lists to determine which compounds are found where, so that the supervisor, **the master herself** (!!!) can work her ***political magic*** to help us get the compounds needed for the experimental acquisition.

We'll start the task off by cleaning the data for the task. 

#### SUBTASK 1: Clean the data again

In [None]:
tox21_experimental.drop_duplicates(subset='InChIKey14', keep='first').reset_index(drop=True)

In [None]:
tox21_experimental_GC50 = filter_gc_amenability(tox21_experimental, 0.5).reset_index(drop=True)

tox21_experimental_GC50_no_sirius = tox21_experimental_GC50[tox21_experimental_GC50['sirius_data'].isna()].reset_index(drop=True)

In [None]:
#tox21_experimental_GC50.drop_duplicates(subset='InChIKey14', keep='first').reset_index(drop=True)

tox21_experimental_GC50_no_sirius.drop_duplicates(subset='InChIKey14', keep='first').reset_index(drop=True)

In [None]:
#Compounds found in AC groups
group_names_ac = ['Group Ulrika Nilsson', 'Kurslab_AK', 'Group Ioannis Sadiktsis', 'Group Jan Holmbäck','Masslab', 'Group Anneli Kruve', 'Group Nicole Pamme', 'Group Leopold Ilag']


tox21_experimental_ac = tox21_experimental_GC50_no_sirius[tox21_experimental_GC50_no_sirius.section_kemikum.isin(group_names_ac)].reset_index(drop=True)

In [None]:
#kemikum (non-analytical) department unique compounds
tox21_experimental_no_ac_compounds = tox21_experimental_GC50_no_sirius[~tox21_experimental_GC50_no_sirius.InChIKey14.isin(tox21_experimental_ac.InChIKey14)]
tox21_experimental_mmk_org = tox21_experimental_no_ac_compounds.dropna(subset=['section_kemikum'], how='all').reset_index(drop=True)

#aces unique compounds
tox21_experimental_aces = tox21_experimental_no_ac_compounds[~tox21_experimental_no_ac_compounds.InChIKey14.isin(tox21_experimental_ac.InChIKey14)].reset_index(drop=True)
tox21_experimental_aces = tox21_experimental_aces[~tox21_experimental_aces.InChIKey14.isin(tox21_experimental_mmk_org.InChIKey14)].reset_index(drop=True)
tox21_experimental_aces = tox21_experimental_aces.dropna(subset=['section_aces'], how='all').reset_index(drop=True)

#### SUBTASK 2: Determine active/inactive count of endpoints for compounds

Remember that all compounds are active in at least one of the two endpoints, so a higher ratio of actives to inactives is normal. 

In [None]:
def active_inactive_count(df, endpoints_list):
    df = df.drop_duplicates(subset='InChIKey14')
    for endpoint in endpoints_list:
        print(f'Active/inactive count for {endpoint}')
        print(df.value_counts(endpoint))

endpoints_list = ['nr.ahr', 'sr.mmp']

print('Analytical department')
active_inactive_count(tox21_experimental_ac, endpoints_list)
print('------')

print('Other departments')
active_inactive_count(tox21_experimental_mmk_org, endpoints_list)
print('------')

print('ACES')
active_inactive_count(tox21_experimental_aces, endpoints_list)
print('------')

#### SUBTASK 3: Combine cleaned data with KLARA information

By first opening the two KLARA datasets (ACES and Kemikum) we can then rename the column for section to fit the version described in the tox21 version. We will then append these columns to the KLARA data. 

Following main tasks: 
1. Open KLARA datasets
2. Rename 'section' column to 'section_{name of KLARA dataset}'
3. Drop any rows which doesn't have any InChIKey14 in column of same name
4. Drop any rows which has complete copies over all columns, only keep the first

In [None]:
#Open klara data for kemikum and aces

with open('/Users/elli/Library/CloudStorage/OneDrive-Kruvelab/Ellinor - Master thesis/Code/Data cleaning/2025-02-13_klara_aces_cleaned.pkl', 'rb') as f:
    klara_aces = pickle.load(f)

klara_aces.rename(columns={'section': 'section_aces'}, inplace=True)

with open('/Users/elli/Library/CloudStorage/OneDrive-Kruvelab/Ellinor - Master thesis/Code/Data cleaning/2025-03-06_klara_kemikum_UPDATED_cleaned.pkl', 'rb') as f:
    klara_kemikum = pickle.load(f)

klara_kemikum.rename(columns={'section': 'section_kemikum'}, inplace=True)

In [None]:
def filter_klara_data(df):
    """
    This function filters the data to only include compounds which have InChIKey14's and their various locations information, any duplicate information is removed
    """
    new_df = df.dropna(subset='InChIKey14', how='all').reset_index(drop=True)
    new_df = new_df.drop_duplicates(subset=new_df.columns, keep='first').reset_index(drop=True)

    return new_df

klara_aces_unique = filter_klara_data(klara_aces)
klara_kemikum_unique = filter_klara_data(klara_kemikum)

In [None]:
def add_klara_data(df, klara_data, section_name):
    """
    This function adds the klara data to the df, and then filter out any duplicate rows
    """
    new_df = klara_data.merge(df[['gc_probability', 'nr.ahr', 'sr.mmp', 'InChIKey14', section_name]], on=['InChIKey14', section_name], how='inner')
    new_df = new_df.drop_duplicates(subset=new_df.columns, keep='first').reset_index(drop=True)

    new_df = new_df.rename(columns={section_name: 'section'})

    return new_df

klara_ac_chemicals = add_klara_data(tox21_experimental_ac, klara_kemikum_unique, 'section_kemikum')
klara_mmk_org_chemicals = add_klara_data(tox21_experimental_mmk_org, klara_kemikum_unique, 'section_kemikum')
klara_aces_chemicals = add_klara_data(tox21_experimental_aces, klara_aces_unique, 'section_aces')


#### SUBTASK 4: Compiling lists of actives to ask and analyse

Which compounds are already available in the group to analyse? Which are available in the corridor to ask about? 

Let's continue to find out!!

##### Analytical Chemistry section

In [None]:
#Compounds to be removed from AC chemicals list

filter_ac_compounds_to_remove = ['2,4-toluendiisocyanat (isomerblandning) ' #Not apporopriate for MS analysis to work with isomer mix 
                                 ]

klara_ac_chemicals = klara_ac_chemicals[~klara_ac_chemicals['name'].isin(filter_ac_compounds_to_remove)].reset_index(drop=True)

In [None]:
klara_ac_chemicals.section.unique()

In [None]:
def separate_groups(df, group_names):

    '''
    Separate the chemicals into different groups based on the group names provided, 
    returns a dictionary with group name as key, and the chemical-df as value
    '''

    new_df = df
    groups_dict = {}

    for group_name in group_names:
        group = new_df[(new_df['section'].str.contains(group_name))].reset_index(drop=True)
        new_df = new_df[~new_df['InChIKey14'].isin(group['InChIKey14'])]

        group_sorted = group.sort_values(by=['InChIKey14', 'cas', 'amount'], ascending=[True,True,False])
        group_filtered = group_sorted.drop_duplicates(subset=['InChIKey14', 'cas'], keep='first').reset_index(drop=True)

        groups_dict[str(group_name)] = group_filtered


    return new_df, groups_dict


ac_group_names = ['Group Anneli Kruve', 'Group Ulrika Nilsson', 'Group Ioannis Sadiktsis', 'Kurslab_AK', 'Masslab', 'Group Leopold Ilag', 'Group Nicole Pamme','Group Jan Holmbäck'] #To assure hirarchy of groups to ask is preserved
klara_ac_chemicals_updated, klara_ac_separate_groups_dict = separate_groups(klara_ac_chemicals, ac_group_names)


In [None]:
klara_ac_separate_groups_dict.keys() #Check that all groups are included

In [None]:
klara_ac_chemicals_final = pd.concat(klara_ac_separate_groups_dict.values(), ignore_index=True)

klara_ac_chemicals_final

Using the grouping above we see a natural hirarchy forming, the groups which have highest prority or chance of getting compounds are higher up in the list. Compounds are then continually removed if they are transfered to their own grouping so no compound is found in different groups. 

In [None]:
klara_ac_chemicals_updated.shape #Should show no entries

##### MMK/Org sections

In [None]:
klara_mmk_org_chemicals.section.unique()

In [None]:
mmk_org_group_names = ['Group Joseph Samec', 'Group Kálmán J Szabó', 'Group Miguel Rivero Crespo', 'Group Pher Andersson', 'Group Erica Zeglio', 'Group Belén Martín-Matute', 
                       'Group Berit Olofsson', 'Grupp Jiayin Yuan', 'Group Xiaodong Zou', 'Group Biswanath Das' , 'Group Jan E Bäckvall', 'Kemiska övningslaboratoriet, KÖL (MMK)']

klara_mmk_org_chemicals_updated, klara_mmk_org_separate_groups_dict = separate_groups(klara_mmk_org_chemicals, mmk_org_group_names)

#### ACES

In [None]:
klara_aces_chemicals_active = klara_aces_chemicals[(klara_aces_chemicals['nr.ahr'] == 1)|(klara_aces_chemicals['sr.mmp'] == 1) ].reset_index(drop=True)
klara_aces_chemicals_active = klara_aces_chemicals_active[klara_aces_chemicals_active['section']=='ACESo, Contaminant Chemistry Unit'].reset_index(drop=True)

klara_aces_chemicals_active = klara_aces_chemicals_active.drop_duplicates(subset=['name', 'cas', 'amount', 'unit', 'building', 'floor', 'room', 'storage', 'section'], keep='first').reset_index(drop=True)
klara_aces_chemicals_active = klara_aces_chemicals_active.sort_values(by='name', ascending=True)
klara_aces_chemicals_active = klara_aces_chemicals_active.drop(index=[60, 33, 48, 39, 40, 53, 27, 42, 45, 36])

klara_aces_chemicals_active

In [None]:
cedrol_index = klara_aces_chemicals_active[klara_aces_chemicals_active.name=='[3R-(3alpha,3aBeta,6alpha,7beta,8aAlpha)]-'].index

klara_aces_chemicals_active.loc[cedrol_index, 'name'] = 'Cedrol'

In [None]:
compounds_to_remove = ['Chlordane (mixture of isomers)', #Not appropriate for MS analysis to work with isomer mix
                       '4-Dodecylphenol, mixture of isomers' #Not appropriate for MS analysis to work with isomer mix
                       ] 

klara_aces_chemicals_active = klara_aces_chemicals_active[~klara_aces_chemicals_active['name'].isin(compounds_to_remove)]

In [None]:
klara_aces_chemicals_final = klara_aces_chemicals_active.drop_duplicates(subset=['name', 'cas', 'InChIKey'], keep='first').reset_index(drop=True)

In [None]:
klara_aces_chemicals_active_location = klara_aces_chemicals_active[['name', 'cas', 'barcode', 'amount', 'unit', 'building', 'floor', 'room', 'storage', 'section', 'comment']].reset_index(drop=True)

klara_aces_chemicals_active_location

klara_aces_chemicals_active_location.to_excel('2025-04-08_KLARA_ACES_chemicals_to_borrow_location.xlsx', index=False)

### TASK 3: List compilation of all compounds which are available
##### Groups which has graciously allowed for us to use their chemicals

**Cheers to the groups which has allowed for us to use their chemcials**

Groups that have have allowed for us to use their chemicals (will be updated):

From AC:
- Group Kruve
- Group Nilsson
- Group Sadiktsis
- Course lab 
- Group Ilag

From MMK/Org:
- Group Szabó (Group KS)
- Group Samec (Group JoS)
- Group Crespo
- Group Andersson
- Group Zeglio
- Group Martín-Matute
- Group Olofsson
- Group Yuan
- Group Zou
- Group Das
- Group Bäckvall

From ACES:
- ACESo


In [None]:
klara_mmk_org_chemicals_updated

In [None]:
# klara_mmk_org_chemicals_to_send_to_anneli = klara_mmk_org_chemicals_updated[['name', 'cas', 'section', 'nr.ahr', 'sr.mmp']]

# klara_mmk_org_chemicals_to_send_to_anneli.to_excel('2025-03-07_updated_kemikum_chemicals.xlsx', index=False)

In [None]:
# For each group, save the chemicals to a separate csv file
for key in klara_mmk_org_separate_groups_dict.keys():
    df = klara_mmk_org_separate_groups_dict[key]
    df = df[['name', 'cas','amount', 'unit', 'room', 'storage', 'comment']]
    df.to_csv(f'2025-03-26_chemicals_to_borrow_{key}.csv', index=False)

In [None]:
klara_mmk_org_chemicals_final = pd.concat(klara_mmk_org_separate_groups_dict.values(), ignore_index=True)

In [None]:
klara_mmk_org_chemicals_final

In [None]:
# add all the available actives to one dataframe
all_available_actives = pd.concat([klara_ac_chemicals_final, klara_mmk_org_chemicals_final, klara_aces_chemicals_final]).reset_index(drop=True)

In [None]:
#Reorganize the columns to make it easier to read
all_available_actives = all_available_actives[['name', 'cas', 'section', 'nr.ahr', 'sr.mmp', 
                                               'amount', 'unit', 'building', 'floor', 'room', 'storage',
                                               'comment', 'barcode', 'SMILES', 'ROMol', 'split_SMILES', 'InChIKey', 'InChIKey14',
                                               'duplicate_InChIKey', 'gc_probability']]

all_available_actives.head()

In [None]:
with open('/Users/elli/Library/CloudStorage/OneDrive-Kruvelab/Ellinor - Master thesis/Code/Experimental_work/2025-04-15_klara_available_actives.pkl', 'wb') as f:
    pickle.dump(all_available_actives, f)

In [None]:
import pickle 
import pandas as pd


with open('/Users/elli/Library/CloudStorage/OneDrive-Kruvelab/Ellinor - Master thesis/Experimental/Experimental_work/2025-04-03_klara_available_actives.pkl', 'rb') as f:
    all_available_actives = pickle.load(f)

In [None]:
print(all_available_actives[all_available_actives['nr.ahr']==1]['nr.ahr'].sum())
print(all_available_actives[all_available_actives['sr.mmp']==1]['sr.mmp'].sum())
print(all_available_actives[(all_available_actives['nr.ahr']==1)&(all_available_actives['sr.mmp']==1)].shape[0])

During the making of the standards some thing may have happened leading to some compounds not going all the way to analysis. 

These will be removed using a filter which consists of all compound which had various issues throughout the standard making process, as well as a comment next to the name as to why. 

In [None]:
compounds_to_remove = ['Toluylene diisocyanate (mixutre of isomeres) <br>(mass)', # not appropriate for MS analysis, was also prone to polymerization
                       "N,N'-Dicyklohexylkarbodiimid", # Reacts with water, determined to not be appropriate to work with
                       'Folpet',#Not found
                       '1,2,5,6,9,10-Hexabromocyclododecane', #Not found
                       '4-Phenoxyphenol', #Not found
                       'Lindane', #Not found
                       '4-(Methylamino)phenol hemisulfate salt', #Not found
                       'p-Toluidin', #Too crystalized in packaging, could not be transferred
                       'Aminoguanidine bicarbonate',# Could not be dissolved in anything other that water  
                       '1,2,4-Triazole', #Not found
                       '1,2,4-Triazole sodium derivative', #Not found
                       'beta-Phenylcinnamaldehyde', #Not found
                       'Triton X-100 (Sigma-Aldrich Sweden AB)' #Determined to not be suitable for GC-analysis
                       ] 

all_available_actives_updated = all_available_actives[~all_available_actives['name'].isin(compounds_to_remove)].reset_index(drop=True)

In [None]:
from rdkit import Chem
from rdkit.Chem import PandasTools, Descriptors, rdMolDescriptors, Crippen, Fragments

In [None]:
def calc_molecular_formula_and_mol_weight(df):
    '''
    This function calculates the following chemical characteristics:
         molecular formula
         monoisotopic molecular weight
         LogP
         number of amines
         number of hydroxyls
         number of hydrogen bond acceptors
         number of hydrogen bond donors
    '''

    PandasTools.AddMoleculeColumnToFrame(df, smilesCol='SMILES')
    df['monoisotopic_molecular_weight'] = df['ROMol'].apply(Chem.rdMolDescriptors.CalcExactMolWt)
    df['molecular_formula'] = df['ROMol'].apply(Chem.rdMolDescriptors.CalcMolFormula)
    df['logP'] = df['ROMol'].apply(Chem.Crippen.MolLogP)

    prim_amines = df['ROMol'].apply(Chem.Fragments.fr_NH2)
    sec_amines = df['ROMol'].apply(Chem.Fragments.fr_NH1)
    tert_amines = df['ROMol'].apply(Chem.Fragments.fr_NH0)
    arom_amines = df['ROMol'].apply(Chem.Fragments.fr_Ar_NH)
    df['amines'] = prim_amines + sec_amines + tert_amines + arom_amines

    aliph_hydroxyls = df['ROMol'].apply(Chem.Fragments.fr_Al_OH)
    aromatic_hydroxyls = df['ROMol'].apply(Chem.Fragments.fr_Ar_OH)
    df['hydroxyls'] = aliph_hydroxyls + aromatic_hydroxyls

    df['HBA'] = df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBA)
    df['HBD'] = df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBD)
    return df

all_available_actives_updated = calc_molecular_formula_and_mol_weight(all_available_actives_updated)

In [None]:
#get pubhcem CID to be able to get the PubChem data
import pubchempy as pcp

def get_pubchem_cid(df):
    '''
    This function gets the PubChem cid for the compounds in the dataframe
    '''
    pubchem_data = []
    for index, row in df.iterrows():
        try:
            compound = pcp.get_compounds(row['InChIKey'], 'inchikey')[0].to_dict(properties=['cid'])['cid']
            pubchem_data.append(compound)
        except:
            print('Failed to get data for compound:', row['InChIKey'])
            pubchem_data.append(None)

    df['pubchem_cid'] = pubchem_data
    return df

all_available_actives_updated = get_pubchem_cid(all_available_actives_updated)

In [None]:
import numpy as np
import requests

#get spectral and experimental data from PubChem
def get_pubchem_data(cid):
    '''
    Get information on spectral data from PubChem
    '''

    def get_spectral_data(cid):
        '''
        Get information on spectral data from PubChem
        '''
        # Get the PubChem CID for the compound
        gcms = False
        lcms = False
        
        for subsection in section:
            if subsection.get('TOCHeading') == 'Spectral Information':
                spectral_info = subsection.get('Section')
                for subsection in spectral_info:
                    if subsection.get('TOCHeading') == 'Mass Spectrometry':
                        mass_spec = subsection.get('Section')
                        for subsection in mass_spec:
                            if subsection.get('TOCHeading') == 'GC-MS':
                                gcms = True
                            elif subsection.get('TOCHeading') == 'LC-MS':
                                lcms = True 

        return gcms, lcms
    
    def get_experimental_data(cid):
        '''
        Get experimental data from PubChem
        '''
        # Get the PubChem CID for the compound
        bp = []
        vp = []
        for subsection in section:
            if subsection.get('TOCHeading') == 'Chemical and Physical Properties':
                chemical_props = subsection.get('Section')

                for subsection in chemical_props:
                    if subsection.get('TOCHeading') == 'Experimental Properties':
                        experimental_props = subsection.get('Section')
                        
                        for subsection in experimental_props:
                            if subsection.get('TOCHeading') == 'Boiling Point':
                                # Extract boiling point
                                bp_info = subsection.get('Information')
                                for ref in bp_info:
                                    if any('ExtendedReference' in k for k in ref):
                                        if any('Matched' in k for k in ref.get('ExtendedReference')[0]): # Requires Matching to library to be true
                                            bp.append(str(ref.get('Value').get('StringWithMarkup')[0].get('String')))
                                        
                            elif subsection.get('TOCHeading') == 'Vapor Pressure':
                                # Extract vapor pressure
                                vp_info = subsection.get('Information')
                                for ref in vp_info:
                                    if any('ExtendedReference' in k for k in ref):
                                        if any('Matched' in k for k in ref.get('ExtendedReference')[0]):
                                            vp.append(str(ref.get('Value').get('StringWithMarkup')[0].get('String')))
                            
        return bp, vp
        

    url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON"
    
    try:
        response = requests.get(url).json()

        section = response.get('Record').get('Section')

        # Check if the JSON contains spectral information
        gcms, lcms = get_spectral_data(section)
        # Check if the JSON contains experimental data
        bp, vp = get_experimental_data(section)

        return gcms, lcms, bp, vp
    
    except Exception as e:
        print(f"Error fetching data for CID {cid}: {e}")
        return np.nan, np.nan, np.nan, np.nan

# Get the PubChem data for all available actives
all_available_actives_updated[['gcms_spectra_available', 'lcms_spectra_available', 'boiling_point', 'vapor_pressure']] = all_available_actives_updated['pubchem_cid'].apply(get_pubchem_data).apply(pd.Series)

In [None]:
all_available_actives_updated

In [None]:
all_available_actives_updated.to_csv('2025-04-07_Ellinors_compounds.csv', index=False)

In [None]:
available_actives_w_same_mol_formula = all_available_actives[all_available_actives.duplicated(subset='molecular_formula', keep=False)].reset_index(drop=True).sort_values(by='molecular_formula')

available_actives_w_same_mol_formula

In [None]:
test = all_available_actives_updated

In [None]:
test.index = test.cas

In [None]:
test.to_dict(orient='index')

In [None]:
all_available_actives_updated.monoisotopic_molecular_weight.describe()

In [None]:
all_available_actives_updated.to_csv('2025-03-26_endocrine_tox_active_chemicals.tsv', sep='\t', index=False)

In [None]:
all_available_actives_updated.ROMol[37]

For each mix, a list of the compound names available on klara is made. These are then used to filter out the already used compounds from the 'all_available_actives' df, while making a new df for each mix, for easy acess to the information. 

In [None]:
mix1_list = ['4-Chlorophenyl isocyanate', '2-Chloroacetophenone', 'alpha-Tetralone (volume)', 'cis-Stilbene (mass)', 'Triphenylborane', 'Indene (mass)', 'Ftaldialdehyd ', 'p-Chloranil']

mix2_list = ['N-Phenyl-o-phenylenediamine', '3-(Dimethylamino)-phenol', 'N,N-Dimethyl-p-phenylenediamine', '2-Nitrophenylacetonitrile', '1,3-Phenylenediamine', 'Benzhydrazide', '2,4,6-Trichlorophenol', 'N,N-Dimethyl-p-toluidine (mass)',
             '2,3-Diaminotoluene', '1-Naphthol', 'Thiourea', 'Myristyltrimethylammonium bromide', 'Hexadecyltrimetylammoniumbromid', 'N,N-Diethyl-1,4-phenylenediammonium sulfate']

mix3_list = ['Tetramethylthiuram disulfide', 'Parathion-methyl', '5-Nitroacenaphthene', '2-Nitrofluorene', '6-Nitroquinoline', '1-Nitronaphthalene', 'Quinoline Yellow', 
             'N-Cyclohexylbenzothiazole-2-sulphenamide', 'N-tert-Butyl-2-benzothiazolesulfenamide', '4-Chloro-m-phenylenediamine']

mix4_list = ['9,10-Dihydrobenzo[a]pyrene-7(8H)-one', '8-Nitroquinoline', '1,2:3,4-Dibenzanthracene','3-Aminofluoranthene',  '1-Methylpyrene',
             '9-Anthracenemethanol', 'Anthrone', '2-Amino-4-methylphenol']

mix5_list = ['2-Methylanthraquinone', 'p-Anisidine (Sigma-Aldrich 800458)', 'N,N-Dimethyl-4-nitrosoaniline', '1-(2-Chlorophenyl)-1-(4-chlorophenyl)-2,2-<br>dichloroethane',
             "4,4'-Dihydroxybiphenyl" ]

mixes_list = [mix1_list, mix2_list, mix3_list, mix4_list, mix5_list]

In [None]:
type(mixes_list[0])

In [None]:
mixes_list[0]

In [None]:
mix1_df = all_available_actives_updated[all_available_actives_updated.name.isin(mixes_list[0])]
mix1_df

In [None]:
def separate_mixes(df, mix_list):

    '''
    Separate the chemicals into different groups based on the group names provided, 
    returns a dictionary with group name as key, and the chemical-df as value
    '''

    new_df = df
    mixes_dict = {}
    nr = 1

    for mix_nr in mix_list:
        mix_df = new_df[new_df['name'].isin(mix_nr)].reset_index(drop=True)
        new_df = new_df[~new_df['name'].isin(mix_nr)].reset_index(drop=True)
        
        mix_df = mix_df.sort_values(by='monoisotopic_molecular_weight', ascending=True)

        mixes_dict['mix'+str(nr)] = mix_df
        nr += 1


    return new_df, mixes_dict

all_available_actives_updated2, active_mixes_dict = separate_mixes(all_available_actives_updated, mixes_list)
active_mixes_dict.keys() #Check that all mixes are included

In [None]:
active_mixes_dict['mix1']