# Cleaning Henry's Law Constants Dataset

Here, we clean up the csv generated from Tabula and make sure everything is in the right format, and at the end there are 4632 unique species.

Looking at the dataframe created using Tabula there are some clear things to note:
- The best values are the first for each species and are consequently next to the substance name
- In column 0, there is no data
- In column 1, the substance names sometimes flow over to the next row as they are long so need to be joined
- Column 2 is populated by the Henry's law constants which are not all formatted right, e.g. they need to be cleaned up so they can be converted to floats
- Some values are infinity or are not definite values as there is an inequality present
- Column 3 is just references so can be dropped


In [221]:
import pandas as pd
import numpy as np

In [222]:
df = pd.read_csv('henrys_law_dataset.csv')

In [223]:
df

Unnamed: 0,0,1,2,3
0,,,Inor,ganic species
1,,,O,xygen (O)
2,,oxygen,1.2×10−5,1700 Warneck and Williams (2012) L
3,,O2,1.3 ×10−5,1500 Sander et al. (2011) L
4,,[7782-44-7],1.3×10−5,1500 Sander et al. (2006) L
...,...,...,...,...
23823,,(methyltriethyl lead),,
23824,,[1762-28-3],,
23825,,tetraethyllead,1.3×10−5,6400 Feldhake and Stevens (1963) M
23826,,C8H20Pb,1.3×10−5,Abraham (1979) ?


In [224]:
# as we can see above the first column contains no data so can be removed, the last column also contains reference data which isn't needed so can also be removed
df = df.drop(df.columns[[0, 3]], axis=1)

In [225]:
# renaming column names
df.rename(columns={'1': 'Species', '2': 'Hcp'}, inplace=True)

In [226]:
# there are some incomplete names as some names are very long are take up more than one row so we need to correct these
# first we deal with if the name ends in a '-' or ',' or ')' sign then it is incomplete

# make sure all values in Substance are strings so they can be worked with
df['Species'] = df['Species'].astype(str)

# first a column containing the substance name shifted up by 1 is created - this matches any parts of incomplete names together
df['name endings'] = df['Species'].shift(-1)

df['Species'] = df.apply(lambda r: (r['Species'] + r['name endings']) if r['Species'].endswith('-') or r['Species'].endswith(',') or r['Species'].endswith(')') else r['Species'], axis=1)

# the above works well if the name occupies two rows but what if it takes up three rows?
# we need to repeat the process again and shift name endings column up by one so that any leftover of the already combined names can be matched up
df['name endings 2'] = df['Species'].shift(-2)

df['Species'] = df.apply(lambda r: (r['Species'] + r['name endings 2']) if r['Species'].endswith('-') or r['Species'].endswith(',') else r['Species'], axis=1)

In [227]:
df

Unnamed: 0,Species,Hcp,name endings,name endings 2
0,,Inor,,oxygen
1,,O,oxygen,O2
2,oxygen,1.2×10−5,O2,[7782-44-7]
3,O2,1.3 ×10−5,[7782-44-7],
4,[7782-44-7],1.3×10−5,,
...,...,...,...,...
23823,(methyltriethyl lead)[1762-28-3],,[1762-28-3],tetraethyllead
23824,[1762-28-3],,tetraethyllead,C8H20Pb
23825,tetraethyllead,1.3×10−5,C8H20Pb,[78-00-2]
23826,C8H20Pb,1.3×10−5,[78-00-2],


In [228]:
# we can now drop the name endings and name endings 2 columns as we are done with them
df = df.drop(df.columns[[2, 3]], axis=1)

In [229]:
# NaN values in column 1 will not correspond to any Henry's law constants or IUPAC names so can be removed
df.dropna(inplace=True)

In [230]:
df

Unnamed: 0,Species,Hcp
0,,Inor
1,,O
2,oxygen,1.2×10−5
3,O2,1.3 ×10−5
4,[7782-44-7],1.3×10−5
...,...,...
23814,ethyltrimethylplumbane,2.8×10−5
23817,diethyldimethylplumbane,2.1 ×10−5
23821,triethylmethylplumbane,1.6×10−5
23825,tetraethyllead,1.3×10−5


In [231]:
# we now need to get rid of rows which have Substance value representing chemical formula and CAS IDs as these represent duplicates
# we know that the IUPAC names all start with a lowercase letter or a digit so we can filter by that (as opposed to an uppercase letter or square bracket as the above would)
df_filtered = df[df.Species.str.contains('^[0-9a-z]')]

# stereoisomers such as E, Z, S, R, - are written as (E) so these will need to be accounted for as well
df_stereo = df[df.Species.str.contains('^\([^a-z]+\)')]

# some names start with a bracket followed by a digit or lowercase - how to differentiate this to the 'Other Names'?
# well the IUPAC names will contain a hyphen after any starting bracket
df_brackets = df[df.Species.str.contains('^\(\S+\)\-')]

# combine two filtered df together
df_clean = pd.concat([df_filtered, df_stereo, df_brackets])

In [232]:
df_filtered

Unnamed: 0,Species,Hcp
0,,Inor
1,,O
2,oxygen,1.2×10−5
5,,1.3×10−5
6,,1.3 10−5×
...,...,...
23811,tetramethyl lead,1.6×10−5
23814,ethyltrimethylplumbane,2.8×10−5
23817,diethyldimethylplumbane,2.1 ×10−5
23821,triethylmethylplumbane,1.6×10−5


In [233]:
df_clean.Species.duplicated().sum()

7184

In [234]:
# removing duplicates
df_clean = df_clean.drop_duplicates(subset=['Species'])

In [235]:
# reset index
df_clean.reset_index(drop=True, inplace=True)
df_clean

Unnamed: 0,Species,Hcp
0,,Inor
1,oxygen,1.2×10−5
2,ozone,1.0×10−4
3,hydrogen atom,2.6 ×10−6
4,hydrogen,7.8×10−6
...,...,...
4723,"(2,4-dichlorophenoxy)-acetic acid 2-ethylhexyl...",5.5×10−1
4724,"(2,4-dichlorophenoxy)-acetic acid,isooctyl ester",1.7×10−1
4725,(bromomethyl)-benzene,1.4 10−3×
4726,(2-bromoethyl)-benzene,6.5×10−3


In [236]:
# one thing to note before dealing with the Hcp values is that some have a '>' in front of them - we will need to remove this in order to work with them
df_clean['Hcp'] = df_clean.apply(lambda r: (r['Hcp'].replace('>', '')) if r['Hcp'].startswith('>') else r['Hcp'], axis=1)

In [237]:
# at the moment the Hcp values are a bit of a mess - we need to standardise them into a single format and then convert them into floats

# there are a few different situations to deal with - the first is the position of the x being in the wrong position
def correct_x_position(Hcp_value):
    if len(Hcp_value) <= 3:       # accounts for any values which aren't in standard form, e.g. 1.2
        Hcp_value = Hcp_value
    
    elif Hcp_value[3] != '×':
        Hcp_value = Hcp_value.replace('×', '').replace(' ','')    # any × characters at the end of the string are removed and then any whitespaces present as well
        Hcp_value = Hcp_value[0:3] + '×' + Hcp_value[3:]    # × characters placed in correct posittion before 10
    
    return Hcp_value
        
df_clean['Hcp'] = df_clean['Hcp'].apply(correct_x_position)

In [238]:
# next we convert the strings to floats to make it easier to work with
def standard_form(Hcp_value):
    if len(Hcp_value) <= 6:     # accounts for any values which aren't in standard form, e.g. 1.2
        Hcp_value = Hcp_value.replace('×10', 'e')
    
    elif Hcp_value[5] == '0':   # deals with values where everything is in the correct position
        Hcp_value = Hcp_value.replace('−', '-')   # replace any − signs with the correct - sign
        Hcp_value = Hcp_value.replace('×10', 'e')    # converting values into scientific format that is understood by python
    
    elif Hcp_value[-2] == '0':   # deals with positive powers
        Hcp_value = Hcp_value[0:6] + '+' + Hcp_value[6:]   # insert + into any values without - sign to distinguish between 104 and 10^4
        Hcp_value = Hcp_value.replace('×10', 'e')    # converting values into scientific format that is understood by python
    
    elif Hcp_value[-3] != '0':   # deals with values where something is out of position
        Hcp_value = Hcp_value[0:4] + '10' + Hcp_value[-3]
        Hcp_value = Hcp_value.replace('×10', 'e')
    
    return Hcp_value
    
df_clean['Hcp'] = df_clean['Hcp'].apply(standard_form)

In [239]:
df_clean

Unnamed: 0,Species,Hcp
0,,Ino×r
1,oxygen,1.2e-5
2,ozone,1.0e-4
3,hydrogen atom,2.6e-6
4,hydrogen,7.8e-6
...,...,...
4723,"(2,4-dichlorophenoxy)-acetic acid 2-ethylhexyl...",5.5e-1
4724,"(2,4-dichlorophenoxy)-acetic acid,isooctyl ester",1.7e-1
4725,(bromomethyl)-benzene,1.4e-3
4726,(2-bromoethyl)-benzene,6.5e-3


In [240]:
# there are some values which appear as just e rather than e-1 so we fix that here
df_clean['Hcp'] = [x + '-1' if x[-1] == 'e' else x for x in df_clean['Hcp']]

In [241]:
# converting string values to float 
df_clean['Hcp'] = df_clean['Hcp'].apply(pd.to_numeric, errors='coerce')

In [242]:
# clean up the numbers a bit
df_clean.Hcp = df_clean.Hcp.map('{:g}'.format)

In [243]:
# reset index
df_clean.reset_index(drop=True, inplace=True)
df_clean

Unnamed: 0,Species,Hcp
0,,
1,oxygen,1.2e-05
2,ozone,0.0001
3,hydrogen atom,2.6e-06
4,hydrogen,7.8e-06
...,...,...
4723,"(2,4-dichlorophenoxy)-acetic acid 2-ethylhexyl...",0.55
4724,"(2,4-dichlorophenoxy)-acetic acid,isooctyl ester",0.17
4725,(bromomethyl)-benzene,0.0014
4726,(2-bromoethyl)-benzene,0.0065


In [244]:
df_clean.loc[df_clean['Hcp'] == 'nan']

Unnamed: 0,Species,Hcp
0,,
38,chlorine nitrate,
46,bromine nitrate,
55,sulfur trioxide,
874,peroxyacetyl radical,
4169,demeton-S-methyl sulfone,


Looking back at the original dataset in the pdf we can see that these nan values correspond to uncertain values such as >4.9x10-4 for nitrosyl chloride or infinity for chlorine nitrate

In [245]:
# dropping these values from the dataframe
df_clean = df_clean[df_clean.Hcp != 'nan']

In [246]:
# reset index
df_clean.reset_index(drop=True, inplace=True)
df_clean

Unnamed: 0,Species,Hcp
0,oxygen,1.2e-05
1,ozone,0.0001
2,hydrogen atom,2.6e-06
3,hydrogen,7.8e-06
4,deuterium,7.9e-06
...,...,...
4717,"(2,4-dichlorophenoxy)-acetic acid 2-ethylhexyl...",0.55
4718,"(2,4-dichlorophenoxy)-acetic acid,isooctyl ester",0.17
4719,(bromomethyl)-benzene,0.0014
4720,(2-bromoethyl)-benzene,0.0065


In [247]:
# have noticed that some CAS numbers have slipped through so can remove them
df_clean = df_clean[~df_clean['Species'].str.endswith(']')]

In [248]:
df_clean.to_csv('cleaned_henry_dataset.csv', index=False)

### Scraping dataset from f90 file

In [249]:
# opening file containing Hcp values in SI units
file = open("HcpSI.f90", "r")

In [250]:
# create list where each element contains a species - in the f90 file each species starts with '! species:'
species_element = file.read().split(sep='! species:')

In [251]:
# then we will split each species element into their respective lists depending on whether it is the species name, inchikey, or Hcp value
# start by creating empty lists
Species = []
InChIKey = []
Hcp = []

In [252]:
# iterate through each element in list created above appending the relevant value to each categorical list
for i in species_element[1: ]:    # first line contains title so skip this
    species = i.split(sep='\n')[0]    # first value in element
    Species.append(species)   # add to species list
    
    inchikey = i.split(sep='\n')[4]   # third value in element
    InChIKey.append(inchikey)
    
    hcp = i.split(sep='!')[4]    # take first Hcp value in list as this is deemed to be most reliable
    Hcp.append(hcp)

In [253]:
# remove leading whitespace from species names
Species = [x.strip() for x in Species]

In [254]:
# remove '! inchikey:' from the beginning of all keys
InChIKey = [x.replace('! inchikey: ', '') for x in InChIKey]

In [255]:
hcp_values = []

# remove beginning of string containing inchi keys
for x in Hcp:
    if 'HcpSI' in x:
        start = x.index('=')
        value = x[start+1:]
        hcp_values.append(value)

    else:
        hcp_values.append('infinity')

In [256]:
# remove leading whitespaces
hcp_values = [x.strip() for x in hcp_values]

In [257]:
# extract values from string
cleaned_values = [x[0:7] for x in hcp_values]

In [258]:
# create dataframe with these lists
df2 = pd.DataFrame({'Species': Species, 'InChIKey': InChIKey, 'Hcp': cleaned_values})

In [259]:
df2

Unnamed: 0,Species,InChIKey,Hcp
0,oxygen,MYMOFIZGZYHOMD-UHFFFAOYSA-N,1.2E-05
1,ozone,CBENFWSGALASAD-UHFFFAOYSA-N,1.0E-04
2,hydrogen atom,YZCKVEUIGOORGS-UHFFFAOYSA-N,2.6E-06
3,hydrogen,UFHFLCQGNIYNRP-UHFFFAOYSA-N,7.8E-06
4,deuterium,UFHFLCQGNIYNRP-VVKOMZTBSA-N,7.9E-06
...,...,...,...
4627,tetramethyl lead,XOOGZRUBTYCLHG-UHFFFAOYSA-N,1.6E-05
4628,ethyltrimethylplumbane,KHQJREYATBQBHY-UHFFFAOYSA-N,2.8E-05
4629,diethyldimethylplumbane,OLOAJSHVLXNSQV-UHFFFAOYSA-N,2.1E-05
4630,triethylmethylplumbane,KGFRUGHBHNUHOS-UHFFFAOYSA-N,1.6E-05


In [260]:
# converting string values to float 
df2['Hcp'] = df2['Hcp'].apply(pd.to_numeric, errors='coerce')

In [261]:
# save dataframe as csv
df2.to_csv('hcp_values.csv', index=False)

### Comparing methods by which dataframes were created

In [262]:
# check how many rows match up
merged = pd.merge(df_clean, df2, on=['Species'], how='inner')

In [263]:
merged

Unnamed: 0,Species,Hcp_x,InChIKey,Hcp_y
0,oxygen,1.2e-05,MYMOFIZGZYHOMD-UHFFFAOYSA-N,1.200000e-05
1,ozone,0.0001,CBENFWSGALASAD-UHFFFAOYSA-N,1.000000e-04
2,hydrogen atom,2.6e-06,YZCKVEUIGOORGS-UHFFFAOYSA-N,2.600000e-06
3,hydrogen,7.8e-06,UFHFLCQGNIYNRP-UHFFFAOYSA-N,7.800000e-06
4,deuterium,7.9e-06,UFHFLCQGNIYNRP-VVKOMZTBSA-N,7.900000e-06
...,...,...,...,...
3733,"(2,4-dichlorophenoxy)-acetic acid, 2-butoxyeth...",0.62,ZMWGIGHRZQTQRE-UHFFFAOYSA-N,6.200000e+01
3734,"(2,4-dichlorophenoxy)-acetic acid 2-ethylhexyl...",0.55,QZSFJRIWRPJUOH-UHFFFAOYSA-N,5.500000e-01
3735,(bromomethyl)-benzene,0.0014,AGEZXYOZHKGVCM-UHFFFAOYSA-N,1.400000e-03
3736,(2-bromoethyl)-benzene,0.0065,WMPPDTMATNBGJN-UHFFFAOYSA-N,6.500000e-03


894 were not matched meaning they were missing from the final pdf scraped dataset.