# Curating Henry's Law Dataset

Henry's Law Constants data was scraped from pdf (https://acp.copernicus.org/articles/15/4399/2015/acp-15-4399-2015.pdf) using Tabula (https://tabula.technology/). The dataset must be cleaned up as Tabula made a few errors that need correcting, and also only the most reliable value constants must be selected and kept in the dataset. The end result should contain 4632 rows with 4632 unique species.

In [71]:
import pandas as pd
import numpy as np

In [72]:
# loading Henry's Law constants dataset as a dataframe
df = pd.read_csv('henrys_law_constants.csv')

In [73]:
df.head()

Unnamed: 0.1,Unnamed: 0,cp,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5
0,Substance,H,,,,
1,Formula,(at T ),,,,
2,(Other name(s)),[ mol],,,,
3,[CAS registry number],m 3 Pa,,,,
4,,Inorganic,,,,


In [74]:
# delete empty columns on the right
df = df.drop(df.columns[[2, 3, 4, 5]], axis=1)

In [75]:
df

Unnamed: 0.1,Unnamed: 0,cp
0,Substance,H
1,Formula,(at T )
2,(Other name(s)),[ mol]
3,[CAS registry number],m 3 Pa
4,,Inorganic
...,...,...
55352,U.: Assessment of chemical screening outcomes ...,
55353,"ent partitioning property estimation methods, ...",
55354,"514–520, 2010.",
55355,"Zhang, Z. and Pawliszyn, J.: Headspace solid-p...",


In [76]:
# deleting rows which don't contain Henry's law data
df = df.drop(labels=range(52027, 55357), axis=0)   # these are the references at the end of the paper

In [77]:
# these are the heading names which we don't need
df = df.loc[df['Unnamed: 0'] != 'Substance']
df = df.loc[df['Unnamed: 0'] != 'Formula']
df = df.loc[df['Unnamed: 0'] != '(Other name(s))']
df = df.loc[df['Unnamed: 0'] != '[CAS registry number]']

In [78]:
# renaming column names
df.rename(columns={'Unnamed: 0': 'Substance', 'cp': 'Hcp'}, inplace=True)

In [79]:
df

Unnamed: 0,Substance,Hcp
4,,Inorganic
5,,
6,oxygen,1.2×10−5
7,O2,1.3×10−5
8,[7782-44-7],1.3×10−5
...,...,...
52022,(methyltriethyl lead),
52023,[1762-28-3],
52024,tetraethyllead,1.3 ×10 −5
52025,C8H20Pb,1.3 ×10−5


In [80]:
df.isna().sum()

Substance    13320
Hcp          23303
dtype: int64

The null values in the Substance column won't relate to any values as they are just extra constants for the above substance however looking at the csv file some of the constants are in the same cell as the substance name so those will need to be separated.

In [81]:
# can delete null values in the substance column
df.dropna(subset=['Substance'], inplace=True)

In [82]:
# as an example here we can see that we need to split the substance name and Hcp value
df.iloc[395]

Substance    sulfur hexafluoride 2.4 10−6×
Hcp                                   3100
Name: 801, dtype: object

In [83]:
# here we are splitting at the first whitespace which is followed by a digit
df[['Substance Name', 'sep', 'missing Hcp']] = df['Substance'].str.split('(\s+\d\.)', n=1, expand=True)

# the sep column contains the digit that we split at so we want to add that back to its value
df["missing Hcp"] = df["sep"] + df["missing Hcp"]
df.drop("sep", inplace=True, axis=1)

In [84]:
# here we can see the value has successfully been split up
df.iloc[395]

Substance         sulfur hexafluoride 2.4 10−6×
Hcp                                        3100
Substance Name              sulfur hexafluoride
missing Hcp                           2.4 10−6×
Name: 801, dtype: object

In [85]:
df

Unnamed: 0,Substance,Hcp,Substance Name,missing Hcp
6,oxygen,1.2×10−5,oxygen,
7,O2,1.3×10−5,O2,
8,[7782-44-7],1.3×10−5,[7782-44-7],
20,ozone,1.0×10−4,ozone,
21,O 3,1.0 ×10 −4,O 3,
...,...,...,...,...
52022,(methyltriethyl lead),,(methyltriethyl lead),
52023,[1762-28-3],,[1762-28-3],
52024,tetraethyllead,1.3 ×10 −5,tetraethyllead,
52025,C8H20Pb,1.3 ×10−5,C8H20Pb,


In [86]:
# replace Substance values with corrected Substance Name values
df['Substance'] = df['Substance Name']

In [87]:
# example below
df.iloc[395]

Substance         sulfur hexafluoride
Hcp                              3100
Substance Name    sulfur hexafluoride
missing Hcp                 2.4 10−6×
Name: 801, dtype: object

In [88]:
# will fill up the missing Hcp column with existing Hcp values - now the missing Hcp values contains all the correct Hcp values
df["missing Hcp"].fillna(df["Hcp"], inplace=True)

In [89]:
df.iloc[395]

Substance         sulfur hexafluoride
Hcp                              3100
Substance Name    sulfur hexafluoride
missing Hcp                 2.4 10−6×
Name: 801, dtype: object

In [90]:
df

Unnamed: 0,Substance,Hcp,Substance Name,missing Hcp
6,oxygen,1.2×10−5,oxygen,1.2×10−5
7,O2,1.3×10−5,O2,1.3×10−5
8,[7782-44-7],1.3×10−5,[7782-44-7],1.3×10−5
20,ozone,1.0×10−4,ozone,1.0×10−4
21,O 3,1.0 ×10 −4,O 3,1.0 ×10 −4
...,...,...,...,...
52022,(methyltriethyl lead),,(methyltriethyl lead),
52023,[1762-28-3],,[1762-28-3],
52024,tetraethyllead,1.3 ×10 −5,tetraethyllead,1.3 ×10 −5
52025,C8H20Pb,1.3 ×10−5,C8H20Pb,1.3 ×10−5


In [91]:
# now we can get rid of the Hcp and Substance Name columns
df = df.drop(df.columns[[1, 2]], axis=1)

# and also rename missing Hcp to Hcp
df.rename(columns={'missing Hcp': 'Hcp'}, inplace=True)

In [92]:
df.iloc[3946]

Substance    naphthacene
Hcp             3.6 102×
Name: 7260, dtype: object

In [93]:
# now we can remove any null values in Hcp column
df.dropna(subset=['Hcp'], inplace=True)

In [94]:
# we now need to get rid of rows which have Substance value representing chemical formula and CAS IDs as these represent duplicates
# we know that the IUPAC names all start with a lowercase letter or a digit so we can filter by that (as opposed to an uppercase letter or square bracket as the above would)
df_filtered = df[df.Substance.str.contains('^[0-9a-z]')]

# stereoisomers such as E, Z, S, R, - are written as (E) so these will need to be accounted for as well
df_stereo = df[df.Substance.str.contains('^\([^a-z]+\)')]

# some names start with a bracket followed by a digit or lowercase - how to differentiate this to the 'Other Names'?
# well the IUPAC names will contain a hyphen after any starting bracket
df_brackets = df[df.Substance.str.contains('^\(\S+\)\-')]

# combine two filtered df together
df_clean = pd.concat([df_filtered, df_stereo, df_brackets])

In [95]:
df_brackets

Unnamed: 0,Substance,Hcp
3270,(Z)-bicyclo[4.4.0]decane,4.3×10−4
3275,(E)-bicyclo[4.4.0]decane,2.7 10−4×
3316,(Z)-bicyclo[4.4.0]decane,4.3×10−4
3321,(E)-bicyclo[4.4.0]decane,2.7 10−4×
3847,"(E)-1,3-pentadiene",8.2×10−5
...,...,...
44608,(2-bromoethyl)-benzene,6.5 10−3×
45669,"(2E)-N,N’-bis(2,4,6-tribromophenyl)-",9.0 109×
45717,"(2E)-N,N’-bis(2,4,6-tribromophenyl)-",9.0 109×
50409,(2-chloroethyl)-phosphonic acid,6.9 107×


In [96]:
# some rows like this have gone through which must be cleaned up
df_clean.iloc[154]

Substance    boric acid
Hcp             3.8×106
Name: 879, dtype: object

In [97]:
# we can filter out any Hcp values which are not in standard form as any of those values are incorrect
df_clean = df_clean[df_clean.Hcp.str.contains('^\d*\.')]

In [98]:
df_clean

Unnamed: 0,Substance,Hcp
6,oxygen,1.2×10−5
20,ozone,1.0×10−4
34,hydrogen atom,2.6×10−6
37,hydrogen,7.8×10−6
44,deuterium,7.9 10−6×
...,...,...
44608,(2-bromoethyl)-benzene,6.5 10−3×
45669,"(2E)-N,N’-bis(2,4,6-tribromophenyl)-",9.0 109×
45717,"(2E)-N,N’-bis(2,4,6-tribromophenyl)-",9.0 109×
50409,(2-chloroethyl)-phosphonic acid,6.9 107×


In [99]:
df_clean.Substance.duplicated().sum()

3587

In [100]:
# removing duplicates
df_clean = df_clean.drop_duplicates(subset=['Substance'])

In [101]:
# reset index
df_clean.reset_index(drop=True, inplace=True)
df_clean

Unnamed: 0,Substance,Hcp
0,oxygen,1.2×10−5
1,ozone,1.0×10−4
2,hydrogen atom,2.6×10−6
3,hydrogen,7.8×10−6
4,deuterium,7.9 10−6×
...,...,...
3481,(bis-(2-chloroethyl)-ether),3.4 10−2×
3482,"(2,4-dichlorophenoxy)-ethanoic acid",1.4 ×10−1
3483,"((2,4-dichlorophenoxy)-acetic acid;",5.0 ×104
3484,(2-bromoethyl)-benzene,6.5 10−3×


We are short by 1192 species :(

In [102]:
# at the moment the Hcp values are a bit of a mess - we need to standardise them into a single format and then convert them into floats

# there are a few different situations to deal with - the first is the position of the x being in the wrong position
def correct_x_position(Hcp_value):
    if len(Hcp_value) <= 3:       # accounts for any values which aren't in standard form, e.g. 1.2
        Hcp_value = Hcp_value
    
    elif Hcp_value[3] != '×':
        Hcp_value = Hcp_value.replace('×', '').replace(' ','')    # any × characters at the end of the string are removed and then any whitespaces present as well
        Hcp_value = Hcp_value[0:3] + '×' + Hcp_value[3:]    # × characters placed in correct posittion before 10
    
    return Hcp_value
        
df_clean['Hcp'] = df_clean['Hcp'].apply(correct_x_position)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean['Hcp'] = df_clean['Hcp'].apply(correct_x_position)


In [103]:
df_clean.isna().sum()

Substance    0
Hcp          0
dtype: int64

In [104]:
df_clean = df_clean.dropna(subset=['Hcp'])

In [105]:
df_clean

Unnamed: 0,Substance,Hcp
0,oxygen,1.2×10−5
1,ozone,1.0×10−4
2,hydrogen atom,2.6×10−6
3,hydrogen,7.8×10−6
4,deuterium,7.9×10−6
...,...,...
3481,(bis-(2-chloroethyl)-ether),3.4×10−2
3482,"(2,4-dichlorophenoxy)-ethanoic acid",1.4×10−1
3483,"((2,4-dichlorophenoxy)-acetic acid;",5.0×104
3484,(2-bromoethyl)-benzene,6.5×10−3


In [106]:
# next we convert the strings to floats to make it easier to work with
def standard_form(Hcp_value):
    if len(Hcp_value) <= 6:
        Hcp_value = Hcp_value
    
    elif Hcp_value[6] == '−':
        Hcp_value = Hcp_value.replace('−', '-')   # replace any − signs with the correct - sign
        Hcp_value = Hcp_value.replace('×10', 'e')    # converting values into scientific format that is understood by python
    
    elif Hcp_value[6] != '−':
        Hcp_value = Hcp_value[0:6] + '+' + Hcp_value[6:]   # insert + into any values without - sign to distinguish between 104 and 10^4
        Hcp_value = Hcp_value.replace('×10', 'e')    # converting values into scientific format that is understood by python
    
    return Hcp_value
    
df_clean['Hcp'] = df_clean['Hcp'].apply(standard_form)

In [107]:
df_clean

Unnamed: 0,Substance,Hcp
0,oxygen,1.2e-5
1,ozone,1.0e-4
2,hydrogen atom,2.6e-6
3,hydrogen,7.8e-6
4,deuterium,7.9e-6
...,...,...
3481,(bis-(2-chloroethyl)-ether),3.4e-2
3482,"(2,4-dichlorophenoxy)-ethanoic acid",1.4e-1
3483,"((2,4-dichlorophenoxy)-acetic acid;",5.0e+4
3484,(2-bromoethyl)-benzene,6.5e-3


In [108]:
df_clean.loc[df_clean.Substance == 'dacthal']

Unnamed: 0,Substance,Hcp
2134,dacthal,4.4


In [109]:
# converting string values to float 
df_clean['Hcp'] = df_clean['Hcp'].apply(pd.to_numeric, errors='coerce')

In [110]:
df_clean

Unnamed: 0,Substance,Hcp
0,oxygen,1.200000e-05
1,ozone,1.000000e-04
2,hydrogen atom,2.600000e-06
3,hydrogen,7.800000e-06
4,deuterium,7.900000e-06
...,...,...
3481,(bis-(2-chloroethyl)-ether),3.400000e-02
3482,"(2,4-dichlorophenoxy)-ethanoic acid",1.400000e-01
3483,"((2,4-dichlorophenoxy)-acetic acid;",5.000000e+04
3484,(2-bromoethyl)-benzene,6.500000e-03


In [111]:
df_clean.isna().sum()

Substance      0
Hcp          121
dtype: int64

In [112]:
# reset index
df_clean.reset_index(drop=True, inplace=True)
df_clean

Unnamed: 0,Substance,Hcp
0,oxygen,1.200000e-05
1,ozone,1.000000e-04
2,hydrogen atom,2.600000e-06
3,hydrogen,7.800000e-06
4,deuterium,7.900000e-06
...,...,...
3481,(bis-(2-chloroethyl)-ether),3.400000e-02
3482,"(2,4-dichlorophenoxy)-ethanoic acid",1.400000e-01
3483,"((2,4-dichlorophenoxy)-acetic acid;",5.000000e+04
3484,(2-bromoethyl)-benzene,6.500000e-03


In [113]:
df_clean.loc[df_clean.Substance == 'aniline,4,4\'-(imidocarbonyl)bis-(N,N-']

Unnamed: 0,Substance,Hcp


In [114]:
df_clean.Hcp = df_clean.Hcp.map('{:g}'.format)

In [115]:
df_clean

Unnamed: 0,Substance,Hcp
0,oxygen,1.2e-05
1,ozone,0.0001
2,hydrogen atom,2.6e-06
3,hydrogen,7.8e-06
4,deuterium,7.9e-06
...,...,...
3481,(bis-(2-chloroethyl)-ether),0.034
3482,"(2,4-dichlorophenoxy)-ethanoic acid",0.14
3483,"((2,4-dichlorophenoxy)-acetic acid;",50000
3484,(2-bromoethyl)-benzene,0.0065
