# Curating Henry's Law Dataset

Henry's Law Constants data was scraped from pdf (https://acp.copernicus.org/articles/15/4399/2015/acp-15-4399-2015.pdf) using Tabula (https://tabula.technology/). The dataset must be cleaned up as Tabula made a few errors that need correcting, and also only the most reliable value constants must be selected and kept in the dataset. The end result should contain 4632 rows with 4632 unique species.

In [1866]:
import pandas as pd
import numpy as np

In [1867]:
# loading Henry's Law constants dataset as a dataframe
df = pd.read_csv('henrys_law_constants.csv')

In [1868]:
df.head()

Unnamed: 0.1,Unnamed: 0,cp,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5
0,Substance,H,,,,
1,Formula,(at T ),,,,
2,(Other name(s)),[ mol],,,,
3,[CAS registry number],m 3 Pa,,,,
4,,Inorganic,,,,


In [1869]:
# delete empty columns on the right
df = df.drop(df.columns[[2, 3, 4, 5]], axis=1)

In [1870]:
df

Unnamed: 0.1,Unnamed: 0,cp
0,Substance,H
1,Formula,(at T )
2,(Other name(s)),[ mol]
3,[CAS registry number],m 3 Pa
4,,Inorganic
...,...,...
55352,U.: Assessment of chemical screening outcomes ...,
55353,"ent partitioning property estimation methods, ...",
55354,"514–520, 2010.",
55355,"Zhang, Z. and Pawliszyn, J.: Headspace solid-p...",


In [1871]:
# deleting rows which don't contain Henry's law data
df = df.drop(labels=range(52027, 55357), axis=0)   # these are the references at the end of the paper

In [1872]:
# these are the heading names which we don't need
df = df.loc[df['Unnamed: 0'] != 'Substance']
df = df.loc[df['Unnamed: 0'] != 'Formula']
df = df.loc[df['Unnamed: 0'] != '(Other name(s))']
df = df.loc[df['Unnamed: 0'] != '[CAS registry number]']

In [1873]:
# renaming column names
df.rename(columns={'Unnamed: 0': 'Substance', 'cp': 'Hcp'}, inplace=True)

In [1874]:
df

Unnamed: 0,Substance,Hcp
4,,Inorganic
5,,
6,oxygen,1.2×10−5
7,O2,1.3×10−5
8,[7782-44-7],1.3×10−5
...,...,...
52022,(methyltriethyl lead),
52023,[1762-28-3],
52024,tetraethyllead,1.3 ×10 −5
52025,C8H20Pb,1.3 ×10−5


In [1875]:
df.isna().sum()

Substance    13320
Hcp          23303
dtype: int64

The null values in the Substance column won't relate to any values as they are just extra constants for the above substance however looking at the csv file some of the constants are in the same cell as the substance name so those will need to be separated.

In [1876]:
# can delete null values in the substance column
df.dropna(subset=['Substance'], inplace=True)

In [1877]:
# as an example here we can see that we need to split the substance name and Hcp value
df.iloc[395]

Substance    sulfur hexafluoride 2.4 10−6×
Hcp                                   3100
Name: 801, dtype: object

In [1878]:
# here we are splitting at the first whitespace which is followed by a digit
df[['Substance Name', 'sep', 'missing Hcp']] = df['Substance'].str.split('(\s\d)', n=1, expand=True)

# the sep column contains the digit that we split at so we want to add that back to its value
df["missing Hcp"] = df["sep"] + df["missing Hcp"]
df.drop("sep", inplace=True, axis=1)

In [1879]:
# here we can see the value has successfully been split up
df.iloc[395]

Substance         sulfur hexafluoride 2.4 10−6×
Hcp                                        3100
Substance Name              sulfur hexafluoride
missing Hcp                           2.4 10−6×
Name: 801, dtype: object

In [1880]:
df

Unnamed: 0,Substance,Hcp,Substance Name,missing Hcp
6,oxygen,1.2×10−5,oxygen,
7,O2,1.3×10−5,O2,
8,[7782-44-7],1.3×10−5,[7782-44-7],
20,ozone,1.0×10−4,ozone,
21,O 3,1.0 ×10 −4,O,3
...,...,...,...,...
52022,(methyltriethyl lead),,(methyltriethyl lead),
52023,[1762-28-3],,[1762-28-3],
52024,tetraethyllead,1.3 ×10 −5,tetraethyllead,
52025,C8H20Pb,1.3 ×10−5,C8H20Pb,


In [1881]:
# replace Substance values with corrected Substance Name values
df['Substance'] = df['Substance Name']

In [1882]:
# example below
df.iloc[395]

Substance         sulfur hexafluoride
Hcp                              3100
Substance Name    sulfur hexafluoride
missing Hcp                 2.4 10−6×
Name: 801, dtype: object

In [1883]:
# will fill up the missing Hcp column with existing Hcp values - now the missing Hcp values contains all the correct Hcp values
df["missing Hcp"].fillna(df["Hcp"], inplace=True)

In [1884]:
df.iloc[395]

Substance         sulfur hexafluoride
Hcp                              3100
Substance Name    sulfur hexafluoride
missing Hcp                 2.4 10−6×
Name: 801, dtype: object

In [1885]:
df

Unnamed: 0,Substance,Hcp,Substance Name,missing Hcp
6,oxygen,1.2×10−5,oxygen,1.2×10−5
7,O2,1.3×10−5,O2,1.3×10−5
8,[7782-44-7],1.3×10−5,[7782-44-7],1.3×10−5
20,ozone,1.0×10−4,ozone,1.0×10−4
21,O,1.0 ×10 −4,O,3
...,...,...,...,...
52022,(methyltriethyl lead),,(methyltriethyl lead),
52023,[1762-28-3],,[1762-28-3],
52024,tetraethyllead,1.3 ×10 −5,tetraethyllead,1.3 ×10 −5
52025,C8H20Pb,1.3 ×10−5,C8H20Pb,1.3 ×10−5


In [1886]:
# now we can get rid of the Hcp and Substance Name columns
df = df.drop(df.columns[[1, 2]], axis=1)

# and also rename missing Hcp to Hcp
df.rename(columns={'missing Hcp': 'Hcp'}, inplace=True)

In [1887]:
df.iloc[3946]

Substance    naphthacene
Hcp             3.6 102×
Name: 7260, dtype: object

In [1888]:
# now we can remove any null values in Hcp column
df.dropna(subset=['Hcp'], inplace=True)

In [1889]:
# we now need to get rid of rows which have Substance value representing chemical formula and CAS IDs as these represent duplicates
# we know that the IUPAC names all start with a lowercase letter or a digit so we can filter by that (as opposed to an uppercase letter or square bracket as the above would)
df = df[df.Substance.str.contains('^[0-9a-z]')]

In [1890]:
df

Unnamed: 0,Substance,Hcp
6,oxygen,1.2×10−5
20,ozone,1.0×10−4
34,hydrogen atom,2.6×10−6
37,hydrogen,7.8×10−6
44,deuterium,7.9 10−6×
...,...,...
52010,tetramethyl lead,1.6×10−5
52013,ethyltrimethylplumbane,2.8×10−5
52016,diethyldimethylplumbane,2.1 10−5×
52020,triethylmethylplumbane,1.6 10−5×


In [1891]:
# some rows like this have gone through which must be cleaned up
df.iloc[154]

Substance    1.4×10−5
Hcp              1500
Name: 840, dtype: object

In [1892]:
# we can filter out any Hcp values which are not in standard form as any of those values are incorrect
df = df[df.Hcp.str.contains('^\d*\.')]

In [1893]:
df

Unnamed: 0,Substance,Hcp
6,oxygen,1.2×10−5
20,ozone,1.0×10−4
34,hydrogen atom,2.6×10−6
37,hydrogen,7.8×10−6
44,deuterium,7.9 10−6×
...,...,...
52010,tetramethyl lead,1.6×10−5
52013,ethyltrimethylplumbane,2.8×10−5
52016,diethyldimethylplumbane,2.1 10−5×
52020,triethylmethylplumbane,1.6 10−5×


In [1894]:
# removing duplicates
df = df.drop_duplicates(subset=['Substance'])

We are 1472 species short :(

In [1895]:
# reset index
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,Substance,Hcp
0,oxygen,1.2×10−5
1,ozone,1.0×10−4
2,hydrogen atom,2.6×10−6
3,hydrogen,7.8×10−6
4,deuterium,7.9 10−6×
...,...,...
3155,tetramethyl lead,1.6×10−5
3156,ethyltrimethylplumbane,2.8×10−5
3157,diethyldimethylplumbane,2.1 10−5×
3158,triethylmethylplumbane,1.6 10−5×


In [1896]:
# at the moment the Hcp values are a bit of a mess - we need to standardise them into a single format and then convert them into floats