# Scraping Henry's Law Constants from the pdf dataset

Henry's Law Constants data was scraped from pdf (https://acp.copernicus.org/articles/15/4399/2015/acp-15-4399-2015.pdf) using Tabula (https://tabula.technology/). The dataset must be cleaned up as Tabula made a few errors that need correcting, and also only the most reliable value constants must be selected and kept in the dataset. The end result should contain 4632 rows with 4632 unique species.

In [1]:
import pandas as pd
import tabula as tb
import re

In [2]:
file = 'henrys_law_constants.pdf'

In [3]:
# here a loop is created where Tabula loops through each of the pages specified to scrap data from the table in the pdf which are then added to a list

dfs = []

for i in range(10, 553):
    df = tb.read_pdf(file, pages=str(i), multiple_tables=True, area=(145.45, 60, 728.64, 543.05), columns = [60,216,270], pandas_options={'header': None}, stream=True, silent=True)[0]
    dfs.append(df)

# the list is then converted into a pandas dataframe
henry_df = pd.concat(dfs)

In [4]:
henry_df

Unnamed: 0,0,1,2,3
0,,,Inor,ganic species
1,,,O,xygen (O)
2,,oxygen,1.2×10−5,1700 Warneck and Williams (2012) L
3,,O2,1.3 ×10−5,1500 Sander et al. (2011) L
4,,[7782-44-7],1.3×10−5,1500 Sander et al. (2006) L
...,...,...,...,...
16,,(methyltriethyl lead),,
17,,[1762-28-3],,
18,,tetraethyllead,1.3×10−5,6400 Feldhake and Stevens (1963) M
19,,C8H20Pb,1.3×10−5,Abraham (1979) ?


In [12]:
henry_df.to_csv('henrys_law_dataset.csv', index=False)