## Table and Graph Extraction

This notebook demonstrates how to extract tables and graphs from literature accurately and successfully. 

### Tables

This notebook focuses on extraction of tables and graphs, in order to prove the idea, test articles will be used as example

In the ChemDataExtractor, the test_table.py only shows how to parse the designed tables. Our collected literature vary by formats, they include html, xml, and pdf files. 

Paragraph function splits table into unreadable content. We need to extract tables out first then parse their content. 

### Graphs 

Graphs in html and xml are hard to extract due to how they are stored. Graphs in pdf can be mined by using the python library pdfminer.six. 

In [76]:
import logging
import re
import pandas as pd
import urllib
import time
import pdfminer
import chemdataextractor as cde
from chemdataextractor import Document
import chemdataextractor.model as model
from chemdataextractor.model import Compound, UvvisSpectrum, UvvisPeak, BaseModel, StringType, ListType, ModelType
from chemdataextractor.parse.common import hyphen
from chemdataextractor.parse.base import BaseParser
from chemdataextractor.utils import first
from chemdataextractor.parse.actions import strip_stop
from chemdataextractor.parse.elements import W, I, T, R, Optional, ZeroOrMore, OneOrMore
from chemdataextractor.parse.cem import chemical_name
from chemdataextractor.doc import Paragraph, Sentence, Caption, Figure,Table
from chemdataextractor.doc.table import Table, Cell
from chemdataextractor.reader import PdfReader, HtmlReader, XmlReader, PlainTextReader

In [2]:
# open and read files
f = open('test_articles/paper0.pdf', 'rb')
doc = Document.from_file(f)
abstract = [11]

f1 = open('test_articles/paper1.pdf', 'rb')
doc1 = Document.from_file(f1)
abstract1 = [7,8]

f2 = open('test_articles/paper2.pdf', 'rb')
doc2 = Document.from_file(f2)
abstract2 = [7,8]

f3 = open('test_articles/paper3.pdf', 'rb')
doc3 = Document.from_file(f3)
abstract3 = [10]

f4 = open('test_articles/paper4.pdf', 'rb')
doc4 = Document.from_file(f4)
abstract4 = [12]

f5 = open('test_articles/paper5.pdf', 'rb')
doc5 = Document.from_file(f5)
abstract5 = [3,4]

f6 = open('test_articles/paper6.pdf', 'rb')
doc6 = Document.from_file(f6)
abstract6 = [5,6,7,8]

f7 = open('test_articles/paper7.pdf', 'rb')
doc7 = Document.from_file(f7)
abstract7 = [11]

In [4]:
# split the paragraph into elements
paras = doc.elements
cems = doc.cems
doc.records.serialize()

[{'names': ['Isoindigo-']},
 {'names': ['bislactam']},
 {'names': ['hydrogens']},
 {'names': ['phenyl']},
 {'names': ['oxygens']},
 {'names': ['oxindoles']},
 {'names': ['triphenylamine']},
 {'names': ['phenyl- carbazole']},
 {'names': ['D − A']},
 {'names': ['4,8-bis(5-(2-ethylhexyl)- thiophen-2-yl)benzo[1,2-b:4,5-b′]dithiophene ( 2D-BDT )']},
 {'names': ['2D-BDT-containing D − π − A']},
 {'names': ['Mn [ kg mol−1 ]']},
 {'names': ['411']},
 {'names': ['34']},
 {'names': ['76']},
 {'names': ['long, branched 2- octyldodecyl alkyl']},
 {'names': ['tris(dibenzylideneacetone)dipalladium']},
 {'names': ['Mn']},
 {'names': ['benzene']},
 {'names': ['ferrocene']},
 {'names': ['−[Eonset ferrocene + 4.8 ] V']},
 {'names': ['methyl substituted alkyl chains']},
 {'names': ['methyl']},
 {'names': ['I ds']},
 {'names': ['WC o L 2']},
 {'names': ['V t']},
 {'names': ['PBDT-TIIG-']},
 {'names': ['P3HT']},
 {'names': ['PCBM']},
 {'names': ['μ e']},
 {'names': ['f [V]']},
 {'names': ['alkyl']},
 {'nam

In [5]:
paras

[Paragraph(id=None, references=[], text='Article'),
 Paragraph(id=None, references=[], text='pubs.acs.org/cm'),
 Paragraph(id=None, references=[], text='Interplay of Molecular Orientation, Film Formation, and\nOptoelectronic Properties on Isoindigo- and Thienoisoindigo-Based\nCopolymers for Organic Field Eﬀect Transistor and Organic\nPhotovoltaic Applications\nChien Lu,\nand Pi-Tai Chou*,#\n†\nDepartment of Chemical Engineering, National Taiwan University, Taipei 106, Taiwan\n‡\nResearch Center for New Generation Photovoltaics, Graduate Institute of Energy Engineering, National Central University, Taoyuan\n320, Taiwan\n#Department of Chemistry, National Taiwan University, Taipei 106, Taiwan'),
 Paragraph(id=None, references=[], text='Hsieh-Chih Chen,*,‡,§'),
 Paragraph(id=None, references=[], text='Wen-Chang Chen,*,†'),
 Paragraph(id=None, references=[], text='Wei-Ti Chuang,'),
 Paragraph(id=None, references=[], text='Yen-Hao Hsu,'),
 Paragraph(id=None, references=[], text='†,§'),
 Par

## Built-in Test Examples in CDE

In [6]:
t = Table(
            caption=Caption('Selected photophysical properties of biarylsubstituted pyrazoles 5–8 and 1-methyl-3,5-diphenylpyrazole (9) at room temperature'),
            headings=[
                [
                    Cell('Compound'),
                    Cell('Absorption maxima λmax,abs (ε) [nm] (L cm−1 mol−1)'),
                    Cell('Emission maxima λmax,em (Φf) [nm] (a.u.)'),
                    Cell('Stokes-shift Δṽ [cm−1]')
                ]
            ],
            rows=[
                [Cell(' 5a '), Cell('273.5 (40 100)'), Cell('357.0 (0.77)'), Cell('9400')],
                [Cell(' 5b '), Cell('268.5 (36 700)'), Cell('359.0 (0.77)'), Cell('8600')],
                [Cell('Coumarin 343'), Cell('263.0 (38 400)'), Cell('344.5 (0.67)'), Cell('9000')],
                [Cell(' 5d '), Cell('281.0 (34 200)'), Cell('351.5 (0.97)'), Cell('7100')],
                [Cell(' 5e '), Cell('285.0 (44 000)'), Cell('382.0 (0.35)'), Cell('8900')],
                [Cell(' 5f '), Cell('289.0 (43 300)'), Cell('363.0 (0.80)'), Cell('7100')],
                [Cell(' 5g '), Cell('285.0 (42 000)'), Cell('343.5 (0.86)'), Cell('6000')],
                [Cell(' 6a '), Cell('283.5 (35 600)'), Cell('344.5 (0.49)'), Cell('6300')],
                [Cell(' 6b '), Cell('267.5 (35 800)'), Cell('338.5 (0.83)'), Cell('7800')],
                [Cell(' 6c '), Cell('286.0 (33 000)'), Cell('347.0 (0.27)'), Cell('6200')],
                [Cell(' 6d '), Cell('306.5 (36 600)'), Cell('384.0 (0.10)'), Cell('6600')],
                [Cell(' 7 '), Cell('288.5 (62 500)'), Cell('367.0 (0.07)'), Cell('7400')],
                [Cell('Compound 8a '), Cell('257.0 (36 300), 293.0 sh (25 000)'), Cell('385.0 (0.41)'), Cell('8200')],
                [Cell(' 8b '), Cell('257.0 (32 000), 296.0 sh (23000)'), Cell('388.0 (0.33)'), Cell('8000')],
                [Cell(' 8c '), Cell('257.0 (27 400), 307.5 (18900)'), Cell('387.0 (0.12)'), Cell('6700')],
                [Cell(' 8d '), Cell('268.5 (29 500)'), Cell('385.0 (0.29)'), Cell('11 300')],
                [Cell('Dye 8e '), Cell('261.5 (39 900), 288.0 sh (29 600), 311.0 sh (20 500)'), Cell('386.5 (0.37)'), Cell('6300')],
                [Cell(' 8f '), Cell('256.5 (27 260), 296.0 (28404)'), Cell('388.5 (0.35)'), Cell('8000')],
                [Cell(' 8g '), Cell('272.5 (39 600)'), Cell('394.0 (0.30)'), Cell('11 300')],
                [Cell(' 8h '), Cell('286.0 (22 900)'), Cell('382.5 (0.33)'), Cell('8800')],
                [Cell(' 9 '), Cell('254.0 (28 800)'), Cell('338.5 (0.40)'), Cell('9800')]]
)

In [7]:
t

Compound,"Absorption maxima λmax,abs (ε) [nm] (L cm−1 mol−1)","Emission maxima λmax,em (Φf) [nm] (a.u.)",Stokes-shift Δṽ [cm−1]
5a,273.5 (40 100),357.0 (0.77),9400
5b,268.5 (36 700),359.0 (0.77),8600
Coumarin 343,263.0 (38 400),344.5 (0.67),9000
5d,281.0 (34 200),351.5 (0.97),7100
5e,285.0 (44 000),382.0 (0.35),8900
5f,289.0 (43 300),363.0 (0.80),7100
5g,285.0 (42 000),343.5 (0.86),6000
6a,283.5 (35 600),344.5 (0.49),6300
6b,267.5 (35 800),338.5 (0.83),7800
6c,286.0 (33 000),347.0 (0.27),6200


In [8]:
[record.serialize() for record in t.records]

[{'labels': ['5a'],
  'uvvis_spectra': [{'peaks': [{'value': '273.5',
      'units': 'nm',
      'extinction': '40100',
      'extinction_units': 'L cm − 1 mol − 1'}]}],
  'quantum_yields': [{'value': '0.77'}]},
 {'labels': ['5b'],
  'uvvis_spectra': [{'peaks': [{'value': '268.5',
      'units': 'nm',
      'extinction': '36700',
      'extinction_units': 'L cm − 1 mol − 1'}]}],
  'quantum_yields': [{'value': '0.77'}]},
 {'names': ['Coumarin 343'],
  'uvvis_spectra': [{'peaks': [{'value': '263.0',
      'units': 'nm',
      'extinction': '38400',
      'extinction_units': 'L cm − 1 mol − 1'}]}],
  'quantum_yields': [{'value': '0.67'}]},
 {'labels': ['5d'],
  'uvvis_spectra': [{'peaks': [{'value': '281.0',
      'units': 'nm',
      'extinction': '34200',
      'extinction_units': 'L cm − 1 mol − 1'}]}],
  'quantum_yields': [{'value': '0.97'}]},
 {'labels': ['5e'],
  'uvvis_spectra': [{'peaks': [{'value': '285.0',
      'units': 'nm',
      'extinction': '44000',
      'extinction_units

In [9]:
# in paper0, the table is shown below
t2 = Table(
    caption = Caption("Table 1. Physicochemical Properties of the Study Polymers"),
    headings=[
                [
                    Cell('Polymer'),
                    Cell('Mn (kg/mol)'),
                    Cell('PDI'),
                    Cell('Tg (C)'),
                    Cell('Td (C)'),
                    Cell('soluton/λmax [nm]'),
                    Cell('film/λmax [nm]'),
                    Cell('HOMO [eV]'),
                    Cell('LUMO [eV]'),
                    Cell('Egec[eV]'),
                    Cell('Egopt[eV]'),
                ]
    ],
    
    rows=[
                [Cell(' PBDT-IIG '), Cell('21'), Cell('2.6'), Cell('56'),Cell('380'),Cell('359, 446, 625'),Cell('367, 456, 631, 678'),Cell('-5.38'),Cell('-5.35'),Cell('2.03'),Cell('1.59')],
                [Cell(' PBDT-TIIG '), Cell('34'), Cell('2.8'), Cell('76'),Cell('411'),Cell('463, 826,854'),Cell('466, 833,856'),Cell('-4.96'),Cell('-3.29'),Cell('1.67'),Cell('1.05')]
    ]

)

In [10]:
t2 # illustration from paper0.pdf

Polymer,Mn (kg/mol),PDI,Tg (C),Td (C),soluton/λmax [nm],film/λmax [nm],HOMO [eV],LUMO [eV],Egec[eV],Egopt[eV]
PBDT-IIG,21,2.6,56,380,"359, 446, 625","367, 456, 631, 678",-5.38,-5.35,2.03,1.59
PBDT-TIIG,34,2.8,76,411,"463, 826,854","466, 833,856",-4.96,-3.29,1.67,1.05


In [11]:
# output result to json format
[record.serialize() for record in t2.records]

[{'names': ['PBDT-IIG'],
  'uvvis_spectra': [{'peaks': [{'value': '359', 'units': 'nm'},
     {'value': '446', 'units': 'nm'},
     {'value': '625', 'units': 'nm'}]},
   {'peaks': [{'value': '367', 'units': 'nm'},
     {'value': '456', 'units': 'nm'},
     {'value': '631', 'units': 'nm'},
     {'value': '678', 'units': 'nm'}]}]},
 {'names': ['PBDT-TIIG'],
  'uvvis_spectra': [{'peaks': [{'value': '463', 'units': 'nm'}]},
   {'peaks': [{'value': '466', 'units': 'nm'}]}]}]

In [12]:
t3 = Table(caption = Caption("Table 3. Photovoltaic Parameters of Optimized Solar Cells"),
           headings = [
               [
                   Cell('polymer'),
                   Cell('polymer:PC71BM'),
                   Cell('Voc [V]'),
                   Cell('Jsc [mA cm-2]'),
                   Cell('FF [%]'),
                   Cell('PCE [%]'),
               ]
           ],
            
           rows = [
               [Cell(' PBDT-IIG '), Cell('21'), Cell('2.6'), Cell('56'),Cell('380'),Cell('359, 446, 625'),Cell('367, 456, 631, 678'),Cell('-5.38'),Cell('-5.35'),Cell('2.03'),Cell('1.59')],
               [Cell(' PBDT-TIIG '), Cell('21'), Cell('2.6'), Cell('56'),Cell('380'),Cell('359, 446, 625'),Cell('367, 456, 631, 678'),Cell('-5.38'),Cell('-5.35'),Cell('2.03'),Cell('1.59')],
           ]
    )

In [13]:
[record.serialize() for record in t3.records]

[{'names': ['PBDT-IIG']}, {'names': ['PBDT-TIIG']}]

Based on the test we can examine that the result is valid, all numbers are accurately extracted. Tabulated data is much easier to extract compare to graphs and texts.


## Extracting Tables

### Tools 
1. pyPDF2
2. pymupdf
3. pdfCandy (not python supported) -- good image extraction
4. tabula -- **poor accuracy**
5. pdftables

In [24]:
def recog_tables(paras):
    """
    This function extract table from paragraphs and then parse it
    :paras paragraphs 
    """
    
    for i in range(len(paras)):
        if str(paras[i]).startswith("Table")

In [56]:
len(paras) # 196 elements
paras[0] # only shows to 195

In [62]:
def find_table(paras):
    """
    Find tables from paragraphs
    """
    ls = []
    length = len(paras)
    for i in range(length):
        if str(paras[i]).startswith("Table"):
            ls.append(i)
    return ls

In [63]:
find_table(paras)

[35, 88, 120]

In [66]:
paras[35]
paras[88]
paras[120]

In [77]:
# how to read tables from pdf
p = open('test_articles/paper0.pdf', 'rb')
doc_0 = Document.from_file(p,readers=[PdfReader()])
doc_0

## Extracting Graphs

Extracting chemical graphs and convert them to SMILES using OSRA or other software (usually without API) to process them

### Tools 
1. pyPDF2
2. pyMuPDF

I decided to use pyMuPDF to extract images, but we found any graph with chemical graphs embedded cannot be detected and extracted. The rest of graphs can be accuratley extracted. 

Chemical graphs need different treatment. Refer to images from example papers

In [102]:
# ! pip install tabula
# ! pip install git+https://github.com/pdftables/python-pdftables-api.git
# ! pip install pypdf
# ! pip install tabula-py
# ! pip install PyMuPDF
# ! pip install pysimplegui

### Test and Selection

Among all packages for image extraction, we found that pyMuPDF is the best one with customizable package. In this case the code uses pysimpleGUI to select file and extract images

In [2]:
from __future__ import print_function
import os, sys, time
import fitz
import PySimpleGUI as sg

"""
PyMuPDF utility
----------------
For a given entry in a page's getImagleList() list, function "recoverpix"
returns either the raw image data, or a modified pixmap if an /SMask entry
exists.
The item's first two entries are PDF xref numbers. The first one is the image in
question, the second one may be 0 or the object id of a soft-image mask. In this
case, we assume it being a sequence of alpha bytes belonging to our image.
We then create a new Pixmap giving it these alpha values, and return it.
If the result pixmap is CMYK, it will be converted to RGB first.
"""
print(fitz.__doc__)

if not tuple(map(int, fitz.version[0].split("."))) >= (1, 13, 17):
    raise SystemExit("require PyMuPDF v1.13.17+")

dimlimit = 100  # each image side must be greater than this
relsize = 0.05  # image : pixmap size ratio must be larger than this (5%)
abssize = 2048  # absolute image size limit 2 KB: ignore if smaller
imgdir = "images"  # found images are stored in this subfolder

if not os.path.exists(imgdir):
    os.mkdir(imgdir)

def recoverpix(doc, item):
    x = item[0]  # xref of PDF image
    s = item[1]  # xref of its /SMask
    if s == 0:  # no smask: use direct image output
        return doc.extractImage(x)

    def getimage(pix):
        if pix.colorspace.n != 4:
            return pix
        tpix = fitz.Pixmap(fitz.csRGB, pix)
        return tpix

    # we need to reconstruct the alpha channel with the smask
    pix1 = fitz.Pixmap(doc, x)
    pix2 = fitz.Pixmap(doc, s)  # create pixmap of the /SMask entry

    # sanity check
    if not (pix1.irect == pix2.irect and pix1.alpha == pix2.alpha == 0 and pix2.n == 1):
        pix2 = None
        return getimage(pix1)

    pix = fitz.Pixmap(pix1)  # copy of pix1, alpha channel added
    pix.setAlpha(pix2.samples)  # treat pix2.samples as alpha value
    pix1 = pix2 = None  # free temp pixmaps

    # we may need to adjust something for CMYK pixmaps here:
    return getimage(pix)


fname = sys.argv[1] if len(sys.argv) == 2 else None
if not fname:
    fname = sg.PopupGetFile("Select file:", title="PyMuPDF PDF Image Extraction")
if not fname:
    raise SystemExit()

t0 = time.time()
doc = fitz.open(fname)

page_count = len(doc)  # number of pages

xreflist = []
imglist = []
for pno in range(page_count):
    sg.QuickMeter(
        "Extract Images",  # show our progress
        pno + 1,
        page_count,
        "*** Scanning Pages ***",
    )

    il = doc.getPageImageList(pno)
    imglist.extend([x[0] for x in il])
    for img in il:
        xref = img[0]
        if xref in xreflist:
            continue
        width = img[2]
        height = img[3]
        if min(width, height) <= dimlimit:
            continue
        pix = recoverpix(doc, img)
        if type(pix) is dict:  # we got a raw image
            ext = pix["ext"]
            imgdata = pix["image"]
            n = pix["colorspace"]
            imgfile = os.path.join(imgdir, "img-%i.%s" % (xref, ext))
        else:  # we got a pixmap
            imgfile = os.path.join(imgdir, "img-%i.png" % xref)
            n = pix.n
            imgdata = pix.getPNGData()

        if len(imgdata) <= abssize:
            continue

        if len(imgdata) / (width * height * n) <= relsize:
            continue

        fout = open(imgfile, "wb")
        fout.write(imgdata)
        fout.close()
        xreflist.append(xref)

t1 = time.time()
imglist = list(set(imglist))
print(len(set(imglist)), "images in total")
print(len(xreflist), "images extracted")
print("total time %g sec" % (t1 - t0))


PyMuPDF 1.16.3: Python bindings for the MuPDF 1.16.0 library.
Version date: 2019-10-01 09:01:06.
Built for Python 3.6 on win32 (64-bit).



SystemExit: 

Automate the above scripts to make things easier, we only need to pre-define the folder for pdfs and images. We would also like to see image extraction from xml and html files.


