## ChemDataExtractor (CDE) and Paper-parser (PP)

environment setting:
* use python version 2.7 - 3.6
* all requirements are inside requirements.txt
* use pipenv or conda to build virtual env 

The scientific abstracts used in this study are available via Elsiver's Scopus and Science Direct API's (https://dev.elsvier.com/), the Springer Nature API (https://dev.springernature.com/),the Royal Chemistry Society API. The list of DOIs used in this study, the pretrained model is built in chemdataextractor package, while extensive parsing rules and searching standards are done by the author. 

**Author:** Jingtian Zhang



### Extracting a Custom Property

In [46]:
import chemdataextractor
import chemdataextractor.model as model
from chemdataextractor.model import Compound
from chemdataextractor.doc import Document, Heading, Paragraph, Sentence

### Example Document

Let's create a simple example document with a single heading followed by a single paragrah:

In [28]:
d = Document(
    Heading(u'Synthesis of 2,4,6-trinitrotoluene (3a)'),
    Paragraph(u'The procedure was followed to yield a pale yellow solid (b.p. 240 °C)')
)

# display a heading and a paragraph of the given content
d

### Default Parsers

By default, chemdataextractor won't extract the boiling point property:

In [29]:
d.records.serialize()

AttributeError: 'list' object has no attribute 'xpath'

### Defining a New Property Model

The first task is to define the schema of a new property, and add it to the `Compound` model:

In [30]:
from chemdataextractor.model import BaseModel, StringType, ListType, ModelType

class BoilingPoint(BaseModel):
    value = StringType()
    units = StringType()
    
Compound.boiling_points = ListType(ModelType(BoilingPoint))

### Then Writing a New Parser

Next, define parsing rules that define how to interpret text and convert it into the model:

In [31]:
import re
from chemdataextractor.parse import R, I, W, Optional, merge

prefix = (R(u'^b\.?p\.?$', re.I) | I(u'boiling') + I(u'point')).hide()
units = (W(u'°') + Optional(R(u'^[CFK]\.?$')))(u'units').add_action(merge)
value = R(u'^\d+(\.\d+)?$')(u'value')
bp = (prefix + value + units)(u'bp')

In [32]:
from chemdataextractor.parse.base import BaseParser
from chemdataextractor.utils import first

class BpParser(BaseParser):
    root = bp

    def interpret(self, result, start, end):
        compound = Compound(
            boiling_points=[
                BoilingPoint(
                    value=first(result.xpath('./value/text()')),
                    units=first(result.xpath('./units/text()'))
                )
            ]
        )
        yield compound

In [33]:
Paragraph.parsers = [BpParser()]

### Running the New Parser

In [34]:
d = Document(
    Heading(u'Synthesis of 2,4,6-trinitrotoluene (3a)'),
    Paragraph(u'The procedure was followed to yield a pale yellow solid (b.p. 240 °C)')
)

d.records.serialize()

[{'boiling_points': [{'units': '°C', 'value': '240'}],
  'labels': ['3a'],
  'names': ['2,4,6-trinitrotoluene'],
  'roles': ['product']}]

## Example from Documentation

In [2]:
doc = Document('UV-vis spectrum of 5,10,15,20-Tetra(4-carboxyphenyl)porphyrin in Tetrahydrofuran (THF).')
doc

In [3]:
# each individual chemical entity mention (CEM)
doc.cems

[Span('Tetrahydrofuran', 65, 80),
 Span('THF', 82, 85),
 Span('5,10,15,20-Tetra(4-carboxyphenyl)porphyrin', 19, 61)]

In [4]:
doc.abbreviation_definitions

[(['THF'], ['Tetrahydrofuran'], 'CM')]

In [5]:
doc.records
doc.records[0].serialize()
doc.records[1].serialize()

{'names': ['Tetrahydrofuran', 'THF']}

## Trying on OPV Documents

we are going to extract some basic OPV optoelectronic properties from the documents. Here is the list of some fundamental properties:

* Voc (V)
* Jsc (mA cm-2)
* FF (%)
* PCE (%)
* Bandgap, energy loss (Eloss), offsets (eV)
* active area (cm2)
* exposure time (s, min, hr, d)
* Molecuar weight (Mw) (kg/mol)
* hole and electron mobilities (cm2 V-1 s-1)
* EQE, IQE (unitless)
* absorption (nm)

Since spectroscopy and other propertis can be automatically extracted, we are not going to focus too much on them.

But we also want to extract AFM and TEM images from documents and their corresponding roughness and dump them into chemical database 

**Reading documents can break the doc into either paragraphs, and from paragraphs they can derive sentences or tokens**

While ChemDataExtractor supports documents in a wide variety of formats, some are better suited for extraction than others. If there is an HTML or XML version available, that is normally the best choice.

Wherever possible, avoid using the PDF version of a paper or patent. At best, the text will be interpretable, but it is extremely difficult to reliably distinguish between headings, captions and main body text. At worst, the document will just consist of a scanned image of each page, and it won't be possible to extract any of the text at all. You can get some idea of what ChemDataExtractor can see in a PDF by looking at the result of copying-and-pasting from the document.

For scientific articles, most publishers offer a HTML version alongside the PDF version. Normally, this will open as a page in your web browser. Just choose "Save As..." and ensure the selected format is "HTML" or "Page Source" to save a copy of the HTML file to your computer.

Most patent offices provide XML versions of their patent documents, but these can be hard to find. Two useful resources are the USPTO Bulk Data Download Service and the EPO Open Patent Services API.

**Reference**: http://www.chemdataextractor.org

In [35]:
f = open('example_doc.pdf', 'rb')
doc = Document.from_file(f)

In [40]:
doc.elements
doc.cems 
# the problem is author's name might be mistaken as a chemical name
# cems is returned as a `Span`, which contains the mention text, as well as 
# the start and end character offsets within the containing document element.

[Span('Li', 19, 21),
 Span('Furan', 3747, 3752),
 Span('PCBM', 104, 108),
 Span('diketopyrrolopyrrole', 814, 834),
 Span('alkyl', 36, 41),
 Span('diketopyrrolopyrrole', 81, 101),
 Span('H', 3658, 3659),
 Span('thiophene', 0, 9),
 Span('thiophene', 192, 201),
 Span('thiophene', 256, 265),
 Span('thiophene', 64, 73),
 Span('pyrrole,24', 356, 366),
 Span('S', 200, 201),
 Span('Oh', 2457, 2459),
 Span('Chalcogenophene', 1912, 1927),
 Span('thiophene', 351, 360),
 Span('Diketopyrrolopyrrole', 3312, 3332),
 Span('selenophene-\nsubstituted', 920, 944),
 Span('DPP', 730, 733),
 Span('DPP', 618, 621),
 Span('DPP', 1050, 1053),
 Span('DPP', 10, 13),
 Span('DPP', 83, 86),
 Span('bithiophene,18 benzodithiophene', 423, 454),
 Span('DPP', 234, 237),
 Span('furan', 500, 505),
 Span('4-methylthiophen-2-yl', 622, 643),
 Span('phenyl', 977, 983),
 Span('DPP', 836, 839),
 Span('DPP', 324, 327),
 Span('DPP', 0, 3),
 Span('DPP', 920, 923),
 Span('DPP', 340, 343),
 Span('PBDTT', 63, 68),
 Span('DPP', 110, 1

In [41]:
# element types include Title, Heading, Paragraph, Citation, Table, Figure, Caption and FootNode. 
# you can retrieve a specific element by its index within the document
para = doc.elements[3]
para

In [57]:
# you can get the individual sentences of a paragraph
para.sentences

[Sentence('CONSPECTUS: Conjugated polymers have been extensively studied for\napplication in organic solar cells.', 0, 101),
 Sentence('In designing new polymers, particular\nattention has been given to tuning the absorption spectrum, molecular\nenergy levels, crystallinity, and charge carrier mobility to enhance\nperformance.', 102, 291),
 Sentence('As a result, the power conversion eﬃciencies (PCEs) of\nsolar cells based on conjugated polymers as electron donor and fullerene\nderivatives as electron acceptor have exceeded 10% in single-junction and\n11% in multijunction devices.', 292, 523),
 Sentence('Despite these eﬀorts, it is notoriously diﬃcult\nto establish thorough structure−property relationships that will be required\nto further optimize existing high-performance polymers to their intrinsic\nlimits.', 524, 730),
 Sentence('In this Account, we highlight progress on the development and our\nunderstanding of diketopyrrolopyrrole (DPP) based conjugated polymers\nfor polymer sola

In [43]:
para.tokens

[[Token('CONSPECTUS', 0, 10),
  Token(':', 10, 11),
  Token('Conjugated', 12, 22),
  Token('polymers', 23, 31),
  Token('have', 32, 36),
  Token('been', 37, 41),
  Token('extensively', 42, 53),
  Token('studied', 54, 61),
  Token('for', 62, 65),
  Token('application', 66, 77),
  Token('in', 78, 80),
  Token('organic', 81, 88),
  Token('solar', 89, 94),
  Token('cells', 95, 100),
  Token('.', 100, 101)],
 [Token('In', 102, 104),
  Token('designing', 105, 114),
  Token('new', 115, 118),
  Token('polymers', 119, 127),
  Token(',', 127, 128),
  Token('particular', 129, 139),
  Token('attention', 140, 149),
  Token('has', 150, 153),
  Token('been', 154, 158),
  Token('given', 159, 164),
  Token('to', 165, 167),
  Token('tuning', 168, 174),
  Token('the', 175, 178),
  Token('absorption', 179, 189),
  Token('spectrum', 190, 198),
  Token(',', 198, 199),
  Token('molecular', 200, 209),
  Token('energy', 210, 216),
  Token('levels', 217, 223),
  Token(',', 223, 224),
  Token('crystallinity', 22

### Designing New Parsing Rules for Properties

1. defining a new property model
2. writing a new parser
3. running the new parser

we can use functions described in chemdataextractor website

ChemDataExtractor contains a chemistry-aware Part-of-speech tagger. Use the `pos_tagged_tokens` property on a document element to get the tagged tokens:

In [48]:
# use pos_tagged_tokens property on a document element to get the tagged tokens
s = Sentence('1H NMR spectra were recorded on a 300 MHz BRUKER DPX300 spectrometer.')
s.pos_tagged_tokens

[('1H', 'NN'),
 ('NMR', 'NN'),
 ('spectra', 'NNS'),
 ('were', 'VBD'),
 ('recorded', 'VBN'),
 ('on', 'IN'),
 ('a', 'DT'),
 ('300', 'CD'),
 ('MHz', 'NNP'),
 ('BRUKER', 'NNP'),
 ('DPX300', 'NNP'),
 ('spectrometer', 'NN'),
 ('.', '.')]

In [49]:
# using taggers directly
from chemdataextractor.nlp.pos import ChemCrfPosTagger
cpt = ChemCrfPosTagger()
cpt.tag(['1H', 'NMR', 'spectra', 'were', 'recorded', 'on', 'a', '300', 'MHz', 'BRUKER', 'DPX300', 'spectrometer', '.'])

[('1H', 'NN'),
 ('NMR', 'NN'),
 ('spectra', 'NNS'),
 ('were', 'VBD'),
 ('recorded', 'VBN'),
 ('on', 'IN'),
 ('a', 'DT'),
 ('300', 'CD'),
 ('MHz', 'NNP'),
 ('BRUKER', 'NNP'),
 ('DPX300', 'NNP'),
 ('spectrometer', 'NN'),
 ('.', '.')]

In [60]:
from chemdataextractor.model import BaseModel, StringType, ListType, ModelType

class Mobility(BaseModel):
    value = StringType()
    units = StringType()
class Voc(BaseModel):
    value = StringType()
    units = StringType()
class FF(BaseModel):
    value = StringType()
    units = StringType()
class PCE(BaseModel):
    value = StringType()
    units = StringType()
class Area(BaseModel):
    value = StringType()
    units = StringType()

Compound.Jsc = ListType(ModelType(Jsc))
Compound.Voc = ListType(ModelType(Voc))
Compound.FF = ListType(ModelType(FF))
Compound.PCE = ListType(ModelType(PCE))
Compound.Area = ListType(ModelType(Area))

In [67]:
import re
from chemdataextractor.parse import R, I, W, Optional, merge

prefix = (I(u'mobilities')).hide()
units = (W(u'cm2 V−1 s−1') + Optional(R(u'^[CFK]\.?$')))(u'units').add_action(merge)
value = R(u'^\d+(\.\d+)?$')(u'value')
mu = (prefix + units + value)(u'mu')

In [68]:
from chemdataextractor.parse.base import BaseParser
from chemdataextractor.utils import first

class muParser(BaseParser):
    root = mu

    def interpret(self, result, start, end):
        compound = Compound(
            mobility=[
                Mobility(
                    value=first(result.xpath('./value/text()')),
                    units=first(result.xpath('./units/text()'))
                )
            ]
        )
        yield compound

Paragraph.parsers = [muParser()]
d = Document(
    Paragraph(u'Thiophene-substituted DPP polymers show high hole and electron mobilities above 1 cm2 V−1 s−1 in FETs.16 By copolymerization of the thiophene-substituted DPP with π- conjugated aromatic monomers of different donor strength, such as biphenyl,23 phenyl,8 thiophene,8 and dithienopyrrole,24 the absorption onset can be tuned from 750 nm to above 1000 nm. The PCEs can reach 9.4%.')
)
d.records.serialize()

[]

In [84]:
d = Document(
    Paragraph(u'Thiophene-substituted DPP polymers show high hole and electron mobilities above 1 cm2 V−1 s−1 in FETs.16 By copolymerization of the thiophene-substituted DPP with π- conjugated aromatic monomers of different donor strength, such as biphenyl,23 phenyl,8 thiophene,8 and dithienopyrrole,24 the absorption onset can be tuned from 750 nm to above 1000 nm. The PCEs can reach 9.4%.')
)
# we can see cems working fine
d.cems
# but d.records.serialize() doesn't return anything, only []


<bound method Sequence.count of <Document: 1 elements>>

### As shown above and in the extracting_a_custom_property notebook, there is no