## Test on Properties

This notebook focuses on customizing parsers that can be used to extract information

In this case we are going to try to check for data patching and parsers at the same time. If the extracted properties misalign with compounds extracted, the project would be meaningless.


10/15/2019 meeting 

Challenge:

1. in parser, if a full name with a abbrev in () it will not recognize
2. too messy data paragraphs --> cannot efficiently recognize 


In [9]:
import logging
import re
import pandas as pd
import urllib
import time

import chemdataextractor as cde
from chemdataextractor import Document
import chemdataextractor.model as model
from chemdataextractor.model import Compound, UvvisSpectrum, UvvisPeak, BaseModel, StringType, ListType, ModelType
from chemdataextractor.parse.common import hyphen
from chemdataextractor.parse.base import BaseParser
from chemdataextractor.utils import first
from chemdataextractor.parse.actions import strip_stop
from chemdataextractor.parse.elements import W, I, T, R, Optional, ZeroOrMore, OneOrMore, Or, And
from chemdataextractor.parse.cem import chemical_name
from chemdataextractor.doc import Paragraph, Sentence, Caption, Figure,Table, Heading
from chemdataextractor.doc.table import Table, Cell

In [10]:
from chemdataextractor.text.chem import SOLVENT_RE, INCHI_RE, SMILES_RE

In [None]:
# open and read files
f = open('../test_articles/paper0.pdf', 'rb')
doc = Document.from_file(f)
abstract = [11]

f1 = open('../test_articles/paper1.pdf', 'rb')
doc1 = Document.from_file(f1)
abstract1 = [7,8]

f2 = open('../test_articles/paper2.pdf', 'rb')
doc2 = Document.from_file(f2)
abstract2 = [7,8]

f3 = open('../test_articles/paper3.pdf', 'rb')
doc3 = Document.from_file(f3)
abstract3 = [10]

f4 = open('../test_articles/paper4.pdf', 'rb')
doc4 = Document.from_file(f4)
abstract4 = [12]

f5 = open('../test_articles/paper5.pdf', 'rb')
doc5 = Document.from_file(f5)
abstract5 = [3,4]

f6 = open('../test_articles/paper6.pdf', 'rb')
doc6 = Document.from_file(f6)
abstract6 = [5,6,7,8]

f7 = open('../test_articles/paper7.pdf', 'rb')
doc7 = Document.from_file(f7)
abstract7 = [11]

In [None]:
# split the paragraph into elements
paras = doc.elements
cems = doc.cems
a = doc.records.serialize()

In [None]:
a

In [None]:
doc

PCE and FF works fine, as well as other quantities end in %. For other units, further customization required

Most of properties from literature have the same layout, so if one example works, the rest of them should work too.

Any unit with simple expression (1 component) is easy to extract. Otherwise a combination is needed.

In [None]:
class Jsc(BaseModel):
    value = StringType()
    units = StringType()

Compound.jsc_pattern = ListType(ModelType(Jsc))

abbrv_prefix = (I(u'jsc') | I(u'Jsc')).hide()
words_pref = (I(u'short') + I(u'circuit') + I(u'current') + I(u'density')).hide()
hyphanated_pref = (I(u'short') + I(u'-') + I('circuit') + I(u'current') + I(u'density')).hide()

prefix = abbrv_prefix | words_pref | hyphanated_pref

common_text = R('(\w+)?\D(\D+)+(\w+)?').hide()
common_paren = R('\((.+)\)|\[(.*?)\]') # get the common parenthesis 
units = ((W('m') + W(u'A') + W(u'/') + W(u'cm') + W('2')))(u'units') # get the unit
value = R(u'\d+(\.\d+)?')(u'value') # get the value

jsc_first= (prefix + ZeroOrMore(common_text) + value + units)(u'jsc')
jsc_second = (value + units + prefix)(u'jsc')
jsc_third = (prefix + ZeroOrMore(common_paren) + value + units)(u'jsc')

jsc_pattern = jsc_first | jsc_second | jsc_third

class JscParser(BaseParser):
    root = jsc_pattern

    def interpret(self, result, start, end):
        compound = Compound(
            jsc_pattern=[
                Jsc(
                    value=first(result.xpath('./value/text()')),
                    units=first(result.xpath('./units/text()'))
                )
            ]
        )
        yield compound

def parse_jsc(list_of_sentences):
    """ 
    Takes a list of sentences and parses for quantified PCE
    information and relationships to chemicals/chemical labels
    """

    Sentence.parsers.append(JscParser())

    cde_senteces = [Sentence(sent).records.serialize() for sent in list_of_sentences]
    return cde_senteces

Possible situations for the unit and combination are like the followings:

Comb

1. **name** (abbrev) + value + unit
2. **name** [abbrev] + value + unit
3. **name** + value + unit
4. The above situations with no space between value and unit --> not likely

if there is any insertion between value and the previous part, parser does not work

In [None]:
Sentence.parsers.append(JscParser())
Paragraph.parsers.append(JscParser())

Sentence.parsers.append(VocParser())
Paragraph.parsers.append(VocParser())

# Sentence.parsers.append(MwParser())
# Paragraph.parsers.append(MwParser())

Sentence.parsers.append(PceParser())
Paragraph.parsers.append(PceParser())

# Sentence.parsers.append(EqeParser())
# Paragraph.parsers.append(EqeParser())

# Sentence.parsers.append(FfParser())
# Paragraph.parsers.append(FfParser())

In [None]:
doc = Document(
    Heading('5,10,15,20-Tetra(4-carboxyphenyl)porphyrin (3).'),
    Paragraph('m.p. 90°C.'),
    Paragraph('open-circuit voltage of 5 V'),
    Paragraph('power-conversion efficiency of 12 %'),
    Paragraph('with the short-circuit current density (Jsc) of 12 mAcm-2')
)

rec = doc.records.serialize()

In [None]:
rec

In this case we can test on doc7 on data patching. 

In [None]:
doc7[11]

In [None]:
doc = Document(
    Heading('Abstract:'),
    Paragraph('We report the synthesis, properties, and photo- voltaic applications of new π-conjugated polymers having thiophene, 3,4-dihexylthiophene, and 1,3,4-oxadiazole (OXD) or 1,3,4-thiadiazole (TD) units in the main chain, denoted as P1 and P2. They were synthesized by the Stille coupling reaction of 2,5- bis(trimethylstannyl)thiophene and the corresponding monomers of 2,5-bis(5′-bromo-3′,4′-dihexylthien-2′-yl)-1,3,4-oxadiazole or 2,5-bis(5′-bromo-3′,4′-dihexylthien-2′-yl)-1,3,4-thiadiazole, re- spectively. '),
    Paragraph('The experimental results indicated that the introduc- tion of an electron-accepting moiety of OXD or TD lowered the highest occupied molecular orbital (HOMO) energy levels, resulting in the higher the open-circuit voltage (Voc) values of polymer solar cells (PSCs). Indeed, the PSCs of P1 and P2 showed high Voc values in the range 0.80−0.90 V. The highest ﬁeld-eﬀect transistor (FET) mobilities of P1 and P2 with the OXD and TD moieties, respectively, were 1.41 × 10−3 and 8.81 × 10−2 cm2 V−1 s−1. '),
    Paragraph('The higher mobility of P2 was related to its orderly nanoﬁbrillar structure, as evidenced from the TEM images. Moreover, the higher absorption coeﬃcient and smaller band gap of P2 provided a more eﬃcient light-harvesting ability. '),
    Paragraph('The power conversion eﬃciency (PCE) of the PSC based on P2:PCBM = 1:1 (w/w) reached 3.04 % with a short-circuit current density (Jsc) value of 6.60 mA/cm2, a Voc value of 0.80 V, and a fill factor (as) value of 57.6% during the illumination of AM 1.5, 100 mW/cm2. '),
    Paragraph('In comparison, the electron-accepting moiety exhibited an inferior device performance (FET mobility = 2.10 × 10−4 cm2 V−1 s−1 and PCE = 1.91%). The experimental results demonstrated that incorporating the electron-acceptor moiety into the polythiophene backbone could enhance the device performance due to the low-lying HOMO levels, compact packing structure, and high charge carrier mobility. This is the ﬁrst report for the achievement of PCE > 3% using PSCs based on polythiophenes having TD units in the main chain.')
)

rec = doc.records.serialize()

In [None]:
rec

In [None]:
doc7[11].records.serialize()

In the abstract of doc7, it gives the following values:

1. Voc
2. PCE
3. Mobility
4. Blend ratio

and compounds 

1. P1
2. P2

from above we see no P1 and P2, but only the two units in the main chain OXD and TD. 

## Regex Tester

1. \[    : [ is a meta char and needs to be escaped if you want to match it literally.
2. (.*?) : match everything in a non-greedy way and capture it.
3. \]    : ] is a meta char and needs to be escaped if you want to match it literally.

In [None]:
# regex finder and tester, use this to test regex you plan to give for the parser 
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"\((.+)\)"
# regex = r"\[(.*?)\]"


test_str = "molecular weight (Mw) of 12 kg"

matches = re.finditer(regex, test_str)

for matchNum, match in enumerate(matches, start=1):
    
    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1
        
        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.