# Goal

* **Topic to present:** 
    * Share with the community the pipeline of constructing a database of materials research that can be used for data science study. 


* **Major contents:**
    * How to search for right articles. 
    * How to extract the data and form dataset from a collection of articles or online publications. 
    * Data processing techniques that have been used by Everette. 


* **Examples we will present in this article:**
    * Saeki's group's paper collection. 
    * Everett's search for phase-changing materials, especially liquid state properties, including liquid state thermal conductivity, liquid state density, liquid state heat capacity. 
    * Potentially look at Oya sensei, or Okabe Sensei's research topic, to see if there is an example, we can generate with their research focus. Doesn't need to be a big example. 
    
    
* All these examples serve as proofs to show the pipeline is effective in data mining and data processing for generating input data set for data science study on materials.


* **Tasks for Sam At the moment:** 
    * Quickly summarize the Berkley technique in generating the database so I can share with Saeki Sensei 
    * Take a quick look at the technique Saeki's group used and see how to incorporate into our paper, (we can maybe discuss on phone after you take a look, so I can propose to Saeki's sensei that maybe his student can summarize the method he used in a short paragraph and we can include in our paper.) 


## Workflow

The workflow of this project is shown in the following mind graph:
<img src="project_xmind.png">

**Scraper**
* The first step of data mining process is to collect data
* Scraping literature, including articles, conference papers, etc. , from major science literature publishers and societies, such as arXiv e-prints, Elsevier, Springer, Royal Chemistry Society, American Chemical Society, etc.
* Use cases and testing
    * Perovskite solar cell literature (from pp)
    * OPV solar cell literature (from professor Saeki’s resource)
    * Thermosetting resins literature (currently researching)
    * Liquid properties for materials (currently researching)
    
**NLP Model**
* ChemDataExtractor
* paper-parser
* Both will be tested. ChemDataExtractor can be put into a wider use scope in chemistry, biology and material science and engineering. 

**Database**
* Designing an efficient, useful database is a matter of following the proper process, including these phases:
    * Requirements analysis, or identifying the purpose of your database
    * Organizing data into tables
    * Specifying primary keys and analyzing relationships
    * Normalizing to standardize the tables

**Front-End UI**
* The goal is to build a Django-based interactive user interface that allows users to interact with data that is stored in the database.
* Interactions like downloading, uploading users’ own data, searching and filtering should be allowed.
* Major databases like Polymer Database, Genome Database, Protein Data Bank are good examples 


# Tests

This notebook is used to test multiple metrics in the following list:
1. reader
2. extract_sentences
3. extracted info
4. PCE
5. Jsc
6. Voc
7. EQE / IQE
8. mobilities (hole / electron)
9. Graphs
10. Tables
11. order
12. sentence classifier
13. spincoat

**The most important part of using NLP tool is the classification accuracy.**

In [1]:
import logging
import re
import pandas as pd
import urllib
import time
# import feedparser
import chemdataextractor as cde
from chemdataextractor import Document
import chemdataextractor.model as model
from chemdataextractor.model import Compound, UvvisSpectrum, UvvisPeak, BaseModel, StringType, ListType, ModelType
from chemdataextractor.parse.common import hyphen
from chemdataextractor.parse.base import BaseParser
from chemdataextractor.utils import first
from chemdataextractor.parse.actions import strip_stop
from chemdataextractor.parse.elements import W, I, T, R, Optional, ZeroOrMore, OneOrMore
from chemdataextractor.parse.cem import chemical_name
from chemdataextractor.doc import Paragraph, Sentence

CDE support many formats, it can read acs, a base reader, cssp, ChemSpider, NLM journals, pdf, plain text, rsc, USPTO patents

## PCE Test

In this test we are going to build a parser to extract pce from literature.

As a result this test finds that pce in different names (pce, PCE, power conversion efficiency, etc.) can be detected if the format follows the order.

the advantage of CDE is customizing grammar parsers for different properties. Also, it is compatible with documents in different formats, especially in acs, rsc, and others.

In [2]:
# open and read files
f = open('test_articles/paper0.pdf', 'rb')
doc = Document.from_file(f)
abstract = [11]

f1 = open('test_articles/paper1.pdf', 'rb')
doc1 = Document.from_file(f1)
abstract1 = [7,8]

f2 = open('test_articles/paper2.pdf', 'rb')
doc2 = Document.from_file(f2)
abstract2 = [7,8]

f3 = open('test_articles/paper3.pdf', 'rb')
doc3 = Document.from_file(f3)
abstract3 = [10]

f4 = open('test_articles/paper4.pdf', 'rb')
doc4 = Document.from_file(f4)
abstract4 = [12]

f5 = open('test_articles/paper5.pdf', 'rb')
doc5 = Document.from_file(f5)
abstract5 = [3,4]

f6 = open('test_articles/paper6.pdf', 'rb')
doc6 = Document.from_file(f6)
abstract6 = [5,6,7,8]

f7 = open('test_articles/paper7.pdf', 'rb')
doc7 = Document.from_file(f7)
abstract7 = [11]

In [3]:
# split the paragraph into elements
paras = doc.elements
paras

[Paragraph(id=None, references=[], text='Article'),
 Paragraph(id=None, references=[], text='pubs.acs.org/cm'),
 Paragraph(id=None, references=[], text='Interplay of Molecular Orientation, Film Formation, and\nOptoelectronic Properties on Isoindigo- and Thienoisoindigo-Based\nCopolymers for Organic Field Eﬀect Transistor and Organic\nPhotovoltaic Applications\nChien Lu,\nand Pi-Tai Chou*,#\n†\nDepartment of Chemical Engineering, National Taiwan University, Taipei 106, Taiwan\n‡\nResearch Center for New Generation Photovoltaics, Graduate Institute of Energy Engineering, National Central University, Taoyuan\n320, Taiwan\n#Department of Chemistry, National Taiwan University, Taipei 106, Taiwan'),
 Paragraph(id=None, references=[], text='Hsieh-Chih Chen,*,‡,§'),
 Paragraph(id=None, references=[], text='Wen-Chang Chen,*,†'),
 Paragraph(id=None, references=[], text='Wei-Ti Chuang,'),
 Paragraph(id=None, references=[], text='Yen-Hao Hsu,'),
 Paragraph(id=None, references=[], text='†,§'),
 Par

In [4]:
# Cheimcal entity mentions in the doc
doc.cems

[Span('PBDT', 962, 966),
 Span('methanol', 859, 867),
 Span('Ag', 550, 552),
 Span('21\n34', 0, 5),
 Span('Ag', 294, 296),
 Span('PBDT', 895, 899),
 Span('polymer\nPBDT', 0, 12),
 Span('Huber', 2144, 2149),
 Span('F', 1776, 1777),
 Span('Noh', 66, 69),
 Span('Li', 1066, 1068),
 Span('PBDT', 591, 595),
 Span('IIG', 285, 288),
 Span('BDT', 1669, 1672),
 Span('PBDT', 1012, 1016),
 Span('PBDT', 61, 65),
 Span('PBDT', 500, 504),
 Span('PBDT', 244, 248),
 Span('PBDT', 367, 371),
 Span('TBAP', 1332, 1336),
 Span('IIG', 44, 47),
 Span('MoO3', 243, 247),
 Span('IIG', 69, 72),
 Span('IIG', 341, 344),
 Span('PBDT', 33, 37),
 Span('PBDT', 97, 101),
 Span('PBDT', 337, 341),
 Span('PBDT', 1121, 1125),
 Span('IIG', 46, 49),
 Span('PBDT', 1627, 1631),
 Span('PBDT', 603, 607),
 Span('PBDT', 3323, 3327),
 Span('PBDT', 123, 127),
 Span('(E)-6,6′-dibromo-1,1′-bis(2-oc-\ntyldodecyl)-[3,3′-biindolinylidene]-2,2′\n-dione', 348, 427),
 Span('PCBM', 2330, 2334),
 Span('DIO', 1511, 1514),
 Span('bis(dialkylthie

In [5]:
paras[11]

In [6]:
doc = Document(cde.doc.text.Paragraph(u'A systematic study on the effects of heteroarenes on the solid state structure and optoelectronic properties of isoindigo analogues, namely, PBDT-IIG and PBDT-TIIG, used in solution-processed organic filed effect transistors (OFETs) and organic photovoltaics (OPVs) is reported. We discover that the optical absorption, frontier orbitals, backbone coplanarity, molecular orientation, solubility, ﬁlm morphology, charge carrier mobility, and solar cell performance are critically inﬂuenced by the heteroarenes in the acceptor subunits. PBDT-IIG exhibits good p-type OFET performance with mobility up to 1.03 × 10−1 e = 2.81 × 10−4 cm2 V−1 cm2 V−1 s−1, whereas PBDT-TIIG displays ambipolar mobilities of μ s−1. PBDT-IIG and PBDT-TIIG blended with [6,6]-phenyl-C71-butyric acid methyl ester (PC71BM) yield promising power conversion eﬃciencies (PCEs) of 5.86% and 2.55%, respectively. The excellent mobility of PBDT-IIG can be attributable to the growing fraction of edge-on packing by the interfacial surface treatment. Although PBDT-TIIG could construct a long-range face- on packing alignment to meliorate its photocurrent in OPV applications, the low open-circuit voltage caused by its high-lying HOMO energy level and greater recombination demonstrates the trade-oﬀ between light absorption and solar cell performance. Nevertheless, PBDT-TIIG with a PCE of 2.55% is the highest reported PCE to date for the TIIG-based systems.'))
doc.records.serialize()

[{'names': ['isoindigo']},
 {'names': ['heteroarenes']},
 {'names': ['PBDT-IIG']},
 {'names': ['[6,6]-phenyl-C71-butyric acid methyl ester', 'PC71BM']},
 {'names': ['PBDT']}]

In [7]:
doc_w_cap = Document(cde.doc.text.Paragraph(u'Solar cells containing 1 display PCEs up to 4.73 %. Though devices containing 2 have exceeded PCEs of 15 %,2a–2c their moisture sensitivity remains a concern for large‐scale device fabrication or their long‐term use. The layered structure of 1 aids the formation of high‐quality films that show greater moisture resistance compared to 2. The larger bandgap of 1 also affords a higher VOC value of 1.18 V compared to devices with 2. Further improvements in material structure and device engineering, including making appropriate electronic contact with the anisotropic inorganic sheets, should increase the PCEs of these devices. In particular, higher values of n as single‐phase materials or as mixtures may allow for lower bandgaps and higher carrier mobility in the inorganic layers while the organic layers provide additional tunability. For example, hydrophobic fluorocarbons could increase moisture stability, conjugated organic layers could facilitate charge transport, and organic photosensitizers could improve the absorption properties of the material. We are focused on manipulating this extraordinarily versatile platform through synthetic design.'),
                      cde.doc.text.Caption(u'PXRD patterns of films of (PEA)2(MA)2[Pb3I10] (1), (MA)[PbI3] formed from PbI2 (2 a), and (MA)[PbI3] formed from PbCl2 (2 b), which were exposed to 52 % relative humidity. Annealing of films of 2 a (15 minutes) and 2 b (80 minutes) was conducted at 100 °C prior to humidity exposure. Asterisks denote the major reflections from PbI2.'))

doc_w_cap.records.serialize()

[{'names': ['fluorocarbons']},
 {'names': ['(MA)[PbI3]']},
 {'names': ['PbCl2']},
 {'names': ['(PEA)2(MA)2[Pb3I10]'], 'labels': ['1']},
 {'names': ['PbI2']}]

In [8]:
doc2 = Document(cde.doc.text.Paragraph(u'PCBM and PCTT lead to pce of 5% and pce of 6%'))
doc3 = Document(cde.doc.text.Paragraph(u'I have a pce of 4% and a fill factor of 3%'))

doc3.records.serialize()

[]

from the result above, we see that the result only show one pce value from abstract. The combination that can be recognized is 1 pce mention + 1 value + 1 unit. 

## Extracted Info Test

This test examines whether the information with specific metrics (sentences that contain pce, jsc, voc) can be detected and extracted. 

The testing result proves that extracted info can locate sentences with targeted metrics, but not unless you design that metric.

In [10]:
import pandas as pd
test_df = pd.read_csv("test_articles/test.csv",sep="\t").set_index("ID No.")

In [11]:
test_df

PCE = test_df['PCE_max']
Jsc = test_df['Voc (V)']
Voc = test_df['Jsc (mA/cm2)']
FF = test_df['FF']
Mw = test_df['Mw (kg/mol)']
Mn = test_df['Mn (kg/mol)']
PDI = test_df['PDI']

In [12]:
# import from paper parser
def quantified_performance_sentence_search(sentence_list, metric='PCE'):
    """ Finds sentences in list that contain quantitative information about PCE (power conversion efficiency)"""
    return_sents = []
    
    if metric == 'PCE':
        metric_patterns = ['PCE']
        units_patterns = ['%', 'percent']
    elif metric == "VOC":
        metric_patterns = ['VOC']
        units_patterns = ['V\w', 'volts']
    elif metric == "JSC":
        metric_patterns = ['JSC']
        units_patterns = ['A\w', 'amps']
    elif metric == "FF":
        metric_patterns = ['FF']
        units_patterns = ['%', 'percent']
    else:
        raise ValueError('{} is not a valid performance metric'.format(metric))
    for sent in sentence_list:
        for pce_pattern in metric_patterns:
            # Check for pce_pattern 
            pce_found = re.search(pce_pattern, sent, re.IGNORECASE)
            if pce_found: # check for percent and stop iterating if found
                # Check for numbers
                numbers_found = re.search('\d+', sent)
                if not numbers_found:
                    # Stop looking at sentence
                    break
                # Check for units
                for pce_units_pattern in units_patterns: 
                    units_found = re.search(pce_units_pattern, sent)
                    if units_found: break # stop looking for units
                    # if this loop exits with out finding units, throw away the sentence
                if units_found:
                    return_sents.append(sent)
                    break
    return return_sents

In [13]:
test_sentences = ['For example, when MAPbI3 was loaded on a mesoporous (mp)-TiO2 electrode by the sequential deposition of PbI2 and methylammonium iodide (MAI), a 15.0% power-conversion efficiency (PCE) was achieved under 1 sun illumination11.',
 'The Jsc, Voc and FF values obtained from the I–V curve of the reverse scan were 19.2 mA cm−2, 1.09 V and 0.69, respectively, yielding a PCE of 14.4% under standard AM 1.5 conditions.',
 'The average values from the J–V curves from the reverse and forward scans (Fig.\xa05a) exhibited a Jsc of 19.58 mA cm−2, Voc of 1.105 V, and FF of 76.2%, corresponding to a PCE of 16.5% under standard AM 1.5 G conditions.',
 'The best device also showed a very broad IPCE plateau of over 80% between 420 and 700 nm, as shown in Fig.\xa05b.',
 'One of these devices was certified by the standardized method in a photovoltaics calibration laboratory, confirming a PCE of 16.2% under AM 1.5 G full sun (Supplementary Fig.\xa06).',
 'In summary, we developed a solvent-engineering technology for the deposition of extremely uniform perovskite layers, and demonstrated a solution-processed perovskite solar cell with 16.5% PCE under standard conditions (AM 1.5 G radiation, 100 mW cm−2).',
 'a solar cell with a fill factor(FF) of 75% can be acheived after improving its morphology.',
                  'I have a big baby',
 'my dream is to make a lot of money']

In [14]:
# only sentences containing metrics (PCE, Jsc, Voc, etc.) can be seen
quantified_performance_sentence_search(test_sentences)

['For example, when MAPbI3 was loaded on a mesoporous (mp)-TiO2 electrode by the sequential deposition of PbI2 and methylammonium iodide (MAI), a 15.0% power-conversion efficiency (PCE) was achieved under 1 sun illumination11.',
 'The Jsc, Voc and FF values obtained from the I–V curve of the reverse scan were 19.2 mA cm−2, 1.09 V and 0.69, respectively, yielding a PCE of 14.4% under standard AM 1.5 conditions.',
 'The average values from the J–V curves from the reverse and forward scans (Fig.\xa05a) exhibited a Jsc of 19.58 mA cm−2, Voc of 1.105 V, and FF of 76.2%, corresponding to a PCE of 16.5% under standard AM 1.5 G conditions.',
 'The best device also showed a very broad IPCE plateau of over 80% between 420 and 700 nm, as shown in Fig.\xa05b.',
 'One of these devices was certified by the standardized method in a photovoltaics calibration laboratory, confirming a PCE of 16.2% under AM 1.5 G full sun (Supplementary Fig.\xa06).',
 'In summary, we developed a solvent-engineering techn

##  Scraping Test

In this case we are going to test if our arXiv scraper works. We also try to run on other platforms like Elsvier and Springer Nature, Royal Chemistry Society. 

In [1]:
def arx_scrape(search_term, start_idx, scope='ti'):
    '''uses urllib, time, and feedparser'''
    #first escape search terms
    search_term= search_term.replace('"',"%22").replace(" ", "+");
    # set wait time and iteration step
    iterstep= 200;
    wait_time= 2 
    base_url = 'http://export.arxiv.org/api/query?'
    # els_base_url = 'http://api.elsevier.com/content/article/doi/'
    start= start_idx
    date_dict={
        "date":[],
        "article_id": [],
        "summary":[],
        "source": "arXiv"
    }
    
    while True:
        response= urllib.request.urlopen(base_url+f"search_query={scope}:{search_term}&sortBy=submittedDate&sortOrder=ascending&start={start}&max_results={iterstep}")
        feed= feedparser.parse(response)
        if not feed.entries:
            print('query complete')
            print(f"There should be {feed.feed.opensearch_totalresults} results?")
            break
        date_dict['date'].extend([entry.published for entry in feed.entries])
        date_dict['article_id'].extend([entry.id.split('/abs/')[-1] for entry in feed.entries])
        date_dict['summary'].extend([entry.summary.replace("\n", " ") for entry in feed.entries])
        print(f"gathering results {start} to {start + iterstep-1} ")
        start = start + iterstep
        time.sleep(wait_time)

    return pd.DataFrame(date_dict)

In [2]:
#collect all all:organic photovoltaics from arXiv
a1 = arx_scrape("all", 0, scope='organic photovoltaics OR organic solar cell')
a2 = arx_scrape("all", 10000, scope='organic photovoltaics OR organic solar cell')
a3 = arx_scrape("all", 12200, scope='organic photovoltaics OR organic solar cell')
a4 = arx_scrape("all", 14400, scope='organic photovoltaics OR organic solar cell')

NameError: name 'urllib' is not defined

In [None]:
all_a = pd.concat([a1, a2, a3, a4], ignore_index=True)
all_a.head()
all_a['date']= pd.DatetimeIndex(all_ai['date']).normalize()
all_a.to_csv('./els.csv', index=False)

In [None]:
a = pd.read_csv('./arxiv_physics.chem_ph.csv')
corpus = a['summary']
corpus = list(corpus)

In [None]:
corpus

From the arXiv we have scraped 14800 abstracts,date and article ID for them. Usually abstracts contain the most concise information for the entire paper. Thus focusing on property extraction from abstracts is the main trend.

The followings are elsvier and springer nature's api keys. In order to design a useful scraper, we beed to desgin multiple layers of filtering mechanism in order to scrape the right articles. For example, only "organic photovoltaics" or "peroskite solar cell" can also be broad. More details can be added to it.

In [18]:
els_key = "4bc84cbdadca6050062348015ac963aa"
sn_key = "eca22bc7a0b1ee3153ab02c024a6a06e"
folder = "testing_download_articles/"

In [19]:
# for Scopus, only documents that doi is known can be collected. Thus we need multiple layer of filtering when
# designing the scraper
els_url = 'https://api.elsevier.com/content/article/doi/10.1016/j.mattod.2014.07.007\
?APIKey=' + els_key

In [20]:
import requests
r = requests.get(els_url)
with open(folder + '/write_test_els_paper1.html', 'wb') as file:
    file.write(r.content)

In [21]:
f = open('testing_download_articles/write_test_els_paper1.html', 'rb')
doc = Document.from_file(f)

In [22]:
doc