# Scraping Documents From Websites

### Methods:

Scientific articles used in this work are journal publications published by Springer, Wiley, Elsevier, the Royal Society of Chemistry, and the Electrochemical Society from which we received permissions to download large amount of articles. For each publisher, we manually identified all material science related journals available for download. A web scraping engine was built using scrapy. Only full-text articles publihsed after 2000 were downloaded, including metadata such as journal name, article title, article abstract, authors, etc. 

All data were stored in a document-oriented database implemented using a MongoDB database instance. Becasue downloaded articles are in HTML/XML format, which contains irrelevant markups and stylesheets, we developed a customized library for parsing article markup strings into text paragraphs while keeping the structures of paper and sections headings. The current snapshot of the database contains XXX papers, from which we used XXX paragraphs in the experimental sections of each paper to conduct this research. The experimental sections were identified by using case-insensitive keyword matching in section headings. keywords like "experiment", "synthesis", and their morphological derivations. 

In [3]:
# import pandas as pd
import re
import urllib
import time
# import feedparser

# import chemdataextractor
# from chemdataextractor import Document
# from chemdataextractor.reader import NlmXmlReader
# from chemdataextractor.reader import XmlReader
# from chemdataextractor.reader import PlainTextReader
import numpy as np
import requests

In [4]:
els_key = "4bc84cbdadca6050062348015ac963aa"
sn_key = "eca22bc7a0b1ee3153ab02c024a6a06e"
folder = "testing_download_articles/"

## Search API

In all platforms, we need to first search articles by topics of interst and then retrieve the full text using another API. Refer to https://dev.elsevier.com/documentation/FullTextRetrievalAPI.wadl

In [3]:
def arx_scrape(search_term, start_idx, scope='ti'):
    '''uses urllib, time, and feedparser'''
    #first escape search terms
    search_term= search_term.replace('"',"%22").replace(" ", "+");
    # set wait time and iteration step
    iterstep= 200;
    wait_time= 2 
    base_url = 'http://export.arxiv.org/api/query?'
    start= start_idx
    date_dict={
        "date":[],
        "article_id": [],
        "summary":[],
        "source": "arXiv"
    }
    
    while True:
        response= urllib.request.urlopen(els_base_url+f"search_query={scope}:{search_term}&sortBy=submittedDate&sortOrder=ascending&start={start}&max_results={iterstep}")
        feed= feedparser.parse(response)
        if not feed.entries:
            print('query complete')
            print(f"There should be {feed.feed.opensearch_totalresults} results?")
            break
        date_dict['date'].extend([entry.published for entry in feed.entries])
        date_dict['article_id'].extend([entry.id.split('/abs/')[-1] for entry in feed.entries])
        date_dict['summary'].extend([entry.summary.replace("\n", " ") for entry in feed.entries])
        print(f"gathering results {start} to {start + iterstep-1} ")
        start = start + iterstep
        time.sleep(wait_time)
        
    return pd.DataFrame(date_dict)

In [4]:
# #collect all all:organic photovoltaics from arXiv
# a1 = arx_scrape("all", 0, scope='organic photovoltaics OR organic solar cell')
# a2 = arx_scrape("all", 10000, scope='organic photovoltaics OR organic solar cell')
# a3 = arx_scrape("all", 12200, scope='organic photovoltaics OR organic solar cell')
# a4 = arx_scrape("all", 14400, scope='organic photovoltaics OR organic solar cell')

In [5]:
# all_a = pd.concat([a1, a2, a3, a4], ignore_index=True)
# all_a.head()
# all_a['date']= pd.DatetimeIndex(all_ai['date']).normalize()
# all_a.to_csv('./els.csv', index=False)

In [6]:
# a = pd.read_csv('./arxiv_physics.chem_ph.csv')
# corpus = a['summary']
# corpus = list(corpus)

In [7]:
# corpus

From the arXiv we have scraped 14800 abstracts,date and article ID for them. Usually abstracts contain the most concise information for the entire paper. Thus focusing on property extraction from abstracts is the main trend.

The followings are elsvier and springer nature's api keys. In order to design a useful scraper, we beed to desgin multiple layers of filtering mechanism in order to scrape the right articles. For example, only "organic photovoltaics" or "peroskite solar cell" can also be broad. More details can be added to it.

## Full Text Retrieval API

This section use the full text retrieval API to retrieve full-article information and save it to either database (production) or local folders (development)

In [14]:
import requests

folder = "testing_download_articles"
# count is 10 so we output 10 results
els_url = "https://api.elsevier.com/content/search/scopus?query=organic%26photovoltaics&count=100&start=1&apiKey=4bc84cbdadca6050062348015ac963aa"
r = requests.get(els_url)

with open(folder + '/write_test_els_paper2.json', 'wb') as file:
    file.write(r.content)

In [15]:
import json
from pprint import pprint
file = "testing_download_articles/write_test_els_paper2.json"

with open('testing_download_articles/write_test_els_paper2.json') as f:
    data = json.load(f)

pprint(data)

{'service-error': {'status': {'statusCode': 'INVALID_INPUT',
                              'statusText': 'Exceeds the maximum number '
                                            'allowed for the service level'}}}


In [10]:
ls = []
ls2 = []
for i in data['search-results']['entry']:
    # print(i)
    if 'prism:doi' in i:
        # print(i['prism:doi'])
        ls.append(i['prism:doi'])
        ls2.append(i['dc:title'])
print(ls)
print(ls2)

['10.1038/s42004-020-0256-7', '10.1038/s41467-020-15215-x', '10.1038/s41467-020-15078-2', '10.1038/s41598-020-61768-8', '10.1038/s41467-019-13909-5', '10.1007/s12034-019-2020-0', '10.1038/s41598-020-58310-1', '10.1038/s41598-020-61282-x', '10.1038/s41467-019-13437-2', '10.1038/s41467-020-14401-1', '10.1038/s42005-020-0313-7', '10.1007/s12034-019-2002-2', '10.1038/s41598-020-61602-1', '10.1038/s41467-020-14986-7', '10.1038/s41467-020-14661-x', '10.1038/s41427-020-0202-2', '10.1038/s41427-020-0198-7', '10.1038/s41467-019-14237-4', '10.1038/s41467-019-13908-6', '10.1038/s41377-020-0264-5']
['Mapping the optoelectronic property space of small aromatic molecules', 'Molecular vibrations reduce the maximum achievable photovoltage in organic solar cells', 'Ultra-high open-circuit voltage of tin perovskite solar cells via an electron transporting layer design', 'Origin of Rashba Spin-Orbit Coupling in 2D and 3D Lead Iodide Perovskites', 'Highly efficient all-inorganic perovskite solar cells wit

In [11]:
def search_articles(file):
    """
    This is the combination of search and full text retrieval API
    input : JSON file location
    
    """
    names = []
    dois = []
    
    # read json as dictionary in python
    with open(file) as f:
        data = json.load(f)
    
    for i in data['search-results']['entry']:
        if 'prism:doi' in i:
            dois.append(i['prism:doi'])
            names.append(i['dc:title'])
    return dois, names

In [13]:
file = "testing_download_articles/write_test_els_paper2.json"

search_articles(file)

(['10.1016/j.cej.2019.122813',
  '10.1016/j.jhazmat.2019.121260',
  '10.1016/j.jhazmat.2019.121275',
  '10.1016/j.cej.2019.122464',
  '10.1016/j.amc.2019.124780',
  '10.1016/j.renene.2019.07.038',
  '10.1016/j.renene.2019.07.018',
  '10.1016/j.dyepig.2019.107927',
  '10.1016/j.dyepig.2019.107925',
  '10.1016/j.renene.2019.08.070',
  '10.1016/j.dyepig.2019.107890',
  '10.1016/j.dyepig.2019.107887',
  '10.1016/j.renene.2019.07.028',
  '10.1016/j.dyepig.2019.107891',
  '10.1016/j.jechem.2019.04.019',
  '10.1016/j.dyepig.2019.107840',
  '10.1016/j.dyepig.2019.107880',
  '10.1016/j.renene.2019.08.094',
  '10.1016/j.dyepig.2019.107921',
  '10.1016/j.dyepig.2019.107881'],
 ['The flexible-transparent p-n junction film device of N-doped Cu<inf>2</inf>O/SnO<inf>2</inf> orderly nanowire arrays towards highly photovoltaic conversion and stability',
  'Improving the flame retardancy of poly(lactic acid) using an efficient ternary hybrid flame retardant by dual modification of graphene oxide with ph

## Merge Them Together

In this case we can combine both of them together to get a better view on 

In [12]:
import requests

In [13]:
# pull articles
count = 0
for i in ls:
    els_url = 'https://api.elsevier.com/content/abstract/doi/' + i + '?APIKey=' + els_key
    r = requests.get(els_url)
    for num in range(len(ls)):
        with open(folder + f'/abstract{num}.xml', 'wb') as file:
            file.write(r.content)

In [16]:
class search_and_pull:
    def __init__(self):
        self.els_key = "4bc84cbdadca6050062348015ac963aa"
        self.file = "testing_download_articles/write_test_els_paper2.json"
        self.folder = "testing_download_articles/"
        
    def search_articles(self,file):
        """
        This is the combination of search and full text retrieval API
        input : JSON file location
        """
        names = []
        dois = []

        # read json as dictionary in python
        file = self.file
        with open(file) as f:
            data = json.load(f)

        for i in data['search-results']['entry']:
            if 'prism:doi' in i:
                dois.append(i['prism:doi'])
                names.append(i['dc:title'])
        return dois

    def pull_articles(self,ls):
        """
        This function writes txt files for scraped documents
        """
        # pull articles
        doi = self.search_articles(file)
        els_key = self.els_key
        
        for i in doi:
            els_url = 'https://api.elsevier.com/content/article/doi/' + doi + '?APIKey=' + els_key
            r = requests.get(els_url)
            for num in range(len(ls)):
                with open(folder + f'/write_test_els_paper{num}.xml', 'wb') as file:
                    file.write(r.content)

Under school's network this combination of APIs work since UW is a subscriber. Next we are going to customize the parser to make it automated and more powerful

The followings are some platforms we can access to:
1. Elsevier
2. Springer
3. ACS
4. RSC

They are all major publishers for main journals in OPV and other fields of study. In this case we only need these four.

Mongodb is a NoSQL database that can is document-oriented. It is perfect to store search results. The tabulated data can either be directly output or stored in a SQL database like MySQL.