# Scraping Documents From Websites

### Methods:

Scientific articles used in this work are journal publications published by Springer, Wiley, Elsevier, the Royal Society of Chemistry, and the Electrochemical Society from which we received permissions to download large amount of articles. For each publisher, we manually identified all material science related journals available for download. A web scraping engine was built using scrapy. Only full-text articles publihsed after 2000 were downloaded, including metadata such as journal name, article title, article abstract, authors, etc. 

All data were stored in a document-oriented database implemented using a MongoDB database instance. Becasue downloaded articles are in HTML/XML format, which contains irrelevant markups and stylesheets, we developed a customized library for parsing article markup strings into text paragraphs while keeping the structures of paper and sections headings. The current snapshot of the database contains XXX papers, from which we used XXX paragraphs in the experimental sections of each paper to conduct this research. The experimental sections were identified by using case-insensitive keyword matching in section headings. keywords like "experiment", "synthesis", and their morphological derivations. 

In [65]:
import pandas as pd
import re
import urllib
import time
import feedparser

import chemdataextractor
from chemdataextractor import Document
from chemdataextractor.reader import NlmXmlReader
from chemdataextractor.reader import XmlReader
from chemdataextractor.reader import PlainTextReader
import numpy as np
import requests
%matplotlib inline

In [61]:
def arx_scrape(search_term, start_idx, scope='ti'):
    '''uses urllib, time, and feedparser'''
    #first escape search terms
    search_term= search_term.replace('"',"%22").replace(" ", "+");
    # set wait time and iteration step
    iterstep= 200;
    wait_time= 2 
    base_url = 'http://export.arxiv.org/api/query?'
    start= start_idx
    date_dict={
        "date":[],
        "article_id": [],
        "summary":[],
        "source": "arXiv"
    }
    
    while True:
        response= urllib.request.urlopen(els_base_url+f"search_query={scope}:{search_term}&sortBy=submittedDate&sortOrder=ascending&start={start}&max_results={iterstep}")
        feed= feedparser.parse(response)
        if not feed.entries:
            print('query complete')
            print(f"There should be {feed.feed.opensearch_totalresults} results?")
            break
        date_dict['date'].extend([entry.published for entry in feed.entries])
        date_dict['article_id'].extend([entry.id.split('/abs/')[-1] for entry in feed.entries])
        date_dict['summary'].extend([entry.summary.replace("\n", " ") for entry in feed.entries])
        print(f"gathering results {start} to {start + iterstep-1} ")
        start = start + iterstep
        time.sleep(wait_time)
        
    return pd.DataFrame(date_dict)

In [59]:
#collect all all:organic photovoltaics from arXiv
a1 = arx_scrape("all", 0, scope='organic photovoltaics OR organic solar cell')
a2 = arx_scrape("all", 10000, scope='organic photovoltaics OR organic solar cell')
a3 = arx_scrape("all", 12200, scope='organic photovoltaics OR organic solar cell')
a4 = arx_scrape("all", 14400, scope='organic photovoltaics OR organic solar cell')

HTTPError: HTTP Error 400: BAD_REQUEST

In [20]:
all_a = pd.concat([a1, a2, a3, a4], ignore_index=True)
all_a.head()
all_a['date']= pd.DatetimeIndex(all_ai['date']).normalize()
all_a.to_csv('./els.csv', index=False)

In [63]:
a = pd.read_csv('./arxiv_physics.chem_ph.csv')
corpus = a['summary']
corpus = list(corpus)

In [64]:
corpus

['Dynamical systems of a new kind are described, which are motivated by the problem of constructing diffeomorphism invariant quantum theories. These are based on the extremization of a non-local and non-additive quantity that we call the variety of a system. In these systems all dynaqmical variables refer to relative coordinates or, more generally, describe relations between particles, so that they are invariant under discrete analogues of diffeomorphisms in which the labels of all particles are permutted arbitrarily. The variety is a measures of how uniquely each of the elements of the system can be distinguished from the others in terms of the values of these relative coordinates. Thus a system with extremal variety is one in which the parts are related to the whole in as distinct a way as possible.   We study numerically several dynamical systems which are defined by setting the action of the system equal to its variety. We find evidence that suggests that such systems may serve as 

From the arXiv we have scraped 14800 abstracts,date and article ID for them. Usually abstracts contain the most concise information for the entire paper. Thus focusing on property extraction from abstracts is the main trend.

The followings are elsvier and springer nature's api keys. In order to design a useful scraper, we beed to desgin multiple layers of filtering mechanism in order to scrape the right articles. For example, only "organic photovoltaics" or "peroskite solar cell" can also be broad. More details can be added to it.

In [44]:
els_key = "4bc84cbdadca6050062348015ac963aa"
sn_key = "eca22bc7a0b1ee3153ab02c024a6a06e"
folder = "testing_download_articles/"

In [52]:
# for Scopus, only documents that doi is known can be collected. Thus we need multiple layer of filtering when
# designing the scraper
els_url = 'https://api.elsevier.com/content/article/doi/10.1016/j.mattod.2014.07.007\
?APIKey=' + els_key

In [53]:
import requests
r = requests.get(els_url)
with open(folder + '/write_test_els_paper1.html', 'wb') as file:
    file.write(r.content)

In [54]:
f = open('testing_download_articles/write_test_els_paper1.html', 'rb')
doc = Document.from_file(f)

In [55]:
doc

In [None]:
def scrape(platform, apikey, iterstep, start, count):
    """
    This function is used to scrape articles from platforms
    :platform Elsvier-Scopus, Elsvier-ScienceDirect
    :apikey normal API key
    :doi digital object identifier
    : author author of the literature
    """
    search_term= search_term.replace('"',"%22").replace(" ", "+");
    # set wait time and iteration step
    iterstep= 200;
    wait_time= 2 
    date_dict={
        "date":[],
        "article_id": [],
        "summary":[],
        "source": platform
    }
    
    if platform == "Elsvier":
        els_key = "4bc84cbdadca6050062348015ac963aa"
        url = "https://api.elsevier.com/content/search/scopus?query={term}&count={count}&start={start}&apiKey={els_key}&sortBy=submittedDate&sortOrder=ascending&start={start}&max_results={iterstep}"
    elif platform == "Springer":
        url = ""
    elif platform == "ACS":
        url = ""
    elif platform == "arXiv":
        url = ""
    elif platform == "RSC":
        url = ""
        
    while True:
        response= urllib.request.urlopen(url)
        feed= feedparser.parse(response)
        if not feed.entries:
            print('query complete')
            print(f"There should be {feed.feed.opensearch_totalresults} results?")
            break
        date_dict['date'].extend([entry.published for entry in feed.entries])
        date_dict['article_id'].extend([entry.id.split('/abs/')[-1] for entry in feed.entries])
        date_dict['summary'].extend([entry.summary.replace("\n", " ") for entry in feed.entries])
        print(f"gathering results {start} to {start + iterstep-1} ")
        start = start + iterstep
        time.sleep(wait_time)
        
    return pd.DataFrame(date_dict)