In [1]:
import os
import time
import random

import requests
from scrapy import Selector
from bs4 import BeautifulSoup
import pandas as pd

# Web scraping Physical Review B with Scrapy

*Ben McKeever 2023*

Notebook for extracting citation data from papers published in PRB in 2019. 

The results of the exploratory web scraping in this notebook have been released as a [python package]() for convenience / reusability. It may be of interest to see further jupyter notebooks in this repo:

* [Sample notebook for use of the python package](Sample-notebook-for-scraping-publication-data-from-PRB-in-2019.ipynb)
* [Notebook for data analysis and visualisations of the dataset](Data-Analysis-and-Visualisations-PRB-dataset.ipynb)


## Introduction

Volumes and Issues of PRB are located at the URL https://journals.aps.org/prb/issues/   
Some initial notes:

* There are 2 Volumes each year in the journal
    * Volume 99 is for the first half of 2019 and contains 24 issues (from issue 99/1 to issue 99/24). 
    * Volume 100 is for the second half of 2019 with 24 more issues from 100/1 to 100/24.
* Every second issue also contains a *Rapid Communications* section, for research papers expedited for publication. I won't bother identifying which papers fall under this label, but instead will just focus on whether or not they are highlighted by the editors.

Strategy for the web scraping:

1. From the web page of each issue (48 such web pages), scrape all of the interesting information to a new csv file.
1. Import all data into one Pandas DataFrame with `pd.read_csv()` and `pd.concat()`.
1. Separately scrape the number of citations (and maybe abstract) from each article's webpage (approx 90 x 48 such pages).
1. Update the dataframe with number of citations (and maybe abstract)

Structure of this notebook:

* Section 1: Obtaining article info from issue web pages: (`doi`, `section`, `authors`, `name`, `date_published`, `issue_number`)
    * Section 1a: Extension to every issue in 2019
* Section 2: Obtaining number of citations `prb_citations` for each article.

## 1: Web scraping a particular issue (99/5)

Fields to obtain:

* ✅ `doi` : (string) Digital Object Identifier -- the unique key for each research paper. Our DataFrame index.
* ✅ `section`: (string) The subfield of condensed matter physics the paper is placed in by the editors.
* ✅ `date_published`: (string) This has the format *DD Month-name YEAR*, e.g. 01 January 2019. 
* ✅ `name` : (string) The name of the research article.
* ✅ `issue`: (string) Identifies to which publication the article belongs. Has the format "Volume/Issue", so 99/5 in this case.
* ✅ `is_highlighted`: (bool) Signifies if the article is highlighted or not by the editors.'

When we read in the data cached locally with `pd.read_csv` we can use `parse_dates` on the `date_published` column to convert it to a datetime object. 

### Retrieving DOIs and corresponding section names

Get html data into a Scrapy Selector object:

In [2]:
url = 'https://journals.aps.org/prb/issues/99/5'
html = requests.get(url).content
sel = Selector(text = html)

Highlighted articles appear twice on the web page -- once at the top of the page, and once in each section.  
Therefore collecting {doi: section} pairs makes sense to do first before locating other information, to avoid any duplicates appearing in the data.

In [3]:
def to_kebab_case(value):
    return "-".join(value.lower().split()).replace(",","")

# List of the 5 sections of condensed matter physics for issue 99/5
sections = [
    'Structure, structural phase transitions, mechanical properties, defects',
    'Inhomogeneous, disordered, and partially ordered systems',
    'Dynamics, dynamical systems, lattice effects',
    'Magnetism',
    'Superfluidity and superconductivity'    
]

sections_kebab_case = ["sect-articles-" + to_kebab_case(s) for s in sections]

In [4]:
sections_dict = dict()
doi_xpath = '..//div[@class="article panel article-result"]/@data-id'

for sec in sections_kebab_case:
    section_selector = sel.xpath('//a[@name="{}"]'.format(sec))
    dois = section_selector.xpath(doi_xpath).extract()
    sections_dict.update({doi:sec for doi in dois})

unique_dois = list(set(sections_dict.keys())) # ensure no duplicates

### Initialize DataFrame from `sections_dict`

In [5]:
section_series = pd.Series(sections_dict)
df = pd.DataFrame(section_series, columns=['section'])

In [6]:
df.head()

Unnamed: 0,section
10.1103/PhysRevB.99.054101,sect-articles-structure-structural-phase-trans...
10.1103/PhysRevB.99.054102,sect-articles-structure-structural-phase-trans...
10.1103/PhysRevB.99.054103,sect-articles-structure-structural-phase-trans...
10.1103/PhysRevB.99.054104,sect-articles-structure-structural-phase-trans...
10.1103/PhysRevB.99.054105,sect-articles-structure-structural-phase-trans...


### Add issue column to DataFrame

In [7]:
issue = '/'.join(url.split('/')[-2:])
df['issue'] = pd.Series({doi: issue for doi in unique_dois})

In [8]:
df.head()

Unnamed: 0,section,issue
10.1103/PhysRevB.99.054101,sect-articles-structure-structural-phase-trans...,99/5
10.1103/PhysRevB.99.054102,sect-articles-structure-structural-phase-trans...,99/5
10.1103/PhysRevB.99.054103,sect-articles-structure-structural-phase-trans...,99/5
10.1103/PhysRevB.99.054104,sect-articles-structure-structural-phase-trans...,99/5
10.1103/PhysRevB.99.054105,sect-articles-structure-structural-phase-trans...,99/5


### Example of the data available once we have the {doi: section} pairs

Using the Beautiful Soup library we can nicely print HTML elements retrieved by an XPath using Scrapy.  
On the webpage for issue 99/5, each research article is in its own `<div>` element and contains a bunch of metadata from which we can scrape relevant information:

In [9]:
html_snippet = section_selector.xpath(doi_xpath[:-9]).extract_first()
soup = BeautifulSoup(html_snippet, 'html.parser')
print(soup.prettify())

<div class="article panel article-result" data-id="10.1103/PhysRevB.99.054501">
 <div class="row">
  <div class="columns large-9">
   <h6 class="tag">
   </h6>
   <h5 class="title">
    <a href="/prb/abstract/10.1103/PhysRevB.99.054501">
     Pressure-induced irreversible evolution of superconductivity in
     <math xmlns="http://www.w3.org/1998/Math/MathML">
      <mrow>
       <mi>
        PdB
       </mi>
       <msub>
        <mi mathvariant="normal">
         i
        </mi>
        <mn>
         2
        </mn>
       </msub>
      </mrow>
     </math>
    </a>
   </h5>
   <h6 class="authors">
    Ying Zhou, Xuliang Chen, Chao An, Yonghui Zhou, Langsheng Ling, Jiyong Yang, Chunhua Chen, Lili Zhang, Mingliang Tian, Zhitao Zhang, and Zhaorong Yang
   </h6>
   <h6 class="pub-info">
    Phys. Rev. B
    <b>
     99
    </b>
    , 054501 (2019) – Published  4 February 2019
   </h6>
   <h6 class="reveal-abstract">
    <a href="">
     Show Abstract
     <i class="fi-plus">
     </i>
  

As we can see from the output above, the html source retrieved by the XPath *doesn't* include the number of citations or the abstract information.  
For that, we will need to go to the web page for each article directly (done later in Section 2).

### Add `date_published` , `authors`, and `name` to DataFrame

In [10]:
pub_dates_dict = dict()
authors_dict = dict()
names_dict = dict()

# using the {doi: section} dictionary to avoid duplicates
# from the highlighted articles part of the web page:
for k, v in sections_dict.items():
    
    sec_selector = sel.xpath(f'//a[@name="{v}"]')
    
    pub_info_xpath = f'..//div[@data-id="{k}"]//h6[@class="pub-info"]//text()'
    pub_info = ''.join(sec_selector.xpath(pub_info_xpath).extract())   
    pub_dates_dict[k] = pub_info.split('Published ')[-1]
    
    authors_xpath = f'..//div[@data-id="{k}"]//h6[@class="authors"]//text()'
    authors_info = ''.join(sec_selector.xpath(authors_xpath).extract())
    authors_dict[k] = authors_info
    
    name = ''.join(
        sec_selector.xpath(f'..//a[@href="/prb/abstract/{k}"]//text()').extract()
    )
    names_dict[k] = name

df['date_published'] = pd.Series(pub_dates_dict)
df['authors'] = pd.Series(authors_dict)
df['name'] = pd.Series(names_dict)

In [11]:
df.head()

Unnamed: 0,section,issue,date_published,authors,name
10.1103/PhysRevB.99.054101,sect-articles-structure-structural-phase-trans...,99/5,5 February 2019,"Michal Stekiel, Adrien Girard, Tra Nguyen-Than...","Phonon-driven phase transitions in calcite, do..."
10.1103/PhysRevB.99.054102,sect-articles-structure-structural-phase-trans...,99/5,6 February 2019,Chris J. Pickard,Hyperspatial optimization of structures
10.1103/PhysRevB.99.054103,sect-articles-structure-structural-phase-trans...,99/5,7 February 2019,"Wen Wang, Jian Shen, and Q.-C. He",Microscale superlubricity of graphite under va...
10.1103/PhysRevB.99.054104,sect-articles-structure-structural-phase-trans...,99/5,11 February 2019,"V. B. Eltsov, A. Gordeev, and M. Krusius",Kelvin-Helmholtz instability of AB interface i...
10.1103/PhysRevB.99.054105,sect-articles-structure-structural-phase-trans...,99/5,21 February 2019,"Xiaoying Zhuang, Bo He, Brahmanandam Javvaji, ...",Intrinsic bending flexoelectric constants in t...


### Add `is_highlighted` column to DataFrame

In [12]:
selector = sel.xpath('//a[@name="sect-highlighted-articles"]')
highlighted_xpath ='../div[@class="article panel article-result"]/@data-id'
highlighted_articles = selector.xpath(highlighted_xpath).extract()

is_highlighted_dict = dict()

for i, article in enumerate(df.index):
    if article in highlighted_articles:
        is_highlighted_dict[article] = True
    else:
        is_highlighted_dict[article] = False

is_highlighted_series = pd.Series(is_highlighted_dict)
df["is_highlighted"] = is_highlighted_series

In [13]:
df.head()

Unnamed: 0,section,issue,date_published,authors,name,is_highlighted
10.1103/PhysRevB.99.054101,sect-articles-structure-structural-phase-trans...,99/5,5 February 2019,"Michal Stekiel, Adrien Girard, Tra Nguyen-Than...","Phonon-driven phase transitions in calcite, do...",False
10.1103/PhysRevB.99.054102,sect-articles-structure-structural-phase-trans...,99/5,6 February 2019,Chris J. Pickard,Hyperspatial optimization of structures,True
10.1103/PhysRevB.99.054103,sect-articles-structure-structural-phase-trans...,99/5,7 February 2019,"Wen Wang, Jian Shen, and Q.-C. He",Microscale superlubricity of graphite under va...,False
10.1103/PhysRevB.99.054104,sect-articles-structure-structural-phase-trans...,99/5,11 February 2019,"V. B. Eltsov, A. Gordeev, and M. Krusius",Kelvin-Helmholtz instability of AB interface i...,False
10.1103/PhysRevB.99.054105,sect-articles-structure-structural-phase-trans...,99/5,21 February 2019,"Xiaoying Zhuang, Bo He, Brahmanandam Javvaji, ...",Intrinsic bending flexoelectric constants in t...,False


Printing just the highlighted articles

In [14]:
df[df['is_highlighted']]

Unnamed: 0,section,issue,date_published,authors,name,is_highlighted
10.1103/PhysRevB.99.054102,sect-articles-structure-structural-phase-trans...,99/5,6 February 2019,Chris J. Pickard,Hyperspatial optimization of structures,True
10.1103/PhysRevB.99.054404,sect-articles-magnetism,99/5,6 February 2019,Guo-Qiang Zhang and J. Q. You,Higher-order exceptional point in a cavity mag...,True
10.1103/PhysRevB.99.054430,sect-articles-magnetism,99/5,26 February 2019,"B. F. McKeever, D. R. Rodrigues, D. Pinna, Ar....",Characterizing breathing dynamics of magnetic ...,True
10.1103/PhysRevB.99.054505,sect-articles-superfluidity-and-superconductivity,99/5,13 February 2019,"Bitan Roy, Sayed Ali Akbar Ghorashi, Matthew S...",Topological superconductivity of spin-3/2 carr...,True
10.1103/PhysRevB.99.054516,sect-articles-superfluidity-and-superconductivity,99/5,25 February 2019,"Subir Sachdev, Harley D. Scammell, Mathias S. ...",Gauge theory for the cuprates near optimal doping,True


### Saving the data to a csv file

In [15]:
df.to_csv('issue_99_5.csv')

In [16]:
!head -2 issue_99_5.csv

,section,issue,date_published,authors,name,is_highlighted
10.1103/PhysRevB.99.054101,sect-articles-structure-structural-phase-transitions-mechanical-properties-defects,99/5, 5 February 2019,"Michal Stekiel, Adrien Girard, Tra Nguyen-Thanh, Alexei Bosak, Victor Milman, and Bjoern Winkler","Phonon-driven phase transitions in calcite, dolomite, and magnesite",False


Showing how to import the data, and converting `date_published` to datetimes:

In [17]:
data = pd.read_csv('issue_99_5.csv', index_col=[0],parse_dates=[3])
data.head()

Unnamed: 0,section,issue,date_published,authors,name,is_highlighted
10.1103/PhysRevB.99.054101,sect-articles-structure-structural-phase-trans...,99/5,2019-02-05,"Michal Stekiel, Adrien Girard, Tra Nguyen-Than...","Phonon-driven phase transitions in calcite, do...",False
10.1103/PhysRevB.99.054102,sect-articles-structure-structural-phase-trans...,99/5,2019-02-06,Chris J. Pickard,Hyperspatial optimization of structures,True
10.1103/PhysRevB.99.054103,sect-articles-structure-structural-phase-trans...,99/5,2019-02-07,"Wen Wang, Jian Shen, and Q.-C. He",Microscale superlubricity of graphite under va...,False
10.1103/PhysRevB.99.054104,sect-articles-structure-structural-phase-trans...,99/5,2019-02-11,"V. B. Eltsov, A. Gordeev, and M. Krusius",Kelvin-Helmholtz instability of AB interface i...,False
10.1103/PhysRevB.99.054105,sect-articles-structure-structural-phase-trans...,99/5,2019-02-21,"Xiaoying Zhuang, Bo He, Brahmanandam Javvaji, ...",Intrinsic bending flexoelectric constants in t...,False


## 1a: Extension to all issues from 2019

The code in the previous section doesn't yet fully extend to every single issue in 2019, because different issues sort the research papers into different sections of condensed matter physics. For example, issue 99/3 uses the following section titles:

    Electronic structure and strongly correlated systems
    Semiconductors I: bulk
    Semiconductors II: surfaces, interfaces, microstructures, and related topics
    Surface physics, nanoscale physics, low-dimensional systems

none of which are used in issue 99/5.

### Retrieving section names

Rather than manually note down all of the different sections used, we can simply scrape the different `section` headings from the web url itself. E.g.

In [18]:
url = 'https://journals.aps.org/prb/issues/99/3'
html = requests.get(url).content
sel = Selector(text = html)

section_selectors = sel.xpath('//a[contains(@name,"sect-articles-")]')
section_names = section_selectors.xpath('./@name').extract()
section_names

['sect-articles-electronic-structure-and-strongly-correlated-systems',
 'sect-articles-semiconductors-i-bulk',
 'sect-articles-semiconductors-ii-surfaces-interfaces-microstructures-and-related-topics',
 'sect-articles-surface-physics-nanoscale-physics-low-dimensional-systems']

Now the steps taken previously in this notebook may be applied to any weekly issue in 2019.

### Python package information

The routine described above has been placed into the function `get_issue_data` inside a python package `scrape` in the GitHub repo where this notebook lives. All issue metadata from 2019 can now be gathered in a new notebook for data analysis with just a few lines of code:


```python
import pandas as pd
from scrape.data import get_issue_data

url_list = [f'https://journals.aps.org/prb/issues/{j}/' 
            + str(i+1) for j in [99,100] for i in range(24)]
df_list = []
for u in url_list:
    df_list.append(get_issue_data(url = u,force_download=True))
    time.sleep(random.randint(1,5)) # wait a bit before sending another request
    
df = pd.concat(df_list)
df.to_csv('all_issues_2019_no_citation_data.csv')
```

### Final notes

Some final notes, for anyone interested in extending the code developed above: 

* After December 2020, Physical Review B dropped their *"Rapid Communications"* feature in place of *"Letters"*, which is the hallmark publication type for the sibling journal Physical Review Letters.

* Therefore, an extension to the code in my python package could be to additionally include a `publication_type` column, by simply including a similar line such as `section_selectors = sel.xpath('//a[contains(@name,"sect-letters-")]')` in the python function `get_issue_data`. 
* The code outlined above then can easily be extended to research papers published after 2020, and also any publications in Physical Review Letters or any other APS journals that separate research papers into "Letters" and "Articles".

## 2: Get number of citations for each DOI from PRB

The number of citations for a particular article has to be obtained from that article's page, which also contains the abstract information. Since citation numbers will vary, while the rest of the metadata is fixed at the time of publication, it makes sense to do this step separately.

Fields to obtain

* ✅ prb_citations
* abstract (if we want to do analysis on keywords)

With the data fields described in Section 1, `all_issues_2019_no_citation_data.csv` has a file size of 1.3MB. This will become notably larger if we also scrape the abstract text, so I will skip this for now.

In [19]:
article_base_url = 'https://journals.aps.org/prb/abstract/'

### Retrieve citation number for a single DOI

In [20]:
article_url = article_base_url + unique_dois[0]
html = requests.get(article_url).content

sel = Selector(text = html)

s = sel.xpath('//a[contains(@href,"cited-by")]/text()').extract_first()
citations = int(s.replace("Citing Articles ","")[1:-1])

print('doi = ', unique_dois[0], '\t citations = ', citations)

doi =  10.1103/PhysRevB.99.054303 	 citations =  42


### Retrieve citation numbers for all DOIs

Since retrieving citation numbers involves sending one request per research paper to the website's servers, it is a good idea to include a `time.sleep()` to not be identified as a DoS attack and therefore IP blocked. 

There are 82 research papers in issue 99/5. By including a random time delay between each request, retrieving all of the citation numbers from an issue can take a few minutes. The code below is therefore written such that requests are only sent if we do not already have the data saved locally.

In [21]:
if not os.path.exists('citations_issue_99_5.csv'):
    
    citations_dict = dict()
    for doi in unique_dois:

        article_url = article_base_url + doi
        html = requests.get(article_url).content

        time.sleep(random.randint(1,5)) # avoid flagging DDoS protective measures

        sel = Selector(text = html)
        s = sel.xpath('//a[contains(@href,"cited-by")]/text()').extract_first()
        if s == None:
            num_of_citations = 0 
        else:
            num_of_citations = int(s.replace("Citing Articles ","")[1:-1])
        citations_dict[doi] = num_of_citations
else:
    df = pd.read_csv('citations_issue_99_5.csv',index_col=[0],parse_dates=[3])
    citations_dict = df['prb_citations'].to_dict()

prb_citations = pd.Series(citations_dict)
df = pd.DataFrame(prb_citations, columns=['prb_citations'])

In [22]:
# construct a DataFrame with just citation numbers
df.head()

Unnamed: 0,prb_citations
10.1103/PhysRevB.99.054101,7
10.1103/PhysRevB.99.054102,13
10.1103/PhysRevB.99.054103,19
10.1103/PhysRevB.99.054104,3
10.1103/PhysRevB.99.054105,58


In [23]:
# Contruct a DataFrame which also includes the section info for each article

df = pd.DataFrame({'section': sections_dict,'prb_citations': citations_dict})
df.head()

Unnamed: 0,section,prb_citations
10.1103/PhysRevB.99.054101,sect-articles-structure-structural-phase-trans...,7
10.1103/PhysRevB.99.054102,sect-articles-structure-structural-phase-trans...,13
10.1103/PhysRevB.99.054103,sect-articles-structure-structural-phase-trans...,19
10.1103/PhysRevB.99.054104,sect-articles-structure-structural-phase-trans...,3
10.1103/PhysRevB.99.054105,sect-articles-structure-structural-phase-trans...,58


### Python Package information

The code outlined above is appreciably simpler than the web scraping code in Section 1 which obtained the rest of the publication metadata. Nonetheless, for convenience and resusability purposes, I have packaged this into a new function `get_citation_data` so the DataFrame is updated with citation numbers and cached locally in a new csv file. Example usage:

```python
import pandas as pd
from scrape.data import get_citation_data

infile = 'issue_99_5.csv'
outfile = 'citations_issue_99_5.csv'
df = get_citation_data(infile=infile,outfile=outfile,journal='prb')
```