# Scraping all issue data and citations from PRB in 2019

*Ben McKeever 2023*

Example notebook showing how to use the python package to download and cache all citation data from PRB in 2019.

It may be of interest to also see:

* [Notebook for data analysis and visualisations of the dataset](Data-Analysis-and-Visualisations-PRB-dataset.ipynb)
* [Notebook outlining details of the data collection via web scraping of PRBs website](Web-scraping-Physical-Review-B-workbook.ipynb)

In [None]:
import time
import random

import pandas as pd
from scrape.data import get_citation_data, get_issue_data

## Download and cache data for issue 99/5

In [None]:
url = 'https://journals.aps.org/prb/issues/99/5'
df = get_issue_data(url=url,force_download=True)

In [None]:
!head -2 issue_99_5.csv

In [None]:
df = get_citation_data(infile='issue_99_5.csv',
                       outfile='citations_issue_99_5.csv',
                       force_download=True)

## Download and cache all issue data from 2019

Note: this makes requests to 48 urls, with a random 1-5 second time delay in between each request.  

In [None]:
url_list = [f'https://journals.aps.org/prb/issues/{j}/' 
            + str(i+1) for j in [99,100] for i in range(24)]
df_list = []
for u in url_list:
    df_list.append(get_issue_data(url = u))
    time.sleep(random.randint(1,5)) # wait a bit before sending another request
    
df = pd.concat(df_list)
df.to_csv('all_issues_2019_no_citation_data.csv')

## Download and cache all citation data from 2019

Note: this makes requests to ~4500 different URLs, with a random 1-5 second time delay in between each request.

Go watch a movie or two if running this code!

In [None]:
from datetime import date
today = date.today()

In [None]:
input_csv_list = [f'issue_{j}_{i+1}.csv' for j in [99,100] for i in range(24)]
output_csv_list = ['citations_' + csv for csv in input_csv_list]

df_list = []

for i,o in zip(input_csv_list,output_csv_list):
    df_list.append(get_citation_data(i,o))

df = pd.concat(df_list)

df.to_csv(f"data_{today}.csv")