# Scraping 28 Springer Philosophy journals

#### *By: Amirhossein Kiani*
#### *Email: ahosseinkiani@gmail.com*

### Introduction
This notebook contains the codes to scrape 28 top philosophy journals from the well-known publisher Springer. This amounts to the data of **781,237** papers in some of the best journals in the world.

The notebook has 3 parts.

- In Part 1, the data of 28 journals are scraped. All the data of all issues from all the volumes of each journal is scraped and saved into a dataframe. These columns of each dataframe are as follows (see a real example below): 
                Index(['issue_link', 'vol', 'issue', 'issue_year', 'issue_month',
               'paper_titles', 'paper_links', 'pdfs', 'num_papers', 'paper_authors',
               'paper_page_counts', 'page_counts', 'page_counts_per_issue',
               'pdf_texts'],
              dtype='object')
- In Part 1, all the columns are filled with their respective data, except for `pdf_texts` which is supposed to record the actual text of all the papers of each given issue, for any given journal. Extracting all these data can be very resource-intensive, both in terms of computation and finance. Part 3 of the notebook contains the code that scrapes the texts of the papers to fill up the column `pdf_texts`.  
- Part 2 introduces the function that is used for scraping plain and mathematical contents of the papers in Part Three.

These dataframes can be used for text-analytics purposes on professional philosophy texts. I'm currently pursuing this project for some of the journals. Updates will become available on my GitHub account: https://github.com/amirkiaml


**NOTE 1:** The number 28 is arbitrary: all the other remaning jounrals in Springer -- in philosophy *or* other areas -- can be scraped in the same way as in what is done in Part 1.

**NOTE 2:** All the 28 jounrals, with all their respective data are publicly available on Springer Nature's website. The **exception** to this rule is the column `pdf_texts` which is devised to record the text content of the papers from the given journals. Aside from computational and financial resournces that scraping the relevant data would take, another issue has to do with proprietary nature of most of the publications, which forbids me to make them accessible to the public. Make sure to read the following legal notice and its embedded link if you would like to pursue Part 3 of this notebook.

### Legal Notice:
From Springer Nature's Tect and Data Mining policy (https://www.springernature.com/gp/researchers/text-and-data-mining):

<span style="color:gray"> 
Springer Nature recognizes the importance of new research techniques and aims to support innovation in this regard. As the volume of scientific publications increases and TDM software tools improve, Springer Nature appreciates the need for a more formalized process to enable TDM, and strives to make this as simple as possible for researchers.

A growing part of Springer Nature’s journal articles is published open access. TDM is usually allowed without restrictions for these publications since the majority of Springer Nature open access content is licensed under CC-by.
 

**TDM for researchers at subscribing academic institutions**
    
For subscribed journals and books, Springer Nature grants researchers text and data mining rights via their institutions, provided the purpose is non-commercial research.

Individual researchers can download subscription (and open access) journal articles and books for TDM purposes directly from Springer Nature’s content platforms. They are requested to limit this to 1 request per second. The selection of desired articles can be conducted by using existing search methods and tools, such as PubMed, Web of Science, or Springer Nature’s Metadata API, among others. An API key can be requested for researchers  who want to use Springer Nature’s TDM APIs. Use of the API provides additional querying parameters and a higher bandwidth for content requests (150 requests per minute).

Researchers are required to use reasonable measures to protect the security of downloaded content, store content on a secure internal server without access for third parties and only for the duration of the TDM project.

Researchers are requested to be considerate and limit downloads to a reasonable rate which does not impose an undue burden on Springer Nature’s systems and servers.
 

**Implementation by academic and government institutions**
    
Subscribing academic and government institutions may include text and data mining rights in all new and renewed Journal and ebook subscription agreements under Springer Nature’s standard TDM terms (Springer Nature's specialist Database products excluded). For such customers the rights to perform TDM is at no additional cost for content that their subcription license provides access to. Existing subscribers may also add TDM rights under these terms before their agreement is up for renewal.

The use of Springer Nature’s TDM API incurs additional costs.
 

**TDM for commercial research (Industry)**
    
For TDM in the context of commercial research, Springer Nature offers standard TDM terms as well as the TDM API for a fee. In that case, the restriction to non-commercial research does not apply.

In addition, Copyright Clearance Center offers a text-mining solution that covers publications from 25 STM publishers, including Springer Nature.*
</span> 

<span style="color:red">**If you choose to perfrom web scraping using the codes in this notebook (particularly in Part 3), make sure to respect Springer Nature's policies.**</span> 

### Table of Content
- [PART 0: The Journals](#part0)
- [PART 1: Scraping the Journals' Metadata and Creating Dataframes](#part1)
- [PART 2: A Function to Extract English and Mathematical Contents](#part2)
- [PART 3: Scraping the contents of the papers](#part3)

## PART0: The Journals<a class ='author' id='part0'></a>

The dictionary of 28 Springer philosophy journals based on their unique codes. These codes will be used to scrape the journals.

In [1]:
journal_dictionary = {10992: 'JPL', 11229: 'Synthese', 11225: 'Studia Logica', 12136: 'Acta Analytica', 11245: 'Topoi',
                      12133: 'Metaphysica', 10849: 'Journal of Logic, Language and Information', 13347: 'Philosophy & Technology',
                      10838: 'Journal for General Philosophy of Science', 11569: 'Nanoethics', 12152: 'Neuroethics', 
                      10892: 'The Journal of Ethics', 10551: 'The Journal of Business Ethics', 10551: 'The Journal of Business Ethics',
                      10790: 'The Journal of Value Inquiry', 13752: 'Biological Theory', 10677: 'Ethical Theory and Moral Practice',
                      12142: 'Human Rights Review', 11153: 'International Journal for Philosophy of Religion', 
                      10539: 'Biology & Philosophy', 41055: 'Food Ethics', 11158: 'Res Publica', 11017: 'Theoretical Medicine and Bioethics',
                      10503: 'Argumentation', 11406: 'Philosophia', 10698: 'Foundations of Chemistry', 11841: 'Sophia', 
                      10670: 'Erkenntnis', 11098: 'Philosophical Studies'}

In [3]:
for key in journal_dictionary.keys():
    print(f'{key}: {journal_dictionary[key]}')

10992: JPL
11229: Synthese
11225: Studia Logica
12136: Acta Analytica
11245: Topoi
12133: Metaphysica
10849: Journal of Logic, Language and Information
13347: Philosophy & Technology
10838: Journal for General Philosophy of Science
11569: Nanoethics
12152: Neuroethics
10892: The Journal of Ethics
10551: The Journal of Business Ethics
10790: The Journal of Value Inquiry
13752: Biological Theory
10677: Ethical Theory and Moral Practice
12142: Human Rights Review
11153: International Journal for Philosophy of Religion
10539: Biology & Philosophy
41055: Food Ethics
11158: Res Publica
11017: Theoretical Medicine and Bioethics
10503: Argumentation
11406: Philosophia
10698: Foundations of Chemistry
11841: Sophia
10670: Erkenntnis
11098: Philosophical Studies


In [2]:
len(journal_dictionary)

28

# PART 1: Scraping the Journals' Metadata and Creating Dataframes<a class ='author' id='part1'></a>

This code creates a database of 28 Springer journals, their issues, volumes, article titles, links and pdfs.

In [4]:
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re
import pandas as pd

journals_page_numbers = []
journal_dataframe_list = []


def add_space_between_letters(input_list):
    """
    This function makes sure that if a paper is coauthored, the names of the authors don't collide (they get separated with
    a '|')
    """
    pattern = r'([a-z])([A-Z])'
    output_list = []
    for element in input_list:
        modified_element = re.sub(pattern, r'\1 | \2', element)
        output_list.append(modified_element)
    return output_list

# Set up Selenium webdriver with Chrome
driver = webdriver.Chrome()


journal_dictionary = {10992: 'JPL', 11229: 'Synthese', 11225: 'Studia Logica', 12136: 'Acta Analytica', 11245: 'Topoi',
                      12133: 'Metaphysica', 10849: 'Journal of Logic, Language and Information', 13347: 'Philosophy & Technology',
                      10838: 'Journal for General Philosophy of Science', 11569: 'Nanoethics', 12152: 'Neuroethics', 
                      10892: 'The Journal of Ethics', 10551: 'The Journal of Business Ethics', 10551: 'The Journal of Business Ethics',
                      10790: 'The Journal of Value Inquiry', 13752: 'Biological Theory', 10677: 'Ethical Theory and Moral Practice',
                      12142: 'Human Rights Review', 11153: 'International Journal for Philosophy of Religion', 
                      10539: 'Biology & Philosophy', 41055: 'Food Ethics', 11158: 'Res Publica', 11017: 'Theoretical Medicine and Bioethics',
                      10503: 'Argumentation', 11406: 'Philosophia', 10698: 'Foundations of Chemistry', 11841: 'Sophia', 
                      10670: 'Erkenntnis', 11098: 'Philosophical Studies'}

journal_codes = ['10992', '11229', '11225', '12136', '11245', '12133', '10670', '11098', '10849', '13347', '10838', '11569', '12152', '10892', 
                 '10551', '10790','13752', '10677', '12142', '11153', '10539', '41055', '11158', '11017', '10503', '11406', 
                 '10698', '11841']

for journal_code in journal_codes:
    # Open the webpage
    driver.get(f"https://link.springer.com/journal/{journal_code}/volumes-and-issues")

    # Grab the journal's name
    journal_name = driver.find_element(By.CLASS_NAME, 'c-product-header__title').text

    # Find all links on the page
    links = driver.find_elements(By.TAG_NAME, 'a')

    # Regular expression pattern of all valume-issue links
    pattern = fr"https:\/\/link\.springer\.com\/journal\/{journal_code}\/volumes-and-issues\/\d+-\d+.*"

    # Match and store the links that match the pattern
    matching_links = []
    for link in links:
        href = link.get_attribute("href")
        if href and re.match(pattern, href):
            matching_links.append(href)

    df = pd.DataFrame(matching_links,columns=['issue_link'])
    df['vol'] = df['issue_link'].apply(lambda x: x.split('/')[-1].split('-')[0] if 'supplement' not in x.split('/')[-1] else x.split('/')[-2].split('-')[0]) 
    df['issue'] = df['issue_link'].apply(lambda x: x.split('/')[-1].split('-')[1] if 'supplement' not in x.split('/')[-1] else x.split('/')[-2].split('-')[1])
    df['issue_year'] = ''
    df['issue_month'] = ''
    df['paper_titles'] = ''
    df['paper_links'] = ''
    df['pdfs'] = ''
    df['num_papers'] = ''
    df['paper_authors'] = ''
    df['paper_page_counts'] = ''
    df['page_counts'] = ''
    df['page_counts_per_issue'] = ''


    for i, url in enumerate(df['issue_link']):
        driver.get(url)
        paper_titles = driver.find_elements(By.CLASS_NAME, 'c-card__title')
        paper_links = driver.find_elements(By.CLASS_NAME, 'c-card__title a')
        paper_authors = driver.find_elements(By.CLASS_NAME, 'c-author-list.c-author-list--compact.u-text-sm')
        paper_page_counts = driver.find_elements(By.CLASS_NAME, 'c-meta__item.c-meta__item--block-sm-max')
        issue_date_info = driver.find_element(By.CLASS_NAME, 'u-mb-8')

        
        titles = [title.text for title in paper_titles]
        links = [link.get_attribute('href') for link in paper_links]
        authors = [author.text for author in paper_authors]
        authors = add_space_between_letters(authors)
        pages = [page_range.text for page_range in paper_page_counts]
        page_numbers = [page for page in pages if 'Pages' in page]
        issue_date = issue_date_info.text.split(',')[-1].split(" ")
        issue_year = issue_date[-1]
        issue_month = issue_date[-2]


        df.at[i, 'paper_titles'] = titles
        df.at[i, 'paper_links'] = links
        df.at[i, 'paper_authors'] = authors
        df.at[i, 'paper_page_counts'] = page_numbers
        df.at[i, 'issue_year'] = issue_year
        df.at[i, 'issue_month'] = issue_month


    df['pdfs'] = df['paper_links'].apply(lambda x: [url.replace("article", "content/pdf")+".pdf?pdf=button" for url in x])
    df['num_papers'] = df['paper_titles'].apply(lambda x: len(x))
    df['pdf_texts'] = df['pdfs'].apply(lambda x: [])
    df['page_counts'] = df['paper_page_counts'].apply(lambda x: [y.split(':')[-1] for y in x] if x != [] else [])
    df['page_counts'] = df['page_counts'].apply(lambda x: [int(y.replace(" ", "").split('-')[-1]) - int(y.replace(" ", "").split('-')[0]) for y in x if y.replace(" ", "").split('-')[0].isdigit() and y.replace(" ", "").split('-')[-1].isdigit()])
    for idx, row in df['page_counts'].iteritems():
        if row == []:
            df.at[idx, 'page_counts'] = [15 for _ in range(df.at[idx, 'num_papers'])]
            ## This line assigns an approximate value of 15 for the page number of the papers missing page numbers on their url
    df['page_counts_per_issue'] = df['page_counts'].apply(lambda x: sum(x))


    # Save the data frame to hard_drive in pickle format.
    df.to_pickle(f'{journal_name}.pkl')
    
    for journal_code in list(journal_dictionary.keys()):
        journals_page_numbers.append({f'{journal_code}': sum(df['page_counts_per_issue'])})
    
    # This appends the dataframe to the list of dataframes that can be called here.
    journal_dataframe_list.append({f'{journal_code}: {df}'})

# Close the browser
driver.quit()

#### <span style="color:red">The total number of pages for Springer Philosophy papers (from the list above) is</span> <span style="color:red">**781,237**</span>:

In [83]:
# Return the total number of pages for all the journals scraped:
x = set({})
for i in range(len(journals_page_numbers)):
    for j in journals_page_numbers[i].values():
        x.add(j)
print(sum(x))

781237


### As an example, we import the dataframe `Erkenntnis.pkl`

In [3]:
import pandas as pd
Erkenntnis = pd.read_pickle('Erkenntnis.pkl')

In [32]:
Erkenntnis.head()

Unnamed: 0,issue_link,vol,issue,issue_year,issue_month,paper_titles,paper_links,pdfs,num_papers,paper_authors,paper_page_counts,page_counts,page_counts_per_issue,pdf_texts
0,https://link.springer.com/journal/10670/volume...,88,5,2023,June,[Propositionalism and Questions that do not ha...,[https://link.springer.com/article/10.1007/s10...,[https://link.springer.com/content/pdf/10.1007...,1,[Giulia Felappi],[Pages: 1 - 19],[18],18,[]
1,https://link.springer.com/journal/10670/volume...,88,4,2023,April,"[Rules, Equilibria and Virtual Control: How to...",[https://link.springer.com/article/10.1007/s10...,[https://link.springer.com/content/pdf/10.1007...,22,"[Frank Hindriks, Joe Morrison, Torin Alter | D...","[Pages: 1367 - 1389, Pages: 1391 - 1408, Pages...","[22, 17, 19, 17, 19, 13, 26, 22, 21, 20, 15, 2...",409,[]
2,https://link.springer.com/journal/10670/volume...,88,3,2023,March,[Epistemic Collaborativeness as an Intellectua...,[https://link.springer.com/article/10.1007/s10...,[https://link.springer.com/content/pdf/10.1007...,22,"[Alkis Kotsonis, Rafael Ventura, Borut Trpin, ...","[Pages: 869 - 884, Pages: 885 - 905, Pages: 90...","[15, 20, 20, 21, 13, 25, 20, 20, 22, 21, 20, 3...",464,[]
3,https://link.springer.com/journal/10670/volume...,88,2,2023,February,[How to Interpret Belief Hierarchies in Bayesi...,[https://link.springer.com/article/10.1007/s10...,[https://link.springer.com/content/pdf/10.1007...,22,"[Cyril Hédoin, Søren Overgaard, Joey Pollock, ...","[Pages: 419 - 440, Pages: 441 - 455, Pages: 45...","[21, 14, 20, 30, 18, 18, 15, 11, 17, 22, 19, 1...",418,[]
4,https://link.springer.com/journal/10670/volume...,88,1,2023,January,"[Semantic Relativism and Logical Implication, ...",[https://link.springer.com/article/10.1007/s10...,[https://link.springer.com/content/pdf/10.1007...,22,"[Leonid Tarasov, Sim-Hui Tee, Daniel Coren, Th...","[Pages: 1 - 21, Pages: 23 - 41, Pages: 43 - 65...","[20, 18, 22, 27, 20, 28, 25, 15, 2, 20, 19, 17...",384,[]


The number of Erkenntnis issues: 276

In [34]:
len(Erkenntnis)

276

The numper of total pages for Erkenntnis: 49,413

In [84]:
sum(Erkenntnis['page_counts_per_issue'])

49413

# PART 2: A Function to Extract English and Mathematical Contents<a class ='author' id='part2'></a>

The following function 
1. turns a given pdf from hard drive into a set of images, 
2. scans all the images using Mathpix API and turns them into text, and 
3. deletes the images from hard drive.

Mathpix API is a service provided by Mathpix, a company that specializes in optical character recognition (OCR) technology for mathematics. The API allows developers to integrate the Mathpix OCR functionality into their own applications and systems.

The Mathpix API enables you to extract text and mathematical equations from images, providing accurate and reliable OCR capabilities specifically designed for math-related content. By sending an image containing mathematical expressions, equations, or other mathematical notations to the Mathpix API, you can receive the corresponding text representation or even the LaTeX code for the equation.

The API supports various input formats, including images in common formats like PNG and JPEG, as well as PDF files. It utilizes advanced machine learning algorithms to accurately recognize and parse mathematical content, making it useful for applications such as educational tools, equation editors, math-related software, and more.

With the Mathpix API, you can automate the process of extracting mathematical information from images, saving time and effort in manual transcription. It simplifies the integration of math OCR capabilities into your own applications, allowing you to leverage the power of Mathpix's OCR technology.

I have incorporated the Mathpix API in the following function to accommodate journals and papers with lots of advanced mathematical notion and symbols. An example of a journal with this type of technical philosophy of Journal of Philsophical Logic. In the case of journals with less mathematical contents, there is no need to use the Mathpix API; you can simply use some free Python libraries such as `PyPDF2`.

In [2]:
import requests
from pdf2image import convert_from_bytes
from pdf2image import convert_from_path
from PIL import Image
import os
import json
import pandas as pd

def pdf_to_images_to_text(file_path):
    
    """
    This function extracts English text and mathematical formulas from a PDF file:
    - It turns the PDF into a series of images.
    - Then turns the images into text using mathpix API, in order to extract the mathematical content of the PDF.
    - Finally, the function deletes the PDF file from harddrive. This helps with space efficiency.
    
    Later we will use this function to extract the text from technical philosophy journals with lots of mathematical content.
    
    For journals with less technical content we can use a free Python library such as PyPDF2.
    """
    
    # Convert the PDF file into a list of images
    pages = convert_from_path(file_path, dpi=300)
    num_pages = len(pages)  # Number of pages in the PDF

    # Convert each page into a separate image
    new_images = []
    for i, page in enumerate(pages):
        new_images.append(page)

    # Show and save the new images
    for i, image in enumerate(new_images):
        # Resize the image to reduce file size
        size = (image.width // 2, image.height // 2)
        image = image.resize(size, resample=Image.Resampling.BICUBIC)

        # Save the image in JPEG format to reduce file size
        # The quality can be raised if need be. This seems good enough for Mathpix: it produces images of about 100kb or less
        # which can still be well captured by Mathpix.
        image.save(f'page_{i}.jpg', format='JPEG', quality=24)

    # Turn images into text using Mathpix API
    text = ''
    for i in range(num_pages):
        r = requests.post("https://api.mathpix.com/v3/text",
                          files={"file": open(f"page_{i}.jpg", "rb")},
                          data={
                              "options_json": json.dumps({
                                  "math_inline_delimiters": ["$", "$"],
                                  "math_block_delimiters": ["$$", "$$"],
                                  "rm_spaces": True
                              })
                          },
                          headers={
                              "app_id": "<YOU APP ID>",
                              "app_key": "<YOUR APP KEY>"
                          }
                          )
        response = r.json()
        text += response['text']

    # Delete the temporary images
    for i, image in enumerate(new_images):
        os.remove(f'page_{i}.jpg')

    return text

An example of how this function works:

In [7]:
pdf_to_images_to_text(r'.\s11229-023-04114-5.pdf')

'Synthese (2023) 201:141\nhttps://doi.org/10.1007/s11229-023-04114-5\nORIGINAL RESEARCH\nStructured propositions and a semantics for unrestricted impure logics of ground\nAmirhossein Kiani ${ }^{1}$\nReceived: 19 August 2022 / Accepted: 24 February 2023\n๑ The Author(s), under exclusive licence to Springer Nature B.V. 2023\nAbstract\nI show that the assumption of highly structured propositions can be leveraged to provide a unified semantics for various propositional logics of impure ground in a very expressive and flexible way. It is shown, in particular, that the induced models are capable of capturing an infinitude of grounding facts that follow from unrestricted logics of ground, but, due to certain artificial restrictions, are left unaccounted for by the existing semantics in the literature. It is also shown that our models, unlike the ones in the literature, are easily extendable to capture certain distinct views about iterated as well as identity grounding.\nKeywords Ground · Imp

# PART 3: Scraping the contents of the papers<a class ='author' id='part3'></a>
This code 
1. logs into Springer, 
2. downloads pdfs from all issues of the desired journal's data frame from PART 1,
3. applies the function from PART 2 to each pdf, 
4. saves the text into the the jounral's dataframe, and 
5. deletes the pdfs from hard drive.

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import os
import time

# Pick your favorite journal from the list of 28 journals scraped earlier
journal_name = pd.read_csv("jounal_name.csv")

# Download directory for PDFs
directory = r"<YOUR LOCAL DIRECTORY>"

# Set up the Selenium WebDriver and navigate to the login page
options = Options()
options.add_experimental_option('prefs', {
    'plugins.always_open_pdf_externally': True,
    'download.default_directory': directory,
    # Set the desired download directory above
    # Ensure that the specified directory exists
})
driver = webdriver.Chrome(options=options)
driver.get("<YOUR LOG-IN LINK TO SPRINGER>")

# Manually log in to the website and wait for the user to press Enter
input("Press Enter after you have manually logged in to the Springer website so the scraper can begin working...")

# Navigate to the desired page

# Extract the URLs from all rows and iterate over them
for row in range(len(journal_name)):
    for url in journal_name['pdfs'].iloc[row]:
        driver.get(url)
        # The PDF will be automatically downloaded to the specified directory
        time.sleep(5)

        # List all files in the directory
        files = os.listdir(directory)

        # Iterate over the files and delete PDF files
        for file in files:
            if file.endswith(".pdf"):
                file_path = os.path.join(directory, file)
                text = pdf_to_images_to_text(file_path)
                journal_name.at[row, 'pdf_texts'].append(text)


                os.remove(file_path)


# Close the browser
driver.quit()