# Generating the data set

In this notebook, we generate the data set that contains abstracts from six influential Springer journals in philosophy.

In [2]:
import requests
import unidecode
import collections

from crossref.restful import Journals, Works
from bs4 import BeautifulSoup
import pandas as pd

## Scraping DOIs using CrossRef API

DOIs are unique identifiers of journal papers. We first use the CrossRef API to scrape the DOIs, title, author name, and publication year of the relevant journal articles.

In [3]:
# Synthese e-ISSN
synthese_issn = "1573-0964"
# Philosophical studies e-ISSN
ps_issn = "1573-0883"
# Philosophy and Technology e-ISSN
pt_issn = "2210-5441"
# Erkenntnis e-ISSN
erk_issn = "1572-8420"
# JPL e-ISSN
jpl_issn = "1573-0433"
# Minds and Machines
mm_issn = "1572-8641"

# Put into list
issns = [synthese_issn, ps_issn, pt_issn, erk_issn, jpl_issn, mm_issn]

In [3]:
# Initialize Journals and Works object for crossref API calls
journals = Journals()
works = Works()

# Initialize empty list for entries
entries = []

# Get all DOIs and further metadata from each journal
for issn in issns:
    # Iterate through journal specific publications
    for i, article in enumerate(journals.works(issn)):
        try:
            # Extract metadata
            title = unidecode.unidecode(article["title"][0])
            given_name = unidecode.unidecode(article["author"][0].get("given"))
            family_name = unidecode.unidecode(article["author"][0].get("family"))
            doi = article["DOI"]
            type_ = article["type"]
            year = article["published-print"].get("date-parts")[0][0]
            journal = journals.journal(issn).get("title")
        
            # Create list with all the entry elements
            entry = [title, given_name, family_name, doi, type_, year, journal]
            entries.append(entry)
            print(f"Entry {i}: {title}")
        except:
            print(f"Could not extract")
    
# Turn into dataframe
df = pd.DataFrame(entries, columns=["Title","First name","Last name","DOI","Type","Year","Journal"])

# Save to CSV
#df.to_csv("data/crossref_dois.csv")

Entry 0: A hundred years later: The rise and fall of Frege's influence in language theory
Entry 1: Defending virtue epistemology: epistemic dependence in testimony and extended cognition
Entry 2: A notorious affair called exportation
Entry 3: A Theory of Belief for Scientific Refutations
Entry 4: Marcus, Kripke, and the origin of the new theory of reference
Entry 5: Het Wonder


In [4]:
df = pd.read_csv("data/crossref_dois.csv")

# Check distribution over different journals
journal_names = df["Journal"].unique()
for journal in journal_names:
    print(f"{journal}: {len(df[df['Journal']==journal])}")

Synthese: 7210
Philosophical Studies: 5045
Philosophy & Technology: 480
Erkenntnis: 2315
Journal of Philosophical Logic: 1413
Minds and Machines: 705


In [5]:
# Sort by how often a given article appears
df_title_counts = df.groupby("Title").count().sort_values("Year",ascending=False)
# Store those titles that appear more than three times
duplicate_titles = df_title_counts[df_title_counts["DOI"]>3].index
# Remove any entry with such a title from the data frame
for duplicate_title in duplicate_titles:
    df = df[-(df["Title"]==duplicate_title)]

## Retrieving Abstracts using the Springer API

In the next step, we will try to retrieve the abstract for each DOI using the Springer API.

In [6]:
# Initialize authorization for Springer API
springer_api_key = "2353c0417a34ed77a423e7c13c0af0d1"
base_url = "http://api.springernature.com/metadata/json/doi/"

# Initialize abstracts list
abstracts = []

# Loop through DOIs (later change from df_lim to full df)
for i, doi in enumerate(df["DOI"]):
    
    try:
        
        #API call get content as JSON
        url = base_url+doi+"?api_key="+springer_api_key
        print(url)
        r = requests.get(url)
        content = r.json()
        
        # Check that we're only considering English language abstracts
        language = content.get("records")[0].get("language")
        if language!="en":
            continue
        
        # Retrieve abstract (and title for verification)
        abstract = content.get("records")[0].get("abstract")
        title = content.get("records")[0].get("title")
        print(f"#{i+1}: {title}")
    except:
        print("API call failed")
        entry = ["Error","Error","Error"]
    else:
        entry = [title, abstract, doi]
    
    abstracts.append(entry)

# Turn into data frame and save as CSV
abstracts_df = pd.DataFrame(abstracts,columns=["Title","Abstract","DOI"])
#abstracts_df.to_csv("data/abstracts.csv")

http://api.springernature.com/metadata/json/doi/10.1007/bf00873280?api_key=2353c0417a34ed77a423e7c13c0af0d1
#1: A hundred years later: The rise and fall of Frege's influence in language theory


## Building the final data set

In this final step we merge the data set containing the CrossRef metadata about the publications with the abstracts retrieved using the Springer API.

In [None]:
# Merge with initial DF
df_merged = pd.merge(df,
                   abstracts_df,
                   on="DOI")

In [None]:
# Create new column which contains the title  and abstract
df_merged["Text"] = df_merged["Title_x"]+" "+df_merged["Abstract"]
# Drop empty column
df_merged = df_merged.drop(columns=["Unnamed: 0_x", "Unnamed: 0_y"])
# Delete entries without abstract
df_merged = df_merged[df_merged["Abstract"].str.len()!=0]
df_merged = df_merged.dropna(subset=["Abstract"])
# Drop duplicates
df_merged = df_merged.drop_duplicates()
# Drop corrections
df_merged = df_merged[-df_merged["Title_y"].str.startswith("Correction to:")]

In [12]:
# Check distribution over different journals
journal_names = df_merged["Journal"].unique()

for journal in journal_names:
    print(f"{journal}: {len(df_merged[df_merged['Journal']==journal])}")

Synthese: 3534
Philosophical Studies: 2565
Philosophy & Technology: 398
Erkenntnis: 1370
Journal of Philosophical Logic: 915
Minds and Machines: 525


In [11]:
# Save final data frame to csv
#df_merged.to_csv("data/complete_abstract_data.csv")
print(f"Saved csv file with {len(df_merged)} abstracts.")

Saved csv file with 9307 abstracts.
