crossref docs: https://github.com/CrossRef/rest-api-doc#readme
crossref py: https://github.com/fabiobatalha/crossrefapi
crossref R: https://github.com/ropensci/rcrossref


APIs used list: https://www.crossref.org/  

Possible: importing data into zotero?

In [1]:
pip install arxiv


Collecting arxiv
  Downloading arxiv-2.0.0-py3-none-any.whl (11 kB)
Collecting feedparser==6.0.10 (from arxiv)
  Downloading feedparser-6.0.10-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.1/81.1 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Collecting sgmllib3k (from feedparser==6.0.10->arxiv)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6049 sha256=6b470b4dff683741dda9c90cb63488bde78eead4e2c2592125e6b9801ba45386
  Stored in directory: /root/.cache/pip/wheels/f0/69/93/a47e9d621be168e9e33c7ce60524393c0b92ae83cf6c6e89c5
Successfully built sgmllib3k
Installing collected packages: sgmllib3k, feedparser, arxiv
Successfully installed arxiv-2.0.0 feedparser-6.0.10 sgmllib3k-1.0.0


In [2]:
pip install crossrefapi

Collecting crossrefapi
  Downloading crossrefapi-1.6.0-py3-none-any.whl (14 kB)
Collecting urllib3==1.26.16 (from crossrefapi)
  Downloading urllib3-1.26.16-py2.py3-none-any.whl (143 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.1/143.1 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: urllib3, crossrefapi
  Attempting uninstall: urllib3
    Found existing installation: urllib3 2.0.7
    Uninstalling urllib3-2.0.7:
      Successfully uninstalled urllib3-2.0.7
Successfully installed crossrefapi-1.6.0 urllib3-1.26.16


In [3]:
from urllib.request import urlopen

In [4]:
from crossref.restful import Works, Etiquette

In [5]:
import arxiv

In [6]:
import json

In [7]:
import pandas as pd

In [8]:
from tqdm import tqdm

In [11]:
# Creates Etiquette entity. Adding this to the query later on should mean we are put in a "polite" pool of users, where the API connection should be better, and if our script is found to cause any
# problems with the server we will be emailed to address this.
my_etiquette = Etiquette("Data Access Mandate", "0.1", "https://www.adalovelaceinstitute.org/our-work/programmes/future-regulation/", "bmaj3035@gmail.com")

In [12]:
class return_articles():
  def __init__(self, query_term, year):

# The query term and year are defined here so we have access to them across all functions in an instance of this class.
# This is useful as it means we can use them in all the functions without passing them again, saving some effort.
    self.query_term = query_term
    self.year = year

  def generate_output_list(self):
    print("--- Generating Unfiltered Output ------------------")

    # Creates an empty list.
    output_list = []

    # Creates a Works() instance, and passes the etiquette argument we created above.
    works = Works(etiquette=my_etiquette)

    # Creates a start and end date for the range we want to cover, always beginning on the 1st of January and ending on the 31st of December.
    start_date = self.year + "-01-01"
    end_date = self.year + "-12-31"

    # Calls a query using the query_term we entered, and filters for only academic journal articles published during the year we entered.
    # This gives us a list of entities which all represent an academic article and hold some information such as the article's title, DOI, or publisher.
    query = works.query(self.query_term).filter(type = "journal-article", from_online_pub_date= start_date, until_online_pub_date = end_date)

    # Prints the total number of returned items.
    print("Total number of query results: " + str(query.count()))

    # Appends each item in the query variable to the empty list we created earlier.
    for item in tqdm(query):
      output_list.append(item)

    # The function then returns the output_list when the function's code is complete.
    return output_list

  # This function requires us to pass the list of articles the previous function outputted.
  def generate_output_titles(self, input_list):
    print("--- Generating Filtered Output --------------------")

    # Creates an empty list.
    output_json = []

    # Looks through all of the entries in the input list.
    for i in input_list:

    # If the item has a "title" key and the query term is in the title, after it is converted to all lower case, the item is added to the empty list we created above.
      if "title" in i.keys() and self.query_term in i["title"][0].lower():
        output_json.append(i)

    # If the item has an "abstract" key, the query term is in the abstract, and the item has not already been added to the list we add it.
      if "abstract" in i.keys() and self.query_term in i["abstract"].lower() and i not in output_json:
        output_json.append(i)

    # Function then returns the list.
    return output_json


  # This function requires us to pass the list of elements outputted by the previous function and a list of the things we want our eventual CSV to hold about each article.
  # The elements have to be written in the same way they appear in the data returned to us by the API (case sensitive).
  def generate_output_df(self, filtered_json, required_elements):
    print("--- Generating Dataframe --------------------------")

    # Creates empty list.
    dict_list = []

    # Filters through all the elements in the input list.
    for i in range(0, len(filtered_json)):

      # Creates an empty dictionary, where each item in required_elements is turned into a dictionary key. Therefore, if required_elements looked like this: ["title", "DOI", "publisher"]
      # the edited_json would look like this: {"title": none, "DOI": none, "publisher": none}.
      edited_json = dict.fromkeys(required_elements)

      # We loop through the required elements list.
      for j in required_elements:

        # If that element is found in the article data we find it and copy it into its respective position in edited_json.
        if j in filtered_json[i].keys():
          edited_json[j] = filtered_json[i][j]

        # If we don't find that element in our article data we paste "no" + required_element to the respective position in edited_json.
        elif j not in filtered_json[i].keys():
          edited_json[j] = "no " + j

      # For each item in the list of articles we appen the edited_json to the empty list created at the top of this function.
      dict_list.append(edited_json)

    # Convert the list into a Pandas Data Frame to make converting into a CSV easier.
    df = pd.DataFrame.from_records(dict_list)

    # Convert the Data Frame into a CSV. This is done for each individual year. Therefore, if our connection to the server crashes we will still have CSVs
    # for all the years we queries before the crash.
    df.to_csv(self.query_term + " " + self.year + " df.csv", index = False)

    # Returns dataframe.
    return df

  # Requires us to input a list of Data Frames
  def output_final_df(self, df_list):
    print("--- Generating CSV -------------------------------")

    # Stacks all of the input Data Frames on top of each other.
    combined_df = pd.concat(df_list, ignore_index = True)

    # Adds a https prefix to each DOI value, turning them into usable links straight away.
    combined_df["DOI"] = combined_df["DOI"].apply(lambda x: "https://doi.org/" + x)

    # Exports the combined Data Frame as a CSV.
    combined_df.to_csv(self.query_term + " combined.csv", index = False)

    # Returns combined Data Frame.
    return combined_df

In [13]:
# A quick function intended to create a list of years. It requires us to input a starting and ending data.
def generate_years(min_years, max_years):

  # Creates a list, where the first element is our starting year.
  output_years = [str(min_years)]

  # Creates an index value.
  index = 1

  # Creates a while loop which iterates through its code whilst the last element of the output_years loop
  # is not equal to the max_years variable.
  while int(output_years[-1]) != max_years:

    # Adds the index value to the min_years value
    year = min_years + index

    # Appends the year value to the output_years list.
    output_years.append(str(year))

    # Adds 1 to the index, so during the next iteration of the loop the next year value is 1 greater than the
    # year in the current iteration.
    index += 1

  # When the While loop ends because the final number is equal to the max_years variable the function returns the output_years list.
  return output_years

# Creates years list.
years = generate_years(2010,2023)

print(years)

# Creates empty list.
df_list = []

['2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023']


In [None]:
# Loops through all the items in the year list.
for i in years:
  print("Generating for: " + i)

  # Creates an instance of the return_articles class, and passes the query and year arguments.
  articles_gen = return_articles("access to data", i)

  # Generates articles based on the entered query and the current year.
  unfiltered_json = articles_gen.generate_output_list()

  # Filters articles ensuring only ones with the key phrase in the title or the abstract are returned.
  filtered_json = articles_gen.generate_output_titles(unfiltered_json)

  # Prints how many filtered results are outputted.
  print(str(len(filtered_json)) + " filtered results in the year: " + i)

  # Creates a Data Frame of the filtered articles and the required data.
  output_df = articles_gen.generate_output_df(filtered_json, ["title", "DOI", "publisher","abstract"])

  # Appends the output_df to the empty list.
  df_list.append(output_df)

Generating for: 2010
--- Generating Unfiltered Output ------------------
Total number of query results: 17798


17798it [08:35, 34.55it/s]


--- Generating Filtered Output --------------------
8 filtered results in the year: 2010
--- Generating Dataframe --------------------------
Generating for: 2011
--- Generating Unfiltered Output ------------------
Total number of query results: 17296


17296it [07:53, 36.53it/s]


--- Generating Filtered Output --------------------
16 filtered results in the year: 2011
--- Generating Dataframe --------------------------
Generating for: 2012
--- Generating Unfiltered Output ------------------
Total number of query results: 20643


20643it [10:27, 32.88it/s]


--- Generating Filtered Output --------------------
20 filtered results in the year: 2012
--- Generating Dataframe --------------------------
Generating for: 2013
--- Generating Unfiltered Output ------------------
Total number of query results: 22259


22259it [09:29, 39.10it/s]


--- Generating Filtered Output --------------------
10 filtered results in the year: 2013
--- Generating Dataframe --------------------------
Generating for: 2014
--- Generating Unfiltered Output ------------------
Total number of query results: 28250


16201it [06:44, 46.52it/s]

In [None]:
# Combines all of the Data Frames into one and exports it as a CSV.
final_df = articles_gen.output_final_df(df_list)