***To start the remote computer, you can either click 'Connect' on the top right, or just run a single cell by clicking the play button on the top left of the cell***

Run this cell to install the package needed to work with Word documents

In [1]:
!python3 -m pip install python-docx

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting python-docx
  Downloading python-docx-0.8.11.tar.gz (5.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m42.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: python-docx
  Building wheel for python-docx (setup.py) ... [?25l[?25hdone
  Created wheel for python-docx: filename=python_docx-0.8.11-py3-none-any.whl size=184505 sha256=fbd8623ed08f54d22c1cff4a07beedc45de6ecfbf98f2619d8260a6db619ee9c
  Stored in directory: /root/.cache/pip/wheels/32/b8/b2/c4c2b95765e615fe139b0b17b5ea7c0e1b6519b0a9ec8fb34d
Successfully built python-docx
Installing collected packages: python-docx
Successfully installed python-docx-0.8.11


#Get DOIs from a list of citations

This takes a list of citations in a docx file and finds their DOIs using the CrossRef API. The first three cells import packages and define functions. Run them first; no output is expected.

In [2]:
import requests
from docx import Document
from docx.shared import Pt
from docx.enum.text import WD_COLOR_INDEX

In [3]:
def search(title, author, first_res=True):
  title.replace(' ', '')
  response = requests.get(f'https://api.crossref.org/works?query.title={title}&query.author={author}&select=DOI,author,title,volume,page')
  first_res = None
  for i in range(min(15, len(response.json()['message']['items']))):
    article = response.json()['message']['items'][i]
    if i == 0:
      first_res = article
    
    if article['title'][0].lower() == title.lower():
      return article
    
  if first_res:
    return first_res
  
  return None

In [4]:
def get_dois(file, first_res):
  articles = []
  doc = Document(file)
  for p in doc.paragraphs:
    r = p.text
    try:
      author = r.split(',')[0]
      title = r.split('). ')[1].split(".")[0]
      article = search(title, author, first_res)
      found_title =  article['title'][0]
      found_first_author = article['author'][0]['family']
      articles.append(
          {
              'orig_first_author': author,
              'found_first_author': found_first_author,
              'orig_title': title,
              'found_title': found_title,
              'doi': article['DOI']
          }
      )
    except:
      articles.append(
          {
              'orig_first_author': author,
              'found_first_author': 'N/A',
              'orig_title': title,
              'found_title': 'N/A',
              'doi': 'N/A'
          }
      )

  return articles

This is where the DOIs are retrieved.

1. Drag your text file into the file directory (Click the file icon on the left sidebar, and then drag the file over the sidebar)

2. Change the file name. You can either change your text file's name to 'ref_list.docx' or change the file name in the code snippet (line 2 of the code cell) below to your own file name

3. Sometimes, the title in the citation and the title in the search result refer to the same article, but have slightly different formats or different first authors. If you want the formats to match exactly, you can set exact_match to 'True' (with an uppercase T) (line 3 of the code cell). I recommend keeping exact_match as 'False' (with an uppercase F) and looking through the CSV yourself to see if the DOI is correct for titles and authors that don't match. DOIs with mismatched authors/titles will also be highlighted in the new Word document.

It's completely normal for the search to take a few minutes! You know it's done when there's a green checkmark next to the cell.

In [5]:
import pandas as pd
file = 'ref_list.docx'
exact_match = False
articles = get_dois(file, not exact_match)

View the output by running the cell below.

The table has 5 columns. 'orig_first_author' and 'orig_title' refer to the author and title retrieved from the list of citations you imported. 'found_first_author' and 'found_title' refer to the citation that was found in the crossref database. 'N/A' values mean the article was not found. If the DOI was found, it will be in the last column

It's important that you check original authors/titles and the found ones match! A mismatch does not always mean the wrong article was found (could be an error in the database).

In [6]:
df = pd.DataFrame(articles)
df

Unnamed: 0,orig_first_author,found_first_author,orig_title,found_title,doi
0,Alberts,Hecht,The communicative process of drug resistance a...,Resistance to Drug Offers among College Students,10.3109/10826089209065589
1,Anderson,,"Teens, Social Media & Technology 2018",,
2,Ansell,Ansell,Effects of marijuana use on impulsivity and ho...,Effects of marijuana use on impulsivity and ho...,10.1016/j.drugalcdep.2014.12.029
3,Arnett,Arnett,Reckless driving in adolescence: ‘State’ and ‘...,Reckless driving in adolescence: ‘State’ and ‘...,10.1016/s0001-4575(97)87007-8
4,Bandura,Locke,Social Foundations of Thought and Action: A so...,Social Foundations of Thought and Action: A So...,10.2307/258004
...,...,...,...,...,...
122,Wong,Wong,Digital health technology to enhance adolescen...,Digital Health Technology to Enhance Adolescen...,10.1016/j.jadohealth.2019.10.018
123,Young,Young,Alcohol-related sexual assault victimization a...,Alcohol-Related Sexual Assault Victimization A...,10.15288/jsad.2008.69.39
124,Yuen,Yuen,Adolescent alcohol use trajectories: risk fact...,Adolescent Alcohol Use Trajectories: Risk Fact...,10.1542/peds.2020-0440
125,Zador,Zador,Alcohol-related relative risk of driver fatali...,Alcohol-related relative risk of driver fatali...,10.15288/jsa.2000.61.387


Now we'll write the dois into the word document and save it as 'ref_updated.docx'. It will show up in the same place you dragged your text file into (sometimes it takes a little bit for the files to show up). Insert_space determines whether or not to insert a space between the DOI and the citation. If your output file has 2 spaces between the DOI and citation, set this to 'False' (with an uppercase F).

**IMPORTANT**: The dois that are highlighted in yellow MUST be double checked. These citations returned a different first author/title when obtained from the CrossRef database and may be incorrect.

**UPDATE**: Set insert_space (line 2 of code cell) to 'False' (with an uppercase F) and run cell again if there are two spaces between the inserted DOI and the citation in the new docx file.

In [14]:
doc = Document('ref_list.docx')
insert_space = True

i = 0
miss = 0
for p in doc.paragraphs:
  try:
    if 'doi' not in p.text:
      if df.loc[i, 'doi'] != 'N/A':
        
        p.style.font.name = 'Times New Roman'
        p.style.font.size = Pt(12)
        title = df.loc[i, 'orig_title'].lower() 
        if title[-1] == '.':
          title = title[:-1]
        found = df.loc[i, 'found_title'].lower()
        if found[-1] == '.':
          found = found[:-1]
        if df.loc[i, 'orig_first_author'].lower() == df.loc[i, 'found_first_author'].lower() and title == found:
          if insert_space:
            p.add_run(f" doi:{df.loc[i, 'doi']}")
          else:
            p.add_run(f"doi:{df.loc[i, 'doi']}")
        else:
          miss += 1
          if insert_space:
            p.add_run(f" doi:{df.loc[i, 'doi']}").font.highlight_color = WD_COLOR_INDEX.YELLOW
          else:
            p.add_run(f"doi:{df.loc[i, 'doi']}").font.highlight_color = WD_COLOR_INDEX.YELLOW
    i += 1
  except:
    break
  
print(f"Unsure about the DOIs of {miss} out of {i-1} citations.")
print("Saving...")
doc.save('refs_updated.docx')
print("Saved!")

Unsure about the DOIs of 23 out of 126 citations.
Saving...
Saved!


You can also export the table of ctiations to a CSV and Excel file. They will show up in the file tab with the rest of the files.

In [None]:
df.to_csv('dois.csv')
df.to_excel('dois.xlsx')