# Substituting Zotero scannable citations with Pandoc Markdown citations

This notebooks shows a way to take a Markdown document containing Zotero scannable citations and convert those into Pandoc MD citation syntax, using citation keys that are linked to the same actual items in a Zotero database.

In a nutshell, from:

    { e.g. | Barnes, 1985 | p. 42 | |zu:1589851:BVF6UVWN}

To:

    [e.g. @barnesScience1985, p. 42]

The code that does most of the work is in the module [`citesub.py`](citesub.py).

In [1]:
import json
import re
from citesub import SCAN_RX, subst, parse_scannable

## Input documents

Besides the MD document being processed here, we also need a bibliography dataset that contains both Zotero's internal keys and the cite keys for each item. This dataset can be produced CSL JSON format by exporting a library or collection from Zotero, but I needed to use some custom processing with the [Better BibTeX plugin](http://retorque.re/zotero-better-bibtex/).

To get `test-bibliography.json` I added the following JavaScript snippet to [BBT's advanced export configuration](http://retorque.re/zotero-better-bibtex/exporting/scripting/):

    if (Translator.BetterCSL) {
      reference["zoteroKey"] = item.key;
    }

That code adds Zotero's internal key (e.g. `BVF6UVWN`) to each JSON object in the export.

In [2]:
with open("test/scrivener-zotero-scannable.md", "r") as mdfile:
    md_lines = mdfile.readlines()

with open("test/test-bibliography.json", "r") as bibfile:
    bib_data = json.load(bibfile)

The rest of this notebook is thinking aloud, so feel free to [skip to the end](#Process-the-file-and-generate-output).

## Finding scannable cites in the text

Find citation groups (or individual citaitons if there's just one at a time). Print the number of groups found in each line of text.

In [3]:
for line in md_lines:
    found = re.findall(SCAN_RX, line)
    if found: print(len(found), found)

2 ['{ | Wikipedia contributors, 2019 | | |zu:1589851:DEX3JSWI}', '{ e.g. | Barnes, 1985 | | |zu:1589851:BVF6UVWN}{ | Barnes, 1982 | | |zu:1589851:VMPBFGAN}{ | Kuhn, 1962 | | |zu:1589851:UUBEC8PU}']
2 ['{ see | Bouchout Declaration, 2014 | p. 1 | |zu:1589851:AP5ZSMVI}', '{ | Bowker, 2000 | | |zu:1589851:MHZGZRMK}']
1 ['{ | National Center for Biotechnology Information (NCBI), no date | | |zu:1589851:RR759BZU}']
1 ['{ again... | -Wikipedia contributors, 2019 | | |zu:1589851:DEX3JSWI}']
1 ['{ | Loy, 2006 | | |zu:1589851:G27C99S2}']


## Getting relevant parameters of each citation

Each citation will need to be processed on its own, but they must end up in the same groups that they started in. I'll use these parameters:

- Prefix, i.e. "*see* Barnes, 1985"
- Author name suppression, as in "According to Barnes (1985)..."
- Location, i.e. "p. 42"
- I don't use postfixes in my writing
- The Zotero unique key

In [4]:
for line in md_lines:
    found = re.findall(SCAN_RX, line)
    if found:
        print([parse_scannable(cite) for cite in found])

[{'pre': '', 'noauth': '', 'loc': '', 'key': 'DEX3JSWI'}, [{'pre': 'e.g.', 'noauth': '', 'loc': '', 'key': 'BVF6UVWN'}, {'pre': '', 'noauth': '', 'loc': '', 'key': 'VMPBFGAN'}, {'pre': '', 'noauth': '', 'loc': '', 'key': 'UUBEC8PU'}]]
[{'pre': 'see', 'noauth': '', 'loc': 'p. 1', 'key': 'AP5ZSMVI'}, {'pre': '', 'noauth': '', 'loc': '', 'key': 'MHZGZRMK'}]
[{'pre': '', 'noauth': '', 'loc': '', 'key': 'RR759BZU'}]
[{'pre': 'again...', 'noauth': '-', 'loc': '', 'key': 'DEX3JSWI'}]
[{'pre': '', 'noauth': '', 'loc': '', 'key': 'G27C99S2'}]


## Transforming scannable cites to Pandoc MD syntax

Besides reformatting the citation into the new [syntax](https://rmarkdown.rstudio.com/authoring_pandoc_markdown.html%23raw-tex#citations), there are two extra operations here:

- Using that bibliography JSON export, look up the citekey for each item
- Collect all the Zotero citation keys into a set, so that I know which ones were in this document

In [5]:
collected_citations = set()

for line in md_lines:
    found = re.findall(SCAN_RX, line)
    if found:
        print([subst(parse_scannable(cite), collected_citations, bib_data) for cite in found])
        
print(collected_citations)

['[@wikipediacontributorsWikipedia2019]', '[e.g. @barnesScience1985; @barnesKuhnSocialScience1982; @kuhnStructureScientificRevolutions1962]']
['[see @bouchoutdeclarationBouchoutDeclarationOpen2014, p. 1]', '[@bowkerBiodiversityDatadiversity2000]']
['[@nationalcenterforbiotechnologyinformationncbiPubChem]']
['[again... -@wikipediacontributorsWikipedia2019]']
['[@loyMusimathicsMathematicalFoundations2006]']
{'UUBEC8PU', 'VMPBFGAN', 'DEX3JSWI', 'RR759BZU', 'AP5ZSMVI', 'MHZGZRMK', 'G27C99S2', 'BVF6UVWN'}


## Substituting citations line by line

The regex that I use to detect scannable cite groups also matches one character after the last cite ends. So we need to make sure that character is retained.

In [6]:
def sub_citations(matched):
    text = matched.group(0)
    return subst(parse_scannable(text[:-1]), collected_citations, bib_data) + text[-1]

First, check that the substitutions are correct. There should be the same number of matches as above.

In [7]:
collected_citations = set()

for line in md_lines:
    found = re.findall(SCAN_RX, line)
    if found:
        new = re.subn(SCAN_RX, sub_citations, line)
        print(new[1], new[0])

2 All statements should have citations [@wikipediacontributorsWikipedia2019]. Some statements need more than one citation [e.g. @barnesScience1985; @barnesKuhnSocialScience1982; @kuhnStructureScientificRevolutions1962].

2 All biodiversity data should be open [see @bouchoutdeclarationBouchoutDeclarationOpen2014, p. 1]---but not necessarily in one uniform organizing system [@bowkerBiodiversityDatadiversity2000].

1 Chemical data: same same, but different [@nationalcenterforbiotechnologyinformationncbiPubChem].

1 For more knowledge, see Wikipedia [again... -@wikipediacontributorsWikipedia2019].

1 [^fn1]: Waves are circles [@loyMusimathicsMathematicalFoundations2006].


In [8]:
collected_citations

{'AP5ZSMVI',
 'BVF6UVWN',
 'DEX3JSWI',
 'G27C99S2',
 'MHZGZRMK',
 'RR759BZU',
 'UUBEC8PU',
 'VMPBFGAN'}

## Process the file and generate output

In [9]:
def subst_line(line):
    found = re.findall(SCAN_RX, line)
    if not found:
        return line
    return re.sub(SCAN_RX, sub_citations, line)

In [10]:
collected_citations = set()

substituted = (subst_line(line) for line in md_lines)

with open("test/citation-substitution.md", "w") as outfile:
    for line in substituted:
        outfile.write(line)
        print(line, end="")

---
Title: Test Project  
Author: Akos Kokai  
reference-section-title: References  
csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa.csl
---

# Big subject #

All statements should have citations [@wikipediacontributorsWikipedia2019]. Some statements need more than one citation [e.g. @barnesScience1985; @barnesKuhnSocialScience1982; @kuhnStructureScientificRevolutions1962].

## Little subject ##

All biodiversity data should be open [see @bouchoutdeclarationBouchoutDeclarationOpen2014, p. 1]---but not necessarily in one uniform organizing system [@bowkerBiodiversityDatadiversity2000].

Chemical data: same same, but different [@nationalcenterforbiotechnologyinformationncbiPubChem].

### Insignificant subject ###

For more knowledge, see Wikipedia [again... -@wikipediacontributorsWikipedia2019].

## Another subject ##

This section has a footnote[^fn1] and a list.

* First list item
* Second list item


[^fn1]: Waves are circles [@loyMusimathicsMathematic

In [11]:
collected_citations

{'AP5ZSMVI',
 'BVF6UVWN',
 'DEX3JSWI',
 'G27C99S2',
 'MHZGZRMK',
 'RR759BZU',
 'UUBEC8PU',
 'VMPBFGAN'}