 # Table of Contents
<div class="toc" style="margin-top: 1em;"><ul class="toc-item" id="toc-level0"><li><span><a href="http://localhost:8888/notebooks/1.%20Record%20Linkage%20between%20DBLP%20and%20Arxiv.ipynb#Running-the-record-linkage" data-toc-modified-id="Running-the-record-linkage-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Running the record linkage</a></span></li><li><span><a href="http://localhost:8888/notebooks/1.%20Record%20Linkage%20between%20DBLP%20and%20Arxiv.ipynb#Outputing-matching-numbers-for-use-in-paper" data-toc-modified-id="Outputing-matching-numbers-for-use-in-paper-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Outputing matching numbers for use in paper</a></span></li><li><span><a href="http://localhost:8888/notebooks/1.%20Record%20Linkage%20between%20DBLP%20and%20Arxiv.ipynb#Results-of-Manually-Labelling-Coreference-Pairs" data-toc-modified-id="Results-of-Manually-Labelling-Coreference-Pairs-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Results of Manually Labelling Coreference Pairs</a></span></li></ul></div>

This is a bash notebook that runs the code that does the simple coreference. It assumes that you have already run the notebook that [downloads and preprocesses the data](0.%20Download%20and%20Preprocess%20Data.ipynb). 

In [3]:
mkdir generated/matching

# Running the record linkage

There are a lot of different record linkage methods in the code, but in the end, it seemed to make sense to go with something simple but conservative: We say that a paper on DBLP appears on the arxiv if: (a) the title matches exactly, and (b) at least one of the author names matches. 

In [18]:
time python2 matching/match_cnf_arxiv.py \
         --arxiv-file generated/arxiv/json/arxiv_articles.json \
         --dblp-file generated/dblp/all-papers.json \
         --output-file generated/matching/all-papers-matched-titleauthor.json

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Total articles matched =  7313
Total articles checked =  82427

real	0m9.733s
user	0m8.552s
sys	0m0.890s


The below script creates a subset of: 

 * 25 matched articles
 * 25 unmatched articles whose titles and authors have Jaccard similarity > 0.5

This will then be reviewed manually to estimate the precision and recall of this poor man's matching algorithm.

This script requires about two hours to run.

In [22]:
time python2 matching/random_subsets.py \
         --dblp-file generated/matching/all-papers-matched-titleauthor.json \
         --arxiv-file generated/arxiv/json/arxiv_articles.json \
         --threshold 0.5 --N 25 --seed 2045230 \
         --prefix generated/matching/subset_

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Total DBLP articles = 82427
Total matched articles = 7313
Total close articles = 470
Total non-close articles = 74644

real	139m3.476s
user	134m37.930s
sys	0m46.769s


In the above:

* "DBLP articles" is all of the articles that we extracted from DBLP
* "Matched articles" is the number of DBLP articles for which we found a match on Arxiv

(those two numbers should be the same as the previous script)

* "close articles" are those that do not have a match on arxiv, but there *is* an arxiv article whose Jaccard similarity on both titles and authors is greater than threshold
* "non-close" are the ones with no arxiv match, and no arxiv article whose Jaccard similarity on both titles and authors is greater than threshold

We should have close + non-close + matched == DBLP articles


Before doing the manual analysis, I found it helpful to write one more script that processed the data set into the exact URLs that I needed to go to online.

This script also randomizes the orders of the articles so that going through the list I don't know which were matched and which were merely close (although it's kind of obvious given the naming heuristic).

In [28]:
python2 matching/subsets2csv.py \
    generated/matching/subset_close.json generated/matching/subset_matched.json  \
    > data/manually_labeled_coref_initial.csv

After adding human judgements into the file `data/manually_labeled_coref_initial.csv`, it was saved as `data/manually_labeled_coref.csv`.

# Outputing matching numbers for use in paper

In [30]:
head generated/matching/all-papers-matched-titleauthor.json

[
  {
    "dblp": "journals/bioinformatics/XuA13", 
    "title": "Automated target segmentation and real space fast alignment methods for high-throughput classification and averaging of crowded cryo-electron subtomograms", 
    "url": "db/journals/bioinformatics/bioinformatics29.html#XuA13", 
    "year": 2013, 
    "area": "bio", 
    "venue": "ISMB", 
    "authors": [
      "Min Xu", 


In [34]:
TOTAL_DBLP=$(grep -c "title" generated/matching/all-papers-matched-titleauthor.json)
TOTAL_MATCHED=$(grep -c "arxiv" generated/matching/all-papers-matched-titleauthor.json)
echo $TOTAL_DBLP $TOTAL_MATCHED

82427 7313


In [48]:
echo \\newcommand{\\ntotaldblp}{$TOTAL_DBLP\\xspace} \
     \\newcommand{\\nmatched}{$TOTAL_MATCHED\\xspace} \
     \\newcommand{\\nunmatched}{$(( $TOTAL_DBLP - $TOTAL_MATCHED ))\\xspace} > figures/number_of_matched_papers.tex
     

# Results of Manually Labelling Coreference Pairs

After manually labelling the file `data/manually_labeled_coref.csv`, 48 out of the total 50 papers in the file matched. For the two unmatched papers, it was clear that the titles were not string identical. Therefore:

 * Of the "exact match" papers in `generated/matching/subset_matched.json`, 25/25 of the (arXiv eprint, DBLP publication) pairs that were matched by the heuristc were in fact the same paper.
 * Of the "close match" papers in `generated/matching/subset_close.json`, 23/25 of the (arXiv eprint, DBLP publication) pairs that were matched by the heuristic were in fact the same paper.
 
 
The 74644 non-close DBLP articles are those for which no arXiv preprint has Jaccard similarity greater than 0.5. Let's assume that none of them have arXiv preprints. Of the 470 close articles, approximate 92% have arXiv eprints, or 432 in total.

This yields

*  Estimated P of heuristic = 100%
*  Estimated R of heuristic = 7313 / (7313 + 432) = 94%
 