# DBLP-Scholar
The task is to build an entity resolution algorithm to link papers found on DBLP and Google Scholar.  This paper summarises the algorithm which was built to do this, and its effectiveness.  The [companion paper](./DBLP-Scholar%20Algorithm%20Development%20Details.ipynb) gives a lot more detail, and describes the development of the algorithm.

### Background
The [benchmark_datasets_for_entity_resolution](https://dbs.uni-leipzig.de/en/research/projects/object_matching/fever/benchmark_datasets_for_entity_resolution) comes from the Database Group at the University of Leipzig. It contains two paper collections (`DBLP1` and `Scholar`) each of which have the following attributes for each of the papers in the respective paper collections:

- title
- authors
- venue
- year

The goal is to create an algorithm to link the papers from one collection to the other. Also included in the dataset is a file "DBLP-Scholar_perfectMapping.csv" which represents the gold standard mapping for this problem.

### Algorithm
The papers are normalised as follows: 

| Attribute | Normalisation
|:----------|:------------
| Title | Substitute characters which were not letters or numbers with whitespace; turn all uppercase to lowercase
| Authors | Standardise each name to have a single uppercase letter initial followed by a last name

Every paper in the `DBLP1` collection is compared with every paper in the `Scholar` collection, and are deemed to be equivalent if they match.  The matching criteria are given as either:

- (1) (a) the title matches exactly, and (b) is not one of the generic titles such as "Keynote address", or
- (2) (a) the year matches exactly if they are present in both papers, and (b) the authors match and (c) at least half of the words in the title match, and (d) the title is again not one of the generic titles.

In the case of 2(b), I only consider the first three authors, and a small amount of spelling variation is accepted.  Note that I'm completely ignoring the venue attribute of papers, as I found that the data was very messy and not particularly additive.

All this is detailed more precisely in Python code as follows:

In [5]:
from top import *

bad_titles = {
    "editorial", "keynote address", "reminiscences on influential papers",
    "editor s notes", "guest editorial", "book review column",
    "guest editor s introduction", "title"
    }

def match_authors(paper1, paper2):
    p1 = paper1.authors[:3]
    q1 = paper2.authors[:3]
    if len(p1) == len(q1):
        # Check that for each author, that the edit_distance is less than `edit`
        return all(edit_distance(x,y)<3 for x,y in zip(p1, q1))

@matcher
def match(paper1, paper2):
    return (
        (
            paper1.title == paper2.title and 
            paper1.title not in bad_titles
        ) or (
            (paper1.year == paper2.year or paper1.year == "" or paper2.year == "") and
            match_authors(paper1, paper2) and
            (len(paper1.words & paper2.words) > 0.5 * (len(paper1.words)+len(paper2.words))*0.5) and 
            (paper1.title not in bad_titles or paper2.title not in bad_titles)
        )
    )

### Results
The dataset was split in two parts as follows.  `DBLP1` (2616 papers) was not touched, but `Scholar` (64263 papers) was divided into training (the first 32000 papers) and test (the remaining 32263 papers).  The gold standard contains pairs of paper ids where the first key comes from `DBLP1` and the second from `Scholar`.  It too was split so that if the second part of the pair corresponded to a paper that existed in training then the pair would go into the gold standard for the training set, otherwise the pair would go into the gold standard of the test set.

In [6]:
train = dataset[:32000]
test = dataset[32000:]

The papers were matched from the `DBLP1` collection to the `Scholar` collection as described, and these were compared with the gold standard pairings.  The $F_1$ score of this matching algorithm was determined.

In [7]:
train.calc(match)



In [8]:
test.calc(match)



Whilst the point of this task was not to provide a comparison to previous work, but to code an algorithm for performing this entity resolution, it is nonetheless worth considering other results. ["Evaluation of entity resolution approaches on real-world match problems" (Köpcke et al 2010)](https://dbs.uni-leipzig.de/file/EvaluationOfEntityResolutionApproaches_vldb2010_CameraReady.pdf) provide some other results for this dataset.  They consider several different algorithms and report their results in terms of 1 attribute and 2 attribute methods.  The algorithms are a mixture of machine learning algorithms and non-learning algorithms.  The $F_1$ score that they report is shown below:

Algorithm | 1-Attr | 2-Attr
:------|-----|------
COSY | 84.5 | 82.9
FEBRL FellegiSunter | 57.2 | 81.9
PPJoin+ | 77.8 | - 
FEBRL SVM comb. | 81.9 | 87.6
MARLIN ADTree comb | 82.6 | 82.9
MARLIN SVM comb. | 82.6 | 89.4

Against this backdrop, an 85% $F_1$ score is reasonable.

### Conclusion

An algorithm for entity resolution was developed.  It is relatively simple to describe, and is somewhat parsimonious.  The main contributions of this work probably lie in the simple framework which was developed to perform this analysis.  The code and development strategy are given to aid transparency and reproducibility; this Jupyter notebook presentation allows the reader to experiment further.

### Future Work
Whilst a reasonably good algorithm has been developed, there are many more things that could be tried.  For example:

- It was noticed that there were a number of cases where one or more of the last few authors' names got mapped onto the title.  This caused problems with both the title matching and with matching the authors' names.
- Sometimes the title appears to have extraneous text appended to it.  Comparisons could be made by trimming away the final part of the title where the lengths don't match.
- There were often many insertions, deletions and substitutions present in the titles, making them ripe for assessment using the `edit_distance()` metric described in the companion paper.  However, experiments were performed to assess the effectiveness of this, and it was not found to be additive (above the criteria that half the words of the title must match).  This didn't work as well as expected, perhaps due to the two factors above.  If they were controlled for, then perhaps this would have worked better.
- There are now a couple of parameters present (e.g. author spelling `edit_distance`, fraction of words in the title to be matched).  Whilst a parsimonious model has been sought, such parameters could be tuned.  Indeed, a more complicated model could score various attributes and weight them to determine a match, rather than taking a more binary approach.
- More effort could be put into trying to normalise the description of the venues.  If this were done to a high level, then perhaps more could be gained from mapping them.  Indeed, one might expect that in many cases, an accurate year and venue might be enough to uniquely match papers.
- Finally, I believe that it would be worth re-examining the gold standard to check the matches, as at times, some of the matches appear to be unlikely (e.g. titles that seem completely different or authors that seem different)