Skip to content

An R interface for the dblink Spark application

Notifications You must be signed in to change notification settings

cleanzr/dblinkR

Repository files navigation

dblinkR: An R interface for dblink

Overview

dblinkR is an R interface for dblink—an Apache Spark package for performing unsupervised entity resolution. It implements a generative Bayesian model for entity resolution called blink (Steorts 2015), with extensions proposed in (Marchant et al. 2021). Unlike many entity resolution methods, dblink approximates the full posterior distribution over the linkage structure. This facilitates propagation of uncertainty to post-entity resolution analysis, and provides a framework for answering probabilistic queries about entity membership.

Installation

dblinkR is not currently available on CRAN. The latest development version can be installed from source using devtools as follows:

library(devtools)
install_github("ngmarchant/dblinkR")

Dependencies

dblinkR depends heavily on the sparklyr R interface for Apache Spark. Please refer to the sparklyr website for information about connecting to a Spark deployment.

dblinkR currently supports Spark releases in the 2.3.x series and 2.4.x series. Spark releases prior to 2.3.x are not supported.

Example

The RLdata500 vignette demonstrates how to use dblinkR to perform entity resolution for a small synthetic data set. This example is small enough to run on a laptop (Spark cluster not required).

Licence

GPL-3

References

Steorts, Rebecca C. 2015. “Entity Resolution with Empirically Motivated Priors.” Bayesian Analysis 10 (4): 849–75. https://doi.org/10.1214/15-BA965SI.

Releases

No releases published

Packages

No packages published