ipprl_tools
is a Python package developed by the CU Record Linkage team as part of our research into Incremental Privacy Preserving Record Linkage (iPPRL).
This package contains utility methods for generating synthetic data for record linkage, as well as functions to calculate common 'linkability' metrics, which allow a user to assess the utility of their data fields for record linkage tasks.
The package can installed using pip
, or used without installation by writing cloning this repository and importing ipprl_tools
locally.
To use ipprl_tools
locally:
- Clone this repository and create a virtual environment in the repository root directory (the directory containing this
README.md
file) usingpython3 -m venv venv
. - Activate your virtual environment using
source venv/bin/activate
on Unix, orvenv/Scripts/activate.ps1
on Windows. - With the virtual environment active, install dependencies using
pip install -r requirements.txt
. - Create a script or notebook in the repository root directory which imports
ipprl_tools
.
You can test functionality using the demo.ipynb
Jupyter Notebook (Jupyter should already be installed as a dependency from requirements.txt
) or demo.py
script, which will download some data, and calculate metrics.
To install the package with pip
, run:
pip install git+https://github.com/cu-recordlinkage/ipprl_tools
The synthetic data tools can be found at:
ipprl_tools.synthetic.py
and imported into your python code using:
from ipprl_tools import synthetic
We recommend viewing or running the tutorial Jupyter Notebook, which can be found at:
ipprl_tools/docs/ipprl_tools Tutorial Notebook.ipynb
If you would just like to run the linkability metrics on your own data, you can use the run_metrics.py
script:
python3 run_metrics.py --input_file=<your CSV data> --output_file=<output file path>
to calculate all available metrics and write the result to a file.
This script uses the metrics.get_metrics()
function to compute all metrics. If you would like finer control, the documentation ipprl_tools/docs/IPPRL_Tools_Documentation__Synthetic_Data_Corruption_Tools.pdf
provides function names of
the metric functions, which may be called individually.
Documentation PDF files are available at:
ipprl_tools/docs/IPPRL_tools_Documentation_*.pdf
.
If you encounter issues with the module or have a suggestion for improvement, please open an issue here or email me at: