Ligo an open source application that provides an easy-to-use web interface that lets analysts select among a number of data linking methods and use these in a documented, repeatable, tested, step-by-step process to deduplicate and match administrative records.
Ligo's framework for managing data linking projects is extensible to allow for the addition of new methodologies and algorithms in the completion of administrative data linking projects.
Built in Python and implemented as a desktop-capable and cloud-deployable containerized application, Ligo includes many of the latest data-linking comparison algorithms with a plugin architecture that supports the simple addition of new formulae and methods.
Currently, deterministic approaches to linking have been implemented and probabilistic methods are in alpha testing. The high-level application road-map anticipates the inclusion of machine learning extensions in future releases.
LIGO will help you:
- Identity common entities in a CSV formatted dataset [de-duplication]
- Identify common entities between two CSV formatted datasets [linking]
- Support for deterministic linking with probabilistic linking underway
- Support for multi-step projects
- Each step allows you to define a specific blocking (filtering the search space) and linking criteria
- For de-duplication, you can choose the step(s) where entities are identified (i.e., you can join multiple de-duplication steps)
- You can have multiple records representing the same entity; the software links entities (not records). At each step, only not-linked entities are evaluated.
- Comparison rules such as SoundEx, Levenshtein and Jaro-Winkler allow you to tailor matching criteria for individual fields
- Designed for handling large datasets
- Results in CSV and PDF for easy processing / viewing
- Can run in both offline workstations and cloud environments
- Supports running concurrent sessions on cloud infrastructure.
A very basic user manual has been started. Feel free to contribute to it here //docs/
If something is not behaving intuitively, it is likely a bug, and should be reported
Warning: Despite the use of Django accounts to support multi-user linking project management, Ligo is not designed to provide any security to prevent access to data or controls. The application should only be used within an access controlled desktop environment or behind a web firewall that limits access to trusted parties. Do NOT rely on user accounts to protect data.
The application has been containerized to support repeatablity in development processes and deployment - a refactoring to support leveraging scaling within a Kubernetes cluster is in the roadmap.
The following installation instructions/steps are for cases when users are pulling docker images from dockerhub instead of building any images from source code.
- Install docker-compose - https://docs.docker.com/compose/install/
- Create a subdirectory named ligo_app.
- In this subdirectory create a subdirectory named files and within files directory create a subdirectory with the name 'media'.
- Ligo expects data files and resources in certain pre-defined locations. Under the ligo_app/files/media/ subdirectory, create a subdirectory named 'linking' and another subdirectory named 'datasets'. Basically run the following command in the application root (ligo_app folder):
mkdir -p files/media/datasets files/media/linking
- Put the files that you want to de-duplicate or link in the ligo_app/files/media/datasets subdirectory.
- Add read and write permission for everyone to the ligo_app/files/ directory and corresponding subdirectories with an instruction like
chmod -R 777 ligo_app/files
- Use a browser to download this docker-compose.yml and save the file into the directory ligo_app.
- To run the application, change to the ligo_app directory where you put the docker-compose.yml file and type the following command (and press enter):
docker-compose up
- You will see some text scroll by in the window as the application starts up. Once it has finished scrolling open a web browser like Edge/Firefox/Chrome and in the address bar type http://localhost:8002 and hit enter.
- Select "sign in" from top right-hand-side and login with username:baseuser password:Pass12345678
If you are able to log in then you successfully installed the software on your system. Enjoy de-duplication and linking!
You will find notes on how to use the Ligo tool here: Ligo User Manual .
If you are not contributing to the source code of Ligo then you can easily ignore all the following sections.
The following environment variables are required for project settings:
web-app Variables | Description |
---|---|
IN_DOCKER | 1 if Dockerized, 0 otherwise |
C_FORCE_ROOT | Boolean - forces workspace from root directory |
CELERY_BROKER_URL | Example: redis://localhost:6379/0 |
You can use the env.example files as a template for creating your environment variable files.
To recreate the docker containers use the following command (due to the content of the docker-compose file the --build option in the following command has no effect):
docker-compose up --build --force-recreate
Should everything be properly configured, you can visit http://localhost:8002