Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature external blocks #551

Merged
merged 24 commits into from
Apr 30, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
9f33397
Modify upload endpoint to take information on external encodings.
hardbyte Apr 23, 2020
4775a61
Minor: Fix api url in e2e test helper, ignore created log files when …
hardbyte Apr 23, 2020
8a94356
Simplify abort_if_inconsistent_upload and add support for external en…
hardbyte Apr 23, 2020
aa37edc
Add utility functions to convert to common clknblocks format.
hardbyte Apr 23, 2020
0f73c6d
Catch most common error in status check during startup
hardbyte Apr 23, 2020
40be046
Modify upload endpoint to take external data, add background task to …
hardbyte Apr 23, 2020
e2ced25
Use correct endpoint for testing object store uploads
hardbyte Apr 24, 2020
2b82b75
Explicitly throw a notimplemented error for blocks + external encodings
hardbyte Apr 24, 2020
aff8741
Add test for pulling external block data from object store
hardbyte Apr 27, 2020
fdcd7b4
Update OpenAPI to support external blocking info
hardbyte Apr 27, 2020
d06dd0d
Extract object store path template
hardbyte Apr 27, 2020
da334dc
Task and backend support added for external blocks
hardbyte Apr 27, 2020
2959d25
Move object store credential parsing into object store module
hardbyte Apr 27, 2020
48eac71
Include release in k8s selector for workers
hardbyte Apr 29, 2020
de13582
Check for executable runs after handling data upload
hardbyte Apr 28, 2020
32def1b
Upgrade to newer minio chart. Update open api doc
hardbyte Apr 29, 2020
8ef63f8
Add to e2e upload test, update readme.
hardbyte Apr 29, 2020
9f3a76d
Extend test using external blocking data to include a run
hardbyte Apr 29, 2020
5faa528
Edit and add log statements throughout upload path
hardbyte Apr 29, 2020
ee36475
Skip cleaning up object store files, when there are no files
hardbyte Apr 29, 2020
c5cb901
Fix tracing on project cleanup
hardbyte Apr 29, 2020
eafd850
Set the encoding size on upload
hardbyte Apr 29, 2020
5e18505
Log expected and unexpected problems during upload with different sev…
hardbyte Apr 29, 2020
f99f206
Remove uploaded data from object store during cleanup
hardbyte Apr 29, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 39 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,33 @@
# Anonlink Entity Service

[![Documentation Status](https://readthedocs.org/projects/anonlink-entity-service/badge/?version=stable)](https://anonlink-entity-service.readthedocs.io/en/stable/?badge=stable)
[![Build Status](https://dev.azure.com/data61/Anonlink/_apis/build/status/data61.anonlink-entity-service?branchName=develop)](https://dev.azure.com/data61/Anonlink/_build/latest?definitionId=1&branchName=develop)

A service for performing privacy preserving record linkage. Allows organizations to carry out record linkage without disclosing personally identifiable information.
A REST service for performing privacy preserving record linkage. Allows organizations to carry out record linkage
without disclosing personally identifiable information.

Clients should use [anonlink-client](https://github.com/data61/anonlink-client/) or the [encoding-service](https://github.com/data61/anonlink-encoding-service/).
The *Anonlink Entity Service* is based on the concept of comparing *Anonymous Linking Codes* (ALC) - bit-arrays
representing an entity.

## Documentation
## Features

Project documentation including tutorials are hosted at https://anonlink-entity-service.readthedocs.io/en/stable/
- Highly scalable architecture, able to distribute work to many machines at once.
- Optimized low level comparison code (provided by [anonlink](https://github.com/data61/anonlink))
- Support for client side blocking (provided by [blocklib](https://github.com/data61/blocklib))

The [docs](./docs) folder contains the documentation source.

Data providers wanting to link their records may want to consider using
[anonlink-client](https://github.com/data61/anonlink-client/) or the
[encoding-service](https://github.com/data61/anonlink-encoding-service/).

## Documentation

Project documentation including tutorials are hosted at
[anonlink-entity-service.readthedocs.io](https://anonlink-entity-service.readthedocs.io/en/stable/).

## Demo

A demo deployment is available at [anonlink.easd.data61.xyz](https://anonlink.easd.data61.xyz/)

## Build

Expand All @@ -31,7 +47,8 @@ Note docker images are pushed to Docker Hub, which can be used instead of buildi
| Component | Docker Hub |
|------------------|---------|
| Backend/Worker | [data61/anonlink-app](https://hub.docker.com/r/data61/anonlink-app) |
| Nginx | [data61/anonlink-nginx](https://hub.docker.com/r/data61/anonlink-nginx) |
| E2E Tests | [data61/anonlink-test](https://hub.docker.com/r/data61/anonlink-test) |
| Nginx Proxy | [data61/anonlink-nginx](https://hub.docker.com/r/data61/anonlink-nginx) |
| Benchmark | [data61/anonlink-benchmark](https://hub.docker.com/r/data61/anonlink-benchmark) |
| Docs | [data61/anonlink-docs-builder](https://hub.docker.com/r/data61/anonlink-docs-builder) |

Expand All @@ -43,27 +60,31 @@ See the docs for more complete deployment documentation:
- [Local Deployment](./docs/local-deployment.rst)
- [Production Deployment](./docs/production-deployment.rst)

To run locally with `docker-compose`:
To test locally with `docker-compose`:

docker-compose -f tools/docker-compose.yml up

## Testing

A simple query with curl should tell you the status of the service:

curl localhost:8851/api/v1/status
{
"number_mappings": 2,
"rate": 44051409,
"status": "ok"
}


### Testing with docker-compose

An additional docker-compose config file can be found in `./tools/ci.yml`,
this can be added to run tests along with the rest of the service:

docker-compose -f tools/docker-compose.yml -f tools/ci.yml -p entityservicetest up -d
docker-compose -f tools/docker-compose.yml -f tools/ci.yml up -d


## Citing

The Anonlink Entity Service, and the wider Anonlink project is designed, developed and supported by
[CSIRO's Data61](https://www.data61.csiro.au/). If you use any part of this library in your research, please
cite it using the following BibTex entry::

@misc{Anonlink,
author = {CSIRO's Data61},
title = {Anonlink Private Record Linkage System},
year = {2017},
publisher = {GitHub},
journal = {GitHub Repository},
howpublished = {\url{https://github.com/data61/anonlink-entity-service}},
}
1 change: 1 addition & 0 deletions backend/.dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@
.git/
data
.env
*.log
117 changes: 102 additions & 15 deletions backend/entityservice/api_def/openapi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -321,16 +321,21 @@ paths:
'/projects/{project_id}/clks':
post:
operationId: entityservice.views.project.project_clks_post
summary: Upload encoded PII data to a linkage project.
summary: Upload encoded data to a linkage project.
tags:
- Project
description: |
Called by each of the data providers with their calculated `CLK` vectors.
The project must have been created, and the caller must have both the
`project_id` and a valid `upload_token` in order to contribute data.
Called by each data provider with their encodings and optional blocking
information.

The data uploaded must be of one of the following formats.
- CLKs only upload: An array of base64 encoded [CLKs](./concepts.html#cryptographic-longterm-keys), one per
The caller must have both the `project_id` and a valid `upload_token` in order to contribute data,
both of these are generated when a project is created.
This endpoint can directly accept uploads up to several hundred MiB, and can pull encoding data from
an object store for larger uploads.

The data uploaded must be of one of the following formats:

- Encodings only: An array of base64 encoded [CLKs](./concepts.html#cryptographic-longterm-keys), one per
entity.
- CLKs with blocking information upload: An array of base64 encoded CLKs with corresponding blocking
information. One element in this array is an array with the first element being a base64 encoded CLK followed
Expand All @@ -342,7 +347,7 @@ paths:
The uploaded encodings must all have the same length in bytes. If the project's linkage schema
specifes an encoding size it will be checked and enforced before any runs are computed. Note a
minimum and maximum encoding size can be set at the server level at deployment time.
Currently anonlink requires this _encoding size_ to be a multiple of 8. An example value is 128 Bytes.
Currently anonlink requires this _encoding size_ to be a multiple of 8. A common value is `128 Bytes`.

Note in the default deployment the maximum request size is set to `~10 GB`, which __should__
translate to just over 20 million entities.
Expand All @@ -352,6 +357,12 @@ paths:
This endpoint can be used with the Content-Type: application/json and uses the `CLKUpload`
structure of a JSON array of base64 encoded strings.

### Object Store Upload

`encodings` and `blocks` can be pulled from an object store. `encodings` must be in the binary format
documented under the `/projects/{project_id}/binaryclks` endpoint. `blocks` must be a JSON file, comprising
a mapping of encoding identifiers to a list of block identifiers; both identifiers must be strings.

### Binary Upload

An additional api endpoint (/projects/{project_id}/binaryclks) has been added for uploading CLKs as a binary
Expand All @@ -361,12 +372,13 @@ paths:
- $ref: '#/components/parameters/project_id'
- $ref: '#/components/parameters/token'
requestBody:
description: the encoded PII
description: Data to upload
required: true
content:
application/json:
schema:
oneOf:
- $ref: '#/components/schemas/EncodingUpload'
- $ref: '#/components/schemas/CLKUpload'
- $ref: '#/components/schemas/CLKnBlockUpload'
responses:
Expand Down Expand Up @@ -394,8 +406,8 @@ paths:
tags:
- Project
description: |
An experimental api for uploading CLKs as a binary file. This is to allow for
faster and more efficient data transfer.
An experimental api for directly uploading CLKs as a binary file. You may instead want to
upload via an object store.
Called by each of the data providers with their calculated `CLK` vectors.
The project must have been created, and the caller must have both the
`project_id` and a valid `upload_token` in order to contribute data.
Expand Down Expand Up @@ -1081,17 +1093,92 @@ components:
required:
- number

EncodingUpload:
description: Object that contains one data provider's encodings
type: object
required: [encodings]
properties:
encodings:
oneOf:
- $ref: '#/components/schemas/EncodingArray'
- $ref: '#/components/schemas/ExternalData'
blocks:
oneOf:
- $ref: '#/components/schemas/BlockMap'
- $ref: '#/components/schemas/ExternalData'
EncodingArray:
description: Array of encodings, base64 encoded.
type: array
items:
- type: string
format: byte
description: Base64 encoded CLK data


BlockMap:
description: Blocking information for encodings. A mapping from encoding id (a string) to a list of block ids
type: object
additionalProperties:
type: array
items:
- type: string
description: Block ID
example:
"1": ["block1", "block2"]
"2": []
"3": ["block1"]

ExternalData:
description: A file in an object store.
type: object
required: [file]
properties:
credentials:
type: object
required: [AccessKeyId, SecretAccessKey]
description: |
Optional credentials to pull the file from the object store.

Not required if using the Anonlink Entity Service's own object store.
properties:
AccessKeyId:
type: string
SecretAccessKey:
type: string
SessionToken:
type: string
file:
type: object
required: [bucket, path]
properties:
bucket:
type: string
example: anonlink-uploads
path:
type: string
description: The object name in the bucket.
example: project-foo/encodings.bin
endpoint:
type: string
description: |
Object store endpoint - usually a public endpoint for a MinIO as part of an Anonlink deployment e.g.
`minio.anonlink.easd.data61.xyz`, or a public (region specific) endpoint for AWS S3:
`s3.ap-southeast-2.amazonaws.com`.

If not given the Anonlink Entity Service's own object store will be assumed.
example: s3.ap-southeast-2.amazonaws.com
secure:
type: boolean
default: true
description: If this object store should be connected to only over a secure connection.

CLKUpload:
description: Object that contains this party's Bloom Filters
type: object
required: [clks]
properties:
clks:
type: array
items:
type: string
format: byte
description: Base64 encoded CLK data
$ref: '#/components/schemas/EncodingArray'

CLKnBlockUpload:
description: Object that contains this party's Bloom Filters including blocking information
Expand Down
43 changes: 37 additions & 6 deletions backend/entityservice/database/insertions.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ def insert_dataprovider(cur, auth_token, project_id):

def insert_blocking_metadata(db, dp_id, blocks):
"""
Insert a new entry into the blocks table.
Insert new entries into the blocks table.

:param blocks: A dict mapping block id to the number of encodings per block.
"""
Expand Down Expand Up @@ -200,6 +200,24 @@ def update_encoding_metadata(db, clks_filename, dp_id, state):
])


def update_blocks_state(db, dp_id, blocks, state):
assert state in {'pending', 'ready', 'error'}
sql_query = """
UPDATE blocks
SET
state = %s
WHERE
dp = %s AND
block_name in %s
"""

with db.cursor() as cur:
cur.execute(sql_query, [
state,
dp_id,
tuple(blocks)
])

def update_encoding_metadata_set_encoding_size(db, dp_id, encoding_size):
sql_query = """
UPDATE uploads
Expand All @@ -209,7 +227,7 @@ def update_encoding_metadata_set_encoding_size(db, dp_id, encoding_size):
dp = %s
"""

logger.info("Updating database with info about encodings")
logger.info(f"Updating uploads table for dp {dp_id} with encoding size ({encoding_size})")
with db.cursor() as cur:
cur.execute(sql_query, [
encoding_size,
Expand Down Expand Up @@ -278,6 +296,18 @@ def update_project_mark_all_runs_failed(conn, project_id):
cur.execute(sql_query, [project_id])


def update_dataprovider_uploaded_state(conn, project_id, dp_id, state):
with conn.cursor() as cur:
sql_query = """
UPDATE dataproviders SET
uploaded = %s
WHERE
id = %s AND
project = %s
"""
cur.execute(sql_query, [state, dp_id, project_id])


def mark_project_deleted(db, project_id):
with db.cursor() as cur:
sql_query = """
Expand Down Expand Up @@ -328,10 +358,11 @@ def get_created_runs_and_queue(db, project_id):

def is_dataprovider_allowed_to_upload_and_lock(db, dp_id):
"""
This method returns true if the dataprovider is allowed to upload her clks.
A dataprovider is not allowed to upload clks if she has already uploaded them, or if the upload is in progress.
This method returns true if the data provider is allowed to upload their encodings.

A dataprovider is not allowed to upload clks if they has already uploaded them, or if the upload is in progress.
This method will lock the resource by setting the upload state to `in_progress` and returning `true`.
Note that the upload state can be `error`, in which case we are allowing the dataprovider to re-try uploading
Note that the upload state can be `error`, in which case we allow the dataprovider to re-try uploading
her clks not to block a project if a failure occurred.
"""
logger.debug("Setting dataprovider {} upload state to `in_progress``".format(dp_id))
Expand All @@ -348,5 +379,5 @@ def is_dataprovider_allowed_to_upload_and_lock(db, dp_id):
elif length > 1:
logger.error("{} rows in the table `dataproviders` are associated to the same dataprovider id {}, while each"
" dataprovider id should be unique.".format(length, dp_id))
raise ValueError("Houston, we have a problem!!! This dataprovider can upload multiple times her clks.")
raise ValueError("Houston, we have a problem!!! This dataprovider has uploaded multiple times")
return True
Loading