Skip to content

Releases: data61/anonlink-entity-service

Version 1.15.1

01 Sep 23:57
2fc0169
Compare
Choose a tag to compare

Spring Cleaning Release

Dependency updates

Implemented in #687

Delete upload files on object store after ingestion

If a data provider uploads its data via the object store, we now clean up afterwards.

Implemented in #686

Fixed Record Linkage API tutorial

Adjusted to changes in the clkhash library.

Implemented in #684

Delete encodings from database at project deletion

Encodings will be deleted at project deletion, but only for projects created with this version or higher.

Implemented in #683

Version 1.15.0

23 Aug 07:31
e02c39f
Compare
Choose a tag to compare

Highlights

Similarity scores are deduplicated

Previously candidate pairs that appear in more than one block would produce more than one similarity score.
The iterator that processing similarity scores now de-duplicates before storing them.

Implemented in: #660

Provided Block Identifiers are now hashed

We now hash the user provided block identifier before storing in DB.

Implemented in: #633

Failed runs return message indicating the failure reason

The run status for a failed run now includes a message attribute with information on what went wrong.

Implemented in: #624

Other changes

The run status endpoint now includes total_number_of_comparisons for completed runs.
Implemented in: #651

As usual lots of version upgrades - now using the latest stable redis and postgresql.

Version 1.14.0

24 Feb 04:04
f94cdd1
Compare
Choose a tag to compare

Highlights

API now supports directly downloading similarity scores from the internal object store

If the request includes the header RETURN-OBJECT-STORE-ADDRESS, the response will be a small json payload with
temporary download credentials to pull the binary similarity scores directly from the object store. The json object
has credentials and object keys::

{
  "credentials": {
    "AccessKeyId": "",
    "SecretAccessKey": "",
    "SessionToken": "",
    "Expiration": "<ISO 8601 datetime string>"
  },
  "object": {
      "endpoint": "<config.DOWNLOAD_OBJECT_STORE_SERVER>",
      "secure": "<config.DOWNLOAD_OBJECT_STORE_SECURE>",
      "bucket": "bucket_name",
      "path": "path"
  }
}

The binary file is serialized using anonlink.serialization, you can convert the stream into Python types with::

    mc = Minio(file_info['endpoint'], ...)
    candidate_pair_stream = mc.get_object(file_info['bucket'], file_info['path'])
    sims, (dset_is0, dset_is1), (rec_is0, rec_is1) = anonlink.serialization.load_candidate_pairs(candidate_pair_stream)

The following settings control the optional feature of using an external object store:

======================================= ==========================================
Environment Variable Helm Config
======================================= ==========================================
DOWNLOAD_OBJECT_STORE_SERVER anonlink.objectstore.downloadServer
DOWNLOAD_OBJECT_STORE_SECURE anonlink.objectstore.downloadSecure
DOWNLOAD_OBJECT_STORE_ACCESS_KEY anonlink.objectstore.downloadAccessKey
DOWNLOAD_OBJECT_STORE_SECRET_KEY anonlink.objectstore.downloadSecretKey
DOWNLOAD_OBJECT_STORE_STS_DURATION - (default 43200 seconds)
======================================= ==========================================

Implemented in: #594, #612, #613, #614

Service now uses sqlalchemy for database migrations

Sqlalchemy models have been added for all database tables, initial database setup
now uses alembic for migrations. The database and object store init scripts can now
be run multiple times without causing issues.

Implemented in #603, #611

New configurable limits on maximum number of candidate pairs

Protects the service from running out of memory due to excessive numbers of
candidate pairs being processed. An added side effect is the service now keeps
track of the number of candidate pairs in a run (as well as the number of comparisons).

The configurable is controlled by the following two environment variables, and their initial
default values::

SOLVER_MAX_CANDIDATE_PAIRS="100_000_000"
SIMILARITY_SCORES_MAX_CANDIDATE_PAIRS="500_000_000"

If a run exceeds these limits, the run is put into an error state and further processing is
abandoned to protect the service from running out of memory.

Implemented in #595, #605

Other changes

  • Ingress now supports a user supplied path. We no longer assume an nginx ingress controller. #587
  • Migrate off deprecated k8s chart repos #596, #588
  • Helm chart now uses standard recommended Kubernetes labels. #616
  • Fix an issue with case sensitivity in object store metadata #590
  • If the object store bucket doesn't exist it is now automatically created. #577
  • Ignore but log failures to delete from object store #576
  • Many dependency updates #578, #579, #580, #582, #581, #583, #596, #604, #609, #615
  • Update the base image, all base dependencies and migrated from minio-py v5 to v7 #601, #608, #610
  • CI e2e tests on Kubernetes will now correctly fail if the tests don't run. #618
  • Add optional pod annotations to init jobs. #619

Version 1.13.0

15 Jun 13:57
e6e6d60
Compare
Choose a tag to compare

Highlights

  • The entity service now supports user provided blocking information. This can reduce the amount of required comparisons significantly and thus allows for linkages between larger datasets.
  • The server can be configured to use an object store for dataset uploads. This allows the use of libraries such as boto3 or minio to improve reliability, especially for large uploads.

Docker Images

  • data61/anonlink-app:v1.13.0
  • data61/anonlink-nginx:v1.4.6
  • data61/anonlink-benchmark:v0.3.3

Breaking Changes

  • the similarity_score output type has been modified, it now returns a JSON array of JSON objects, where such an object looks like [[party_id_0, row_index_0], [party_id_1, row_index_1], score]. #464
  • Integration test configuration is now consistent with benchmark config. Instead of setting ENTITY_SERVICE_URL including /api/v1 now just set the host address in SERVER. #495
  • matching output type was removed. Use the equivalent groups instead. #458

Other Changes

  • use latest stable minio release #572
  • add section to API tutorial about uploads to object store #573
  • plus all the changes introduced in the alpha and beta versions below.

Version 1.13.0-beta3

11 Jun 00:16
29fbb5b
Compare
Choose a tag to compare
Version 1.13.0-beta3 Pre-release
Pre-release
  • Improved performance for blocks of small size #563
  • fix a problem with the upload to the external object store #564
  • updated documentation #567, #569

Version 1.13.0-beta2

30 Apr 03:52
Compare
Choose a tag to compare
Version 1.13.0-beta2 Pre-release
Pre-release

Adds support for users to supply blocking information along with encodings. Data can now be uploaded to
an object store and pulled by the Anonlink Entity Service instead of uploaded via the REST API.
This release includes substantial internal changes as encodings are now stored in Postgres instead of
the object store.

  • Feature to pull data from an object store and create temporary upload credentials. #537, #544, #551
  • Blocking implementation #510 #527,
  • Benchmark container now includes support for blocking #478, #541
  • Encodings are now stored in Postgres database instead of files in an object store. #516, #522
  • Start to add integration tests to complement our end to end tests. #520, #528
  • Use anonlink-client instead of clkhash #536
  • Use Python 3.8 in base image. #518
  • A base image is now used for all our Docker images. #506, #511, #517, #519
  • Binary encodings now stored internally with their encoding id. #505
  • REST API implementation for accepting clknblocks #503
  • Update Open API spec to version 3. Add Blocking API #479
  • CI Updates #476
  • Chart updates #496, #497, #539
  • Documentation updates (production deployment, debugging with PyCharm) #473, #504
  • Fix Jaeger #500, #523

Misc changes/fixes:

  • Detect invalid encoding size as early as possible #507
  • Use local benchmark cache #531
  • Cleanup docker-compose #533, #534, #547
  • Calculate number of comparisons accounting for user supplied blocks. #543

Try it out

You can pull this repository and try with Docker Compose. The Docker images are all hosted on Docker Hub:

Component Docker Hub
Base Image data61/anonlink-base
Backend/Worker data61/anonlink-app
E2E Tests data61/anonlink-test
Nginx Proxy data61/anonlink-nginx
Benchmark data61/anonlink-benchmark
Docs data61/anonlink-docs-builder

Using Kubernetes (follow the detailed docs here:

helm repo add data61 https://data61.github.io/charts
helm repo update
helm install data61/entity-service --version 1.13.1 [--values...]

All the documentation, including tutorials can be found at https://anonlink-entity-service.readthedocs.io/en/latest/index.html

v1.13.0-beta

10 Feb 22:23
Compare
Choose a tag to compare
v1.13.0-beta Pre-release
Pre-release
  • Fixed a bug where a dataprovider could upload their clks multiple times in a project using the same upload token. (#463)
  • Fixed a bug where workers accepted work after failing to initialize their database connection pool. (#477)
  • Modified similarity_score output to follow the group format in preparation to extending this output type to more
    parties. (#464)
  • Tutorials have been improved following an internal review. (#467)
  • Database schema and CLK upload api has been modified to support blocking. (#470)
  • Benchmarking results can now be saved to an object store without authentication. Allowing an AWS user to save to S3
    using node permissions. (#490)
  • Removed duplicate/redundant tests. (#466)
  • Updated dependencies:
    • We have enabled dependabot <https://dependabot.com/>_ on GitHub to keep our Python dependencies up to date.
    • anonlinkclient now used for benchmarking. (#490)
    • Chart dependencies redis-ha, postgres and minio all updated. (#496, #497)

Breaking Changes

  • the similarity_score output type has been modified, it now returns a JSON array of JSON objects, where such an object
    looks like [[party_id_0, row_index_0], [party_id_1, row_index_1], score]. (#464)
  • Integration test configuration is now consistent with benchmark config. Instead of setting ENTITY_SERVICE_URL including
    /api/v1 now just set the host address in SERVER. (#495)

Database Changes (Internal)

  • the dataproviders table uploaded field has been modified from a BOOL to an ENUM type (#463)
  • The projects table has a new uses_blocking field. (#470)

Docker Images

  • data61/anonlink-app:v1.13.0-beta
  • data61/anonlink-nginx:v1.4.6-beta
  • data61/anonlink-benchmark:v0.3.1

Install to Kubernetes using the helm chart:

helm repo add data61 https://data61.github.io/charts
helm repo update
helm install data61/entity-service [--values...]

v1.13.0-alpha

05 Nov 05:03
f72c466
Compare
Choose a tag to compare
v1.13.0-alpha Pre-release
Pre-release
  • fixed bug where invalid state changes could occur when starting a run (#459)

  • matching output type has been removed as redundant with the groups output with 2 parties. (#458)

  • Update dependencies:

    • requests from 2.21.0 to 2.22.0 (#459)

Breaking Change

  • matching output type is not available anymore. (#458)

v1.12.0

18 Oct 07:09
f0ef5d8
Compare
Choose a tag to compare

Created docker images:

  • data61/anonlink-app:v1.12.0
  • data61/anonlink-nginx:v1.4.5
  • data61/anonlink-benchmark:v0.3.0

Changelog:

  • Logging configurable in the deployed entity service by using the key loggingCfg. (#448)
  • Several old settings have been removed from the default values.yaml and docker
    files which have been replaced by CHUNK_SIZE_AIM (#414):
    • SMALL_COMPARISON_CHUNK_SIZE
    • LARGE_COMPARISON_CHUNK_SIZE
    • SMALL_JOB_SIZE
    • LARGE_JOB_SIZE
  • Remove ENTITY_MATCH_THRESHOLD environment variable (#444)
  • Celery configuration updates to solve threads and memory leaks in deployment. (#427)
  • Update docker-compose files to use these new preferred configurations.
  • Update helm charts with preferred configuration default deployment is a minimal working deployment.
  • New environment variables: CELERY_DB_MIN_CONNECTIONS, FLASK_DB_MIN_CONNECTIONS, CELERY_DB_MAX_CONNECTIONS
    and FLASK_DB_MAX_CONNECTIONS to configure the database connections pool. (#405)
  • Simplify access to the database from services relying on a single way to get a connection via a connection pool. (#405)
  • Deleting a run is now implemented. (#413)
  • Added some missing documentation about the output type groups (#449)
  • Sentinel name is configurable. (#436)
  • Improvement on the Kubernetes deployment test stage on Azure DevOps:
    • Re-order cleaning steps to first purge the deployment and then deleting the remaining. (#426)
    • Run integration tests in parallel, reducing pipeline stage Kubernetes deployment tests from 30 minutes to 15 minutes. (#438)
    • Tests running on a deployed entity-service on k8s creates an artifact containing all the logs of all the containers, useful for debugging. (#445)
    • Test container not restarted on test failure. (#434)
  • Benchmark improvements:
    • Benchmark output has been modified to handle multi-party linkage.
    • Benchmark to handle more than 2 parties, being able to repeat experiments.
      and pushing the results to minio object store. (#406, #424 and #425)
    • Azure DevOps benchmark stage runs a 3 parties linkage. (#433)
  • Improvements on Redis cache:
    • Refactor the cache. (#430)
    • Run state kept in cache (instead of fully relying on database) (#431 and #432)
  • Update dependencies:
    • anonlink to v0.12.5. (#423)
    • redis to from 3.2.0 to 3.2.1 (#415)
    • alpine from 3.9 to 3.10.1 (#404)
  • Add some release documentation. (#455)

v1.12 pre release

16 Oct 22:48
c12f73b
Compare
Choose a tag to compare
v1.12 pre release Pre-release
Pre-release

We are creating this tag to be able to deploy an entity-service having all the necessary configurations introduced in develop required for our testing service on kubernetes.