BadgerDoc is a ML Delivery Platform made to make delivery process of Machine Learning Solutions visible to customer, managers and ML team. The primary goal of the platform is to visualize ML model delivery cycle - data annotation, model training and result visualization.
The platform has rich functionality in access and data management, annotation setups, and pipeline composition. Access management is based on Keycloak, which is integrated with Active Directory. Data can be uploaded in batches, organized into datasets as well as uploaded as a single file. ML pipeline can be applied to a dataset, which will trigger batch processing, or to a single document. BadgerDoc is capable of annotating large datasets by many annotators. It has algorithms for task distribution, validation roles, several validation setups and will have multicoverage of files by annotators in nearest future.
BadgerDoc also has steady growing number of pre-trained models available for users, which can be assembled into pipelines through visual editor.
Having such a rich functionality, BadgerDoc can be used for implementing full ML development cycle, as well as for rapid prototyping, demonstrating EPAM expertise in ML and even for large annotation project when preliminary annoation is available.
For now, BadgerDoc is working with vectorized and scanned documents, but it has capability of image annotation.
We have tested BadgerDoc under 'colima', so this is the recommended method for a local run.
Run the following command to build the base image:
make build_base
After the base image is built, it is recommended to clean up any temporary files generated during the build process. To do this, run the following command:
make clean
Easiest way to build microservices is to run make build_all
command, right after that,
it's possible run docker-compose to serve BadgerDoc in local mode.
If it's required to build separate microservice, just run make build_{microservice}
command,
for instance: make build_users
to build or rebuild users
After all services are built, you need to create .env
file in root folder. You may just copy from example: cp .env.example .env
Time to run:
docker-compose -f docker-compose-dev.yaml up -d
Now services are running, but to start using BadgerDoc, some additional configuration steps are required
It's a good idea to automate this section
Important! This is not secure configuration, follow KeyCloak best practices to setup on production environment
-
Login into Keycloak using url http://127.0.0.1:8082/auth and
admin:admin
as credentials -
Go to Realm Settings -> Keys and disable
RSA-OAEP
algorithm. It will help to avoid issue explainded here jpadilla/pyjwt#722 -
Add tenant attribute to
admin
user, go to Users -> selectadmin
-> go to Attributes -> create attributetenants:local
, and save -
Go to Clients -> admin-cli -> Mappers -> Create and fill form with following values:
Param | Value |
---|---|
Protocol | openid-connect |
Name | tenants |
Mapper Type | User Attribute |
User Attribute | tenants |
Token Claim Name | tenants |
Claim JSON Type | string |
Add to ID token | On |
Add to access token | On |
Add to userinfo | On |
Multivalued | On |
Aggregate attribute values | On |
-
Go to Client Scopes -> Find
roles
-> Scope and selectadmin
in list to add to Assigned Roles, then go to Mappers and ensure that only 2 mappers exists:realm roles
andclient roles
. Delete all other mappers -
Go to Clients -> Create -> Fill form and save
Param | Value |
---|---|
Client ID | badgerdoc-internal |
Client Protocol | openid-connect |
- Go to Cliens -> Find
badgerdoc-internal
-> change settingsAccess Type: Confidential
, setService Accounts Enabled
toOn
, set 'Valid Redirect URIs' and 'Web Origins' to '_', then save. Now you can Credentials tab, open it and copy Secret
Then Client ID
and Secret
must be set to .env
as KEYCLOAK_SYSTEM_USER_CLIENT=badgerdoc-internal
and KEYCLOAK_SYSTEM_USER_SECRET
to copied key
-
Go to Clients -> Find
badgerdoc-internal
-> Service Account Roles -> Client Roles -> master-realm -> Findview-users
andview-identity-providers
in Available Roles and add to Assigned Roles -
Go to Roles -> add roles: presenter, manager, role-annotator, annotator, engineer. Open admin role, go to Composite Roles -> Realm Roles and add all these roles
-
Go to Realm Settings -> Tokens -> Find
Access Token Lifespan
and set 1Days
Time to reload docker-compose
, because .env
was changed:
docker-compose -f docker-compose-dev.yaml up -d
In the case of installation with Minio configuration, Minio must be accessible
from the browser using the same host and port as used for internal communication.
The reason is that BadgerDoc displays PDFs using presigned URLs.
If the presigned URL generated by the assets
microservice
uses S3_ENDPOINT=badgerdoc-minio:9000
, then this document will be
accessible only from http://badgerdoc-minio:9000
.
For a local installation, it's possible to add 127.0.0.1 badgerdoc-minio
to the /etc/hosts
file.
This will solve the issue with presigned URLs.
For any other installation, we highly recommend using AWS S3 or Azure Blob Storage instead of Minio.
Change STORAGE_PROVIDER=azure
in .env
file and set AZURE_STORAGE_CONNECTION_STRING
to the connection string
of your Azure Blob Storage account.
Additionally, Blob Storage CORS settings must be configured to allow access from the domain you are running BadgerDoc on. Cross-Origin Resource Sharing (CORS) support for Azure Storage.
Be sure that you added all possible categories via badgerdoc UI (/categories) otherwise you get undefined categories on annotations view page
Airflow runs using its own resources (PostgreSQL, Redis, Flower) without sharing them with BadgerDoc.
- Copy
airflow/.env.example
toairflow/.env
running:
cp airflow/.env.example airflow/.env
To setup service account you need to configure Keycloak for BadgerDoc first.
-
Setup service account. Login into Keycloak using url http://127.0.0.1:8082/auth and
admin:admin
as credentials. Select Clients -> badgerdoc-internal -> Service Accounts Roles -> Find Service Account User and click "service-account-badgerdoc-internal". Then select Attributes tab and addtenants:local
attribute like you did it foradmin
. -
Go to Role Mappings and assign
admin
anddefault-roles-master
-
Go to Clients -> badgerdoc-internal -> Mappers -> Create and fill form:
Param | Value |
---|---|
Protocol | openid-connect |
Name | tenants |
Mapper Type | User Attribute |
User Attribute | tenants |
Token Claim Name | tenants |
Claim JSON Type | string |
Add to ID token | On |
Add to access token | On |
Add to userinfo | On |
Multivalued | On |
Aggregate attribute values | On |
- Copy
KEYCLOAK_SYSTEM_USER_SECRET
from Badgerdoc.env
file into Airflow.env
file, then run
docker-compose -f airflow/docker-compose-dev.yaml up -d
- Login to Airflow
This docker-compose file was downloaded from the Apache Airflow website: https://airflow.apache.org/docs/apache-airflow/2.7.0/docker-compose.yaml with only a few modifications added.
- Install all required dependencies for a microservice using a packaging tool like Pipenv/Poetry depending on the microservice you are about to set up (we will use Pipenv and "assets" service for this example):
cd assets && pipenv install --dev
- Install dependencies from "lib" folder:
pipenv shell && pip install -e ../lib/filter_lib ../lib/tenants
Use this URL to open the swagger of some service
http://127.0.0.1:8080/{service_name}/docs
For example: http://127.0.0.1:8080/users/docs