# Starts Pre Labeling Tool



This notebook shows you how to start the Pre-Labeling Tool and assumes that you have already created the required inputs as explained in [```generate_premanifest_file.ipynb```](mlsl-poc/notebooks/labeling_jobs/generate_premanifest_file.ipynb).

There are 2 different versions of the **Pre-Labeling Tool**:
- **Fuzzy Matching Version**: does the pre-annotation based on text similarities with expected entities files provided by the user
- **Comprehend Version**: does the pre-annotation based on Comprehend model predictions

Both versions correspond to a different Step Function.



In [30]:
import json
import boto3
import os

session = boto3.session.Session()
REGION = session.region_name
sts = session.client('sts')
AWS_ID = sts.get_caller_identity().get('Account')
stepfunctions_client = boto3.client('stepfunctions')

## Prerequisites

### a. Create a private Labeling workforces
To create a GT Labeling Job with the Pre-Labeling tool, you first need to manually create a private Labeling workforce. This can be done in the console by following these steps:
- Go to 'Sagemaker' -> 'Groundtruth' -> 'Labeling workforces' -> 'Private' -> 'Create private team'
- Select: 'Create a new Amazon Cognito user group', 'Invite new workers by email'
- Choose a 'Team name' and enter the email address of the workers

### b. Check notebook IAM role permissions

This notebook needs to have the permission to start a step functions execution. Make sure to update and attach the following policy to your notebook role:

```
{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "VisualEditor0",
			"Effect": "Allow",
			"Action": "states:StartExecution",
			"Resource": [
				"arn:aws:states:{REGION}:{AWS_ID}:stateMachine:PrelabelingComprehendStepFunctions",
				"arn:aws:states:{REGION}:{AWS_ID}:stateMachine:PrelabelingFuzzyMatchingStepFunctions"
			]
		}
	]
}
```

## Option 1: Fuzzy Matching version

**What is the Fuzzy Matching version?**

The Fuzzy Matching version uses [Fuzzy Matching](https://en.wikipedia.org/wiki/Approximate_string_matching) to detect the expected entities provided by the user and generate the pre-annotations.
This Fuzzy Matching version can be used to directly train a Comprehend Custom Entity Recognizer and/or create a Ground Truth Labeling Job that human annotators can use to review the generated pre-anotations before training a model.

In [31]:
# FuzzyMatching Version
fuzzymatching_prelabeling_step_functions_arn = 'arn:aws:states:{}:{}:stateMachine:PrelabelingFuzzyMatchingStepFunctions'.format(REGION,AWS_ID)

To start the Fuzzy Matching version of the Pre-Labeling Tool, we need to start the corresponding Step Function with the following parameters given as an input:
- ```premanifest``` : the premanifest file maps PDF to their expected_entities files.
    - ```bucket```: bucket of the premanifest file
    - ```key```: s3 path of the premanifest file
- ```prefix```: is used for creating the execution_id which will be used for naming the the s3 folder where outputs will be stored and for the GT labeling job name
- ```work_team_name``` [optional]: Private workforce used to create the GT labeling job. If this argument is not given, then the GT Labeling Job will not be created
- ```comprehend_parameters``` [optional]: Parameters used to train the COmprehend custom entity recognizer. If this argument is not given, then the training of the Comprehend custom entity recognizer training won't be done.
- ```entity_types```: entity types displayed on the UI and available to annotators


**IMPORTANT NOTE**: 
- For the Fuzzy Matching version, the expected entities files need to be provided in the premanifest file. In other words, each element of the premanifest file should have the ```expected_entities``` key
- all PDF documents referenced by the ```premanifest``` file must be placed in the bucket ```comprehend-semi-structured-docs-{region-name}-{account-id}``` that was created by the SAM template. Indeed, the Semi-Structured Documents Annotation Tool used (cf [here](https://github.com/aws-samples/amazon-comprehend-semi-structured-documents-annotation-tools)) does not support input data outside of this bucket.


With this Fuzzy Matching version we can create a Ground Truth Labeling Job with pre-annotations displayed and/or directly train a custom Comprehend entity recognizer. We will run the Fuzzy Matching version of the Pre-Labeling tool twice: 
- first to create a Ground Truth Labeling Job (option 1.1)
- and then to show how to launch the training of Comprehend (option 1.2)

Note: these two steps can also be run in one single execution.  

### Option 1.1. Create a GT Labeling Job

Here we will use the 5 PDF documents that have been uploaded in the example-demo/ subfolder (cf generate_premanifest_file.ipynb)

In [32]:
team_name = 'REPLACE_WITH_YOUR_TEAM_NAME'

In [33]:
inputs_subfolder = 'example-demo' # subfolders where inputs have been placed

In [35]:
event_fuzzymatching_create_gt = {
   "premanifest":{
      "bucket":f"comprehend-semi-structured-docs-{REGION}-{AWS_ID}",
       "key":f"prelabeling-inputs/{inputs_subfolder}/premanifest/premanifest.json"
   },
   "prefix":"test-fuzz-gt",
   "work_team_name":team_name,
   "entity_types":[
      "bank_name",
      "customer_name",
      "checking_number",
      "checking_amount",
      "savings_number",
      "savings_amount", 
       "date"
   ],
    # "comprehend_parameters": {"model_name":"test-from-prelabeling"}
}

In [36]:
response = stepfunctions_client.start_execution(
    stateMachineArn=fuzzymatching_prelabeling_step_functions_arn,
    input=json.dumps(event_fuzzymatching_create_gt)
)

### Option 1.2. Train Comprehend

Here we will use the 250 PDF documents that have been uploaded in the example-demo-training/ subfolder (cf ```generate_premanifest_file.ipynb```)

In [37]:
inputs_subfolder = 'example-demo-training' # subfolders where inputs have been placed

In [38]:
event_fuzzymatching_training = {
   "premanifest":{
      "bucket":f"comprehend-semi-structured-docs-{REGION}-{AWS_ID}",
       "key":f"prelabeling-inputs/{inputs_subfolder}/premanifest/premanifest.json"
   },
   "prefix":"test-fuzz-wct",
   # "work_team_name":"PrivateTeam", # here we won't create a GT labeling Job
   "entity_types":[
      "bank_name",
      "customer_name",
      "checking_number",
      "checking_amount",
      "savings_number",
      "savings_amount",
       "date"
   ],
    "comprehend_parameters": {"model_name":"test-from-prelabeling"}

}

In [39]:
response = stepfunctions_client.start_execution(
    stateMachineArn=fuzzymatching_prelabeling_step_functions_arn,
    input=json.dumps(event_fuzzymatching_training)
)

### Results / Outputs

You can visually follow the execution of the Step Function in the console.

Once the execution is finished, you can: 
- inspect the following outputs saved in s3:
  ```
  └── comprehend-semi-structured-docs-{region-name}-{account-id}
      └── prelabeling
          └── {prefix}+{datetime} # unique folder per job
              ├── consolidated_manifest/
              │   └── consolidated_manifest_comprehend.manifest # manifest file that can be used to train Comprehend
              │   └── consolidated_manifest.manifest # manifest file used by GT Labeling Job
              ├── input-premanifest/
              │   └── premanifest.json # copy of the pre-manifest file used as input
              ├── temp_individual_manifests/ # individual manifest per doc per page
              │   ├── file_1_page_0_manifest.manifest 
              │   ├── file_1_page_1_manifest.manifest
              │   └── ...
              └── textract-annotations/ # textract outputs
  ```

- open the GT Labeling Job that was created to review the annotations (for option 1.1). For that, go to 'Amazon Sagemaker' -> 'Ground Truth' -> 'Labeling jobs' 
- inspect and play with the custom model that was trained by Comprehend (for option 1.2)

## Option 2: Comprehend version

**What is the Comprehend version?**

The Comprehend version requires that you already have trained a Custom Entity Recognizer Comprehend model. It will then use this model to pre-annotate the PDF documents.
Hence, this version takes as an additional input the custom Comprehend model that we trained.

Note: this version can only be used to create a Ground Truth labeling Job with pre-annotations displayed. The step to train a new custom entity recognizer has not been implemented.

**Important Note**:
- make sure that the ```premanifest``` file does not exceed 1000 documents as there is currently a limit Quota on Textract: ```StartDocumentTextDetection throttle limit in transactions per second``` is limited to 2 per second. This quota is exceeded when Comprehend creates strictly more than 2 batches of 500 documents each.

In [32]:
# Comprehend Version
comprehend_prelabeling_step_functions_arn = f'arn:aws:states:{REGION}:{AWS_ID}:stateMachine:PrelabelingComprehendStepFunctions'

team_name = 'REPLACE_WITH_YOUR_TEAM_NAME'
model_ARN_comprehend = 'REPLACE_WITH_COMP_ARN' # provide the arn of your custom Comprehend 

In [10]:
event_comprehend = {
   "premanifest":{
      "bucket":f"comprehend-semi-structured-docs-{REGION}-{AWS_ID}",
      "key":f"prelabeling-inputs/{inputs_subfolder}/premanifest/premanifest.json"
   },
   "prefix":"comprehend-version",
   "work_team_name":team_name,
   "model_ARN_comprehend":model_ARN_comprehend,
   "entity_types":[
      "Product",
      "Service",
      "Substrate",
      "Product-Substrate",
      "Position_Number",
      "Quantity",
      "Diameter",
      "Length",
      "General_measurements",
      "Ticked_Box",
      "PurchaseOrderNumber",
      "Customer_name"
   ]
}

This is how the event input of the Comprehend version should look like:
- ```premanifest``` : the premanifest file maps PDF to their expected_entities files.  Note the expected_entities are not required to be given in this pre-manifest file and in that case the annotators won't see them in the annotator-metadata section.
    - ```bucket```: bucket of the premanifest file
    - ```key```: s3 path of the premanifest file
- ```prefix```: is used for creating the execution_id which will be used for naming the the s3 folder where outputs will be stored and for the GT labeling job name
- ```work_team_name```: Private workforce used to create the GT labeling job
- ```model_ARN_comprehend```: model arn of the Comprehend model that will be used to prelabel the documents
- ```entity_types```: entity types displayed on the UI and available to annotators


**IMPORTANT NOTE**: all PDF documents referenced by the ```premanifest``` file must be placed in the bucket ```comprehend-semi-structured-docs-{region-name}-{account-id}``` that was created by the SAM template. Indeed, the Semi-Structured Documents Annotation Tool used (cf [here](https://github.com/aws-samples/amazon-comprehend-semi-structured-documents-annotation-tools)) does not support input data outside of this bucket.

In [11]:
response = stepfunctions_client.start_execution(
    stateMachineArn=comprehend_prelabeling_step_functions_arn,
    input=json.dumps(event_comprehend)
)

### Results and outputs

Once the execution is finished, you can: 
- inspect the following outputs saved in s3:
  ```
    └── comprehend-semi-structured-docs-{region-name}-{account-id}
        └── prelabeling
            └── {prefix}+{datetime} # unique folder per job
                ├── consolidated_manifest/
                │   └── final-manifest.manifest # manifest file used by GT Labeling Job
                ├── input-premanifest/
                │   └── premanifest.json # copy of the pre-manifest file used as input
                ├── comprehend-annotations/
                │   ├── zipped_files/ # comprehend async output
                │   ├── unzipped_files/ # comprehend async output unzipped
                │   │   ├── success/
                │   │   └── failure/
                │   ├── labeling-annotation-folder/  
                │   │   ├── file_1_page_0.json # Comprehend pre-annotations
                │   │   ├── ...
                │   │   └── blocks/ # Textract outputs
                │   └── manifest/
                │       └── manifest.csv # intermediate manifest file
                └── input-comprehend-pdf-batches/ # copy of PDF files grouped in batches
                    ├── batch-0/
                    │   ├── file_1.pdf
                    │   ├── ...
                    │   └── file_500.pdf
                    └── ...
  ```

- open the GT Labeling Job that was created to review the annotations. For that, go to 'Amazon Sagemaker' -> 'Ground Truth' -> 'Labeling jobs' 
