# Annotation by Bedrock

This notebook attempts to leverage AWS Bedrock to assist with the creation of NER training data.

Input data must have been formatted prior to use. It should match this format:
```json
{"meta": {"identity": 73955, "sectionId": "12", "sectionName": "Repealed", "ActId": "Civil Forfeiture Act"}, "text": "repealed 12 [ repealed 2023 - 13 - 11. ]", "label": []}
```
Output data will be similar but will include label values like this:
```js
[[61, 67, 'REF_IN'],
 [123, 124, 'REF_IN']]
```
Have the AWS ENVs populated prior to starting to Jupyter Notebook.


In [None]:
%pip install boto3
%pip install botocore

In [4]:
import boto3
from botocore.config import Config
import json
import os

In [5]:
AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")

session = boto3.Session(
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
)

# Default retry mode is legacy otherwise
config = Config(
  retries = {
    'max_attempts': 3,
    'mode': 'standard'
  }
)
bedrock_runtime = session.client("bedrock-runtime", region_name="us-east-1", config=config)

In [6]:
# Define access to model in Bedrock
# In this case, claude 3.5 sonnet.
def get_claude_kwargs(prompt):
    kwargs = {
        "modelId": "anthropic.claude-3-5-sonnet-20240620-v1:0",
        "contentType": "application/json",
        "accept": "application/json",
        "body": json.dumps(
            {
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 5000,
                "messages": [
                    {"role": "user", "content": [{"type": "text", "text": prompt}]}
                ],
            }
        ),
    }
    return kwargs

In [7]:
# Wrapper to get response from AWS and return only the text content
def get_agent_response(prompt):
    kwargs = get_claude_kwargs(prompt)
    response = bedrock_runtime.invoke_model(**kwargs)
    response_body = json.loads(response.get("body").read())
    return response_body["content"][0]["text"]

In [8]:
import re

# Helper function to return only a part that matches our labeling syntax.
# If no labels found, returns None
def extract_labels(result):
  list = re.search(r"\[\[\d+, \d+, [\"\'][A-Z_]+[\"\']](?:, \[\d+, \d+, [\"\'][A-Z_]+[\"\']])*]", result)
  if (list != None):
    labels_str = list.group()
    # Convert from string to JSON-usable object. Cannot have single-quotes for JSON.
    labels = json.loads(labels_str.replace("'", '"'))
    return labels
  else:
    return None
  

In [9]:
# Build examples to provide to AWS based off of previously annotated texts
examples = []
max_examples = 10
examples_added = 0
with open("./NER Training/doccano_export.jsonl", "r") as input:
  for index, line in enumerate(input):
    if examples_added > max_examples:
      break
    obj = json.loads(line)
    line_text = obj["text"]
    label = obj["label"]
    if len(label) > 0:
      examples.append((line_text, label))
      examples_added += 1

# Contains the prompt used to query Bedrock
# Gets answer, extracts labels, and returns result
def ask_bedrock_for_ner(text):
  prompt = (
  """
  Help me complete this NER task.

  I have this tag: REF_IN

  Every time there is a reference to a section within the same act, I need to receive the starting and ending position of the numerical act id. 

  I would like the results to use the following format:
  [[starting index, ending index, label]]

  For example, in the phrase "Please refer to section 12.5" The act id to label is "12. 5" and the result would be [24, 28, "REF_IN"]
  In the phrase, "According to section 3," the act id to label is "3" and the result would be [21, 22, "REF_IN"]

  When there are multiple instances in a text, they should be in a list based on their order of occurrence.
  For example, in the sentence "Subject to sections 14. 04 to 14. 10" the result should be [[20, 26, "REF_IN"],[30, 36, "REF_IN"]]

  Basically, every time you see the word section or sections, lable the numbers that immediately follow.

  Only return the label array.

  Here is a list of tuples with the original text and the correct label array:
  """
  f"{examples}\n"

  "You try this task with the following prompt:\n"
  f"{text}"
  )
  answer = get_agent_response(prompt)
  labels = extract_labels(answer)
  return labels


FileNotFoundError: [Errno 2] No such file or directory: './NER Training/doccano_export.jsonl'

In [8]:
# Example of how this is used
sample_text = """
3 despite subsection ( 2 ) of this section, if under section 14. 08 ( a ) the director commences proceedings under section 3 in relation to the subject property, the public body entitled to maintain possession of the subject property under section 14. 05 continues to be entitled to maintain possession of that property until expiry of the 30 day period described in section 14. 05 ( a ). 4 this part does not apply in relation to property if the property is the subject of an order of a court establishing a right of possession in that property with a person other than the public body or authorizing a person other than the public body to have or take possession of that property
"""

result = ask_bedrock_for_ner(sample_text)
print(result)

NameError: name 'ask_bedrock_for_ner' is not defined

In [9]:
import threading

# Replace this with your known concurrency limit
CONCURRENCY_LIMIT = 10
semaphore = threading.BoundedSemaphore(CONCURRENCY_LIMIT)

# Open input example (should be format of Doccano export)
with open("./NER Training/doccano_import_small.jsonl", "r") as input:
  # Create/overwrite output file
  with open("./NER Training/bedrock_annotation_output.jsonl", "w") as output:
    for index, line in enumerate(input):
      # Use semaphore to keep requests under concurrency limit
      with semaphore:
        try:
          line_obj = json.loads(line)
          line_number = index + 1
          text = line_obj["text"]
          labels = ask_bedrock_for_ner(text)
          # Only save if some text was actually labelled
          if (labels is not None):
            print(line_number, text, labels)
            line_obj["label"] = labels
            json.dump(line_obj, output, ensure_ascii=False)
            output.write("\n")
        except Exception as e:
          print(f"Failed to send prompt for line {line_number}. Reason: {e}")
      


FileNotFoundError: [Errno 2] No such file or directory: './NER Training/doccano_import_small.jsonl'