# Use generative AI to enhance your technical metadata stored in AWS Glue Data Catalog

<div class="alert alert-block alert-info">
This notebook was tested in an Amazon SageMaker Studio JupyterLab space using a SageMaker Distribution image 1.9.1 and Python 3 kernel.
</div>

You can run this notebook from SageMaker Studio or your local environment. Before runing the notebook:

1. [Add access](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access-modify.html) to the following <b>Amazon Bedrock models</b> in your AWS account:
- Anthropic Claude 3 - Sonnet 
- Amazon Titan Text Embeddings V2

2. Note the ARN of your <b>AWS Glue Crawler IAM role</b> or, if you don't have one set-up already, <b>[create a new  Glue Crawler IAM role](https://docs.aws.amazon.com/glue/latest/dg/crawler-prereqs.html) </b> and note the ARN. The role needs to have a [AWSGlueServiceRole](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AWSGlueServiceRole.html) policy attached or equivalent and an inline policy with access to the S3 bucket with the data. If you are running this code in SageMaker Studio, your AWS Glue Crawler IAM role should also add a trust policy to the IAM role which grants SageMaker principal permissions to assume the role, see [Passing Roles](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) in the AWS documentation.

4. Ensure the IAM role you use to run this Jupyter notebook has access to
- AWS Glue
- Amazon Bedrock
- Amazon S3

## 1. Setup

First, we install the required dependecies. 

In [None]:
!pip install -U boto3==1.34.117 
!pip install -U langchain==0.2.1 
!pip install --upgrade --quiet  langchain-aws
!pip install jsonschema

<div class="alert alert-block alert-warning">
<b>Important:</b> 
Once the previous step has completed, restart the Kernel. Then continue with the sections below.
</div>

We also define a helper function called `get_bedrock_client` that creates and returns a boto3 client for Amazon Bedrock using different configuration options.

In [None]:
import os, sys, json
import boto3 
from botocore.config import Config
from typing import Optional
import langchain
import logging
from datetime import date, datetime
import pprint 


def get_bedrock_client(
    assumed_role: Optional[str] = None,
    region: Optional[str] = None,
    runtime: Optional[bool] = True,
):
    #  create a boto3 session with the specified region and, optionally, an AWS profile name from the `AWS_PROFILE` environment variable.
    if region is None: 
        target_region = os.environ.get("AWS_REGION", os.environ.get("AWS_DEFAULT_REGION"))
    else:
        target_region = region

    print(f"Create new client\n  Using region: {target_region}")
    session_kwargs = {"region_name": target_region}
    client_kwargs = {**session_kwargs}

    profile_name = os.environ.get("AWS_PROFILE")
    if profile_name:
        print(f"  Using profile: {profile_name}")
        session_kwargs["profile_name"] = profile_name

    retry_config = Config(
        region_name=target_region,
        retries={
            "max_attempts": 10,
            "mode": "standard",
        },
    )
    session = boto3.Session(**session_kwargs)

    # if an `assumed_role` is provided assume that role using STS and retrieve the temporary credential for the client.
    if assumed_role: 
        print(f"  Using role: {assumed_role}", end='')
        sts = session.client("sts")
        print(assumed_role)
        response = sts.assume_role(
            RoleArn=str(assumed_role),
            RoleSessionName="langchain-llm-1"
        )
        print(" ... successful!")
        client_kwargs["aws_access_key_id"] = response["Credentials"]["AccessKeyId"]
        client_kwargs["aws_secret_access_key"] = response["Credentials"]["SecretAccessKey"]
        client_kwargs["aws_session_token"] = response["Credentials"]["SessionToken"]

    if runtime:
        service_name='bedrock-runtime'
    else:
        service_name='bedrock'

    # create the boto3 client for the `bedrock-runtime` or `bedrock` service, based on the `runtime` flag, with the specified region, credentials (if assumed role is used), and a retry configuration.
    bedrock_client = session.client(
        service_name=service_name,
        config=retry_config,
        **client_kwargs
    )

    print("boto3 Bedrock client successfully created!")
    print(bedrock_client._endpoint)
    return bedrock_client

Next, provide the ARN of your AWS Glue crawler IAM role and set up the Amazon Bedrock and AWS Glue clients.

In [None]:
model_id = "anthropic.claude-3-sonnet-20240229-v1:0"
embeddings_model_id= "amazon.titan-embed-text-v2:0"

# ---- ⚠️ Un-comment and edit the below lines as needed for your AWS setup ⚠️ ----

os.environ["AWS_DEFAULT_REGION"] = "us-east-1"  # E.g. "us-east-1"
# os.environ["AWS_PROFILE"] = "<YOUR_PROFILE>"
# os.environ["BEDROCK_ASSUME_ROLE"] = "<YOUR_ROLE_ARN>"  # E.g. "arn:aws:..."
GLUE_CRAWLER_ARN = '<YOUR_AWS_GLUE_CRAWLER_IAM_ROLE>'

bedrock_client= get_bedrock_client(
    assumed_role=os.environ.get("BEDROCK_ASSUME_ROLE", None),
    region=os.environ.get("AWS_DEFAULT_REGION", None)
)

glue_client = boto3.client("glue", region_name="us-east-1")

### Create S3 bucket

In this section we will create the S3 resources. This include an S3 bucket and files containing our test data.

In [None]:
import random
import string
s3_client = boto3.client('s3')

# Define the prefix for the bucket name
bucket_prefix = 'aws-gen-ai-glue-metadata'

# Generate a random string to append to the bucket name
random_suffix = ''.join(random.choices(string.ascii_lowercase + string.digits, k=10))

# Construct the bucket name
bucket_name = f"{bucket_prefix}-{random_suffix}"

# Create the S3 bucket
try:
    s3_client.create_bucket(Bucket=bucket_name)
    print(f"Bucket '{bucket_name}' created successfully.")
except Exception as e:
    print(f"Error creating bucket: {e}")

### Copy the files from the remote S3 bucket to the local bucket

In [None]:
# Set the bucket name and prefix for the source files
source_bucket_name = 'awsglue-datasets'
source_prefix = 'examples/us-legislators/all/'

# Set the destination bucket name
destination_bucket_name = bucket_name

# List of folder names
folder_names = ['areas', 'countries', 'events', 'memberships', 'organizations', 'persons']

# List of file names
file_names = ['areas.json', 'countries.json', 'events.json', 'memberships.json', 'organizations.json', 'persons.json']

# Create the folders in the destination bucket
for folder_name in folder_names:
    s3_client.put_object(Bucket=destination_bucket_name, Key=(folder_name + '/'))

# Copy the files to the corresponding folders
for folder_name, file_name in zip(folder_names, file_names):
    source_key = f"{source_prefix}{file_name}"
    destination_key = f"{folder_name}/{file_name}"

    copy_source = {
        'Bucket': source_bucket_name,
        'Key': source_key
    }

    s3_client.copy_object(
        CopySource=copy_source,
        Bucket=destination_bucket_name,
        Key=destination_key
    )

print("Files copied successfully!")

### Create AWS Glue Resources 

In this section we will create the AWS Glue resources. This include the Glue Database and the Glue Crawler. The table metadata will be automcatically updated by the crawler (this step might take a couple of minutes.) 

In [None]:
database = 'legislators'

In [None]:
# Create the database
glue_client.create_database(DatabaseInput={'Name': database})
print(f"AWS Glue database '{database}' created succesfully.")

# Define the crawler configuration
crawler_name = 'my-s3-crawler'
role_arn = GLUE_CRAWLER_ARN
database_name = database
s3_target_path = 's3://' + bucket_name + '/'

In [None]:
import time 

# Create the crawler
response = glue_client.create_crawler(
    Name=crawler_name,
    Role=role_arn,
    DatabaseName=database_name,
    Description='Crawler for S3 data',
    Targets={
        'S3Targets': [
            {'Path': s3_target_path}
        ]
    }
)
time.sleep(5)
print(f"AWS Glue Crawler '{crawler_name}' created succesfully.")

In [None]:
# Run the Glue Crawler. This step is going to take a few seconds to complete. 
response = glue_client.start_crawler(
    Name=crawler_name
)
print(f"AWS Glue Crawler '{crawler_name}' started.")
time.sleep(5)
state_previous = None
while True:
    response_get = glue_client.get_crawler(Name=crawler_name)
    state = response_get["Crawler"]["State"]
    if state != state_previous:
        state_previous = state
    if state == "READY":  # Other known states: RUNNING, STOPPING
        break
    time.sleep(10)

print(f"AWS Glue Crawler '{crawler_name}' finished.")


## 2. Inspect the AWS Glue data catalog

Once the Crawlers is completed, we should be able to see our tables. 

In [None]:
from botocore.exceptions import ClientError

def get_alltables(database):
    tables = []
    get_tables_paginator = glue_client.get_paginator('get_tables')
    for page in get_tables_paginator.paginate(DatabaseName=database):
        tables.extend(page['TableList'])
    return tables

In [None]:
def json_serial(obj):

    if isinstance(obj, (datetime, date)):
        return obj.isoformat()
    raise TypeError ("Type %s not serializable" % type(obj))

In [None]:
database_tables =  get_alltables(database)
for table in database_tables:
    print(f"Table: {table['Name']}")
    print(f"Columns: {[col['Name'] for col in table['StorageDescriptor']['Columns']]}")

# 3. Generate table metadata descriptions with Anthropic Claude 3 using Amazon Bedrock and Langchain

In [None]:
glue_data_catalog = json.dumps(get_alltables(database),default=json_serial)

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from botocore.config import Config
from langchain_aws import ChatBedrock

model_kwargs ={ 
    "temperature": 0.5, # You can increase or decrease this value depending on the amount of randomness you want injected into the response. A value closer to 1 increases the amount of randomness.
    "top_p": 0.999
}

model = ChatBedrock(
    client = bedrock_client,
    model_id=model_id,
    model_kwargs=model_kwargs
)

### Generate metadata for a specific database table

In [None]:
table = "persons"
response_get_table = glue_client.get_table( DatabaseName = database, Name = table ) 
pprint.pp(response_get_table)

In [None]:
user_msg_template_table="""
I'd like to you create metadata descriptions for the table called {table} in your AWS Glue data catalog. Please follow these steps:
1. Review the data catalog carefully
2. Use all the data catalog information to generate the table description
3. If a column is a primary key or foreign key to another table mention it in the description.
4. In your response, reply with the entire JSON object for the table {table}
5. Remove the DatabaseName, CreatedBy, IsRegisteredWithLakeFormation, CatalogId,VersionId,IsMultiDialectView,CreateTime, UpdateTime. 
6. Write the table description in the Description attribute
7. List all the table columns under the attribute "StorageDescriptor" and then the attribute Columns. Add Location, InputFormat, and SerdeInfo
8. For each column in the StorageDescriptor, add the attribute "Comment".  If a table uses a composite primary key, then the order of a given column in a table’s primary key is listed in parentheses following the column name.
9. Your response must be a valid JSON object.
10. Ensure that the data is accurately represented and properly formatted within the JSON structure. The resulting JSON table should provide a clear, structured overview of the information presented in the original text.
11. If  you cannot think of an accurate description of a column, say 'not available'
Here is the data catalog json in <glue_data_catalog></glue_data_catalog> tags.

<glue_data_catalog>
{data_catalog}
</glue_data_catalog>

Here is some additional information about the database in <notes></notes> tags.
<notes>
Typically foreign key columns consist of the name of the table plus the id suffix
<notes>
"""

In [None]:
messages = [
    ("system", "You are a helpful assistant"),
    ("user", user_msg_template_table),
]

prompt = ChatPromptTemplate.from_messages(messages)

chain = prompt | model | StrOutputParser()

# # Chain Invoke
TableInputFromLLM = chain.invoke({"data_catalog": {glue_data_catalog}, "table":table})
print(TableInputFromLLM)


### Update the AWS Glue Data Catalog 

In [None]:
# We will validate the LLM response to ensure that it matches the Table input JSON schema as expected by the AWS Glue API. 
# This validation code can only used as a starting point and it should be extended with additional validations for your use case. 
# Documentation: https://python-jsonschema.readthedocs.io/en/stable/ 

from jsonschema import validate

schema_table_input = {
    "type": "object", 
    "properties" : {
            "Name" : {"type" : "string"},
            "Description" : {"type" : "string"},
            "StorageDescriptor" : {
            "Columns" : {"type" : "array"},
            "Location" : {"type" : "string"} ,
            "InputFormat": {"type" : "string"} ,
            "SerdeInfo": {"type" : "object"}
        }
    }
}
validate(instance=json.loads(TableInputFromLLM), schema=schema_table_input)


In [None]:
response = glue_client.update_table(DatabaseName=database, TableInput= json.loads(TableInputFromLLM) )
print(f"Table {table} metadata updated!")

## 4. Improve meta-data descriptions by adding external documentation

### Ingestion workflow

In [None]:
from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader
from langchain_aws import BedrockEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS

### Download the data from EveryPolitician.com 

In [None]:
from langchain_community.document_loaders import AsyncHtmlLoader
# We will use an HTML Community loader to load the external documentation stored on HTLM 
urls = ["http://www.popoloproject.com/specs/person.html", "http://docs.everypolitician.org/data_structure.html",'http://www.popoloproject.com/specs/organization.html','http://www.popoloproject.com/specs/membership.html','http://www.popoloproject.com/specs/area.html']
loader = AsyncHtmlLoader(urls)
docs = loader.load()

Let's take a look at some of the content that was loaded

In [None]:
docs[0].page_content[0:1000]

Next, we split the downloaded content into chunks

In [None]:
text_splitter = CharacterTextSplitter(
    separator='\n',
    chunk_size=1000,
    chunk_overlap=200,

)
split_docs = text_splitter.split_documents(docs)

In [None]:
embedding_model = BedrockEmbeddings(
    client=bedrock_client,
    model_id=embeddings_model_id
)
vs = FAISS.from_documents(split_docs, embedding_model)

In [None]:
search_results = vs.similarity_search(
    'What standards are used in the dataset?', k=2
)
print(search_results[0].page_content)

### Query Worfklow

In [None]:
from operator import itemgetter
from langchain_core.callbacks import BaseCallbackHandler
from typing import Dict, List, Any

class PromptHandler(BaseCallbackHandler):
    def on_llm_start( self, serialized: Dict[str, Any], prompts: List[str], **kwargs: Any) -> Any:
        output = "\n".join(prompts)
        print(output)



system = "You are a helpful assistant. You do not generate any harmful content."
# specify a user message
user_msg_rag = """

Here is the guidance document you should reference when answering the user: 
<documentation>{context}</documentation>

I'd like to you create metadata descriptions for the table called {table} in your AWS Glue data catalog. Please follow these steps:
1. Review the data catalog carefully
2. Use all the data catalog information and the <documentation> to generate the table description 
3. If a column is a primary key or foreign key to another table mention it in the description.
4. In your response, reply with the entire JSON object for the table {table}
5. Remove the DatabaseName, CreatedBy, IsRegisteredWithLakeFormation, CatalogId,VersionId,IsMultiDialectView,CreateTime, UpdateTime. 
6. Write the table description in the Description attribute. Ensure you use information from the <documentation> if possible.
7. List all the table columns under the attribute "StorageDescriptor" and then the attribute Columns. Add Location, InputFormat, and SerdeInfo
8. For each column in the StorageDescriptor, add the attribute "Comment". Ensure Name and Type are included.  If a table uses a composite primary key, then the order of a given column in a table’s primary key is listed in parentheses following the column name.
9. Your response must be a valid JSON object.
10. Ensure that the data is accurately represented and properly formatted within the JSON structure. The resulting JSON table should provide a clear, structured overview of the information presented in the original text.
11. If  you cannot think of an accurate description of a column, say 'not available'

<glue_data_catalog>
{data_catalog}
</glue_data_catalog>

Here is some additional information about the database in <notes></notes> tags.
<notes>
Typically foreign key columns consist of the name of the table plus the id suffix
<notes>
"""


messages = [
    ("system", system),
    ("user", user_msg_rag),
]

prompt = ChatPromptTemplate.from_messages(messages)

# Retrieve and Generate
retriever = vs.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3},
)

In [None]:
chain = (
    {"context": itemgetter("table")| retriever, "data_catalog": itemgetter("data_catalog"), "table": itemgetter("table")}
    | prompt
    | model
    | StrOutputParser()
)

TableInputFromLLM = chain.invoke({"data_catalog":glue_data_catalog, "table":table})
print(TableInputFromLLM)

### Update the AWS Glue Data Catalog

In [None]:
# We will validate the LLM response to ensure that it matches the Table input JSON schema as expected by the AWS Glue API. 
# This validation code can only used as a starting point and it should be extended with additional validations for your use case. 
# Documentation: https://python-jsonschema.readthedocs.io/en/stable/ 

from jsonschema import validate

schema_table_input = {
    "type": "object", 
    "properties" : {
            "Name" : {"type" : "string"},
            "Description" : {"type" : "string"},
            "StorageDescriptor" : {
            "Columns" : {"type" : "array"},
            "Location" : {"type" : "string"} ,
            "InputFormat": {"type" : "string"} ,
            "SerdeInfo": {"type" : "object"}
        }
    }
}
validate(instance=json.loads(TableInputFromLLM), schema=schema_table_input)

# Clean Up

### 1. Delete S3 bucket

In [None]:
s3 = boto3.resource('s3')
s3_client = boto3.client('s3')
bucket = s3.Bucket(bucket_name)

# Delete all objects in the bucket
bucket.objects.all().delete()

# Delete the bucket
response = s3_client.delete_bucket(Bucket=bucket_name)

### 2. Delete Glue Crawler

In [None]:
response = glue_client.delete_crawler(Name=crawler_name)
print(f"Deleting crawler: {crawler_name}")

### 3. Delete AWS Glue Data Catalog tables and database 

In [None]:
def delete_tables_in_database(database_name):
    # Get a list of all tables in the database
    tables = glue_client.get_tables(DatabaseName=database_name)['TableList']

    # Iterate over the tables and delete each one
    for table in tables:
        table_name = table['Name']
        glue_client.delete_table(DatabaseName=database_name, Name=table_name)
        print(f"Deleting table: {table_name}")


delete_tables_in_database(database)
response = glue_client.delete_database(Name=database)
print(f"Deleting database: {database}")


### 4. (Optional) Delete AWS Glue Data Catalog tables and database

In [None]:
#You can optionally delete the Glue role you used by uncommenting the code below. Make sure that your IAM role has the appropriate permissions

# iam = boto3.client('iam')
# response = iam.delete_role(RoleName="AmazonSagemakerCanvasForecastRole-20221027T102188")
# print(f'Role {role_arn} has been deleted successfully.')
