# How to use Amazon Bedrock Titan FM embedding vectors to build a LLM content moderation engine with RandomForestClassifier

In this demo notebook, we demonstrate how to use Bedrock Titan FM embedding vectors to build a LLM content moderation engine with RandomForestClassifier

We will use the Bedrock Python SDK for Embeddings Generation.


## Content moderation Pipeline

Email Template content ->Bedrock Titan FM -> Embedding vectors -> RandomForestClassifier -> compliance result

##### 1. Set up environments
##### 2. Embedding generation by Amazon Titan FM
##### 3. Train for a classifier model RandomForest
##### 4. Inference

Note: This notebook was tested in Amazon SageMaker Studio with Python 3 (Data Science 3.0) kernel.

# 1. Set up environments

---
Before executing the notebook for the first time, execute this cell to add bedrock extensions to the Python boto3 SDK

---

In [None]:
%pip install --upgrade pip
%pip install boto3 --upgrade
%pip install botocore --upgrade

In [2]:
import boto3
import botocore

# Get the Boto3 version
boto3_version = boto3.__version__

# Get the Botocore version
botocore_version = botocore.__version__


# Print the Boto3 version
print("Current Boto3 Version:", boto3_version)

# Print the Botocore version
print("Current Botocore Version:", botocore_version)

Current Boto3 Version: 1.28.63
Current Botocore Version: 1.31.63


Let's initialize the boto3 client to use Bedrock

In [3]:
import boto3
import pandas as pd
import json
bedrock = boto3.client(
 service_name='bedrock',
 region_name='us-east-1',
 endpoint_url='https://bedrock.us-east-1.amazonaws.com'
)


Lets test the endpoint to see what models are available

In [4]:
bedrock.list_foundation_models()

{'ResponseMetadata': {'RequestId': '1dcb1a84-4ae4-40f3-8fe2-efca1263b528',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Mon, 16 Oct 2023 01:38:53 GMT',
   'content-type': 'application/json',
   'content-length': '5729',
   'connection': 'keep-alive',
   'x-amzn-requestid': '1dcb1a84-4ae4-40f3-8fe2-efca1263b528'},
  'RetryAttempts': 0},
 'modelSummaries': [{'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-tg1-large',
   'modelId': 'amazon.titan-tg1-large',
   'modelName': 'Titan Text Large',
   'providerName': 'Amazon',
   'inputModalities': ['TEXT'],
   'outputModalities': ['TEXT'],
   'responseStreamingSupported': True,
   'customizationsSupported': ['FINE_TUNING'],
   'inferenceTypesSupported': ['ON_DEMAND']},
  {'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/amazon.titan-e1t-medium',
   'modelId': 'amazon.titan-e1t-medium',
   'modelName': 'Titan Text Embeddings',
   'providerName': 'Amazon',
   'inputModalities': ['TEXT'],
   'outputModalities'

## Load training dataset CSV file into dataframe

In [5]:
# Specify the file path
csv_file = "compliance_dataset.csv"

# Load the CSV file into a DataFrame
df = pd.read_csv(csv_file)

# Display the first few rows of the DataFrame to verify the data loading
print(df.head())

                                                text           label
0  He is a piece of shit, greedy capitalist who e...  non-compliance
1  The senile credit card shrill from Delaware ne...  non-compliance
2  He does that a lot -- makes everyone look good...  non-compliance
3                                         F*ck Lizzo  non-compliance
4  Epstein and trump were best buds!!! Pedophiles...  non-compliance


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    1000 non-null   object
 1   label   1000 non-null   object
dtypes: object(2)
memory usage: 15.8+ KB


In [7]:
# Duplicate the original DataFrame and assign it to 'vectors_df'
vectors_df = pd.DataFrame(df)

In [8]:
# Count the number of 'compliance' and 'non-compliance' labels
label_counts = vectors_df['label'].value_counts()

# Print the counts
print(label_counts)


non-compliance    501
compliance        499
Name: label, dtype: int64


In [9]:
# Export the DataFrame 'vectors_df' to a JSON file named 'compliance_dataset.json'
vectors_df.to_json('compliance_dataset.json', orient='records', lines=True)


In [10]:
import json

# Open the JSON file for reading
with open('compliance_dataset.json', 'r', encoding='utf-8') as file:
    data = []
    for line in file:
        # Parse each line as a separate JSON object
        try:
            record = json.loads(line.strip())
            data.append(record)
        except json.JSONDecodeError as e:
            print(f"Skipping invalid JSON: {e}")

# 'data' now contains a list of dictionaries, each representing a JSON object


In [11]:
import json

# Open the JSON file for reading
with open('compliance_dataset.json', 'r', encoding='utf-8') as file:
    data = []
    for line in file:
        # Parse each line as a separate JSON object
        try:
            record = json.loads(line.strip())
            data.append(record)
        except json.JSONDecodeError as e:
            print(f"Skipping invalid JSON: {e}")

# 'data' now contains a list of dictionaries, each representing a JSON object

# Specify the filename for the new JSON file
output_filename = 'parsed_compliance_data.json'

# Write the 'data' list to the new JSON file
with open(output_filename, 'w', encoding='utf-8') as output_file:
    json.dump(data, output_file, indent=4)

print(f'Data has been saved to {output_filename}')


Data has been saved to parsed_compliance_data.json


In [12]:
import json

# Specify the filename of the JSON file you want to load
input_filename = 'parsed_compliance_data.json'

# Load the JSON data from the file into a Python variable
with open(input_filename, 'r', encoding='utf-8') as input_file:
    loaded_data = json.load(input_file)

# Now, 'loaded_data' contains the JSON data as a Python data structure (likely a list of dictionaries)

# You can now work with 'loaded_data' in your Jupyter Notebook


In [13]:
# Assuming you've already loaded the JSON data into the 'loaded_data' variable

# Display the top 3 records
top_3_records = loaded_data[:3]
for record in top_3_records:
    print(record)



{'text': 'He is a piece of shit, greedy capitalist who exploits workers, and offers nothing of real benefit to the world.\n All he’s done is make a name for himself on the backs of other people, using dirty money from his family’s emerald mine they acquired during apartheid.\n I don’t care that he’s autistic. He thinks we should be cured with his company’s AI chip. \n He is not a representation of our community. Don’t celebrate him on this page.', 'label': 'non-compliance'}
{'text': 'The senile credit card shrill from Delaware needs to resign!!', 'label': 'non-compliance'}
{'text': "He does that a lot -- makes everyone look good but him...I guess it's also probably the Dems and the Media that force him to compulsively tweet abject bullshit like a lying bitch. They're tricky, them libs.", 'label': 'non-compliance'}


# 2. Embeddings Generation

Embeddings are a key concept in generative AI and machine learning in general. An embedding is a representation of an object (like a word, image, video, etc.) in a vector space. Typically, semantically similar objects will have embeddings that are close together in the vector space. These are very powerful for use-cases like semantic search, recommendations and Classifications.

# We will be using the Titan Embeddings Model to generate our Embeddings.

def get_embedding(body, modelId, accept, contentType):
    response = bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
    response_body = json.loads(response.get('body').read())
    embedding = response_body.get('embedding')
    return embedding

body = json.dumps({"inputText": "explain black holes to 8th graders"})
modelId = 'amazon.titan-e1t-medium'
accept = 'application/json'
contentType = 'application/json'

embedding = get_embedding(body, modelId, accept, contentType)
print(embedding)

## Define get embedding function

In [14]:
bedrock_client = boto3.client('bedrock-runtime')
import json

def get_embedding(body, modelId, accept, contentType):
    response = bedrock_client.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
    response_body = json.loads(response.get('body').read())
    embedding = response_body.get('embedding')
    return embedding

## Generate embedding vectors by Titan FM model 'amazon.titan-embed-text-v1'
Around 2 minutes


In [16]:
# Load the parsed JSON data from 'parsed_compliance_data.json'
with open('parsed_compliance_data.json', 'r', encoding='utf-8') as input_file:
    data = json.load(input_file)

# Initialize a list to store the results
results = []

# Loop through each record in the data
for record in data:
    text = record['text']
    label = record['label']

    # Calculate the embedding for the text
    body = json.dumps({"inputText": text})
    modelId = 'amazon.titan-embed-text-v1'
    accept = 'application/json'
    contentType = 'application/json'
    embedding = get_embedding(body, modelId, accept, contentType)

    # Create a result dictionary with text, label, and embedding
    result = {
        #'text': text,
        'label': label,
        'embedding': embedding
    }

    # Append the result to the list of results
    results.append(result)

# Save the results to 'vectors.json'
with open('moderation_vectors.json', 'w', encoding='utf-8') as output_file:
    json.dump(results, output_file, indent=4)

print('Embedding vectors have been saved to moderation_vectors.json')


Embedding vectors have been saved to moderation_vectors.json


# 3.Train for classifier model
## Prepare training dataset 800 records, and test dataset 200 records
It need import the 'moderation_vectors.json' file, extract the first 100 and last 100 records, and then save them to 'test.json' and the remaining 800 records to 'train.json'. Here's a sample code to do this:

In [17]:
import json

# Load the 'moderation_vectors.json' file
with open('moderation_vectors.json', 'r') as json_file:
    data = json.load(json_file)

# Extract the first 100 and last 100 records
first_100_records = data[:100]
last_100_records = data[-100:]

# Create 'test.json' with the combined 200 records
test_data = first_100_records + last_100_records
with open('test.json', 'w') as test_file:
    json.dump(test_data, test_file)

# Create 'train.json' with the remaining 800 records
train_data = data[100:-100]
with open('train.json', 'w') as train_file:
    json.dump(train_data, train_file)


In [20]:
import json
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report

### Training a model - RandomForestClassifier by embedding vectors

In [21]:


# Load the training dataset from 'training.json'
with open('train.json', 'r') as f:
    training_data = json.load(f)

# Load the test dataset from 'test.json'
with open('test.json', 'r') as f:
    test_data = json.load(f)

# Extract features (embedding vectors) and labels from the datasets
X_train = [data_point["embedding"] for data_point in training_data]
y_train = [data_point["label"] for data_point in training_data]

X_test = [data_point["embedding"] for data_point in test_data]
y_test = [data_point["label"] for data_point in test_data]

# Convert lists to numpy arrays for scikit-learn
X_train = np.array(X_train)
y_train = np.array(y_train)

X_test = np.array(X_test)
y_test = np.array(y_test)

# Build the classification model (Random Forest in this example)
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
clf.fit(X_train, y_train)

# Evaluate the model
y_pred = clf.predict(X_test)

# Calculate and print accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate and print precision
precision = precision_score(y_test, y_pred, average='weighted')
print("Precision:", precision)

# Calculate and print recall
recall = recall_score(y_test, y_pred, average='weighted')
print("Recall:", recall)

# Calculate and print F1-score
f1 = f1_score(y_test, y_pred, average='weighted')
print("F1-score:", f1)

# Calculate and print ROC-AUC score (Note: ROC-AUC is typically used for binary classification)
if len(np.unique(y_test)) == 2:  # Check if it's a binary classification problem
    roc_auc = roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])
    print("ROC-AUC:", roc_auc)

# Print the detailed classification report
classification_report_str = classification_report(y_test, y_pred)
print("Classification Report:\n", classification_report_str)


Accuracy: 0.985
Precision: 0.9850485048504851
Recall: 0.985
F1-score: 0.9849996249906248
ROC-AUC: 0.9967499999999999
Classification Report:
                 precision    recall  f1-score   support

    compliance       0.98      0.99      0.99       100
non-compliance       0.99      0.98      0.98       100

      accuracy                           0.98       200
     macro avg       0.99      0.98      0.98       200
  weighted avg       0.99      0.98      0.98       200



### Load the Trained Model
Load the trained Random Forest classifier that you previously trained and saved. If you haven't saved the model, you should save it after training for later use. You can use the joblib library to save and load scikit-learn models.

In [22]:
# saving the model after training:

from joblib import dump

# Train the model (assuming 'clf' is your trained classifier)
clf.fit(X_train, y_train)

# Save the trained model to a file
dump(clf, 'trained_model.joblib')


['trained_model.joblib']

In [23]:
# loading the saved model for inference:

from joblib import load

# Load the trained model from a file
clf = load('trained_model.joblib')


## Prepare New Data

You need to preprocess the new data in the same way you preprocessed your training and test data. In your case, it appears you'll need to obtain the LLM embedding vectors for the new text data using your 'get_embedding' function.

### To load JSON data from the 'email_content_english.json' file into the new_text variable, you can use the following code:

In [25]:
import json

# Specify the filename of the JSON file
json_filename = 'email_content_english.json'

# Load the JSON data from the file into 'data'
with open(json_filename, 'r', encoding='utf-8') as json_file:
    data = json.load(json_file)

# Extract the email content from the 'email' key
new_text = data.get('email', {})


# Now, 'title' and 'body' contain the respective values from the JSON file


In [26]:
print (new_text)

```json
{
  "title": "Enjoy 50% off flights to Hong Kong with TigerPounce Express!",
  "body": "Dear valued member,

We are excited to offer you an exclusive 50% discount on flights from Kuala Lumpur to Hong Kong in October 2023! Use promo code 56298 on https://demobooking.demo.co to receive half off our already low fares.

Experience the wonders of Hong Kong with this customizable 10-day itinerary:

Day 1: Arrive in Hong Kong. Check into your hotel and explore the bustling streets of Mong Kok. Wander through the sprawling Temple Street Night Market. 

Day 2: Take the tram up to Victoria Peak for stunning views of the city and harbor. Ride a double-decker bus around Hong Kong Island. Visit Man Mo Temple.

Day 3: Escape the city with a day trip to Lantau Island. Ride the cable car up to the Tian Tan Buddha statue and explore the traditional fishing villages. Enjoy an al fresco seafood lunch.

Day 4: Shop your way through the winding lanes of the Ladies' Market. Browse the stalls at the 

### Now, the new_text variable contains the text data loaded from 'email_content_english.json,' and you can use it to calculate the embedding as shown in your code:

In [27]:
bedrock_client = boto3.client('bedrock-runtime')

In [28]:
new_text_embedding = get_embedding(json.dumps({"inputText": new_text}), modelId, accept, contentType)


In [29]:
# Assuming you have calculated 'new_text_embedding' using your get_embedding function
print("new_text_embedding:", new_text_embedding)


new_text_embedding: [0.59765625, 0.28710938, 0.18945312, 0.14746094, -0.24511719, 0.017578125, -0.26367188, 0.0003376007, 0.140625, -0.040039062, 0.04736328, -0.21972656, 0.27734375, -0.033203125, 0.052978516, -0.19824219, -0.31445312, -0.20703125, 0.045654297, 0.20996094, -0.11816406, -0.14257812, -0.2421875, 0.15332031, -0.09716797, 0.42578125, -0.13867188, 0.013793945, -0.091308594, -0.13769531, -0.41015625, -0.022216797, -0.12597656, -0.27734375, -0.51953125, -0.0025024414, -0.24804688, 0.017578125, 0.36914062, 0.2734375, 0.328125, -0.33203125, 0.095703125, 0.06640625, 0.1875, -0.5078125, -0.17382812, 0.11376953, 0.20703125, 0.18554688, 0.036865234, -0.096191406, 0.28710938, -0.07080078, 0.08984375, -0.16015625, 0.20605469, -0.29492188, 0.4609375, -0.24511719, -0.17480469, -0.010559082, -0.44335938, -0.49609375, 0.041503906, -0.02331543, 0.020507812, 0.06640625, -0.1796875, 0.027709961, -0.31640625, -0.09667969, 0.18652344, 0.06591797, 0.66796875, -0.076171875, 0.08496094, 0.115234

## 4. Perform Inference

Use the loaded model to make predictions on the new data. You can use the predict method of your classifier.

In [30]:
# Predict the label for the new data
predicted_label = clf.predict([new_text_embedding])

# Print the predicted label
print("Predicted Label:", predicted_label[0])


Predicted Label: compliance


In [31]:
# Predict the label and obtain probability estimates
probability_estimates = clf.predict_proba([new_text_embedding])
predicted_label = clf.predict([new_text_embedding])

# Print the predicted label and probability estimates
print("Predicted Label:", predicted_label[0])
print("Probability Estimates:", probability_estimates[0])


Predicted Label: compliance
Probability Estimates: [0.83 0.17]


## Summary: in the notebook, you learned how to use Amazon Bedrock Titan FM embedding vectors to build an LLM content moderation engine with a RandomForestClassifier. 