## Step 1a: Create cell-by-cell csv Embedding (using Amazon Bedrock's Titan Embedding Model)

In this notebook, we will generate the cell-by-cell embedding for the csv file found at `./data/chan-RenalGenie_Clinical_Note_csv.csv` using the `amazon.titan-embed-text-v2:0` model. This is slow, and instead a row-by-row embedding generation (step 01b) should be considered. 

We begin by initializing the SSO session that must have the permissions to use Amazon Bedrock on the `amazon.titan-embed-text-v2:0` model. Note that a different embedding model may be specified as desired. 

In [None]:
import boto3
import pandas as pd
import os

# Initialize the Boto3 client
# Important: Ensure that your AWS SSO credentials are configured. This can be done with `aws configure sso`. 
session = boto3.Session(profile_name='renalworks-bedrock') # Replace the profile name (can be found for Windows users in /Users/{YOUR_USERNAME}/.aws/config) accordingly 
boto3_bedrock = session.client('bedrock-runtime')

# File name
file_name = 'chan-RenalGenie_Clinical_Note_csv.csv'

# Load the CSV file (data frame)
df = pd.read_csv(os.path.join('data', 'csv_xlsx', file_name))
df_cleaned = df.dropna(how='all')

# Print the cleaned data frame
print(df_cleaned)

We then test the Titan Embedding model with a short string of sample text and check that a list of floating point numbers is generated. 

In [None]:
import json

# Define Embedding-Getter function
def get_embeddings(text):

    body = json.dumps({"inputText": text})
    modelId = "amazon.titan-embed-text-v2:0"  # (Change this to try different embedding models)
    accept = "application/json"
    contentType = "application/json"

    response = boto3_bedrock.invoke_model(
        body=body, modelId=modelId, accept=accept, contentType=contentType
    )
    response_body = json.loads(response.get("body").read())

    embedding = response_body.get("embedding")
    return embedding

sample_embedding = get_embeddings('i love renalworks')
print(f"The sample embedding vector has {len(sample_embedding)} values\n{sample_embedding[0:3]+['...']+sample_embedding[-3:]}")

The following three code blocks involve individually generating embeddings for the text in each cell before concatenating the embeddings of the row to create the combined embedding for that row. 

While potentially preserving the notion of column separation of the tabular data, this results in the embedding generation that takes a very long time and thus is not recommended, since it would be impractical for users to wait for such a long time for a single csv file to be processed. 

In [None]:
import itertools

# Defines how embeddings across a row should be combined
# Currently, they are simply concatenated, although the embeddings could also be averaged (doesn't sound like a good idea)
def combine_embeddings(embeddings):
    # Example: concatenate embeddings into a long list for each row
    return list(itertools.chain(*embeddings))

In [None]:
import sys

# List to store combined embeddings
combined_embeddings = []

# Specify the columns to embed
columns_to_embed = [key for key in df_cleaned.keys() if key[:8] != "Unnamed:"]

# Total number of rows for which to obtain embeddings
num_rows = df_cleaned.index.size

# Iterate through each row
for index, row in df_cleaned.iterrows():
    embeddings = []
    for col in columns_to_embed:
        text = row[col]
        embedding = get_embeddings(str(text))
        embeddings.append(embedding)
    
    # Combine embeddings (e.g., concatenate or average them)
    combined_embedding = combine_embeddings(embeddings)
    combined_embeddings.append(combined_embedding)

    # Print progress
    sys.stdout.write(f'\rProgress (rows done): {index+1}/{num_rows}\n')
    sys.stdout.flush()

# Add combined embeddings to the DataFrame
# df['combined_embeddings'] = combined_embeddings

print(f"The result is a {len(combined_embeddings)} by {len(combined_embeddings[0])} matrix")

# Time taken to generate embeddings for chan-RenalGenie_Clinical_Note_csv.csv: 12 mins 51 secs

In [None]:
# Convert the large matrix of embeddings into a DataFrame
embeddings_df = pd.DataFrame(combined_embeddings)

# Save embeddings DataFrame to a CSV file
embedding_filename = os.path.join('embeddings', 'csv_xlsx', 'embedding_' + os.path.splitext(file_name)[0])
embeddings_df.to_csv(embedding_filename+'.csv', index=False)