# Translation Memory Database Initialization

This notebook allows you to create and initialize the translation memory database for the machine translation pipeline. The translation memory database stores previously translated text pairs with their embeddings to enable similarity search and improve translation consistency.

## What this notebook does:

- Loads sample translation data (French to German)
- Generates embeddings for source and target text using Amazon Bedrock
- Populates the Aurora PostgreSQL translation memory table
- Tests vector similarity search functionality

The translation memory enables the pipeline to find similar previously translated content and suggest consistent translations for recurring text patterns.

## Prerequisites

1. All CDK stacks must have been successfully deployed
2. The Translation Memory Aurora PostgreSQL cluster must be running
3. The translation_memory table must have been created with vector extension enabled
4. Amazon Bedrock access must be configured for embedding generation

## Setup

Install the required Python libraries for database initialization and embedding generation.

In [None]:
# Install all the required prerequiste libraries - approx 3 min to complete
%pip install -r requirements.txt
%pip install -r bedrock_requirements.txt

## Generate Embeddings for Sample Data

This section demonstrates how to generate vector embeddings for translation pairs using Amazon Bedrock's Titan embedding model. These embeddings enable semantic similarity search in the translation memory.

### Load Sample Translation Data

Load the WMT19 French-German translation dataset to populate the translation memory with high-quality translation pairs. The sample data is borrowed from the [WMT19](https://huggingface.co/datasets/wmt/wmt19) open source dataset available on HuggingFace.

In [None]:
import boto3
import json

bedrock = boto3.client(service_name="bedrock")
bedrock_runtime = boto3.client(service_name="bedrock-runtime")

In [None]:
import pandas as pd

# Load the data of csv
df = pd.read_csv('../../../sample_data/wmt19_fr-de.csv')
print("Total number of records : {}".format(len(df.index)))

display(df.head(2))

### Generate Text Embeddings

Use Amazon Bedrock's Titan Text Embeddings v2 model to convert text into high-dimensional vectors that capture semantic meaning. These embeddings enable similarity search for finding relevant translation memories.

In [None]:
def generate_embeddings(query):
    
    payLoad = json.dumps({'inputText': query })
    
    response = bedrock_runtime.invoke_model(
        body=payLoad, 
        modelId='amazon.titan-embed-text-v2:0',
        accept="application/json", 
        contentType="application/json" )
    response_body = json.loads(response.get("body").read())
    return(response_body.get("embedding"))
    
source_embeddings = generate_embeddings(df.iloc[1].get('source'))

print ("Number of dimensions : {}".format(len(source_embeddings)))

In [None]:
# Generate embeddings for translation pairs - approx 3 min to complete
# Processing first 20 records for demonstration. If there are any failures, please rerun the cell.

from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True, nb_workers=8)

df_20 = df.head(20)
df_20['target_embeddings'] = df_20['target'].apply(generate_embeddings)
df_20['source_embeddings'] = df_20['source'].apply(generate_embeddings)

### Load Data into Translation Memory Table

Insert the translation pairs and their embeddings into the Aurora PostgreSQL database using the RDS Data API. The embeddings are stored as vector data types for efficient similarity search.

**Note:** Replace the placeholder values in the next cell with the actual CloudFormation output values from your DatabaseStack deployment.

In [None]:
import boto3 
import json 

# Replace these placeholders with actual values from your CDK deployment outputs:
# - DatabaseSecretArn: The ARN of the Aurora credentials secret
# - DatabaseClusterArn: The ARN of the Aurora PostgreSQL cluster
# - DatabaseName: The name of the translation memory database

secret_arn = "<REPLACE_WITH_DatabaseSecretArn_OUTPUT>"
cluster_arn = "<REPLACE_WITH_DatabaseClusterArn_OUTPUT>"
database_name = "<REPLACE_WITH_DatabaseName_OUTPUT>"

def insert_data(source_lang, target_lang, source_text, target_text, source_text_embedding, target_text_embedding):
    rds_data = boto3.client('rds-data')

    sql = """
          INSERT INTO translation_memory(source_text, target_text, source_lang, target_lang, source_text_embedding, target_text_embedding)
          VALUES( :source_text, :target_text, :source_lang, :target_lang, CAST(:source_text_embedding AS VECTOR), CAST(:target_text_embedding AS VECTOR))
          """

    param2 = {'name':'source_text', 'value':{'stringValue': source_text}}
    param3 = {'name':'target_text', 'value':{'stringValue': target_text}}
    param4 = {'name':'source_lang', 'value':{'stringValue': source_lang}}
    param5 = {'name':'target_lang', 'value':{'stringValue': target_lang}}
    param6 = {'name':'source_text_embedding', 'value':{'stringValue': source_text_embedding}}
    param7 = {'name':'target_text_embedding', 'value':{'stringValue': target_text_embedding}}
    param_set = [param2, param3, param4, param5, param6, param7]
 
    response = rds_data.execute_statement(
        resourceArn = cluster_arn, 
        secretArn = secret_arn, 
        database = database_name, 
        sql = sql,
        parameters = param_set)

for  index, record in df_20.iterrows():
    insert_data("fr", "de", record['source'], record['target'], str(record['source_embeddings']), str(record['target_embeddings']))

## Test Translation Memory Vector Search

Verify that the translation memory is working correctly by performing a similarity search. This demonstrates how the system finds the most similar source texts and their corresponding translations based on vector embeddings.

In [None]:
import numpy
from IPython.display import display, Markdown, Latex, HTML

def similarity_search(search_text):
    
    embedding = numpy.array(generate_embeddings(search_text))
    rds_data = boto3.client('rds-data')
    embedding_str = str(embedding.tolist())
    sql_text = f"SELECT unique_id, source_text, target_text FROM translation_memory ORDER BY source_text_embedding <=> CAST('{embedding_str}' AS VECTOR) limit 3;"
    
    response = rds_data.execute_statement(
        resourceArn = cluster_arn, 
        secretArn = secret_arn, 
        database = database_name, 
        sql = sql_text
    )

    print(response)

similarity_search("Reprise de la session")