# Text Embedding POC

### 1. Set Up

#### Permissions and environment variables

---
To host on Amazon SageMaker, we need to set up and authenticate the use of AWS services. Here, we use the execution role associated with the current notebook as the AWS account role with SageMaker access. 

---

In [None]:
import sagemaker, boto3, json
from sagemaker.session import Session

sagemaker_session = Session()
aws_role = sagemaker_session.get_caller_identity_arn()
aws_region = boto3.Session().region_name
sess = sagemaker.Session()

print('step completed.')

### 2. Select a model

***
Here, we download jumpstart model_manifest file from the jumpstart s3 bucket, filter-out all the Text Embedding models and select a model for inference. 
***

In [None]:
from ipywidgets import Dropdown
from sagemaker.jumpstart.notebook_utils import list_jumpstart_models

# Retrieves all Text Embedding models available by SageMaker Built-In Algorithms.
filter_value = "task == tcembedding"
text_embedding_models = list_jumpstart_models(filter=filter_value)


# Chose a model for Inference
model_id, model_version = 'mxnet-tcembedding-robertafin-base-uncased', "*"
print(text_embedding_models)

print('step completed.')

### 3. Retrieve JumpStart Artifacts & Deploy an Endpoint

***

We start by retrieving the `deploy_image_uri`, `deploy_source_uri`, and `model_uri` for the pre-trained model. To host the pre-trained model, we create an instance of [`sagemaker.model.Model`](https://sagemaker.readthedocs.io/en/stable/api/inference/model.html) and deploy it. 
***

*** THIS STEP WILL TAKE SEVERAL MINUTES ****

In [None]:
from sagemaker.jumpstart.model import JumpStartModel

my_model = JumpStartModel(model_id=model_id)
model_predictor = my_model.deploy()

print(model_predictor.endpoint_name)
print('step completed.')

### 4. Query endpoint and parse response

---
Input to the endpoint is any string of text dumped in json and encoded in `utf-8` format. Output of the endpoint is a `json` with the text embedding.

---

In [None]:
def query(model_predictor, text):
    """Query the model predictor."""

    encoded_text = text.encode("utf-8")

    query_response = model_predictor.predict(
        encoded_text,
        {
            "ContentType": "application/x-text",
            "Accept": "application/json",
        },
    )
    return query_response


def parse_response(query_response):
    """Parse response and return the embedding."""

    model_predictions = json.loads(query_response)
    translation_text = model_predictions["embedding"]
    return translation_text

print('step completed.')

### 5. Semantic Textual Similarity

A use case of sentence embedding is to cluster together sentences with similar semantic meaning.  In the example below we compute the embeddings of sentences in three categories: pets, cities in the U.S., and color.  We see that sentences originating from the same category have much closer embedding vectors than those from different categories.  

Specifically, the code will do the following:
* The endpoint that you have created above will output an embedding vector for each sentence;  
* The distance between any pair of sentences is computed by the cosine similarity of corresponded embedding vectors;
* A heatmap is created to visualize the distance between any pair of sentences in the embedding space. Darker the color, larger the cosine similarity (smaller the distance). 

Note. Cosine similarity of two vectors is  the inner product of the normalized vectors (scale down to have length 1).

In [None]:
from sklearn.preprocessing import normalize
import numpy as np
import seaborn as sns

def plot_similarity_heatmap(text_labels, embeddings, rotation):
    """Takes sentences, embeddings and rotation as input and plot similarity heat map.

    Args:
      text_labels: a list of sentences to compute semantic textual similarity search.
      embeddings: a list of embedding vectors, each of which corresponds to a sentence.
      rotation: rotation used for display of the text_labels.
    """
    inner_product = np.inner(embeddings, embeddings)
    sns.set(font_scale=1.1)
    graph = sns.heatmap(
        inner_product,
        xticklabels=text_labels,
        yticklabels=text_labels,
        vmin=np.min(inner_product),
        vmax=1,
        cmap="OrRd",
    )
    graph.set_xticklabels(text_labels, rotation=rotation)
    graph.set_title("Semantic Textual Similarity Between Sentences")

print('step completed.')

### 6. Semantic Textual Similarity

In [None]:
sentences = [
]

embeddings = []

for sentence in sentences:
    query_response = model_predictor.predict(sentence)
    embedding = query_response["embedding"]
    embeddings.append(embedding)
    print("First element of embedding of sentence '"+sentence+"' is >> "+str(embedding[0]))
    
embeddings = normalize(np.array(embeddings), axis=1)  # normalization before inner product
plot_similarity_heatmap(sentences, embeddings, 90)

print('Step Completed')

### 7. Clean up the endpoint

In [None]:
model_predictor.delete_model()
model_predictor.delete_endpoint()