# Similarity Search and Aurora Machine Learning using pgvector and Amazon Aurora PostgreSQL

## Learning objectives

1. Use HuggingFace's sentence transformer model [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) and PostgreSQL extension [pgvector](https://github.com/pgvector/pgvector) to perform similarity search on a fictitious hotel reviews dataset. 
2. Perform Sentiment Analysis using [Amazon Aurora Machine Learning](https://aws.amazon.com/rds/aurora/machine-learning/).


## Contents


1. [Background](#Background)
1. [Setup](#Setup)
1. [pgvector](#Open-source-extension-pgvector-for-PostgreSQL)
1. [Load test data](#Load-test-data)
1. [Split text into chunks](#Split-text-into-chunks)
1. [Create collection](#Create-collection)
1. [Calculate cosine similarity](#Calculate-cosine-similarity)


## Background

Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text. No prior machine learning experience is required. This example will walk you through the process of integrating Aurora with the Comprehend Sentiment Analysis API and making sentiment analysis inferences via SQL commands. For our example, we have used a sample dataset for fictitious hotel reviews. We use a pretrained SentenceTransformer model `all-mpnet-base-v2` from [HuggingFace Transformers](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) for generating vector embeddings and store vector embeddings in our Aurora PostgreSQL DB cluster with pgvector. The sentiment analysis part of this demo will be done via psql, a popular PostgreSQL client in a hosted AWS Cloud9 terminal environment.

## Setup
Install required python libraries for the setup.

In [1]:
!pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com


## Open-source extension pgvector for PostgreSQL

[pgvector](https://github.com/pgvector/pgvector) is an open-source extension for PostgreSQL that allows you to store and search vector embeddings for exact and approximate nearest neighbors. It is designed to work seamlessly with other PostgreSQL features, including indexing and querying.

PGVector integration with LangChain needs the connection string to the database. In this step, we generate the embeddings we as well as setup the connection string. Note that we pass in the connection string as well as the HuggingFace API Token from our `.env` file. 


In [2]:
from dotenv import load_dotenv
from langchain.document_loaders import CSVLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.vectorstores.pgvector import PGVector, DistanceStrategy
from langchain.docstore.document import Document
import os

load_dotenv()

embeddings = HuggingFaceInstructEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

connection_string = PGVector.connection_string_from_db_params(                                                  
    driver = os.environ.get("PGVECTOR_DRIVER"),
    user = os.environ.get("PGVECTOR_USER"),                                      
    password = os.environ.get("PGVECTOR_PASSWORD"),                                  
    host = os.environ.get("PGVECTOR_HOST"),                                            
    port = os.environ.get("PGVECTOR_PORT"),                                          
    database = os.environ.get("PGVECTOR_DATABASE")                                       
)

  from tqdm.autonotebook import trange


load INSTRUCTOR_Transformer
max_seq_length  512


## Load test data

Load our sample fictitious hotel dataset (CSV) with LangChain's [CSVLoader](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/csv).

In [3]:
loader = CSVLoader('./data/fictitious_hotel_reviews_trimmed_500.csv')
documents = loader.load()

## Split text into chunks

Split the text using LangChain’s [CharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/character_text_splitter) function and generate chunks.

In [4]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
print(len(documents))
print(len(docs))

# Access the content and metadata of each document
for document in documents:
    content = print(document.page_content)
    metadata = print(document.metadata)

10
10
comments: great hotel night quick business trip, loved little touches like goldfish leopard print robe, complaint wifi complimentary not internet access business center, great location library service fabulous,
{'source': 'great hotel night quick business trip, loved little touches like goldfish leopard print robe, complaint wifi complimentary not internet access business center, great location library service fabulous,  ', 'row': 0}
comments: horrible customer service hotel stay february 3rd 4th 2007my friend picked hotel monaco appealing website online package included champagne late checkout 3 free valet gift spa weekend, friend checked room hours earlier came later, pulled valet young man just stood, asked valet open said, pull bags didn__Ç_é_ offer help, got garment bag suitcase came car key room number says not valet, car park car street pull, left key working asked valet park car gets, went room fine bottle champagne oil lotion gift spa, dressed went came got bed noticed b

## Create collection

The PGVector module will try to create a table with the name of the collection. So, make sure that the collection name is unique and the user has the permission to create a table.

In [5]:
from typing import List, Tuple

collection_name = "fictitious_hotel_reviews"

db = PGVector.from_documents(
     embedding=embeddings,
     documents=docs,
     collection_name=collection_name,
     connection_string=connection_string
)

## Similarity search with score

Run a similarity search using the [similarity_search_with_score](https://python.langchain.com/docs/modules/data_connection/vectorstores/integrations/pgvector) function from pgvector.

In [6]:
query = "What do some of the positive reviews say?"
docs_with_score: List[Tuple[Document, float]] = db.similarity_search_with_score(query)

In [7]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print(doc.metadata)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.4267523012356972
comments: nice hotel expensive parking got good deal stay hotel anniversary, arrived late evening took advice previous reviews did valet parking, check quick easy, little disappointed non-existent view room room clean nice size, bed comfortable woke stiff neck high pillows, not soundproof like heard music room night morning loud bangs doors opening closing hear people talking hallway, maybe just noisy neighbors, aveda bath products nice, did not goldfish stay nice touch taken advantage staying longer, location great walking distance shopping, overall nice experience having pay 40 parking night,
{'source': 'nice hotel expensive parking got good deal stay hotel anniversary, arrived late evening took advice previous reviews did valet parking, check quick easy, little disappointed non-existent view room room clean nice size, bed comfortable woke stiff neck high pillows, not soundproo

## Calculate cosine similarity

Use the Cosine function to refine the results to the best possible match.

In [8]:
store = PGVector(
    connection_string=connection_string, 
    embedding_function=embeddings, 
    collection_name="fictitious_hotel_reviews",
    distance_strategy=DistanceStrategy.COSINE
)

retriever = store.as_retriever(search_kwargs={"k": 1})

In [9]:
retriever.get_relevant_documents(query='What do some of the positive reviews say?')

[Document(page_content='comments: nice hotel expensive parking got good deal stay hotel anniversary, arrived late evening took advice previous reviews did valet parking, check quick easy, little disappointed non-existent view room room clean nice size, bed comfortable woke stiff neck high pillows, not soundproof like heard music room night morning loud bangs doors opening closing hear people talking hallway, maybe just noisy neighbors, aveda bath products nice, did not goldfish stay nice touch taken advantage staying longer, location great walking distance shopping, overall nice experience having pay 40 parking night,', metadata={'source': 'nice hotel expensive parking got good deal stay hotel anniversary, arrived late evening took advice previous reviews did valet parking, check quick easy, little disappointed non-existent view room room clean nice size, bed comfortable woke stiff neck high pillows, not soundproof like heard music room night morning loud bangs doors opening closing he