# Similarity Search and Aurora Machine Learning using pgvector and Amazon Aurora PostgreSQL

## Learning objectives

1. Use HuggingFace's sentence transformer model [all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) and PostgreSQL extension [pgvector](https://github.com/pgvector/pgvector) to perform similarity search on a fictitious hotel reviews dataset. 
2. Perform Sentiment Analysis using [Amazon Aurora Machine Learning](https://aws.amazon.com/rds/aurora/machine-learning/).


## Contents


1. [Background](#Background)
1. [Setup](#Setup)
1. [pgvector](#Open-source-extension-pgvector-for-PostgreSQL)
1. [Load test data](#Load-test-data)
1. [Split text into chunks](#Split-text-into-chunks)
1. [Create collection](#Create-collection)
1. [Calculate cosine similarity](#Calculate-cosine-similarity)


## Background

Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text. No prior machine learning experience is required. This example will walk you through the process of integrating Aurora with the Comprehend Sentiment Analysis API and making sentiment analysis inferences via SQL commands. For our example, we have used a sample dataset for fictitious hotel reviews. We use a pretrained SentenceTransformer model `all-mpnet-base-v2` from [HuggingFace Transformers](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) for generating vector embeddings and store vector embeddings in our Aurora PostgreSQL DB cluster with pgvector. The sentiment analysis part of this demo will be done via psql, a popular PostgreSQL client in a hosted AWS Cloud9 terminal environment.

## Install dependencies
Install required python libraries for the setup.

In [None]:
!pip install -r requirements1.txt
!pip install -r requirements2.txt

## Open-source extension pgvector for PostgreSQL

[pgvector](https://github.com/pgvector/pgvector) is an open-source extension for PostgreSQL that allows you to store and search vector embeddings for exact and approximate nearest neighbors. It is designed to work seamlessly with other PostgreSQL features, including indexing and querying.

PGVector integration with LangChain needs the connection string to the database. In this step, we generate the embeddings we as well as setup the connection string. Note that we pass in the connection string as well as the HuggingFace API Token from our `.env` file. 


In [None]:
from dotenv import load_dotenv
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.embeddings import HuggingFaceInstructEmbeddings
from langchain_postgres import PGVector
from langchain_postgres.vectorstores import PGVector
from langchain.docstore.document import Document
import os

load_dotenv()
os.environ["TOKENIZERS_PARALLELISM"] = 'false'

embeddings = HuggingFaceInstructEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

# Create the connection string for pgvector. Ref: https://github.com/langchain-ai/langchain-postgres/blob/main/examples/vectorstore.ipynb
db_user = os.getenv('PGVECTOR_USER')
db_password = os.getenv('PGVECTOR_PASSWORD')
db_host = os.getenv('PGVECTOR_HOST')
db_port = os.getenv('PGVECTOR_PORT')
db_name = os.getenv('PGVECTOR_DATABASE')
connection = f"postgresql+psycopg://{db_user}:{db_password}@{db_host}:{db_port}/{db_name}"

vectorstore = PGVector(
    embeddings=embeddings,
    connection=connection,
    use_jsonb=True                   
)

## Load test data

Load our sample fictitious hotel dataset (CSV) with LangChain's [CSVLoader](https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/csv).

In [None]:
loader = CSVLoader('./data/fictitious_hotel_reviews_trimmed_500.csv')
documents = loader.load()

## Split text into chunks

Split the text using LangChain’s [CharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/character_text_splitter) function and generate chunks.

In [None]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
print(len(docs))

# Access the content and metadata of each document
for document in documents:
    content = print(document.page_content)
    metadata = print(document.metadata)

## Create collection

The PGVector module will try to create a table with the name of the collection. So, make sure that the collection name is unique and the user has the permission to create a table.

In [None]:
from typing import List, Tuple

collection_name = "fictitious_hotel_reviews"

db = PGVector.from_documents(
     embedding=embeddings,
     documents=docs,
     collection_name=collection_name,
     connection=connection
)

## Similarity search with score

Run a similarity search using the [similarity_search_with_score](https://python.langchain.com/docs/modules/data_connection/vectorstores/integrations/pgvector) function from pgvector.

In [None]:
query = "What do some of the positive reviews say?"
docs_with_score: List[Tuple[Document, float]] = db.similarity_search_with_score(query)

In [None]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print(doc.metadata)
    print("-" * 80)

## Calculate cosine similarity

Use the Cosine function to refine the results to the best possible match.

In [None]:
from langchain_postgres.vectorstores import DistanceStrategy

store = PGVector(
    connection=connection,
    embeddings=embeddings, 
    collection_name="fictitious_hotel_reviews",
    distance_strategy=DistanceStrategy.COSINE
)

retriever = store.as_retriever(search_kwargs={"k": 1})

In [None]:
retriever.invoke(input='What do some of the positive reviews say?')