# Semantic Search with LangChain, Amazon Titan Embeddings, and FAISS
Jupyter notebook for loading documents from PDFs, extracting and splitting text into semantically meaningful chunks using [LangChain](https://python.langchain.com/docs/get_started/introduction), generating text embeddings from those chunks utilizing an , generating embeddings from the text using an  [Amazon Titan Embeddings G1 - Text models](https://docs.aws.amazon.com/bedrock/latest/userguide/titan-embedding-models.html), and storing the embeddings in a [FAISS](https://python.langchain.com/docs/integrations/vectorstores/faiss/) vector database for retrieval.


![Diagram](build_pdf_vector_db.jpg)


## Requirements: 
- Install boto3 - This is the [AWS SDK for Python ](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingTheBotoAPI.html)that allows interacting with AWS services. Install with `pip install boto3`.
- [Configure AWS credentials](https://docs.aws.amazon.com/braket/latest/developerguide/braket-using-boto3.html) - Boto3 needs credentials to make API calls to AWS.
- Install [Langchain](https://python.langchain.com/docs/get_started/introduction), a framework for developing applications powered by large language models (LLMs). Install with `pip install langchain`.

In [None]:
#!pip install boto3
#!pip install langchain

In [None]:
import boto3 # to interact with AWS services.
from langchain_community.document_loaders import PyPDFLoader # to load documents from PDF files.
from langchain.text_splitter import RecursiveCharacterTextSplitter # to split documents into smaller chunks.
from langchain_community.vectorstores import FAISS # to store the documents in a vector database.
from langchain_community.embeddings import BedrockEmbeddings # to create embeddings for the documents.
from langchain_experimental.text_splitter import SemanticChunker # to split documents into smaller chunks.
from langchain_core.documents import Document 

In [None]:
bedrock_client              = boto3.client("bedrock-runtime") 
bedrock_embeddings          = BedrockEmbeddings(model_id="amazon.titan-embed-text-v1",client=bedrock_client)

### Preparation: PDF file to VectorDB

`load_and_split_pdf` function load a PDF file, extract the text, and splits it into overlapping chunks based on character offsets using [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter/).

RecursiveCharacterTextSplitter splits a text into smaller chunks based on the maximum number of characters allowed per chunk. It works as follows:

1. First, it tries to split the text into chunks using a separator, such as a whitespace or a line break `["\n\n", "\n", " ", ""]`.

2. If the resulting chunks exceed the maximum character limit, then it recursively splits each chunk into even smaller parts, until no chunk exceeds the limit.

3. This recursive process continues until all chunks comply with the maximum character limit.

The advantage of using RecursiveCharacterTextSplitter is that it divides the text in a more natural way, respecting sentence and paragraph boundaries whenever possible. This helps preserve the context and meaning of the original text, which is important for NLP tasks such as summarization, text generation, and sentiment analysis.

In [None]:
file_name = "Amazon_Bedrock_User_Guide.pdf"
path_file = "demo-files"
file_path = f"{path_file}/{file_name}"

In [None]:
def load_and_split_pdf(file_path, chunk_size, chunk_overlap):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap
    )
    loader = PyPDFLoader(file_path)
    docs = loader.load_and_split(text_splitter)
    return docs
    

In [None]:
chunk_size = 1000
chunk_overlap = 100

docs = load_and_split_pdf(file_path, chunk_size, chunk_overlap)

In [None]:
print("documentos:", len(docs))
docs[6:9]

`load_and_split_pdf_semantic` function loads a PDF, splits the text into semantically meaningful chunks using [SemanticChunker](https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker/), and returns the split documents. 

Unlike [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter/) which divides the text based on a character limit, [SemanticChunker](https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker/) uses a language model to understand the meaning and context of the text, and then divides it into sections that have a coherent meaning.

The process works as follows:

1. The full text and a language model are provided to the [SemanticChunker](https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker/).

2. The language model analyzes the text and divides it into semantically coherent sentences or segments. 

3. These segments are grouped into larger "chunks" using various techniques, such as topic analysis, topic change detection, etc.

4. The resulting "chunks" represent sections of the text that have a coherent meaning and context.

Additionally, you can use [Breakpoints](https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker/#breakpoints) to have a more granular control over how the text is divided into chunks, which can be important to preserve the meaning and context of the original text during processing.

The advantage of using SemanticChunker is that it produces text fragments that are easier to process and understand for subsequent language models, since each fragment has a coherent semantic meaning. This is especially useful for tasks such as summarization, information extraction, and answer generation.

In [None]:
def load_and_split_pdf_semantic(file_path, embeddings):
    text_splitter = SemanticChunker(embeddings, breakpoint_threshold_amount= 80)
    loader = PyPDFLoader(file_path)
    docs = loader.load_and_split(text_splitter)
    print(f"docs:{len(docs)}")
    return docs

In [None]:
semantic_docs = load_and_split_pdf_semantic(file_path, bedrock_embeddings)

In [None]:
print("documentos:", len(semantic_docs))
semantic_docs[:3]

In [None]:
# Remove any empty pages or documents without actual content.
clean_docs = []
for doc in docs:
    if len(doc.page_content):
        clean_docs.append(doc)

In [None]:
for doc in clean_docs:
    if len(doc.page_content) == 0:
        print(doc)

### [Build Vector database](https://python.langchain.com/docs/integrations/vectorstores/faiss/#ingestion)
Now, using Amazon Bedrock embeddings, create a vector database of document embeddings using [FAISS](https://python.langchain.com/docs/integrations/vectorstores/faiss/) that can allow quick searching by similarity and retrieval of related documents in the future. ,


In [None]:
db = FAISS.from_documents(clean_docs, bedrock_embeddings)
print(f"Vector Database:{db.index.ntotal} docs")

### [Query](https://python.langchain.com/docs/integrations/vectorstores/faiss/#querying)


In [None]:
query = "What is a prompt?"
docs = db.similarity_search(query)
print(docs[0].page_content)

### [Retriever](https://python.langchain.com/docs/integrations/vectorstores/faiss/#as-a-retriever)


In [None]:
retriever = db.as_retriever()
docs = retriever.invoke(query)
print(docs[0].page_content)

### [Save to a local Vector database.](https://python.langchain.com/docs/integrations/vectorstores/faiss/#as-a-retriever)

In [None]:
db_file_name = file_name.split(".")[0]
db_file = f"{db_file_name}.vdb"
db.save_local(db_file)
print(f"vectordb was saved in {db_file}")

### Load and Query local Vector database

In [None]:
db_file_name = "Amazon_Bedrock_User_Guide.vdb"
new_db = FAISS.load_local(db_file_name, bedrock_embeddings, allow_dangerous_deserialization=True)


In [None]:
query = "What is a prompt?"
docs = new_db.similarity_search(query)
print(docs[0].page_content)

### [Delete Vectordb](https://python.langchain.com/docs/integrations/vectorstores/faiss/#delete)

You can also delete records from the vector store.

In [None]:
print("count before:", new_db.index.ntotal)
new_db.delete([new_db.index_to_docstore_id[0]])
print("count after:", new_db.index.ntotal)

Delete the entire database

In [None]:
new_db.delete