# Natural Language Processing

# Retrieval
Many LLM applications require user-specific data that is not part of the model's training set. 
The primary way of accomplishing this is through `Retrieval Augmented Generation (RAG). `
In this process, external data is retrieved and then passed to the LLM when doing the generation step.

LangChain provides all the building blocks for RAG applications - from simple to complex.

This section of the documentation covers everything related to the retrieval step 

<img src="./figures/retrieval.jpeg" >

1. `Document loaders` : Load documents from many different sources (HTML, PDF, code). 
2. `Document transformers` : One of the essential steps in document retrieval is breaking down a large document into smaller, relevant chunks to enhance the retrieval process.
3. `Text embedding models` : Embeddings capture the semantic meaning of the text, allowing you to quickly and efficiently find other pieces of text that are similar.
4. `Vector stores`: there has emerged a need for databases to support efficient storage and searching of these embeddings.
5. `Retrievers` : Once the data is in the database, you still need to retrieve it.

In [None]:
import os
# Set GPU device
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

os.environ['http_proxy']  = 'http://192.41.170.23:3128'
os.environ['https_proxy'] = 'http://192.41.170.23:3128'

## 1. Document loaders
- built-in document loader integrations with 3rd-party tools.
- Use document loaders to load data from a source as Document's. 
- A Document is a piece of text and associated metadata. (.txt .html .md .json)
- Document loaders provide a "load" method for loading data as documents from a configured source. 

### CSV

In [10]:
import pandas as pd
csv_path = './docs/csv/OpenThaiGPT_SelfInstruct_Generated.csv'
df = pd.read_csv(csv_path)
df.shape

(5013, 5)

In [19]:
#Load CSV data with a single row per document.
from langchain.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path=csv_path)
data = loader.load()
len(data)

5013

### PDF

In [20]:
# !pip install pypdf
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("./docs/pdf/MachineLearning-Lecture01.pdf")
pages = loader.load_and_split()
len(pages)

22

In [9]:
# !pip install pymupdf
from langchain.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader("./docs/pdf/MachineLearning-Lecture01.pdf")
pages = loader.load()
len(pages)

22

In [2]:
# !pip3 install pdfminer
# !pip3 install pdfminer-six
from langchain.document_loaders import OnlinePDFLoader
loader = OnlinePDFLoader("https://arxiv.org/pdf/1706.03762.pdf")
pages = loader.load()
len(pages)

1

Suggest exploring a variety of loading techniques for diverse data sources. 
[Link](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf)

## 2. Document transformers

In [15]:
# !pip3 install pdfminer
# !pip3 install pdfminer-six
from langchain.document_loaders import OnlinePDFLoader
loader = OnlinePDFLoader("https://arxiv.org/pdf/1706.03762.pdf")
pages = loader.load()
len(pages)

In [24]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 700,
    chunk_overlap = 100
)

docs = text_splitter.split_documents(pages)
len(docs)

79

In [23]:
assert len(docs) >= len(pages)

Suggest exploring a variety of spliting techniques.
[Link](https://python.langchain.com/docs/modules/data_connection/document_transformers/)

## 3. Text embedding models

The Embeddings class is a class designed for interfacing with text embedding models. 

There are lots of embedding model providers (OpenAI, Cohere, Hugging Face, etc) - this class is designed to provide a standard interface for all of them.

Embeddings create a vector representation of a piece of text. This is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space.

In [2]:
from langchain.embeddings import HuggingFaceInstructEmbeddings
import torch

embedding_model = HuggingFaceInstructEmbeddings(
        model_name = 'hkunlp/instructor-base',              
        model_kwargs = {
            'device': torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        },
    )

  from tqdm.autonotebook import trange


load INSTRUCTOR_Transformer
'NoneType' object has no attribute 'cadam32bit_grad_fp32'


  warn("The installed version of bitsandbytes was compiled without GPU support. "


max_seq_length  512


In [11]:
query_A = embedding_model.embed_query('Chacky love to eat sushi.')
query_B = embedding_model.embed_query('Chacky don\'t love to eat Durian.')

In [16]:
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
cos_sim(torch.Tensor(query_A), torch.Tensor(query_B))

  cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))


tensor(0.9132)

In [33]:
embeddings = embedding_model.embed_documents(
    [
        "Hi there!",
        "Oh, hello!",
        "What's your name?",
        "My friends call me World",
        "Hello World!"
    ]
)
len(embeddings)

5

## 4. Vector stores

One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors, and then at query time to embed the unstructured query and retrieve the embedding vectors that are 'most similar' to the embedded query. A vector store takes care of storing embedded data and performing vector search for you.

<img src="./figures/vectorstores.jpeg" >

In [1]:
# !pip install faiss-cpu
# !pip install faiss-gpu

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyMuPDFLoader
from langchain.vectorstores import FAISS
#STEP 3
loader = PyMuPDFLoader("./docs/pdf/MachineLearning-Lecture01.pdf")
docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=700, 
    chunk_overlap=100
)

documents = text_splitter.split_documents(docs)
#STEP 4
vectordb = FAISS.from_documents(
    documents, 
    embedding_model)

In [28]:
import os
vectordb_path = 'vectordb_path'
db_file_name = 'ml-andrew-ng'
vectordb.save_local(
    os.path.join(vectordb_path, db_file_name)
)

vectordb = FAISS.load_local(
        folder_path = db_file_name,
        embeddings  = embedding_model
    ) 

### Similarity search

In [12]:
query = "What is Liner Regression"
docs = vectordb.similarity_search(query)
docs

[Document(page_content="And one of the most interesting things we'll talk about later this quarter is what if your \ndata doesn't lie in a two-dimensional or three-dimensional or sort of even a finite \ndimensional space, but is it possible — what if your data actually lies in an infinite \ndimensional space? Our plots here are two-dimensional space. I can't plot you an infinite \ndimensional space, right? And so it turns out that one of the most successful classes of \nmachine learning algorithms — some may call support vector machines — actually takes \ndata and maps data to an infinite dimensional space and then does classification using not \ntwo features like I've done here, but an infinite number of features.", metadata={'source': './docs/pdf/MachineLearning-Lecture01.pdf', 'file_path': './docs/pdf/MachineLearning-Lecture01.pdf', 'page': 13, 'total_pages': 22, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'PScript5.dll Version 5.2.2', '

### Similarity search by vector
It is also possible to do a search for documents similar to a given embedding vector using similarity_search_by_vector which accepts an embedding vector as a parameter instead of a string.

In [14]:
query = "What is Liner Regression"
embedding_vector = embedding_model.embed_query(query)
docs = vectordb.similarity_search_by_vector(embedding_vector)
docs

[Document(page_content="And one of the most interesting things we'll talk about later this quarter is what if your \ndata doesn't lie in a two-dimensional or three-dimensional or sort of even a finite \ndimensional space, but is it possible — what if your data actually lies in an infinite \ndimensional space? Our plots here are two-dimensional space. I can't plot you an infinite \ndimensional space, right? And so it turns out that one of the most successful classes of \nmachine learning algorithms — some may call support vector machines — actually takes \ndata and maps data to an infinite dimensional space and then does classification using not \ntwo features like I've done here, but an infinite number of features.", metadata={'source': './docs/pdf/MachineLearning-Lecture01.pdf', 'file_path': './docs/pdf/MachineLearning-Lecture01.pdf', 'page': 13, 'total_pages': 22, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'PScript5.dll Version 5.2.2', '

In [43]:
# !pip install pymupdf
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyMuPDFLoader("./docs/pdf/MachineLearning-Lecture01.pdf")
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=700, 
    chunk_overlap=100
)

documents = text_splitter.split_documents(documents)

## 5. Retrievers

A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) them. Vector stores can be used as the backbone of a retriever, but there are other types of retrievers as well.

In [22]:
retriever = vectordb.as_retriever(search_type="mmr")
docs = retriever.get_relevant_documents("What is Linear Regression")
docs

[Document(page_content="And one of the most interesting things we'll talk about later this quarter is what if your \ndata doesn't lie in a two-dimensional or three-dimensional or sort of even a finite \ndimensional space, but is it possible — what if your data actually lies in an infinite \ndimensional space? Our plots here are two-dimensional space. I can't plot you an infinite \ndimensional space, right? And so it turns out that one of the most successful classes of \nmachine learning algorithms — some may call support vector machines — actually takes \ndata and maps data to an infinite dimensional space and then does classification using not \ntwo features like I've done here, but an infinite number of features.", metadata={'source': './docs/pdf/MachineLearning-Lecture01.pdf', 'file_path': './docs/pdf/MachineLearning-Lecture01.pdf', 'page': 13, 'total_pages': 22, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'PScript5.dll Version 5.2.2', '

In [23]:
retriever = vectordb.as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": .5})
docs = retriever.get_relevant_documents("What is Linear Regression")
docs

[Document(page_content="And one of the most interesting things we'll talk about later this quarter is what if your \ndata doesn't lie in a two-dimensional or three-dimensional or sort of even a finite \ndimensional space, but is it possible — what if your data actually lies in an infinite \ndimensional space? Our plots here are two-dimensional space. I can't plot you an infinite \ndimensional space, right? And so it turns out that one of the most successful classes of \nmachine learning algorithms — some may call support vector machines — actually takes \ndata and maps data to an infinite dimensional space and then does classification using not \ntwo features like I've done here, but an infinite number of features.", metadata={'source': './docs/pdf/MachineLearning-Lecture01.pdf', 'file_path': './docs/pdf/MachineLearning-Lecture01.pdf', 'page': 13, 'total_pages': 22, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'PScript5.dll Version 5.2.2', '

In [26]:
retriever = vectordb.as_retriever(search_kwargs={"k": 3})
docs = retriever.get_relevant_documents("What is Linear Regression")
docs

[Document(page_content="And one of the most interesting things we'll talk about later this quarter is what if your \ndata doesn't lie in a two-dimensional or three-dimensional or sort of even a finite \ndimensional space, but is it possible — what if your data actually lies in an infinite \ndimensional space? Our plots here are two-dimensional space. I can't plot you an infinite \ndimensional space, right? And so it turns out that one of the most successful classes of \nmachine learning algorithms — some may call support vector machines — actually takes \ndata and maps data to an infinite dimensional space and then does classification using not \ntwo features like I've done here, but an infinite number of features.", metadata={'source': './docs/pdf/MachineLearning-Lecture01.pdf', 'file_path': './docs/pdf/MachineLearning-Lecture01.pdf', 'page': 13, 'total_pages': 22, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'PScript5.dll Version 5.2.2', '

## Summary Step

In [None]:
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import torch
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceInstructEmbeddings

##STEP1 Document loaders
folder_path = './docs/pdf/'
pdf_list = []
for filename in os.listdir(folder_path):
    if filename.endswith('.pdf'):
        pdf_list.append(folder_path + filename)

documents = []
pdf_loaders = [PyPDFLoader(pdf) for pdf in pdf_list]
for loader in pdf_loaders:
    documents.extend(loader.load())

##STEP2 Document transformers
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 700,
    chunk_overlap = 100
)

docs = text_splitter.split_documents(documents) 

##STEP3 Text embedding models
from langchain.embeddings import HuggingFaceInstructEmbeddings
embedding_model = HuggingFaceInstructEmbeddings(
        model_name = 'hkunlp/instructor-base',              
        model_kwargs = {
            'device': torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        },
    )

##STEP4 Vector stores
vector_path = 'vectordb_path'
db_file_name = 'ml_andrew_full_course'

vectordb = FAISS.from_documents(
        documents = docs, 
        embedding = embedding_model)

vectordb.save_local(
    os.path.join(vector_path, db_file_name)
)

##STEP5 Retrievers
vectordb = FAISS.load_local(
        folder_path = os.path.join(vector_path, db_file_name),
        embeddings  = embedding_model
    ) 

## Appendix

- [Multi-Vector Retriever for RAG on tables, text, and images](https://blog.langchain.dev/semi-structured-multi-modal-rag/)

### Caching
Embeddings can be stored or temporarily cached to avoid needing to recompute them.

Caching embeddings can be done using a CacheBackedEmbeddings. 

The cache backed embedder is a wrapper around an embedder that caches embeddings in a key-value store. 

The text is hashed and the hash is used as the key in the cache.

The main supported way to initialized a CacheBackedEmbeddings is from_bytes_store. This takes in the following parameters:
- `underlying_embedder`: The embedder to use for embedding.
- `document_embedding_cache`: The cache to use for storing document embeddings.
- `namespace`: The namespace to use for document cache. This namespace is used to avoid collisions with other caches.
- `Attention`: Be sure to set the namespace parameter to avoid collisions of the same text embedded using different embeddings models.

#### With Caching vs Without caching

In [1]:
from langchain.storage import (
    InMemoryStore,
    LocalFileStore,
    RedisStore,
    UpstashRedisStore,
)
from langchain.embeddings import CacheBackedEmbeddings
from langchain.vectorstores import FAISS

fs = LocalFileStore("./cache/")

from langchain.embeddings import HuggingFaceInstructEmbeddings
import torch

underlying_embeddings = HuggingFaceInstructEmbeddings(
        model_name = 'hkunlp/instructor-base',              
        model_kwargs = {
            'device': torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        },
    )

cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    underlying_embeddings, fs, namespace=underlying_embeddings.model_name
)

list(fs.yield_keys())

  from tqdm.autonotebook import trange


load INSTRUCTOR_Transformer
'NoneType' object has no attribute 'cadam32bit_grad_fp32'


  warn("The installed version of bitsandbytes was compiled without GPU support. "


max_seq_length  512


[]

In [2]:
# !pip install pymupdf
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyMuPDFLoader("./docs/pdf/MachineLearning-Lecture01.pdf")
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=700, 
    chunk_overlap=100
)

documents = text_splitter.split_documents(documents)

In [3]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [4]:
import time
start_time = time.time()
vectordb = FAISS.from_documents(documents, cached_embedder)
vectordb2 = FAISS.from_documents(documents, cached_embedder)
end_time = time.time()
epoch_mins, epoch_secs = epoch_time(start_time, end_time)
print(f'Time : {epoch_mins} m {epoch_secs} s')

Time : 0 m 7 s


In [5]:
from langchain.embeddings import HuggingFaceInstructEmbeddings
embedding_model = HuggingFaceInstructEmbeddings(
        model_name = 'hkunlp/instructor-base',              
        model_kwargs = {
            'device': torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        },
    )

load INSTRUCTOR_Transformer
max_seq_length  512


In [6]:
import time
start_time = time.time()
vectordb = FAISS.from_documents(documents, embedding_model)
vectordb2 = FAISS.from_documents(documents, embedding_model)
end_time = time.time()
epoch_mins, epoch_secs = epoch_time(start_time, end_time)
print(f'Time : {epoch_mins} m {epoch_secs} s')

Time : 0 m 15 s


### MultiQueryRetreiver

Distance-based vector database retrieval relies on high-dimensional space representations to find similar documents, but query wording changes and inadequate embeddings can lead to varying results. The MultiQueryRetriever automates prompt tuning by generating diverse queries from a user input, collecting relevant documents for each query, and combining the results to potentially overcome the limitations of distance-based retrieval and provide a more comprehensive set of results.

In [29]:
# #STEP4
db_file_name = './vectordb_path/ml-andrew-ng/'
vectordb = FAISS.load_local(
        folder_path = db_file_name,
        embeddings  = embedding_model
    ) 
# #STEP5
retreiver = vectordb.as_retriever()

In [3]:
from langchain.chat_models import ChatOpenAI
from langchain.retrievers.multi_query import MultiQueryRetriever

llm = ChatOpenAI(temperature=0)
retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=vectordb.as_retriever(), llm=llm
)

In [10]:
# Set logging for the queries
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

In [11]:
question = "What are the difference between Linear Regression and Logistic Regression?"
unique_docs = retriever_from_llm.get_relevant_documents(query=question)
len(unique_docs)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. How do Linear Regression and Logistic Regression differ from each other?', '2. In what ways do Linear Regression and Logistic Regression vary?', '3. Can you explain the distinctions between Linear Regression and Logistic Regression?']


7

### MultiVector Retriever
It can often be beneficial to store multiple vectors per document. There are multiple use cases where this is beneficial. LangChain has a base MultiVectorRetriever which makes querying this type of setup easy. A lot of the complexity lies in how to create the multiple vectors per document. This notebook covers some of the common ways to create those vectors and use the MultiVectorRetriever.

The methods to create multiple vectors per document include:

Smaller chunks: split a document into smaller chunks, and embed those (this is ParentDocumentRetriever).
Summary: create a summary for each document, embed that along with (or instead of) the document.
Hypothetical questions: create hypothetical questions that each document would be appropriate to answer, embed those along with (or instead of) the document.


In [None]:
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.document_loaders import PyPDFLoader

folder_path = './docs/pdf'
documents = [] 
pdf_list = []
for filename in os.listdir(folder_path):
    if filename.endswith('.pdf'):
        pdf_list.append(folder_path + filename)

pdf_loaders = [PyPDFLoader(pdf) for pdf in pdf_list]
for loader in pdf_loaders:
    documents.extend(loader.load())

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 700,
    chunk_overlap = 100
)
docs = text_splitter.split_documents(documents) 

embedding_model = HuggingFaceInstructEmbeddings(
        model_name = 'hkunlp/instructor-base',              
        model_kwargs = {
            'device': torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        },
    )

db_file_name = 'ml-andrew-ng/'

vectordb = FAISS.load_local(
        folder_path = db_file_name,
        embeddings  = embedding_model
    ) 

retreiver = vectordb.as_retriever()

In [None]:
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
# from langchain.document_loaders import TextLoader

# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectordb,
    docstore=store,
    id_key=id_key,
)
import uuid

doc_ids = [str(uuid.uuid4()) for _ in docs]

In [None]:
# The splitter to use to create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

sub_docs = []
for i, doc in enumerate(docs):
    _id = doc_ids[i]
    _sub_docs = child_text_splitter.split_documents([doc])
    for _doc in _sub_docs:
        _doc.metadata[id_key] = _id
    sub_docs.extend(_sub_docs)

retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

In [None]:
sub_docs = vectorstore.similarity_search("What is Linear Regression")