# Natural Language Processing

# Retrieval-augmented generation (RAG)

<img src="../figures/RAG-process.png" >

In [None]:
import os
# Set GPU device
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

os.environ['http_proxy']  = 'http://192.41.170.23:3128'
os.environ['https_proxy'] = 'http://192.41.170.23:3128'

## 1. Prompt

In [28]:
from langchain.prompts import PromptTemplate

prompt_template = """
    I'm your friendly NLP chatbot, here to assist Chaky and Gun with any questions they have about Natural Language Processing (NLP). 
    If you're curious about how probability works in the context of NLP, feel free to ask any questions you may have. 
    Whether it's about probabilistic models, language models, or any other related topic, 
    I'm here to help break down complex concepts into easy-to-understand explanations.
    Just let me know what you're wondering about, and I'll do my best to guide you through it!
    {context}
    Question: {question}
    Answer:
    """

PROMPT = PromptTemplate(
    template = prompt_template, 
    input_variables=["context", "question"]
)

In [29]:
PROMPT.format_prompt(
    context = 'Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can effectively generalize and thus perform tasks without explicit instructions.',
    question = 'What is Mahcine Learning')

StringPromptValue(text="\n    I'm your friendly NLP chatbot, here to assist Chaky and Gun with any questions they have about Natural Language Processing (NLP). \n    If you're curious about how probability works in the context of NLP, feel free to ask any questions you may have. \n    Whether it's about probabilistic models, language models, or any other related topic, \n    I'm here to help break down complex concepts into easy-to-understand explanations.\n    Just let me know what you're wondering about, and I'll do my best to guide you through it!\n    Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can effectively generalize and thus perform tasks without explicit instructions.\n    Question: What is Mahcine Learning\n    Answer:\n    ")

## 2. Retrieval

1. `Document loaders` : Load documents from many different sources (HTML, PDF, code). 
2. `Document transformers` : One of the essential steps in document retrieval is breaking down a large document into smaller, relevant chunks to enhance the retrieval process.
3. `Text embedding models` : Embeddings capture the semantic meaning of the text, allowing you to quickly and efficiently find other pieces of text that are similar.
4. `Vector stores`: there has emerged a need for databases to support efficient storage and searching of these embeddings.
5. `Retrievers` : Once the data is in the database, you still need to retrieve it.

### 2.1 Document Loaders 

[Download Document](https://web.stanford.edu/~jurafsky/slp3/)

In [8]:
from langchain.document_loaders import PyPDFLoader

nlp_document = '../docs/pdf/SpeechandLanguageProcessing_3rd_07jan2023.pdf'
loader = PyPDFLoader(nlp_document)
documents = loader.load()

In [9]:
len(documents)

636

### 2.2 Document Transformers

In [10]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 700,
    chunk_overlap = 100
)

docs = text_splitter.split_documents(documents) 

In [11]:
len(docs)

3434

### 2.3 Text Embedding Models

In [12]:
import torch
from langchain.embeddings import HuggingFaceInstructEmbeddings

embedding_model = HuggingFaceInstructEmbeddings(
        model_name = 'hkunlp/instructor-base',              
        model_kwargs = {
            'device': torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        },
    )

load INSTRUCTOR_Transformer
'NoneType' object has no attribute 'cadam32bit_grad_fp32'
[2023-12-05 10:02:44,642] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)


  warn("The installed version of bitsandbytes was compiled without GPU support. "


max_seq_length  512


### 2.4 Vector Stores

In [20]:
vector_path = '../vectordb_path/'
os.path.exists(vector_path)

True

In [14]:
from langchain.vectorstores import FAISS

vector_path = '../vectordb_path'
db_file_name = 'nlp_standford'

vectordb = FAISS.from_documents(
        documents = docs, 
        embedding = embedding_model)

#save vector locally
vectordb.save_local(
    os.path.join(vector_path, db_file_name)
) 

### 2.5 Retreivers

In [None]:
vector_path = '../vectordb_path'
db_file_name = 'nlp_standford'
#calling vector from local
vectordb = FAISS.load_local(
        folder_path = os.path.join(vector_path, db_file_name),
        embeddings  = embedding_model
    ) 

#ready to use
retriever = vectordb.as_retriever()

## 3. Memory

One of the core utility classes underpinning most (if not all) memory modules is the ChatMessageHistory class. This is a super lightweight wrapper that provides convenience methods for saving HumanMessages, AIMessages, and then fetching them all.

You may want to use this class directly if you are managing memory outside of a chain.


## 4. Chain

Using an LLM in isolation is fine for simple applications, but more complex applications require chaining LLMs - either with each other or with other components.

An `LLMChain` is a simple chain that adds some functionality around language models.
- it consists of a `PromptTemplate` and a `LM` (either an LLM or chat model).
- it formats the prompt template using the input key values provided (and also memory key values, if available), 
- it passes the formatted string to LLM and returns the LLM output.

### Class ConversationalRetrievalChain

- `retriever` : Retriever to use to fetch documents.

- `combine_docs_chain` : The chain used to combine any retrieved documents.

- `question_generator`: The chain used to generate a new question for the sake of retrieval. This chain will take in the current question (with variable question) and any chat history (with variable chat_history) and will produce a new standalone question to be used later on.

- `return_source_documents` : Return the retrieved source documents as part of the final result.

- `get_chat_history` : An optional function to get a string of the chat history. If None is provided, will use a default.

- `return_generated_question` : Return the generated question as part of the final result.

- `response_if_no_docs_found` : If specified, the chain will return a fixed response if no docs are found for the question.


## 5. Chatbot

In [None]:
prompt_question = 'Who are you by the way?'

answer = chain(
    {"question": prompt_question})