## Generating Embeddings Vector for Chunks

Installing Major Dependancies

In [None]:
!pip install langchain pypdf markdown numpy
!pip install -U langchain-community

### Reading and Loading Dataset

In this cell, we are reading and loading the dataset from PDF and Markdown files. We use the `glob` library to find all files with `.pdf` and `.md` extensions in the `rawdata` directory. 

#### Libraries Used:
- `langchain.document_loaders.PyPDFLoader`: This class is used to load PDF documents. It reads the content of PDF files and converts them into a format that can be processed further.
- `langchain.document_loaders.TextLoader`: This class is used to load text documents, including Markdown files. It reads the content of text files and converts them into a format that can be processed further.
- `glob`: This module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell.

#### Process:
1. **Reading PDF Files**:
   - We use `glob.glob("rawdata/*.pdf")` to get a list of all PDF files in the `rawdata` directory.
   - For each PDF file, we create an instance of `PyPDFLoader` and load the documents using the `load()` method.
   - The loaded documents are then added to the `raw_data` list.

2. **Reading Markdown Files**:
   - Similarly, we use `glob.glob("rawdata/*.md")` to get a list of all Markdown files in the `rawdata` directory.
   - For each Markdown file, we create an instance of `TextLoader` with UTF-8 encoding and load the documents using the `load()` method.
   - The loaded documents are then added to the `raw_data` list.

By the end of this cell, the `raw_data` list contains all the documents from the specified PDF and Markdown files, ready for further processing.

In [None]:
from langchain.document_loaders import PyPDFLoader, TextLoader
from langchain.schema import Document
import glob

#### Reading Dataset

pdf_files = glob.glob("rawdata/*.pdf")
md_files = glob.glob("rawdata/*.md")
raw_data=[]
for file in pdf_files:     # Load a PDF files
  pdf_loader = PyPDFLoader(file)
  pdf_documents = pdf_loader.load()
  raw_data.extend(pdf_documents)
for file in md_files:# Load a Markdown file
  md_loader = TextLoader(file, encoding="utf-8")
  md_documents = md_loader.load()
  raw_data.extend(md_documents)

In [None]:
for data in raw_data[0:5]:  #checking content of first 5 documents
  print(data.page_content)

### Splitting Documents into Chunks

Here, we are splitting the loaded documents into smaller chunks to facilitate better processing and embedding generation. We use the `RecursiveCharacterTextSplitter` from the `langchain.text_splitter` module to achieve this.

#### Libraries Used:
- `langchain.text_splitter.RecursiveCharacterTextSplitter`: This class is used to split text documents into smaller chunks. It ensures that the chunks are of a manageable size and maintains context by allowing some overlap between chunks.

#### Process:
1. **Initialize Text Splitter**:
   - We create an instance of `RecursiveCharacterTextSplitter` with the following parameters:
     - `chunk_size=400`: This sets the maximum size of each chunk to 400 characters.
     - `chunk_overlap=60`: This sets the overlap between consecutive chunks to 60 characters to maintain context.

2. **Split Documents**:
   - We use the `split_documents()` method of the `text_splitter` instance to split the `raw_data` documents into smaller chunks.
   - The resulting chunks are stored in the `document_chunks` list.

3. **Print Sample Chunks**:
   - We print the content and metadata of the first 5 chunks to verify the splitting process.

By the end of this cell, the `document_chunks` list contains the smaller chunks of the original documents, ready for further processing and embedding generation.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,  # Maximum size of each chunk
    chunk_overlap=60  # Overlap between chunks to maintain context
)

# Split the documents into chunks
document_chunks = text_splitter.split_documents(raw_data)


for chunk in document_chunks[:5]:  # Print the first 5 chunks
    print("Chunk Content:", chunk.page_content)
    print("Metadata:", chunk.metadata)



### Creating Vector Embeddings

In this cell, we are generating vector embeddings for the document chunks using a pre-trained model from the `sentence_transformers` library. These embeddings will be used for various downstream tasks such as similarity search, clustering, and more.

#### Libraries Used:
- `sentence_transformers.SentenceTransformer`: This class is used to load pre-trained models and generate embeddings for text data.



In [None]:
!pip install sentence-transformers

#### Process:
1. **Load Embedding Model**:
   - We load the `multi-qa-mpnet-base-dot-v1` model using the `SentenceTransformer` class. This model is specifically designed for generating high-quality embeddings for question-answering tasks and general-purpose semantic search.
   - **Reason for Choosing `multi-qa-mpnet-base-dot-v1`**:
     - **Performance**: This model is known for its high performance in generating embeddings that capture semantic meaning effectively. It has larger dimensions compared to other available models like `multi-qa-MiniLM-L6-cos-v1`, `all-MiniLM-L6-v2`
     - **Versatility**: It is optimized for question-answering, which aligns with the goal of creating a chatbot that answers AI-related questions.


2. **Generate Embeddings**:
   - We extract the text content from each chunk in `document_chunks` and store them in the `chunk_texts` list.
   - We use the `encode()` method of the `embedding_model` to generate embeddings for the `chunk_texts`.
   - The resulting embeddings are stored in the `embeddings` list.

3. **Print Sample Embedding**:
   - We print the embedding for the first chunk to verify the embedding generation process.

By the end of this cell, the `embeddings` list contains the vector embeddings for each document chunk, ready for use in further analysis and tasks.

In [None]:
######### Create vector Embeddings
from sentence_transformers import SentenceTransformer

# Load an embedding model
embedding_model = SentenceTransformer('multi-qa-mpnet-base-dot-v1')

# Generate embeddings for each chunk
chunk_texts = [chunk.page_content for chunk in document_chunks]
embeddings = embedding_model.encode(chunk_texts)

# Example: Print the embedding for the first chunk
print("First Chunk Embedding:", embeddings[0])





### Normalizing and Converting Embeddings to NumPy Array

In this cell, we are converting the generated embeddings into a NumPy array and normalizing them. This step ensures that the embeddings are in a consistent format and have unit length, which is important for many machine learning algorithms and similarity calculations.

#### Libraries Used:
- `numpy`: This library is used for numerical operations in Python. It provides support for arrays and matrices, along with a collection of mathematical functions to operate on these data structures.

#### Process:
1. **Convert to NumPy Array**:
   - We convert the list of embeddings into a NumPy array using `np.array(embeddings)`. This allows us to leverage NumPy's efficient numerical operations.
   - We specify the data type as `float32` to ensure that the embeddings are stored in a format that is compatible with most machine learning frameworks.

2. **Normalize Embeddings**:
   - We normalize the embeddings to have unit length using `np.linalg.norm()`. This is done by dividing each embedding vector by its L2 norm.
   - Normalization ensures that the embeddings have a consistent scale, which is important for similarity calculations and other downstream tasks.


`embeddings_array` contains the normalized embeddings in a NumPy array format, which will be used in future.

In [None]:
import numpy as np
# Convert embeddings to a NumPy array
embeddings_array = np.array(embeddings).astype('float32')
embeddings_array /= np.linalg.norm(embeddings_array, axis=1, keepdims=True)

print(embeddings_array.shape) #checking the vector store


following is sample function which returns embeddings for a user query

In [None]:
def generate_vector(chunk):
    '''This function generates an vector for a given chunk/query
    Make sure you have imported sentence_transformers and defined the embedding_model 
    For this task we will use the multi-qa-mpnet-base-dot-v1 model'''
    embeddings = embedding_model.encode(chunk_texts)
    query_embedding = embedding_model.encode([query_text]).astype('float32')
    query_embedding = query_embedding / np.linalg.norm(query_embedding)     # Normalize query embedding 
    return query_embedding


    

### Saving Chunks and Embeddings

We save the document chunks and their corresponding embeddings to files for later access.

#### Libraries Used:
- `pickle`: This module is used for serializing and deserializing Python objects.
- `numpy`: This library is used for numerical operations and saving arrays.

#### Process:
   - We use `pickle.dump()` to save the `document_chunks` list to a binary file named `chunks.pkl`.
   - We use `np.save()` to save the `embeddings_array` to a file named `vector_store.npy`.

the document chunks and embeddings are saved to disk, allowing for easy retrieval and reuse in future tasks.

In [None]:
######### Saving chunks for later access
import pickle

# Save chunks to a binary file
with open("chunks.pkl", "wb") as f:
    pickle.dump(document_chunks, f)

np.save("vector_store.npy", embeddings_array)

