<a href="https://colab.research.google.com/github/akajammythakkar/ragpdf/blob/main/pdf_rag_learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Installing libraries (Incase required)

In [53]:
! pip install pymupdf pymilvus sentence-transformers langchain

Collecting langchain
  Downloading langchain-0.2.16-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.3.0,>=0.2.38 (from langchain)
  Downloading langchain_core-0.2.38-py3-none-any.whl.metadata (6.2 kB)
Collecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain)
  Downloading langchain_text_splitters-0.2.4-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.116-py3-none-any.whl.metadata (13 kB)
Collecting tenacity!=8.4.0,<9.0.0,>=8.1.0 (from langchain)
  Downloading tenacity-8.5.0-py3-none-any.whl.metadata (1.2 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain-core<0.3.0,>=0.2.38->langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting httpx<1,>=0.23.0 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting orjson<4.0.0,>=3.9.14 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading orjson-3.10.7-cp310-cp31

#### Importing Libraries

In [4]:
import numpy
import pandas
import json

# for tokenizers and reading pdf
from transformers import AutoTokenizer, AutoModel
import torch
import fitz  # PyMuPDF

# Display the variable in Markdown format
from IPython.display import Markdown, display

# for api
import requests
import json

# to calculate similarities
from sklearn.metrics.pairwise import cosine_similarity

# ENV variables
from google.colab import userdata

# Vector db
from pymilvus import MilvusClient

# Import Regex
import re

# Ignore warnings
from warnings import filterwarnings
filterwarnings('ignore')

#### Constants

In [20]:
URI = userdata.get("URI")
COLLECTION_NAME = userdata.get("COLLECTION_NAME")
TOKEN = userdata.get("TOKEN")
MODEL_NAME = "Snowflake/snowflake-arctic-embed-m"

#### Milvus DB connections & functions

In [26]:
client = MilvusClient(
        uri=URI,  # Cluster endpoint obtained from the console
        token=TOKEN # API key or a colon-separated cluster username and password
    )

def insert_chunks_to_vec_db(chunks, text):
    """
    Insert the vector embeddings associated with a document into Milvus.

    This function prepares the data by associating document IDs with their respective embeddings
    and then inserts this data into a Milvus collection. It flushes the collection to ensure data
    persistence and prints the number of entities added along with the total entities in the collection.

    Args:
        doc_id (str): The unique identifier for the document.
        embeded_text (list): A list of vector embeddings corresponding to segments of the document.

    Returns:
        list: The response from the Milvus insert operation, typically containing IDs of the inserted vectors.
    """
    # Prepare data for insertion into Milvus by associating document IDs with embeddings
    document_data = [
        {
            "embeddings": chunks,
            "text": text
        }
    ]

    res = client.insert(
        collection_name=COLLECTION_NAME,
        data=document_data
    )

    # Count number of entities in vec_db
    num_entities = client.query(COLLECTION_NAME, filter="", output_fields=["count(*)"])

    # Print the outcome of the insert operation
    print(f"Number of entities added to the db: {len(res['ids'])}, Total Entities in DB: {num_entities[0]['count(*)']}")

    return res['ids']


DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: c8cc309ad55e46478bf548a7453d75e0


#### Helping Functions

In [7]:
def extract_text(pdf_path):
    text = ""
    with fitz.open(pdf_path) as pdf:
        for page in pdf:
            text += page.get_text()
    return text

def search(query, embeddings, chunks):
    query_embedding = generate_embeddings([query])[0]
    similarities = cosine_similarity([query_embedding], embeddings)
    best_match_index = similarities.argsort()[0,::-1]
    return "\n".join([chunks[best_match_index[x]] for x in range(5)])

def chunk_text(text, chunk_size=1000):
    chunk = []
    for i in range(0, len(text), chunk_size):
        new_chunk = text[i : i + chunk_size].lower()
        chunk.append(new_chunk)
    return chunk


def generate_embeddings(chunks):
    embeddings = []
    for chunk in chunks:
        inputs = tokenizer(chunk, return_tensors = 'pt', truncation = True, padding = True)
        #print("inputs : ", inputs)
        """
        The tokenizer processes the text chunk and converts it into a format suitable for the model.
        # args...

        return_tensors='pt': This argument specifies that the output should be in PyTorch tensor format, which is required for the model.
        truncation=True: This ensures that any input longer than the model's maximum length is truncated, preventing errors during processing.
        padding=True: This ensures that shorter inputs are padded to the same length, allowing for batch processing.

        # keys that are returned and which will be used as arg to model:
        input_ids : list of token ids of all tokenised words
        attention_mask : binary mask indicating which tokes are to be attended by the model
        token_type_ids :  It indicates which tokens belong to which segment, if all tokens belong to a single segment then [0,0,0,0]
        overflowing_tokens : This key contains any tokens that were truncated when the input exceeded the maximum length allowed by the model.
        num_truncated_tokens : number of truncated tokesm
        """
        with torch.no_grad():
            outputs = model(**inputs)
            """
            No Gradient Calculation: The with torch.no_grad(): context manager is used to disable gradient calculations. This is important during inference to save memory and speed up computations since we don't need gradients for backpropagation.
            Model Output: The model processes the tokenized inputs and returns the outputs, which include various hidden states. The **inputs syntax unpacks the dictionary of input tensors into keyword arguments for the model.
            """
            k = outputs.last_hidden_state
            #print("meaned last hidden layer : ", k.shape) # prints mean of all multidimensional layers
            embeddings.append(k.mean(dim=1).squeeze().numpy())
            # last hidden state is output of last layer
            """
            Extracting Last Hidden State:
            outputs.last_hidden_state contains the hidden states for all tokens in the input sequence. This is a tensor of shape (batch_size, sequence_length, hidden_size).
            Mean Calculation:
            mean(dim=1) computes the mean of the hidden states across all tokens in the sequence, effectively creating a single embedding for the entire input chunk. This is done to obtain a fixed-size vector representation for each chunk.
            Squeeze and Convert to NumPy:
            """
    return embeddings

# Cleaning text so that only usefull information needs to be saved to the database and vector database.


class TextCleaner:
    def __init__(self):
        self.patterns_to_remove = [
            r'\n',  # Remove newlines
            r'\s+',  # Remove extra whitespaces
            # r'\d+',  # Remove numbers
            # r'\W',  # Remove non-alphanumeric characters
            # r'\b\w{1,2}\b'  # Remove words with less than 3 characters
        ]

    def clean_text(self, text):
        cleaned_text = text
        for pattern in self.patterns_to_remove:

            cleaned_text = re.sub(pattern, ' ', cleaned_text)

            # Remove control characters
            cleaned_text = re.sub(r'[\x00-\x1F\x7F-\x9F]', '', cleaned_text)
        return cleaned_text.strip()

In [8]:
extracted_text = extract_text(r"/content/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf")

In [9]:
text = TextCleaner().clean_text(extracted_text)

In [15]:
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter
import numpy as np


class Embeddings:
    def __init__(self):
        # Default = all-MiniLM-L6-v2, dimensions 384
        self.embedding_function = SentenceTransformer(MODEL_NAME)
        self.token_splitter = SentenceTransformersTokenTextSplitter(
            chunk_overlap=0, tokens_per_chunk=512, model_name=MODEL_NAME)

    def _text_chunking(self, text) -> list:
        """
        Split the text into chunks

        Args:
            text (str): text to be split
        return:
            list: list of chunks
        """
        # Initialize text splitters
        character_splitter = RecursiveCharacterTextSplitter(
            separators=["\n\n", "\n", ". ", " ", ""],
            chunk_size=1000,
            chunk_overlap=0
        )
        # Split text (choose the splitter based on your needs)
        chunks = character_splitter.split_text(''.join(text))

        return chunks

    def _tokenization(self, chunks: list) -> list:
        """
        Tokenize the text

        Args:
            text (str): text to be tokenized

        Returns:
            list: list of tokenized text
        """

        token_split_texts = []
        for text in chunks:
            token_split_texts += self.token_splitter.split_text(text)

        return token_split_texts

    def embed(self, text) -> list:
        """
        Embed the text using the Sentence Transformers model

        Args:
            text (str): text to be embedded

        Returns:
            list: list of vector embeddings
        """

        if text is None:
            raise ValueError("Text is not provided")

        chunks = self._text_chunking(text)
        token_split_texts = self._tokenization(chunks)

        # embedding chunks and normalizing them in vector space
        embeded_text = self.embedding_function.encode(token_split_texts)

        return token_split_texts, embeded_text

    def encode_for_search(self, query: str) -> np.ndarray:
        """
        Encode the given text for search in the vector DB, with tokenization to improve accuracy.

        Args:
            search_text (str): The text input by the user to encode.

        Returns:
            np.ndarray: The normalized vector embedding of the tokenized input text.
        """
        # Tokenize the search text
        tokenized_texts = self.token_splitter.split_text(query)

        # If the tokenizer splits the text into multiple tokens, encode them separately
        encoded_texts = self.embedding_function.encode(tokenized_texts)

        return encoded_texts

In [22]:
token_split_texts, embeded_text = Embeddings().embed(text)

In [23]:
embeded_text[0]

array([-1.45961880e-03,  1.25343222e-02,  5.66486409e-03,  9.93519928e-03,
        8.39136913e-02,  6.70936182e-02, -2.73250006e-02, -4.93440125e-03,
       -3.47165093e-02, -5.05421422e-02, -8.25414732e-02,  4.66613360e-02,
        2.52301543e-04, -4.16405872e-03,  5.24203517e-02,  4.21726890e-03,
       -3.89243215e-02, -1.26039796e-02,  7.24371197e-03,  2.35477407e-02,
       -4.00150800e-03,  5.05013997e-03,  3.53877880e-02, -1.07403976e-04,
       -2.18280517e-02,  3.37613150e-02, -2.65241470e-02,  4.18966897e-02,
       -6.92734048e-02,  5.36484532e-02, -3.55743393e-02,  5.08293994e-02,
       -2.66839936e-02, -5.81354685e-02, -8.18451867e-02, -5.49253225e-02,
       -2.44591087e-02, -6.58704573e-03,  2.63592619e-02, -7.30686542e-03,
        1.12100290e-02, -2.14534793e-02,  2.90034469e-02, -3.77896428e-02,
       -2.10195091e-02, -4.96057235e-02, -2.52000093e-02,  1.69224720e-02,
        9.24899708e-03, -1.39130792e-02,  2.63155848e-02,  7.03710392e-02,
        6.85640872e-02,  

In [19]:
client.describe_collection("Rag_Chunks")

{'collection_name': 'Rag_Chunks',
 'auto_id': True,
 'num_shards': 1,
 'description': '',
 'fields': [{'field_id': 100,
   'name': 'Auto_id',
   'description': 'The Primary Key',
   'type': <DataType.INT64: 5>,
   'params': {},
   'auto_id': True,
   'is_primary': True},
  {'field_id': 101,
   'name': 'chunks',
   'description': '',
   'type': <DataType.FLOAT_VECTOR: 101>,
   'params': {'dim': 784}},
  {'field_id': 102,
   'name': 'text',
   'description': '',
   'type': <DataType.VARCHAR: 21>,
   'params': {'max_length': 65535}}],
 'aliases': [],
 'collection_id': 452198321887966607,
 'consistency_level': 2,
 'properties': {},
 'num_partitions': 1,
 'enable_dynamic_field': True}

In [27]:
for text, chunks in zip(token_split_texts, embeded_text):
    print(f"Embedding Shape: {chunks.shape}")  # Check vector dimensions
    insert_chunks_to_vec_db(chunks, text)
    print("Insertion successful")


Embedding Shape: (768,)
Number of entities added to the db: 1, Total Entities in DB: 0
Insertion successful
Embedding Shape: (768,)
Number of entities added to the db: 1, Total Entities in DB: 0
Insertion successful
Embedding Shape: (768,)
Number of entities added to the db: 1, Total Entities in DB: 0
Insertion successful
Embedding Shape: (768,)
Number of entities added to the db: 1, Total Entities in DB: 0
Insertion successful
Embedding Shape: (768,)
Number of entities added to the db: 1, Total Entities in DB: 0
Insertion successful
Embedding Shape: (768,)
Number of entities added to the db: 1, Total Entities in DB: 0
Insertion successful
Embedding Shape: (768,)
Number of entities added to the db: 1, Total Entities in DB: 2
Insertion successful
Embedding Shape: (768,)
Number of entities added to the db: 1, Total Entities in DB: 2
Insertion successful
Embedding Shape: (768,)
Number of entities added to the db: 1, Total Entities in DB: 2
Insertion successful
Embedding Shape: (768,)
Numb

In [46]:
def retrieval(query_to_be_searched, limit=3):
    output_fields = ["chunks", "score", "text", "distance"]  # Include 'distance' in output fields
    search_params = {
        "metric_type": "COSINE",
        "params": {"nprobe": 10},
    }

    # Generate embeddings from the query
    data = Embeddings().encode_for_search(query_to_be_searched)

    # Perform the search
    retrieved_chunks = client.search(
        collection_name=COLLECTION_NAME,
        data=data,
        limit=limit,
        output_fields=output_fields,
        search_params=search_params
    )

    # Extract the text values from the retrieved chunks
    text_list = [chunk['entity']['text'] for chunk_list in retrieved_chunks for chunk in chunk_list]
    # chunk_ids = [str(chunk['entity']['chunk_id']) for chunk_list in results for chunk in chunk_list]
    # Print and return the extracted text
    # print(text_list)
    text = " ".join(text_list)

    return text


#### "Embedding are values generated for each word and the values also depend on the data that was fed, it trains a NN to create weights *or* for this specific purpose called embeddings."
#### So embeddings should be made context specific for specific use cases.
#### The embeddings generated by the model are designed to capture semantic relationships:
#### Similar texts will have embeddings that are close together in the vector space.
#### Dissimilar texts will have embeddings that are farther apart.

In [None]:
api = "

In [29]:
rag_query = "what does this book talks about??"

In [47]:
rag_response = retrieval(rag_query)

In [48]:
rag_response

'. i ’ ve shared chapters that were edited out of the final book, as well as other popular resources t background i grew up in a single - parent household with my mom working, going to school, and raising my brother and me as latchkey kids . the means of learning are abundant — it ’ s the desire to learn that is scarce. [ 3 ] reading was my first love. [ 4 ] i remember my grandparents ’ house in india. i ’ d be a little kid on the floor going through all of my grandfather ’ s read - er ’ s digests, which is all he had to read. now, of course, there ’ s a smorgasbord of information out there — anybody can read anything all the time. back then, it was much more limited. i would read comic books, storybooks, whatever i could get my hands on. i think i always loved to read because i ’ m actually an antiso - cial introvert. i was lost in the world of words and ideas from an early age. i think some of it comes from the happy cir - cumstance that when i was young, nobody forced me to read cer

In [None]:
url = f'https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash-latest:generateContent?key={api}'
headers = {'Content-Type': 'application/json'}
data = {
        "contents": [
            {
                "parts": [
                    {
                        "text": f"""Query: {query}
Reference Information:
{rag_response}
Please generate a response based on the query and the provided reference information.
Please do not add information from yourside. Keep it pointed on query"""
                    }
                ]
            }
        ],
        "generationConfig": {
            "temperature": 0.7,
            "topK": 40,
            "topP": 0.95,
            "maxOutputTokens": 1024,
        }
    }



response = requests.post(url, headers=headers, json=data)
r = response.json()
r

{'candidates': [{'content': {'parts': [{'text': 'The provided text does not directly state how Naval Ravikant defines good investment opportunities. However, it does mention several key principles that likely inform his investment approach:\n\n* **"Buy-and-hold" + valuation + margin of safety:** This suggests that Naval seeks investments with a long-term perspective, focusing on intrinsic value and a safety buffer to protect against potential losses.\n* **Compound interest:** He emphasizes the importance of compounding returns over time, indicating a preference for investments that can generate consistent, long-term growth.\n* **Leverage:** Naval believes in leveraging one\'s skills and resources to maximize returns, which could translate to investing in businesses with high growth potential or opportunities for scaling.\n* **Avoiding ruin:** He stresses the importance of protecting one\'s capital and avoiding risky investments that could lead to significant losses.\n\nBased on these p

In [None]:
display(Markdown(r['candidates'][0]['content']['parts'][0]['text']))

The provided text does not directly state how Naval Ravikant defines good investment opportunities. However, it does mention several key principles that likely inform his investment approach:

* **"Buy-and-hold" + valuation + margin of safety:** This suggests that Naval seeks investments with a long-term perspective, focusing on intrinsic value and a safety buffer to protect against potential losses.
* **Compound interest:** He emphasizes the importance of compounding returns over time, indicating a preference for investments that can generate consistent, long-term growth.
* **Leverage:** Naval believes in leveraging one's skills and resources to maximize returns, which could translate to investing in businesses with high growth potential or opportunities for scaling.
* **Avoiding ruin:** He stresses the importance of protecting one's capital and avoiding risky investments that could lead to significant losses.

Based on these principles, it can be inferred that Naval likely defines good investment opportunities as those that:

* Offer a solid foundation of intrinsic value and a margin of safety.
* Have the potential for long-term compounding returns.
* Allow for leveraging skills and resources to maximize growth.
* Avoid undue risk and the potential for catastrophic losses. 
