<a href="https://colab.research.google.com/github/akajammythakkar/ragpdf/blob/main/pdf_rag_learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Installing libraries (Incase required)

In [49]:
! pip install pymupdf pymilvus sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-3.0.1


#### Importing Libraries

In [1]:
import numpy
import pandas

# for tokenizers and reading pdf
from transformers import AutoTokenizer, AutoModel
import torch
import fitz  # PyMuPDF

# Display the variable in Markdown format
from IPython.display import Markdown, display

# for api
import requests
import json

# to calculate similarities
from sklearn.metrics.pairwise import cosine_similarity

# ENV variables
from google.colab import userdata

# Vector db
from pymilvus import MilvusClient

# Import Regex
import re

# Ignore warnings
from warnings import filterwarnings
filterwarnings('ignore')

#### Constants

In [2]:
URI = userdata.get("URI")
COLLECTION_NAME = userdata.get("COLLECTION_NAME")
TOKEN = userdata.get("TOKEN")

#### Milvus DB connections & functions

In [29]:
client = MilvusClient(
        uri=URI,  # Cluster endpoint obtained from the console
        token=TOKEN # API key or a colon-separated cluster username and password
    )

def insert_chunks_to_vec_db(chunks, text):
    """
    Insert the vector embeddings associated with a document into Milvus.

    This function prepares the data by associating document IDs with their respective embeddings
    and then inserts this data into a Milvus collection. It flushes the collection to ensure data
    persistence and prints the number of entities added along with the total entities in the collection.

    Args:
        doc_id (str): The unique identifier for the document.
        embeded_text (list): A list of vector embeddings corresponding to segments of the document.

    Returns:
        list: The response from the Milvus insert operation, typically containing IDs of the inserted vectors.
    """
    # Prepare data for insertion into Milvus by associating document IDs with embeddings
    document_data = [
        {
            "chunks": chunks,
            "text": text
        }
    ]

    res = client.insert(
        collection_name=COLLECTION_NAME,
        data=document_data
    )

    # Count number of entities in vec_db
    num_entities = client.query(COLLECTION_NAME, filter="", output_fields=["count(*)"])

    # Print the outcome of the insert operation
    print(f"Number of entities added to the db: {len(res['ids'])}, Total Entities in DB: {num_entities[0]['count(*)']}")

    return res['ids']


DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: 396a37424ef34d5daec836fc56902c1d


#### Helping Functions

In [4]:
def extract_text(pdf_path):
    text = ""
    with fitz.open(pdf_path) as pdf:
        for page in pdf:
            text += page.get_text()
    return text

def search(query, embeddings, chunks):
    query_embedding = generate_embeddings([query])[0]
    similarities = cosine_similarity([query_embedding], embeddings)
    best_match_index = similarities.argsort()[0,::-1]
    return "\n".join([chunks[best_match_index[x]] for x in range(5)])

def chunk_text(text, chunk_size=1000):
    chunk = []
    for i in range(0, len(text), chunk_size):
        new_chunk = text[i : i + chunk_size].lower()
        chunk.append(new_chunk)
    return chunk


def generate_embeddings(chunks):
    embeddings = []
    for chunk in chunks:
        inputs = tokenizer(chunk, return_tensors = 'pt', truncation = True, padding = True)
        #print("inputs : ", inputs)
        """
        The tokenizer processes the text chunk and converts it into a format suitable for the model.
        # args...

        return_tensors='pt': This argument specifies that the output should be in PyTorch tensor format, which is required for the model.
        truncation=True: This ensures that any input longer than the model's maximum length is truncated, preventing errors during processing.
        padding=True: This ensures that shorter inputs are padded to the same length, allowing for batch processing.

        # keys that are returned and which will be used as arg to model:
        input_ids : list of token ids of all tokenised words
        attention_mask : binary mask indicating which tokes are to be attended by the model
        token_type_ids :  It indicates which tokens belong to which segment, if all tokens belong to a single segment then [0,0,0,0]
        overflowing_tokens : This key contains any tokens that were truncated when the input exceeded the maximum length allowed by the model.
        num_truncated_tokens : number of truncated tokesm
        """
        with torch.no_grad():
            outputs = model(**inputs)
            """
            No Gradient Calculation: The with torch.no_grad(): context manager is used to disable gradient calculations. This is important during inference to save memory and speed up computations since we don't need gradients for backpropagation.
            Model Output: The model processes the tokenized inputs and returns the outputs, which include various hidden states. The **inputs syntax unpacks the dictionary of input tensors into keyword arguments for the model.
            """
            k = outputs.last_hidden_state
            #print("meaned last hidden layer : ", k.shape) # prints mean of all multidimensional layers
            embeddings.append(k.mean(dim=1).squeeze().numpy())
            # last hidden state is output of last layer
            """
            Extracting Last Hidden State:
            outputs.last_hidden_state contains the hidden states for all tokens in the input sequence. This is a tensor of shape (batch_size, sequence_length, hidden_size).
            Mean Calculation:
            mean(dim=1) computes the mean of the hidden states across all tokens in the sequence, effectively creating a single embedding for the entire input chunk. This is done to obtain a fixed-size vector representation for each chunk.
            Squeeze and Convert to NumPy:
            """
    return embeddings

# Cleaning text so that only usefull information needs to be saved to the database and vector database.


class TextCleaner:
    def __init__(self):
        self.patterns_to_remove = [
            r'\n',  # Remove newlines
            r'\s+',  # Remove extra whitespaces
            # r'\d+',  # Remove numbers
            # r'\W',  # Remove non-alphanumeric characters
            # r'\b\w{1,2}\b'  # Remove words with less than 3 characters
        ]

    def clean_text(self, text):
        cleaned_text = text
        for pattern in self.patterns_to_remove:

            cleaned_text = re.sub(pattern, ' ', cleaned_text)

            # Remove control characters
            cleaned_text = re.sub(r'[\x00-\x1F\x7F-\x9F]', '', cleaned_text)
        return cleaned_text.strip()

In [5]:
extracted_text = extract_text(r"/content/Eric-Jorgenson_The-Almanack-of-Naval-Ravikant_Final.pdf")

In [6]:
text = TextCleaner().clean_text(extracted_text)

In [7]:
chunked = chunk_text(text)

In [8]:
chunked[0]

't h e a l m a n a c k o f n a v a l r a v i k a n t e r i c j o rg e n s o n t h e a l m a n ac k o f n ava l r av i k a n t copyright © 2020 eric jorgenson all rights reserved. the almanack of naval ravikant a guide to wealth and happiness isbn 978-1-5445-1422-2 hardcover 978-1-5445-1421-5 paperback 978-1-5445-1420-8 ebook this book has been created as a public service. it is available for free download in pdf and e-reader versions on navalmanack.com. naval is not earning any money on this book. naval has essays, podcasts and more at nav.al and is on twitter @naval. f o r m y p a r e n t s , w h o g a v e m e e v e r y t h i n g a n d a l w ay s s e e m t o f i n d a w ay t o g i v e m o r e . contents important notes on this book (disclaimer) 9 foreword 13 eric’s note (about this book) 17 timeline of naval ravikant 21 now, here is naval in his own words… 23 part i: wealth building wealth 29 understand how wealth is created 30 find and build specific knowledge 40 play long-term games

In [9]:
# Load model and tokenizer
model_name = "Snowflake/snowflake-arctic-embed-m"

embedding_function = SentenceTransformer(model_name)
embeded_text = embedding_function.encode(token_split_texts)

Some weights of BertModel were not initialized from the model checkpoint at Snowflake/snowflake-arctic-embed-m and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
embeddings = generate_embeddings(chunked)

In [12]:
embeddings[0]

array([-6.36374520e-04, -1.96864139e-02,  1.80769831e-01, -7.93479457e-02,
        8.64471257e-01,  3.88169885e-01, -7.54374564e-02, -2.28855386e-02,
       -2.78551012e-01, -2.14533299e-01, -8.47056866e-01,  5.37218273e-01,
       -2.62251377e-01, -1.68836966e-01,  5.11363447e-01,  1.49921581e-01,
       -4.15332735e-01, -1.53741777e-01, -9.86334085e-02,  2.74774820e-01,
       -8.21607336e-02,  1.71034187e-01,  1.94986567e-01,  1.02548473e-01,
        7.93456957e-02,  5.27807951e-01, -2.07142413e-01,  2.40035698e-01,
       -6.09044492e-01,  3.36869150e-01, -6.74604028e-02,  3.15107703e-01,
       -5.39049685e-01, -5.89048922e-01, -4.12817538e-01, -4.75309223e-01,
       -1.85214937e-01, -1.35040041e-02,  2.40040511e-01, -4.48806286e-02,
        1.22380540e-01,  4.04032320e-02,  2.40709707e-01, -2.69110829e-01,
       -2.53426164e-01, -2.02074796e-01,  1.09906318e-02, -5.05920202e-02,
       -1.81719378e-01, -1.36548638e-01,  2.59683251e-01,  5.93897760e-01,
        7.46643841e-01, -

In [30]:
client.describe_collection("RAG_Chunks")

{'collection_name': 'RAG_Chunks',
 'auto_id': True,
 'num_shards': 1,
 'description': '',
 'fields': [{'field_id': 100,
   'name': 'Auto_id',
   'description': 'The Primary Key',
   'type': <DataType.INT64: 5>,
   'params': {},
   'auto_id': True,
   'is_primary': True},
  {'field_id': 101,
   'name': 'chunks',
   'description': '',
   'type': <DataType.FLOAT_VECTOR: 101>,
   'params': {'dim': 768}},
  {'field_id': 102,
   'name': 'text',
   'description': '',
   'type': <DataType.VARCHAR: 21>,
   'params': {'max_length': 65535}}],
 'aliases': [],
 'collection_id': 452198321887445723,
 'consistency_level': 2,
 'properties': {},
 'num_partitions': 1,
 'enable_dynamic_field': True}

In [31]:
for text, chunks in zip(chunked, embeddings):
    print(f"Embedding Shape: {chunks.shape}")  # Check vector dimensions
    insert_chunks_to_vec_db(chunks, text)
    print("Insertion successful")


Embedding Shape: (768,)
Number of entities added to the db: 1, Total Entities in DB: 0
Insertion successful
Embedding Shape: (768,)
Number of entities added to the db: 1, Total Entities in DB: 0
Insertion successful
Embedding Shape: (768,)
Number of entities added to the db: 1, Total Entities in DB: 0
Insertion successful
Embedding Shape: (768,)
Number of entities added to the db: 1, Total Entities in DB: 0
Insertion successful
Embedding Shape: (768,)
Number of entities added to the db: 1, Total Entities in DB: 0
Insertion successful
Embedding Shape: (768,)
Number of entities added to the db: 1, Total Entities in DB: 0
Insertion successful
Embedding Shape: (768,)
Number of entities added to the db: 1, Total Entities in DB: 2
Insertion successful
Embedding Shape: (768,)
Number of entities added to the db: 1, Total Entities in DB: 2
Insertion successful
Embedding Shape: (768,)
Number of entities added to the db: 1, Total Entities in DB: 2
Insertion successful
Embedding Shape: (768,)
Numb

In [48]:
def retrieval(query_to_be_searched, limit = 3):
    output_fields = ["chunks", "score", "text", "distance"]  # Include 'distance' in output fields
    search_params = {
        "metric_type": "COSINE",
        "params": {"nprobe": 10},
    }

    data = generate_embeddings(list(query_to_be_searched))
    print(len(data))
    # Perform the search
    retrieved_chunks = client.search(
        collection_name=COLLECTION_NAME,
        data=data,
        limit=limit,
        output_fields=output_fields,
        search_params=search_params
    )

    print(retrieved_chunks)

    # Return retrieved chunks along with distances
    return retrieved_chunks

#### "Embedding are values generated for each word and the values also depend on the data that was fed, it trains a NN to create weights *or* for this specific purpose called embeddings."
#### So embeddings should be made context specific for specific use cases.
#### The embeddings generated by the model are designed to capture semantic relationships:
#### Similar texts will have embeddings that are close together in the vector space.
#### Dissimilar texts will have embeddings that are farther apart.

In [None]:
api = "

In [34]:
rag_query = "what does this book talks about??"

In [49]:
rag_response = retrieval(rag_query)

ERROR:pymilvus.decorators:RPC error: [search], <MilvusException: (code=65535, message=nq [33] is invalid, nq (number of search vector per search request) should be in range [1, 10], but got 33)>, <Time:{'RPC start': '2024-09-09 07:13:12.998785', 'RPC error': '2024-09-09 07:13:13.044092'}>
ERROR:pymilvus.milvus_client.milvus_client:Failed to search collection: RAG_Chunks


33


MilvusException: <MilvusException: (code=65535, message=nq [33] is invalid, nq (number of search vector per search request) should be in range [1, 10], but got 33)>

In [36]:
rag_response

'was like, “well, do i really care if i embarrass 176 · t h e a l m a n a c k o f n a v a l r a v i k a n t myself? who cares? i’m going to die anyway. this is all going to go to zero, and i won’t remember anything, so this is pointless.” then, i shut down, and i went back to brushing my teeth. i was noticing how good the toothbrush was and how good it felt. then the next moment, i’m off to thinking something else. i have to look at my brain again and say, “do i really need to solve this problem right now?” ninety-five percent of what my brain runs off and tries to do, i don’t need to tackle in that exact moment. if the brain is like a muscle, i’ll be better off resting it, being at peace. when a particular problem arises, i’ll immerse myself in it. right now as we’re talking, i’d rather dedicate myself to being completely lost in the conversation and to being 100 percent focused on this as opposed to thinking about “oh, when i brushed my teeth, did i do it the right way?” the ability 

In [None]:
url = f'https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash-latest:generateContent?key={api}'
headers = {'Content-Type': 'application/json'}
data = {
        "contents": [
            {
                "parts": [
                    {
                        "text": f"""Query: {query}
Reference Information:
{rag_response}
Please generate a response based on the query and the provided reference information.
Please do not add information from yourside. Keep it pointed on query"""
                    }
                ]
            }
        ],
        "generationConfig": {
            "temperature": 0.7,
            "topK": 40,
            "topP": 0.95,
            "maxOutputTokens": 1024,
        }
    }



response = requests.post(url, headers=headers, json=data)
r = response.json()
r

{'candidates': [{'content': {'parts': [{'text': 'The provided text does not directly state how Naval Ravikant defines good investment opportunities. However, it does mention several key principles that likely inform his investment approach:\n\n* **"Buy-and-hold" + valuation + margin of safety:** This suggests that Naval seeks investments with a long-term perspective, focusing on intrinsic value and a safety buffer to protect against potential losses.\n* **Compound interest:** He emphasizes the importance of compounding returns over time, indicating a preference for investments that can generate consistent, long-term growth.\n* **Leverage:** Naval believes in leveraging one\'s skills and resources to maximize returns, which could translate to investing in businesses with high growth potential or opportunities for scaling.\n* **Avoiding ruin:** He stresses the importance of protecting one\'s capital and avoiding risky investments that could lead to significant losses.\n\nBased on these p

In [None]:
display(Markdown(r['candidates'][0]['content']['parts'][0]['text']))

The provided text does not directly state how Naval Ravikant defines good investment opportunities. However, it does mention several key principles that likely inform his investment approach:

* **"Buy-and-hold" + valuation + margin of safety:** This suggests that Naval seeks investments with a long-term perspective, focusing on intrinsic value and a safety buffer to protect against potential losses.
* **Compound interest:** He emphasizes the importance of compounding returns over time, indicating a preference for investments that can generate consistent, long-term growth.
* **Leverage:** Naval believes in leveraging one's skills and resources to maximize returns, which could translate to investing in businesses with high growth potential or opportunities for scaling.
* **Avoiding ruin:** He stresses the importance of protecting one's capital and avoiding risky investments that could lead to significant losses.

Based on these principles, it can be inferred that Naval likely defines good investment opportunities as those that:

* Offer a solid foundation of intrinsic value and a margin of safety.
* Have the potential for long-term compounding returns.
* Allow for leveraging skills and resources to maximize growth.
* Avoid undue risk and the potential for catastrophic losses. 
