env setup

In [31]:
from langchain_groq import ChatGroq
import os
from dotenv import load_dotenv

load_dotenv()

True

##### groq based llm

In [32]:
groq_llm = ChatGroq(model="qwen/qwen3-32b")

In [33]:
groq_llm.invoke("Hey yo whats going on , dont think")

AIMessage(content='<think>\nOkay, the user started with "Hey yo whats going on , dont think." They\'re probably looking for a casual chat. Let me make sure I respond in a friendly and laid-back way. They mentioned not to overthink, so keeping the reply simple and conversational is key.\n\nFirst, I should acknowledge their greeting. Maybe say something like "Hey there!" to mirror their tone. Then, add an emoji to keep it light, like a smiley or a wave. Since they said "dont think," maybe include a playful line about keeping it chill. \n\nI should ask how they\'re doing to encourage them to share more. Something like, "How\'s everything going on your end?" That opens the door for a longer conversation. Also, adding a question about what\'s new keeps the interaction engaging without being pushy.\n\nNeed to check for any typos or errors. The original message has some lowercase letters and missing spaces, but the response should be clean. Let me make sure the emojis are appropriate and not 

#### gemini based embeddings

In [34]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

In [36]:
embedding_model = GoogleGenerativeAIEmbeddings(model = "models/embedding-001")

In [37]:
query = "YO man whats going on"

embedding = embedding_model.embed_query(query)
embedding

[0.03098084218800068,
 -0.02732415683567524,
 -0.014498245902359486,
 -0.010752961970865726,
 0.0006902334280312061,
 -0.007594329304993153,
 0.012295924127101898,
 0.026624061167240143,
 0.010147640481591225,
 0.03552689403295517,
 0.07064302265644073,
 0.020531732589006424,
 0.054905228316783905,
 -0.02752363495528698,
 0.014623328112065792,
 -0.010943572036921978,
 -0.004928531125187874,
 -0.010988188907504082,
 0.028578057885169983,
 -0.05172860994935036,
 0.017803112044930458,
 0.04897339269518852,
 0.011857939884066582,
 0.027729205787181854,
 -0.0019923055078834295,
 -0.015241177752614021,
 0.02210492640733719,
 -0.029605979099869728,
 -0.04325753450393677,
 0.022575289011001587,
 -0.071275494992733,
 0.03945137560367584,
 -0.03933204337954521,
 -0.016003886237740517,
 0.05145569518208504,
 -0.0871836394071579,
 0.02644386701285839,
 0.01986776292324066,
 -0.017224349081516266,
 0.009314636699855328,
 -0.00940199475735426,
 -0.023584526032209396,
 -0.013119216077029705,
 -0.0077

In [39]:
len(embedding)

768

#### RAG experiments


In [2]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [3]:
import os 

os.getcwd()

'c:\\Users\\djadh\\Downloads\\document_portal\\notebook for experiments'

In [6]:
pdf_file_path = os.path.join(os.getcwd() , "data" , "sample.pdf")

In [None]:
# research paper of llama 2 model

data = PyPDFLoader(pdf_file_path)

In [9]:
# load the pdf
documents = data.load()

In [11]:
documents[:3]

[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-07-20T00:30:36+00:00', 'author': '', 'keywords': '', 'moddate': '2023-07-20T00:30:36+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'c:\\Users\\djadh\\Downloads\\document_portal\\notebook for experiments\\data\\sample.pdf', 'total_pages': 77, 'page': 0, 'page_label': '1'}, page_content='Llama 2: Open Foundation and Fine-Tuned Chat Models\nHugo Touvron∗ Louis Martin† Kevin Stone†\nPeter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra\nPrajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen\nGuillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller\nCynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou\nHakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa 

In [None]:
## this loading have made this data into a type of document
## 1 document = 1 page of pdf

len(documents) ## 77 oages

77

we have 77 pages , we can directly do the embedding of this or if you feel theres too much text on 1 single page and we cant preserve context or maybe context lenght of llm is not so good then you can do the **chunking**

note this step is not applicable for img and autdio data , only text , that too not mandatory 

In [13]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 150 , 
    length_function = len
)

In [16]:
chuncked_docs = text_splitter.split_documents(documents)

In [17]:
len(chuncked_docs)

765

to fetch metadata

In [19]:
chuncked_docs[0].metadata

{'producer': 'pdfTeX-1.40.25',
 'creator': 'LaTeX with hyperref',
 'creationdate': '2023-07-20T00:30:36+00:00',
 'author': '',
 'keywords': '',
 'moddate': '2023-07-20T00:30:36+00:00',
 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5',
 'subject': '',
 'title': '',
 'trapped': '/False',
 'source': 'c:\\Users\\djadh\\Downloads\\document_portal\\notebook for experiments\\data\\sample.pdf',
 'total_pages': 77,
 'page': 0,
 'page_label': '1'}

to fetch the main text

In [20]:
chuncked_docs[0].page_content

'Llama 2: Open Foundation and Fine-Tuned Chat Models\nHugo Touvron∗ Louis Martin† Kevin Stone†\nPeter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra\nPrajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen\nGuillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller\nCynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou\nHakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev'

Storing data in vector db

In [None]:
from langchain_community.vectorstores import FAISS
from langchain_google_genai import GoogleGenerativeAIEmbeddings

embedding the data

In [23]:
embedding_model = GoogleGenerativeAIEmbeddings(model = "models/embedding-001")

In [27]:
vector_db = FAISS.from_documents(chuncked_docs ,
            embedding_model
        )

below is the retrival process


we will fetch / rank / retrieve the most relevant / appropriate documents from vector db

In [28]:
query = "What is pretraining of data"

In [None]:
relevant_docs = vector_db.similarity_search(query)

fetch the most closest vector

In [None]:
relevant_docs[0].page_content

'knowledge and dampen hallucinations.\nWe performed a variety of pretraining data investigations so that users can better understand the potential\ncapabilities and limitations of our models; results can be found in Section 4.1.\n2.2 Training Details\nWe adopt most of the pretraining setting and model architecture fromLlama 1. We use the standard\ntransformer architecture (Vaswani et al., 2017), apply pre-normalization using RMSNorm (Zhang and'

we can also give hyper param of top k results

In [39]:
tok_k_relevant_docs = vector_db.similarity_search(query , k = 2)

In [41]:
tok_k_relevant_docs

[Document(id='0d935c9f-9171-498a-9249-d9fa861c1d98', metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2023-07-20T00:30:36+00:00', 'author': '', 'keywords': '', 'moddate': '2023-07-20T00:30:36+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': 'c:\\Users\\djadh\\Downloads\\document_portal\\notebook for experiments\\data\\sample.pdf', 'total_pages': 77, 'page': 4, 'page_label': '5'}, page_content='knowledge and dampen hallucinations.\nWe performed a variety of pretraining data investigations so that users can better understand the potential\ncapabilities and limitations of our models; results can be found in Section 4.1.\n2.2 Training Details\nWe adopt most of the pretraining setting and model architecture fromLlama 1. We use the standard\ntransformer architecture (Vaswani et al., 2017), apply pre-normalization using RMSNor

what we did here was 

in memory storage as of now  - its stored in our RAM

=================================================

lets do it in

on disk / persistent storage
