### Document Preparation and Ingestion [MANDATORY STEPS]

#### Step 1: Loading and Pre-processing the document

In [None]:
from PyPDF2 import PdfReader

PDF_PATH = "../data/AI Training Document.pdf"
text = ""
reader = PdfReader(PDF_PATH)

for page in reader.pages:
    cleaned_text = " ".join(page.extract_text().split())
    text += cleaned_text

text

#### Step 2: Creating chunks and storing it in separate directory

In [None]:
from nltk.tokenize import sent_tokenize
import nltk

nltk.download("punkt_tab")

In [41]:
CHUNK_SIZE = 300

sentences = sent_tokenize(text)
chunks = []
current_sentence = ""

for sentence in sentences:
    if len(current_sentence) <= CHUNK_SIZE and (len(current_sentence) + len(sentence) <= CHUNK_SIZE):
        current_sentence += sentence
    else:
        chunks.append(current_sentence)
        current_sentence = ""

cleaned_chunks = [chunk for chunk in chunks if len(chunk) > 5]

In [44]:
import os

output_dir = '../chunks'

for i, chunk in enumerate(cleaned_chunks):
    chunk_filename = os.path.join(output_dir, f"chunk_{i+1:03d}.txt")
    try:
        with open(chunk_filename, 'w', encoding='utf-8') as f:
            f.write(chunk)
        print(f"Saved chunk {i+1} to: {chunk_filename}")
    except Exception as e:
        print(f"Error saving chunk {i+1} to {chunk_filename}: {e}")

Saved chunk 1 to: ../chunks\chunk_001.txt
Saved chunk 2 to: ../chunks\chunk_002.txt
Saved chunk 3 to: ../chunks\chunk_003.txt
Saved chunk 4 to: ../chunks\chunk_004.txt
Saved chunk 5 to: ../chunks\chunk_005.txt
Saved chunk 6 to: ../chunks\chunk_006.txt
Saved chunk 7 to: ../chunks\chunk_007.txt
Saved chunk 8 to: ../chunks\chunk_008.txt
Saved chunk 9 to: ../chunks\chunk_009.txt
Saved chunk 10 to: ../chunks\chunk_010.txt
Saved chunk 11 to: ../chunks\chunk_011.txt
Saved chunk 12 to: ../chunks\chunk_012.txt
Saved chunk 13 to: ../chunks\chunk_013.txt
Saved chunk 14 to: ../chunks\chunk_014.txt
Saved chunk 15 to: ../chunks\chunk_015.txt
Saved chunk 16 to: ../chunks\chunk_016.txt
Saved chunk 17 to: ../chunks\chunk_017.txt
Saved chunk 18 to: ../chunks\chunk_018.txt
Saved chunk 19 to: ../chunks\chunk_019.txt
Saved chunk 20 to: ../chunks\chunk_020.txt
Saved chunk 21 to: ../chunks\chunk_021.txt
Saved chunk 22 to: ../chunks\chunk_022.txt
Saved chunk 23 to: ../chunks\chunk_023.txt
Saved chunk 24 to: .

#### Step 3: Creating Embeddings for our chunks and then storing it in a vector database

In [45]:
from langchain_community.vectorstores import FAISS
from langchain.docstore.document import Document
from langchain_huggingface import HuggingFaceEmbeddings

DB_PATH = '../vectordb'

documents = [Document(page_content=chunk) for chunk in cleaned_chunks]

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vector_db = FAISS.from_documents(documents, embeddings)
vector_db.save_local(DB_PATH)

In [46]:
vector_db.index_to_docstore_id

{0: '1477c90d-6876-47fa-9a87-b97a755e6b10',
 1: '4fd4edd7-57da-4e5d-a955-70cdb6b8c17f',
 2: 'c1f6fb4a-6a32-4800-96a4-ed717c23fb47',
 3: 'ba585523-e65d-4c09-90bd-924d981ce9cd',
 4: '10112fdc-20fc-4add-85c8-d8943b26b4ff',
 5: 'ffec3d49-f45a-4fad-9e4c-4683b392cb7c',
 6: 'cf9bd4bf-e699-45c7-a186-3cdcbad8944a',
 7: '0b1322ae-0955-4dca-bd70-a435fae56f47',
 8: 'a2a6382f-0a0c-4926-94bd-2740177e913d',
 9: 'e9033aec-93fa-4c01-970f-92acab53d2dc',
 10: '1f48254e-7bfb-475d-9dea-c0fb18912d20',
 11: 'cea5bf18-c224-491c-b20e-7cb9bb02a9bc',
 12: '3f60ffb2-d3f7-4b65-8378-c99549df48e1',
 13: '5de63523-803e-4f07-b5a9-c4e3852dd636',
 14: '8e187830-6bb8-4f84-9e5a-3478ccefb3bb',
 15: 'c79f7c64-008d-4444-a47d-57c9d9123ba1',
 16: 'da20e317-037f-4d10-a914-dbcef2003fd1',
 17: '7d571427-7002-4c4e-aa42-20a9df690cca',
 18: '166671f9-df13-4cc1-bbb7-899c49915129',
 19: '8eff7352-5461-4d77-84d4-596b6ceea39c',
 20: 'fae9ac78-c57c-4b98-990f-387f9fbb47c1',
 21: '0056fb17-43c8-4228-93b3-a58df7a484c4',
 22: 'a75dc03c-85ab-

In [48]:
vector_db.get_by_ids(['5de63523-803e-4f07-b5a9-c4e3852dd636'])

[Document(id='5de63523-803e-4f07-b5a9-c4e3852dd636', metadata={}, page_content='Neither the accuracy of vehicle information provided on eBay.com, nor the availability, quality, or safety of vehicles is guaranteed by eBay.Furthermore, neither the financing of or insurance relevant to vehicles is controlled or guaranteed b y eBay.')]

### Fine-Tuning and Evaluation [OPTIONAL]

#### Checking if the retriever is working or not

In [49]:
retriever = vector_db.as_retriever(search_type='similarity', search_kwargs={"k": 4})
retriever

VectorStoreRetriever(tags=['FAISS', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x000001A4DEA5C560>, search_kwargs={'k': 4})

In [50]:
retriever.invoke("What about fees?")

[Document(id='c79f7c64-008d-4444-a47d-57c9d9123ba1', metadata={}, page_content='6.Fees and Taxes We charge sellers for the use of our Services.In some cases, where buyers receive supplemental Services such as authentication or storage Services for items in certain categories, we may also charge those buyers for such supplemental Services.'),
 Document(id='079f9205-9653-49cb-8288-20d573c22787', metadata={}, page_content='The NAM Rules are cur rently available at https://www.namadr.com/resources/rules -fees-forms/ .'),
 Document(id='da20e317-037f-4d10-a914-dbcef2003fd1', metadata={}, page_content='You as a seller must have a payment method on file when using our selling Services and pay all fees and applicable taxes associated with your use of our Services by the payment due date.'),
 Document(id='a2a6382f-0a0c-4926-94bd-2740177e913d', metadata={}, page_content='Also, as provided below in the Fees and Taxes section, if we believe you are violating our policy on buying or selling outside 

#### Model Selection

Here we have used ollama to run Mistral 7b instruct model locally. In order to use this first we have to download and install ollama, then pull mistral 7b instruct model using ollama. Both should be properly done in order to work.

In [29]:
from langchain_ollama import ChatOllama

model = ChatOllama(
    model="mistral:7b-instruct",
    temperature=0.3
)

In [30]:
response = model.invoke("hii there")

In [31]:
response.content

" Hello! How can I help you today? Is there something specific you would like to know or discuss? I'm here to assist with any questions you might have."

#### Designing a basic template and defining the model

In [35]:
template = """
You are a helpful and honest assistant. Use the following context to answer the user's question. 
Only answer based on the context provided. If the answer is not found in the context, say "I don't know based on the given information."

{context}

Question: {question}

Answer:
"""

In [81]:
from langchain_core.prompts import PromptTemplate
from langchain_ollama.llms import OllamaLLM

prompt = PromptTemplate(
    template=template,
    input_variables=['context', 'question']
)

llm = OllamaLLM(model="mistral:7b-instruct", temperature=0.3)

##### **Type 1:** Manually running each steps i.e not chaining it

In [74]:
question = "What kind of privacy policies are written here?"
retrieved_docs = retriever.invoke(question)

In [75]:
context_text = "\n\n".join(doc.page_content for doc in retrieved_docs)

In [76]:
final_prompt = prompt.invoke({"context": context_text, "question": question})

In [78]:
llm.invoke(final_prompt)

 Based on the provided context, there is no explicit mention of privacy policies in this User Agreement. The text primarily discusses terms related to content ownership, warranties, and liability for such content, as well as policy enforcement and disclaimers regarding the Services' operation. However, without more information about the broader document or platform from which these excerpts are taken, it is impossible to definitively state whether privacy policies are included elsewhere in the agreement.

" Based on the provided context, there is no explicit mention of privacy policies in this User Agreement. The text primarily discusses terms related to content ownership, warranties, and liability for such content, as well as policy enforcement and disclaimers regarding the Services' operation. However, without more information about the broader document or platform from which these excerpts are taken, it is impossible to definitively state whether privacy policies are included elsewhere in the agreement."

##### **Type 2:** Chaining all the steps

In [59]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

In [None]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

qa_chain = (
    {
        "context": vector_db.as_retriever() | format_docs,
        "question": RunnablePassthrough()
    } | prompt | llm | StrOutputParser()
)

res = qa_chain.stream("what about the policies?")
for r in res:
    print(r, end="")

 The provided context suggests that the platform enforces its policies flexibly, taking into account both the user's performance history and specific circumstances. However, it does not specify explicit details about how policies are enforced or provide a comprehensive list of policies. For more detailed information about the policies, you should refer to the platform's terms of service or contact their customer support.