### *I have attached a pdf file which is used in this notebook. Please go through this pdf file. Also, I recommend to try asking questions in "query" and the end of this notebook*

Happy learning!

## 0. Import modules required

In [79]:
import re

In [80]:
!pip install langchain transformers sentence-transformers faiss-cpu PyMuPDF

import fitz  # PyMuPDF
from langchain.docstore.document import Document as LangchainDocument
from sentence_transformers import SentenceTransformer
import faiss #vectore store
import numpy as np
from transformers import pipeline



In [81]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
file_path = (
    "/content/drive/My Drive/Colab Notebooks/sample_transcript.pdf")

In [None]:
'''def preprocess_text(text, chunk_size=500):
    chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
    return [LangchainDocument(page_content=chunk) for chunk in chunks]
    '''

# **1. Load Document**

We load the document and clean the document

1.   Loading the document
2.   Cleaning the document





## 1.1 Loading the document

In [82]:
def load_pdf(file_path):
    doc = fitz.open(file_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

In [83]:
text = load_pdf(file_path)

## 1.2 Cleaning data




The purpose of preprocessing is to preserve the information as it contains. To do this, it's useful to strip the data of format-specific characters that contain no information, such as extra whitespace, blank lines, specified substrings, regexes, page headers, footers etc., This is very important step for quality answers from RAG

In [84]:
print(text) # this is text before pre-processing of text

MachineLearning-Lecture01  
Instructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machine 
learning class. So what I wanna do today is just spend a little time going over the logistics 
of the class, and then we'll start to talk a bit about machine learning.  
By way of introduction, my name's Andrew Ng and I'll be instructor for this class. And so 
I personally work in machine learning, and I've worked on it for about 15 years now, and 
I actually think that machine learning is the most exciting field of all the computer 
sciences. So I'm actually always excited about teaching this class. Sometimes I actually 
think that machine learning is not only the most exciting thing in computer science, but 
the most exciting thing in all of human endeavor, so maybe a little bias there.  
I also want to introduce the TAs, who are all graduate students doing research in or 
related to the machine learning and all aspects of machine learning. Paul Baumstarck 
works in machine learning

In [85]:
text = re.sub(r"(\w+)-\n(\w+)",r"\1\2", text) #merge hyphen words
text =re.sub(r"(?<!\n\s)\n(?!\s\n)"," ",text.strip()) #fix new lines in middle of sentences
text =re.sub(r"\n\s*\n", "\n\n",text) #eliminate multiple lines
text= re.sub(r'\s+', ' ', text) #multiple spaces

In [86]:
print(text) #Text after pre-processing

MachineLearning-Lecture01 Instructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machine learning class. So what I wanna do today is just spend a little time going over the logistics of the class, and then we'll start to talk a bit about machine learning. By way of introduction, my name's Andrew Ng and I'll be instructor for this class. And so I personally work in machine learning, and I've worked on it for about 15 years now, and I actually think that machine learning is the most exciting field of all the computer sciences. So I'm actually always excited about teaching this class. Sometimes I actually think that machine learning is not only the most exciting thing in computer science, but the most exciting thing in all of human endeavor, so maybe a little bias there. I also want to introduce the TAs, who are all graduate students doing research in or related to the machine learning and all aspects of machine learning. Paul Baumstarck works in machine learning and computer v

# **2. Split text into chunks**

In [87]:
def preprocess_text_overlap(text, chunk_size=500, overlap=100):
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunks.append(text[start:end])
        start += chunk_size - overlap
    return [LangchainDocument(page_content=chunk) for chunk in chunks]

In [88]:
documents = preprocess_text_overlap(text)

In [89]:
documents

[Document(metadata={}, page_content="MachineLearning-Lecture01 Instructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machine learning class. So what I wanna do today is just spend a little time going over the logistics of the class, and then we'll start to talk a bit about machine learning. By way of introduction, my name's Andrew Ng and I'll be instructor for this class. And so I personally work in machine learning, and I've worked on it for about 15 years now, and I actually think that machine learning is the most exc"),
 Document(metadata={}, page_content="I've worked on it for about 15 years now, and I actually think that machine learning is the most exciting field of all the computer sciences. So I'm actually always excited about teaching this class. Sometimes I actually think that machine learning is not only the most exciting thing in computer science, but the most exciting thing in all of human endeavor, so maybe a little bias there. I also want to introduce the TAs

# **3. Use Embedding model**

To convert text chunks into embeddings. Think of embeddings as vectors.

1.   Model selection
2.   Convert text to embeddings




## 3.1 Model Selection
Here we use "all-MiniLM-L6-v2" model from hugging face. More info: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

In [90]:
model = SentenceTransformer('all-MiniLM-L6-v2') #  model



## 3.2 Convert text to embeddings

In [91]:
embeddings = model.encode([doc.page_content for doc in documents])

# **4. Store embeddings into Vector store**

Here we use FAISS (Facebook AI Similarity Search) vector store. This is a library for efficient similarity search and clustering of dense vectors.

In [92]:
index = faiss.IndexFlatL2(embeddings.shape[1]) #creating index from our document embeddings
index.add(np.array(embeddings))

# **5. Creating a pipeline for retrieval**

Here we use "distilbert-base-uncased-distilled-squad". More info: https://huggingface.co/distilbert/distilbert-base-uncased-distilled-squad

This is LLM model, which helps in generating text from the retrieved and augmented information when query is asked

In [93]:
qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad") # this  model is LLM model

# **6 .Retrieval**

In [94]:
def search(query, top_k=5):
    query_embedding = model.encode([query]) #encoding the query
    distances, indices = index.search(np.array(query_embedding), top_k) #The query embedding is used to search the FAISS index. The search method returns the distances and indices of the top k most similar vectors in the index.
    return [documents[i] for i in indices[0]] # retrieves the documents corresponding to the indices of the top k similar vectors and returns them.

In [95]:
def answer_query(query):
    relevant_docs = search(query)
    context = " ".join([doc.page_content for doc in relevant_docs])
    result = qa_pipeline(question=query, context=context)
    return result['answer']

# **7. User query**


1.   query: user question
2.   response: generated by model using RAG


I highly recommend, to try asking different questions in query.

In [97]:
query = "For how many years Andrew worked in ML"
response = answer_query(query) #answer_query generated response
response


'15 years'

In [103]:
query = "For how many years Andrew worked in ML"
response = search(query) # gives the content in which the response can be find in documents
response


[Document(metadata={}, page_content="MachineLearning-Lecture01 Instructor (Andrew Ng): Okay. Good morning. Welcome to CS229, the machine learning class. So what I wanna do today is just spend a little time going over the logistics of the class, and then we'll start to talk a bit about machine learning. By way of introduction, my name's Andrew Ng and I'll be instructor for this class. And so I personally work in machine learning, and I've worked on it for about 15 years now, and I actually think that machine learning is the most exc"),
 Document(metadata={}, page_content="bout maybe 15 years, applying learning algorithms to them can turn raw medical records into what I might loosely call medical knowledge in which we start to detect trends in medical practice and even start to alter medical practice as a result of medical knowledge that's derived by applying learning algorithms to the sorts of medical records that hospitals have just been building over the last 15, 20 years in an electr

In [99]:
query = "Who are the TA's for the course"
response = answer_query(query)
response


'Zico Kolter'

If we want to ask multiple queries

In [100]:
def answer_queries(queries):
    responses = []
    for query in queries:
        relevant_docs = search(query)
        context = " ".join([doc.page_content for doc in relevant_docs])
        result = qa_pipeline(question=query, context=context)
        responses.append(result['answer'])
    return responses

In [101]:
queries = ["For how many years Andrew worked in ML", "Who are the TA's for the course", "How many students registered for course?"]
responses = answer_queries(queries)

for i, response in enumerate(responses):
    print(f"Response to query {i+1}: {response}")

Response to query 1: 15 years
Response to query 2: Zico Kolter
Response to query 3: up to three people


In [102]:
query = "What are the comments about Netflix? Answer only if you are confident about the answer"
response = answer_query(query)
response

'recommend books for you to buy or movies for you to rent or whatever'