# RAG with LangChain, Ollama, and FAISS Vector Store

## PDF Dataset: 
https://github.com/aydiegithub/rag-system-ollama/tree/dac3b4563a66b8b11962aaa08349ba5138be396a/rag-dataset-main

![Document Ingestion](flowcharts/Flowchart.png)

In [1]:
# pip install -U langchain-community faiss-cpu langchain-huggingface pymupdf tiktoken langchain-ollama python-dotenv

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
import os
from dotenv import load_dotenv

os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'
load_dotenv()

True

## Document loader

In [4]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("rag-dataset-main/machine-learning/MACHINE LEARNING(R17A0534).pdf")

docs = loader.load()

In [5]:
doc = docs[10]

In [6]:
print(doc.page_content)

6 
 
covers less distance physically than by train because a plane is unrestricted. Similarly, in chess, the 
concept of distance depends on the piece used – for example, a Bishop can move diagonally.   Thus, 
depending on the entity and the mode of travel, the concept of distance can be experienced differently. 
The distance metrics commonly used are Euclidean, Minkowski, Manhattan, and Mahalanobis. 
 
 
Distance is applied through the concept of neighbours and exemplars. Neighbours are points in 
proximity with respect to the distance measure expressed through exemplars. Exemplars are 
either centroids that ﬁnd a centre of mass according to a chosen distance metric or medoids that ﬁnd 
the most centrally located data point. The most commonly used centroid is the arithmetic mean, which 
minimises squared Euclidean distance to all other points. 
 
Notes: 
 
The centroid represents the geometric centre of a plane figure, i.e., the arithmetic mean 
position of all the points in the figu

## Load all the pdfs

In [7]:
import os

pdfs = []

for root, dirs, files in os.walk('rag-dataset-main'):
    # print(root, dirs, files)
    for file in files:
        if file.endswith('.pdf'):
            pdfs.append(os.path.join(root, file))

pdfs

['rag-dataset-main/machine-learning/CSIT_(R22)_3-2_MACHINE LEARNING_DIGITAL NOTES.pdf',
 'rag-dataset-main/machine-learning/NotesOnMachineLearningForBTech-1.pdf',
 'rag-dataset-main/machine-learning/ML_notes_22.pdf',
 'rag-dataset-main/machine-learning/COS324_Course_Notes.pdf',
 'rag-dataset-main/machine-learning/2505.03861v1.pdf',
 'rag-dataset-main/machine-learning/MACHINE LEARNING(R17A0534).pdf']

In [8]:
docs = []

for pdf in pdfs:
    loader = PyMuPDFLoader(pdf)
    pages = loader.load()
    
    docs.extend(pages)

In [9]:
len(docs) # number of pages in document

1060

## Document Chuncking

In [10]:
# pip install -qU langchain-text-splitters

In [23]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 100)

chunks = text_splitter.split_documents(docs)

In [24]:
len(docs), len(chunks)

(1060, 2440)

In [30]:
print(chunks[100].page_content)

CSIT DEPT-R22-MACHINE LEARNING 
48 
 
The multi-layer perceptron model is also known as the Backpropagation algorithm, which 
executes in two stages as follows: 
 
Forward Stage: Activation functions start from the input layer in the forward stage 
and terminate on the output layer. 
 
Backward Stage: In the backward stage, weight and bias values are modified as per 
the model's requirement. In this stage, the error between actual output and demanded 
originated backward on the output layer and ended on the input layer. 
Hence, a multi-layered perceptron model has considered as multiple artificial neural networks 
having various layers in which activation function does not remain linear, similar to a single 
layer perceptron model. Instead of linear, activation function can be executed as sigmoid, 
TanH, ReLU, etc., for deployment. 
A multi-layer perceptron model has greater processing power and can process linear and non-


In [33]:
len(chunks[100].page_content)

938

In [34]:
import tiktoken

encoding = tiktoken.encoding_for_model("gpt-4o-mini")

In [39]:
len(encoding.encode(chunks[2].page_content))

164

In [40]:
len(encoding.encode(docs[2].page_content))

314

## Document Vector Embedding