In [1]:
!pip install PyMuPDF
!pip install sentence-transformers
!pip install faiss-cpu

Collecting PyMuPDF
  Downloading pymupdf-1.25.1-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.1-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m61.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.25.1
Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m45.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.9.0.post1


In [5]:
import fitz  # PyMuPDF
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

# Load PDF and extract text
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

# Chunk the text
def chunk_text(text, chunk_size=512):
    return [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]

# Convert chunks to embeddings
def get_embeddings(chunks):
    return model.encode(chunks)

# Store embeddings in FAISS
def store_embeddings(embeddings):
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatL2(dimension)
    index.add(embeddings)
    return index

# Example usage
pdf_path = '/content/drive/MyDrive/Colab Notebooks/material1.pdf'
text = extract_text_from_pdf(pdf_path)
chunks = chunk_text(text)
embeddings = get_embeddings(chunks)
index = store_embeddings(np.array(embeddings))

In [12]:
def query_embeddings(query, model, index):
    query_embedding = model.encode([query])
    D, I = index.search(query_embedding, k=5)  # Retrieve top 5 similar chunks
    return I  # Indices of the retrieved chunks

# Example usage
query = "summarize types of visualization of data"
retrieved_indices = query_embeddings(query, model, index)
retrieved_chunks = [chunks[i] for i in retrieved_indices[0]]

In [13]:
def compare_chunks(chunks, terms):
    # Implement logic to compare chunks based on terms
    # This is a placeholder for actual comparison logic
    comparison_results = {}
    for term in terms:
        comparison_results[term] = [chunk for chunk in chunks if term in chunk]
    return comparison_results

# Example usage
comparison_terms = ["Bachelor's Degree", "Master's Degree"]
comparison_results = compare_chunks(retrieved_chunks, comparison_terms)

In [14]:
def generate_response(retrieved_chunks, query):
    # Placeholder for LLM response generation
    response = f"Based on your query '{query}', here are the relevant details:\n"
    for chunk in retrieved_chunks:
        response += f"- {chunk}\n"
    return response

# Example usage
response = generate_response(retrieved_chunks, query)
print(response)

Based on your query 'summarize types of visualization of data', here are the relevant details:
- Tables, Charts, and 
Graphs 
with Examples from History, Economics, 
Education, Psychology, Urban Affairs and 
Everyday Life
REVISED: MICHAEL LOLKUS 2018
Tables, Charts, and 
Graphs Basics
We use charts and graphs to visualize data.  
This data can either be generated data, data gathered from 
an experiment, or data collected from some source.
A picture tells a thousand words so it is not a surprise that 
many people use charts and graphs when explaining data.
Types of Visual 
Representations of Data
Tab
- ng different groups of 
variables.  We used it to compare different components 
of US GDP.  We did the same with the pie chart; 
depending on your purposes you may choose to use a 
pie chart or a bar graph.
x
y
0
0
1
3
2
6
3
9
4
12
5
15
6
18
7
21
8
24
•
If given a table of data, we should be able to plot it.  Below is 
some sample data; plot the data with x on the x-axis and y on the 
