In [1]:
from pathlib import Path
import pandas as pd
import pypdf 
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
model = SentenceTransformer('all-MiniLM-L6-v2')

In [3]:
def extract_pdf_pages(pdf_path:str):
    pdf_reader = pypdf.PdfReader(pdf_path)
    pages = [page.extract_text() for page in pdf_reader.pages]
    return pages

In [4]:
pdf_sructured= "1colExample.pdf"
pdf_unstructured= "2colExample.pdf"

In [5]:
pagest_structured = extract_pdf_pages(pdf_sructured)
pagest_unstructured = extract_pdf_pages(pdf_unstructured)
print(len(pagest_structured), len(pagest_unstructured))

4 3


In [6]:
df_structured = pd.DataFrame({
    "page":range(1, len(pagest_structured)+1),
    "text": pagest_structured
})

df_unstructured = pd.DataFrame({
    "page":range(1, len(pagest_unstructured)+1),
    "text": pagest_unstructured
})

In [7]:
df_structured.head()

Unnamed: 0,page,text
0,1,2 col example\nIntroduction\nBiomarker plays a...
1,2,4. Biopsy if suspicious characteristics are ob...
2,3,feature selection classification [12]. Feature...
3,4,The second and third steps are repeated until ...


In [8]:
df_unstructured.head()

Unnamed: 0,page,text
0,1,(1 col example)\nIntroduction\nBiomarker plays...
1,2,the redundancy of data and increase the ac-\nc...
2,3,"chine, it uses several support vector machines..."


In [9]:
def embed_chunks(text_list, model):
    chunk_embeddings=model.encode(text_list, normalize_embeddings=True)
    doc_embeddings = chunk_embeddings.mean(axis=0) # average doc-level
    return chunk_embeddings, doc_embeddings

In [10]:
emb_struct_pages, emb_struct_doc = embed_chunks(pagest_structured, model)
emb_unstruct_pages, emb_unstruct_doc = embed_chunks(pagest_unstructured, model)


In [11]:
df_structured["embedding"] = [list(e) for e in emb_struct_pages]
df_unstructured["embedding"] = [list(e) for e in emb_unstruct_pages]

In [12]:
df_structured.head()

Unnamed: 0,page,text,embedding
0,1,2 col example\nIntroduction\nBiomarker plays a...,"[-0.016911913, 0.028695906, 0.0052786754, -0.0..."
1,2,4. Biopsy if suspicious characteristics are ob...,"[0.012842059, 0.024613895, 0.010548779, -0.013..."
2,3,feature selection classification [12]. Feature...,"[-0.012905651, -0.00821479, -0.018207353, -0.0..."
3,4,The second and third steps are repeated until ...,"[0.034310393, -0.02366298, 0.02857434, 0.01840..."


In [13]:
df_unstructured.head()

Unnamed: 0,page,text,embedding
0,1,(1 col example)\nIntroduction\nBiomarker plays...,"[-0.017610889, 0.0147622125, -0.007204978, -0...."
1,2,the redundancy of data and increase the ac-\nc...,"[-0.041768815, 0.0014158036, -0.024878249, -0...."
2,3,"chine, it uses several support vector machines...","[-0.12998605, -0.018148558, 0.0036243915, 0.00..."


In [14]:
# compare document embeddings
sim_matrix = cosine_similarity(emb_struct_pages, emb_unstruct_pages)

sim_differences = pd.DataFrame(
    sim_matrix, 
    index=[f"Structured_Page_{i+1}" for i in range(len(pagest_structured))],
    columns=[f"Unstructured_Page_{i+1}" for i in range(len(pagest_unstructured))]   
)

sim_differences

Unnamed: 0,Unstructured_Page_1,Unstructured_Page_2,Unstructured_Page_3
Structured_Page_1,0.920607,0.680938,0.397384
Structured_Page_2,0.680696,0.744178,0.423352
Structured_Page_3,0.665628,0.754512,0.572773
Structured_Page_4,0.596418,0.611709,0.631951


# PDF Context Extraction and Embedding Analysis

## Overview
This notebook demonstrates a complete pipeline for extracting text from PDF documents, generating semantic embeddings, and comparing document similarity. It processes both structured (single-column) and unstructured (multi-column) PDF layouts.

## Components and Workflow

### 1. Libraries and Dependencies
- **pypdf**: PDF text extraction
- **sentence_transformers**: Generate semantic embeddings using pre-trained models
- **pandas**: Data organization and analysis
- **sklearn**: Cosine similarity calculations

### 2. Embedding Model
Uses `all-MiniLM-L6-v2` model:
- Lightweight transformer model (22M parameters)
- Optimized for semantic similarity tasks
- Generates 384-dimensional embeddings
- Fast inference suitable for production use

### 3. Key Functions

#### `extract_pdf_pages(pdf_path)`
- Extracts raw text from each PDF page
- Returns list of page texts maintaining original order
- Handles both single and multi-column layouts

#### `embed_chunks(text_list, model)`
- Generates embeddings for each text chunk (page)
- Creates normalized embeddings for better similarity comparison
- Computes document-level embedding by averaging page embeddings
- Returns both page-level and document-level representations

### 4. Processing Pipeline

#### Step 1: PDF Extraction
Extracts text from two example PDFs:
- `1colExample.pdf`: Structured single-column layout
- `2colExample.pdf`: Unstructured multi-column layout

#### Step 2: Data Organization
Creates DataFrames with:
- Page numbers for reference
- Extracted text content
- Generated embeddings for each page

#### Step 3: Embedding Generation
- Each page gets a 384-dimensional vector representation
- Embeddings capture semantic meaning of text
- Normalized for consistent similarity scores

#### Step 4: Similarity Analysis
Computes cosine similarity matrix between:
- All pages from structured PDF (rows)
- All pages from unstructured PDF (columns)
- Values range from -1 (opposite) to 1 (identical)

## Results Interpretation

### Similarity Matrix
- **High values (>0.7)**: Strong semantic similarity between pages
- **Medium values (0.3-0.7)**: Moderate topical overlap
- **Low values (<0.3)**: Different topics or content

### Use Cases
1. **Content Deduplication**: Identify similar pages across documents
2. **Document Clustering**: Group related pages/documents
3. **Semantic Search**: Find relevant pages based on query
4. **Information Retrieval**: Build knowledge bases with semantic indexing

## Applications in RAG Systems
This pipeline forms the foundation for:
- **Document Chunking**: Splitting documents into semantic units
- **Vector Databases**: Storing embeddings for fast retrieval
- **Context Selection**: Finding relevant chunks for LLM prompts
- **Quality Assessment**: Comparing extraction quality between formats

## Performance Considerations
- Page-level processing allows granular retrieval
- Document-level embeddings enable fast initial filtering
- Normalized embeddings ensure consistent similarity scores
- Lightweight model balances speed and accuracy

## Impact of Document Structure on Knowledge Extraction

### Observed Similarity Patterns
The similarity matrix reveals important insights about how document structure affects extraction quality:

| Page Comparison | Similarity Score | Interpretation |
|-----------------|------------------|----------------|
| Struct_1 vs Unstruct_1 | 0.921 | Very high - likely same content |
| Struct_1 vs Unstruct_2 | 0.681 | Moderate - partial content overlap |
| Struct_1 vs Unstruct_3 | 0.397 | Low - different content sections |
| Struct_2 vs Unstruct_2 | 0.744 | High - good content alignment |
| Struct_3 vs Unstruct_2 | 0.755 | High - content properly matched |
| Struct_4 vs Unstruct_3 | 0.632 | Moderate - partial alignment |

### Key Issues with Multi-Column Extraction

#### 1. **Content Fragmentation**
- Multi-column PDFs often split related content across columns
- PyPDF may read columns sequentially instead of logically
- Results in broken sentences and disconnected paragraphs

#### 2. **Misaligned Page Boundaries**
- Structured PDF: Page 1 content â†’ High similarity (0.921) with Unstructured Page 1
- But Structured Page 2-3 content appears split between Unstructured Pages 2-3
- Indicates text reflow issues during extraction

#### 3. **Semantic Coherence Loss**
- Lower similarities for later pages (0.397-0.632) suggest:
  - Context is being split unnaturally
  - Related information ends up in different chunks
  - Semantic meaning is diluted or lost

### Downstream Impact on Knowledge Generation

#### RAG System Implications:
1. **Retrieval Accuracy**: Malformed chunks may not match relevant queries
2. **Context Quality**: LLMs receive fragmented or incomplete context
3. **Hallucination Risk**: Incomplete information increases generation errors
4. **Answer Coherence**: Generated responses may miss critical connections

#### Example Scenario:
If a medical document discusses "symptoms" in one column and "treatments" in another:
- **Proper extraction**: Single chunk contains complete symptom-treatment relationship
- **Malformed extraction**: Symptoms and treatments in separate chunks
- **Result**: LLM may generate incomplete or incorrect medical advice

### Mitigation Strategies
1. **Layout-Aware Extraction**: Use tools like `pdfplumber` or `pymupdf` with layout detection
2. **Post-Processing**: Reconstruct logical reading order after extraction
3. **Validation**: Compare embeddings to detect extraction anomalies
4. **Chunk Overlap**: Use overlapping windows to preserve context
5. **Manual Review**: Verify extraction quality for critical documents