# Basic PDF extraction workflow
This is vastly oversimplified workflow that demonstrates the following:
1. Processing PDF into images by pages
2. Compute vector embeddings per page (using a pre-trained ViT)
3. We'll manually pick a page with a data table and get top10 similar pages to evaluate whether the embeddings are good

In [3]:
import locale
from pathlib import Path
import shutil

import duckdb
import ghostscript
import lance
import numpy as np
import pandas as pd
import pyarrow as pa
from PyPDF2 import PdfFileReader, PdfFileWriter

## Process the PDF
- PDF's can be both images and text
- For simplicity, we assume all pages are scanned images
- However we should also look at what is already text (saves OCR) when doing this for realz
- This only needs to be done once. Afterwards you can skip directly to the section on training using the saved Lance dataset

### Download

In [None]:
!curl https://www.hcd.ca.gov/housing-elements/docs/torrance-6th-adopted081522.pdf --output test.pdf

In [2]:
url = "https://www.hcd.ca.gov/housing-elements/docs/torrance-6th-adopted081522.pdf"

### Split into pages

In [None]:
pages = Path(".") / "pages"
shutil.rmtree(pages)
pages.mkdir(exist_ok=True)


def write_page(i, page):
    writer = PdfFileWriter()
    writer.addPage(page)
    page_i = pages / f"page_{i}.pdf"
    with page_i.open("wb") as outfile:
        writer.write(outfile)
    return str(page_i.absolute())

page_paths = []
with open("test.pdf", 'rb') as fh:
    reader = PdfFileReader(fh)
    for i, page in enumerate(reader.pages):
        page_paths.append(write_page(i, page))

TODO: consider adding Lance type for PDF and convenience functions like split by page, conversion to image, etc

### Convert to image

In [None]:
!./install_deps.sh # ghostscript and tkinter

In [None]:
images = Path("images").absolute()
shutil.rmtree(images)
images.mkdir(exist_ok=True)

def pdf2jpeg(pdf_input_path, jpeg_output_path):
    args = ["pdf2jpeg",
            "-dNOPAUSE",
            "-sDEVICE=jpeg",
            "-r144",
            f"-sOutputFile={jpeg_output_path}",
            pdf_input_path]
    encoding = locale.getpreferredencoding()
    args = [a.encode(encoding) for a in args]
    ghostscript.Ghostscript(*args)

for i, p in enumerate(page_paths):
    pdf2jpeg(p, str(images / f"page_{i}.jpeg"))

In [None]:
df = pd.DataFrame({
    "page_id": range(len(page_paths)),
    "combined": url,
    "page": page_paths,
    "image_link": pd.array(images.iterdir(), dtype='image[uri]'),
})
df["image"] = pd.array([img.bytes for img in df.image_link.values], dtype='image[binary]')
df

In [None]:
uri = "leoslittleshopofhorrors.lance"

In [None]:
shutil.rmtree(uri)
lance.write_dataset(pa.Table.from_pandas(df), uri)

Next time we can just use the lance dataset in pandas / pytorch / duckdb directly from here out

In [4]:
%load_ext sql
%sql duckdb:///:memory: -a {"config":{"allow_unsigned_extensions":"True"},"preload_extensions":["lance"]}

{'config': {'allow_unsigned_extensions': 'True'}, 'preload_extensions': ['lance']}


In [None]:
pdf_dataset = lance.dataset(uri) # pyarrow Dataset

note this visualization widget doesn't work in jupyter lab yet only notebook

In [None]:
%%sql --lance

SELECT page_id, image 
FROM pdf_dataset
LIMIT 20;

## Let's get the embeddings now

In [6]:
uri = "leoslittleshopofhorrors.lance"
pdf_dataset = lance.dataset(uri) # pyarrow Dataset
tbl = pdf_dataset.to_table()  # Arrow table
df = tbl.to_pandas()  # pandas dataframe

We'll use ViT from HuggingFace (pretrained on ImageNet1K labels)

In [None]:
from transformers import ViTImageProcessor, ViTForImageClassification
from transformers.models.vit.modeling_vit import ViTEmbeddings

feature_extractor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
model = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224")

Get the final embedding for the CLS token

In [None]:
from lance.pytorch.data import Dataset
import torch

# Turn this down for lower mem usage
batch_size = 64
dataset = Dataset(uri, columns=["page_id", "image"], batch_size=batch_size, mode="batch")
inputs = feature_extractor(df.image[0].to_pil(), return_tensors="pt")

ndims = 768
embeddings = np.zeros((len(df), 768))
page_ids = np.zeros(len(df))

with torch.no_grad():
    for batch_id, (batch_page_ids, images) in enumerate(dataset):
        inputs = feature_extractor(images, return_tensors="pt")
        results = model(output_hidden_states=True, **inputs)
        batch_start = batch_id * batch_size
        batch_end = (batch_id + 1) * batch_size
        embeddings[batch_start:batch_end, :] = results.hidden_states[-1][:,0,:].numpy()
        page_ids[batch_start:batch_end] = batch_page_ids.numpy()

In [None]:
# TODO Add a pandas ExtensionDtype that knows how to convert to fixed_size_list
# so we don't need this boilerplate

emb_type = pa.list_(pa.float32(), list_size=ndims)
schema = pa.schema([pa.field("page_id", pa.int64(), False), 
                    pa.field("embedding", emb_type, False)])

def make_vec_array(embeddings):
    return pa.FixedSizeListArray.from_arrays(
        pa.array(embeddings.ravel(), type=pa.float32()),
        list_size=ndims)

# l2-normalize        
embeddings = embeddings / np.sqrt((embeddings**2).sum(axis=1))[:, np.newaxis]        

vectable = pa.Table.from_arrays([pa.array(page_ids), make_vec_array(embeddings)], schema=schema)

We can merge this back into the original dataset

In [None]:
pdf_dataset = pdf_dataset.checkout(1).merge(
    vectable, left_on="page_id", right_on="page_id", 
    metadata={"notes": "Pretrained ViT CLS embedding"})
pdf_dataset.versions()

Huh? what just happened?

1. `merge`'ing a pyarrow Table into a LanceDataset joins the columns and writes it to disk
2. Lance also automatically creates a new version
3. And you can `checkout` different versions to roll back and reproduce some past work

Experimentation

Notice how we did a `checkout(1)` before we merged. This can allow us to experiment with the vectors without losing history and polluting the colum-space.

## How do the embeddings do?

We saw before that page 6, 10, and 14 are "obvious" tables. How does that translate into the vector space?

In [7]:
uri = "leoslittleshopofhorrors.lance"
pdf_dataset = lance.dataset(uri)

In [19]:
%%sql --lance

WITH src AS (SELECT page_id, embedding as query_vector FROM pdf_dataset WHERE page_id = 14)

SELECT pdf_dataset.page_id, pdf_dataset.image, l2_distance(embedding, src.query_vector) as l2_dist
FROM pdf_dataset, src
ORDER BY 3
LIMIT 10

Took 0.25568413734436035


Actually, that's not bad. Assuming this is a pre-processing step, it's worth investigating further.
I guess the assumption here is that in a given pdf, you may find that tables tend to have the same style and therefore tend to look alike?

## Ok let's see it altogether

- It's nice to get a feel for things with a few observations
- But let's see it from the whole dataset

<lance.lib.FileSystemDataset at 0x13d1398b0>