# Query PDF Tutorial

In this tutorial, we demonstrate how to load a PDF and query it.

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/Yongqi099/evadb/blob/staging/apps/youtube_summary/summary.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" /> Run on Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/georgia-tech-db/eva/blob/master/tutorials/12-query-pdf.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" /> View source on GitHub</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/georgia-tech-db/eva/raw/master/tutorials/12-query-pdf.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" /> Download notebook</a>
  </td>
</table><br><br>

### Connect to EvaDB


In [1]:
%pip install --quiet "evadb[document,notebook]"
import evadb
cursor = evadb.connect().cursor()
import warnings
warnings.filterwarnings("ignore")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m530.1/530.1 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.6/137.6 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m108.9/108.9 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.4/139.4 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m35.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m48.7 MB/s[0

Downloading: "http://ml.cs.tsinghua.edu.cn/~chenxi/pytorch-models/mnist-b07bb66b.pth" to /root/.cache/torch/hub/checkpoints/mnist-b07bb66b.pth
100%|██████████| 1.03M/1.03M [00:01<00:00, 1.06MB/s]
Downloading: "https://download.pytorch.org/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth" to /root/.cache/torch/hub/checkpoints/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth


### Download PDFs

In [2]:
!wget -nc "https://www.dropbox.com/s/fv6pqdneth3l6fz/pdf_sample1.pdf"

--2023-10-10 23:38:10--  https://www.dropbox.com/s/fv6pqdneth3l6fz/pdf_sample1.pdf
Resolving www.dropbox.com (www.dropbox.com)... 162.125.1.18, 2620:100:6030:18::a27d:5012
Connecting to www.dropbox.com (www.dropbox.com)|162.125.1.18|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /s/raw/fv6pqdneth3l6fz/pdf_sample1.pdf [following]
--2023-10-10 23:38:10--  https://www.dropbox.com/s/raw/fv6pqdneth3l6fz/pdf_sample1.pdf
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uce030e63a650111bf701c8428a2.dl.dropboxusercontent.com/cd/0/inline/CFVxXh4ejW28Y3gnDZXe49E_wc92p6D2MkxPOseSX4w2--RvlA8YZtH71U_VjAWPAuLfpvZMZpydFzgzXmfaF4Ki2eWRFCx9jGonmsPO5DxRlPrqr77IDOrbRgSZ30Qq2QOKos5zfXuKp8VCii8yDJdp/file# [following]
--2023-10-10 23:38:10--  https://uce030e63a650111bf701c8428a2.dl.dropboxusercontent.com/cd/0/inline/CFVxXh4ejW28Y3gnDZXe49E_wc92p6D2MkxPOseSX4w2--RvlA8YZtH71U_VjAWPAuLfpvZMZpydFzgzXmfaF4Ki2eWR

### Load PDFs

In [3]:
cursor.query("DROP TABLE IF EXISTS MyPDFs").df()
cursor.query("LOAD PDF 'pdf_sample1.pdf' INTO MyPDFs").df()

Unnamed: 0,0
0,Number of loaded PDF: 1


### Retrieve Text from Loaded PDFs

In [4]:
cursor.query("SELECT * FROM MyPDFs").df()

Unnamed: 0,mypdfs._row_id,mypdfs.name,mypdfs.page,mypdfs.paragraph,mypdfs.data
0,1,pdf_sample1.pdf,1,1,HAEMETOLOGY  STUDY OF BLOOD
1,1,pdf_sample1.pdf,1,2,DEFINATION  Specialized connective tissue wit...
2,1,pdf_sample1.pdf,1,3,PHYSICAL CHARACTERISTICS ( 1 ) COLOUR -- R...
3,1,pdf_sample1.pdf,2,3,PLASMA SERUM
4,1,pdf_sample1.pdf,2,4,[1] has fibrinogen [1] No fibrinogen
5,1,pdf_sample1.pdf,2,5,[2] has prothrombin [2] No prothrombin
6,1,pdf_sample1.pdf,2,6,[3] has clotting factors V and ...
7,1,pdf_sample1.pdf,2,7,[3] no factors V & VIII
8,1,pdf_sample1.pdf,2,8,[4] No platelet derived growth ...
9,1,pdf_sample1.pdf,2,9,[4] Has additional platelet growth factors th...


In [5]:
cursor.query("""
    SELECT *
    FROM MyPDFs
    WHERE page = 1 AND paragraph = 3
""").df()

Unnamed: 0,mypdfs._row_id,mypdfs.name,mypdfs.page,mypdfs.paragraph,mypdfs.data
0,1,pdf_sample1.pdf,1,3,PHYSICAL CHARACTERISTICS ( 1 ) COLOUR -- R...


### Create UDFs for Text Classification and Text Summarization

In [10]:
cursor.query("""
    CREATE FUNCTION IF NOT EXISTS TextClassifier
    TYPE HuggingFace
    TASK 'text-classification'
    MODEL 'distilbert-base-uncased-finetuned-sst-2-english'
""").df()

Unnamed: 0,0
0,"Function TextClassifier already exists, nothin..."


In [11]:
cursor.query("""
    CREATE FUNCTION IF NOT EXISTS TextSummarizer
    TYPE HuggingFace
    TASK 'summarization'
    MODEL 'facebook/bart-large-cnn'
""").df()

Unnamed: 0,0
0,"Function TextSummarizer already exists, nothin..."


In [9]:
from google.colab import output
output.enable_custom_widget_manager()

Support for third party widgets will remain active for the duration of the session. To disable support:

In [None]:
from google.colab import output
output.disable_custom_widget_manager()

### Get Summaries of a Subset of Paragraphs with Negative Sentiment

In [12]:
cursor.query("""
    SELECT data, TextSummarizer(data)
    FROM MyPDFs
    WHERE page = 1 AND paragraph >= 1 AND paragraph <= 3 AND TextClassifier(data).label = 'NEGATIVE'
""").df()

Your max_length is set to 142, but your input_length is only 97. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=48)


Unnamed: 0,mypdfs.data,textsummarizer.summary_text
0,DEFINATION  Specialized connective tissue wit...,Specialized connective tissue with fluid matri...
1,PHYSICAL CHARACTERISTICS ( 1 ) COLOUR -- R...,The temperature is 38° C / 100.4° F. The body ...
