# Query PDF Tutorial

In this tutorial, we demonstrate how to load a PDF and query it.

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/georgia-tech-db/eva/blob/master/tutorials/12-query-pdf.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" /> Run on Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/georgia-tech-db/eva/blob/master/tutorials/12-query-pdf.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" /> View source on GitHub</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/georgia-tech-db/eva/raw/master/tutorials/12-query-pdf.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" /> Download notebook</a>
  </td>
</table><br><br>

### Connect to EvaDB


In [1]:
%pip install --quiet "evadb[document,notebook]"
import evadb
cursor = evadb.connect().cursor()

Note: you may need to restart the kernel to use updated packages.


### Download PDFs

In [2]:
!wget -nc "https://www.dropbox.com/s/fv6pqdneth3l6fz/pdf_sample1.pdf"

File ‘pdf_sample1.pdf’ already there; not retrieving.



### Load PDFs

In [3]:
cursor.query("DROP TABLE IF EXISTS MyPDFs").df()
cursor.query("LOAD PDF 'pdf_sample1.pdf' INTO MyPDFs").df()

Unnamed: 0,0
0,Number of loaded PDF: 1


### Retrieve Text from Loaded PDFs

In [4]:
cursor.query("SELECT * FROM My.df()

SyntaxError: unterminated string literal (detected at line 1) (4090552935.py, line 1)

In [3]:
cursor.query("""
    SELECT *
    FROM MyPDFs
    WHERE page = 1 AND paragraph = 3
""").df()

### Create UDFs for Text Classification and Text Summarization

In [6]:
cursor.query("""
    CREATE FUNCTION IF NOT EXISTS TextClassifier
    TYPE HuggingFace
    TASK 'text-classification'
    MODEL 'distilbert-base-uncased-finetuned-sst-2-english'
""").df()

In [7]:
cursor.query("""
    CREATE FUNCTION IF NOT EXISTS TextSummarizer
    TYPE HuggingFace
    TASK 'summarization'
    MODEL 'facebook/bart-large-cnn'
""").df()

### Get Summaries of a Subset of Paragraphs with Negative Sentiment

In [8]:
cursor.query("""
    SELECT data, TextSummarizer(data)
    FROM MyPDFs
    WHERE page = 1 AND paragraph >= 1 AND paragraph <= 3 AND TextClassifier(data).label = 'NEGATIVE'
""").df()