# Query PDF Tutorial

In this tutorial, we demonstrate how to load a PDF and query it.

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/georgia-tech-db/eva/blob/master/tutorials/12-query-pdf.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" /> Run on Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/georgia-tech-db/eva/blob/master/tutorials/12-query-pdf.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" /> View source on GitHub</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/georgia-tech-db/eva/raw/master/tutorials/12-query-pdf.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" /> Download notebook</a>
  </td>
</table>

### Connect to EvaDB


In [1]:
%pip install evadb
import evadb
cursor = evadb.connect().cursor()

Note: you may need to restart the kernel to use updated packages.


### Download PDFs

In [2]:
!wget -nc "https://www.dropbox.com/s/fv6pqdneth3l6fz/pdf_sample1.pdf"

File 'pdf_sample1.pdf' already there; not retrieving.



### Load PDFs

In [3]:
cursor.query("DROP TABLE IF EXISTS MyPDFs").df()
cursor.load('pdf_sample1.pdf', "MyPDFs", format="pdf").df()



Unnamed: 0,0
0,Number of loaded PDF: 1


### Retrieve Text from Loaded PDFs

In [4]:
query = cursor.table("MyPDFs")
query = query.select("*")

query.df()

Unnamed: 0,mypdfs._row_id,mypdfs.name,mypdfs.page,mypdfs.paragraph,mypdfs.data
0,1,pdf_sample1.pdf,1,1,HAEMETOLOGY  STUDY OF BLOOD
1,1,pdf_sample1.pdf,1,2,DEFINATION  Specialized connective tissue wit...
2,1,pdf_sample1.pdf,1,3,PHYSICAL CHARACTERISTICS ( 1 ) COLOUR -- R...
3,1,pdf_sample1.pdf,2,3,PLASMA SERUM
4,1,pdf_sample1.pdf,2,4,[1] has fibrinogen [1] No fibrinogen
5,1,pdf_sample1.pdf,2,5,[2] has prothrombin [2] No prothrombin
6,1,pdf_sample1.pdf,2,6,[3] has clotting factors V and ...
7,1,pdf_sample1.pdf,2,7,[3] no factors V & VIII
8,1,pdf_sample1.pdf,2,8,[4] No platelet derived growth ...
9,1,pdf_sample1.pdf,2,9,[4] Has additional platelet growth factors th...


In [5]:
query = cursor.table("MyPDFs")
query = query.select("*")
query = query.filter("page = 1 AND paragraph = 3")

query.df()

Unnamed: 0,mypdfs._row_id,mypdfs.name,mypdfs.page,mypdfs.paragraph,mypdfs.data
0,1,pdf_sample1.pdf,1,3,PHYSICAL CHARACTERISTICS ( 1 ) COLOUR -- R...


### Create UDFs for Text Classification and Text Summarization

In [6]:
cursor.query("""CREATE UDF IF NOT EXISTS TextClassifier
                  TYPE HuggingFace
                  'task' 'text-classification'
                  'model' 'distilbert-base-uncased-finetuned-sst-2-english'""").df()

Unnamed: 0,0
0,UDF TextClassifier successfully added to the d...


In [7]:
cursor.query("""CREATE UDF IF NOT EXISTS TextSummarizer
                  TYPE HuggingFace
                  'task' 'summarization'
                  'model' 'facebook/bart-large-cnn';""").df()

Your max_length is set to 142, but your input_length is only 11. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)


Unnamed: 0,0
0,UDF TextSummarizer successfully added to the d...


### Get Summaries of a Subset of Paragraphs with Negative Sentiment

In [8]:
query = cursor.table("MyPDFs")
query = query.select("data, TextSummarizer(data)")
query = query.filter("page = 1 AND paragraph >= 1 AND paragraph <= 3")
query = query.filter("TextClassifier(data).label = 'NEGATIVE'")

query.df()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


2023-06-06 23:10:57,837	INFO worker.py:1625 -- Started a local Ray instance.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[2m[36m(ray_parallel pid=3264851)[0m Your max_length is set to 142, but your input_length is only 20. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=10)
[2m[36m(ray_parallel pid=3264851)[0m Your max_length is set to 142, but your input_length is only 97. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=48)


Unnamed: 0,mypdfs.data,mypdfs.summary_text
0,DEFINATION  Specialized connective tissue wit...,Specialized connective tissue with fluid matri...
1,PHYSICAL CHARACTERISTICS ( 1 ) COLOUR -- R...,The temperature is 38° C / 100.4° F. The body ...
