# Query PDF Tutorial

In this tutorial, we demonstrate how to load a PDF and query it.

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/georgia-tech-db/eva/blob/master/tutorials/12-query-pdf.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" /> Run on Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/georgia-tech-db/eva/blob/master/tutorials/12-query-pdf.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" /> View source on GitHub</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/georgia-tech-db/eva/raw/master/tutorials/12-query-pdf.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" /> Download notebook</a>
  </td>
</table>

### Start EVA server



We are reusing the start server notebook for launching the EVA server.

In [1]:
!wget -nc "https://raw.githubusercontent.com/georgia-tech-db/eva/master/tutorials/00-start-eva-server.ipynb"
%run 00-start-eva-server.ipynb
cursor = connect_to_server()

File ‘00-start-eva-server.ipynb’ already there; not retrieving.



Note: you may need to restart the kernel to use updated packages.


Stopping EVA Server ...


Starting EVA Server ...
nohup eva_server > eva.log 2>&1 &


### Download PDFs

In [2]:
!wget -nc "https://www.dropbox.com/s/fv6pqdneth3l6fz/pdf_sample1.pdf"

File ‘pdf_sample1.pdf’ already there; not retrieving.



### Load PDFs

In [3]:
response = cursor.execute("DROP TABLE IF EXISTS MyPDFs").fetch_all().as_df()
cursor.execute("LOAD PDF 'pdf_sample1.pdf' INTO MyPDFs").fetch_all().as_df()

Unnamed: 0,0
0,Number of loaded PDF: 1


### Retrieve Text from Loaded PDFs

In [4]:
cursor.execute('SELECT * FROM MyPDFs;').fetch_all().as_df()

Unnamed: 0,mypdfs._row_id,mypdfs.name,mypdfs.page,mypdfs.paragraph,mypdfs.data
0,1,pdf_sample1.pdf,1,1,HAEMETOLOGY  STUDY OF BLOOD
1,1,pdf_sample1.pdf,1,2,DEFINATION  Specialized connective tissue wit...
2,1,pdf_sample1.pdf,1,3,PHYSICAL CHARACTERISTICS ( 1 ) COLOUR -- R...
3,1,pdf_sample1.pdf,2,3,PLASMA SERUM
4,1,pdf_sample1.pdf,2,4,[1] has fibrinogen [1] No fibrinogen
5,1,pdf_sample1.pdf,2,5,[2] has prothrombin [2] No prothrombin
6,1,pdf_sample1.pdf,2,6,[3] has clotting factors V and ...
7,1,pdf_sample1.pdf,2,7,[3] no factors V & VIII
8,1,pdf_sample1.pdf,2,8,[4] No platelet derived growth ...
9,1,pdf_sample1.pdf,2,9,[4] Has additional platelet growth factors th...


In [5]:
cursor.execute('SELECT data FROM MyPDFs WHERE page = 1 AND paragraph = 3;').fetch_all().as_df()

Unnamed: 0,mypdfs.data
0,PHYSICAL CHARACTERISTICS ( 1 ) COLOUR -- R...


### Create UDFs for Text Classification and Text Summarization

In [6]:
cursor.execute("""CREATE UDF IF NOT EXISTS TextClassifier
                  TYPE HuggingFace
                  'task' 'text-classification'
                  'model' 'distilbert-base-uncased-finetuned-sst-2-english'""").fetch_all().as_df()

Unnamed: 0,0
0,UDF TextClassifier successfully added to the d...


In [7]:
cursor.execute("""CREATE UDF IF NOT EXISTS TextSummarizer
                  TYPE HuggingFace
                  'task' 'summarization'
                  'model' 'facebook/bart-large-cnn';""").fetch_all().as_df()

Unnamed: 0,0
0,UDF TextSummarizer successfully added to the d...


### Get Summaries of a Subset of Paragraphs with Negative Sentiment

In [8]:
cursor.execute("""SELECT data, TextSummarizer(data)
                  FROM MyPDFs
                  WHERE page = 1 
                  AND paragraph >= 1 AND paragraph <= 3
                  AND TextClassifier(data).label = 'NEGATIVE';""").fetch_all().as_df()

Unnamed: 0,mypdfs.data,textsummarizer.summary_text
0,DEFINATION  Specialized connective tissue wit...,Specialized connective tissue with fluid matri...
1,PHYSICAL CHARACTERISTICS ( 1 ) COLOUR -- R...,The temperature is 38° C / 100.4° F. The body ...
