# Similarity PDF Query Tutorial

In this tutorial, we demonstrate how to load a PDF and query it.

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/georgia-tech-db/eva/blob/master/tutorials/13-privategpt.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" /> Run on Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/georgia-tech-db/eva/blob/master/tutorials/13-privategpt.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" /> View source on GitHub</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/georgia-tech-db/eva/raw/master/tutorials/13-privategpt.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" /> Download notebook</a>
  </td>
</table>

### Start EVA server



In [2]:
from eva.interfaces.relational.db import connect
conn = connect()
cursor = conn.cursor()

### Download PDFs

In [3]:
!wget -nc "https://www.dropbox.com/s/6jdcn33xizdtl9k/layout-parser-paper.pdf"
!wget -nc "https://www.dropbox.com/s/4q3bvne3m2vsu5g/state_of_the_union.pdf"

File 'layout-parser-paper.pdf' already there; not retrieving.

File 'state_of_the_union.pdf' already there; not retrieving.



### Load PDFs

In [4]:
drop_pdf = cursor.query("DROP TABLE IF  EXISTS MyPDFs").execute()
load_pdf1 = cursor.load(file_regex="layout-parser-paper.pdf", format="PDF", table_name="MyPDFs").execute()
load_pdf1 = cursor.load(file_regex="state_of_the_union.pdf", format="PDF", table_name="MyPDFs").execute()

### Retrieve Text from Loaded PDFs

In [5]:
cursor.table("MyPDFs").df()

Unnamed: 0,mypdfs._row_id,mypdfs.name,mypdfs.page,mypdfs.paragraph,mypdfs.data
0,1,layout-parser-paper.pdf,1,1,LayoutParser: A Uniﬁed Toolkit for DeepLearnin...
1,1,layout-parser-paper.pdf,1,2,"Zejiang Shen1(�), Ruochen Zhang2, Melissa Dell..."
2,1,layout-parser-paper.pdf,1,3,1 Allen Institute for AIshannons@allenai.org2 ...
3,1,layout-parser-paper.pdf,1,4,Abstract. Recent advances in document image an...
4,1,layout-parser-paper.pdf,1,5,Keywords: Document Image Analysis · Deep Learn...
...,...,...,...,...,...
512,2,state_of_the_union.pdf,17,19,Now is our moment to meet and overcome the cha...
513,2,state_of_the_union.pdf,17,20,"And we will, as one people."
514,2,state_of_the_union.pdf,17,21,One America.
515,2,state_of_the_union.pdf,17,22,The United States of America.


### Create Sentence Transformer Feature Extractor UDF

In [6]:
udf_check = cursor.query(
            "DROP UDF IF  EXISTS SentenceTransformerFeatureExtractor"
        )
udf_check.execute()
udf = cursor.create_udf(
    "SentenceTransformerFeatureExtractor",
    True,
    "../eva/udfs/sentence_transformer_feature_extractor.py",
)
udf.execute()

<eva.models.storage.batch.Batch at 0x7fee718cec20>

### Create vector index for paragraphs of pdf

In [7]:
cursor.create_vector_index(
            "faiss_indexs",
            table_name="MyPDFs",
            expr="SentenceTransformerFeatureExtractor(data)",
            using="QDRANT",
        ).df()

Unnamed: 0,0
0,Index faiss_indexs successfully added to the d...


In [9]:
query = (
            cursor.table("MyPDFs")
            .order(
                """Similarity(
                    SentenceTransformerFeatureExtractor('When was the NATO created?'), SentenceTransformerFeatureExtractor(data)
                )"""
            )
            .limit(1)
            .select("data")
        ).df()
query

Unnamed: 0,mypdfs.data
0,"[38] Zhong, X., Tang, J., Yepes, A.J.: Publayn..."
