# Similarity PDF Query Tutorial

In this tutorial, we demonstrate how to load a PDF and query it.

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/georgia-tech-db/eva/blob/master/tutorials/13-privategpt.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" /> Run on Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/georgia-tech-db/eva/blob/master/tutorials/13-privategpt.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" /> View source on GitHub</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/georgia-tech-db/eva/raw/master/tutorials/13-privategpt.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" /> Download notebook</a>
  </td>
</table><br><br>

### Start EvaDB server



In [1]:
%pip install --quiet "evadb[document,notebook]"
%pip install --quiet qdrant_client
from evadb.interfaces.relational.db import connect
conn = connect()
cursor = conn.cursor()

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
grpcio-tools 1.56.0 requires protobuf<5.0dev,>=4.21.6, but you have protobuf 3.20.1 which is incompatible.[0m[31m
[0m

Note: you may need to restart the kernel to use updated packages.


Note: you may need to restart the kernel to use updated packages.


### Download PDFs

In [2]:
!wget -nc "https://www.dropbox.com/s/6jdcn33xizdtl9k/layout-parser-paper.pdf"
!wget -nc "https://www.dropbox.com/s/4q3bvne3m2vsu5g/state_of_the_union.pdf"

File ‘layout-parser-paper.pdf’ already there; not retrieving.



File ‘state_of_the_union.pdf’ already there; not retrieving.



### Load PDFs

In [3]:
drop_pdf = cursor.query("DROP TABLE IF  EXISTS MyPDFs").execute()
load_pdf1 = cursor.load(file_regex="layout-parser-paper.pdf", format="PDF", table_name="MyPDFs").execute()
load_pdf1 = cursor.load(file_regex="state_of_the_union.pdf", format="PDF", table_name="MyPDFs").execute()

### Retrieve Text from Loaded PDFs

In [4]:
cursor.table("MyPDFs").df()

Unnamed: 0,mypdfs._row_id,mypdfs.name,mypdfs.page,mypdfs.paragraph,mypdfs.data
0,1,layout-parser-paper.pdf,1,1,LayoutParser: A Uniﬁed Toolkit for DeepLearnin...
1,1,layout-parser-paper.pdf,1,2,"Zejiang Shen1(�), Ruochen Zhang2, Melissa Dell..."
2,1,layout-parser-paper.pdf,1,3,1 Allen Institute for AIshannons@allenai.org2 ...
3,1,layout-parser-paper.pdf,1,4,Abstract. Recent advances in document image an...
4,1,layout-parser-paper.pdf,1,5,Keywords: Document Image Analysis · Deep Learn...
...,...,...,...,...,...
512,2,state_of_the_union.pdf,17,19,Now is our moment to meet and overcome the cha...
513,2,state_of_the_union.pdf,17,20,"And we will, as one people."
514,2,state_of_the_union.pdf,17,21,One America.
515,2,state_of_the_union.pdf,17,22,The United States of America.


### Create Sentence Transformer Feature Extractor UDF

In [5]:
udf_check = cursor.query(
            "DROP UDF IF  EXISTS SentenceFeatureExtractor"
        )
udf_check.execute()

# Adding Emotion detection
!wget -nc https://raw.githubusercontent.com/georgia-tech-db/eva/master/evadb/udfs/sentence_feature_extractor.py

udf = cursor.create_udf(
    "SentenceFeatureExtractor",
    True,
    "sentence_feature_extractor.py",
)
udf.execute()

File ‘sentence_feature_extractor.py’ already there; not retrieving.



<evadb.models.storage.batch.Batch at 0x7f0933699870>

#### Configure Dataframe Display

In [6]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

### Create vector index for paragraphs of pdf

In [7]:
cursor.create_vector_index(
            "faiss_indexs",
            table_name="MyPDFs",
            expr="SentenceFeatureExtractor(data)",
            using="QDRANT",
        ).df()

Unnamed: 0,0
0,Index faiss_indexs successfully added to the database.


In [8]:
query = (
            cursor.table("MyPDFs")
            .order(
                """Similarity(
                    SentenceFeatureExtractor('When was the NATO created?'), SentenceFeatureExtractor(data)
                )"""
            )
            .select("data")
        ).df()
query

Unnamed: 0,mypdfs.data
0,That’s why the NATO Alliance was created to secure peace and stability in Europe after World War 2.
1,"For that purpose we’ve mobilized American ground forces, air squadrons, and ship deployments to protect NATO countries including Poland, Romania, Latvia, Lithuania,and Estonia."
2,We spent months building a coalition of other freedom-loving nations from Europe and the Americas to Asia and Africa to confront Putin.
3,As I have made crystal clear the United States and our Allies will defend every inch of territory of NATO countries with the full force of our collective power.
4,He thought the West and NATO wouldn’t respond. And he thought he could divide us athome. Putin was wrong. We were ready. Here is what we did.
...,...
512,3.4Storage and visualization
513,Get rid of outdated rules that stop doctors from prescribing treatments. And stop the flow of illicit drugs by working with state and local law enforcement to go after traffickers.
514,Heath’s widow Danielle is here with us tonight. They loved going to Ohio State football games. He loved building Legos with their daughter.
515,But cancer from prolonged exposure to burn pits ravaged Heath’s lungs and body.
