# Similarity PDF Query Tutorial

In this tutorial, we demonstrate how to load a PDF and query it.

<table align="left">
  <td>
    <a target="_blank" href="#"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" /> Run on Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="#"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" /> View source on GitHub</a>
  </td>
  <td>
    <a target="_blank" href="#"><img src="https://www.tensorflow.org/images/download_logo_32px.png" /> Download notebook</a>
  </td>
</table>

### Start EVA server



In [1]:
from evadb.interfaces.relational.db import connect
conn = connect()
cursor = conn.cursor()

### Download PDFs

In [2]:
!wget -nc "https://www.dropbox.com/s/4q3bvne3m2vsu5g/state_of_the_union.pdf"

File ‘state_of_the_union.pdf’ already there; not retrieving.



### Load PDFs

In [2]:
drop_pdf = cursor.query("DROP TABLE IF  EXISTS MyPDFs").execute()
load_pdf1 = cursor.load(file_regex="state_of_the_union.pdf", format="PDF", table_name="MyPDFs").execute()



### Retrieve Text from Loaded PDFs

In [3]:
cursor.table("MyPDFs").df()

Unnamed: 0,mypdfs._row_id,mypdfs.name,mypdfs.page,mypdfs.paragraph,mypdfs.data
0,1,state_of_the_union.pdf,1,1,"Madam Speaker, Madam Vice President, our First..."
1,1,state_of_the_union.pdf,1,2,Last year COVID-19 kept us apart. This year we...
2,1,state_of_the_union.pdf,1,3,"Tonight, we meet as Democrats Republicans and ..."
3,1,state_of_the_union.pdf,1,4,With a duty to one another to the American peo...
4,1,state_of_the_union.pdf,1,5,And with an unwavering resolve that freedom wi...
...,...,...,...,...,...
359,1,state_of_the_union.pdf,17,19,Now is our moment to meet and overcome the cha...
360,1,state_of_the_union.pdf,17,20,"And we will, as one people."
361,1,state_of_the_union.pdf,17,21,One America.
362,1,state_of_the_union.pdf,17,22,The United States of America.


### Create Sentence Transformer Feature Extractor UDF

In [4]:
cursor.drop_udf("SentenceTransformerFeatureExtractor", if_exists=True).execute()
udf = cursor.create_udf(
    "SentenceTransformerFeatureExtractor",
    True,
    "../evadb/udfs/sentence_transformer_feature_extractor.py",
)
udf.execute()




<evadb.models.storage.batch.Batch at 0x162c621a0>

# creating GPT4AllQaUDF udf

In [5]:
cursor.drop_udf("GPT4AllQaUDF", if_exists=True).execute()
udf = cursor.create_udf(
    "GPT4AllQaUDF",
    True,
    "../evadb/udfs/GPT4ALL.py",
)
udf.execute()



Found model file at  /Users/afaanansari/Desktop/gtech/eva/data/models/ggml-gpt4all-j-v1.3-groovy.bin
gptj_model_load: loading model from '/Users/afaanansari/Desktop/gtech/eva/data/models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 2
gptj_model_load: ggml ctx size = 5401.45 MB
gptj_model_load: kv self size  =  896.00 MB
gptj_model_load: ................................... done
gptj_model_load: model size =  3609.38 MB / num tensors = 285


<evadb.models.storage.batch.Batch at 0x2add707c0>

### getting answers from GPT4ALL UDF without filtering and returning top 3 answers

In [6]:
pdf_table_gpt4all = (
    cursor.table("MyPDFs")
    .cross_apply(
                "GPT4AllQaUDF(data,'When was the NATO created?')", "objs(answers)"
            )
    .limit(3)
).df()
pdf_table_gpt4all

Found model file at  /Users/afaanansari/Desktop/gtech/eva/data/models/ggml-gpt4all-j-v1.3-groovy.bin
gptj_model_load: loading model from '/Users/afaanansari/Desktop/gtech/eva/data/models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 2
gptj_model_load: ggml ctx size = 5401.45 MB
gptj_model_load: kv self size  =  896.00 MB
gptj_model_load: ................................... done
gptj_model_load: model size =  3609.38 MB / num tensors = 285
 The North Atlantic Treaty Organization (NATO) is an alliance formed on April 4th 1949 to promote peace in Europe by providing military assistance against aggression during Cold War period, and today it has 36 member states that includes Canada , United States of America.
 The North Atlantic Treaty Organization (NATO) is an alliance form

Unnamed: 0,mypdfs._row_id,mypdfs.name,mypdfs.page,mypdfs.paragraph,mypdfs.data,objs.answers
0,1,state_of_the_union.pdf,1,1,"Madam Speaker, Madam Vice President, our First...",The North Atlantic Treaty Organization (NATO)...
1,1,state_of_the_union.pdf,1,2,Last year COVID-19 kept us apart. This year we...,The question is unclear and requires more con...
2,1,state_of_the_union.pdf,1,3,"Tonight, we meet as Democrats Republicans and ...",The North Atlantic Treaty Organization (NATO)...


### getting answers from GPT4ALL UDF with filtering Similarity based on features from SentenceTransformerFeatureExtractor udf and returning top 3 results

In [7]:
pdf_table_similarity = (
    cursor.table("MyPDFs")
    .order(
        """Similarity(
                SentenceTransformerFeatureExtractor('When was the NATO created?'), SentenceTransformerFeatureExtractor(data)
            )"""
    )
    .limit(3)
)
pdf_table_gpt = (
    pdf_table_similarity.cross_apply(
        "GPT4AllQaUDF(data,'When was the NATO created?')", "objs(answers)"
    )
).df()
pdf_table_gpt

0    [[-0.038720004, 0.005207654, -0.07539935, 0.05...   
1    [[-0.038720004, 0.005207654, -0.07539935, 0.05...   
2    [[-0.038720004, 0.005207654, -0.07539935, 0.05...   
3    [[-0.038720004, 0.005207654, -0.07539935, 0.05...   
4    [[-0.038720004, 0.005207654, -0.07539935, 0.05...   
..                                                 ...   
359  [[-0.038720004, 0.005207654, -0.07539935, 0.05...   
360  [[-0.038720004, 0.005207654, -0.07539935, 0.05...   
361  [[-0.038720004, 0.005207654, -0.07539935, 0.05...   
362  [[-0.038720004, 0.005207654, -0.07539935, 0.05...   
363  [[-0.038720004, 0.005207654, -0.07539935, 0.05...   

          sentencetransformerfeatureextractor.features  
0    [[-0.029452354, -0.03953443, 0.02554976, -0.04...  
1    [[-0.0666121, 0.0290789, 0.027656456, -0.00346...  
2    [[-0.00999561, -0.04046597, -0.034857474, 0.04...  
3    [[-0.083663166, 0.034778293, -0.038327992, -0....  
4    [[-0.01534493, 0.05412948, 0.023132319, -0.023...  
..                 

Found model file at  /Users/afaanansari/Desktop/gtech/eva/data/models/ggml-gpt4all-j-v1.3-groovy.bin
gptj_model_load: loading model from '/Users/afaanansari/Desktop/gtech/eva/data/models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 2
gptj_model_load: ggml ctx size = 5401.45 MB
gptj_model_load: kv self size  =  896.00 MB
gptj_model_load: ................................... done
gptj_model_load: model size =  3609.38 MB / num tensors = 285
 The founding of NATO (North Atlantic Treaty Organization) on April 4th 1949
 The founding of NATO (North Atlantic Treaty Organization) on April 4th 1949 The North Atlantic Treaty Organization (NATO) is an international military alliance formed in 1949 that includes 29 member states from Europe, Canada, Turkey, United States and Norway.


Unnamed: 0,mypdfs._row_id,mypdfs.name,mypdfs.page,mypdfs.paragraph,mypdfs.data,objs.answers
0,1,state_of_the_union.pdf,1,17,That’s why the NATO Alliance was created to se...,The founding of NATO (North Atlantic Treaty O...
1,1,state_of_the_union.pdf,3,1,For that purpose we’ve mobilized American grou...,The North Atlantic Treaty Organization (NATO)...
2,1,state_of_the_union.pdf,2,1,We spent months building a coalition of other ...,It is not specified when exactly Russia's act...
