# Query PDF Tutorial

In this tutorial, we demonstrate how to load a PDF and query it.

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/Yongqi099/evadb/blob/staging/apps/youtube_summary/summary.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" /> Run on Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/georgia-tech-db/eva/blob/master/tutorials/12-query-pdf.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" /> View source on GitHub</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/georgia-tech-db/eva/raw/master/tutorials/12-query-pdf.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" /> Download notebook</a>
  </td>
</table><br><br>

### Connect to EvaDB


In [1]:
%pip install --quiet "evadb[document,notebook]"
%pip install youtube_transcript_api
%pip install reportlab

import evadb
cursor = evadb.connect().cursor()
import warnings
warnings.filterwarnings("ignore")

from google.colab import output
output.enable_custom_widget_manager()

from youtube_transcript_api import YouTubeTranscriptApi

from reportlab.platypus import SimpleDocTemplate, Paragraph
from reportlab.lib.styles import getSampleStyleSheet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m530.1/530.1 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.6/137.6 kB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m108.9/108.9 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m75.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m70.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m76.7 MB/s[0m

Downloading: "http://ml.cs.tsinghua.edu.cn/~chenxi/pytorch-models/mnist-b07bb66b.pth" to /root/.cache/torch/hub/checkpoints/mnist-b07bb66b.pth
100%|██████████| 1.03M/1.03M [00:01<00:00, 803kB/s] 
Downloading: "https://download.pytorch.org/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth" to /root/.cache/torch/hub/checkpoints/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth


### Video Link

In [13]:
# replace with your video URL
video_url = 'https://www.youtube.com/watch?v=dQw4w9WgXcQ&pp=ygUNbmV2ZXIgZ2l2ZSB1cA%3D%3D'

### Get Youtube Transcript

In [14]:
# Check if the video URL starts with 'https://www.youtube.com/watch'
if not video_url.startswith('https://www.youtube.com/watch'):
    raise ValueError("Invalid video URL. It should start with 'https://www.youtube.com/watch'")

# Extract the video ID from the URL
video_id = video_url.split('=')[1]

# Get the transcript of the video
transcript = YouTubeTranscriptApi.get_transcript(video_id)

In [20]:
# Combine all the parts into a single string
full_transcript = " ".join([part['text'] for part in transcript])
# Write the transcript to a text file
with open('transcript.txt', 'w') as f:
    f.write(full_transcript)

# Write the transcript to a PDF file
doc = SimpleDocTemplate("pdf_sample1.pdf")
styles = getSampleStyleSheet()
story = [Paragraph(full_transcript, styles["BodyText"])]
doc.build(story)

### Download PDFs

In [19]:
# !wget -nc "https://www.dropbox.com/s/fv6pqdneth3l6fz/pdf_sample1.pdf"

### Load PDFs

In [21]:
cursor.query("DROP TABLE IF EXISTS MyPDFs").df()
cursor.query("LOAD PDF 'pdf_sample1.pdf' INTO MyPDFs").df()

Unnamed: 0,0
0,Number of loaded PDF: 1


### Retrieve Text from Loaded PDFs

In [22]:
cursor.query("SELECT * FROM MyPDFs").df()

Unnamed: 0,mypdfs._row_id,mypdfs.name,mypdfs.page,mypdfs.paragraph,mypdfs.data
0,1,pdf_sample1.pdf,1,1,[Music] we're no strangers to love you know th...


In [23]:
cursor.query("""
    SELECT *
    FROM MyPDFs
    WHERE page = 1 AND paragraph = 3
""").df()

### Create UDFs for Text Classification and Text Summarization

In [24]:
cursor.query("""
    CREATE FUNCTION IF NOT EXISTS TextClassifier
    TYPE HuggingFace
    TASK 'text-classification'
    MODEL 'distilbert-base-uncased-finetuned-sst-2-english'
""").df()

Unnamed: 0,0
0,"Function TextClassifier already exists, nothin..."


In [7]:
cursor.query("""
    CREATE FUNCTION IF NOT EXISTS TextSummarizer
    TYPE HuggingFace
    TASK 'summarization'
    MODEL 'facebook/bart-large-cnn'
""").df()

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Your max_length is set to 142, but your input_length is only 11. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=5)


Unnamed: 0,0
0,Function TextSummarizer added to the database.


### Get Summaries of a Subset of Paragraphs with Negative Sentiment

In [10]:
cursor.query("""
    SELECT data, TextSummarizer(data)
    FROM MyPDFs
    WHERE page = 1 AND paragraph >= 1 AND paragraph <= 3 AND TextClassifier(data).label = 'NEGATIVE'
""").df()

Your max_length is set to 142, but your input_length is only 97. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=48)


Unnamed: 0,mypdfs.data,textsummarizer.summary_text
0,DEFINATION  Specialized connective tissue wit...,Specialized connective tissue with fluid matri...
1,PHYSICAL CHARACTERISTICS ( 1 ) COLOUR -- R...,The temperature is 38° C / 100.4° F. The body ...
