<a href="https://colab.research.google.com/github/bhavika67/NLP/blob/main/pdf_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extracting Text From PDF

Extracting data from PDFs using Python can be achieved with various libraries, each having its strengths and weaknesses. Here's are some of the most popular methods

**Method 1: PyPDF2**


**Pros:** Pure Python library, easy to install, handles basic text extraction well.

**Cons:** Doesn't preserve the structure of the PDF, struggles with complex layouts and tables

In [None]:
#install the packages
!pip install PyPDF2
import PyPDF2
from PyPDF2 import PdfFileReader

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [None]:
pdf=r'/content/drive/MyDrive/A_Tempest_of_Tea_-_Hafsah_Faizal.pdf'
pdf_reader=PyPDF2.PdfReader(pdf)
print(len(pdf_reader.pages))#print the lenght of pdf
page=pdf_reader.pages[70] #read the pages
print(page.extract_text())#data is extracted

313
“Got it!” the boy yelled, and took of f down the street.
“You—” Flick started. Lied. She was certain he had lied, but he spoke
with such cando r that even Flick had a hard time not believing him. “He
believed you!”
“It’s a skill, love,” Jin said, straightening.
“But you lied to him. That was positively churlish. You’ve ruined his
day!” she reprim anded, and Jin tossed something at her. It was the boy’s
toffee. “Jin! You stole that from him!”
“Sticky fingers, sorry ,” he said, not looking the least bit sorry .
“You—you’re—”
He lifted his brows. “Superbly striking or savagely clever? That boy
was a crook. He filches for a smaller gang.”
“That was a child,” she proteste d, swallowing her frustration when he
sighed. “And I—I don’ t forge anymore.”
The words burst out of her , surprising them both.
Jin’s forehead scrunched. “But you’re good at it.”
The raw honesty in his voice struck her. She was good at it. Flick had
been searching all her life—for what, she didn’ t know . She thought 

In [None]:
pdf=open("/content/drive/MyDrive/A_Tempest_of_Tea_-_Hafsah_Faizal.pdf","rb")
pdf_reader=PyPDF2.PdfReader(pdf)
print(pdf_reader.pages)#print the lenght of pdf
page=pdf_reader.pages[0] #read the pages
print(page.extract_text())#data is extracted  # Use extract_text() with lowercase 't'
pdf.close()

<PyPDF2._page._VirtualList object at 0x7d5a5d2d8ee0>



In [None]:
import PyPDF2
from PyPDF2 import PdfReader
with open("/content/drive/MyDrive/A_Tempest_of_Tea_-_Hafsah_Faizal.pdf",'rb') as f: # Use a different variable name to avoid shadowing the 'pdf' module
    pdf = PdfReader(f)  # Use PdfReader on the file object
    for i in range(len(pdf.pages)):
        page = pdf.pages[i]  # Indent this line
        text = page.extract_text()  # Indent this line
        print(text)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
She couldn’ t expect to be convincing by twiddling her thumb s at the
door. It was the first requirement of any good swindle: Make yourself
belong. She picked up the little brass bell from the edge of his desk and
rang it.
Lambard looked distressed.
“I thought we were settled,” the balding official said. “After last time,
there—”
“Did you stop finding extra duvin in your pockets?” Arthie asked. His
mustache twitched with her disrespect, but his silence was what she’d
expected. “As I thought. I need your Athereum marker .”
“You cannot have it!” he blustered with unfounded indignation. “I
refuse to do any such thing.”
Arthie let his words echo in the silence. The four wood-paneled walls
spewed his words back at him until he shrank into his curved chair .
“I was more expecting you to deny being an Athereum member or even
a vampire, but you’re as pathetic as ever. I do appreciate you saving me the
time, however , so there’ s 

**Method 2: PyMuPDF (fitz)**

**Pros:** Fast, supports images and more complex layouts, preserves text structure better than PyPDF2.

**Cons:** Requires installation of non-Python libraries (MuPDF).

In [None]:
!pip install pymupdf
import fitz

def extract_text_from_pdf(pdf_path):
    with fitz.open("/content/drive/MyDrive/A_Tempest_of_Tea_-_Hafsah_Faizal.pdf") as doc:
        text = ""
        for page in doc:
            text += page.get_text()
        return text

text = extract_text_from_pdf('A_Tempest_of_Tea_-_Hafsah_Faizal.pdf')
print(text)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
mustache twitched with her disrespect, but his silence was what she’d
expected. “As I thought. I need your Athereum marker.”
“You cannot have it!” he blustered with unfounded indignation. “I
refuse to do any such thing.”
Arthie let his words echo in the silence. The four wood-paneled walls
spewed his words back at him until he shrank into his curved chair.
“I was more expecting you to deny being an Athereum member or even
a vampire, but you’re as pathetic as ever. I do appreciate you saving me the
time, however, so there’s that.”
He shook his head, too embarrassed to even try saving face. “I—”
“You forget, Clayton Lambard, whom you speak to. You won’t even
need to use what little brain you have left to give it to me.”
“You’re a monster,” Lambard whispered.
“A monster would shove a barrel down your throat until you handed it
over. I like to think I’m quite civilized.” Arthie regarded her fingernails.
“Oh, and I also don’t 

**Method 3:PDFMiner.six**

**Pros:** Excellent for extracting text from PDFs with complex layouts.

**Cons:** Can be slower than other libraries.

In [None]:
!pip install pdfminer.six
from pdfminer.high_level import extract_text

def extract_text_from_pdf(pdf_path):
    text = extract_text(pdf_path)
    return text

text = extract_text_from_pdf('/content/drive/MyDrive/A_Tempest_of_Tea_-_Hafsah_Faizal.pdf')
print(text)

# Provide the full path here as well:
text = extract_text_from_pdf('/content/drive/MyDrive/A_Tempest_of_Tea_-_Hafsah_Faizal.pdf')
print(text)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m

The night gusted between them. She drew her lip between her teeth.
Was it the shift of the moon, or did his eyes darken? Was it the clock
that thrummed down her body, or was it something else? Arthie let herself
linger  in  the  heat  flooding  through  her  limbs,  feeling  powerful  when  his
gaze dropped to her mouth.

She shoved him off her, immediately curling her fists against the feel of
his  skin  beneath  his  clothes,  the  solidity  of  his  chest.  It  wasn’t  hard  to
conjure  the  image  of  him  at  his  apartment,  his  torso  bare,  skin  glistening
with  stray  drops  of  water.  He  had  discarded  his  robes  today  in  favor  of  a
linen shirt in royal blue that fit snug across his shoulders with billowy pants
slung  low.  He  looked  even  more  bare  without  his  kitten  curled  on  his
shoulder.

Arthie  rounded  to  the  other  side  of  the  balcony,  biting  her  tongue  to

force herself bac