<a href="https://colab.research.google.com/github/ad17171717/YouTube-Tutorials/blob/main/Python/Extract%20Text%20from%20PDF/Python!_Extracting_Text_from_PDFs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from PyPDF2 import PdfReader
from PIL import Image
import pytesseract
import fitz

# **Text Extraction from PDFs**

**Text from PDF files can be extracted using two different methods: parsing through a digitally created PDF or through Optical Character Recognition for scanned PDFs. A digital native PDF is a PDF file that was created on a computer using a program such as Adobe while scanned PDFs are documents that were uploaded via file scan on a printer. Digitally native PDFs can be parsed using a program such as PyPDF2 while scanned documents can be parsed using a OCR engine such as pytesseract.**

<sup>Source: [Digitally-born vs Scanned PDF files](https://pypdf2.readthedocs.io/en/3.0.0/user/extract-text.html#digitally-born-vs-scanned-pdf-files) from PyPDF2's documentation</sup>

## **Extracting Text from a Natively Created PDF**

In [None]:
native_pdf = "Data Science with Python! Extracting Metadata from Images.pdf"

In [None]:
reader = PdfReader(native_pdf)

pdf_string_1 = ''

for page in range(len(reader.pages)):
    pdf_string_1 += reader.pages[page].extract_text()

In [None]:
print(pdf_string_1)

## **Extracting Text from a Scanned PDF**

### **Optical Character Recognition (OCR)**

**Optical Character Recognition (OCR) converts images of typed, handwritten or printed text into machine encoded text.**

<sup>Source: [What Is Ocr (Optical Character Recognition)?](https://aws.amazon.com/what-is/ocr/) from AWS</sup>

In [None]:
scanned_pdf = 'kmeans_algorithm.pdf'

In [None]:
reader_2 = PdfReader(scanned_pdf)

pdf_string_2 = ''

for page in range(len(reader_2.pages)):
    pdf_string_2 += reader_2.pages[page].extract_text()

In [None]:
print(pdf_string_2)

In [None]:
doc = fitz.open('kmeans_algorithm.pdf')

In [None]:
for page in doc:
    pix = page.get_pixmap()
    pix.save("page-%i.png" % page.number)

In [None]:
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'

In [None]:
ocrd_text = pytesseract.image_to_string(Image.open('page-0.png'))

In [None]:
print(ocrd_text)

# **References and Additional Learning**

## **Modules**

- **[Digitally-born vs Scanned PDF files](https://pypdf2.readthedocs.io/en/3.0.0/user/extract-text.html#digitally-born-vs-scanned-pdf-files) from PyPDF2's documentation**

- **[Install Tesseract OCR on Linux](https://linuxhint.com/install-tesseract-ocr-linux/) by David Adams on linuxhint.com**

- **[Installing Tesseract on Mac](https://www.oreilly.com/library/view/building-computer-vision/9781838644673/95de5b35-436b-4668-8ca2-44970a6e2924.xhtml) on O'Reilly**

- **[PyPDF2 Documentation](https://github.com/py-pdf/PyPDF2) on GitHub.com**

- **[pytesseract Documentation](https://github.com/madmaze/pytesseract) on GitHub.com**

- **[Tesseract Download at UB Mannheim](https://github.com/UB-Mannheim/tesseract/wiki) on GitHub.com**

## **Videos**

- **[Data Science with Python! Extracting Metadata from a PDF!](https://www.youtube.com/watch?v=DfY0-VdSumo) by Adrian Dolinay**

- **[Encrypting and Decrypting PDFs with Python!](https://www.youtube.com/watch?v=_b96G79IahQ&ab_channel=AdrianDolinay) by Adrian Dolinay**

- **[Merging PDFs with Python!](https://www.youtube.com/watch?v=CMEHOjQksUs) by Adrian Dolinay**

## **Website**

- **[What Is Ocr (Optical Character Recognition)?](https://aws.amazon.com/what-is/ocr/) from AWS**

# **Connect**
- **Feel free to connect with Adrian on [YouTube](https://www.youtube.com/channel/UCPuDxI3xb_ryUUMfkm0jsRA), [LinkedIn](https://www.linkedin.com/in/adrian-dolinay-frm-96a289106/), [Twitter](https://twitter.com/DolinayG), [GitHub](https://github.com/ad17171717), [Medium](https://adriandolinay.medium.com/) and [Odysee](https://odysee.com/@adriandolinay:0). Happy coding!**