In this file, we will go through how we can extract text from PDF files. 
* To extract text from a PDF is not an easy task. There is lot to do here. But for some help, we will use Python package known as [pdf2image](https://pypi.org/project/pdf2image/), which can be easily installed by using the `pip` command; `pip install pdf2image`.

The biggest challenge we face while we extract text from PDF file is, PDF files come in different file formats. So first we need to prepare a function so that we can convert every format of a PDF file into our desired one.

Let’s start by importing all the packages. 
* We need `pdf2image` to convert **PDF files to ppm image files**.
* We also need to manipulate the paths to join and rename text files, so we import `os` and `sys` packages. 

The following part calls a `PIL` library and imports the image with `pytesseract`:

In [11]:
# !pip install pdf2image
# !pip install pytesseract

* Downlaod **poppler binary** from [here](http://blog.alivate.com.au/poppler-windows/), execute all the files in bin folder and set the environment path
* Downlaod and install tesseract from [here](https://github.com/UB-Mannheim/tesseract/wiki)

In [12]:
#Import all packages

import pdf2image
import os, sys
try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract

#import docxpy  - we can use this package when you are extracting text from word files

Now we need to initialize the path of our documents and the counter to be used later in the pdf extract function to count our documents in the folder:

In [13]:
#initialize the path to your documents

# PATH = 'C:\Waqas Documents\Waqas\W.A\Programming\Data Sciecnce\With Python\Aman Kharwal\Data Science Projects -Advanced'

PATH = './pdf/'

#initialize the counter that we will use later in our pdf extraction function
i = 1

Now, we need to delete some unrequired files from our pdf files, for this we will create a new function:

In [14]:
# This function deletes all ppm and .DS_Store files from the folder
def delete_ppms():
  for file in os.listdir(PATH):
    if '.ppm' in file or '.DS_Store' in file:
      try:
          os.remove(PATH + file)
      except FileNotFoundError:
          pass

Now we need to sort the pdf files according to their types. We will start this by creating lists one for pdf files and one for Docx files because these two types are the most used pdf file types:

In [15]:
# initialize lists for each document type
pdf_files = []
docx_files = []

# append document names into the lists by their extension type
for f in os.listdir(PATH):
  full_name = os.path.join(PATH, f) 
  if os.path.isfile(full_name):
    name = os.path.basename(f)
    filename, ext = os.path.splitext(name)
    if ext == '.pdf':
      pdf_files.append(name)
    elif ext == ('.docx'):
      docx_files.append(name)

In [16]:
print(pdf_files, docx_files)

['sample.pdf'] []


Now we can finally extract text from PDF files. Here is the `pdf_extract` function. 
* First, it prints the name of each file from which the text is extracted. 
* Depending on the size of the document, extracting text may take some time.

In [17]:
# This function converts pdf to images and then extracts text from images
def pdf_extract(file, i):
  print("extracting from file:", file)
  delete_ppms()
  images = pdf2image.convert_from_path(file, output_folder=PATH) # it will return a list
  for im in images:
        im.close() # to close file
  j = 0
  for file in sorted (os.listdir(PATH)):
      if '.ppm' in file and 'image' not in file:
        os.rename(PATH + file, PATH + 'image' + str(i) + '-' + str(j) + '.ppm')
        j += 1
  j = 0
  f = open(PATH +'result{}.txt'.format(i), 'w')
  files = [f for f in os.listdir(PATH) if '.ppm' in f]

  for file in sorted(files, key=lambda x: int(x[x.index('-') + 1: x.index('.')])):
      pytesseract.pytesseract.tesseract_cmd = r'C:\Users\Waqas.Ali\AppData\Local\Programs\Tesseract-OCR\tesseract.exe'
      temp = pytesseract.image_to_string(Image.open(PATH + file))
      f.write(temp)
  f.close()

Now, we can use our function to extract text from all the PDF files using Python:

In [18]:
# Run for-loop for each document in range of pdf_files list
for i in range(len(pdf_files)):
  pdf_file = pdf_files[i]
  pdf_extract(pdf_file, i)

extracting from file: sample.pdf


Now after running the function if we will go to the directory we will see a text file by the name of `result1.txt` with all the text extracted from the PDF file.