## PDF and Image Text Extraction Notebook
This notebook is designed to extract text from PDFs, including text that's embedded within images in the PDF. The steps include:

Extract text and images from a given PDF.
Use Optical Character Recognition (OCR) via Google Cloud Vision API to extract text from images.
Write the extracted text to a JSON lines (.jsonl) file.

#### 1. Import Necessary Libraries

In [1]:
from pypdf import PdfReader as pdf_reader
from collections import defaultdict
from google.cloud import vision
from tqdm import tqdm
import logging
import json
import os 

#### 2. Logging Setup

In [2]:
# Setting up logging to monitor the progress and capture any issues
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHandler())

#### 3. Configure Google Cloud Credentials for Auth Process

In [3]:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = './../credentials/vai-key.json'

#### 4. Set Local Directory Paths

In [4]:
# Define paths for input PDF files and output results
LOCAL_INPUT_DIR = './DATA/INPUT'
LOCAL_OUTPUT_DIR = './DATA/OUTPUT'

#### 5. Create Google Vision API Client

In [5]:
image_annotator_client = vision.ImageAnnotatorClient()

#### 6. Define Helper Functions
6.1. Extract Text and Images from PDF

In [6]:
def read_pdf(file_name: str) -> list:
    """
    Extracts text and images from pages of the PDF and create dictionaries with the mapped information.
    """
    text_by_page = {}
    images_by_page = defaultdict(list)
    logger.info(f'Extracting text and images from document: {file_name}')
    
    with open(f'{LOCAL_INPUT_DIR}/{file_name}.pdf', 'rb') as pdf_file:
        reader = pdf_reader(pdf_file)
        for i, page in tqdm(enumerate(reader.pages)):
            i += 1
            logger.info(f'Processing page num: {i}')
            
            # Extract text from the page
            text = page.extract_text()
            text_by_page[i] = text
            
            # Extract images from the page and save them locally
            try:
                for image in page.images:
                    IMAGE_WRITE_PATH = f'{LOCAL_OUTPUT_DIR}/{file_name}/IMAGES'
                    os.makedirs(IMAGE_WRITE_PATH, exist_ok=True)
                    images_by_page[i].extend([f'{IMAGE_WRITE_PATH}/{image.name}'])
                    with open(f'{IMAGE_WRITE_PATH}/{image.name}', 'wb') as image_file:
                        image_file.write(image.data)
            except Exception as e:
                logger.error(e)
    
    return [text_by_page, images_by_page]

6.2. Extract Text from Images using OCR

In [7]:
def process_images_ocr(text_by_pages: dict, images_by_pages: dict) -> dict:
    for page, images in tqdm(images_by_pages.items()):  
        logger.info(f'Extracting text from images for page: {page}')
        
        # Extract text from each image using Google Vision API and append to the page's text
        for image_path in images:      
            with open(image_path, 'rb') as image_file:
                try:
                    content = image_file.read()
                    image = vision.Image(content=content)
                    response = image_annotator_client.text_detection(image=image)
                    detected_text = response.full_text_annotation.text
                    page_text = text_by_pages[page]
                    logger.info(f'Adding extracted text from image back into the page: {page}')
                    text_by_pages[page] = '\n'.join([page_text, "Text extracted from the image =>", detected_text])
                except Exception as e:
                    logger.error(e)
    
    return text_by_pages

6.3. Write Extracted Text to JSON Lines File

In [8]:
def write_pages_to_local(file_name, text_by_pages):
    logger.info(f'Writing processed pages for {file_name} into local disk')
    JSON_WRITE_PATH = f'{LOCAL_OUTPUT_DIR}/{file_name}/TEXT'
    
    try:
        os.makedirs(JSON_WRITE_PATH, exist_ok=True)
        with open(f'{JSON_WRITE_PATH}/{file_name}.jsonl', 'w') as output_json:
            for page, text in tqdm(text_by_pages.items()):
                json_line = json.dumps({"doc_name": file_name, "page_num": page, "page_content": str(text)})
                output_json.write(json_line + '\n')
    except Exception as e:
        logger.error(e)

#### 7. Main Processing Function

In [9]:
def process_file(file_name: str):
    # Extract text and images from the PDF
    text_by_pages, images_by_pages = read_pdf(file_name)
    
    # Process the images using OCR to extract text
    text_by_pages = process_images_ocr(text_by_pages, images_by_pages)
    
    # Write the processed pages to a local directory in JSON lines format
    write_pages_to_local(file_name, text_by_pages)

#### 8. Execute the Process for Input Files

In [10]:
process_file('file-1')
process_file('file-2')

Extracting text and images from document: file-1
0it [00:00, ?it/s]Processing page num: 1
Processing page num: 2
2it [00:00, 19.12it/s]Processing page num: 3
Processing page num: 4
Processing page num: 5
pillow is required to do image extraction. It can be installed via 'pip install pypdf[image]'
Processing page num: 6
6it [00:00, 27.34it/s]Processing page num: 7
Processing page num: 8
Processing page num: 9
9it [00:00, 23.13it/s]Processing page num: 10
Processing page num: 11
pillow is required to do image extraction. It can be installed via 'pip install pypdf[image]'
Processing page num: 12
pillow is required to do image extraction. It can be installed via 'pip install pypdf[image]'
Processing page num: 13
pillow is required to do image extraction. It can be installed via 'pip install pypdf[image]'
Processing page num: 14
pillow is required to do image extraction. It can be installed via 'pip install pypdf[image]'
14it [00:00, 29.94it/s]Processing page num: 15
pillow is required to d