# üñºÔ∏è Video 5 ‚Äì Image Processing & PDF Automation

This Jupyter Notebook mirrors the **26-slide outline** for Topic 5 and adds **practical code examples**.

- Image examples use the file `students_park.png`.
- PDF examples use the file `1776_fancy_with_seal.pdf`.

Run cells step by step as you follow the corresponding "slides" in your presentation.

## ‚ú® 1. Title Slide ‚Äì Image Processing & PDF Automation

- Today‚Äôs topics overview
- Libraries used: Pillow (PIL), PyPDF2, pdfplumber, PyMuPDF
- Real-world automation examples


In [None]:
# Quick sanity check for environment
import sys, os, platform, datetime, pathlib
now = datetime.datetime.now()
cwd = pathlib.Path.cwd()
venv = os.environ.get('VIRTUAL_ENV')
conda_env = os.environ.get('CONDA_DEFAULT_ENV') or os.environ.get('CONDA_PREFIX')
if venv:
    env_name = pathlib.Path(venv).name
    env_type = 'venv'
elif conda_env:
    env_name = pathlib.Path(conda_env).name if os.path.sep in str(conda_env) else str(conda_env)
    env_type = 'conda'
else:
    env_name = 'system'
    env_type = 'system'
info = {
    'run_at': now.isoformat(),
    'python_version': sys.version.split()[0],
    'python_impl': platform.python_implementation(),
    'executable': sys.executable,
    'platform': platform.platform(),
    'cwd': str(cwd),
    'env_type': env_type,
    'env_name': env_name,
}
for k, v in info.items():
    print(f"{k}: {v}")

## üß≠ 2. Why Automate Image Processing?

- Standardizing images for reports and archives
- Reducing file sizes for web or email
- Preparing thumbnails for websites
- Improving digital collection workflows


In [None]:
print('Automation benefits: standardize, reduce size, thumbnails')

## üñºÔ∏è PART I ‚Äî IMAGE PROCESSING WITH PILLOW

- From setup to bulk processing


## üß© 3. Installing Pillow

- `pip install Pillow`
- Supports major image formats
- Simple API for opening and editing images


In [None]:
# Pillow install check
try:
    import PIL
    from PIL import Image
    print('Pillow installed:', PIL.__version__)
except ImportError:
    print('Pillow not installed. Run: pip install Pillow')

## üñºÔ∏è 4. Image Basics ‚Äî Loading & Saving

- `Image.open("students_park.png")`
- Save with `.save()`
- Image modes: RGB, RGBA, L (grayscale)


In [None]:
from PIL import Image
from pathlib import Path
p = Path('students_park.png')
img = Image.open(p)
print('Mode:', img.mode, 'Size:', img.size)

## üìè 5. Resizing Images

- `.resize((width, height))`
- `.thumbnail()` preserves aspect ratio
- Useful for ML preprocessing or website thumbnails


In [None]:
from PIL import Image
img2 = Image.open('students_park.png')
resized = img2.resize((640, 360))
resized.size

## ‚úÇÔ∏è 6. Cropping Images

- Crop boxes: `(left, top, right, bottom)`
- Extract region of interest
- Automating repetitive cropping tasks


In [None]:
# Cropping example placeholder
from PIL import Image
_img_crop = Image.open('students_park.png')
w,h=_img_crop.size
box=(int(w*0.1),int(h*0.1),int(w*0.9),int(h*0.9))
preview_crop=_img_crop.crop(box)
preview_crop.size

## üîÑ 7. Rotating & Flipping Images

- `.rotate()` for rotation
- `.transpose()` for flips
- Fixing incorrectly scanned or rotated photos


In [None]:
# Rotation & flip quick demo
from PIL import Image, ImageOps
_img_rot = Image.open('students_park.png')
_rot90=_img_rot.rotate(90, expand=True)
_mirror=ImageOps.mirror(_img_rot)
_rot90.size,_mirror.size

## üîÅ 8. Converting Image Formats

- Convert JPG ‚Üî PNG ‚Üî TIFF
- Importance of compression
- TIFF often used in archival work


In [None]:
# Format conversion support list
from PIL import Image
print('Available formats:', sorted(Image.registered_extensions().keys())[:8], '...')

## üìÇ 9. Bulk Image Processing

- Loop through folders
- Pattern: load ‚Üí process ‚Üí save
- Create consistent naming conventions


In [None]:
# Bulk processing skeleton
from pathlib import Path
from PIL import Image
def bulk_resize(src='.', pattern='*.png', size=(256,256)):
    for p in Path(src).glob(pattern):
        img=Image.open(p)
        img.thumbnail(size)
        out=p.with_name(p.stem + '_thumb.jpg')
        img.save(out)
        print('Saved', out)
print('Define bulk_resize(src, pattern, size) ready to use')

## üè∑Ô∏è 10. Image Metadata Basics

- EXIF stores camera and timestamp info
- Metadata may be lost when editing
- Some formats (e.g. PNG) do not use EXIF


In [None]:
# Image metadata quick check
from PIL import Image
_meta_img = Image.open('students_park.png')
print('Format:', _meta_img.format, 'Mode:', _meta_img.mode)

## üìù 11. Reading & Writing EXIF

- `.getexif()` to inspect metadata
- Common fields: orientation, datetime
- Correct rotated smartphone photos


In [None]:
# EXIF sample read (may be empty)
from PIL import Image
_exif_img = Image.open('students_park.png')
exif = _exif_img.getexif()
print('EXIF tags found:', len(exif))

## üñ®Ô∏è 12. Converting PDF Pages ‚Üí Images

- Use PyMuPDF (recommended)
- Useful for generating page thumbnails
- DPI settings affect quality


In [None]:
# Convert PDF page to image placeholder
try:
    import fitz
    doc = fitz.open('1776_fancy_with_seal.pdf')
    pix = doc[0].get_pixmap(dpi=72)
    print('Page 1 image size:', pix.width, 'x', pix.height)
except Exception as e:
    print('PyMuPDF not available or file missing:', e)

## üìÑ 13. Converting Images ‚Üí PDF

- Combine scanned documents into one file
- Pillow: `save_all=True`
- Choose compression/DPI depending on use case


In [None]:
# Images to PDF placeholder
from PIL import Image
def images_to_pdf(img_files, out_pdf='combined.pdf'):
    imgs=[Image.open(f).convert('RGB') for f in img_files]
    if imgs:
        first, rest = imgs[0], imgs[1:]
        first.save(out_pdf, save_all=True, append_images=rest)
        print('Saved', out_pdf)
images_to_pdf(['students_park.png'])

## üìö PART II ‚Äî PDF AUTOMATION

- From structure to splitting, merging, and extraction


In [None]:
# Section: PART II start
print('Starting PART II: PDF Automation')

## üß© 14. How PDFs Work Internally

- Structured but not plain text
- Pages contain multiple object types
- Extraction is often imperfect


In [None]:
# Internal PDF object count sample
import PyPDF2, pathlib
pdf_file=pathlib.Path('1776_fancy_with_seal.pdf')
if pdf_file.exists():
    reader=PyPDF2.PdfReader(str(pdf_file))
    first=reader.pages[0]
    print('Num pages:', len(reader.pages), '| Page 1 contents keys:', list(first.keys())[:5])
else:
    print('PDF file missing.')

## üì¶ 15. Installing PDF Libraries

- `pip install PyPDF2 pdfplumber pymupdf`
- PyPDF2 for page operations
- pdfplumber for text/table extraction
- PyMuPDF for page images and advanced tasks


In [None]:
# PDF libraries import test
try:
    import PyPDF2, pdfplumber
    print('PyPDF2/pdfplumber imported')
except ImportError as e:
    print('Missing libs:', e)

## üìò 16. Reading a PDF

- Load file and count pages
- Extract simple text
- Understand extraction limitations


In [None]:
# Read PDF basic info placeholder
import pathlib, PyPDF2
pdf_p=pathlib.Path('1776_fancy_with_seal.pdf')
if pdf_p.exists():
    r=PyPDF2.PdfReader(str(pdf_p))
    print('Pages:', len(r.pages))
else:
    print('PDF not found')

## üßµ 17. Extracting Text from Pages

- Page-by-page processing
- Irregular line breaks
- Difference between ‚Äúlogical‚Äù vs. ‚Äúvisual‚Äù layout


In [None]:
# Extract first page text snippet
import pdfplumber, pathlib
pdf_p=pathlib.Path('1776_fancy_with_seal.pdf')
if pdf_p.exists():
    with pdfplumber.open(str(pdf_p)) as pdf:
        t=pdf.pages[0].extract_text() or ''
        print(t[:120].replace('\n',' '),'...')
else:
    print('PDF not found for text extraction')

## üìä 18. Table Extraction with pdfplumber

- Detecting tables
- Export to CSV or DataFrame
- Use cases: financial reports, statistics


In [None]:
# Table extraction stub
import pdfplumber, pathlib
pdf_p=pathlib.Path('1776_fancy_with_seal.pdf')
if pdf_p.exists():
    with pdfplumber.open(str(pdf_p)) as pdf:
        page=pdf.pages[0]
        tables=page.extract_tables() or []
        print('Tables found on page 1:', len(tables))
else:
    print('PDF not found for table extraction')

## ‚úÇÔ∏è 19. Splitting PDFs

- Extract specific pages or ranges
- Reduce document size
- Useful in administration workflows


In [None]:
# Split PDF stub
import PyPDF2, pathlib
pdf_p=pathlib.Path('1776_fancy_with_seal.pdf')
if pdf_p.exists():
    r=PyPDF2.PdfReader(str(pdf_p))
    print('Ready to split', len(r.pages), 'pages')
else:
    print('PDF not found for splitting')

## ‚ûï 20. Merging PDFs

- Combine multiple PDFs
- Order matters
- Creating a single consolidated report


In [None]:
# Merge PDFs stub
print('Use PyPDF2.PdfWriter() and add_page() for merging')

## üîÉ 21. Rotating PDF Pages

- Fix upside-down scans
- Rotate at page level
- Save as a new PDF


In [None]:
# Rotate PDF pages stub
print('Use page.rotate(angle) before adding to writer')

## üíß 22. Adding Watermarks / Stamps

- Overlay text or images
- Common for internal classification
- Branding and security use cases


In [None]:
# Watermark stub
print('Create watermark PDF then merge_page(wm_page) for each page')

## üîê 23. Protecting PDFs (Passwords)

- Basic encryption with PyPDF2
- Limitations of password protection
- Useful for confidential distribution


In [None]:
# Encryption stub
print('Use writer.encrypt(password) before writer.write()')

## üîÑ 24. End-to-End Automation Workflow

- Images ‚Üí cleaned + resized ‚Üí PDF
- PDF ‚Üí structured data (tables/text) ‚Üí CSV
- Generate thumbnails for navigation
- Typical digitization pipeline in libraries


In [None]:
# End-to-end workflow pseudo steps
steps=['Load images','Resize/crop','Assemble PDF','Extract text/tables','Generate thumbnails']
for s in steps:
    print('-', s)

## üéì 25. Skills Learned Today

- Basic image manipulation (resize, crop, rotate)
- Converting images and PDFs
- Reading/writing EXIF metadata
- Extracting text & tables from PDFs
- Splitting, merging, rotating, watermarking PDFs


In [None]:
# Skills recap print
skills=['Resize','Crop','Rotate','Format convert','EXIF','PDF read','Split/Merge','Rotate pages','Watermark','Encrypt']
print('Skills covered:', ', '.join(skills))

## üß™ 26. Exercises

- Resize a folder of images into 256√ó256 thumbnails
- Crop a specific region from 10 images
- Convert a folder of JPG files into a single PDF
- Extract all text from a PDF report
- Split a PDF into individual page files
- (Optional) Extract a table into CSV using pdfplumber


In [None]:
# Exercises encouragement
print('Try each exercise; automate & iterate!')

### üîß Setup for Image Examples

Install and import Pillow. In many environments it is already installed; if not, run the `pip` line.

In [None]:
#!pip install Pillow

from PIL import Image, ImageOps
from pathlib import Path

img_path = Path('students_park.png')  # ensure this file is in the same folder as the notebook
img = Image.open(img_path)
img

### üìè Example: Resizing and Thumbnail Creation

We create a smaller copy and a thumbnail while preserving aspect ratio.

In [None]:
small = img.resize((640, 360))
small.save('students_park_640x360.jpg')

thumb = img.copy()
thumb.thumbnail((256, 256))
thumb.save('students_park_thumb_256.jpg')

small, thumb

### ‚úÇÔ∏è Example: Cropping a Region of Interest

Here we crop the central part of the image to focus on the students.

In [None]:
w, h = img.size
left = w * 0.25
upper = h * 0.2
right = w * 0.75
lower = h * 0.9

crop = img.crop((left, upper, right, lower))
crop.save('students_park_cropped.jpg')
crop

### üîÑ Example: Rotating & Flipping

Rotate by 90 degrees and create a mirror image.

In [None]:
rotated = img.rotate(90, expand=True)
flipped = ImageOps.mirror(img)
rotated.save('students_park_rotated_90.jpg')
flipped.save('students_park_flipped.jpg')
rotated, flipped

### üîÅ Example: Converting to Different Formats

Save the same image as PNG and TIFF for different workflows.

In [None]:
img.save('students_park_converted.png')
img.save('students_park_converted.tiff')
print('Saved PNG and TIFF versions.')

### üè∑Ô∏è Example: Inspecting EXIF Metadata

Not all images contain EXIF, but this is how you would inspect it.

In [None]:
exif = img.getexif()
print(f'EXIF entries: {len(exif)}')
for tag_id, value in list(exif.items())[:10]:
    print(tag_id, value)

---
## üìö PDF Examples with `1776_fancy_with_seal.pdf`

We now switch to PDF automation examples using the sample document `1776_fancy_with_seal.pdf` (4-page decorative Declaration of Independence excerpt).

In [None]:
# PDF examples section check
import pathlib
pdf_path=pathlib.Path('1776_fancy_with_seal.pdf')
print('PDF exists?', pdf_path.exists())

### üîß Setup for PDF Examples

Install and import the libraries needed for PDF manipulation.

In [None]:
#!pip install PyPDF2 pdfplumber pymupdf

from pathlib import Path
import PyPDF2
import pdfplumber

pdf_path = Path('1776_fancy_with_seal.pdf')
pdf_path

### üìò Reading Basic PDF Info

Read the PDF, count pages, and inspect basic metadata.

In [None]:
with pdf_path.open('rb') as f:
    reader = PyPDF2.PdfReader(f)
    print('Pages:', len(reader.pages))
    print('Metadata:', reader.metadata)

### üßµ Extracting Text Page by Page

Use `pdfplumber` to extract text for further analysis.

In [None]:
with pdfplumber.open(pdf_path) as pdf:
    for i, page in enumerate(pdf.pages):
        text = page.extract_text() or ''
        print(f'--- Page {i+1} ---')
        print(text[:400], '...')
        print()

### ‚úÇÔ∏è Splitting the PDF into Separate Files

Create one new PDF per page (useful for workflows where you only need certain pages).

In [None]:
with pdf_path.open('rb') as f:
    reader = PyPDF2.PdfReader(f)
    for i, page in enumerate(reader.pages):
        writer = PyPDF2.PdfWriter()
        writer.add_page(page)
        out_name = f'1776_split_page_{i+1}.pdf'
        with open(out_name, 'wb') as out_f:
            writer.write(out_f)
        print('Wrote', out_name)

### ‚ûï Merging Selected Pages into a New PDF

Merge the first and last page into a new 2-page document.

In [None]:
with pdf_path.open('rb') as f:
    reader = PyPDF2.PdfReader(f)
    writer = PyPDF2.PdfWriter()
    writer.add_page(reader.pages[0])
    writer.add_page(reader.pages[-1])
    with open('1776_merged_first_last.pdf', 'wb') as out_f:
        writer.write(out_f)
    print('Created 1776_merged_first_last.pdf')

### üîÉ Rotating a Page in the PDF

Rotate the second page by 90 degrees and save as a new PDF.

In [None]:
with pdf_path.open('rb') as f:
    reader = PyPDF2.PdfReader(f)
    writer = PyPDF2.PdfWriter()
    for i, page in enumerate(reader.pages):
        if i == 1:
            page = page.rotate(90)
        writer.add_page(page)

    with open('1776_rotated_page2.pdf', 'wb') as out_f:
        writer.write(out_f)

print('Created 1776_rotated_page2.pdf')

### üíß Adding a Simple Text Watermark

For production you might use more advanced PDF libraries. Here we create a 1-page PDF with the word `DRAFT` and overlay it using PyPDF2.

In [None]:
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter

wm_file = 'watermark_draft.pdf'
c = canvas.Canvas(wm_file, pagesize=letter)
c.setFont('Helvetica-Bold', 60)
c.setFillGray(0.8, 0.5)
c.saveState()
c.translate(300, 400)
c.rotate(45)
c.drawCentredString(0, 0, 'DRAFT')
c.restoreState()
c.save()

with open(wm_file, 'rb') as wf, pdf_path.open('rb') as f:
    wm_reader = PyPDF2.PdfReader(wf)
    reader = PyPDF2.PdfReader(f)
    writer = PyPDF2.PdfWriter()

    wm_page = wm_reader.pages[0]
    for page in reader.pages:
        page.merge_page(wm_page)
        writer.add_page(page)

    with open('1776_watermarked.pdf', 'wb') as out_f:
        writer.write(out_f)

print('Created 1776_watermarked.pdf')

### üîê Encrypting the PDF with a Password

Use PyPDF2 to add a simple password (note: this is not strong security, but useful for basic protection).

In [None]:
with pdf_path.open('rb') as f:
    reader = PyPDF2.PdfReader(f)
    writer = PyPDF2.PdfWriter()
    for page in reader.pages:
        writer.add_page(page)

    writer.encrypt('1776password')
    with open('1776_encrypted.pdf', 'wb') as out_f:
        writer.write(out_f)

print('Created 1776_encrypted.pdf (password: 1776password)')

### üñ®Ô∏è (Optional) Converting PDF Pages to Images with PyMuPDF

```python
import fitz  # pymupdf

doc = fitz.open('1776_fancy_with_seal.pdf')
page = doc[0]
pix = page.get_pixmap(dpi=150)
pix.save('1776_page1.png')
```

Run this only if `pymupdf` is installed in your environment.