# OCR Google Photos Prep

Simple Jupyter script to run OCR on images in `src/` and then save results to IPTC so they will be set as the description in Google Photos when the image is uploaded.

Images will be converted to `.jpg` (metadata support) and saved to `output/` folder.

## Import

In [11]:
import pytesseract
import os
import glob
import re

from iptcinfo3 import IPTCInfo
from PIL import Image

import time
import datetime

from tqdm import tqdm

### Setup

Set correct tesseract executable

[Tesseract guide](https://tesseract-ocr.github.io/tessdoc/Home.html)

In [2]:
pytesseract.pytesseract.tesseract_cmd = r'/usr/local/Cellar/tesseract/4.1.1/bin/tesseract'

Optional prefix (when renaming MacOS screenshots)

In [3]:
output_prefix = 'data'

In [4]:
"""
    filename helper
    formats standard MacOS screenshot format to "url friendly" plus prefix
    otherwise keeps same filename
"""
def prep_filename(filename, prefix='', separator='-'):
    # check if matches MacOS screenshot format
    regex = re.search("Screen Shot ([0-9]{4}-[0-9]{2}-[0-9]{2}) at ([0-9]{1,2}\.[0-9]{2}\.[0-9]{2}) (AM|PM)", filename)
    if regex is not None: 
        datetime_string = regex.group(1) +" "+ regex.group(2) +" "+ regex.group(3)
        datetime_obj = datetime.strptime(datetime_string, "%Y-%m-%d %I.%M.%S %p")
        return prefix+separator+datetime_obj.strftime("%Y-%m-%d-%H-%M-%S")
    else:
        # keep filename
        r = re.search("^(src/)(.*)(\.(png|jpg|jpeg|gif))$", filename)
        return r.group(2)

## Sample
![title](sample.png)

In [5]:
sample_string = pytesseract.image_to_string(r'sample.png')
print(sample_string)

Observational studies can never
be used to prove causation, only
to demonstrate correlation.


## Gather Files

In [6]:
# find images in ./src/
screenshots = glob.glob("src/*.png") + glob.glob("src/*.gif") + glob.glob("src/*.jpg") + glob.glob("src/*.jpeg")
print("Files found: "+str(len(screenshots)))

Files found: 3


## Run Operations

In [7]:
# list to store output filenames
output_screenshots = []
# list to store OCR results
captions = []

### Convert to JPEG and run OCR

Images must be converted to JPG to support EXIF/IPTC metadata. Run Tesseract to read text in images.

In [12]:
# loop through screenshots in src/ to pass through OCR and convert to JPG
for i in tqdm(range(len(screenshots))):
    # load image
    with Image.open(screenshots[i]) as sshot:
        # convert to rgb (saving to jpg later)
        sshot = sshot.convert('RGB')
        # tesseract (OCR)
        caption = pytesseract.image_to_string(sshot)
        # store result in list
        captions.append(caption)
        # prepare filename
        filebase = prep_filename(screenshots[i], prefix=output_prefix)
        filename = "output/"+filebase+".jpg"
        output_screenshots.append(filename)
        # save jpg
        sshot.save(filename)
        sshot.close()

100%|██████████| 3/3 [00:01<00:00,  2.02it/s]


### Insert Caption

Open the saved jpg using IPTCInfo to write caption in.

In [13]:
for j in tqdm(range(len(output_screenshots))):
    try:
        """
            wrapping in a try for now
            > "WARNING:iptcinfo:Marker scan hit start of image data"
            IPTC metadata is still updated
        """
        info = IPTCInfo(output_screenshots[j], force=True)
        info['caption/abstract'] = str.encode(captions[j])
        info.save()
    except:
        pass

  0%|          | 0/3 [00:00<?, ?it/s]Marker scan hit start of image data
Marker scan hit start of image data
Marker scan hit start of image data
100%|██████████| 3/3 [00:00<00:00, 207.00it/s]


### Cleanup ~ Files

(to investigate) IPTCInfo lib may create duplicate files with `~` prefixed. This portion removes those files (if found)

In [14]:
# remove ~ files
for k in tqdm(range(len(output_screenshots))):
    filename = output_screenshots[k]+"~"
    # check if file exists and as simple safeguard lookout for jpg string
    if(os.path.exists(filename) and "jpg" in filename):
        os.remove(filename)

100%|██████████| 3/3 [00:00<00:00, 2001.42it/s]


## Report

At this point the processing is complete. This section is to review some of the results.

In [15]:
total_reports = min(len(captions), 10)
for n in range(total_reports):
    print(str(n)+'| '+output_screenshots[n])
    print(' | '+captions[n][:100].replace("\n", " "))

0| output/sample-01.jpg
 | Because he had to open the door in this way, it was already wide open before he could be seen. He ha
1| output/sample-03.jpg
 | As i waiched, the planet seemed to grow larger and smaller and to advance and recede, but that was s
2| output/sample-02.jpg
 | Next morning the not-yet- subsided sea rolled in long slow billows of mighty bulk, and striving in t
