## OCR Model
Extracts characters from the Univus screenshot, using Tesseract Engine.

# Setup

Tesseract engine needs to be downloaded and installed on your PC prior to usage.

**Windows**

For windows, download and install tesseract here: https://github.com/UB-Mannheim/tesseract/wiki. Download into a separate folder. After download, add PATH to environment variables.

# Other Installs

- pip install opencv-python
- pip install pytesseract
- pip install pillow

### Uses OCR to extract text from images

Page segmentation modes:

| Number | Description |
| ----------- | ----------- |
|  0 |   Orientation and script detection (OSD) only. |
|  1  |  Automatic page segmentation with OSD.|
|  2  |  Automatic page segmentation, but no OSD, or OCR.|
|  3  |  Fully automatic page segmentation, but no OSD. (Default)|
|  4  |  Assume a single column of text of variable sizes.|
|  5  |  Assume a single uniform block of vertically aligned text.|
|  6  |  Assume a single uniform block of text.|
|  7  |  Treat the image as a single text line.|
|  8  |  Treat the image as a single word.|
|  9  |  Treat the image as a single word in a circle.|
| 10  |  Treat the image as a single character.|
| 11  |  Sparse text. Find as much text as possible in no particular order.|
| 12  |  Sparse text with OSD.|
| 13  |  Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.|

**OCR Engine Modes:**

0. Legacy engine only.
1. Neural nets LSTM engine only.
2. Legacy + LSTM engines.
3. Default, based on what is available.

In [9]:
import os
import re
from datetime import datetime, timezone, timedelta
import pytesseract
import cv2

# Reads in images as long as of image format
def load_images_from_folder(folder):
    images = []
    filenames = []
    for filename in os.listdir(folder):
        img = cv2.imread(os.path.join(folder,filename))
        if img is not None:
            images.append(img)
            filenames.append(filename)
    return (images, filenames)

# Extracts text from images using Tesseract OCR
def extract_text_from_images(green_pass_folder_path):
    texts = []
    # Read all images
    load_data = load_images_from_folder(green_pass_folder_path)

    for image in load_data[0]:
        grayscale_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        text = pytesseract.image_to_string(grayscale_image, config='--psm 11 --oem 3')
        texts.append(text)
    return (texts, load_data[1])

### Date Time Extraction, Parsing and Comparison

- datetime.strptime documentation: https://www.programiz.com/python-programming/datetime/strptime

In [10]:
# Uses regex patterns to match time and date
def extract_time_date_data(text):
    date = re.findall(r'\d{4}\s\w{3,4}\s\d{1,2}', text) # Detect format of YYYY MMM DD
    time = re.findall(r'\d{2}:\d{2}:\d{2}', text) # Detect format of HH:MM:SS
    return (date, time)

# Parse strings into date time objects
def parse_date_time_data(date, time):
    default_date = '2020 Jan 01'
    default_time = '00:00:00'
    if len(date) == 0:
        date.append(default_date)
    if len(time) == 0:
        time.append(default_time)
    date_time_string = date[0] + ' ' + time[0]
    date_time_object = datetime.strptime(date_time_string, "%Y %b %d %H:%M:%S")
    # print("Date time =", date_time_object)
    return date_time_object

# Determines if time is within threshold. Outputs 1 if within, 0 if not
def check_datetime_in_window(date_time_object):
    # 86400 seconds in a day
    threshold = 600 # Time in seconds

    timezone_offset = +8.0  # SG GMT +8
    timezone_info = timezone(timedelta(hours=timezone_offset))
    datetime.now(timezone_info)
    now = datetime.now()
    difference = now - date_time_object # Calculate difference in current time vs image
    
    # Need to update to score based on if days was missing/time was missing
    if difference.total_seconds() < threshold:
        return 1
    return 0

In [11]:
# Main function to determine if image within time period
def determine_time_validity():
    extract_data = extract_text_from_images("./ProcessedImages/GreenPass") # Path to folder with green pass images
    iteration = 0
    for text in extract_data[0]:
        date_time = extract_time_date_data(text)
        print(date_time[0])
        print(date_time[1])
        date_time_object = parse_date_time_data(date_time[0], date_time[1])
        outcome = check_datetime_in_window(date_time_object)
        print('Filename: ' + extract_data[1][iteration])
        print(outcome)
        print('===================')
        iteration += 1

determine_time_validity()

['2022 Mar 15']
['14:59:42']
Filename: 1.jpg
0
['2022 Mar 15']
['19:08:20']
Filename: 10.jpg
0
['2022 Mar 18']
['10:01:14']
Filename: 11.jpg
0
['2022 Mar 18']
['10:01:18']
Filename: 12.jpg
0
['2022 Mar 18']
['14:46:38']
Filename: 13.jpg
0
['2022 Mar 18']
['14:46:45']
Filename: 14.jpg
0
['2022 Mar 18']
['14:47:05']
Filename: 15.jpg
0
['2022 Mar 18']
['09:59:40']
Filename: 2.jpg
0
['2022 Mar 18']
['14:55:16']
Filename: 3.jpg
0
['2022 Mar 16']
['16:37:22']
Filename: 4.jpg
0
['2022 Mar 18']
['14:43:28']
Filename: 5.jpg
0
['2022 Mar 18']
['15:07:04']
Filename: 6.jpg
0
['2022 Mar 15']
['17:38:38']
Filename: 7.jpg
0
[]
['19:08:03']
Filename: 8.jpg
0
[]
['19:08:01']
Filename: 9.jpg
0


In [12]:
# Test code
now = datetime.now()
now = now - timedelta(seconds=120)
print(check_datetime_in_window(now))

1
