# OCR Model Evaluation 
In this notebook, we will evaluate three models for the OCR component of Gredient. We will use data from the public dataset [Open Food Facts](https://world.openfoodfacts.org/data) (OFF), that contains images and annotations of product ingredients.

We will evaluate Pytesseract, Amazon Rekognition, and Amazon Textract on their ability to correctly detect ingredients from an image. We use the OFF data to first run each model on the whole sample of data, which includes both high and low quality images. At that point, performance is fairly low over all models. So to proxy performance on high quality images, we get the top 210 scoring detections from each model and run all three models on each set of top 210 images. We average the 630 F1 scores for each model to get our final accuracy metric used for evaluation. Additionally, we record the time in seconds for each model to run and include speed in our evaluation as well. Since the performance of detections is highly dependent on image quality, we encourage our users through the interface to take better quality photos by providing them a cropping mechanism and messages indicating what a good and bad quality image looks like. 

### Import dependencies

In [1]:
# import general libraries
import pandas as pd
import numpy as np
import string
import time
import re

In [2]:
# import libraries for connecting to S3
import boto3 
import botocore 
from sagemaker import get_execution_role 

In [3]:
# import libraries for preprocessing
import urllib
import cv2
from io import BytesIO
import base64

In [4]:
# import libraries for OCR 
import pytesseract
from PIL import Image
from pytesseract import Output

### Connect to S3 to read OFF data
The OFF data was stored in Amazon's S3 bucket for ease of access.

In [5]:
# connect to S3 bucket to accessa OFF data
role = get_execution_role() 
bucket = 'sagemaker-060720' 
data_key = 'evalOFFdata.csv' 
data_location = 's3://{}/{}'.format(bucket, data_key) 

In [6]:
# load OFF data
eval_data = pd.read_csv(data_location)
print(eval_data.shape)

(6576, 5)


### Preprocessing for OCR
- Sampling 1000 images for the evaluation of each model.
- Since each image is saved in the form of url (link to the image), ```url_to_image``` converts the url image into an ndarray (data structure that Python can work with).


In [7]:
# sample N data for evaluation
n=1000
data = eval_data.sample(n, random_state=210).reset_index()
print(data.shape)
data.head()

(1000, 6)


Unnamed: 0,level_0,index,product_name,countries_en,image_ingredients_url,ingredients_text
0,5849,392866,Simply Lemonade,United States,https://static.openfoodfacts.org/images/produc...,"PURE FILTERED WATER, SUGAR, LEMON JUICE, NATUR..."
1,5778,386955,Free-range turkey snack sticks,United States,https://static.openfoodfacts.org/images/produc...,"Turkey, water, redmond seasoned salt (sea salt..."
2,4939,309534,nutricost b2,United States,https://static.openfoodfacts.org/images/produc...,supplement fac serving size: 1 capsule serving...
3,159,8126,Coconut Oil,United States,https://static.openfoodfacts.org/images/produc...,"coconut oil,"
4,2038,105093,"Lemon & lemon zest flavored mineral water, lem...",United States,https://static.openfoodfacts.org/images/produc...,"Carbonated mineral water, natural flavors."


In [8]:
# function to convert url to images
def url_to_image(url):
    resp = urllib.request.urlopen(url)
    image = np.asarray(bytearray(resp.read()), dtype='uint8')
    image = cv2.imdecode(image, cv2.IMREAD_COLOR)
    return image

In [9]:
# convert image urls to np arrays with RGB
imgs = [url_to_image(image_url) for image_url in data.image_ingredients_url]
len(imgs)

1000

### Where is the image preprocessing?
In previous iterations of our finalized model, we implemented and evaluated five image preprocessing methods that included combinations of grayscaling, thresholding, dilating, eroding, opening, deskewing, and canny edge detection. These methods did not improve the accuracy metrics, as the OCR algorithms already include preprocessing techniques for the enhancement of each image. 
Thus, performing a combination of the said operations will result only in the increase of noise, and thus, a low accuracy.

### Auxiliary Functions for Cleaning Text of Ingredients in OFF dataset
- Per given list of strings
- Per given word

In [10]:
# clean word tokens
def clean_word(word):
    
    c_word = word.lower().strip() # lowercase and remove white space
    c_word = re.sub('[^a-zA-Z]+', '', c_word) # remove anything that's not a letter
    if len(c_word) < 2: # remove words that are less than 2 characters
        c_word = "" 
    
    return c_word

# clean list of strings 
def clean_text(text, split=True):
    
    if split == False: # for ocr output
        c_text = [clean_word(w) for w in text] # already split and clean words
        
    else: 
        c_text = re.sub('[0-9]', ' ', text) # replace numbers with space 
        c_text = re.sub('['+string.punctuation+']', ' ', c_text) # replace punctuation with space
        c_text = [clean_word(w) for w in c_text.split()] # split on spaces and clean words
      
    c_text = sorted(list(filter(None, set(c_text)))) # remove empty words and get unique values and sort
    
    return c_text

## Why do we care about OCR performance?
Since Gredient heavily relies on the detection of ingredients in images uploaded by users, we wanted to ensure that our OCR model had the most promising metrics.

### Auxiliary Functions for Metrics of OCR only
The following questions calculate precision, recall and F1scores for each of the OCR models we utilized. Ideally, the best model detects every word found in the ingredient label. Thus we proceed to measure precision, recall and f1 scores in this fashion to find how many of the actual ingredients in the label were succesfully detected by the OCR.

- ```precision``` function:
    - takes in a list of detected words present in the ingredients and divides them by the number of all detected words.
- ```recall``` function:
    - calculating recall via the detected words found in ingredients / number of words in ingredients
- ```F1score``` function:
    - Calculating F1 the usual way: 2\*precision\*recall / precision+recall

In [11]:
# precision: #(detected words that are in ingredients) / #(all detected words)
def precision(ing_lst, i, ingredients):
    detected_words = ing_lst[i]
    actual_ingredients = ingredients[i]
    if len(detected_words) == 0:
        return 0 # 0 if no detected words 
    else:
        tp = sum([dw in actual_ingredients for dw in detected_words])
        p = len(detected_words) # tp+fp (positives)
        return tp/p
    
    
# recall: #(detected words that are in ingredients) / #(all words in ingredients)
def recall(ing_lst, i, ingredients):
    detected_words = ing_lst[i]
    actual_ingredients = ingredients[i]
    if len(detected_words) == 0:
        return 0 # 0 if no detected words 
    else:
        tp = sum([dw in actual_ingredients for dw in detected_words])
        a = len(actual_ingredients) # tp+fn (actual)
        return tp/a
    

# f1 score: 2*precision*recall / precision+recall
def F1score(ing_lst, i, ingredients):
    p = precision(ing_lst, i, ingredients)
    r = recall(ing_lst, i, ingredients)
    if p==0 and r==0:
        return 0
    return (2*p*r)/(p+r)

As mentioned before, most of the images from the OFF dataset were of low quality, and had ingredient labels that were not visible. This resulted in our OCR models performing poorly.
Instead of dealing with the entire dataset, we created subsets of the images that were easily readbable, which had:
- A close up of an ingredient label
- Relatively good quality (not pixelated)

Thus we selected 210 images with the highest F1 scores.

In [12]:
# get top images, ingredients, and F1 score 
def top210(scores):
    
    s = np.array([scores])
    inds = (-s).argsort()[0][:210] # top 210 detections

    imgs210 = [imgs[i] for i in list(inds)] # images for top 210 detections 
    print(len(imgs210))

    a_ings = [ingredients[i] for i in list(inds)] # actual ingredients for top 210 detections 
    print(len(a_ings))

    print("Average top F1-score:", sum([scores[i] for i in list(inds)])/210) # average F1 score for top 210 detections 
    
    return [imgs210, a_ings]

### Cleaned List of Actual Ingredients
Here we clean the ingredients (strings) provided by the OFF dataset.

Note that these are the ingredients that we will be utilizing for the evaluation of our OCR models.

In [13]:
ingredients = [clean_text(ing) for ing in data.ingredients_text] # words in ingredients 
len(ingredients)

1000

## Pytesseract
Call to Pytesseract OCR algorithm, assuming one block of text.

In [14]:
# run pytesseract on an image
def pytess(img) :
    custom_oem_psm_config = r'--dpi 300 --psm 6'
    box = pytesseract.image_to_data(img, output_type=Output.DICT, lang='eng', config=custom_oem_psm_config)
    return box['text']

**--skip if already run**

In [None]:
# apply pytesseract to images and return time
start_time = time.time()
pyt_texts = [pytess(img) for img in imgs]
print("--- %s seconds ---" % (int(time.time() - start_time)/n)) # 1.014

In [None]:
# detected words
pyt_ingredients = [clean_text(text, split=False) for text in pyt_texts]
len(pyt_ingredients)

In [None]:
# save output of tesseract
pyt_output = pd.DataFrame({'detected':pyt_ingredients})
pyt_output.to_csv('pyt_output.csv', index=False)

**--**

Get ingredients that were detected, clean them, calculate F1 scores.

In [15]:
# load saved data in csv
pyt_data = pd.read_csv('pyt_output.csv')

# get detected ingredients
pyt_ingredients = [[re.sub('[^a-zA-Z]+', '', e) for e in l.split(",")] for l in pyt_data.detected]

# peak at detected ingredients
print(pyt_ingredients[:5])
print(len(pyt_ingredients))

# get F1 scores for pytesseract detections
pyt_scores = [F1score(pyt_ingredients, i, ingredients) for i in range(n)]

print("Average F1-score:", sum(pyt_scores)/n)

[['contains', 'filtered', 'flavors', 'juice', 'lemon', 'natural', 'pure', 'sugar', 'water'], ['coe', 'igen', 'ingredients', 'musa', 'polis', 'rlefobeaess', 'tas'], ['are', 'es', 'fa', 'so', 'supplement', 'tren'], ['above', 'acon', 'aeohe', 'ane', 'approximatey', 'asmoke', 'ay', 'becomes', 'bho', 'co', 'coconut', 'cool', 'cooonuror', 'dry', 'en', 'extreme', 'facts', 'fam', 'hae', 'heat', 'ina', 'ingredients', 'lguid', 'ma', 'of', 'oistrcuten', 'ol', 'ona', 'oretvedu', 'plageana', 'point', 'refined', 'sou', 'srst', 'store', 'the', 'wea', 'xnocen'], ['']]
1000
Average F1-score: 0.31855246325287856


In [16]:
# top 210 images and actual ingredients
pyt_imgs, a_pyt_ings = top210(pyt_scores)

210
210
Average top F1-score: 0.8488069532533237


## Rekognition
Call to AWS Rekognition for the OCR of ingredients

In [17]:
# instantiate rekognition object
rek_client=boto3.client('rekognition')

In [18]:
# run rekognition on an image
def rekogn(img):
    pil_img = Image.fromarray(img)
    buff = BytesIO()
    pil_img.save(buff, format="JPEG")
    img_bytes = buff.getvalue()
    rek_text = rek_client.detect_text(Image={"Bytes":img_bytes})
    return rek_text

**--skip if already run**

In [None]:
# apply rekognition to images and return time
start_time = time.time()
rek_texts = [rekogn(img) for img in imgs]
print("--- %s seconds ---" % (int(time.time() - start_time)/n)) # 4.67

In [None]:
# detected words
rek_words = [[text['DetectedText'] if text['Type']=='WORD' else "" for text in texts['TextDetections']] for texts in rek_texts]

rek_ingredients = [clean_text(text, split=False) for text in rek_words] # detected words

len(rek_ingredients)

In [None]:
# save output of rekognition
rek_output = pd.DataFrame({'detected':rek_ingredients})
rek_output.to_csv('rek_output.csv', index=False)

**--**

Again read the detections from Rekognition algorithm, clean detections, and calculate F1 scores.

In [19]:
# load saved data in csv
rek_data = pd.read_csv('rek_output.csv')

# get detected ingredients
rek_ingredients = [[re.sub('[^a-zA-Z]+', '', e) for e in l.split(",")] for l in rek_data.detected]

# peak at detected ingredients
print(rek_ingredients[:5])
print(len(rek_ingredients))

# get F1 scores for rekognition detections
rek_scores = [F1score(rek_ingredients, i, ingredients) for i in range(n)]

print("Average F1-score:", sum(rek_scores)/n)

[['contains', 'filtered', 'flavors', 'juice', 'lemon', 'natural', 'pure', 'sugar', 'water'], ['acid', 'agen', 'beef', 'black', 'casing', 'celery', 'co', 'coriander', 'cutured', 'encapsutated', 'feeange', 'ic', 'in', 'induding', 'ingredients', 'mustard', 'onion', 'owder', 'paprikatumeric', 'pasley', 'pepper', 'powder', 'redmond', 'sal', 'salt', 'sat', 'sea', 'seasoned', 'spices', 'tc', 'turkey', 'water'], ['acid', 'anin', 'aonbunt', 'caicium', 'capsule', 'container', 'daly', 'due', 'established', 'fac', 'magnesium', 'mg', 'ngridenesgelasin', 'noe', 'ohee', 'per', 'ribofavin', 'riceflout', 'sarios', 'serving', 'sevig', 'sico', 'size', 'stearave', 'supplement', 'vale', 'veortable'], ['ainer', 'and', 'approximately', 'away', 'best', 'by', 'cincinnati', 'co', 'coconut', 'cool', 'distributed', 'dry', 'extreme', 'facts', 'for', 'forbaking', 'free', 'from', 'gluten', 'has', 'heat', 'in', 'ingredients', 'is', 'kroger', 'mediumhigh', 'of', 'ohio', 'oil', 'ol', 'or', 'over', 'place', 'point', 're

In [20]:
# top 210 images and actual ingredients
rek_imgs, a_rek_ings = top210(rek_scores)

210
210
Average top F1-score: 0.9244284208177813


## Textract
Call AWS Textract to make predictions on ingredient list.

In [21]:
# instantiate textract object
tex_client = boto3.client('textract')

In [22]:
# run textract on an image
def textract(img):
    pil_img = Image.fromarray(img)
    buff = BytesIO()
    pil_img.save(buff, format="JPEG")
    img_bytes = buff.getvalue()
    tex_text = tex_client.detect_document_text(Document={"Bytes":img_bytes})
    return tex_text

**--skip if already run**

In [None]:
# apply textract to images and return time
start_time = time.time()
tex_texts = [textract(img) for img in imgs]
print("--- %s seconds ---" % (int(time.time() - start_time)/n)) # 1.354

In [None]:
# detected words
tex_words = [[text['Text'] if text['BlockType']=='WORD' else "" for text in texts['Blocks']] for texts in tex_texts]

tex_ingredients = [clean_text(text, split=False) for text in tex_words] # detected words

len(tex_ingredients)

In [None]:
# save output of textract
tex_output = pd.DataFrame({'detected':tex_ingredients})
tex_output.to_csv('tex_output.csv', index=False)

**--**

Clean detections, calculate F1 scores.

In [23]:
# load saved data in csv
tex_data = pd.read_csv('tex_output.csv')

# get detected ingredients
tex_ingredients = [[re.sub('[^a-zA-Z]+', '', e) for e in l.split(",")] for l in tex_data.detected]

# peak at detected ingredients
print(tex_ingredients[:5])
print(len(tex_ingredients))

# get F1 scores for textract detections
tex_scores = [F1score(tex_ingredients, i, ingredients) for i in range(n)]

print("Average F1-score:", sum(tex_scores)/n)

[['contains', 'filtered', 'juice', 'lemon', 'natural', 'pure', 'rlavors', 'sugar', 'water'], ['bas', 'bes', 'bououunteyuded', 'bpmo', 'daepoepow', 'epwode', 'espauoseaspuoupey', 'fuseyuebe', 'il', 'jjemfaxnebueseay', 'jpequ', 'laddeo', 'lapueuoobupnpujsaodses', 'lase', 'lepopangno', 'per', 'perensdeoue', 'pjegsnw', 'poeon', 'sinbiobuon', 'uouo'], ['amount', 'and', 'blue', 'cakcion', 'capsule', 'container', 'established', 'fac', 'flour', 'gelatin', 'ingredonts', 'magnesiu', 'ng', 'no', 'oe', 'pe', 'per', 'ribotain', 'rice', 'route', 'seeing', 'serving', 'shearate', 'sine', 'son', 'supplement', 'truly', 'veoetable', 'veri'], ['above', 'and', 'aner', 'approximately', 'away', 'baking', 'becomes', 'below', 'best', 'by', 'cincinnati', 'co', 'coconut', 'cool', 'dally', 'distributed', 'dry', 'extreme', 'facts', 'fand', 'for', 'free', 'from', 'gluten', 'has', 'heat', 'in', 'ingredients', 'is', 'kroger', 'liquid', 'mediumhigh', 'of', 'ohi', 'oil', 'ol', 'or', 'over', 'place', 'point', 'refined',

In [24]:
# top 210 images and actual ingredients
tex_imgs, a_tex_ings = top210(tex_scores)

210
210
Average top F1-score: 0.9197823584411816


## High Quality Images

We proceed to evaluate each model on all three sets of "high-quality" images.

### Helpful Functions

In [25]:
# run each model on top images
def run_all(imgs, n):
    
    # apply pytesseract to images and return time
    start_time = time.time()
    p_texts = [pytess(img) for img in imgs]
    p_time = int(time.time() - start_time)/n
    print("--- pyt %s seconds ---" % (p_time))
    p_ingredients = [clean_text(text, split=False) for text in p_texts]
    print(len(p_ingredients))
    
    # apply rekognition to images and return time
    start_time = time.time()
    r_texts = [rekogn(img) for img in imgs]
    r_time = int(time.time() - start_time)/n
    print("--- rek %s seconds ---" % (r_time)) 
    r_words = [[text['DetectedText'] if text['Type']=='WORD' else "" for text in texts['TextDetections']] for texts in r_texts]
    r_ingredients = [clean_text(text, split=False) for text in r_words] # detected words
    print(len(r_ingredients))
    
    # apply rekognition to images and return time
    start_time = time.time()
    t_texts = [textract(img) for img in imgs]
    t_time = int(time.time() - start_time)/n
    print("--- tex %s seconds ---" % (t_time))
    t_words = [[text['Text'] if text['BlockType']=='WORD' else "" for text in texts['Blocks']] for texts in t_texts]
    t_ingredients = [clean_text(text, split=False) for text in t_words] # detected words
    print(len(t_ingredients))
    
    return [[p_ingredients, r_ingredients, t_ingredients],[p_time, r_time, t_time]]

In [26]:
# get top scores for each model on a set of top detections
def top_scores(top_detections, top_ingredients):
    
    top_s = [[F1score(td, i, ti) for i in range(210)] for td,ti in zip(top_detections,[top_ingredients]*3)]

    print("Average F1-scores (pyt):", sum(top_s[0])/210)
    print("Average F1-scores (rek):", sum(top_s[1])/210)
    print("Average F1-scores (tex):", sum(top_s[2])/210)
    
    return top_s

### Scores for Top Pytesseract Images

In [27]:
top_pyt = run_all(pyt_imgs, 210)
top_pyt_ingredients = top_pyt[0]
top_pyt_time = top_pyt[1]

--- pyt 0.4095238095238095 seconds ---
210
--- rek 4.223809523809524 seconds ---
210
--- tex 1.180952380952381 seconds ---
210


In [28]:
top_pyt_scores = top_scores(top_pyt_ingredients,a_pyt_ings)

Average F1-scores (pyt): 0.8483084278219671
Average F1-scores (rek): 0.8435603771332821
Average F1-scores (tex): 0.8612136523989301


### Scores for Top Rekognition Images

In [29]:
top_rek = run_all(rek_imgs, 210)
top_rek_ingredients = top_rek[0]
top_rek_time = top_rek[1]

--- pyt 0.37142857142857144 seconds ---
210
--- rek 3.7 seconds ---
210
--- tex 1.2285714285714286 seconds ---
210


In [30]:
top_rek_scores = top_scores(top_rek_ingredients,a_rek_ings)

Average F1-scores (pyt): 0.6465259060446534
Average F1-scores (rek): 0.9234937130446207
Average F1-scores (tex): 0.8222205873419498


### Scores for Top Textract Images

In [31]:
top_tex = run_all(tex_imgs, 210)
top_tex_ingredients = top_tex[0]
top_tex_time = top_tex[1]

--- pyt 0.40476190476190477 seconds ---
210
--- rek 3.604761904761905 seconds ---
210
--- tex 1.0904761904761904 seconds ---
210


In [32]:
top_tex_scores = top_scores(top_tex_ingredients,a_tex_ings)

Average F1-scores (pyt): 0.6907304846374634
Average F1-scores (rek): 0.8651720762650041
Average F1-scores (tex): 0.9096800807703677


### Average Scores for All Top Images

In [33]:
model = ['Pytesseract', 'Rekognition', 'Textract']

for i in range(3):
    print(model[i], (sum(top_pyt_scores[i]) + sum(top_rek_scores[i]) + sum(top_tex_scores[i])) / (3*210))

Pytesseract 0.728521606168028
Rekognition 0.8774087221476357
Textract 0.8643714401704159


In [48]:
for i in range(3):
    print(model[i], (top_pyt_time[i] + top_rek_time[i] + top_tex_time[i])/3)

Pytesseract 0.3952380952380952
Rekognition 3.842857142857143
Textract 1.1666666666666667


## Results
- Pytesseract has an average F1 score of 0.728 and speed of 0.4 seconds/image
    - 0.84881 for top 210 
    - 0.31855 for full sample
- Rekognition has an average F1 score of 0.878 and speed of 3.8 seconds/image  
    - 0.92443 for top 210 
    - 0.50114 for full sample 
- Textract has an average F1 score of 0.864 and speed of 1.2 seconds/image  
    - 0.91978 for top 210 
    - 0.49600 for full sample