**Part 1**: OCR

# Import Libraries

In [1]:
import cv2 as cv
import math
import argparse
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import pytesseract
from pytesseract import Output
import re, string, copy, os, glob

import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import words
from nltk.corpus import wordnet 


In [9]:
#Arugment dictionary for default args 
args = {'input_folder':'../data/all_images/*.jpg', 
        'model':'../saved_models/east_text_detection.pb', 
        'thr':0.8, 
        'nms': 0.4, 
        'width_std':640, 
        'height_std':800,
        'width_1':640,
        'height_1': 640,
        'padding':0.05,
        'tesseract_conf': 70,
        'tesseract_conf_lax': 40,
        'padding_lax': 0.1}

# Problem Statement

There are two main goals for the project:
* Part 1: Create a proof-of-concept for an optical character recognition (OCR) tool to capture text data from single-origin coffee bean packaging labels
* Part 2: Deploy NLP techniques to accurately recommend coffees from a online store (sweetmarias.com) by leveraging the text data in the Part 1

To narrow down the scope of the project, I will be focussing only on single origin coffee beans.

## Part 1: Introduction

This notebook will cover the OCR tool and the capturing of the data from this tool. While there are several ready-made OCR tools (ranging from open-source to paid APIs such as one offered by Amazon), the aim of this part is to create a tool which is free (i.e. using existing open-source models) and can suit the intended purpose (coffee labels). There are two main parts to creating this tool: **1) text detection** and **2) text recognition**. 

For text detection, I have chosen to use the EAST (Efficient and Accurate Scene Text) detector which is considered as one of the state-of-the-art deep learning architectures for text dectection. I have chosen to use EAST because it is trained on natural images and is fast (without compromising on accuracy) compared to other text detectors (e.g. YOLO).

For text recognition, I have chosen to use Tesseract OCR. EAST is only a text detection architecture and only creates bounding boxes of text it identifies. Therefore, Tesseract will be deployed to 'recognise' the text in these bounding boxes and generate a best guess of what the text is. There are a few parameters that I have tuned to help improve the accuracy of the results from Tesseract - these will be explained further later.

While the end product generates text using one image at a time, the code below is able to process multiple images. The rationale for doing this is to allow us to evaluate the results of over 300 different images and to optimise the text detection and recognition accuracy. This will then allow us to fine-tune our models to capture the needed information as accurately as possible. 

No OCR tool is perfect especially since they are an infinite permutations of how text will appear on images. The aim is to capture these 5 types of information (if available) as accurately as possible: **country, region, variety, processing method and tasting notes**.

**Credit for images in dataset**:
<br>
IMG_0001 to IMG_0100 are from @gurucaleb on Instagram / IMG_0100 to IMG0300 are from Google search
<br>
All rights reserved to the copyright owners.

# Text Detection - EAST

## Resize Images

- EAST accepts images of W x H that are in multiples of 32
- Most of the images are in potrait mode, we will therefore resize them to 800 x 640
- For images that have W=H, we will resize them to 640 x 640 to avoid any compression

In [3]:
def resize(image):
    '''Function to resize images'''
    
    og_height = image.shape[0]
    og_width = image.shape[1]
    
    if og_height == og_width:
        resized_width = args['width_1']
        resized_height = args['height_1']
    else:
        resized_width =  args['width_std']
        resized_height = args['height_std']


    resized_image= cv.resize(image, (resized_width, resized_height))
        
    return resized_image, (resized_width, resized_height)

## Decode EAST output

- Returns bounding boxe if probability score is above confidence threshold
- Code adapted from OpenCV Github -> Samples -> dnn [[Source](https://github.com/opencv/opencv/tree/4.x/samples/dnn)]
- FYI: code from EAST Github is mostly in C++ implementation [[Source](https://github.com/argman/EAST)], re-written into Python using guidance from the OpenCV codes. 

In [4]:
def decode(scores, geometry, score_threshold):
    '''Returns bounding boxes and probabilit scores if above confidence threshold'''
    '''Code adapted from OpenCV Github -> Samples -> dnn'''
    '''Code from EAST Github is mostly in C++ implementation'''
    
    detections = []
    confidences = []

    #Checks for incorrect dimensions
    assert len(scores.shape) == 4, 'Incorrect dimensions of scores'
    assert len(geometry.shape) == 4, 'Incorrect dimensions of geometry'
    assert scores.shape[0] == 1, 'Invalid dimensions of scores'
    assert geometry.shape[0] == 1, 'Invalid dimensions of geometry'
    assert scores.shape[1] == 1, 'Invalid dimensions of scores'
    assert geometry.shape[1] == 5, 'Invalid dimensions of geometry'
    assert scores.shape[2] == geometry.shape[2], 'Invalid dimensions of scores and geometry'
    assert scores.shape[3] == geometry.shape[3], 'Invalid dimensions of scores and geometry'
    
    height = scores.shape[2]
    width = scores.shape[3]
    
    #Loop over rows
    for y in range(0, height):

        # Extract data from scores
        scores_data = scores[0][0][y]
        x0_data = geometry[0][0][y]
        x1_data = geometry[0][1][y]
        x2_data = geometry[0][2][y]
        x3_data = geometry[0][3][y]
        angles_data = geometry[0][4][y]
        
        #Loop over columns
        for x in range(0, width):
            score = scores_data[x]

            # If score is lower than threshold score, ignore
            if(score < score_threshold):
                continue

            #Multiply back to original dimnesions (EAST shrinks input by 4x)
            offsetX = x * 4.0
            offsetY = y * 4.0
            angle = angles_data[x]

            #Calculate cos and sin of angle
            cosA = np.cos(angle)
            sinA = np.sin(angle)
            h = x0_data[x] + x2_data[x]
            w = x1_data[x] + x3_data[x]

            #Calculate offset
            offset = ([offsetX + cosA * x1_data[x] + sinA * x2_data[x], offsetY - sinA * x1_data[x] + cosA * x2_data[x]])

            #Find points for bounding box 
            #This also rotates bounding boxes to capture angled text
            p1 = (-sinA * h + offset[0], -cosA * h + offset[1])
            p3 = (-cosA * w + offset[0],  sinA * w + offset[1])
            center = (0.5*(p1[0]+p3[0]), 0.5*(p1[1]+p3[1]))
            detections.append((center, (w,h), -1*angle * 180.0 / np.pi))
            confidences.append(float(score))

    # Return detections and confidences
    return [detections, confidences]

## Run EAST Model

Parameters for EAST that I have adjusted:
- Confidence threshold @ 80%
- Non-maximum suppression threshold @ 40%

The two outputs from EAST that we require are:
- Probability scores of whether an area contains text or not
- Coordinates of where the bounding box is detected

In [5]:
def gen_detect():
    
    conf_threshold = args['thr']
    nms_threshold = args['nms']
    
    # Load the pre-trained EAST model
    model = args['model']
    net = cv.dnn.readNet(model)

    #Two outputs needed from the EAST model:
    #1. Probability scores of whether an area contains text or not
    #2. Coordinates of the bounding box when text is detected
    outputNames = ['feature_fusion/Conv_7/Sigmoid', 'feature_fusion/concat_3']
    
    images = [file for file in glob.glob(args['input_folder'])]
    images.sort() 
    images = [cv.imread(img) for img in images]
    
    indices_output = []
    boxes_output = []
    resized_output = []
    for image in images:
        resized, (resized_width, resized_height) = resize(image)
        blob = cv.dnn.blobFromImage(resized, 1.0, (resized_width, resized_height), (123.68, 116.78, 103.94), True, False)
        
        net.setInput(blob)
        output_detect = net.forward(outputNames)
    
        #Scores and geometry from model output (i.e. decode EAST output)
        scores = output_detect[0]
        geometry = output_detect[1]

        [boxes, confidences] = decode(scores, geometry, conf_threshold)
        
        #Apply non-maximum suppression and return indices of bounding boxes
        indices = cv.dnn.NMSBoxesRotated(boxes, confidences, conf_threshold, nms_threshold)
        
        indices_output.append(indices)
        boxes_output.append(boxes)
        resized_output.append(resized)
        
    yield indices_output
    yield boxes_output
    yield resized_output

# Text Recognition - Tesseract

Parameters for Tesseract that I have adjusted:
- Confidence threshold @ 70%
- Padding @ 5%
- Lang = English
- oem = 1 and psm = 6 (treat image as block of text)

However, when less than 3 words are captured for an image, these parameters are re-adjusted:
- Confidence threshold @ 40%
- Padding @ 10%

For text bounding boxes that are very small (i.e. <32 pixel in height), I have re-scaled them to 32 pixel because, according to this [author](https://groups.google.com/g/tesseract-ocr/c/Wdh_JJwnw94/m/24JHDYQbBQAJ), this is the 'ideal' font height for Tesseract to recognise text.

## Run Tesseract Model

In [6]:
def gen_recognition():

    conf_tesseract = args['tesseract_conf']
    conf_tesseract_lax = args['tesseract_conf_lax']
    padding = args['padding']
    padding_lax = args['padding_lax']
    
    #Take bounding box coordinates/details from text detection part above
    output_EAST = list(gen_detect())
    indices_output = output_EAST[0]
    boxes_output = output_EAST[1]
    resized_output = output_EAST[2]

    results = []
    for img_id, (indices, boxes, image) in enumerate(zip(indices_output, boxes_output, resized_output)):
        
        sub_results = []
        
        for i in indices:
            center = boxes[i[0]][0]
            w, h = boxes[i[0]][1]
            angle = boxes[i[0]][2]
            
            #Center of bounding boxes    
            center_x, center_y = center
            center_x = int(center_x)
            center_y = int(center_y)

            if w < h:
                w, h = h, w
                angle += 90.0

            rows, cols, _ = image.shape

            #Rotate bounding boxes that have angled text
            matrix  = cv.getRotationMatrix2D(center, angle, 1)
            rotated = cv.warpAffine(image, matrix, (cols, rows))

            #padding
            dX = int(w * padding)
            dY = int(h * padding)

            #Crop the rotated bounding box
            start_y = int((center_y - (h / 2)) - dY)
            end_y   = int((start_y + h) + (2 * dY))
            start_x = int((center_x - (w / 2)) - dX)
            end_x   = int((start_x + w) + (2 * dX))
            start_x = start_x if 0 <= start_x < cols else (0 if start_x < 0 else cols-1)
            end_x   = end_x if 0 <= end_x < cols else (0 if end_x < 0 else cols-1)
            start_y = start_y if 0 <= start_y < rows else (0 if start_y < 0 else rows-1)
            end_y   = end_y if 0 <= end_y < rows else (0 if end_y < 0 else rows-1)
            crop    = rotated[start_y:end_y, start_x:end_x]

            #Rescale very small bounding boxes to 32 height
            if h < 32:
                
                crop = cv.resize(crop, None, fx=32/h, fy=32/h, interpolation=cv.INTER_CUBIC)
                    
            else:
                crop

            #Configuration setting to convert image to string
            #Chosen english and spanish
            configuration = ('-l eng+spa --oem 1 --psm 6')

            #Recognize the text from the bounding box image 
            text = pytesseract.image_to_data(crop, config=configuration, output_type='data.frame')
            selected_text = text.loc[(text['conf'] > conf_tesseract), ['conf','text']]
            final_text = selected_text.values.tolist()

            if not final_text:
                continue
                
            if len(final_text) >= 2:
                final_text = selected_text.loc[selected_text['conf'].idxmax(),:].values.tolist()
            else:
                final_text = final_text[0]
            
            max_conf = final_text[0]
            
            
            max_text = str(final_text[1]).lower()
            if re.search(r'[^\w\s]', max_text):
                max_text = re.sub(r'[^\w\s]', '',max_text)
                
            if max_text == '':
                continue
            
            sub_results.append((img_id+1, (start_x, start_y, end_x, end_y), max_conf, max_text))
        
        sub_results = sorted(sub_results, key=lambda x: x[1][1])
        
        #Re-run with less strict parameters if <3 words identified
        if len(sub_results) > 3:
            
            results.append(sub_results)
            
        else:
            
            
            sub_results_lax = []
            for i in indices:
                center = boxes[i[0]][0]
                w, h = boxes[i[0]][1]
                angle = boxes[i[0]][2]

                #Center of bounding boxes    
                center_x, center_y = center
                center_x = int(center_x)
                center_y = int(center_y)

                if w < h:
                    w, h = h, w
                    angle += 90.0

                rows, cols, _ = image.shape

                #Rotate bounding boxes that have angled text
                matrix  = cv.getRotationMatrix2D(center, angle, 1)
                rotated = cv.warpAffine(image, matrix, (cols, rows))

                #padding
                dX = int(w * padding_lax)
                dY = int(h * padding_lax)

                #Crop the rotated bounding box
                start_y = int((center_y - (h / 2)) - dY)
                end_y   = int((start_y + h) + (2 * dY))
                start_x = int((center_x - (w / 2)) - dX)
                end_x   = int((start_x + w) + (2 * dX))
                start_x = start_x if 0 <= start_x < cols else (0 if start_x < 0 else cols-1)
                end_x   = end_x if 0 <= end_x < cols else (0 if end_x < 0 else cols-1)
                start_y = start_y if 0 <= start_y < rows else (0 if start_y < 0 else rows-1)
                end_y   = end_y if 0 <= end_y < rows else (0 if end_y < 0 else rows-1)
                crop    = rotated[start_y:end_y, start_x:end_x]

                #Rescale very small bounding boxes to 32 height
                if h < 32:

                    crop = cv.resize(crop, None, fx=32/h, fy=32/h, interpolation=cv.INTER_CUBIC)

                else:
                    crop

                #Configuration setting to convert image to string
                #Chosen english and spanish
                configuration = ('-l eng --oem 1 --psm 6')

                #Recognize the text from the bounding box image 
                text = pytesseract.image_to_data(crop, config=configuration, output_type='data.frame')
                selected_text = text.loc[(text['conf'] > conf_tesseract_lax), ['conf','text']]
                final_text = selected_text.values.tolist()

                if not final_text:
                    continue

                if len(final_text) >= 2:
                    final_text = selected_text.loc[selected_text['conf'].idxmax(),:].values.tolist()
                else:
                    final_text = final_text[0]

                max_conf = final_text[0]


                max_text = str(final_text[1]).lower()
                if re.search(r'[^a-zA-Z]', max_text):
                    max_text = re.sub(r'[^a-zA-Z]', '',max_text)
                
                if max_text == '' or max_text == ' ':
                    continue
                
                sub_results_lax.append((img_id+1, (start_x, start_y, end_x, end_y), max_conf, max_text))
                
            sub_results_lax = sorted(sub_results_lax, key=lambda x: x[1][0])
            results.append(sub_results_lax)
    yield results


# Results

The results of all 303 images are captured below. For Part 2, I have compiled them into 303 rows with all text captured recorded under the 'text' column.

In [10]:
#Run generator
results = list(gen_recognition())

#Update results into dataframe
image_data = [x for x in results[0]]
df_results = []
for data in image_data:
    data_results = pd.DataFrame(data, columns=['img_id','bbox coord', 'conf', 'text'])
    df_results.append(data_results)


In [11]:
#Summary of results    
summary = pd.concat(df_results)
summary

Unnamed: 0,img_id,bbox coord,conf,text
0,1,"(274, 166, 350, 191)",92.0,elegant
1,1,"(454, 166, 493, 188)",86.0,yet
2,1,"(191, 181, 238, 208)",86.0,med
3,1,"(411, 199, 457, 221)",95.0,very
4,1,"(196, 206, 244, 232)",94.0,med
...,...,...,...,...
8,303,"(207, 629, 305, 643)",85.0,producer
9,303,"(301, 635, 389, 650)",89.0,fernando
10,303,"(339, 653, 414, 668)",83.0,process
11,303,"(411, 656, 474, 671)",96.0,washed


In [13]:
#Export summary to CSV

summary.to_csv(r'../data/summary_all_images_70pct_0811_vfinal.csv', index=False)

## Data Cleaning

### Drop NaNs and Blank Values

In [14]:
#Replace any remaining blanks with nan and drop the row.

summary['text'].replace('', np.nan, inplace=True)
summary['text'].replace(' ', np.nan, inplace=True)

In [15]:
summary.dropna(subset=['text'], inplace=True)

In [16]:
summary

Unnamed: 0,img_id,bbox coord,conf,text
0,1,"(274, 166, 350, 191)",92.0,elegant
1,1,"(454, 166, 493, 188)",86.0,yet
2,1,"(191, 181, 238, 208)",86.0,med
3,1,"(411, 199, 457, 221)",95.0,very
4,1,"(196, 206, 244, 232)",94.0,med
...,...,...,...,...
8,303,"(207, 629, 305, 643)",85.0,producer
9,303,"(301, 635, 389, 650)",89.0,fernando
10,303,"(339, 653, 414, 668)",83.0,process
11,303,"(411, 656, 474, 671)",96.0,washed


### Remove non-words

In order for the recommender system to work effectively in part 2, we need to pass in words that make sense. In the steps below, I have cleaned the input from the OCR tool by performing basic lemmatisation and removing words that are no found in the wordnet dictionary. However, to account for some unique words such as bean variety (e.g. 'caturra', 'catuai'), I have manually added them into the dictionary to be used.

In [17]:
#Manually adding words related to bean variety/regions

manual_additions_words = ['caturra', 'catuai', 'gesha', 'bourbon', 'arabica', 'sidama', 'sidamo', 'yirgacheffe', 'harrar', 'limu', 'guji']
nltk_words = set.union(set(nltk.corpus.wordnet.words()),set(nltk.corpus.words.words()))

total_words = nltk_words.union(manual_additions_words)

In [18]:
def clean_text(row):
    '''Function to apply basic cleaning and lemmaitization'''
    
    #remove all non-alphabet characters
    row['text'] = re.sub(r'[^a-zA-Z]',' ', row['text']) 
    
    #lemmatize words to check if they exist
    lemmatizer = WordNetLemmatizer()
    lem_word = lemmatizer.lemmatize(row['text'])
    row['cleaned_text'] = lem_word
    
    return row

In [19]:
def check(row):
    '''Utility function to check if text to remove is valid'''
    
    if row['cleaned_text'] in total_words:
        row['check'] = 'Yes'
    else:
        row['check'] = 'No'
        
    return row

In [20]:
summary = summary.apply(clean_text, axis=1).apply(check, axis=1)

In [21]:
summary

Unnamed: 0,img_id,bbox coord,conf,text,cleaned_text,check
0,1,"(274, 166, 350, 191)",92.0,elegant,elegant,Yes
1,1,"(454, 166, 493, 188)",86.0,yet,yet,Yes
2,1,"(191, 181, 238, 208)",86.0,med,med,Yes
3,1,"(411, 199, 457, 221)",95.0,very,very,Yes
4,1,"(196, 206, 244, 232)",94.0,med,med,Yes
...,...,...,...,...,...,...
8,303,"(207, 629, 305, 643)",85.0,producer,producer,Yes
9,303,"(301, 635, 389, 650)",89.0,fernando,fernando,No
10,303,"(339, 653, 414, 668)",83.0,process,process,Yes
11,303,"(411, 656, 474, 671)",96.0,washed,washed,Yes


In [22]:
#Remove non-words (i.e. those marked as 'No')

summary = summary[summary['check'] == 'Yes']

In [23]:
#Remove words <3 characters

summary = summary.loc[summary['cleaned_text'].str.len() >=3 ,:]

In [24]:
summary

Unnamed: 0,img_id,bbox coord,conf,text,cleaned_text,check
0,1,"(274, 166, 350, 191)",92.0,elegant,elegant,Yes
1,1,"(454, 166, 493, 188)",86.0,yet,yet,Yes
2,1,"(191, 181, 238, 208)",86.0,med,med,Yes
3,1,"(411, 199, 457, 221)",95.0,very,very,Yes
4,1,"(196, 206, 244, 232)",94.0,med,med,Yes
...,...,...,...,...,...,...
6,303,"(314, 600, 364, 613)",95.0,green,green,Yes
7,303,"(423, 602, 493, 616)",95.0,almond,almond,Yes
8,303,"(207, 629, 305, 643)",85.0,producer,producer,Yes
10,303,"(339, 653, 414, 668)",83.0,process,process,Yes


### New DataFrame - Consolidate text by image

In [25]:
summary_consolidated = summary.groupby('img_id')['cleaned_text'].apply(lambda x: "%s" % ', '.join(x))

In [26]:
summary_consolidated = pd.DataFrame(summary_consolidated, columns=['cleaned_text'])

In [27]:
summary_consolidated.reset_index(inplace=True)

In [28]:
def unique_list(string):
    ulist = []
    [ulist.append(x) for x in string if x not in ulist]
    return ulist

def dupe(row):
    split = row['cleaned_text'].split(', ')
    row['final_text'] = ', '.join(unique_list(split))
    return row

In [29]:
summary_consolidated = summary_consolidated.apply(dupe, axis=1)

In [30]:
summary_consolidated.drop(columns='cleaned_text', inplace=True)

In [31]:
summary_consolidated

Unnamed: 0,img_id,final_text
0,1,"elegant, yet, med, very, body, clean, brightne..."
1,2,"ethiopia, balance, indigenous, heirloom, dried..."
2,3,"kenya, bellingham, roasted"
3,4,ethiopia
4,5,"coffee, whole"
...,...,...
295,299,gram
296,300,"coffee, world, old, natural, costa, don, juicy..."
297,301,"method, natural, processing, mixed, lao, altit..."
298,302,"black, harvest, elevation, shan, state, myanma..."


## Basic EDA

Not surprising, words like 'coffee', 'process', 'roasters' appear most frequently. 'Useful' words such as 'Ethiopia', 'natural', 'chocolate' are also captured quite a few times.

In [32]:
summary.groupby('text').count().sort_values(by='conf',ascending=False).head(15)

Unnamed: 0_level_0,img_id,bbox coord,conf,cleaned_text,check
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
coffee,125,125,125,125,125
process,58,58,58,58,58
roasters,55,55,55,55,55
washed,53,53,53,53,53
ethiopia,40,40,40,40,40
roasted,39,39,39,39,39
the,35,35,35,35,35
natural,35,35,35,35,35
and,32,32,32,32,32
notes,31,31,31,31,31


**Average Number of Words Per Capture**

In [33]:
doc_length = []
for document in range(len(summary_consolidated['final_text'])):
    doc_length.append(len(summary_consolidated['final_text'][document].split(', ')))

avg_doc_length = sum(doc_length)/len(doc_length)
print(f'{round(avg_doc_length)}' + ' words')

8 words


## Consolidate results and export to CSV

In [34]:
summary_consolidated.to_csv(r'..data/consolidated_all_images_70pct_0811_vfinal.csv', index=False)

# Acknowlegments

These are the resources which I have found helpful:

[EAST Github](https://github.com/argman/EAST)
<br>
[OpenCV Github (for implementation of EAST in Python)](https://github.com/opencv/opencv/tree/4.x/samples/dnn)
<br>

Other Resources:
<br>
https://www.pyimagesearch.com/2018/08/20/opencv-text-detection-east-text-detector/?_ga=2.123146509.781830900.1634551481-1925401864.1633847684
<br>
https://nanonets.com/blog/ocr-with-tesseract/
<br>
https://www.pyimagesearch.com/2018/09/17/opencv-ocr-and-text-recognition-with-tesseract/
<br>
https://nanonets.com/blog/deep-learning-ocr/#text-detection