<img src="https://gcgrossi.github.io/GiulioGrossi.png" width="20%">

# **_Giulio Cornelio Grossi, Ph.D._** 
_giulio.cornelio.grossi@gmail.com_

[![Linkedin](https://img.shields.io/badge/Linkedin-blue?style=for-the-badge&logo=linkedin)](https://www.linkedin.com/in/giulio-cornelio-grossi/)
[![Github](https://img.shields.io/badge/Github-black?style=for-the-badge&logo=github)](https://github.com/gcgrossi)

[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/118ZV814gCWS1PmJwXEQG0IoLoy0-Dcso?usp=sharing)


#**_Dowloads and Installs_**

In [None]:
print('\n Installing Tesseract ... \n')

# intstall pytesseract 
!pip install pytesseract
!apt install tesseract-ocr
!apt install libtesseract-dev

print('\n Cloning Github Repository ... \n')

# clone repository with datatset
!rm -r AMLD2021/
!git clone https://github.com/SamurAi-sarl/AMLD2021.git
!ls -ltr AMLD2021/*

print('\n Downloading example images ... \n')

#download relevant images
!curl -L "https://docs.google.com/uc?export=download&id=1Gu46DRx_idNbvjqHcuSKilX64v0DYNUJ" > transaction_ticket.jpg
!curl -L "https://docs.google.com/uc?export=download&id=1fKDIVs2JcGxe3i01RtPpDKIbKx1Oan7Z" > registration_ticket.jpg
!curl -L "https://docs.google.com/uc?export=download&id=19rjKuqF5s9AfAQiUt80RbLkQVP3oQZy9" > invoice_ticket.jpg
!curl -L "https://docs.google.com/uc?export=download&id=1UV4wnGKG3S5YU9PGesXcDJWrMLZIsK7z" > invoice_scanned.jpg
!curl -L "https://docs.google.com/uc?export=download&id=1i-HOc4VE0U2J-p1so5mhHAnc1e9rdaFR" > invoice_test_flash.jpg

#**_Imports_**

In [None]:
from google.colab import drive
from google.colab import files

from pathlib import Path

import pytesseract
pytesseract.pytesseract.tesseract_cmd = (r'/usr/bin/tesseract')
from pytesseract import Output

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import subprocess
import shutil
import json
import cv2
import sys
import os


In [None]:
# mount drive folder and import custom modules
#drive.mount('/content/drive', force_remount=False)
#sys.path.insert(0,'/content/drive/MyDrive/Samurai_Workshop')

#from architectures.smallvggnet import SmallVGGNet

#**_Utilities Functions_** 

###**_Decode Bytestream Images_**

In [None]:
def decode_image(vals):
  # decode uploaded bystring images
  nparr = np.fromstring(vals, np.uint8)
  return cv2.imdecode(nparr, cv2.IMREAD_COLOR)

###**_Draw image in matplotlib_**

In [None]:
def draw(image,size=(7,25)):
  plt.figure(figsize=size)
  plt.imshow(image)
  return

<img src="https://drive.google.com/uc?id=1q7PJ36Cx8-YHnxDrG9w-coxn-GL2QWwv" width="20%">

#**_Difficulty: Padawan_**
#_Document with fixed position items_

These kind of documents are the easiest to process. The methodology is very simple: since the position of the information is fixed, it is sufficient to select the corresponding boxes (in the form of x1,y1,x2,y2 coordinates) in the image and then perform a text extraction with Tesseract.


##**_Example 1_**
##_Image: a Transaction Ticket_

In this example, a real life transaction ticket from a famous bank has been delivered to the backoffice, that is now responsible to fill a Transactions Database with the trade information. Also the Asset Manager (or whoever is responsible for the task) should account for the transaction in his portfolio. We are going to read a couple of fields in the document just to give an example.

In [None]:
#read the image from disk
filename = os.path.join(os.getcwd(),'transaction_ticket.jpg')
template = cv2.imread(filename)

# store a copy of the original image
orig = template.copy()

#show the input image
draw(cv2.cvtColor(template, cv2.COLOR_BGR2RGB))

### _**Select Relevant RoIs (Regions of Insterest)**_



In [None]:
# define the coordinates of different rois
x_start, x_end = 1165, 1585
y = {'type':(1183,1208),'direction':(1208,1234),'exchange':(1234,1259)}

rois=[]

# loop over the rois coordinates
# select the image region  
# and append a list with the selected rois
for key,val in y.items():
  y_start, y_end = val[0], val[1]
  roi = template[y_start:y_end,x_start:x_end]
  rois.append(roi)

### _**Extract the text from each RoI**_



In [None]:
#loop over the selected rois
# draw the roi and extract the text with tesseract
for roi in rois:
  rgb = cv2.cvtColor(roi, cv2.COLOR_BGR2RGB)
  draw(rgb,(25,7))
  text = pytesseract.image_to_string(rgb)

  for line in text.split("\n"):
      print(line)

##**_Example 2_**
##_Image: a Registration Form and an Invoice_
In this second example, a patient registration form from a famous dental office in Geneva has been submitted. Our system is now taking charge of creating a folder for the patient to store all his relevant documents. Later on, the system receives an invoice that should be stored in the folder that matches the client to whom the invoice was addressed 

We are going read the first name and last name fields in the registration form (with the same methodology of our first example) and create a corresponding folder. After, we are going to extract the same information from the invoice, and try to move the invoice in the corresponding folder.

In [None]:
#read the image from disk
filename = os.path.join(os.getcwd(),'registration_ticket.jpg')
template = cv2.imread(filename)

# store a copy of the original image
orig = template.copy()

#show the input image
draw(cv2.cvtColor(template, cv2.COLOR_BGR2RGB))

### _**Select Relevant RoIs (Regions of Insterest)**_



In [None]:
# define the coordinates of different rois
x_start, x_end = 200, 1400
y = {'first_name':(545,617),'last_name':(761,822)}

rois=[]

# loop over the rois coordinates
# select the image region  
# and append a list with the selected rois
for key,val in y.items():
  y_start, y_end = val[0], val[1]
  roi = template[y_start:y_end,x_start:x_end]
  rois.append(roi)


### _**Extract the text from each RoI**_



In [None]:
foldername=''
# define a list of forbidden tokens
forbidden_tokens = ['\x0c']

#loop over the selected rois
# draw the roi and extract the text with tesseract
print("\n[INFO] Characters after splitting for endline token")
for roi in rois:
  rgb = cv2.cvtColor(roi, cv2.COLOR_BGR2RGB)
  draw(rgb,(25,7))
  text = pytesseract.image_to_string(rgb)

  print('[INFO] {}'.format(text.split("\n")))

  #remove the end line token and remove forbidden characters
  for line in text.split("\n"):
      if line and line not in forbidden_tokens: foldername+=line

#creating the directory with os module
print('\n[INFO] creating directory: {}\n'.format(foldername))
os.mkdir(foldername)

Here we see another important feature of Tesseract. As you can see from the output of the string processing, there are some unexpected characters detected `'\x0C`' that, based on [this definition](https://www.computerhope.com/jargon/f/formfeed.htm), is: 

*a special character that, when encountered in code, causes printers to automatically advance one full page or the start of the next page.*

This is a limitation of Tesseract that should be taken in cosideration when designing a Data Science product that should fit a specific case. One possible solution is to explore another method of Tessearct, `image_to_data` instead of `image_to_string`. 

In [None]:
df_text=[]

# get data regarding the text as a dataframe 
# append each dataframe to a list
for roi in rois:
  rgb = cv2.cvtColor(roi, cv2.COLOR_BGR2RGB)
  df_text.append(pytesseract.image_to_data(rgb, output_type='data.frame')) #Output.DICT

# show one output as example
df_text[0]

The `image_to_data` method returns an handy data frame with more information (as the position of the box surrounding the word or the detection confidence). But most importantly there are less strange characters to deal with. From there is easy to extract again the information we need.

In [None]:
foldername=''
for df in df_text:
    for line in df['text'].dropna().to_list():
      if line: foldername+=line

print('\n[INFO] Extracted Folder Name: {}\n'.format(foldername))

### _**Read an Input Invoice**_
we will now repeat the steps for an input invoice.

In [None]:
#read the image from disk
filename = os.path.join(os.getcwd(),'invoice_ticket.jpg')
template = cv2.imread(filename)

# store a copy of the original image
orig = template.copy()

#show the input image
draw(cv2.cvtColor(template, cv2.COLOR_BGR2RGB))

### _**Extract Text from RoI**_

In [None]:
# define the coordinates of different rois
x_start, x_end = 130, 775
y_start, y_end = 440, 600

# select the image region  
roi = template[y_start:y_end,x_start:x_end]

# extract text as a dataframe
rgb = cv2.cvtColor(roi, cv2.COLOR_BGR2RGB)
draw(rgb,(25,7))
df_text = pytesseract.image_to_data(rgb, output_type='data.frame') #pytesseract.image_to_string(rgb)

df_text

In [None]:
# get a list of strings detected
# find elements between 'Name:' and 'Address:'
# construct the foldername string by joining the resulting list
text_list=df_text['text'].dropna().to_list()
idx_name = text_list.index('Name:')+1
idx_address = text_list.index('Address:')
foldername=''.join(text_list[idx_name:idx_address])

print('\n[INFO] Extracted Folder Name: {}\n'.format(foldername))

### _**Move the invoice to another folder**_

We can now move the invoice to the corresponding folder, adding one more feature: if we cannot find a match between the receiver of the invoice and our folders, we move it to an 'Unmatched' folder. We can also add more actions like sending a notification email or a message when there is no match. 

In this case, the human being enters in the loop by checking the unmatched invoiced and wether the algorithm has failed to detect a correct string. He reports to the error to the IT which will investigate further and improve the algorithm (if possible).

In this way, a positive data loop closes itself, possibly leading to a better model as the time passes.

In [None]:
# create an Unmatched directory if doesnt exist
if os.path.isdir(os.path.join(os.getcwd(),'Unmatched')):
  print('\n[INFO] Unmatched directory already exists\n')
else:
  os.mkdir('Unmatched')

# find all the directories in the current folder and fill a list
a_dir = os.getcwd()
dirlist= [name for name in os.listdir(a_dir) if os.path.isdir(os.path.join(a_dir, name))]

# if the foldername is in the list we move the invoice there
# otherwise we move it to the Unmatched folder

destination = foldername if foldername in dirlist else 'Unmatched'

try:
  shutil.copy(filename, os.path.join(os.getcwd(),destination))
  print("File copied successfully.")
  
# If source and destination are same
except shutil.SameFileError:
  print("Source and destination represents the same file.")
 
# If there is any permission issue
except PermissionError:
    print("Permission denied.")
 
# For other errors
except:
    print("Error occurred while copying file.")

If you now browse to the directory (using the File icon in the right menu), you will see that the invoice has been copied to the correct folder. Well done!

##**_Exercise_**

<img src="https://drive.google.com/uc?id=10tpXK7FwcLv4NEc9zAyKNDynb-wCVYlb" width="50%">

##_Extract Other Fields in the Invoice or the Transaction Ticket_
Use the invoice or the transaction ticket already dowloaded and the methods described above to extract other relevant fields.

In [None]:
# Use this cell to write your code

<img src="https://drive.google.com/uc?id=1xFbyKxKkx-ljh8n8Y93NLrvG_Y-c_yz5" width="25%">

#**_Difficulty: Novice_**
#_Scanned Document with fixed position items_

#**_Exercise (Optional)_**
Please read and run the cells in this section to see what changes if we apply the same methology of RoI selection we did before to scanned documents.

We are going now to read a document that was already scanned using Cam Scanner, the famous Android App for document scanning. We repeat the steps applied during **Case 1** . Since the position of the items is still fixed, the methodology of defining RoIs is exactly the same. We will have some differences though, and we will analyse them.

In [None]:
#read the image from disk
filename = os.path.join(os.getcwd(),'invoice_scanned.jpg')
template = cv2.imread(filename)

#show the input image
draw(cv2.cvtColor(template, cv2.COLOR_BGR2RGB))

**Very important**: the size of the input image may vary based on the different software/hardware used for scanning the document. In this case, the x,y coordinates of the original RoIs must absolutely be scaled (down or up) to match the image size, otherwise we will end up selecting a different region or we will even get an Error from OpenCV! 

Thus, we can calculte scaling factors for the height and the width of the input RoIs: 

$f_{w}= \frac{w}{w_{original}} $

$f_{h}= \frac{h}{h_{original}} $

where $w,h$ and $w_{original},h_{original}$ are the width and the height of the input image and the original image respectively

In [None]:

# calculate the resizing factor for the rois
h_orig, w_orig, c_orig = orig.shape 
h, w, c = template.shape 

fh = h/h_orig
fw = w/w_orig  

# define the coordinates of different rois
x_start, x_end = int(130*fw), int(775*fw)
y_start, y_end = int(440*fh), int(600*fh)

# select the image region  
roi = template[y_start:y_end,x_start:x_end]

# extract text as a dataframe
rgb = cv2.cvtColor(roi, cv2.COLOR_BGR2RGB)
draw(rgb,(25,7))
df_text = pytesseract.image_to_data(rgb, output_type='data.frame') #pytesseract.image_to_string(rgb)

# get a list of strings detected
# find elements between 'Name:' and 'Address:'
# construct the foldername string by joining the resulting list
text_list=df_text['text'].dropna().to_list()
idx_name = text_list.index('Name:')+1
idx_address = text_list.index('Address:')
foldername=''.join(text_list[idx_name:idx_address])

print('\n[INFO] Extracted Folder Name: {}\n'.format(foldername))

As you can see here the are two things to notice:


1.   The RoI now is not centered exactly
2.   The characters are way more blurred than the original image

The blur does not represent a problem in this particular case, Tesseract still can find the information we were looking for. But unfortunately, not all the cases are equal, and there can be situations (documents with poorer quality) where the algorithm fails (as we will see soon).

Also notice that, if we were looking for the email field (i.e. to send an email to the receiver), we would have failed. We would need to find a way to center the content of the RoI.



<img src="https://drive.google.com/uc?id=1sjHNgjLOC6d5aErNWPnO4BNPQn6SXSxQ" width="25%">

#**Case 3 - _Difficulty: Master_**
#_Documents taken from camera: create an homemade scanner_

Sometimes you will face the case when the documents are not just scanned, but are images taken from a camera. In this case, it will be impossible to apply the methodology with RoIs out of the box, because the document paper will never be in the correct position to extract a RoI. We need to process the image in order to align it to the original template. We need therefore to:
 

1.   Detected the countour of the paper in the image
2.   project the contour to a new image with the correct alignment

The procedure is a little bit heavy to work out, but it will be clear in the next section what are the logical passages to apply.

##**_Utilities Functions_**

###**_Four Point Transform_**

Given an input image, a target image and a set of starting points, orders the input point from top-left to bottom-left in clockwise order. Calculates the perspective transform start $\rightarrow$ target and applies it to the input image.

In [None]:
def order_points(pts):
  # initialzie a list of coordinates that will be ordered
  rect = np.zeros((4, 2), dtype = "float32")
  
  # the top-left point will have the smallest sum, whereas
  # the bottom-right point will have the largest sum
  s = pts.sum(axis = 1)
  rect[0] = pts[np.argmin(s)]
  rect[2] = pts[np.argmax(s)]
  
  # now, compute the difference between the points, the
  # top-right point will have the smallest difference,
  # whereas the bottom-left will have the largest difference
  diff = np.diff(pts, axis = 1)
  rect[1] = pts[np.argmin(diff)]
  rect[3] = pts[np.argmax(diff)]
  
  # return the ordered coordinates
  return rect

def four_point_transform(image, target, pts):
  # obtain a consistent order of the points and unpack them
  # individually
  rect = order_points(pts)
  (tl, tr, br, bl) = rect
  
  # use the target image shape as destination point of the transformation
  h, w, c = target.shape 
  dst = np.array([[0, 0],[w, 0],[w, h],[0, h]], dtype = "float32")
  
  # compute the perspective transform matrix and then apply it
  M = cv2.getPerspectiveTransform(rect, dst)
  warped = cv2.warpPerspective(image, M, (w, h))
  # return the warped image
  return warped

###**_Edge Detection_**

**`auto_canny`** function: a slightly modified version of Canny that takes in input just one parameter ($\sigma$). It calculates the median value of the pixel intensities ($\mu$) and calls Canny algorithm with threshold boundaries $\mu\pm\sigma$.

**`edge_detection`** function: takes an input image, convert to grayscale, blur it and call the `auto_canny` function. Grabs the contours with max area found by Canny and approximate, them to the closest polygon. Returns a dictionary with the edge mask and the contour if the polygon has 4 edges. Returns False if nothing has been found.

In [None]:
def auto_canny(image, sigma=0.33):
    # compute the median of the single channel pixel intensities
    v = np.median(image)

    # apply automatic Canny edge detection using the computed median
    lower = int(max(0, (1.0 - sigma) * v))
    upper = int(min(255, (1.0 + sigma) * v))
    edged = cv2.Canny(image, lower, upper)

    # return the edged image
    return edged


def edge_detection(image,sigma=0.33):
  # convert the image to grayscale, blur it, and find edges
  # in the image
  gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
  gray = cv2.GaussianBlur(gray, (5, 5), 0)
  edged = auto_canny(gray,sigma)
  #edged = cv2.Canny(gray, thresh_low, thresh_high)

  # find the contours with max area
  cnts = cv2.findContours(edged.copy(), cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
  cnts = sorted(cnts[0], key = cv2.contourArea, reverse = True)[:5]

  # init output contour
  screenCnt = None

  # loop over the contours
  for c in cnts:
    # approximate the contour
    peri = cv2.arcLength(c, True)
    approx = cv2.approxPolyDP(c, 0.02 * peri, True)
    # if our approximated contour has four points, then we
    # can assume that we have found our screen
    if len(approx) == 4:
      screenCnt = approx 
      return {'edge_mask':edged,'contour':screenCnt}
    
  return False

###**_Autotune edge detection parameters_**

Loop over a set of hyperparameter values for the Canny algorithm, determined by the inputs `sigma_init`,`sigma_min`, `step`. Calls `edge_detection` for each value of the set. Stops when a candidate contour has been found and returns the result.

In [None]:
def autotune_edge_detection(image,sigma_init=0.5,sigma_min=0,step=0.05):
   
  # create arrays of parameter to scan
  sigmas = [sigma_init]
  for s in np.arange(sigma_min,sigma_init,step):
    sigmas.append(sigma_init+s)
    sigmas.append(sigma_init-s)

  # loop over parameter values
  # if edge detection find a good contour
  # return the result
  for s in sigmas:
    result = edge_detection(image,s)
    if result: return result

  print('I didnt find the contours I am sorry')
  return

  

##**_Read Image_**

In [None]:
#read the original image from disk
filename = os.path.join(os.getcwd(),'invoice_ticket.jpg')
template = cv2.imread(filename)

# Bonus
# upload an image to transform
#uploaded = files.upload()
#uploaded_images = [decode_image(vals) for keys,vals in uploaded.items()]

filename = os.path.join(os.getcwd(),'invoice_test_flash.jpg')
uploaded_image = cv2.imread(filename)

# store a copy of the original image
orig = uploaded_image.copy()

#show the input image
draw(cv2.cvtColor(uploaded_image, cv2.COLOR_BGR2RGB))

As you can see, the image in far from being usable. We need to isolate only the part containing the invoice and try to eliminate the table and the part of my trouser 😅

##**_Perform Edge Detection_**

The logical steps to accomplish this mission are the following:


1.   find all the possible closed shapes in the image
2.   find the shape that has the biggest area and is a rectangle with 4 edges

If we find it, we can safely assume that the shape corresponds to the one of the paper invoice.

The goal is accomplished by the following technique:

*  We apply the [Canny algorithm](https://docs.opencv.org/3.4/da/d5c/tutorial_canny_detector.html) to the image and obtain a mask (a black/withe image) of all the edges.
*   We use the [OpenCV find Contours](https://docs.opencv.org/master/d3/dc0/group__imgproc__shape.html#gadf1ad6a0b82947fa1fe3c3d497f260e0) function on the mask to obtain all the closed shapes in the image.
* If we don't find anything we repeat the procedure changing the Canny algorithm parameters, until we find a good candidate

For a matter of time, we will not go in deep detail of all the inner mechanisms of all the algorithms involved. The functions we will use are defined in the beginning of the notebook under the section **_Utilites functions_**. You can feel free to investigate more about them if you are interested.






In [None]:
result = autotune_edge_detection(uploaded_image,sigma_init=0.5)
if result:

  # draw edge mask
  edged = result['edge_mask']
  draw(cv2.cvtColor(edged, cv2.COLOR_BGR2RGB))
  #cv2.imwrite(os.path.join(os.getcwd(),'edged.jpg'),edged)

  # draw contour on the image
  cv2.drawContours(uploaded_image, [result['contour']] , -1, (0, 255, 0), 3)
  draw(cv2.cvtColor(uploaded_image, cv2.COLOR_BGR2RGB))
  #cv2.imwrite(os.path.join(os.getcwd(),'countour.jpg'),uploaded_image)

else:
  print('No contour was found')

##**_Warp the Image_**

We then use the points corresponding to the rectangle we have just found to create a linear trasformation matrix that will map the points of the rectangle to the points of the template invoice. 

This steps are exploited using the function ```four_point_transform``` in the section **_Utilities functions_** and the OpenCV functions: 

```
cv2.getPerspectiveTransform()
cv2.warpPerspective()
  
 ```



In [None]:
# apply the four point transform to obtain a top-down
# view of the original image using the copy and the edged mask
warped = four_point_transform(orig,template,result['contour'].reshape(4,2))

# draw warped image
draw(cv2.cvtColor(warped, cv2.COLOR_BGR2RGB))
cv2.imwrite(os.path.join(os.getcwd(),'warped.jpg'),warped)


and bam! 🤯 

We have our image taken from camera aligned. From here on we can again apply our well known technique of text extraction from RoIs, and see what is the output result.

##**_Use Tesseract on Selected RoIs_**

In [None]:
def extract_text_from_roi(warped,key='first_name'):
  # define a dictionary with roi coordinates
  # in the following order
  # 1. (top_left_x,top_left_y) 
  # 2. (bottom_right_x,bottom_right_y)

  # determine a scaling factor between the original template
  # and the warped image
  h_orig, w_orig = 2399,1653
  if len(warped.shape) > 2:
    h_warped, w_warped, c_warped = warped.shape 
  else:
    h_warped, w_warped = warped.shape 

  fh = h_warped/h_orig
  fw = w_warped/w_orig  

  # define a dictionary with the position of the rois
  # in the template image
  rois = {'first_name':[(1,485),(1652,655)],
          'email':     [(1,1240),(1652,1419)],
          'to':        [(130,440),(775,600)]} #(1,328),(793,589)]

  # scale the roi to match
  # the size of the warped image
  tlx = rois[key][0][0]*fw
  tly = rois[key][0][1]*fh
  brx = rois[key][1][0]*fw
  bry = rois[key][1][1]*fh

  # select the corresponding rois in the image
  #aligned = cv2.resize(warped, (w_orig,h_orig), interpolation = cv2.INTER_AREA)
  aligned = warped.copy()
  roi = aligned[int(tly):int(bry),int(tlx):int(brx)]

  # draw the roi and extract the text with tesseract
  rgb = cv2.cvtColor(roi, cv2.COLOR_BGR2RGB)
  draw(rgb,(25,7))
  text = pytesseract.image_to_string(rgb)
  for line in text.split("\n"):
      print(line)
  
  return text

text=extract_text_from_roi(warped,key='to')

We are still able to extract the information from the RoI but, as you can see, the quality is lowering.

<img src="https://drive.google.com/uc?id=1j615H3PrNo15Jz2Zu0H4jn-N6iu5_EIg" width="25%">

#**Case 4 - _Difficulty: Sith Lord_**
#_Document with items in variable position: Detect RoIs using Bounding Box Regression with Deep Learning_

In many situations the position of the regions we are interested to inspect in a document can vary. An example is in the invoices we've been analyzing: the height of table with the products purchased will vary each time, since the number of products is not always the same.

In this case we will not be able at all to extract a RoI and we need some more sophisticated technique. Fortunately we can instruct our machine to detect the position of the RoI, even if it changes every time.

How? with Deep learning of course!

##**_Import Keras and TensorFlow_**

We will be using Keras and Tensor Flow library for the task, so let's import all the relevant stuff

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dropout

from tensorflow.keras.models import Model

from tensorflow.keras.optimizers import SGD
from tensorflow.keras.optimizers import Adam

from tensorflow.keras.callbacks import EarlyStopping

from tensorflow.keras.applications import VGG16
from tensorflow.keras.applications.vgg16 import preprocess_input
from tensorflow.keras.preprocessing.image import img_to_array
from tensorflow.keras.preprocessing.image import load_img

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split

##**_Read the Annotations_**

To train a model we will need to present it the images of the invoice, and tell it the position of the RoI in each image. We are going to use a json file that has this information and that I have constructed before, using the [VGG Image Annotator](https://www.robots.ox.ac.uk/~vgg/software/via/): a fantastic open source tool for dataset annotation.

The file has all the information we need: file name, position of the RoI (expressed in x,y,height and width).

In [None]:
# download the annotation file
!curl -L "https://docs.google.com/uc?export=download&id=1491L9DVNPMvZdaQwVej81BO_RRGw4MIy" > annotations.json

# load the contents of the json annotations file
print("\n[INFO] reading json annotations...\n")
annotations_json = os.path.join(os.getcwd(),'annotations.json')
with open(annotations_json) as f:
  annotations_dict = json.load(f)

#print some elements of the json
for i in list(annotations_dict.items())[:3]:
  print(i)

If you look at the 'region_label' attribute you can see that the annotations correspond to the bottom right part of the invoice, were the total amount due is registered. The position of this region in fact varies in each invoice.

##**_Construct the training data_**

We construct the training data by looping over all the filenames and creating 2 arrays:

1.   with all the images
2.   with the coordinates of the Bounding Box



In [None]:
# mount drive folder and import custom modules
# drive.mount('/content/drive', force_remount=False)
# sys.path.insert(0,'/content/drive/MyDrive/Samurai_Workshop')

#dataset_path = os.path.join(sys.path[0],'dataset_jpg','invoice')
#output_path= os.path.join(sys.path[0],"output")

In [None]:
# build path to dataset
dataset_path = os.path.join(os.getcwd(),'AMLD2021','dataset','invoice')

# initialize the lists
data,targets,filenames = [],[],[]

print("\n[INFO] reading images... this may take a while ...\n")

# loop over the keys and values of the json dictionary
for key,val in annotations_dict.items():

  # do not read if there is no region registered
  if len(val['regions']) == 0: continue

  # read the relevant info from the annotations dictionary
  # filename and x,y coordinates of the bounding box
  filename = val['filename']
  startX = val['regions'][0]['shape_attributes']['x']
  startY = val['regions'][0]['shape_attributes']['y']
  endX = startX + val['regions'][0]['shape_attributes']['width'] #startx + width
  endY = startY + val['regions'][0]['shape_attributes']['height'] #starty + height

  # build the path to the image and read it
  image_path = os.path.join(dataset_path,filename)
  image = cv2.imread(image_path)

  # skip if there was a problem loading the image
  if image is None: continue

  # scale the bounding box coordinates relative to the spatial
  # dimensions of the input image
  (h, w) = image.shape[:2]
  startX = float(startX) / w
  startY = float(startY) / h
  endX = float(endX) / w
  endY = float(endY) / h
  
  # load the image and preprocess it
  # scale to 224 x 224 the input size for VGG16
  image = load_img(image_path, target_size=(224, 224))
  image = img_to_array(image)
	
  # update our list of data, targets, and filenames
  data.append(image)
  targets.append((startX, startY, endX, endY))
  filenames.append(filename)

print("\n[INFO] Read {} total number of images ...\n".format(len(data)))

### **_Train/Test Split_**

We normalize the images (scale all pixels in the range [0,1]) and use Scikit-Learn  ```train_test_split()``` function and use 10% of the dataset for testing.



In [None]:
# convert the data and targets to NumPy arrays, scaling the input
# pixel intensities from the range [0, 255] to [0, 1]
data = np.array(data, dtype="float32") / 255.0
targets = np.array(targets, dtype="float32")

# partition the data into training and testing splits using 90% of
# the data for training and the remaining 10% for testing
split = train_test_split(data, targets, filenames, test_size=0.10,random_state=42)

# unpack the data split
(trainImages, testImages) = split[:2]
(trainTargets, testTargets) = split[2:4]
(trainFilenames, testFilenames) = split[4:]

##**_Define Model and Compile_**

We are going to use a transfer learning technique to train our model. This means that we will download from Keras a neural network (the [VGG16](https://neurohive.io/en/popular-networks/vgg16/)) that has altready been trained on a huge image dataset ([Imagenet](https://en.wikipedia.org/wiki/ImageNet) in this case).

In order to keep the feature extraction power of the inner layers, we will keep the convulutional layers, but we will change only the otuput Fully Connected layers. The Network will have a 4-neuron output layer, since the have to predict 4 numbers.

We the handy Keras functional API to do the magic. 🧙‍♂️

In [None]:
# initialize our initial learning rate and # of epochs to train for
INIT_LR = 1e-5
EPOCHS = 30
BS = 32

# load the VGG16 network, ensuring the head FC layer sets are left
# off
baseModel = VGG16(weights="imagenet", include_top=False, input_tensor=Input(shape=(224, 224, 3)))

# freeze all VGG layers
baseModel.trainable = False

# flatten the max-pooling output of VGG
flatten = baseModel.output
flatten = Flatten()(flatten)

# construct a fully-connected layer header to output the predicted
# bounding box coordinates
bboxHead = Dense(128, activation="relu")(flatten)
bboxHead = Dense(64, activation="relu")(bboxHead)
bboxHead = Dense(32, activation="relu")(bboxHead)
bboxHead = Dense(4, activation="sigmoid")(bboxHead)

# place the head FC model on top of the base model (this will become
# the actual model we will train)
model = Model(inputs=baseModel.input, outputs=bboxHead)

# compiling the model
# we use the mean squared error 
# a very common loss for regression problems
print("[INFO] compiling model...")
opt = Adam(learning_rate=INIT_LR)
model.compile(loss="mse", optimizer=opt)
print(model.summary())


##**_Train Model_**

In [None]:
# train the network for bounding box regression
print("\n[INFO] training bounding box regressor...\n")
H = model.fit(
    trainImages, trainTargets,
    validation_data=(testImages, testTargets),
    batch_size=BS,
    epochs=EPOCHS,
    verbose=1)

# plot the model training history
print("\n[INFO] Drawing the Training Curves...\n\n")
N = EPOCHS
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, N), H.history["loss"], label="train_loss")
plt.plot(np.arange(0, N), H.history["val_loss"], label="val_loss")
plt.title("Bounding Box Regression Loss on Training Set")
plt.xlabel("Epoch #")
plt.ylabel("Loss")
plt.legend(loc="lower left")


A nice and smooth training Curve after very few Epochs. No sign of overfitting. 

We are know ready to use this model and make some predictions to see if we can extract the RoI we want from some invoices.

##**_Make predictions_**

Always remember. The input image for the prediction should have the exact same size and format that the ones used for training. This is why it is extremely important to reapeat the pre-processing steps before prediciting our RoI position.

In [None]:
# dowload the test image with curl and subprocess
#
# list of available images:
# 
# https://docs.google.com/uc?export=download&id=1YufQe63RzA04Up9mHhGNf2VpRhztcZoG
# https://docs.google.com/uc?export=download&id=1oUy07v1t-R5h0tAJKGpxzhQ7ALrCHlmx
# https://docs.google.com/uc?export=download&id=1xtNq_HB0u8-oD9NeOjLrs6zlRVOG9ZFp
# https://docs.google.com/uc?export=download&id=1YufQe63RzA04Up9mHhGNf2VpRhztcZoG

urls = ['https://docs.google.com/uc?export=download&id=1YufQe63RzA04Up9mHhGNf2VpRhztcZoG',
        'https://docs.google.com/uc?export=download&id=1oUy07v1t-R5h0tAJKGpxzhQ7ALrCHlmx',
        'https://docs.google.com/uc?export=download&id=1xtNq_HB0u8-oD9NeOjLrs6zlRVOG9ZFp',
        'https://docs.google.com/uc?export=download&id=1YufQe63RzA04Up9mHhGNf2VpRhztcZoG']

url = urls[1]
outname = os.path.join(os.getcwd(),'test.jpg')
subprocess.run(['curl','-L',url,'-o',outname])

# load the input image (in Keras format) from disk and preprocess
# it, scaling the pixel intensities to the range [0, 1]
# adding the batch dimension
image = load_img(outname, target_size=(224, 224))
image = img_to_array(image) / 255.0
image = np.expand_dims(image, axis=0)

# make bounding box predictions on the input image
preds = model.predict(image)[0]
(startX, startY, endX, endY) = preds
	
# load the input image and grab its dimensions
test = cv2.imread(outname)
(h, w) = test.shape[:2]
	
# scale the predicted bounding box coordinates based on the image
# dimensions
startX = int(startX * w)
startY = int(startY * h)
endX = int(endX * w)
endY = int(endY * h)
	
#  draw the predicted bounding box on the image
cv2.rectangle(test, (startX, startY), (endX, endY),(0, 255, 0), 2)
plt.figure(figsize=(7,25))
plt.imshow(cv2.cvtColor(test, cv2.COLOR_BGR2RGB))

As you can see, the model has correctly spot the region of the invoice we were looking for. From here we can apply again the methodology of extracting text from RoIs we discussed before, but for the moment ... one last battle before leaving!

#**_Exercise_**
##_**The Last battle: train your own model**_ 

---

<br>


<img src="https://drive.google.com/uc?id=1SbIBJfPHw5mIDL7T0QwtN6iQchm_Lm6G" width="50%">

---

<br>

Please follow the instructions:

#### **1.** _Dowload the VGG Image Annotator (VIA) zip file from the [tools folder](https://drive.google.com/file/d/1Wm1pOwJJgfWY78gvwsiDFHwf-r14FSi3/view?usp=sharing) or from the [official site](https://www.robots.ox.ac.uk/~vgg/software/via/downloads/via-2.0.11.zip), unzip it and open the via.html file with your browser_.

#### **2.** _Dowload the dataset from [here](https://drive.google.com/drive/folders/10xe9yFCkRSPcRfJe3l0M57ltYpSZZg4z?usp=sharing)_.

#### **3.** _Use VIA to_: 
  * Upload the images from _300 on
  * Note the position of the Bounding Box relative to the table of products of the images from _300 on (You can do as much as you want until the end of the dataset _435. I suggest you to do max 50 images for timing reasons)
  * Export the annotations in .json format.

  * (TIP: read the 'help' section in the VIA software)

#### **4.** _Run the cell below and upload the .json annotations you just made: the cell will merge yours with the first _300 that I took last night instead of watching a movie._ 😝


In [None]:
# download annotation json file
url = "https://docs.google.com/uc?export=download&id=1tfGeWrTiFgPsq21ULWcvavEwlBue-6E9"
outname = os.path.join(os.getcwd(),'annotations_invoice_table_0_299.json')
subprocess.run(['curl','-L',url,'-o',outname])

# read first downloaded annotations json from disk 
# as dictionary 
with open(outname) as f:
  annotations_dict = json.load(f)

# upload the json annotation from remote
# and read it as a dictionary
uploaded = files.upload()
data = next(iter(uploaded.values()))
annotations_dict1 = json.loads(data.decode())

# set items in first dictionary equal
# to the elements in the uploaded dictionary
for key,val in annotations_dict1.items():
  if len(val['regions']) > 0:
    annotations_dict[key] = val

# clean the merged dictionary
# eliminate the elements with no region
keys = [key for key,val in annotations_dict.items() if len(val['regions'])==0]
for k in keys: del annotations_dict[k]

#### **5.** _Try to train a bounding box regression model to find the table position using transfer learning._
#### **6.** _Make some predictions on the test images used before._
#### **7.** _Try to extract some information using Tesseract._
<br>

### **N.B.** copy+paste the cells from **Construct the training data** to **Make prediction** will surely do the job!

In [None]:
### Start your exercise from here!

#**_Congratulations!_**

<img src="https://drive.google.com/uc?id=1n8MRZfFFlS8NNjSEV1aysp2__qabAzbw" width="50%">

#**_You arrived at the end of the Module_**