# Tesseract

Tesseract is an optical character recognition engine for various operating systems. It is free software, released under the Apache License. It is open source. In 2006, Tesseract was considered one of the most accurate open-source OCR engines available.

In [1]:
import os
import re
import cv2
import glob
import pytesseract
import numpy as np
import pandas as pd
from datetime import date
from pytesseract import Output
from difflib import get_close_matches

pytesseract.pytesseract.tesseract_cmd=r"<local_path>/Tesseract-OCR/tesseract.exe"

imagelink = "<local_path>/Google-Tesseract/Images/"

Before we jump into Tesseract, let us view some common image manipulation that can be handy while extracting text from any image. 

## OPERATIONS ON IMAGES

In [2]:
# DISPLAY IMAGE

link = imagelink + "example_1.jpg"

image = cv2.imread(link, 0)

cv2.imshow("Image Displayed", image)
cv2.waitKey(0)

-1

In [3]:
# RESIZE IMAGE

link = imagelink + "example_1.jpg"

image = cv2.imread(link, 0)
image = cv2.resize(image, (500, 700))

cv2.imshow("Image Resized", image)
cv2.waitKey(0)

-1

In [4]:
# CROPPED IMAGE

link = imagelink + "example_1.jpg"

image = cv2.imread(link, 0)
image = image[50:, :200]

cv2.imshow("Image Cropped", image)
cv2.waitKey(0)

-1

In [5]:
# ROTATE IMAGE

link = imagelink + "example_1.jpg"

image = cv2.imread(link, 0)
image = cv2.rotate(image, cv2.cv2.ROTATE_90_CLOCKWISE)

cv2.imshow("Image Rotated", image)
cv2.waitKey(0)

-1

In [6]:
# TRANSLATED IMAGE

link = imagelink + "example_1.jpg"

image = cv2.imread(link, 0)

height, width = image.shape[:2]

tx, ty = width / 4, height / 4

translation_matrix = np.array([[1, 0, tx],[0, 1, ty]], dtype=np.float32)

image = cv2.warpAffine(src=image, M=translation_matrix, dsize=(width, height))

cv2.imshow("Image Translated", image)
cv2.waitKey(0)

-1

## TEXT EXTRACTION

## - Simple Extraction

In [7]:
link = imagelink + "example_1.jpg"

image = cv2.imread(link, 0)

data = pytesseract.image_to_string(image)

print(data)

A simple image with text to demonstrate
extraction of text using python and tesseract

“Two things are infinite: the universe and human stupidity; and I'm not sure
about the universe.” - Albert Einstein



In [8]:
link = imagelink + "example_2.jpg"

image = cv2.imread(link, 0)

data = pytesseract.image_to_string(image)

print(data)

“You've gotta dance like there's nobody watching,
Love like you'll never be hurt,
Sing like there's nobody listening,
And live like it’s heaven on earth.”
— William w. Pu rkey



## - Text Extraction With Manipulations

In [9]:
link = imagelink + "example_3.jpg"

image = cv2.imread(link,0)
image = cv2.resize(image, (500, 700))
image = image[25:300, :]

thresh = cv2.threshold(image, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

Data = pytesseract.image_to_string(thresh, lang='eng',config='--psm 6')

print("\n{}".format(Data))

print("-"*20)
print("\nWe notice that views on the screenshot are visible after a special character '©'.\nTherefore we use regex to extract the number of views.")

Views = re.findall(r'© .*',Data)[0]
Views = [int(i) for i in Views.split() if i.isdigit()][0]

print("-"*20)
print("\nExample 3 has {} views.".format(Views))


it © 13 ~*~ wu
C) Kimmy Long
C) Le Fevre Taylor

--------------------

We notice that views on the screenshot are visible after a special character '©'.
Therefore we use regex to extract the number of views.
--------------------

Example 3 has 13 views.


In [10]:
link = imagelink + "example_4.jpg"

image = cv2.imread(link,0)
image = cv2.resize(image, (500, 700))
image = image[25:300, :]

rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
results = pytesseract.image_to_data(rgb, output_type=Output.DICT)

Views = [int(i) for i in results["text"] if i.isdigit()][0]

print("{}\n".format(results["text"]))

print("-"*20)
print("\nWe can't automate a process if there is a dependency on visibility for a special character\nThus, we use another method to extract the number of views.")

print("-"*20)
print("\nExample 4 has {} views.".format(Views))

['', '', '', '', '©', '5616', '', '', '', 'rm', '', '', '', '']

--------------------

We can't automate a process if there is a dependency on visibility for a special character
Thus, we use another method to extract the number of views.
--------------------

Example 4 has 5616 views.


In [11]:
#Using Review

link = imagelink + "example_5.jpg"

image = cv2.imread(link)

rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
results = pytesseract.image_to_data(rgb, output_type=Output.DICT)

top = results['text'].index("Review")
bottom = results['text'].index("helpful?")

top_cod = results["top"][top]
top_cod = top_cod - round(top_cod/1.5)

bottom_cod = results["top"][bottom] 

image = image[top_cod:bottom_cod, :]

rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
results = pytesseract.image_to_data(rgb, output_type=Output.DICT)

review = results['text'][results['text'].index(get_close_matches("/10", results['text'],cutoff=0.6)[0])+1:]
review = " ".join([i for i in review if i != ""])
    
reviewtext = results['text'][results['text'].index(get_close_matches("/10", results['text'],cutoff=0.6)[0])+1:]
reviewindx = reviewtext.index([i for i in reviewtext if (i.isdigit()) and (int(i) >=1900 and int(i) <= date.today().year)][0])

reviewtime = " ".join(reviewtext[:reviewindx+1][-3:])
reviewheading = " ".join([i for i in reviewtext[:reviewindx-2][:-2] if i != ""])
reviewer = reviewtext[:reviewindx-2][-1]
review = " ".join([i for i in reviewtext[reviewindx+1:] if i != ""])

completereview = [reviewtime,reviewer,reviewheading,review]

print(*completereview,sep="\n--------------\n")

cv2.imshow("Image", image)
cv2.waitKey(0)

20 July 2017
--------------
TheLittleSongbird
--------------
Spider-Man with a fresh twist
--------------
Really enjoyed the first two films, both contained great scenes/action, acting and the two best villains of the films. Was mixed on the third film, which wasn't that bad but suffered mainly from bloat, and was not totally sold on the ‘Amazing Spider-Man’ films. Whether ‘Spider-Man: Homecoming’ is the best 'Spider-Man' film ever is debatable, some may prefer the first two films, others may prefer this. To me, it is the best 'Spider-Man' film since the second and on par with the first two. It may not have taken as many risks or had sequences/action as memorable as the first two films, and for more of an origin story it's best to stick with the first two films. For a fresh twist on 'Spider-Man' and the superhero genre, ‘Spider-Man: Homecoming’ (one of Marvel's best to date) more than fits the bill.


-1

In [12]:
#Using rating

link = imagelink + "example_6.jpg"

image = cv2.imread(link)

rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
results = pytesseract.image_to_data(rgb, output_type=Output.DICT)

top_val = results['text'][results['text'].index(get_close_matches("/10", results['text'],cutoff=0.6)[0])]
top = results['text'].index(top_val)

bottom = results['text'].index("helpful?")

top_cod = results["top"][top]
top_cod = top_cod - round(top_cod/6)

bottom_cod = results["top"][bottom] 

image = image[top_cod:bottom_cod, :]

rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
results = pytesseract.image_to_data(rgb, output_type=Output.DICT)

review = results['text'][results['text'].index(get_close_matches("/10", results['text'],cutoff=0.6)[0])+1:]
review = " ".join([i for i in review if i != ""])
    
reviewtext = results['text'][results['text'].index(get_close_matches("/10", results['text'],cutoff=0.6)[0])+1:]
reviewindx = reviewtext.index([i for i in reviewtext if (i.isdigit()) and (int(i) >=1900 and int(i) <= date.today().year)][0])

reviewtime = " ".join(reviewtext[:reviewindx+1][-3:])
reviewheading = " ".join([i for i in reviewtext[:reviewindx-2][:-2] if i != ""])
reviewer = reviewtext[:reviewindx-2][-1]
review = " ".join([i for i in reviewtext[reviewindx+1:] if i != ""])

completereview = [reviewtime,reviewer,reviewheading,review]

print(*completereview,sep="\n--------------\n")

cv2.imshow("Image", image)
cv2.waitKey(0)

29 September 2017
--------------
SnoopyStyle
--------------
fun comic book fare
--------------
Salvager Adrian Toomes (Michael Keaton) holds a grudge against Tony Stark (Robert Downey Jr.) after his takeover of the Battle of New York cleanup. Toomes kept some of the Chitauri tech to create new weapons. Eight years later after the events of Civil War, Peter Parker (Tom Holland) returns to his school, Midtown School of Science and Technology. He lives with his sought-after aunt May (Marisa Tomei). He has a crush on classmate Liz. His best friend Ned discovers his secret identity Spider-Man. There is also the sarcastic academic teammate Michelle (Zendaya). This is fun. It's got the comic book action. It weaves into the MCU with ease. RDJ has a supporting role which is more than a simple cameo. This definitely has the John Hughes vibe. It's nice light fun in this overarching comics universe. Holland is a great teen Spider-man as he showed in Civil War. The young cast is terrific and Keaton

-1

In [13]:
#Bulk using rating

Image = os.path.join(imagelink, "*") 
Image = glob.glob(Image)[-2:]

for link in Image:

    print("\nDetails for Image: {}\n".format(link.split("\\")[-1]))
    image = cv2.imread(link)

    rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    results = pytesseract.image_to_data(rgb, output_type=Output.DICT)

    top_val = results['text'][results['text'].index(get_close_matches("/10", results['text'],cutoff=0.6)[0])]
    top = results['text'].index(top_val)

    bottom = results['text'].index("helpful?")

    top_cod = results["top"][top]
    top_cod = top_cod - round(top_cod/6)

    bottom_cod = results["top"][bottom] 

    image = image[top_cod:bottom_cod, :]

    rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    results = pytesseract.image_to_data(rgb, output_type=Output.DICT)

    review = results['text'][results['text'].index(get_close_matches("/10", results['text'],cutoff=0.6)[0])+1:]
    review = " ".join([i for i in review if i != ""])

    reviewtext = results['text'][results['text'].index(get_close_matches("/10", results['text'],cutoff=0.6)[0])+1:]
    reviewindx = reviewtext.index([i for i in reviewtext if (i.isdigit()) and (int(i) >=1900 and int(i) <= date.today().year)][0])

    reviewtime = " ".join(reviewtext[:reviewindx+1][-3:])
    reviewheading = " ".join([i for i in reviewtext[:reviewindx-2][:-2] if i != ""])
    reviewer = reviewtext[:reviewindx-2][-1]
    review = " ".join([i for i in reviewtext[reviewindx+1:] if i != ""])

    completereview = [reviewtime,reviewer,reviewheading,review]
    
    print("-"*20)
    print(*completereview,sep="\n--------------\n")
    print("-"*20)


Details for Image: example_5.JPG

--------------------
20 July 2017
--------------
TheLittleSongbird
--------------
Spider-Man with a fresh twist
--------------
Really enjoyed the first two films, both contained great scenes/action, acting and the two best villains of the films. Was mixed on the third film, which wasn't that bad but suffered mainly from bloat, and was not totally sold on the ‘Amazing Spider-Man’ films. Whether ‘Spider-Man: Homecoming’ is the best 'Spider-Man' film ever is debatable, some may prefer the first two films, others may prefer this. To me, it is the best 'Spider-Man' film since the second and on par with the first two. It may not have taken as many risks or had sequences/action as memorable as the first two films, and for more of an origin story it's best to stick with the first two films. For a fresh twist on 'Spider-Man' and the superhero genre, ‘Spider-Man: Homecoming’ (one of Marvel's best to date) more than fits the bill.
--------------------

Details f