# OCR with Python

Since you are reading this article, you probably need to convert larger amount of screenshots 
containing some text into actual word document. This is what we call Optical Character Recognition, a.k.a. OCR.

Of course, there are many free pages where you upload your file and it will return back pdf or word for you.
This is fine if you need to convert say 5 images. But how about a hundred of screenshots? In my experience, that is usually paid service and sometimes you just don't want to pay fortune for some silly things and experiments.

I was in the same situation, so read carefully what I did.

What I needed and is presented here is:
    - I had more than 100 screenshots I wanted to convert to text
    - At the same time, I wanted to have visual reference as well - in other words, I wanted text and 
    screenshot next to each other
    - Since I wasn't keen on doing this manually, I was looking for way to have this automated

# What do we do & What do we need?

Approach is very simple:
- perform optical character recognition
- insert it into word document
- insert screenshot itself
- repeat

For OCR we will use very well working Tesseract engine. Before you proceed further, please go ahead and
download latest version of Tesseract in case you don't have it yet. The official GitHub link here: https://github.com/tesseract-ocr/tesseract
There are many installation guides, so I will skip the installation part. 

We will need also cv2 library. And you will see that I have used also beautiful soup but you may not need this one. Please make sure you have installed necessary libraries.

In [97]:
# IMPORT NECESSARY LIBRARIES

import cv2
import pytesseract
import numpy as np
import bs4

Since I'm looking for way how to make this OCR task as easy as possible for me, I want to have some function
doing the heavy lifting for me.
Let's define function called ocr_check. It takes in only path to the file you wish to convert to text

In [103]:
def ocr_check(file_path):
    """
    Takes in path to the file for OCR. Then initiates the recognition and returing as list of strings.
    !!! Be sure to specify exact location of tesseract.exe file !!!
    """    
    
    img = cv2.imread(file_path) # read image
    
    """
    # If you have only black/white screenshot, you can leave out the section below
    
    gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
    gray, img_bin = cv2.threshold(gray,128,255,cv2.THRESH_BINARY | cv2.THRESH_OTSU) # you may adjust parameters
    gray = cv2.bitwise_not(img_bin)
    
    kernel = np.ones((1, 1), np.uint8)
    img = cv2.erode(gray, kernel, iterations=1000)# you may adjust parameters
    img = cv2.dilate(img, kernel, iterations=1000)# you may adjust parameters
    """
    
    pytesseract.pytesseract.tesseract_cmd = r'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'

    out_below = pytesseract.image_to_string(img)
    
    #out_below = bs4.BeautifulSoup(out_below,"lxml")
    
    return out_below.split('\n')[0:-1]

The next steps is to put everything in the loop. We'll use OS and DOCX libraries to help us out.
I recomend that you have your script and folder with screenshots in one place. If you want to get current
path to current working directory, simply type in pwd as example below. You will need to enter path to the loop.

In [110]:
pwd

In [107]:
import os
from docx import Document
from docx.shared import Inches

# define where are pictures for transformation to docx. I suggest you use pwd and copy/paste
path = 'C:\\Users\\....here you put your path....\\pics2docx'

# creates list of pictures in the folder
pic_list = os.listdir(path) 

# create Document object
document = Document()

In the following loop, path for each picture is constructed and then calling methods of Document object
we will add firstly text, then screenshot itself and page break. 
At the end, we save the document under chosen name.

In [108]:
for picture in pic_list:
    
    file_to_ocr = path + '\\' + picture
    
    document.add_paragraph(ocr_check(file_to_ocr))
    document.add_picture(file_to_ocr, width=Inches(6))
    document.add_page_break()

document.save('test_of_ocr.docx')

# Conclusion

In this article you can find easy solution for automated OCR. I was focusing on black/white images but 
at the end, when working on this solution, I experimented and colored some text. I think that results are 
pretty good - you may check it in attached word docx.