# The Project #
1. This is a project with minimal scaffolding. Expect to use the the discussion forums to gain insights! It’s not cheating to ask others for opinions or perspectives!
2. Be inquisitive, try out new things.
3. Use the previous modules for insights into how to complete the functions! You'll have to combine Pillow, OpenCV, and Pytesseract
4. There are hints provided in Coursera, feel free to explore the hints if needed. Each hint provide progressively more details on how to solve the issue. This project is intended to be comprehensive and difficult if you do it without the hints.

### The Assignment ###
Take a [ZIP file](https://en.wikipedia.org/wiki/Zip_(file_format)) of images and process them, using a [library built into python](https://docs.python.org/3/library/zipfile.html) that you need to learn how to use. A ZIP file takes several different files and compresses them, thus saving space, into one single file. The files in the ZIP file we provide are newspaper images (like you saw in week 3). Your task is to write python code which allows one to search through the images looking for the occurrences of keywords and faces. E.g. if you search for "pizza" it will return a contact sheet of all of the faces which were located on the newspaper page which mentions "pizza". This will test your ability to learn a new ([library](https://docs.python.org/3/library/zipfile.html)), your ability to use OpenCV to detect faces, your ability to use tesseract to do optical character recognition, and your ability to use PIL to composite images together into contact sheets.

Each page of the newspapers is saved as a single PNG image in a file called [images.zip](./readonly/images.zip). These newspapers are in english, and contain a variety of stories, advertisements and images. Note: This file is fairly large (~200 MB) and may take some time to work with, I would encourage you to use [small_img.zip](./readonly/small_img.zip) for testing.

Here's an example of the output expected. Using the [small_img.zip](./readonly/small_img.zip) file, if I search for the string "Christopher" I should see the following image:
![Christopher Search](./readonly/small_project.png)
If I were to use the [images.zip](./readonly/images.zip) file and search for "Mark" I should see the following image (note that there are times when there are no faces on a page, but a word is found!):
![Mark Search](./readonly/large_project.png)

Note: That big file can take some time to process - for me it took nearly ten minutes! Use the small one for testing.

In [18]:
from zipfile import ZipFile

from PIL import Image
from PIL import ImageDraw
from kraken import pageseg
import pytesseract 
import cv2 as cv
import numpy as np

#debug
image = Image.open("readonly/text.png")
text = pytesseract.image_to_string(image)
print(text)

def show_boxes(img):
    '''Modifies the passed image to show a series of bounding boxes on an image as run by kraken
    
    :param img: A PIL.Image object
    :return img: The modified PIL.Image object
    '''
    # Lets bring in our ImageDraw object
    from PIL import ImageDraw
    # And grab a drawing object to annotate that image
    drawing_object=ImageDraw.Draw(img)
    # We can create a set of boxes using pageseg.segment
    bounding_boxes=pageseg.segment(img.convert('1'))['boxes']
    # Now lets go through the list of bounding boxes
    for box in bounding_boxes:
        # An just draw a nice rectangle
        drawing_object.rectangle(box, fill = None, outline ='red')
    # And to make it easy, lets return the image object
    return img


# loading the face detection classifier
face_cascade = cv.CascadeClassifier('readonly/haarcascade_frontalface_default.xml')
# A global list of dictionary objects PIL_Image, bounding boxes, text
gl = []

# the rest is up to you!
zf = ZipFile('readonly/small_img.zip', mode='r')

for el in zf.infolist():
    with zf.open(el) as file:
        img = Image.open(file)
        
        # debug
        #print(img.size, img.mode, len(img.getdata()))
        #display(img)
        #print(file.name)
        
        text = pytesseract.image_to_string(img)
       
        # create a dictionary object
        boxes = []
        new_entry = {"name":file.name, "image":img.copy(), "boxes":boxes, "text":text.lower()}
        
        gl.append(new_entry)
        

for el in gl:
    print(el['name'])
        
    print("----------- OCR text -----------------------")
    #text = pytesseract.image_to_string(el["image"])
    print(el['text'][:128])
    
    # To test this, lets use display
    # display(show_boxes(el['image'].convert('1')))
    # break

zf.close()


Behold, the magic of OCR! Using
pytesseract, we’ll be able to read the
contents of this image and convert it to
text


a-0.png
----------- OCR text -----------------------
(the michigan baily

ann arbor, michigan

wednesday, november 5, 2014

michigandailycom

big day for republicans

snyder earns s
a-1.png
----------- OCR text -----------------------
2a — wednesday, november 5, 2014

students vote, watch midterm election 2014

the michigan daily — michigandaily.com

 

(the mi
a-2.png
----------- OCR text -----------------------
the michigan daily — michigandaily.com

page 3a — wednesday, november 5, 2014

 

(the and;

aﬁﬁ

   

igan bailg

edited and ma
a-3.png
----------- OCR text -----------------------
4a, 5a — wednesday, november 5, 2014

the michigan daily — michigandaily.com

 

 

luna anna archey/daily

ann arbor mayor elec
