# The Project #
1. This is a project with minimal scaffolding. Expect to use the the discussion forums to gain insights! It’s not cheating to ask others for opinions or perspectives!
2. Be inquisitive, try out new things.
3. Use the previous modules for insights into how to complete the functions! You'll have to combine Pillow, OpenCV, and Pytesseract
4. There are hints provided in Coursera, feel free to explore the hints if needed. Each hint provide progressively more details on how to solve the issue. This project is intended to be comprehensive and difficult if you do it without the hints.

### The Assignment ###
Take a [ZIP file](https://en.wikipedia.org/wiki/Zip_(file_format)) of images and process them, using a [library built into python](https://docs.python.org/3/library/zipfile.html) that you need to learn how to use. A ZIP file takes several different files and compresses them, thus saving space, into one single file. The files in the ZIP file we provide are newspaper images (like you saw in week 3). Your task is to write python code which allows one to search through the images looking for the occurrences of keywords and faces. E.g. if you search for "pizza" it will return a contact sheet of all of the faces which were located on the newspaper page which mentions "pizza". This will test your ability to learn a new ([library](https://docs.python.org/3/library/zipfile.html)), your ability to use OpenCV to detect faces, your ability to use tesseract to do optical character recognition, and your ability to use PIL to composite images together into contact sheets.

Each page of the newspapers is saved as a single PNG image in a file called [images.zip](./readonly/images.zip). These newspapers are in english, and contain a variety of stories, advertisements and images. Note: This file is fairly large (~200 MB) and may take some time to work with, I would encourage you to use [small_img.zip](./readonly/small_img.zip) for testing.

Here's an example of the output expected. Using the [small_img.zip](./readonly/small_img.zip) file, if I search for the string "Christopher" I should see the following image:
![Christopher Search](./readonly/small_project.png)
If I were to use the [images.zip](./readonly/images.zip) file and search for "Mark" I should see the following image (note that there are times when there are no faces on a page, but a word is found!):
![Mark Search](./readonly/large_project.png)

Note: That big file can take some time to process - for me it took nearly ten minutes! Use the small one for testing.

In [1]:
import zipfile

from PIL import Image
import pytesseract
import cv2 as cv
import numpy as np

# loading the face detection classifier
face_cascade = cv.CascadeClassifier('haarcascade_frontalface_default.xml')

In [2]:
# Looks Like 4 Images in the small zip (think finding in a dictionary would be easier than a list as the hint suggests)

#  Why not just store the PIL.Image objects in a global data structure, maybe a list or a dictionary indexed by name
zip_images = {}

* Note: 
    * You'll need two context managers to open the zipped info then a secondary with open type statement to open the zipext file

In [3]:
with zipfile.ZipFile('small_img.zip', 'r') as zipped:
    newspapers = zipped.infolist() # alias list of zipped files from above context manager
    for newspaper in newspapers: # iterate through class 'zipfile.ZipInfo' types
        #print(type(newspaper))
        with zipped.open(newspaper) as zipextfile: # use alias outside w/context and open method to open file
            #display(Image.open(zipextfile)) # michigan daily newspaper images - images successfully display
            #print(type(nwsp, dir(newspaper), dir(zipextfile)) - filename is not in the file but how the image opens 
            img = Image.open(zipextfile).convert('RGB') # Conversion should help for facial recognition from Module 3
            zip_images[newspaper.filename] = {'PIL_img': img}

In [4]:
zip_images # Alright sweet looks like we have the individual zipe file name and the subsquent PIL image

{'a-0.png': {'PIL_img': <PIL.Image.Image image mode=RGB size=3600x6300>},
 'a-1.png': {'PIL_img': <PIL.Image.Image image mode=RGB size=3600x6300>},
 'a-2.png': {'PIL_img': <PIL.Image.Image image mode=RGB size=3600x6300>},
 'a-3.png': {'PIL_img': <PIL.Image.Image image mode=RGB size=7200x6300>}}

#### Module 2 Started w/Tesseract and the ability to use a PIL image with pytesseract to extract the text, using our dictionary we can add a a new key to each image with the detected text from pytesseract

In [7]:
import pytesseract 

In [10]:
# Ok, lets try and run tesseract on this image text = pytesseract.image_to_string(image)) - Module 2 Note

for img in zip_images.keys():
    # pass pil_img for key name to ocr:pytesseract
    img_text = pytesseract.image_to_string(zip_images[img]['PIL_img'])
    # print(type(img_text), len(img_text))
    # Well ... 6197 is a lot of text, but based on the project definition 'Christopher should be in there' - let's count & check
    # if 'Christopher' in img_text:
    #    print('hey') # let's see how many are in there ... lol there's just one in the string of 6197 characters (at least for the first image)
    # break
    # Let's add a text key for each image
    zip_images[img]['img_text'] = img_text
    
# How long are our text strings?
for img_str in zip_images.keys():
    print(len(zip_images[img_str]['img_text']))

6197
12517
18623
28776


#### ... Wow that's a lot of Text!