# The Project #
1. This is a project with minimal scaffolding. Expect to use the the discussion forums to gain insights! It’s not cheating to ask others for opinions or perspectives!
2. Be inquisitive, try out new things.
3. Use the previous modules for insights into how to complete the functions! You'll have to combine Pillow, OpenCV, and Pytesseract
4. There are hints provided in Coursera, feel free to explore the hints if needed. Each hint provide progressively more details on how to solve the issue. This project is intended to be comprehensive and difficult if you do it without the hints.

### The Assignment ###
Take a [ZIP file](https://en.wikipedia.org/wiki/Zip_(file_format)) of images and process them, using a [library built into python](https://docs.python.org/3/library/zipfile.html) that you need to learn how to use. A ZIP file takes several different files and compresses them, thus saving space, into one single file. The files in the ZIP file we provide are newspaper images (like you saw in week 3). Your task is to write python code which allows one to search through the images looking for the occurrences of keywords and faces. E.g. if you search for "pizza" it will return a contact sheet of all of the faces which were located on the newspaper page which mentions "pizza". This will test your ability to learn a new ([library](https://docs.python.org/3/library/zipfile.html)), your ability to use OpenCV to detect faces, your ability to use tesseract to do optical character recognition, and your ability to use PIL to composite images together into contact sheets.

Each page of the newspapers is saved as a single PNG image in a file called [images.zip](./readonly/images.zip). These newspapers are in english, and contain a variety of stories, advertisements and images. Note: This file is fairly large (~200 MB) and may take some time to work with, I would encourage you to use [small_img.zip](./readonly/small_img.zip) for testing.

Here's an example of the output expected. Using the [small_img.zip](./readonly/small_img.zip) file, if I search for the string "Christopher" I should see the following image:
![Christopher Search](./readonly/small_project.png)
If I were to use the [images.zip](./readonly/images.zip) file and search for "Mark" I should see the following image (note that there are times when there are no faces on a page, but a word is found!):
![Mark Search](./readonly/large_project.png)

Note: That big file can take some time to process - for me it took nearly ten minutes! Use the small one for testing.

In [1]:
import zipfile
from zipfile import ZipFile
from PIL import Image
from PIL import ImageDraw
import pytesseract
import cv2 as cv
import numpy as np
import inspect

# Given
face_cascade = cv.CascadeClassifier('readonly/haarcascade_frontalface_default.xml')

# Maybe use this at some point
eng_dict=[]
with open ("readonly/words_alpha.txt", "r") as f:
    data=f.read()
    eng_dict=data.split("\n")

# Defined function
def binarizepil(image_to_transform, threshold):
    output_image = image_to_transform.convert("L")
    for x in range(output_image.width):
        for y in range(output_image.height):
            if output_image.getpixel((x,y))< threshold:
                output_image.putpixel( (x,y), 0 )
            else:
                output_image.putpixel( (x,y), 255 )
    return output_image

# Specify the correct zip file to read through
with ZipFile('readonly/small_img.zip','r') as zipfile:
    files = zipfile.infolist()

#     # Use to tempoarily work with the first image in the zip file
#     for i in range(len(files)-1):
#         files.pop()
        
    # Look at each file
    for file in files:
        with zipfile.open(file.filename) as img_file:
                          
            # Perform search for text
            string = 'Christopher'
            search = 'Christopher'.lower()
            print("Searching for '{}' in file".format(string),file.filename)
            img_pil = Image.open(img_file.name)
#             # REMOVED - Resize
#             #img_pil = img_pil.resize((int(img_pil.size[0]/5),int(img_pil.size[1]/5)))
#             print("Creating greyscale image") # Progress Check
            img_pil_g = img_pil.convert('L')
#             print("Creating b-w image") # Progress Check
            img_pil_b = img_pil.convert('1')
#             print("Searching for '{}' in greyscale image".format(string)) # Progress Check
            g_text = pytesseract.image_to_string(img_pil_g) 
            g_words = g_text.lower().strip().split()
#             print("Completed search for '{}' in greyscale image".format(string)) # Progress Check
            if (search not in g_words):
                b_words = []
                for thresh in [int(256/4),int(256/2),int(256*3/4)]:
                    b_words_temp = []
#                     print("Searching for '{}' in b-w image with threshold {}".format(string,thresh)) # Progress Check
                    b_text = pytesseract.image_to_string(binarizepil(img_pil_b,thresh))
                    b_words_temp = b_text.lower().strip().split()
                    for word in b_words_temp:
                        if word not in b_words:
                            b_words.append(word)
#                     print("Completed search for '{}' in b-w image with threshold {}".format(string,thresh)) # Progress Check
                if (search in g_words) or (search in b_words):
                    print("Results were matched in this file and faces displayed below")
                else:
                    print("No matches found for that text")    
                
            # Find faces on that sheet
#             print("Compiling face images from",file.filename) # Progress Check
            images = []
            img_array = cv.imread(img_file.name)
            img_array_g = cv.cvtColor(img_array, cv.COLOR_BGR2GRAY)
            faces = face_cascade.detectMultiScale(img_array_g,2)
#             print(f"There are {len(faces)} to draw") # Progress Check
            drawing = ImageDraw.Draw(img_pil)
            index = 1
            for x,y,w,h in faces:
#                 print("Drawing face {}".format(index)," of {}".format(len(faces))) # Progress Check
                drawing.rectangle((x,y,x+w,y+h), outline="white")
                images.append(img_pil.crop((x,y,x+w,y+h)))
                index += 1
#             print("Passing images to contact sheet") # Progress Check

#             # Used for quick testing of contact sheet creation with
#             images=[]
#             for i in range(0,9):
#                 test_i = Image.open(img_file.name)
#                 wpix = int(50*i+100)
#                 ratio = float(wpix/test_i.size[0])
#                 test_i = test_i.resize((wpix,int(test_i.size[1]*ratio)))
#                 display(test_i)
#                 images.append(test_i)

        # Create contact sheet
        span = 5
        image_base = images[0]
        base_w = image_base.width
        base_h = image_base.height
        resize = (base_w,base_h)
        cs_width = base_w * span
        cs_rows = len(images) // span
        if len(images) % span > 0:
            cs_rows += 1
        cs_height = cs_rows * base_h
        w_inc = base_w
        h_inc = base_h
        cs = Image.new(image_base.mode, (cs_width,cs_height))
        x = 0
        y = 0
        for img in images:
            img.thumbnail(resize)
            img = img.resize(resize)
            cs.paste(img,(x, y))
            if x + w_inc == cs_width:
                x = 0
                y = y + h_inc
            else:
                x = x + w_inc
        ratio = 1000 / cs_width
        cs = cs.resize((int(cs_width*ratio),int(cs_height*ratio)))
        display(cs)

ModuleNotFoundError: No module named 'kraken'