# Gather images and prepare them for labs
----

This notebook downloads and prepares "Labeled Faces in the Wild" images using sklearn's fetch_lfw_people function. It converts the images from jpg to png, creates an S3 bucket which will be used in the subsequent modules, and uploads a sample of the images to that bucket. 

1. Create a S3 bucket 
2. Download and un-tar "Labeled Faces in the Wild" images.
3. Convert images from jpg to png (or jpeg)(Rekognition only supports png and jpeg images)
4. Upload converted images to the S3 bucket.
5. Upload contents of media directory to the S3 bucket. 


###  Information on "Labeled Faces in the Wild":
-----
*Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller.
Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments.
University of Massachusetts, Amherst, Technical Report 07-49, October, 2007* [pdf](https://vis-www.cs.umass.edu/lfw/lfw.pdf)


### Information on sklearn.datasets.fetch_lfw_people

Sklearn provides [fetch_lfw_people](https://scikit-learn.org/stable/datasets/real_world.html#labeled-faces-in-the-wild-dataset) this function will perform a one time download the images and tasks into the data directory of sklearn. Note: this will take several minutes to download the first time it is run. The LFW dataset is a collection of JPEG pictures of famous people collected over the internet, all details are available on the official website:

https://vis-www.cs.umass.edu/lfw/

Each picture is centered on a single face. The typical task is called Face Verification: given a pair of two pictures, a binary classifier must predict whether the two images are from the same person.


#### Install python dependencies

In [None]:
!pip install boto3 matplotlib opencv-python openpyxl pandas sklearn scikit-image

#### Import Libraries & Specify a S3 bucket name

Update the name of the bucket you want created on your behalf. This bucket will be used in subsequent modules and will be loaded with a sample of the "Labeled Faces in the Wild" images 




In [None]:
import os
import glob 
import boto3
from PIL import Image 
from pathlib import Path

mySession = boto3.session.Session()
aws_region = mySession.region_name
print("AWS Region: {}".format(aws_region))

# --- sklearn libraries for Labeled Faces in the Wild --- 
from sklearn.datasets import fetch_lfw_people, get_data_home

# --- provide a bucket name ---
bucket_name = ""
# -------------------------------------------

print("AWS Bucket: {}".format(bucket_name))
%store bucket_name

## Step 1. Create an S3 Bucket 
----
The following will create a new S3 bucket which will store images and other objects that are used in subsequent modules. 


In [None]:
s3 = boto3.client('s3')
try:
   s3.create_bucket(
       Bucket=bucket_name,
       CreateBucketConfiguration={'LocationConstraint': aws_region}
   )
except:
    print("bucket already exits")

## Step 2. Download Labeled Faces in the Wild
------

The Labeled Faces in the Wild is a database of face photographs designed for studying the problem of unconstrained face recognition. The data set contains more than 13,000 images of faces collected from the web. Each face has been labeled with the name of the person pictured. 1680 of the people have two or more distinct photos. More details can be found here: http://vis-www.cs.umass.edu/lfw/


**NOTE: using the sklearn.datasets.fetch_lfw_people function will take several minutes to download!**  

fetch_lfw_people is a one time download, and will write the faces to **jpg** the sklearn data home directory.

In [None]:
# -- download image pairs --
fetch_lfw_people(min_faces_per_person=2,  download_if_missing = True)
# -- get the path to lfw image data --
lfw_path = "{}/lfw_home/lfw_funneled/".format(get_data_home())
print(lfw_path)

## Step 3. Convert images from jpg to png (or jpeg)
-----
Rekognition supports png and jpeg images, here we are going to read the "jpg" images, convert them to png and save them in a directory called idv_media. In this step we are simply extracting images that start with the letter "A". 


**NOTE: this may take a minute or two to run**

In [None]:
# delete existing idv-media directory(if it exists), then create the idv-media directory 
!rm -R idv-media
!mkdir idv-media

pattern = "lfw_funneled/*/*.jpg" 
pattern = "{}/*/*.jpg".format(lfw_path)
for img in glob.glob(pattern):
    file_and_path = Path(img)
    filename_replace_ext = file_and_path.with_suffix('.png')
    file_name = os.path.basename(filename_replace_ext)
    if file_name[0] == "A":
        print(file_name)
        im1 = Image.open(img)
        full_path = os.path.join(os.getcwd(), "idv-media", file_name)
        resp = im1.save(full_path)
    
print("-- conversion complete --")

## Step 4. Upload Images to S3 bucket
----
This step reads all of the "png" files in the idv_media folder and uploads them to the S3 bucket. 

**NOTE: this may take a few minutes to upload to S3** (yes we could try and do this in parallel)

In [None]:
# S3 upload converted images
destination_path = "idv-media/*.png"
for img in glob.glob(destination_path):
    file_name = os.path.basename(img)
    s3.upload_file(img, bucket_name, file_name)
print("-- upload complete --")

## Step 5. Upload media images and file to S3 bucket 

In [None]:
# S3 upload media folder
destination_path = "media/*"
for img in glob.glob(destination_path):
    file_name = os.path.basename(img)
    s3.upload_file(img, bucket_name, file_name)
print("-- upload complete --")