# Face recognition Image Cropping and Filtering notebook

This note books is program that preprocesing data for face recognition
It reads image data from GCS, the images are 

* filtered : it read data and filter out not appropriate image for example, if it has more than 2 face, if face has sunglass or if face angle is so big. 
* croped : after filtering it crops face only
* resize : croped image will be resized for traing
* stored in destination GCS bucket : all processed images are stored in destination bucket in GCS
* create labled image list file :for training it also create csv file which combines image URL and label for the image. It generate two csv file. One for training and the other for validation

If you just put the images in SOURCE_BUCKET in GCS, it will automatically preprocess and generate result.
The process is using Apache Beam (aka. google dataflow) to provide scale.

Here is reference doc 
* Jupyter notebook which uses Apache beam for preprocessing. It is really good!! : https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/blogs/babyweight/babyweight.ipynb
* GCS python client :  http://gcloud-python.readthedocs.io/en/latest/storage-client.html
* Beam python example : https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples 

## Install Python modules
You need to run this in just first time. After u install the modules. you don't need to install again.

In [1]:
#!pip install Pillow

# Set up environment variables

In [1]:
PROJECT = 'terrycho-ml' #Google Project ID
SOURCE_BUCKET = 'terrycho-face-rawdata' #GCS SOURCE BUCKET
IMAGE_FILE_PREFIX = '' #directory which stores raw images in GCS
DESTINATION_BUCKET = 'terrycho-face-trainingdata' #GCS DESTINATION BUCKET, all filtered and cropped image and file list will be stored
INPUT_FILE_LOCAL='filelist.csv'
INPUT_FILE='gs://'+SOURCE_BUCKET+'/'+INPUT_FILE_LOCAL 

IMAGE_SIZE = 96,96

MAX_ROLL = 20
MAX_TILT = 20
MAX_PAN = 20

DEBUG_MODE=False
NUM_OF_DEBUG_DATA=15
TRAINING_FILE = "training.csv"
VALIDATION_FILE = "validation.csv"


#### Source image file directory structure

SOURCE_BUCKET/IMAGE_PREFIX/Jessica/imagefileX.jpg <BR>
SOURCE_BUCKET/IMAGE_PREFIX/Jessica/imagefileX.jpg <BR>
SOURCE_BUCKET/IMAGE_PREFIX/Jessica/imagefileX.jpg <BR>
SOURCE_BUCKET/IMAGE_PREFIX/Brad/imagefileX.jpg <BR>
SOURCE_BUCKET/IMAGE_PREFIX/Brad/imagefileX.jpg <BR>

#### Options

* MAX_ROLL,MAX_TILT,MAX_PAN : face angle, if face of angle in the image exceeds this value, the image will be filtered out
* DEBUG_MODE : if it is true, dataflow pipeline will be ran in local and only NUM_OF_DEBUG_DATA will be used for testing purpose only
* TRAINING_FILE : final file name which contains cropped image file name & lables for training purpose
* VALIDATION_FILE :  final file name which contains cropped image file name & lables for validation purpose


## define temp directory depends on DEBUG_MODE.
If it is development mode with local dataflow runner. it will uses local directory as a temp.
In production mode, dataflow runner will be ran in cloud , it uses GCS as a temp. But even it still needs a LOCAL_TMP_DIR. 
Because, in the code, it pulls files from GCS to local disk to resize/crop etc. the LOCAL_TMP_DIR is used for the purpose


In [2]:
if DEBUG_MODE:
    TMP_DIR='/tmp/face'
    LOCAL_TMP_DIR = '/tmp/'
else:
    TMP_DIR='gs://'+DESTINATION_BUCKET+'/tmp/'
    LOCAL_TMP_DIR = '/tmp/'

### Clear garbage data

clear local directories before running

In [3]:
!rm training.csv-*


rm: training.csv-*: No such file or directory


# Make raw file list with CSV format

Scan files in SOURCE directory and make list of the files with CSV format <BR>
The SOURCE directory will have subdirectory and the sub directoy just have image file of one people. (the people name is the subdirectory name)<BR>
file format will be "filename,labelname,label"<BR>
The label is integer, label name is string
<p>
<B>before run the process, you have to create service account file (json). you can get the file from google cloud console API menu.
After you download it, you have to replace "/Users/terrycho/keys/terrycho-ml.json" this key file to your downloaded key file. </B>

In [5]:
#gcs example https://github.com/salrashid123/gcpsamples#cloud-python
from google.cloud import storage
import google.auth
import os

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/Users/terrycho/keys/terrycho-ml.json"

def create_csv():
    # get bucket
    credentials, project = google.auth.default()
    client = storage.Client()
    bucket = client.get_bucket(SOURCE_BUCKET)
    labels = {}
    index =  0
    cnt = 0

    # open file list file
    tfile = open(INPUT_FILE_LOCAL,'w')

    # read files in directory
    blobs = bucket.list_blobs()
    for blob in blobs:
        uri = blob.name
        # if file is jpeg, extract file name and directory name
        if(blob.content_type == 'image/jpeg'):
            #print uri
            try:
                e = uri.split('/')
                imagefile = e[len(e)-1]
                if ',' in imagefile:
                    continue
                name = str(e[len(e)-2])
                try:
                    label = labels[name]
                except Exception , e:
                    label = index
                    labels[name] = label
                    index = index + 1

                tfile.write('%s,%s,%s\n'%(imagefile,name,label))
                cnt = cnt + 1
                # if it is debug mode, it extract small number of data for testing purpose
                if DEBUG_MODE and cnt > NUM_OF_DEBUG_DATA:
                    break
            except Exception , ex:
                s = str(ex)
                print("[Error] %s file processing error by %s and skiped" %(imagefile,str(ex)) )

    tfile.close()
    print ('\nFound %s files and create %s file'%(cnt,INPUT_FILE_LOCAL))

create_csv()
                    

[Error] 2016 FW KOLPING 설현 콜핑 아웃도어 광고01.jpg file processing error by 'ascii' codec can't encode characters in position 16-21: ordinal not in range(128) and skiped

Found 456 files and create filelist.csv file


## Validate raw csv file and upload it into GCS

validate generated file and upload the file into SOURCE_BUCKET

In [6]:
# chceck generated file

!wc -l $INPUT_FILE_LOCAL
!head $INPUT_FILE_LOCAL

#upload INPUT_FILE to GCS
!gsutil cp $INPUT_FILE_LOCAL gs://$SOURCE_BUCKET

     456 filelist.csv
jesi-1zdcfbt.jpg,Alba,0
jesi-2dgou36.jpg,Alba,0
jesi-2dqn8li.jpg,Alba,0
jesi-2drg9ok.jpg,Alba,0
jesi-2dv8uth.jpg,Alba,0
jesi-2j5i15c.jpg,Alba,0
jesi-2ndimages (10).jpeg,Alba,0
jesi-2ndimages (17).jpeg,Alba,0
jesi-2ndimages (22).jpeg,Alba,0
jesi-2ndimages (23).jpeg,Alba,0
Copying file://filelist.csv [Content-Type=text/csv]...
\ [1 files][ 14.4 KiB/ 14.4 KiB]                                                
Operation completed over 1 objects/14.4 KiB.                                     


## Clean up DESTINATION bucket which will store filtered images

In [7]:
# clear destination bucket
!gsutil -m rm -r "gs://"$DESTINATION_BUCKET

# create destination bucket
!gsutil mb "gs://"$DESTINATION_BUCKET
!echo 'Done'

Removing gs://terrycho-face-trainingdata/staging/preparefacedata170609012211.1496938933.100437/Babel-2.3.4.tar.gz#1496939000485883...
Removing gs://terrycho-face-trainingdata/staging/preparefacedata170609012211.1496938933.100437/Bottleneck-1.1.0.tar.gz#1496939012905200...
Removing gs://terrycho-face-trainingdata/staging/preparefacedata170609012211.1496938933.100437/Pillow-3.3.1.zip#1496939102212878...
Removing gs://terrycho-face-trainingdata/staging/preparefacedata170609012211.1496938933.100437/Pygments-2.2.0.tar.gz#1496939121678474...
Removing gs://terrycho-face-trainingdata/staging/preparefacedata170609012211.1496938933.100437/alabaster-0.7.9.tar.gz#1496938969183040...
Removing gs://terrycho-face-trainingdata/staging/preparefacedata170609012211.1496938933.100437/appnope-0.1.0.tar.gz#1496938969832049...
Removing gs://terrycho-face-trainingdata/staging/preparefacedata170609012211.1496938933.100437/appscript-1.0.1.tar.gz#1496938971667971...
Removing gs://terrycho-face-trainingdata/stagi

Removing gs://terrycho-face-trainingdata/staging/preparefacedata170609012211.1496938933.100437/requirements.txt#1496938938303716...
Removing gs://terrycho-face-trainingdata/staging/preparefacedata170609012211.1496938933.100437/rsa-3.4.2.tar.gz#1496939126984157...
Removing gs://terrycho-face-trainingdata/staging/preparefacedata170609012211.1496938933.100437/scandir-1.5.tar.gz#1496939127652869...
Removing gs://terrycho-face-trainingdata/staging/preparefacedata170609012211.1496938933.100437/setuptools-36.0.1.zip#1496939129621795...
Removing gs://terrycho-face-trainingdata/staging/preparefacedata170609012211.1496938933.100437/simplegeneric-0.8.1.zip#1496939130171481...
Removing gs://terrycho-face-trainingdata/staging/preparefacedata170609012211.1496938933.100437/six-1.10.0.tar.gz#1496939130863834...
Removing gs://terrycho-face-trainingdata/staging/preparefacedata170609012211.1496938933.100437/traitlets-4.3.2.tar.gz#1496939132005125...
Removing gs://terrycho-face-trainingdata/staging/prepar

## Make requirements.txt file for python module dependency

To run Apache beam data flow in google cloud, it need to upload dependency files into apache beam cloud run time (aka dataflow). 
This file will be used to install dependency module to google data flow run time
Reference https://cloud.google.com/dataflow/pipelines/dependencies-python

In [8]:
%%writefile requirements.txt
Pillow==3.3.1
google-api-python-client==1.6.2
google-cloud-vision==0.24.0
google-cloud-storage==1.1.1
ipython==5.3.0
ipython-genutils==0.2.0

Overwriting requirements.txt


In [None]:
!cat requirements.txt

Pillow==3.3.1
google-api-python-client==1.6.2
google-cloud-vision==0.24.0
google-cloud-storage==1.1.1
ipython==5.3.0
ipython-genutils==0.2.0

# Run filtering & resize image file and make a list of filtered file

This is Apache beam based data preprocessing workflow
I read $INPUT_FILE the file is csv format file and it contains imagefilename, string label and int label.<BR>
based on the csv file, it read imagefile and filter the file based on
number of faces in the photo, face angle, sunglass etc.<BR>
After the filtering it crop & resize the image into IMAGE_SIZE*IMAGE_SIZE <BR>
All filtered and cropped image will be uploaded destination bucket. <BR>
The file list also stored in destination bucket. <BR>


In [None]:
import os
import datetime
import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions

import google.auth
import io
from oauth2client.client import GoogleCredentials

# set service account file into OS environment value
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/Users/terrycho/keys/terrycho-ml.json"

# make variables in option

job_name = 'preparefacedata'+ datetime.datetime.now().strftime('%y%m%d%H%M%S')

options = {
    'staging_location': 'gs://'+DESTINATION_BUCKET+'/staging',
    'temp_location': 'gs://'+DESTINATION_BUCKET+'/tmp',
    'job_name': job_name,
    'project': PROJECT,
    'teardown_policy': 'TEARDOWN_ALWAYS',
    'no_save_main_session': True ,  
    'requirements_file' : 'requirements.txt',
    'save_main_session': True
}
opts = beam.pipeline.PipelineOptions(flags=[], **options)

# define transform

def parseCSV(element):
    line = str(element)
    print line
    e = line.split(',')
    imagefile = str(e[0])
    name = str(e[1])
    label = int(e[2])
    return label,name,imagefile

# filter image with image information with vision API
def get_image_info(element):
    # in dataflow, globally imported module cannot be used. so import the module in each function
    # for more information https://cloud.google.com/dataflow/faq
    from google.cloud import vision
    if len(element) < 3 or element == None:
         print('[Error] %s: It doesnt have file name' % element)
         return None
    label = int(element[0])
    name = str(element[1])
    imagefile = str(element[2])
    
    visionClient = vision.Client()
    imagefile_uri = 'gs://'+SOURCE_BUCKET+'/'+IMAGE_FILE_PREFIX+name+'/'+imagefile
    print ('[INFO] processing %s'%(imagefile_uri))
    image = visionClient.image(source_uri=imagefile_uri)
    faces = image.detect_faces(limit=2)

    if len(faces) > 1:
        print('[Error] %s: It has more than 2 faces in a file' % imagefile)
        return None
    if len(faces) == 0:
        print('[Error] %s: It has no faces in a file' % imagefile)
        return None
    face = faces[0]

    # extract face angle
    roll_angle = face.angles.roll
    pan_angle = face.angles.pan
    tilt_angle = face.angles.tilt
    angle = [roll_angle,pan_angle,tilt_angle]
    
    # filter out based on angle
    if abs(roll_angle) > MAX_ROLL or abs(pan_angle) > MAX_PAN or abs(tilt_angle) > MAX_TILT:
        print('[Error] %s: face skew angle is big' % imagefile)
        return None
        
    # extract face boundary
    left = face.fd_bounds.vertices[0].x_coordinate
    top = face.fd_bounds.vertices[0].y_coordinate
    right = face.fd_bounds.vertices[2].x_coordinate
    bottom = face.fd_bounds.vertices[2].y_coordinate
    rect = [left,top,right,bottom]
    
    # check sunglass
    try:
        objs = image.detect_labels(limit=50)
        if objs != None:
            for obj in objs:
                if 'sunglasses' in obj.description:
                    print('[Error] %s: sunglass is detected' % imagefile)  
                    return None
    except Exception as e:
        print('[Error] %s: Get Label info error: %s' %(imagefile,str(e)) )
        return None
    
    return label,name,imagefile,rect

def process_image(element):
    from google.cloud import storage
    from google.cloud.storage import Blob
    from PIL import Image
    from PIL import ImageDraw
    
    print "process image",element

    if element == None  or len(element) < 4:
        print('[Error] $s doesnt have 4 elements '%(str(element)) )
        return None
    
    label = int(element[0])
    name = str(element[1])
    imagefile = str(element[2])
    rect = element[3]
    print ('[INFO] Cropping %s'%(imagefile))

    # crop filesfile
    storageClient = storage.Client()
    source_bucket = storageClient.get_bucket(SOURCE_BUCKET)
    blob = source_bucket.get_blob(IMAGE_FILE_PREFIX+name+'/'+imagefile)
    
    # 1) download file
    tmp_file = LOCAL_TMP_DIR+'tmp-'+imagefile
    cropped_file = LOCAL_TMP_DIR+imagefile
    with open(tmp_file,'wb') as file_obj:
        blob.download_to_file(file_obj)
        
    # 2) crop face
    try:
        fd = io.open(tmp_file,'rb')
        image = Image.open(fd)  

        crop = image.crop(rect)
        im = crop.resize(IMAGE_SIZE,Image.ANTIALIAS)
            
        im.save(cropped_file,"JPEG")
        fd.close()
        print('[Info]  %s: Crop face %s and write it to file : %s' %( imagefile,rect,cropped_file) )

    except Exception as e:
        print('[Error] %s: Crop image writing error : %s' %(imagefile,str(e)) )
        return None
    
    # 3) upload file
    destination_bucket = storageClient.get_bucket(DESTINATION_BUCKET)
    
    blob = Blob('images/'+imagefile,destination_bucket)
    with open(cropped_file,'rb') as file_obj:
        blob.upload_from_file(file_obj)
        
    # 4) delete file
    os.remove(cropped_file)
    os.remove(tmp_file)
    return imagefile,name,label

def FormatString(element):
    if element == None:
        return

    s=''
    try:
        s = '%s,%s,%s'%(element[0],element[1],element[2])
    except Exception as e:
        print('[Error] %s: file formating error : %s' %(str(element),str(e))) 
    print s
    return s

    
# create pipeline



def run():    
    if(DEBUG_MODE):
        RUNNER = 'DirectRunner'
    else:
        RUNNER = 'DataflowRunner'
    p = beam.Pipeline(RUNNER, options=opts)

    # Extract image data --> Filter --> Resize & upload --> create file list 
    l=(p 
     | 'read csv' >> ReadFromText(INPUT_FILE)
     | 'parse CSV file' >> beam.Map(parseCSV)
     | 'Get image meta info' >> beam.Map(get_image_info)
     | 'Filter unsuitable image' >> beam.Filter(lambda x: x!=None)
     | 'Process Image' >> beam.Map(process_image) 
     | 'Format String' >> beam.Map(FormatString)
     | 'Write to training file' >> WriteToText('gs://'+DESTINATION_BUCKET+'/'+TRAINING_FILE)
    )
    job = p.run()
    job.wait_until_finish()

run()
print 'file filtering and cropping is done!!'

No handlers could be found for logger "oauth2client.contrib.multistore_file"
  super(GcsIO, cls).__new__(cls, storage_client))


Please wait until the dataflow job has been finished.
If you run the data flow in google cloud, you can check the progress in google cloud console.
In my case, when i process 456 files, it takes around 17 min. 
In the console you can trace applicatoin log and data flow status like below

!daatf

# Read cropped file list and seperate it to Training & Validation file

the filtered and cropped file need to be seperated to two type of data. One is for training and the other is for validation. <P>
To do that, it donwload all filtered and croped file from destination bucket and merge it into one file.<BR>
After that, for each label, 70% of imagefile are stored in training file list and 30% of image files are stored in validation file

### Download cropped files from CSV and merge the lists into filtered_filelist.csv 

In [4]:
!gsutil cp gs://$DESTINATION_BUCKET/*.csv* .
    

Copying gs://terrycho-face-trainingdata/training.csv-00000-of-00013...
Copying gs://terrycho-face-trainingdata/training.csv-00001-of-00013...          
Copying gs://terrycho-face-trainingdata/training.csv-00002-of-00013...          
Copying gs://terrycho-face-trainingdata/training.csv-00003-of-00013...          
/ [4 files][  6.3 KiB/  6.3 KiB]                                                
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m -o ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying gs://terrycho-face-trainingdata/training.csv-00004-of-00013...
Copying gs://terrycho-face-trainingdata/training.csv-00005-of-00013...          
Copying gs://terrycho-face-trainingdata/training.csv-00006-of-00013...          
Copying gs://terrycho-face-trainingdata/training.csv-00007-of-00013...          
Copying gs://terrycho-face-trainingdata

In [5]:
!rm filtered_filelist.csv

In [6]:
# 
#!ls *.csv*

!cat training.csv-* > filtered_filelist.csv
!wc -l filtered_filelist.csv

#!cat filtered_filelist.csv
!echo 'Alba'
!egrep Alba filtered_filelist.csv | wc -l
!echo 'Sulhyun'
!egrep Sulhyun filtered_filelist.csv | wc -l
!echo 'Jolie'
!egrep Jolie filtered_filelist.csv | wc -l
!echo 'Victoria'
!egrep Victoria filtered_filelist.csv | wc -l
!echo 'Nicole'
!egrep Nicole filtered_filelist.csv | wc -l

     318 filtered_filelist.csv
Alba
      67
Sulhyun
      73
Jolie
      69
Victoria
      54
Nicole
      55


### Seperate 70% files to  training file and 30% files to validation file 

In [15]:
MAX_DATA=50
labels = {}

flist = open('filtered_filelist.csv','r')
tfile = open(TRAINING_FILE,'w')
vfile = open(VALIDATION_FILE,'w')

for line in flist:
    e = line.split(',')
    imagefile = e[0]
    name = e[1]
    label = e[2]
    
    if name in labels:
        labels[name] = labels[name] + 1
    else:
        labels[name] = 0
    
    if labels[name] < int(0.8*MAX_DATA):
        tfile.write(line)
    else:
        vfile.write(line)

print "Data split is done"
for l in labels:
    print l,'totla:',labels[l],'training data :',int(labels[l]*0.8),'validation data :',int(labels[l]*0.3)


flist.close()
tfile.close()
vfile.close()

Data split is done
Alba totla: 66 training data : 52 validation data : 19
Sulhyun totla: 72 training data : 57 validation data : 21
Jolie totla: 68 training data : 54 validation data : 20
Nicole totla: 54 training data : 43 validation data : 16
Victoria totla: 53 training data : 42 validation data : 15
{'Alba': 66, 'Sulhyun': 72, 'Jolie': 68, 'Nicole': 54, 'Victoria': 53}


In [8]:
!wc -l $TRAINING_FILE
!wc -l $VALIDATION_FILE

     175 training.csv
     143 validation.csv


### Upload the training file and validation file into destination file

In [9]:
!gsutil cp $TRAINING_FILE gs://$DESTINATION_BUCKET/
!gsutil cp $VALIDATION_FILE gs://$DESTINATION_BUCKET/
    

Copying file://training.csv [Content-Type=text/csv]...
- [1 files][  6.0 KiB/  6.0 KiB]                                                
Operation completed over 1 objects/6.0 KiB.                                      
Copying file://validation.csv [Content-Type=text/csv]...
- [1 files][  4.5 KiB/  4.5 KiB]                                                
Operation completed over 1 objects/4.5 KiB.                                      


In [10]:
!echo 'Training file is gs://'$DESTINATION_BUCKET'/'$TRAINING_FILE
!echo 'Validation file is gs://'$DESTINATION_BUCKET'/'$TRAINING_FILE


Training file is gs://terrycho-face-trainingdata/training.csv
Validation file is gs://terrycho-face-trainingdata/training.csv


Image preprocessing is done. Next time i will introduce face recognition model that uses this filtered data.