# Processing Data Download

This Jupyter Notebook serves the purpose to transform the downloaded data set (a collection of `.jpg` pictures) into interpretable input for the Machine Learning algorithm. As this has to be done only once, we keep it out of the main Jupyter Notebook that implements the different methods and models on the training and testing data set.

Three majors steps have to be performed in this Jupyter Notebook:
1. Proper loading of all images from the subdirectories
2. Definition of functions that construct the pipeline
    + Translate `.jpg` picture into pixels with intensity per pixel
    + Creating arrays of corresponding labels
    + Combining all arrays into proper dictionary
    + Store translated pixel data so that main Jupyter Notebook can access the data
3. Running pipeline with different pixel resolutions

<br>
<br>
<br>

## 1. Defintion of where to find files in folders

<br>

In [1]:
# Import necessary libraries and set-up Jupyter Notebook.

# Common imports
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import timeit # for measuring time/code performance
import pickle # for storing dictionary in the end
import PIL #Python Image Library

# to make this notebook's output stable across runs (safety measure)
np.random.seed(42)

In [2]:
# Switch between toy and full data
full_data_switch_on = False #Set True for full data set and False for Dummy Data set (see comment below)

<div class="alert alert-block alert-danger">
<b>Action required</b>
<p>
    
You have to set the switch whether you want to use the full dataset (True) or the dummy toy dataset (False). We set aside 100 correct and 100 incorrect pictures into a dummy toy data set in order to test our code faster. For running the algorithm with the dummy toy data everything is included in the GitHub-Repository (in the folder "01_data/99_dummy_toy_data").
    
However, if you want to run the algorithm with the full data set, you have to download the corresponding files under the Dropbox-Link below. The raw data is placed in the Dropbox folder "00_raw" (Hint: It is ca. 40,5 GB) and you have to download it into the repository folder "01_data/00_raw/".

The reason why we cannot directly use the links here is that we do not have figured out yet how to loop through subfolders and files Dropbox online. GitHub does not allow us to upload such an amount of data.
<br>
Dropbox-Link: https://www.dropbox.com/sh/45vbkq1ihfnhqem/AAADdq6mJKaLsG1w7SDK-QV8a?dl=0    

<br>
<b>
!!!  Be aware: Running this Jupyter Notebook with the full data set requires probably >8 hours of runtime for a 32x32 pixel resolution - depending on your hardware !!!

</b>    
</div>

In [3]:
# Setting path variables depending on switch
if full_data_switch_on == True: 
    # Set path to full data set of correct and incorrect files
    ROOT_DATA = "01_data/00_raw/"
    PATH_DATA_CORRECT_NVIDIA = os.path.join(ROOT_DATA + "Masked-Face-Net-Dataset/CMFD")
    PATH_DATA_INCORRECT_NVIDIA = os.path.join(ROOT_DATA + "Masked-Face-Net-Dataset/IMFD")
    # Set path to second data set
    PATH_DATA_KAGGLE = os.path.join(ROOT_DATA + "Face-Mask-Dataset_Kaggle/train/")
    PATH_LABELS_KAGGLE = os.path.join(ROOT_DATA +"Face-Mask-Dataset_Kaggle/kaggle_train_labels.csv")
elif full_data_switch_on == False:
    # Set path to dummy toy data set of correct and incorrect files
    ROOT_DATA = "01_data/99_dummy_toy_data/"
    PATH_DATA_CORRECT_NVIDIA = os.path.join(ROOT_DATA + "correct")
    PATH_DATA_INCORRECT_NVIDIA = os.path.join(ROOT_DATA + "incorrect")
    # Set path to dummy second data set
    PATH_DATA_KAGGLE = os.path.join(ROOT_DATA + "Kaggle_dummy")
    PATH_LABELS_KAGGLE = os.path.join(ROOT_DATA +"kaggle_train_labels.csv")
else:
    raise ValueError("Full data switch not correctly defined: Binary value of True or False necessary")

In [4]:
# Defining function to list all pictures to include
def list_files(dir):
    r = []
    for root, dirs, files in os.walk(dir):
        for name in files:
            r.append(os.path.join(root, name))
    return r

In [5]:
# Preparing and Extracting necessary information of NVIDIA data set

# List all file paths in corresponding folder
filepaths_correct_NVIDIA = list_files(PATH_DATA_CORRECT_NVIDIA)
filepaths_incorrect_NVIDIA = list_files(PATH_DATA_INCORRECT_NVIDIA)

# Extract pic ids from files in consideration
#Correct pic ids
pic_ids_correct_NVIDIA = np.array([])
for i in range(len(filepaths_correct_NVIDIA)):
    file_short = filepaths_correct_NVIDIA[i].replace(PATH_DATA_CORRECT_NVIDIA + "/", "NVIDIA/")
    pic_ids_correct_NVIDIA = np.append(pic_ids_correct_NVIDIA, file_short)
#Incorrect pic ids
pic_ids_incorrect_NVIDIA = np.array([])
for i in range(len(filepaths_incorrect_NVIDIA)):
    file_short = filepaths_incorrect_NVIDIA[i].replace(PATH_DATA_INCORRECT_NVIDIA + "/", "NVIDIA/")
    pic_ids_incorrect_NVIDIA = np.append(pic_ids_incorrect_NVIDIA, file_short)

In [8]:
# Preparing and Extracting necessary information of Kaggle datafile

# Load labels csv.file of Kaggle data set
labels_kaggle = pd.read_csv(PATH_LABELS_KAGGLE)

# List all file path in corresponding folder
filepaths_kaggle_all = list_files(PATH_DATA_KAGGLE)

# Extract pic ids from files in consideration
pic_ids_kaggle = []
for itr in range(len(filepaths_kaggle_all)):
    pic_ids_kaggle.append(filepaths_kaggle_all[itr].replace(PATH_DATA_KAGGLE, "KAGGLE"))

# Extract filenames out of pic ids
filenames_kaggle = []
for itr in range(len(pic_ids_kaggle)):
    filenames_kaggle.append(pic_ids_kaggle[itr].replace("KAGGLE/", ""))
    
# Add all information into pandas data frame
filenames_kaggle = pd.DataFrame(filenames_kaggle)
filenames_kaggle = filenames_kaggle.rename(columns = {0: "filename"})
filenames_kaggle["pic_ids"] = pic_ids_kaggle
filenames_kaggle["full_path"] = filepaths_kaggle_all

# Merge with labels set with left joing
Left_join_kaggle = pd.merge(filenames_kaggle, 
                     labels_kaggle, 
                     on ='filename', 
                     how ='left')

# Translating descriptive labels into True/False
Left_join_kaggle["label_tf"] = ""
for itr in range(len(Left_join_kaggle)):
    if Left_join_kaggle["label"][itr] == "with_mask":
        Left_join_kaggle["label_tf"][itr] = True
    elif Left_join_kaggle["label"][itr] == "without_mask":
        Left_join_kaggle["label_tf"][itr] = False
    else:
        raise ValueError("Check labels: Neither 'with_mask' nor 'without_mask'")


# Separating pandas dataframe into one for correct and one for incorrect
pd_data_kaggle_correct = Left_join_kaggle.loc[Left_join_kaggle["label_tf"] == True]
pd_data_kaggle_incorrect = Left_join_kaggle.loc[Left_join_kaggle["label_tf"] == False]

# Extracting pic ids to numpy arrays for later purposes
pic_ids_correct_KAGGLE_np = np.array(pd_data_kaggle_correct["pic_ids"].tolist())
pic_ids_incorrect_KAGGLE_np = np.array(pd_data_kaggle_incorrect["pic_ids"].tolist())

In [10]:
# Combining filepaths of two data sets
filepaths_correct = filepaths_correct_NVIDIA + pd_data_kaggle_correct["full_path"].tolist()
filepaths_incorrect = filepaths_incorrect_NVIDIA + pd_data_kaggle_incorrect["full_path"].tolist()

<br>

## 2. Defining necessary functions that construct the pipeline

<br>
<br>
<br>


### Function for transforming .jpg-fils into numerical arrays

<br>

In [11]:
# Importing necessary libraries/functions
from PIL import Image as PIL_Image


# Defining function that performs translation of jpg files into numerical representation
def pixel_transformation(target_pixel, correct_switch):
    # Correct/Incorrect switch
    if correct_switch==True:
            
        # Run-time information
        start_time = timeit.default_timer()
        print(">... Starting pixel transformation for correct pictures for resolution: ", target_pixel, "x", target_pixel)

        # Initialize emtpy array of fitting length
        array_length = target_pixel*target_pixel*3 #Times 3 as we have 3 values (RGB) per pixel
        loaded_pics_correct = np.empty([0,array_length])

        #Running trough all correctly worn image files
        for filename in filepaths_correct:
            # open picture
            pic = PIL_Image.open(filename)
            # Reduce size from original format to target format
            pic_resized = pic.resize((target_pixel, target_pixel))
            # Extract RGB data
            pic_data = np.array(pic_resized)
            # Include help array to reshape 3D-array(e.g.: 1024, 1024, 3) into 1D array
            help_array = np.reshape(pic_data,(pic_data.size,))
            # Stack each array onto each other to have one larger array of shape (#obs,#pixels*3)
            loaded_pics_correct = np.vstack((loaded_pics_correct, help_array))
        
        #End run-time information
        elapsed = timeit.default_timer() - start_time
        print("Finished", target_pixel, "x", target_pixel ,"pixel transformation for correct pictures. Run-time in seconds: ", round(elapsed,2))
 
        # Returning
        return loaded_pics_correct
        
  
    elif correct_switch==False:
            
        # Run-time information
        start_time = timeit.default_timer()
        print(">... Starting pixel transformation for incorrect pictures for resolution: ", target_pixel, "x", target_pixel)

        # Initialize emtpy array of fitting length
        array_length = target_pixel*target_pixel*3 #Times 3 as we have 3 values (RGB) per pixel
        loaded_pics_incorrect = np.empty([0,array_length])

        #Running trough all correctly worn image files
        for filename in filepaths_incorrect:
            # open picture
            pic = PIL_Image.open(filename)
            # Reduce size from original format to target format
            pic_resized = pic.resize((target_pixel, target_pixel))
            # Extract RGB data
            pic_data = np.array(pic_resized)
            # Include help array to reshape 3D-array(e.g.: 1024, 1024, 3) into 1D array
            help_array = np.reshape(pic_data,(pic_data.size,))
            # Stack each array onto each other to have one larger array of shape (#obs,#pixels*3)
            loaded_pics_incorrect = np.vstack((loaded_pics_incorrect, help_array))

        #End run-time information
        elapsed = timeit.default_timer() - start_time
        print("Finished", target_pixel, "x", target_pixel ,"pixel transformation for incorrect pictures. Run-time in seconds: ", round(elapsed,2))

        # Returning
        return loaded_pics_incorrect
        
   
    else:
        raise ValueError("Error: Value for correct_switch muste be True or False!")
        
   
    

<br>
<br>
<br>
<br>


### Function for creating label vectors

<br>

In [12]:
# Defining function to create label vector

def add_labels(pic_array, correct_switch):
    # Run-time information
    start_time = timeit.default_timer()
    print(">... Starting to create labels vector for array.")
    
    # Define label depending on switch
    if correct_switch==True:
        label_item = 1
    elif correct_switch==False:
        label_item = 0
    else:
        raise ValueError("Error: Value for correct_switch muste be True or False!")
        
    # Initialize array and run for-loop
    labels = np.array([])
    for itr in range(len(pic_array)):
        labels = np.append(labels, label_item)
    # Transform to integers
    labels = labels.astype(np.uint8)
    
    # End run-time information
    elapsed = timeit.default_timer() - start_time
    print("Finished to create labels vector for array. Label: ",label_item,". Run-time in seconds: ", round(elapsed,2))
    
    #Returning values
    return labels

<br>
<br>
<br>
<br>


### Function for combining arrays into dictionary

<br>

In [13]:
# Defining function to combine arrays to dictionary

def comb_to_dict(feat_cor, feat_incor, labels_cor, labels_incor):
    # Run-time information
    start_time = timeit.default_timer()
    print(">... Starting to combine arrays to dictionary.")
    
    # combine label arrays
    labels = np.array([])
    labels = np.append(labels_cor, labels_incor)
    
    # Combine feature arrays
    cleaned_data = np.array([])
    cleaned_data = np.vstack((feat_cor, feat_incor))
    
    # Combine pic id arrays
    pic_ids_all = np.array([])
    for i in range(len(pic_ids_correct_NVIDIA)):
        pic_ids_all = np.append(pic_ids_all, pic_ids_correct_NVIDIA[i])
    for i in range(len(pic_ids_correct_KAGGLE_np)):
        pic_ids_all = np.append(pic_ids_all, pic_ids_correct_KAGGLE_np[i])   
    for i in range(len(pic_ids_incorrect_NVIDIA)):
        pic_ids_all = np.append(pic_ids_all, pic_ids_incorrect_NVIDIA[i])
    for i in range(len(pic_ids_incorrect_KAGGLE_np)):
        pic_ids_all = np.append(pic_ids_all, pic_ids_incorrect_KAGGLE_np[i])
    
    # Combine data and labels into one dictionary
    pic_data = {}
    pic_data["rgb_data"] = cleaned_data
    pic_data["labels"] = labels
    pic_data["pic_ids"] = pic_ids_all
    
    # End run-time information
    elapsed = timeit.default_timer() - start_time
    print("Finished creating dictionary. Run-time in seconds: ", round(elapsed,2))
    
    #Returning values
    return pic_data
    

<br>
<br>
<br>
<br>


### Function for storing dictionary with approriate name

<br>

In [14]:
# Defining function for storing dictionary with appropiate name in appropriate folder
def store_dict(pic_dict, target_pixel):
    if full_data_switch_on == True: 
        file = open(("01_data/01_cleaned/pic_data_full_"+str(target_pixel)+".pkl"),"wb")
        pickle.dump(pic_dict, file)
        file.close()
        print("Successfully stored in: 01_data/01_cleaned/pic_data_full_"+str(target_pixel)+".pkl")
    elif full_data_switch_on == False: 
        file = open(("01_data/01_cleaned/pic_data_dummy_toy_"+str(target_pixel)+".pkl"),"wb")
        pickle.dump(pic_dict, file)
        file.close()
        print("Successfully stored in: 01_data/01_cleaned/pic_data_dummy_toy_"+str(target_pixel)+".pkl")
    else:
        raise ValueError("Full data switch not correctly defined: Binary value of True or False necessary")

<br>
<br>
<br>
<br>


### Function for storing dictionary with approriate name

<br>

In [15]:
# Defining function to run whole pipeline

def pic_pipeline(pixel_res):
    # Run-time information
    start_time_all = timeit.default_timer()
    print("\n\n>... Starting picture pipeline for resolution: ", pixel_res, "x", pixel_res, "\n\n")
    
    # Transformation
    feat_correct = pixel_transformation(target_pixel=pixel_res, correct_switch=True)
    feat_incorrect = pixel_transformation(target_pixel=pixel_res, correct_switch=False)
    # Adding labels
    labels_correct = add_labels(pic_array=feat_correct, correct_switch=True)
    labels_incorrect = add_labels(pic_array=feat_incorrect, correct_switch=False)
    # Combining
    pic_data = comb_to_dict(feat_cor=feat_correct, 
                            feat_incor=feat_incorrect, 
                            labels_cor=labels_correct, 
                            labels_incor=labels_incorrect)
    # Storing
    if store_switch == True:
        store_dict(pic_dict=pic_data, target_pixel=pixel_res)
    elif store_switch == False:
        print("Be aware. New arrays are not stored.")
    else:
        raise ValueError("Error. Store Switch must be either True or False!")
    
    # End run-time information
    elapsed = timeit.default_timer() - start_time_all
    print("\n\nFinished picture pipeline for resultion: ", pixel_res, "x", pixel_res, "Run-time in seconds:", round(elapsed,2))
    
    # Store run-time info
    run_time_info = {}
    run_time_info[str(pixel_res)] = round(elapsed, 2)
    if full_data_switch_on == True: 
        file = open(("03_output/03_data_prep_run_time/run_time_data_prep_full_"+str(pixel_res)+".pkl"),"wb")
        pickle.dump(run_time_info, file)
        file.close()
        print("Successfully stored in: 03_output/03_data_prep_run_time/run_time_data_prep_full_"+str(pixel_res)+".pkl")
    elif full_data_switch_on == False: 
        file = open(("03_output/03_data_prep_run_time/run_time_data_prep_dummy_toy_"+str(pixel_res)+".pkl"),"wb")
        pickle.dump(run_time_info, file)
        file.close()
        print("Successfully stored in: 03_output/03_data_prep_run_time/run_time_data_prep_dummy_toy_"+str(pixel_res)+".pkl")
    else:
        raise ValueError("Full data switch not correctly defined: Binary value of True or False necessary")
    

<br>
<br>
<br>
<br>

## 3. Running full transformation pipeline

<br>

In [16]:
# Defining pixel resolutions to use: All pics will be reshaped quadratically: So 16 will become 16x16 resolution
desired_pixel_res = [24]

# Define if new run shall be stored (For development of code)
store_switch = True


for pix in desired_pixel_res:
    # Running transformation pipeline incl storing and adding labels
    pic_pipeline(pixel_res=pix)
    print("\n\n Finished overall pipeline!")
    



>... Starting picture pipeline for resolution:  24 x 24 


>... Starting pixel transformation for correct pictures for resolution:  24 x 24
Finished 24 x 24 pixel transformation for correct pictures. Run-time in seconds:  1.64
>... Starting pixel transformation for incorrect pictures for resolution:  24 x 24
Finished 24 x 24 pixel transformation for incorrect pictures. Run-time in seconds:  1.66
>... Starting to create labels vector for array.
Finished to create labels vector for array. Label:  1 . Run-time in seconds:  0.0
>... Starting to create labels vector for array.
Finished to create labels vector for array. Label:  0 . Run-time in seconds:  0.0
>... Starting to combine arrays to dictionary.
Finished creating dictionary. Run-time in seconds:  0.0
Successfully stored in: 01_data/01_cleaned/pic_data_dummy_toy_24.pkl


Finished picture pipeline for resultion:  24 x 24 Run-time in seconds: 3.3
Successfully stored in: 03_output/03_run_time_scores/run_time_data_prep_dummy_toy_24.pkl