# Processing Data Download

This Jupyter Notebook serves the purpose to transform the downloaded data set (a collection of `.jpg` pictures) into interpretable input for the Machine Learning algorithm. As this has to be done only once, we keep it out of the main Jupyter Notebook that implements the different methods and models on the training and testing data set.

Four steps have to be performed in this Jupyter Notebook:
+ Proper loading of all images from the subdirectories
+ Translate `.jpg` picture into pixels with intensity per pixel
+ Creating vectors of corresponding labels
+ Store translated pixel data so that main Jupyter Notebook can access the data


### Loading data

<br>

In [11]:
# Import necessary libraries and set-up Jupyter Notebook.

# Common imports
import numpy as np
import os
import matplotlib.pyplot as plt
import timeit # for measuring time/code performance
import pickle # for storing dictionary in the end
import PIL #Python Image Library

# to make this notebook's output stable across runs (safety measure)
np.random.seed(42)

In [12]:
# Switch between toy and full data
full_data_switch_on = False #Set True for full data set and False for Dummy Data set (see comment below)

<div class="alert alert-block alert-danger">
<b>Action required</b>
<p>
    
You have to set the switch whether you want to use the full dataset (True) or the dummy toy dataset (False). We set aside 100 correct and 100 incorrect pictures into a dummy toy data set in order to test our code faster. For running the algorithm with the dummy toy data everything is included in the GitHub-Repository (in the folder "01_data/99_dummy_toy_data").
    
However, if you want to run the algorithm with the full data set, you have to download the corresponding files under the Dropbox-Link below. The raw data is placed in the Dropbox folder "00_raw" (Hint: It is ca. 40,5 GB) and you have to download it into the repository folder "01_data/00_raw/".

The reason why we cannot directly use the links here is that we do not have figured out yet how to loop through subfolders and files Dropbox online. GitHub does not allow us to upload such an amount of data.
<br>
Dropbox-Link: https://www.dropbox.com/sh/45vbkq1ihfnhqem/AAADdq6mJKaLsG1w7SDK-QV8a?dl=0    

<br>
<b>
!!!  Be aware: Running this Jupyter Notebook with the full data set requires probably >8 hours of runtime - depending on your hardware !!!

</b>    
</div>

In [13]:
# Setting path variables depending on switch
if full_data_switch_on == True: 
    # Set path to full data set of correct and incorrect files
    ROOT_DATA = "01_data/00_raw/Masked-Face-Net-Dataset"
    PATH_DATA_CORRECT = os.path.join(ROOT_DATA + "/CMFD")
    PATH_DATA_INCORRECT = os.path.join(ROOT_DATA + "/IMFD")
else:
    # Set path to dummy toy data set of correct and incorrect files
    ROOT_DATA = "01_data/99_dummy_toy_data"
    PATH_DATA_CORRECT = os.path.join(ROOT_DATA + "/correct")
    PATH_DATA_INCORRECT = os.path.join(ROOT_DATA + "/incorrect")

In [14]:
# Defining list of all pictures to include
import os

def list_files(dir):
    r = []
    for root, dirs, files in os.walk(dir):
        for name in files:
            r.append(os.path.join(root, name))
    return r

filenames_correct = list_files(PATH_DATA_CORRECT)
filenames_incorrect = list_files(PATH_DATA_INCORRECT)

<br>
<br>
<br>
<br>


### Transform .jpg-fils into numerical arrays

<br>

In [15]:
#Start timer
start_time = timeit.default_timer()

# Importing necessary libraries/functions
from PIL import Image as PIL_Image

# Set pixel size (--> Large amount pixels increase data massively.) (Original #pixels: 1024x1024)
target_pixel = 32
array_length = target_pixel*target_pixel*3 #Times three as we have three values (RGB) per pixel

# Initalize and Load Correct pics
loaded_pics_correct = np.empty([0,array_length])

for filename in filenames_correct:
    # open picture
    pic = PIL_Image.open(filename)
    # Reduce size from original format to target format
    pic_resized = pic.resize((target_pixel, target_pixel))
    # Extract RGB data
    pic_data = np.array(pic_resized)
    # Include help array to reshape 3D-array(e.g.: 1024, 1024, 3) into 1D array
    help_array = np.reshape(pic_data,(pic_data.size,))
    # Stack each array onto each other to have one larger array of shape (#obs,#pixels*3)
    loaded_pics_correct = np.vstack((loaded_pics_correct, help_array))

# Initalize and Load Incorrect pics
loaded_pics_incorrect = np.empty([0,array_length])

for filename in filenames_incorrect:
    # open picture
    pic = PIL_Image.open(filename)
    # Reduce size from original format to target format
    pic_resized = pic.resize((target_pixel, target_pixel))
    # Extract RGB data
    pic_data = np.array(pic_resized)
    # Include help array to reshape 3D-array(e.g.: 1024, 1024, 3) into 1D array
    help_array = np.reshape(pic_data,(pic_data.size,))
    # Stack each array onto each other to have one larger array of shape (#obs,#pixels*3)
    loaded_pics_incorrect = np.vstack((loaded_pics_incorrect, help_array))

#End timer
elapsed = timeit.default_timer() - start_time
print("Run-time of this cell in seconds: ", round(elapsed,2))

Run-time of this cell in seconds:  3.16


<div class="alert alert-block alert-warning">
<b>Place to work on</b>
<p>
    
+ Write a function that takes different target pixels and creates different pickel outputs in the end. (In order to perform different test.)
    
</div>

<br>
<br>
<br>
<br>

### Adding labels and combining into one data frame

<br>

In [16]:
# Correct pictures
label_item = 1
# Get length of pictures included
len(loaded_pics_correct)
labels_correct = np.array([])
for itr in range(len(loaded_pics_correct)):
    labels_correct = np.append(labels_correct, label_item)

# Transform to integers
labels_correct = labels_correct.astype(np.uint8)

# Incorrect pictures
label_item = 0
# Get length of pictures included
len(loaded_pics_incorrect)
labels_incorrect = np.array([])
for itr in range(len(loaded_pics_incorrect)):
    labels_incorrect = np.append(labels_incorrect, label_item)

# Transform to integers
labels_incorrect = labels_incorrect.astype(np.uint8)

In [17]:
# Combining labels into one array
labels = np.array([])
labels = np.append(labels_correct, labels_incorrect)

In [18]:
# Combine two dataset of correct and incorrect worn masks
cleaned_data = np.array([])
cleaned_data = np.vstack((loaded_pics_correct, loaded_pics_incorrect))

In [19]:
# Combine data and labels into one dictionary
pic_data = {}
pic_data["rgb_data"] = cleaned_data
pic_data["labels"] = labels

<br>
<br>
<br>
<br>

### Storing Pic data and labels

<br>

In [20]:
# Save dictionary to make it accesible for other Jupyter Notebook
if full_data_switch_on == True: 
    file = open(("01_data/01_cleaned/pic_data_full.pkl"),"wb")
    pickle.dump(pic_data, file)
    file.close()
else: 
    file = open(("01_data/01_cleaned/pic_data_dummy_toy.pkl"),"wb")
    pickle.dump(pic_data, file)
    file.close()

<div class="alert alert-block alert-warning">
<b>Place to work on</b>
<p>
In the end, much of this process will be rewritten as functions to run different pixel resolutions and save them in different files automatically.
</div>