# Processing Data Download

This Jupyter Notebook serves the purpose to transform the downloaded data set (a collection of `.jpg` pictures) into interpretable input for the Machine Learning algorithm. As this has to be done only once, we keep it out of the main Jupyter Notebook that implements the different methods and models on the training and testing data set.

Four steps have to be performed in this Jupyter Notebook:
+ Proper loading of all images from the subdirectories
+ Turning colored pictures into grayscale pictures
+ Translate `.jpg` picture into pixels with intensity per pixel
+ Store translated pixel data so that main Jupyter Notebook can access the data


### Loading data

The data is located in the folder `/01_data/00_raw/Masked-Face-Net-Dataset`. The folder CMFD (Correctly Masked Face Dataset) contains all pictures of the correctly worn facemasks whereas the folder IMFD (Incorrectly Masked Face Dataset) contains all pictures where people wear the facemasks incorrectly. In the folders there are several subdirectories labeled with ascending numbers.

In [209]:
# Import necessary libraries and set-up Jupyter Notebook.

# Common imports
import numpy as np
import os
import matplotlib.pyplot as plt

# Imports for dealing with images:
import PIL #Pillow (install with "pip install Pillow")

# to make this notebook's output stable across runs (safety measure)
np.random.seed(42)

# Set path to correct and incorrect data sets for keeping references short later
ROOT_DATA = "01_data/99_dummy_toy_data"
PATH_DATA_CORRECT = os.path.join(ROOT_DATA + "/correct")
PATH_DATA_INCORRECT = os.path.join(ROOT_DATA + "/incorrect")

# Where to save possible figures
PROJECT_ROOT_DIR = "02_figures"
CHAPTER_ID = "01_data_preparation"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

In [210]:
# Importing necessary libraries/functions
from os import listdir
from PIL import Image as PIL_Image

# Set pixel size (--> Large amount pixels increase data massively.) (Original #pixels: 1024x1024)
target_pixel = 64

# Initalize and Load Correct pics
loaded_pics_correct = np.array([])

i = 0 # helping running index to distinguish between hstack and vstack
for filename in listdir(PATH_DATA_CORRECT):
    # open picture
    pic = PIL_Image.open(PATH_DATA_CORRECT + "/" + filename)
    # Reduce size from original format to target format
    pic_resized = pic.resize((target_pixel, target_pixel))
    # Extract RGB data
    pic_data = np.array(pic_resized)
    # Include help array to transform 3D-array(e.g.: 1024, 1024, 3) into one long array
    help_array = np.array([[]])
    help_array = np.append(help_array, pic_data)
    # Stack help array onto loaded pics array so that each line has an array of features (RGB values per pixel)
    if i == 0:
        loaded_pics_correct = np.hstack((loaded_pics_correct, help_array))
    else:
        loaded_pics_correct = np.vstack((loaded_pics_correct, help_array))
    i += 1

<div class="alert alert-block alert-danger">
<b>Place to work on</b>
<p>
Write a function that takes different target pixels and creates different pickel outputs in the end. (In order to perform different test.)
</div>

In [211]:
# Initalize and Load Incorrect pics
loaded_pics_incorrect = np.array([])

i = 0 # helping running index to distinguish between hstack and vstack
for filename in listdir(PATH_DATA_INCORRECT):
    # open picture
    pic = PIL_Image.open(PATH_DATA_INCORRECT + "/" + filename)
    # Reduce size from original format to target format
    pic_resized = pic.resize((target_pixel, target_pixel))
    # Extract RGB data
    pic_data = np.array(pic_resized)
    # Include help array to transform 3D-array(e.g.: 1024, 1024, 3) into one long array
    help_array = np.array([[]])
    help_array = np.append(help_array, pic_data)
    # Stack help array onto loaded pics array so that each line has an array of features (RGB values per pixel)
    if i == 0:
        loaded_pics_incorrect = np.hstack((loaded_pics_incorrect, help_array))
    else:
        loaded_pics_incorrect = np.vstack((loaded_pics_incorrect, help_array))
    i += 1

<div class="alert alert-block alert-danger">
<b>Place to work on</b>
<p>
Have a look into v-stack and h-stack formula. Not so smart to have a nested if clause in a for-loop. Have a look if append or cocatenate can solve the problem easier. But it is working for now.
</div>

######## Clean below

In [212]:
loaded_pics_correct.shape

(3, 12288)

In [213]:
# pic_resized.show()

In [214]:
help_array.shape

(12288,)

In [215]:
pic_data.shape

(64, 64, 3)

In [216]:
loaded_pics_correct[:1].shape

(1, 12288)

In [217]:
len(loaded_pics_correct)

3

In [218]:
len(loaded_pics_incorrect)

3

######## Clean above

### Adding labels and combining into one data frame

In [219]:
# Correct pictures
label_item = 1
# Get length of pictures included
len(loaded_pics_correct)
labels_correct = np.array([])
for itr in range(len(loaded_pics_correct)):
    labels_correct = np.append(labels_correct, label_item)

# Transform to integers
labels_correct = labels_correct.astype(np.uint8)

# Incorrect pictures
label_item = 0
# Get length of pictures included
len(loaded_pics_incorrect)
labels_incorrect = np.array([])
for itr in range(len(loaded_pics_incorrect)):
    labels_incorrect = np.append(labels_incorrect, label_item)

# Transform to integers
labels_incorrect = labels_incorrect.astype(np.uint8)

In [220]:
type(labels_correct)

numpy.ndarray

In [221]:
# Combining labels into one array
labels = np.array([])
labels = np.append(labels_correct, labels_incorrect)

In [222]:
labels

array([1, 1, 1, 0, 0, 0], dtype=uint8)

In [223]:
cleaned_data = np.array([])
cleaned_data = np.vstack((loaded_pics_correct, loaded_pics_incorrect))

In [224]:
cleaned_data.shape

(6, 12288)

In [225]:
pic_data = {}
pic_data["rgb_data"] = cleaned_data
pic_data["labels"] = labels

### Storing Pic data and labels

In [226]:
import pickle
file = open(os.path.join(ROOT_DATA + "/cleaned/pic_data.pkl"),"wb")
pickle.dump(pic_data, file)
file.close()

<div class="alert alert-block alert-danger">
<b>Place to work on</b>
<p>
Find more efficient way of storing data. Currently six pictures of combined 1.8 MB are transformed to csv files of 151 MB. ~Factor 84 in disk space (--> 40.9 GB get to 3,431 GB of data) - When 1024X1024 pixels are used.
    
Proposed pixel combination to run at first: (16x16, 32x32, 64x64) --> Reduces 1.8 MB of the six pictures to only 37 to 590 kB at max (reduction of factor 48 to 3 --> overall data reduced to 0.85 GB or 13.6 GB) 
</div>