# Data Pre-Processing

When training any sort of model using a machine learning algorithm, a large dataset is first needed to train that model off of. Data can be anything that would help benefit with the training of the model. In this case, images of people facing the camera head on wearing/not wearing a face mask is the type of data that is being used.

# Data preparation

Raw data is first collected. These are just images of people wearing/not wearing face masks. This is not enough by itself as the data must the be divided into two groups, i.e into 'with_mask' and 'without_mask'.

# Categorisation and labeling

Next the data must be categorised and labeled as such.

In [1]:
import cv2,os

data_path='dataset'
categories=os.listdir(data_path)
labels=[i for i in range(len(categories))]

label_dict=dict(zip(categories,labels)) #empty dictionary

print(label_dict)
print(categories)
print(labels)

{'without_mask': 0, 'with_mask': 1}
['without_mask', 'with_mask']
[0, 1]


# Resizing and reshaping the data
When feeding any sort of data into an algorithm, it is important that we normalise the data. In this case of working with many images in the dataset, the images must be resized and shaped so that they are all fixed and common in size. For the purpose of this project, each image was resized to be 50 pixels by 50 pixels and were converted to greyscale. It is also important to note that each image was added to the data array, and its corresponding label is added to the target array

In [2]:
img_size=100
data=[]
target=[]


for category in categories:
    folder_path=os.path.join(data_path,category)
    img_names=os.listdir(folder_path)
        
    for img_name in img_names:
        img_path=os.path.join(folder_path,img_name)
        img=cv2.imread(img_path)

        try:
            gray=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)           
            resized=cv2.resize(gray,(img_size,img_size))
            data.append(resized)
            target.append(label_dict[category])

        except Exception as e:
            print('Exception:',e)

# Serialising the resulting pre processed data
Now that the dataset has been preprocessed and sorted into arrays, it must now be serialised so it can be used in the training process. To serialise the data, numpy is used as it is capable of serialising arrays and deserialising them later on for use (also known as flattening and unflattening)

In [3]:
import numpy as np

data=np.array(data)/255.0
data=np.reshape(data,(data.shape[0],img_size,img_size,1))
target=np.array(target)

from keras.utils import np_utils

new_target=np_utils.to_categorical(target)

np.save('data',data)
np.save('target',new_target)

# Finishing up
Now that the data has been preprocessed and serialised, it is now ready to be used in the training process. It is important to note that this must be done any time data needs to be added or removed from the dataset so it is not uncommon to have to use this multiple times throughout the development of the project