# Face-Mask-Classification Project


Authors:
+ Tobias Palmowski
+ Fabian Metz
+ Thilo Sander

Date of Midterm-Report: 29.03.2021 <br>
Date of final submission: 26.04.2021


### Introduction

This Jupyter Notebook is the core of the Face-Mask-Classification Project performed in the class "Machine Learning" of the Hertie School in Berlin. There is one other Jupyter Notebook which deals with combining the different datasets into one large data set - a task only performed once and therefore outsourced to another file.


### Data Processing: Pipeline-Building

<br>
<br>
<br>
<br>
[Short Description]

In [29]:
# Import necessary libraries and set-up Jupyter Notebook.

# Common imports
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt

# Imports for dealing with images:
import PIL #Pillow (install with "pip install Pillow")

# to make this notebook's output stable across runs (safety measure)
np.random.seed(42)

# Set path to correct and incorrect data sets for keeping references short later
ROOT_DATA = "01_data/99_dummy_toy_data"
PATH_DATA_CORRECT = os.path.join(ROOT_DATA + "/correct")
PATH_DATA_INCORRECT = os.path.join(ROOT_DATA + "/incorrect")

# Where to save possible figures
PROJECT_ROOT_DIR = "02_figures"
CHAPTER_ID = "01_data_preparation"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

In [30]:
# Open pickle file that contains the directory
import pickle
pic_data = pickle.load(open(os.path.join(ROOT_DATA + "/cleaned/pic_data.pkl"),"rb"))

In [31]:
# Exploring dictionary structure
pic_data

{'rgb_data': array([[ 95.,  50.,  59., ...,  32.,  31.,  35.],
        [ 79.,  48.,  25., ..., 213.,  54.,  62.],
        [212., 212., 212., ..., 214., 215., 217.],
        ...,
        [212., 204., 189., ..., 188., 160., 121.],
        [238., 242., 253., ..., 247., 251., 255.],
        [201., 200., 169., ...,  31.,  35.,  65.]]),
 'labels': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [32]:
# Copying dictionary data into separate data frames
rgb_data, labels = pic_data["rgb_data"], pic_data["labels"]

In [33]:
# Explore dimensionalities of data frames
print("Dimensions of rgb_data:", rgb_data.shape)
print("Dimensions of labels:", labels.shape)

Dimensions of rgb_data: (200, 3072)
Dimensions of labels: (200,)


In [34]:
# Split into test and training data set
from sklearn.model_selection import train_test_split

rgb_data_train, rgb_data_test, labels_train, labels_test = train_test_split(rgb_data, labels, test_size=0.10, random_state=42)

<div class="alert alert-block alert-danger">
<b>ATTENTION</b>
<p>
Adapt test size according to the size of the whole data set. 
</div>

### Baseline: SGD Classifier

<br>

This section defines a Stochastic Gradient Decent method as a baseline for the project.
[Short Description]

In [35]:
# redefining labels as True False
labels_train_tf = (labels_train == 1)
labels_test_tf = (labels_test == 1)


# Code for Baseline
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=1000, tol=1e-3, random_state=42) 
sgd_clf.fit(rgb_data_train, labels_train_tf)

SGDClassifier(random_state=42)

Parameters need to be checked, what makes sense for a toy dataset?

### Evaluation of Baseline

<br>
The normal Evaluation does not work with the current toy dataset as its too small.

<div class="alert alert-block alert-danger">
<b>ATTENTION</b>
<p>
The Evaluation Part is coded well, but does not work with the small test toy dataset.
</div>

In [36]:
# Cross Valuation Score
from sklearn.model_selection import cross_val_score
cvs = cross_val_score(sgd_clf, rgb_data_train, labels_train_tf, cv=3, scoring="accuracy")

<div class="alert alert-block alert-danger">
<b>Place to work on</b>
<p>
CV needs to be specified for whole dataset.
</div>

In [37]:
#create prediction sets
from sklearn.model_selection import cross_val_predict
labels_train_pred = cross_val_predict(sgd_clf, rgb_data_train, labels_train_tf, cv = 3)


#Calculate Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(labels_train_tf, labels_train_pred)


<div class="alert alert-block alert-danger">
<b>Place to work on</b>
<p>
CV needs to be specified for whole dataset.
</div>

In [38]:
# precision and recall
from sklearn.metrics import precision_score, recall_score
ps = precision_score(labels_train_tf, labels_train_pred)
rs = recall_score(labels_train_tf, labels_train_pred)

### Preliminary Output for Working Process
<br>
This section gives us the output for the confusion matrix, the cross validation score and the precision and recall stores for the current code and tuning.

In [39]:
# Output for us while working on it
print("Confusion Matrix")
pd.DataFrame(cm)

Confusion Matrix


Unnamed: 0,0,1
0,79,11
1,6,84


In [40]:
print('Cross Validation Scores')
print(cvs)

Cross Validation Scores
[0.91666667 0.91666667 0.88333333]


In [41]:
print("Precision Score")
print(ps)

Precision Score
0.8842105263157894


In [42]:
print("Recall Score")
print(rs)

Recall Score
0.9333333333333333


<div class="alert alert-block alert-danger">
<b>Place to work on</b>
<p>
What output do we want to generate?
</div>