## Binary Image Classification: Pizza vs. Not Pizza

Authors: Chloe Veth and Sharon Dunbar

Our goal is to build a binary image classifier. Through applying different machine learning classification techniques, we are hoping to differentiate between images of pizza and images of other food that is not pizza
#### Our dataset
We found our data on Kaggle: https://www.kaggle.com/datasets/carlosrunner/pizza-not-pizza. The dataset contains 1966 images with an even split between images that are pizza and images that are not. Each image is 512 pixels on the longest side. 

To download the dataset for yourself, please visit our GitHub repository: https://github.com/chloeveth/ML-Project.


#### Libraries that we are using

In [None]:
# for image manipulation
import cv2
from scipy import ndimage

# for displaying graphs and image
import matplotlib.pyplot as plt

# for machine learning
import sklearn

# general purpose 
import os
import numpy as np

#### Transforming data
These steps will take the training data, transform it, and output the cleaned, ready to use data into a new dataset folder. In the dataset, the images are already resized to have a max length of 512 pixels, but we want images of exactly the same size. The transformation steps include rotating all images to be landscape, changing the aspect ratio of all images to 3:4, and decreasing resolution by half.

In [None]:
paths = ['/not_pizza/', '/pizza/']

for path in paths:
    os.makedirs('clean_data' + path, exist_ok=True)
    
    for filename in os.listdir("pizza_not_pizza" + path):
        # load image
        img = cv2.imread("pizza_not_pizza" + path + filename)
    
        # if the image is horizontal, rotate so it's vertical
        if img.shape[0] > img.shape[1]:
            rot_img = np.transpose(img, (1, 0, 2))  
        else:
            rot_img = img
        
        # Semi random decision, change if necessary
        # reshape all images to be 3:4 and about half of original dimensions, that is 192 by 256 
        # (since all original images have one dim that is 512)
        new_img = cv2.resize(rot_img,(256, 192)) # width by height so axes are swapped when passed in

        # write image to new directory, preserving dir structure & filenames
        new_path = 'clean_data' + path + filename
        cv2.imwrite(new_path, new_img)
    

#### Loading the data
Once the data is transformed and stored to a new folder, we load it into the X array and create targets y.

In [None]:
# load in data
X = []
paths = ['/not_pizza/', '/pizza/']
for path in paths:
    for filename in os.listdir("clean_data" + path):
        # load image
        img_array = cv2.imread("clean_data" + path + filename)
        X.append(img_array.flatten()) # flatten to 1D array
    print(f"files from {path} loaded")
    
X = np.array(X)
print(X.shape)

In [None]:
# create array of labels, with pizza as class 0 and ice cream as class 1
num_not_pizza = len([f for f in os.listdir("clean_data/not_pizza")])
num_pizza = len([f for f in os.listdir("clean_data/pizza")])

y = np.concatenate((np.zeros(num_not_pizza), np.ones(num_pizza)))

# make sure all data is loaded
assert len(y) == len(X)
assert len(X) == 1966

#### A Sample of the dataset
Here is a selection of images from both classes in the dataset

In [None]:
import matplotlib.pyplot as plt

not_pizza_imgs = X[0:15]
pizza_imgs = X[983:998]
together = np.concatenate((not_pizza_imgs, pizza_imgs))
targets_together = np.concatenate((y[0:15], y[983:998]))

fig, axes = plt.subplots(6, 5, figsize=(18, 24), subplot_kw={'xticks': (), 'yticks': ()})

plt.subplots_adjust(hspace=0)
for target, image, ax in zip(targets_together, together, axes.ravel()):
    img = np.reshape(image, (192, 256, 3))
    ax.imshow(cv2.cvtColor(img, cv2.COLOR_RGBA2BGR))
    ax.set_title("pizza" if target == 1 else "not pizza")

#### Implementing K Nearest Neighbors
For starters, we tried the K Nearest Neighbors approach because it was one of the simplest and allowed us to make sure we had loaded the dataset correctly. In order to get an accurate representation of how well it does, we ran it 10 times on different splits of training and test data.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

print("Run # \t Training Score \t Test Score")
test_total = 0.0
training_total = 0.0

for x in range(10):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    knn_model = KNeighborsClassifier(n_neighbors=7)
    knn_model.fit(X_train, y_train)
    training_score = knn_model.score(X_train, y_train)
    test_score = knn_model.score(X_test, y_test)
    training_total += training_score
    test_total += test_score
    
    print(f"{x} \t {training_score} \t {test_score}")
    
print("\nAverage Scores")
print(f"Training: {training_total / 10}")
print(f"Test: {test_total / 10}")

This model does okay, scoring an average of .72 on the training set and .63 on the test set. It potentially doesn't do super well because the dimensions of the datapoints are quite large (147456) so the curse of dimensionality makes neighbors farther apart. There's probably a way we can do this better.

#### Trying Principal Component Analysis
Hopefully using PCA we can find the most informative components and reduce the dimensionality of the data while still keeping enough information to create a good classifier

In [None]:
from sklearn.decomposition import PCA

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

pca = PCA(n_components=200, whiten=True).fit(X_train) # keeps 200 most informative components

# data mapped onto pca space
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

In [None]:
# Trying KNN on this transformed data
knn_model_pca = KNeighborsClassifier(n_neighbors=5)
knn_model_pca.fit(X_train_pca, y_train)
knn_model_pca.score(X_test_pca, y_test)

The transformed data does a bit worse, getting a score slightly above 50%. With PCA, we reduce the dimensionality and lose some of the accuracy, which probably means that the models are actually using information from all of the dimensions. 

#### Implementing Support Vector Machines

Maybe the problem is that KNN is too simple and a different model might perform better. To explore this possibility, we tried support vector machines

In [None]:
from sklearn.svm import SVC

svc = SVC()
svc.fit(X_train, y_train)

print("Accuracy on training set: {:.2f}".format(svc.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(svc.score(X_test, y_test)))

While the accuracy of this model is higher on both the training and the test set compared to KNN (0.91 vs 0.71 on training and 0.74 vs 0.63 on test), it is also much slower to train. Maybe principal component analysis can help with speed without reducing accuracy.

#### Trying Support Vector Machines with Principal Component Analysis

In [None]:
from sklearn.svm import SVC

# all pixels will have same scale
# easier to do SVM with PCA because taking too long without PCA
svc = SVC()
svc.fit(X_train_pca, y_train)

print("Accuracy on training set: {:.2f}".format(svc.score(X_train_pca, y_train)))
print("Accuracy on test set: {:.2f}".format(svc.score(X_test_pca, y_test)))

We can see here that running SVM with PCA gives a pretty similar result to above without PCA, but it is much faster. There is overfitting here; we can see that the accuracy on the training set is much better than the accuracy on the test set. 

#### SVM with PCA over 3 trials
Next, to see if our results were consistent or just a result of the split between training and test sets, we performed SVM with PCA over 3 trials. 

In [None]:
i = 0
max_iter = 3
total_train = 0
total_test = 0

while (i < max_iter):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    pca = PCA(n_components=200).fit(X_train) # keeps 200 most informative components
    # data mapped onto pca space
    X_train_pca = pca.transform(X_train)
    X_test_pca = pca.transform(X_test)

    svc = SVC()
    svc.fit(X_train_pca, y_train)

    total_train = total_train + svc.score(X_train_pca, y_train)
    total_test = total_test + svc.score(X_test_pca, y_test)
    i=i+1
    
print("Accuracy on training set: {:.2f}".format(total_train/max_iter))
print("Accuracy on test set: {:.2f}".format(total_test/max_iter))

Running the test over the many trials seems to give about the same results, consistent with the results we got previously. 

#### Implementing Logistic Regression

Next, we'll see if Logistic Regression does any better than SVM and KNN. 

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(solver="liblinear").fit(X_train, y_train)

print("Training set score: {:.3f}".format(logreg.score(X_train, y_train)))
print("Test set score: {:.3f}".format(logreg.score(X_test, y_test)))

We have a similar issue to SVM where there is overfitting, and here the overfitting is even more extreme. Logistic regression does not work very well. 

#### Trying Logistic Regression Principal Component Analysis

Nevertheless, we also implement logistic regression with PCA.

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(solver="liblinear").fit(X_train_pca, y_train)

print("Training set score: {:.3f}".format(logreg.score(X_train_pca, y_train)))
print("Test set score: {:.3f}".format(logreg.score(X_test_pca, y_test)))

The results are a bit better for the test set, but worse for the training set. There may be less overfitting, but the results of the classification algorithm are still not very accurate. 

#### Implementing Neural Nets

Now, we'll see if neural nets do any better. 

In [None]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(max_iter=200, alpha=.001, random_state=42, hidden_layer_sizes=([100, ]), solver = "lbfgs")
mlp.fit(X_train, y_train)

print("Accuracy on training set: {:.2f}".format(mlp.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(mlp.score(X_test, y_test)))

It appears a common theme here is overfitting. There is still overfitting here, but the accuracy on the test set is a bit better than logistic regression without PCA.

#### Trying Principal Component Analysis

The following code tries neural nets with the default setting and PCA. 

In [None]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(random_state=42)
mlp.fit(X_train_pca, y_train)

print("Accuracy on training set: {:.2f}".format(mlp.score(X_train_pca, y_train)))
print("Accuracy on test set: {:.2f}".format(mlp.score(X_test_pca, y_test)))

We get fairly similar results to neural nets without PCA. There is still overfitting.

In the following code, we adjust some of the settings, mainly changing the alpha to see if that will decrease overfitting. 

In [None]:
mlp = MLPClassifier(max_iter=10000, alpha=1, random_state=0, hidden_layer_sizes=([10, ]))
mlp.fit(X_train_pca, y_train)

print("Accuracy on training set: {:.3f}".format(
    mlp.score(X_train_pca, y_train)))
print("Accuracy on test set: {:.3f}".format(mlp.score(X_test_pca, y_test)))

With the adjusted settings, the results are a bit better, but they still are not great. 

#### Using a CNN

To do this, we followed the steps for preparing data and building a model that we went through as a class in the lab on Deep learning. This code is copied from that lab and then modified as necessary for our situation

In [None]:
import tensorflow as tf
from tensorflow import keras

In [None]:
# load in data - without flattening
X_cnn = []
paths = ['/not_pizza/', '/pizza/']
for path in paths:
    for filename in os.listdir("clean_data" + path):
        # load image
        img_array = cv2.imread("clean_data" + path + filename)
        X_cnn.append(img_array)
    print(f"files from {path} loaded")
    
X_cnn = np.array(X_cnn)

# Normalize pixel values to be between 0 and 1
X_cnn = X_cnn / 255.0
print(X_cnn.shape)

In [None]:
from functools import partial

DefaultConv2D = partial(keras.layers.Conv2D,
                        kernel_size=3, activation='relu', padding="SAME")

model = keras.models.Sequential([
    DefaultConv2D(filters=64, kernel_size=7, input_shape=[192, 256, 3]),
    keras.layers.MaxPooling2D(pool_size=2),
    DefaultConv2D(filters=128),
    keras.layers.Flatten(),
    keras.layers.Dense(units=10, activation='softmax'),
])

In [None]:
from sklearn.model_selection import train_test_split
X_train_cnn, X_test_cnn, y_train_cnn, y_test_cnn = train_test_split(X_cnn, y, test_size=0.2)

In [None]:
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
history = model.fit(X_train_cnn, y_train_cnn, epochs=10)#, validation_data=(X_valid, y_valid))

In [None]:
score = model.evaluate(X_test_cnn, y_test_cnn)
X_new = X_test_cnn[:10]


This simple CNN does relatively well. However, the performance varied greatly based on the various training and test set splits. Because it took around 20 minutes to train, it was difficult to determine what the accuracy averaged to. With 10 epochs, it achieved a score of .99 on the training set and .92 on the test set at one point, but more commonly it would score around .90 on the training set and around .60 on the test set, which is not much of an improvement from some of the simpler models. 

#### Conclusions


In this lab, we were able to try the majority of the classification models that we learned about in class, both with and without the transformation of the data by principal component analysis. Overall, we found that support vector machines without principal component analysis performed the best out of the simpler types of models. Convolutional neural nets also performed well, but maybe not well enough to justify the higher cost of computation.