## Importing libraries
We start of by importing the libraries that we need for this project.

In [6]:
from sklearn import preprocessing
import PIL
from PIL import Image
from tqdm import tqdm
import glob
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf

from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from tensorflow.keras import layers, losses
from tensorflow.keras.datasets import fashion_mnist
from tensorflow.keras.models import Model

2022-11-30 01:05:46.925721: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Case 1: images anomaly detection
When we look at the pictures we see alot of dishes and interior/exterior photo's but there are also alot of unhelpfull pictures. These pictures include logo's, people and other random images. There isn't any benifit of having these pictures on a tripadvisor page. Therefor we call these anomalies, these pictures have no need to be on a tripadvisor page, as they bring no information to the customer. Because of this it would be nice to detect these images, this model could then be used to clear up the pages of tripadvisor. 

The detection of these images is called anomaly detection, there are 2 sorts of anomaly detection:
- **Outlier detection:** For this model we need a dataset with both standard pictures and anomaly pictures. 
- **Novelty detection:** For this nodel the trainingset consists only of the standard pictures. The trainingset must be labeled, so it the models are in a supervised fashion.

The dataset of tripadvisor is not labeled, therefor we could use outlier detection. Another choice can be to create a dataset of interiors, exteriors and dishes. This dataset would allow us to use **novelty detection**. For this case we will start of by looking at novelty detection.



sources: 1

### Construction the dataset

The dataset will look like a array of 3 d matrixes.

In [7]:
# todo adding labels
def imagesInFolderToDataset(path, width, height, channels):
    fileNameList = glob.glob(f"{path}*")
    images = []
    for fileName in tqdm(fileNameList, total=len(fileNameList)):
        try:

            img = Image.open(f"{fileName}")
            img_np = np.array(img.resize(( width, height ), channels ))
            if img_np.shape == (width,height,channels):
                images.append(img_np)



        except PIL.UnidentifiedImageError:
            pass
    return np.array(images)

In [13]:
restaurants_train   = imagesInFolderToDataset("Images/restaurant/", 64, 64, 3)
buffet_train        = imagesInFolderToDataset("Images/buffet/", 64, 64, 3)

# create test and train set
x_train     = np.concatenate((restaurants_train, buffet_train), axis=0)
x_test      = imagesInFolderToDataset("tripadvisor_dataset/tripadvisor_images/",64,64,3)

# normalize the data
x_train = x_train.astype(float) / 255.
x_test  = x_test.astype(float) / 255.

100%|██████████| 513/513 [00:04<00:00, 118.46it/s]
100%|██████████| 111/111 [00:00<00:00, 148.65it/s]
100%|██████████| 15183/15183 [01:02<00:00, 242.71it/s]


Om te beginnen gaan we greyscale gebruiken, de reden hiervoor is dat we op deze manier de dimensionaliteit reduceren. Dit zorgt dat er minder kans is op overfitting. Fotos met rgb maken het wel mogelijk om een beter model te trainen maar hebben meer kans op overfitting, de oplossingen hiervoor zijn dim reduction of meer data.

In [14]:
x_train_greyscale   = np.array(list(map( lambda x: np.dot(x[...,:3], [0.2989, 0.587, 0.114]), x_train)))
x_test_greyscale    = np.array(list(map( lambda x: np.dot(x[...,:3], [0.2989, 0.587, 0.114]), x_test)))

In [15]:
## convalutional autoencoder
class AnomalyDetector(Model):
  def __init__(self):
    super(AnomalyDetector, self).__init__()
    self.encoder = tf.keras.Sequential([
      layers.Input(shape=(64, 64, 1)),
      layers.Conv2D(16, (3, 3), activation='relu', padding='same', strides=2),
      layers.Conv2D(8, (3, 3), activation='relu', padding='same', strides=2)])

    self.decoder = tf.keras.Sequential([
      layers.Conv2DTranspose(8, kernel_size=3, strides=2, activation='relu', padding='same'),
      layers.Conv2DTranspose(16, kernel_size=3, strides=2, activation='relu', padding='same'),
      layers.Conv2D(1, kernel_size=(3, 3), activation='sigmoid', padding='same')])


  def call(self, x):
    encoded = self.encoder(x)
    decoded = self.decoder(encoded)
    return decoded

autoencoder = AnomalyDetector()

autoencoder.compile(optimizer='adam', loss=losses.MeanSquaredError())

In [16]:
history = autoencoder.fit(x_train_greyscale, x_train_greyscale, 
          epochs=20, 
          batch_size=128,
          shuffle=True)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [20]:
autoencoder(x_test_greyscale)

reconstructions = model(data)
loss = tf.keras.losses.mae(reconstructions, data)

<tf.Tensor: shape=(15169, 64, 64, 1), dtype=float32, numpy=
array([[[[0.38968173],
         [0.48573735],
         [0.6419693 ],
         ...,
         [0.33125523],
         [0.39581877],
         [0.43386334]],

        [[0.46167764],
         [0.64270973],
         [0.7464678 ],
         ...,
         [0.32229954],
         [0.35104614],
         [0.39644054]],

        [[0.5767327 ],
         [0.82396454],
         [0.8896123 ],
         ...,
         [0.3935624 ],
         [0.41239172],
         [0.4380607 ]],

        ...,

        [[0.6436168 ],
         [0.8265079 ],
         [0.86373717],
         ...,
         [0.917894  ],
         [0.8735098 ],
         [0.7604954 ]],

        [[0.5752002 ],
         [0.7860461 ],
         [0.8839731 ],
         ...,
         [0.89145803],
         [0.81598556],
         [0.71655196]],

        [[0.54113513],
         [0.6321665 ],
         [0.6691126 ],
         ...,
         [0.7688103 ],
         [0.6731208 ],
         [0.5859966 ]]],




Ze kunnen anomely detection opdelen in 2 verschillende delen:
- **Outlier detection:** Our input dataset contains examples of both standard events and anomaly events. These algorithms seek to fit regions of the training data where the standard events are most concentrated, disregarding, and therefore isolating, the anomaly events. Such algorithms are often trained in an unsupervised fashion (i.e., without labels). We sometimes use these methods to help clean and pre-process datasets before applying additional machine learning techniques.

- **Novelty detection:** Unlike outlier detection, which includes examples of both standard and anomaly events, novelty detection algorithms have only the standard event data points (i.e., no anomaly events) during training time. During training, we provide these algorithms with labeled examples of standard events (supervised learning). At testing/prediction time novelty detection algorithms must detect when an input data point is an outlier.

sources: 1



## Sources
1. https://pyimagesearch.com/2020/01/20/intro-to-anomaly-detection-with-opencv-computer-vision-and-scikit-learn/
2. https://www.tensorflow.org/tutorials/generative/autoencoder

https://www.guru99.com/autoencoder-deep-learning.html