# Tutorial 2: Automatic Feature Extraction/Engineering

---

### Introduction

In this notebook, we will extract/engineer features using a deep learning method called autoencoder.
 An autoencoder is an artificial neural network with a symmetric structure which is trained to reconstruct its input at the final output layer. The output of the first half of the network represents an encoding of the input data. ([source](https://arxiv.org/abs/2206.06165)).

First, we import some libraries:

In [None]:
import matplotlib.pyplot as plt # for plotting data/graphs
import numpy as np # For handling N-DIMENSIONAL ARRAYS

import tensorflow as tf #An end-to-end machine learning platform, focusing on training deep learning models
from tensorflow.keras import layers, losses # Implementation of the Keras API, the high-level API of TensorFlow.
from tensorflow.keras.models import Model #This displays graphs 



---

### Reading in data
The following code is the same as in Tutorial 1.

In [None]:
from galaxy_mnist import GalaxyMNISTHighrez

dataset_train = GalaxyMNISTHighrez(
    root='data_import/data',
    download=True,
    train=True  # by default, or False for canonical test set
)
# for the testing data
dataset_test = GalaxyMNISTHighrez(
    root='data_import/data',
    download=True,
    train=False  # by default, or False for canonical test set
)

In [None]:
# defining the training and testing labels and image samples
images_train = dataset_train.data
images_test = dataset_test.data
labels_train = dataset_train.targets
labels_test = dataset_test.targets
classes = GalaxyMNISTHighrez.classes

---

### Pre-processing 
The following code is the same as in Tutorial 1.

In [None]:
from source.pre import pre_processing #  A predefined function to pre-process the data as we did in tutorial 1

In [None]:
# pre-processing(data, size) function takes two arguments
# 1. data: the data to be processed
# 2. The size for which the data needs to be reduced.
images_trainPre = pre_processing(images_train, 56)
images_testPre = pre_processing(images_test, 56)

In [None]:
images_trainPre.shape # the shape of the training data

Displaying images after pre-processing

In [None]:
rows = 1
columns = 5
for j in range(len(GalaxyMNISTHighrez.classes)):
    fig = plt.figure(figsize=(8, 8))# Figure is 8 inches by 8 inches
    for i in range (columns):    # Create images in each column
        train_image = images_trainPre[(labels_train == j)][i]
        fig.add_subplot(rows, columns, i+1)
        plt.imshow(train_image*255,cmap='gray', vmin=0, vmax=255) 
                            # we have to multiply the image by 255 to restore the original values
    print("label: "+str(GalaxyMNISTHighrez.classes[j]))
    plt.tight_layout()
    plt.show() 


---

### Autoencoders
We will try different options for the autoencoder and compare performance. First we'll try a shallow autoencoder with as few layers as possible. Then we'll compare with a deeper autoencoder. 

#### Shallow Autoencoder
The autoencoder neural network must have a symmetric structure, and thus must have an even number of layers. We will use a very simple neural network for the autoencoder with just two hidden layers (plus input and output layers). The autoencoder neural network is trained on the data that we pre-processed. The original code can be found [here](https://www.tensorflow.org/tutorials/generative/autoencoder).

In [None]:
latent_dim = 64 # the number of features to be encoded, this can change 
num, length, width  = images_trainPre.shape
# need to document how excatly it works

class Autoencoder(Model):
  def __init__(self, latent_dim):
    super(Autoencoder, self).__init__()
    self.latent_dim = latent_dim
    # The NN is defined in two parts:encoder and decoder
    # Encoder part:
    self.encoder = tf.keras.Sequential([
      layers.Flatten(), # Input layer-- flattens image into vector
      layers.Dense(latent_dim, activation='relu'), # Dense hidden layer
    ])
    # Decoder part of the NN
    self.decoder = tf.keras.Sequential([
      layers.Dense(length*width, activation='sigmoid'), # Dense hidden layer
      layers.Reshape((length, width)) # Output layer (reshapes vector back to image size)
    ])

  def call(self, x):
    encoded = self.encoder(x)
    decoded = self.decoder(encoded)
    return decoded

##### 1) Define model

In [None]:
shallow_model = Autoencoder(latent_dim)

##### 2) Compile model with Adam optimization

In [None]:
shallow_model.compile(optimizer='adam', loss=losses.MeanSquaredError())

##### 3) Build the model

In [None]:
shallow_model.build((None, 56,56,1))

##### 4) Train the model


In the training process we use "early stopping", which automatically terminates training when there is little or no improvement from epoch to epoch.

In [None]:
from tensorflow.keras.callbacks import EarlyStopping
early_stopping = EarlyStopping(patience=2)

`EarlyStopping()` has a few options:
- `monitor (default value 'val_loss')`: Uses validation loss as performance measure to terminate the training.
- `patience (default value 0)`: specifies the number of epochs with no improvement. The value 0 means the training is terminated as soon as the performance measure gets worse from one epoch to the next.

We're ready to train the model!

In [None]:
shallow_model.fit(np.array(images_trainPre), np.array(images_trainPre),
                epochs=50,
                shuffle=True,
                validation_data=(np.array(images_trainPre), np.array(images_trainPre)), callbacks=[early_stopping])

This code runs very fast, because the model is very shallow.

##### 5) Display the model

Now, Let's compare inputs and outputs, and see if they closely resemble each other.

In [None]:
encoded_imgs = shallow_model.encoder(images_testPre).numpy()
decoded_imgs = shallow_model.decoder(encoded_imgs).numpy()

In [None]:
# Check that the output shape is correct
print(decoded_imgs.shape)

In [None]:
# Display inputs and outputs
rows = 1
columns = 5
for j in range(len(GalaxyMNISTHighrez.classes)):
    fig = plt.figure(figsize=(8, 8))# Figure is 8 inches by 8 inches
    for i in range (columns):    # Create images in each column
        test_image = images_testPre[(labels_test == j)][i]
        fig.add_subplot(rows, columns, i+1)
        plt.imshow(test_image*255,cmap='gray', vmin=0, vmax=255) 
                            # we have to multiply the image by 255 to restore the original values
    print("Original: "+str(GalaxyMNISTHighrez.classes[j]))
    plt.tight_layout()
    plt.show() 
    
    fig = plt.figure(figsize=(8, 8))# Figure is 8 inches by 8 inches
    for i in range (columns):    # Create images in each column
        test_image = decoded_imgs[(labels_test == j)][i]
        fig.add_subplot(rows, columns, i+1)
        plt.imshow(test_image*255,cmap='gray', vmin=0, vmax=255) 
                            # we have to multiply the image by 255 to restore the original values
    print("Reconstructed: "+str(GalaxyMNISTHighrez.classes[j]))
    plt.tight_layout()
    plt.show() 
    
    

**Exercise 1:** Which classes do you think will be confused with the others?

In [None]:
### -- Answer here --


##### 6) Save the model


In [None]:
shallow_model.save("./shallowModel_save") # saving the model (shallow)

---

#### Deep convolutional autoencoder
A visual comparison of inputs and outputs shows the shallow fully-connected autoencoder does not preserve images very well. So we'll try a more complicated model and compare execution time and image quality. In image classification, convolutional NN's are typically used. So let's try a convolutional NN for our autoencoder. The following autoencoder model is a modified version of the one found [here](https://github.com/ezrafielding/galaxy-cluster/blob/main/autoencoder/galaxyencode.py).

In [None]:
class GalaxyEncoder(Model):
    def __init__(self):
        super(GalaxyEncoder, self).__init__()
        self.encoder = tf.keras.Sequential ([
            layers.InputLayer(input_shape=(56,56,1)),
            layers.Conv2D(16, (3,3), 1, padding="same", activation="relu"),
            layers.MaxPool2D((2,2), padding="same", strides=2),
            layers.Conv2D(8, (3,3), 1, padding="same", activation="relu"),
            layers.MaxPool2D((2,2), padding="same", strides=2),
            layers.Flatten()
        ])
        self.decoder = tf.keras.Sequential ([
            layers.InputLayer(input_shape=(1568)),
            layers.Reshape((14, 14, 8)),
            layers.UpSampling2D((2,2)),
            layers.Conv2DTranspose(8, (3,3), 1, padding="same", activation="relu"),
            layers.UpSampling2D((2,2)),
            layers.Conv2DTranspose(16, (3,3), 1, padding="same", activation="relu"),
            layers.Conv2D(1, (3,3), 1, padding="same", activation="sigmoid")
        ])

    def call(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

In [None]:
deep_model = GalaxyEncoder()

### Exercises ###
Following the procedure that was used above for the shallow model, apply the same steps to the deep model that we have just defined. These steps include: 1) define, 2) compiling, 3) building, 4) training, 5) displaying and 6) saving.   

##### 1) Define

In [None]:
### -- Code here --


##### 2) compiling

In [None]:
### -- Code here --


##### 3) building

In [None]:
### -- Code here --


##### 4) training

In [None]:
### -- Code here --


##### 5) displaying

In [None]:
### -- Code here --


##### 6) saving

In [None]:
### -- Code here --


---

### Extracting the engineered features from the autoencoder model

For now, will continue just with the shallow model. To proceed, we need to extract the 64 encoded features from the shallow encoder model

In [None]:
import tensorflow as tf
import pandas as pd

In [None]:
model = tf.keras.models.load_model('shallowModel_save/') # recalling the model
print(model.summary())

We may apply the encoder to training and testing data to obtained the encoded features for data item.

In [None]:
auto_features_train = model.encoder.predict(images_trainPre) # extracting the features for the training data
auto_features_test = model.encoder.predict(images_testPre)   # extracting the features for the testing data

In [None]:
auto_df_train = pd.DataFrame(auto_features_train) #turning the data into a dataframe
auto_df_test = pd.DataFrame(auto_features_test) #turning the data into a dataframe

In [None]:
print(auto_df_train.shape) 

In [None]:
auto_df_train.head(3)

---

#### **_Saving data for later use_**

We can save the data so that we can call it up again in subsequent notebooks

In [None]:
%store auto_df_train
%store auto_df_test
%store labels_train
%store labels_test
%store classes