# Galactic Classification CNN

## Purpose:

The purpose of this was for me to gain more exposure to CNNs in a context I find rather interesting. I don't have much experience with NNs in general, so I thought building one (admittedly a rather basic one) would be a good start! This model will also serve as a base-line for when I get access to more computing power and memory, or a general conceptual guide when I need to come back for a quick reference. 

## Section Breakdown:

I have a description of each cell above it so please look there if you want a summary/explanation, but the general index is something like this (and note that each nth cell refers to the coded/non-markdown sections):

1. First Cell: general set-up for my local enviornment and importing the first needed packages
2. Second Cell: Matching the images to file names as it wasn't set up automatically
3. Third Cell: Image configuration and adding images/labels to their respective lists
4. Fourth Cell: Converting lists to numpy arrays, splitting into test/train sets
5. Fifth Cell: Model Architecture
6. Sixth Cell: Hyper-parameter tuning
7. Seventh Cell: Model training and saving

## Limitations:

The biggest hurdle for me by far was the fact that I am running on a local machine with 8GB of RAM. I had to choose between sacrificing some number of layers, reducing the training/validation set sizes, using a model that isn't so hard on memory, and so on and so forth to be able to run this locally. 

## Discussion:

Overall I'm happy with the model. The accuracy and loss values were better than I was expecting when I first started building this 'toy model', and I'm sure if I put more elbow-work into this model I could improve both of the aforementioned metrics (see 'Future Work for a few ideas)

## Future Work:

* Run it with the full dataset 
* Change parameters (optimizer, train/test set percentages, etc)
* Add more layers 
* Incorporate rotations and flips

## Final notes: 

The explanations on this are more conceptually than mathematically based. For more mathematical based explanations, I can direct you to Google's ML explanation, Andrew NG's ML courses, Codecademy's ML courses, or most other pages out there centered around ML/CV/NNs. I originally built and ran this in a .py file that will also be in my github, however I figured it would be better for the reasons described in the 'Purpose' section to put it in a notebook, split different sections into different cells, and use Markdown cells to annotate my notes/thoughts. However, for those who prefer to use VSCode/PyCharm/XYZ IDE - you are welcome to use the .py file rather than this. They should be the same as this was just copied and pasted from there (with the addition of the Markdown cells of course). Thanks for reading :)

## Data Source: 

Thanks to the wonderful people at Galaxy Zoo for this data! The images and CSV files were pulled from here: https://data.galaxyzoo.org/

_____________________________________________________________________________________________________________________________________________________

### Cell One: 

This is the general set-up for my local environment, some of the os.environ lines would not be needed if run on other machines. Importing the first  and commonly used packages

In [1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
os.environ['QT_QPA_PLATFORM'] = 'offscreen'
os.environ['QT_LOGGING_RULES'] = 'qt.qpa.*=false'
uid = 1000
os.environ['XDG_RUNTIME_DIR'] = f'/run/user/{os.getuid()}'
import cv2
import pandas as pd
import numpy as np

from PyQt5.QtCore import QLibraryInfo



os.environ['QT_PLUGIN_PATH'] = os.path.join(QLibraryInfo.location(QLibraryInfo.PluginsPath))


### Cell Two: 

This matches the file name with the corresponding row in the corresponding CSV file. When initially downloaded, the way the images were unpacked left the 250,000 image folder in a completely randomized state, so this goes through and matches the correct file to the correct row, and then adds them to a list. It stops at 5000 as that is the size that is big enough to train the data on, but not so large that it causes issues when running (refer to the 'Limitations' section in the very beginning)

In [2]:
# # Get a list of all image file names in the folder
galaxy_df = pd.read_csv("/home/cyrus/Documents/galaxy_df_final.csv")

# Set the path to the directory containing the images
image_directory = "/home/cyrus/Downloads/images_gz2/images"

# Initialize empty lists to store the matched image file names and their corresponding 'gz2class' values
matched_images = []
matched_labels = []

# Iterate through the rows in the table
for index, row in galaxy_df.iterrows():

    if len(matched_images) >= 5000:
        break
    
    # Extract the 'id' value
    image_id = row["asset_id"]

    # Create the corresponding image file name
    image_file = f"{image_id}.jpg"

    # Check if the image file exists in the directory
    if os.path.isfile(os.path.join(image_directory, image_file)):
        # Add the image file name to the matched_images list
        matched_images.append(image_file)
        # Add the corresponding 'gz2class' value to the matched_labels list
        matched_labels.append(row["gz2class"])



### Cell Three: 

This sets up the resizing and reformatting of the images. Resizing from a 4XXp to 128p sized image helps save RAM, and OpenCV by default uses a BGR color format, so I conert it to the standard RGB that we all know an love! It also loops through in batches to help save on RAM (again, one of the tedious limitations of my local machine)

In [3]:
# Set the desired dimensions for resizing
resize_dim = (128, 128)

# Set the batch size
batch_size = 10

# Initialize an empty list to store the batches of images and labels
image_batches = []
label_batches = []

# Loop through the image files and labels in batches
images = []
for i in range(0, len(matched_images), batch_size):
    batch_images = matched_images[i:i + batch_size]
    batch_labels = matched_labels[i:i + batch_size]
    image_arrays = []

    for image_file in batch_images:
        image_path = os.path.join(image_directory, image_file)

        # Read the image from file
        image = cv2.imread(image_path)

        if image is not None:
            # Convert the image from BGR to RGB format
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

            # Resize the image
            resized_image = cv2.resize(image, resize_dim)

            # Append the resized image to the list of image arrays
            images.append(resized_image)

    # Add the batch of image arrays and labels to their respective lists
    # image_batches.append(image_arrays)
    label_batches.extend(batch_labels)




### Cell Four:

This converts the previous lists to numpy arrays, it then normalizes the pixel values, encodes the labels, and splits the data into training and test sets.

Encoding is important because it converts the data from  a type the model can't read to a type that it can. It changes the labels from a categorical data type to numerical. 

Test train splits are usually recommended to be about 70% train and 30% test, 70% train, 15% test, and 15% validation, or 70% train, 20% test, and 10% validation - however, I chose to do 80% for training and 20% for testing. There are many different way to set it up with varying philosophies and degrees of success, but for this 'toy model' I thought that the layers would be more important. An exploration into different test-train split values could yield higher accuracies in the future

In [4]:
X = np.array(images)
y = np.array(label_batches)


# Normalize pixel values
X = X.astype('float32')
X /= 255.0

from sklearn.preprocessing import LabelEncoder

# Encode labels
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)


import tensorflow as tf
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping

# Assuming your data is in the following format
# X = np.array(images_array)  # Images as numpy arrays
# y = np.array(matched_labels)       # Labels as integer values

# Split the dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y_encoded, test_size = 0.2, random_state = 42, stratify = y_encoded)





2023-05-09 16:56:56.340590: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Cell Five:

This is the model architecture, where the model is a Sequential model with 2x Convolution layers, 2x Pooling layers, a Flattening layer, and 2x Dense layers. Here is a breakdown of the specifics of each term:

* Sequential model: a model with multiple layers where each layer takes a tensor as an input, and has a tensor as an output. It is a simple model so it is computationally inexpensive 
* Convolution layers: takes a 2D input, applies a specificed number of filters, and then outputs a transformed version of the inputs. The filters move the same way English is read (top to bottom, left to right), and applies the transformation one pixel at a time. A few parameters:
   * filters: specifies what filters and how many will be used
   * kernel size: the 2D size of the filter (height x width)
   * strides: The length of the "step" in pixels, so a stride of 1 will mean that each time a transformation is finished, it will move one pixel over
   * padding: This deals with the border or edge of the input. 'Valid' means that the filter doesn't go past the boundary, and 'same' means that the input has the same size as the original input
   * activation: This specifies what funciton is applied to the output, and I used 'ReLu' in this particular model
   
* Pooling Layers: in the specific case of MaxPooling, it finds the max value of each part of a feature map to find the important parts like edges. In general, a pooling layer is applied after a convolution layer, and is a downsampling technique that decreases computational cost by reducing the 2D size of the filter (heigh x widght), and also provides translational variance to help reduce overfitting. A few parameters:

   * pool_size: this is the 2D size of the pooling frame (in height x width dimensions)
   * strides: The length of the "step" in pixels, so a stride of 1 will mean that each time a transformation is finished, it will move one pixel over
   * padding: This deals with the border or edge of the input. 'Valid' means that the filter doesn't go past the boundary, and 'same' means that the input has the same size as the original input

* Flattening Layers: usually after the Convolution and Pooling layers comes the Flattening layer(s). This layer takes a multidimensional input (in this case a 2D input of height x weight), and turns it into a 1D tensor. If the multidimensional input is 3D shape of 5x5x2, the output of the tensor would be 50.
* Dense Layers: a Dense layer is a full connected layer, meaning that each "neuron" is connected to each neuron in the previous layer. It takes the features from the other layers and predicts them to make predictions. Each neuron takes each input, adds weights and biases to each, and applies to specified activation function
* 'ReLu': the Rectified Linear Unit is an activation function that returns 0 the input is a negative value, and returns x for any other value. It also introduces non-linearity to the model, and allows it to handle "real world data" a little better. ReLu is popular because it is avoids the gradient problem (where gradients that are extremely small make it hard to account for) by having the positive inputs always equal 1.
* 'Dense units': Dense units are the individual 'neurons' or 'nodes' in a dense layer. Dense(64) would refer to a dense layer of 64 neurons
* 'Softmax': An activation function that normalizes the values of the inputs between zero and one, and assigns a class based on the probability of it falling within the determined range. Often used in multi-class classification problems where each image can belong to a number of categories, and also generally the last layer in a CNN for image classifiation. 
* 'seed': A seed is a number that we pass to the random number generator to initialize it. If we use the same seed every time, then the random number generator will produce the same sequence of random numbers every time - this allows for reproductibility when running the model. Without it, even when using the same training data and initial conditions (i.e. epochs), the initial weights would be different every time it was ran. 
* 'Adam optimizer': Adaptive Moment Estimation, Adam accounts for bias correction by incorporating first and second order moments of the gradient. This means it accounts for just the previous step, and it happens with each update. It calculates an exponential moving average of the gradient and the squared gradient, and the parameters beta1 and beta2 control the decay rates of these moving averages. Adam requires less tuning than other optimizers, and handles noise functions well, but also has the drawback of being more computationally expensive. 

The filters I wasn't sure what to name, so I followed the "standard notation" of dataframes (i.e. the first dataframe would be df or df1, the second would be df2, etc etc)


In [5]:
from tensorflow.keras import layers, models
from keras_tuner import HyperModel

class GalacticHyperModel(HyperModel):
    def __init__(self, input_shape, num_classes):
        self.input_shape = input_shape
        self.num_classes = num_classes

    def build(self, hp):
        model = models.Sequential()
        
        # Convolutional and pooling layers with tunable hyperparameters
        model.add(layers.Conv2D(filters = hp.Int('filters_1', 16, 64, step = 16),
                                kernel_size = (3, 3),
                                activation = 'relu',
                                input_shape = self.input_shape))
        model.add(layers.MaxPooling2D((2, 2)))
        
        model.add(layers.Conv2D(filters = hp.Int('filters_2', 32, 128, step = 32),
                                kernel_size = (3, 3),
                                activation = 'relu'))
        model.add(layers.MaxPooling2D((2, 2)))
        
        model.add(layers.Conv2D(filters = hp.Int('filters_3', 64, 256, step = 64),
                                kernel_size = (3, 3),
                                activation = 'relu'))

        # Fully connected layers
        model.add(layers.Flatten())
        model.add(layers.Dense(units = hp.Int('dense_units', 32, 256, step = 32),
                               activation = 'relu'))
        model.add(layers.Dense(self.num_classes, activation = 'softmax'))

        # Compile the model with tunable learning rate
        optimizer = tf.keras.optimizers.Adam(learning_rate = hp.Float('learning_rate', 1e-4, 1e-2, sampling = 'log'))
        # optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate)
        model.compile(optimizer = optimizer, loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])

        return model



### Cell Six:

Here is the actual implementation of the model. The hypermodel is instantiated, the tuning parameters are set up, and then the tuning process iterates over 50 different epochs (or sets) and finds the best combination of parameters. Early stopping is also included, so if the model isn't improving much in terms of validation accuracy, the model moves onto the next epoch. It basically decides that the time that could be spent training on the current epoch would be better spent on others, as there are limited and diminishing returns. 

In [6]:
from keras_tuner import RandomSearch
from keras_tuner.engine.hyperparameters import HyperParameters

input_shape = (128, 128, 3)
num_classes = 3

# Instantiate the HyperModel
galactic_hypermodel = GalacticHyperModel(input_shape, num_classes)

# Set up the RandomSearch tuner
tuner = RandomSearch(galactic_hypermodel,
                     objective = 'val_accuracy',
                     max_trials = 30,  # Number of different hyperparameter combinations to try
                     seed = 42,
                     project_name = 'galaxy_cnn_tuning')

# Search for the best hyperparameters
tuner.search(X_train, y_train,
             epochs = 50,
             validation_data = (X_val, y_val),
             callbacks = [tf.keras.callbacks.EarlyStopping(patience=5)])

# Get the best model found by the tuner
best_hp = tuner.get_best_hyperparameters()[0]
galaxy_cnn = tuner.hypermodel.build(best_hp)



INFO:tensorflow:Reloading Tuner from ./galaxy_cnn_tuning/tuner0.json
INFO:tensorflow:Oracle triggered exit


2023-05-09 16:56:59.762069: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Cell Seven:

This is where the model is actually ran. The results are then saved into a pickle file and then the results are analyzed/plotted in a separate environment. I've seen other people do everything in one notebook, however I was having issues plotting in this notebook for some reason so I decided to split them up. 



In [7]:
# Set hyperparameters
import tensorflow as tf
batch_size = 32
epochs = 50
learning_rate = 0.0001

# Compile the model with the chosen learning rate
optimizer = tf.keras.optimizers.Adam(learning_rate = learning_rate)
galaxy_cnn.compile(optimizer = optimizer, loss = 'sparse_categorical_crossentropy', metrics=['accuracy'])

# Set up early stopping to avoid overfitting
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Train the model
history = galaxy_cnn.fit(X_train, y_train,
                         batch_size = batch_size,
                         epochs = epochs,
                         validation_data = (X_val, y_val),
                         callbacks = [early_stopping])



import pickle

with open("history.pickle", "wb") as f:
    pickle.dump(history.history, f)


print('Finished with the model!')

Epoch 1/50


2023-05-09 16:57:00.351570: W tensorflow/tsl/framework/cpu_allocator_impl.cc:82] Allocation of 786432000 exceeds 10% of free system memory.


Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Finished with the model!


#### Finished with the model!