using:

https://github.com/henrysky/astroNN

data: https://astronn.readthedocs.io/en/latest/galaxy10.html



## cleaning data
### normalising pixel values:

In [None]:
import h5py

file_path="Galaxy10.h5"    

### Let us study the data:

In [None]:
with h5py.File(file_path,'r') as f:
    print("Keys in the file:", list(f.keys()))

    images=f['images'][:]
    print(f"Images shape: {images.shape}")

    labels=f['ans'][:]
    print(f"Labels shape:{labels.shape}")
    print(f"Unique labels: {set(labels)}")

#### So, there are:
1. 21785 images with 21785 labels
2. each image is 69x69 pixels
3. images have 3 channels (likely RGB channels)
4. and  there 10 classes of labels (0-9) which means there are 10 types of galaxies in the data.

### Normalising Picture Values

Normalised Picture Value = Pixel Value / 255.0

In [None]:
# normalised_images = images/255.0

# print(f"Normalized image shape: {normalised_images.shape}")
# print(f"Pixel value range: {normalised_images.min()} to {normalised_images.max()}")

We can optionally save the normalised image data:

In [None]:
# normalised_image_file="Galaxy10_normalized.h5"

# with h5py.File(normalised_image_file, 'w') as f:
#     f.create_dataset('images',data=normalised_images)
#     print(f"Saved to {normalised_image_file}")

### Now, we split the data into training and validation sets

In [None]:
from sklearn.model_selection import train_test_split

with h5py.File(file_path, 'r') as f:
    labels=f['ans'][:]

X_train, X_val, y_train, y_val = train_test_split(images, labels, test_size=0.2, random_state=42)

print(f"Training set shape: {X_train.shape}\nValidation set shape:{X_val.shape}")

### Building a simple CNN


In [None]:
import tensorflow as tf

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense,Input

Defining the CNN model:

In [None]:
model = Sequential([
    Input(shape=(64, 64, 3)),  
    Conv2D(32, (3, 3), activation='relu'),
    MaxPooling2D((2, 2)), #halves the pooling window
    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D((2, 2)),
    Flatten(),
    Dense(128, activation='relu'),
    Dense(10, activation='softmax')  
])

How the Model Works

    1. The input image (64, 64, 3) passes through the convolutional layers to extract features like edges and patterns
    2. MaxPooling layers reduce the size of the feature maps, keeping only the most important information.
    3. The flatten layer prepares the data for the fully connected dense layers.
    4. The dense layers process the features and output probabilities for each of the 3 classes.

Needed to add another dense layer due to error.

This model is designed for a classification task with 3 classes.
It works well for small image datasets and can be trained quickly.

In [None]:
model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])
model.summary()

### Training the model:

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.image import resize
import numpy as np

In [None]:
X_train = np.array([resize(img, (64, 64)).numpy() for img in X_train])
X_val = np.array([resize(img, (64, 64)).numpy() for img in X_val]) #added to make even division

In [None]:
datagen = ImageDataGenerator(
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True
)

In [None]:
train_gen=datagen.flow(X_train,y_train,batch_size=32)
val_gen=datagen.flow(X_val,y_val,batch_size=32)

In [None]:
history=model.fit(train_gen,validation_data=val_gen,epochs=10)

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.xlabel('Epochs') 
plt.ylabel('Accuracy')
plt.title("Training and Validation Accuracy of Custom CNN")
plt.suptitle("The plot shows that the model gets better at identifying galaxies over time (training accuracy) and then does a good job unseen data (validation accuracy)")
plt.legend()
plt.show()

### In the beginning we separated the data into training vs validation. The bar chart above shows how well the model performed after training on labelled data and then went on to classify validation data itself. Since we know the labels for validation data, we can compare the model's classification vs default classification and get a validation accuracy result. 