<a href="https://colab.research.google.com/github/dlkt101101/STAT441-Project/blob/main/STAT441_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# STAT 441: Statistical Learning - Classification
## Classifying Astrophysical Images
Prepared by:\
Darren Alexander Lam Kin Teng\
Ojus Udagani\
Raghuv

In [19]:
# ! pip install tensorflow

In [21]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import zipfile
import os

In [16]:
# importing keras
from keras.models import Sequential
from keras.layers import Dense, Conv2D, Flatten
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from keras.activations import relu, sigmoid

## Importing the dataset

In [17]:
training_path = 'data.zip/astro_dataset_maxia/astro_dataset_maxia/training'
test_path = 'data.zip/astro_dataset_maxia/astro_dataset_maxia/test'
validation_path = 'data.zip/astro_dataset_maxia/astro_dataset_maxia/validation'

# Task
Extract the contents of `data.zip`, update the `training_path`, `test_path`, and `validation_path` variables, and then initialize `ImageDataGenerator` instances for these datasets, configuring them for data augmentation (rescale, shear range, zoom range, and horizontal flip) and preparing data flow using `flow_from_directory` to ensure image data is ready in batches for model training.

## Extract Dataset

### Subtask:
Extract the contents of 'data.zip' to a specified directory. This will make the image files accessible for processing.


**Reasoning**:
To extract the contents of 'data.zip', I need to use the `zipfile` module to open the archive and then call the `extractall()` method. This will make the image files accessible for further processing.



In [29]:
with zipfile.ZipFile('data.zip', 'r') as zip_ref:
    # Extract all the contents into the current directory
    zip_ref.extractall()

Extracted data.zip successfully.


## Update Data Paths

### Subtask:
Adjust the 'training_path', 'test_path', and 'validation_path' variables to point to the correct directories after extraction.


**Reasoning**:
The previous step extracted the contents of 'data.zip'. Now, I need to update the `training_path`, `test_path`, and `validation_path` variables to reflect the new, correct directory structure after extraction.



In [30]:
training_path = 'astro_dataset_maxia/astro_dataset_maxia/training'
test_path = 'astro_dataset_maxia/astro_dataset_maxia/test'
validation_path = 'astro_dataset_maxia/astro_dataset_maxia/validation'

train_datagen = ImageDataGenerator(rescale=1./255, shear_range=0.2, zoom_range=0.2, horizontal_flip=True)

validation_datagen = ImageDataGenerator(rescale=1./255)

test_datagen = ImageDataGenerator(rescale=1./255)

Updated training_path: astro_dataset_maxia/astro_dataset_maxia/training
Updated test_path: astro_dataset_maxia/astro_dataset_maxia/test
Updated validation_path: astro_dataset_maxia/astro_dataset_maxia/validation


## Initialize ImageDataGenerators

### Subtask:
Set up `ImageDataGenerator` instances for the training, validation, and test datasets, configuring them for data augmentation (rescale, shear range, zoom range, and horizontal flip) where appropriate.


**Reasoning**:
I need to initialize three ImageDataGenerator instances: one for training with data augmentation (rescale, shear range, zoom range, horizontal flip) and two for validation and test datasets with only rescaling, as specified in the instructions.



ImageDataGenerators initialized successfully.


**Reasoning**:
Now that the ImageDataGenerator instances are initialized, I need to create data generators using `flow_from_directory` for the training, validation, and test datasets. This will prepare the image data in batches for model training, specifying target size, batch size, and class mode.



In [32]:
image_width, image_height = 150, 150
batch_size = 32

training_set = train_datagen.flow_from_directory(
    training_path,
    target_size=(image_width, image_height),
    batch_size=batch_size,
    class_mode='categorical'
)

validation_set = validation_datagen.flow_from_directory(
    validation_path,
    target_size=(image_width, image_height),
    batch_size=batch_size,
    class_mode='categorical'
)

test_set = test_datagen.flow_from_directory(
    test_path,
    target_size=(image_width, image_height),
    batch_size=batch_size,
    class_mode='categorical'
)

print("Data generators created successfully.")

Found 2416 images belonging to 12 classes.
Found 658 images belonging to 12 classes.
Found 345 images belonging to 12 classes.
Data generators created successfully.


## Final Task

### Subtask:
Confirm that the ImageDataGenerators are successfully set up and ready to provide image data in batches for model training.


## Summary:

### Data Analysis Key Findings
*   The `data.zip` file was successfully extracted, making the image dataset accessible.
*   The `training_path`, `test_path`, and `validation_path` variables were correctly updated to point to the respective directories: `astro_dataset_maxia/astro_dataset_maxia/training`, `astro_dataset_maxia/astro_dataset_maxia/test`, and `astro_dataset_maxia/astro_dataset_maxia/validation`.
*   `ImageDataGenerator` instances were successfully initialized:
    *   `train_datagen` was configured with `rescale=1./255`, `shear_range=0.2`, `zoom_range=0.2`, and `horizontal_flip=True` for data augmentation.
    *   `validation_datagen` and `test_datagen` were configured only with `rescale=1./255`.
*   Data generators were created using `flow_from_directory` for each dataset, with images resized to (150, 150) and a `batch_size` of 32, using `'categorical'` class mode.
    *   The training set contains 2416 images belonging to 12 classes.
    *   The validation set contains 658 images belonging to 12 classes.
    *   The test set contains 345 images belonging to 12 classes.

### Insights or Next Steps
*   The data pipeline is fully set up, with augmented training data and properly scaled validation/test data, ready for immediate use in a deep learning model.
*   The next logical step is to define, compile, and train a convolutional neural network (CNN) model using these prepared data generators.
