# Course Project: Galaxy Classifier

### Lexy Andershock, Gian Fernandez-Aleman, David Long

#### Introduction
This course project for our Intro to Machine Learning course (COSC325) utilizes a logistic regression model on a dataset of colored galaxy images (Galaxy10 SDSS). While the dataset was originally used with a CNN (a more appropriate model for image data), we wanted to experiment and see what we could do with it using one of the learning techniques we've learned in class.

#### Who is this for?
Our target audience for this project would include astronomers, physicists, space enthusiasts, space researchers, educators, and general populations with an interest in galaxy-shape indentification.

#### Purpose of the Project
The problem we are trying to solve is identifying the shapes of newly discovered galaxies. This dataset contains 10 unique categories of different shapes galaxies possess, and we want to train a model to be able to assign one of these categories to new images of galaxies it hasn't encountered yet. With new space telescopes such as JWST, we are discovering new galaxies at an accelerating rate, and being able to quickly classify them for studying would be very useful instead of manually doing it. This will make users' lives significantly better as they can have a system that handles massive amounts of galaxy photosets and produces accurate results, which is more time-efficient as opposed to manually determining the galaxy shapes.

### 1. Setting Up The Environment
We must ensure that we have imported all the appropriate libraries that we will utilize for our data manipulation and modeling.

In [27]:
# Import required libraries
import h5py
import numpy as np
import matplotlib.pyplot as plt

# Scikit-learn libraries and routines
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import SGDClassifier

### 2. Organizing Data
We then extract our data from the dataset and organize them into two separate categories: images and labels.

In [28]:
# To get the images and labels from file
with h5py.File('Galaxy10.h5', 'r') as data:
    images = np.array(data['images'])
    labels = np.array(data['ans'])

# Flatten images
n_samples, height, width, channels = images.shape
X = images.reshape(n_samples, height * width * channels)
y = labels

# Normalize pixel values between 0 and 1
X = X.astype(np.float32) / 255.0

### 3. Splitting Data
Once we have the data put into separate categories, we split it into training and testing categories.

In [29]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state = 42)

### 4. Scaling Features
We'll scale the features so that logistic regression converges better

In [30]:
# Create and fit the scaler on training data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

# Use the same scaler to transform the test set
X_test = scaler.transform(X_test)

### 5. Training The Model!
We'll start training the logistic regression model

In [31]:
# Use a smaller subset (full took atleast 9 minutes)
X_train_small = X_train[:7000]
y_train_small = y_train[:7000]

model = OneVsRestClassifier(SGDClassifier(loss = 'log_loss', max_iter = 500))
model.fit(X_train_small, y_train_small)



### 6. Predicting using the model
We'll predict 