# Course Project: Galaxy Classifier

### Lexy Andershock, Gian Fernandez-Aleman, David Long

#### Introduction
This course project for our Intro to Machine Learning course (COSC325) utilizes a logistic regression model on a dataset of colored galaxy images (Galaxy10 SDSS). While the dataset was originally used with a CNN (a more appropriate model for image data), we wanted to experiment and see what we could do with it using one of the learning techniques we've learned in class.

#### Who is this for?
Our target audience for this project would include astronomers, physicists, space enthusiasts, space researchers, educators, and general populations with an interest in galaxy-shape indentification.

#### Purpose of the Project
The problem we are trying to solve is identifying the shapes of newly discovered galaxies. This dataset contains 10 unique categories of different shapes galaxies possess, and we want to train a model to be able to assign one of these categories to new images of galaxies it hasn't encountered yet. With new space telescopes such as JWST, we are discovering new galaxies at an accelerating rate, and being able to quickly classify them for studying would be very useful instead of manually doing it. This will make users' lives significantly better as they can have a system that handles massive amounts of galaxy photosets and produces accurate results, which is more time-efficient as opposed to manually determining the galaxy shapes.

### 1. Setting Up The Environment
We must ensure that we have imported all the appropriate libraries that we will utilize for our data manipulation and modeling.

In [None]:
# Import required libraries
import h5py
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
from tensorflow.keras import utils
from matplotlib.colors import LinearSegmentedColormap

# Scikit-learn libraries and routines
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
from scipy.optimize import minimize
from sklearn.multiclass import OneVsRestClassifier

### 2. Organizing Data
We then extract our data from the dataset and organize them into two separate categories: images and labels.

In [1]:
# To get the images and labels from file
with h5py.File('Galaxy10.h5', 'r') as data:
    images = np.array(data['images'])
    labels = np.array(data['ans'])

# To convert the labels to categorical 10 classes
labels = utils.to_categorical(labels, 10)

# To convert to desirable type
labels = labels.astype(np.float32)
images = images.astype(np.float32)

NameError: name 'h5py' is not defined

### 3. Splitting Data
Once we have the data put into separate categories, we split it into training and testing categories.

In [None]:
# Split the data into training and test sets
train_idx, test_idx = train_test_split(np.arange(labels.shape[0]), test_size=0.1)
train_images, train_labels, test_images, test_labels = images[train_idx], labels[train_idx], images[test_idx], labels[test_idx]

### 4. Predicting
We predict this shit fr.

In [None]:
# Because we are dealing with image data and we are working with logistic
# regression, a linear model, the images will have to be flattened.
n_samples, height, width, channels = train_images.shape
train_images_flat = train_images.reshape(n_samples, -1)
test_images_flat = test_images.reshape(test_images.shape[0], -1)

# Standardizing the data will allow logistic regression to perform better
scaler = StandardScaler()
train_images_scaled = scaler.fit_transform(train_images_flat)
test_images_scaled = scaler.transform(test_images_flat)

log_reg = OneVsRestClassifier(LogisticRegression(max_iter=5000, multi_class='ovr', solver='saga'))

log_reg.fit(train_images_scaled, np.argmax(train_labels, axis=1))

predictions = log_reg.predict(test_images_scaled)

cm = confusion_matrix(np.argmax(test_labels, axis=1), predictions)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues)
plt.show()

print(classification_report(np.argmax(test_labels, axis=1), predictions))

