# SDSS Performance Comparison | Hollis McLeod, Elijah Johnson

The following code is the product of the combined efforts of Hollis McLeod and Elijah Johnson for the term project in the Fall 2025 Section #22904 of AST4930 - Special Topics: Machine Learning at the University of Florida. This code is not for viewing, use, analysis, or any other purposes not in compliance with the [University of Florida Honor Code](https://policy.ufl.edu/regulation/4-040/). Violations will be reported accordingly.

## Overview

This Jupyter notebook defines a method for reading, processing, training machine learning models on, and visualizing results for [SDSS data](https://skyserver.sdss.org/dr19/SearchTools/sql) of the following form.

500,000 rows were downloaded as *SDSS.csv*, although 100,000 rows are planned for use. If computational power permits, however, more will be used for increased rigor of analysis.

**Fields**:
- Identification and standard features of astronomical objects (not used for training)
  - `objid`
  - `ra`
  - `dec`
  - `specobjid`
- Magnitudes
  - `u`
  - `g`
  - `r`
  - `i`
  - `z`
- Color indices (calculated)
  - `u-g`
  - `g-r`
  - `r-i`
  - `i-z`
- `redshift`
- `class`
  - "GALAXY"
  - "QSO" (shorthand for quasar)
  - "STAR"

**SQL Code**:
```
SELECT TOP 500000
p.objid,p.ra,p.dec,p.u,p.g,p.r,p.i,p.z,
p.u-p.g as 'u-g', p.g-p.r as 'g-r', p.r-p.i as 'r-i', p.i-p.z as 'i-z',
s.specobjid, s.class, s.z as redshift
FROM PhotoObj AS p
JOIN SpecObj AS s ON s.bestobjid = p.objid
WHERE s.class IN ('GALAXY', 'QSO', 'STAR')
```

## Setup

### Install and Import Packages

In [None]:
# Install if not already installed.
%pip install pandas
%pip install numpy
%pip install scipy
%pip install matplotlib
%pip install scikit-learn

In [14]:
# Consolidate all imports together for easy use.
import pandas as pd

from sklearn import model_selection
from sklearn.model_selection import train_test_split

### Define Paths and Constants

In [9]:
# These paths are stored in a dictionary for easy access and non-verbosity
paths = {
    # location of SDSS data
    'data': './data/SDSS.csv',

    # base directories for saved graphs, charts, models
    'supervised_products': './products/supervised/',
    'unsupervised_products': './products/unsupervised/',

    # additional pathing for specific products
    'graphs': 'graphs/',
    'charts': 'charts/',
    'models': 'models/'
}

In [15]:
# The number of rows to be used with the data
NROWS = 100000

# The column names of the features and the class from the dataset
FEATURES = ['u', 'g', 'r', 'i', 'z', 'u-g', 'g-r', 'r-i', 'i-z', 'redshift']
CLASS = 'class'

### Import Data and Perform Preliminary Processing

In [16]:
data = pd.read_csv(paths['data'], nrows=NROWS)
data

Unnamed: 0,objid,ra,dec,u,g,r,i,z,u-g,g-r,r-i,i-z,specobjid,class,redshift
0,1237668705156530825,263.370025,7.223793,18.08503,16.13823,15.28424,14.93009,14.71180,1.946796,0.853986,0.354156,0.218284,3149144865661544448,STAR,-0.000976
1,1237668705693467044,263.664289,7.488496,20.73099,19.08960,18.26207,17.86547,17.68044,1.641390,0.827530,0.396591,0.185038,3172787664193611776,GALAXY,0.735104
2,1237668705693468328,263.711809,7.529318,19.98693,18.40730,17.72147,17.44062,17.31446,1.579628,0.685833,0.280849,0.126162,3149149538585962496,STAR,-0.000280
3,1237668571475018376,262.378986,6.996502,25.06067,19.90350,18.44531,17.69727,17.27005,5.157169,1.458197,0.748039,0.427219,3172843739286628352,STAR,-0.000059
4,1237671696061236496,263.684828,8.193936,20.79169,19.28190,18.61582,18.31616,18.15825,1.509792,0.666075,0.299660,0.157909,3172804431745935360,STAR,-0.000074
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,1237666209239269489,255.838908,27.333566,18.55913,16.88940,16.21973,15.95857,15.83906,1.669731,0.669672,0.261165,0.119510,3161585838186326016,STAR,-0.000256
99996,1237662638523220956,240.927307,9.760700,23.87011,21.08361,20.28882,20.15691,20.16212,2.786507,0.794788,0.131912,-0.005213,5493271238775429120,QSO,3.426004
99997,1237651737930499236,239.576144,2.691617,23.77327,23.11989,21.43978,20.39449,19.64420,0.653383,1.680111,1.045284,0.750292,5411089616244856832,GALAXY,0.633541
99998,1237662303525143296,254.194683,26.466377,24.63793,21.74906,19.97466,18.97339,18.56893,2.888876,1.774393,1.001272,0.404459,4706452196270823424,GALAXY,0.496634


In [17]:
# Separate into X and Y
X = data[FEATURES].values
Y = data[CLASS]

print(f'SHAPES:\n\t{X.shape}\n\t{Y.shape}')

SHAPES:
	(100000, 10)
	(100000,)


## Supervised Machine Learning

This section contains training, optimization, testing, and creation of products for the following supervised machine learning algorithms:

- k-Nearest Neighbors (kNN)
- Decision Tree (DT)
- Support Vector Machine Classifier (SVM-C)
- Random Forest (RF)
- Adaptive Boosting (AdaBoost) with Base DT

## Unsupervised Machine Learning

This section contains training, optimization, testing, and creation of products for the following unsupervised machine learning algorithms:

- Support Vector Machine Regressor (SVM-R)
- Gaussian Mixture Model (GMM) Clustering
- Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
- Neural Network (NN) with [XXXX] layers of [XXXX] nodes
