# Title

## Introduction 

Stars are large spheres of hot gas that emit heat and light into space. They are composed of mostly hydrogen, with some helium and other elements. The sun is an example of a star and is the closest star to Earth (NASA).

Galaxies are clusters of planets, stars, gasses, and dust that are all held together by gravity. Galaxies are very large and emit light from the stars and other things that it contains. The Milky Way Galaxy is an example of a galaxy and is the one that Earth is a part of (NASA).

Quasars are the core of active galaxies and they are powered by supermassive black holes. They emit immense amounts of heat and light due to the friction of material being drawn in. The closest quasar to Earth is called 3C 273 and can be seen with an 8-inch telescope (Space.com).

The classification of celestial objects into sta
rs, galaxies, and quasars has been pivotal for the understanding of planet Earth's positioning within space. It has led to key insights such as the discovery that the Andromeda galaxy is separate from our own, and this classification continues to be essential for astrological research (Clarke 2019).

In this report, we will use data on celestial objects to answer the following question: "Based on its redshift and brightness in different wavelengths of light, what type of celestial object is this?" 

Our data set is from Sloan Digital Sky Survey Data Release 16. It was collected by the Sloan Digital Sky Survey Telescope; a powerful telescope aimed at measuring spectral characteristics. It contains data on light emitted from galaxies, quasars, and stars, including redshift, which reflects how quickly an object moves (SDSS Voyages, 2024b), and brightness in five wavelengths of light. Below are a list of the variables collected as well as what they represent:

* obj_ID = Object Identifier, the unique value that identifies the object in the image catalog used by the CAS
* alpha = Right Ascension angle
* delta = Declination angle
* u = Ultraviolet filter in the photometric system
* g = Green filter in the photometric system
* r = Red filter in the photometric system
* i = Near Infrared filter in the photometric system
* z = Infrared filter in the photometric system
* run_ID = Run Number used to identify the specific scan
* rereun_ID = Rerun Number to specify how the image was processed
* cam_col = Camera column to identify the scanline within the run
* field_ID = Field number to identify each field
* spec_obj_ID = Unique ID used for optical spectroscopic objects (this means that 2 different observations with the same spec_obj_ID must share the output class)
* class = object class (galaxy, star, or quasar object)
* redshift = redshift value based on the increase in wavelength
* plate = plate ID, identifies each plate in SDSS
* MJD = Modified Julian Date, used to indicate when a given piece of SDSS data was taken
* fiber_ID = fiber ID that identifies the fiber that pointed the light at the focal plane in each observation

We will focus on the u, g, r, i, z, and redshift variables to help predict the classification of the class variable.


## Import Libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import altair as alt

alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

## Methods and Results

In [2]:
# Load in the data file from the web
url="https://drive.google.com/file/d/1LM-kB1xP90O9RBY5yjRP1mET_BKOOhxC/view?usp=sharing"
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
star_data = pd.read_csv(url) #Citation: (Pandas, 2019)

star_data

Unnamed: 0,objid,ra,dec,u,g,r,i,z,run,rerun,camcol,field,specobjid,class,redshift,plate,mjd,fiberid
0,1237666301628060000,47.372545,0.820621,18.69254,17.13867,16.55555,16.34662,16.17639,4849,301,5,771,8168632633242440000,STAR,0.000115,7255,56597,832
1,1237673706652430000,116.303083,42.455980,18.47633,17.30546,17.24116,17.32780,17.37114,6573,301,6,220,9333948945297330000,STAR,-0.000093,8290,57364,868
2,1237671126974140000,172.756623,-8.785698,16.47714,15.31072,15.55971,15.72207,15.82471,5973,301,1,13,3221211255238850000,STAR,0.000165,2861,54583,42
3,1237665441518260000,201.224207,28.771290,18.63561,16.88346,16.09825,15.70987,15.43491,4649,301,3,121,2254061292459420000,GALAXY,0.058155,2002,53471,35
4,1237665441522840000,212.817222,26.625225,18.88325,17.87948,17.47037,17.17441,17.05235,4649,301,3,191,2390305906828010000,GALAXY,0.072210,2123,53793,74
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,1237667968033620000,228.674917,19.179460,19.32631,18.82748,18.65659,18.60481,18.60917,5237,301,5,134,4448615345201370000,QSO,0.438182,3951,55681,672
99996,1237664818210470000,173.757382,36.441603,18.33687,17.30365,17.16037,17.14895,17.14419,4504,301,2,111,2265404129658560000,STAR,-0.000497,2012,53493,340
99997,1237664295297290000,205.426531,38.499053,17.50690,15.63152,15.22328,15.04469,15.28668,4382,301,4,97,2257446413900210000,GALAXY,0.004587,2005,53472,62
99998,1237656537513130000,337.135144,-9.635967,19.33946,17.21436,16.29697,15.86745,15.51556,2576,301,2,105,811847537492257000,GALAXY,0.084538,721,52228,268


In [3]:
# Filter relevant columns 
star_filtered = (
    star_data.loc[:, ["u", "g", "r", "i", "z", "redshift", "class"]]
    .rename(columns={
        "u":"Ultraviolet", 
        "g":"Green", 
        "r":"Red", 
        "i":"Near Infrared", 
        "z":"Infrared",
        "redshift":"Redshift",
        "class":"Class"
    })
)
star_filtered

Unnamed: 0,Ultraviolet,Green,Red,Near Infrared,Infrared,Redshift,Class
0,18.69254,17.13867,16.55555,16.34662,16.17639,0.000115,STAR
1,18.47633,17.30546,17.24116,17.32780,17.37114,-0.000093,STAR
2,16.47714,15.31072,15.55971,15.72207,15.82471,0.000165,STAR
3,18.63561,16.88346,16.09825,15.70987,15.43491,0.058155,GALAXY
4,18.88325,17.87948,17.47037,17.17441,17.05235,0.072210,GALAXY
...,...,...,...,...,...,...,...
99995,19.32631,18.82748,18.65659,18.60481,18.60917,0.438182,QSO
99996,18.33687,17.30365,17.16037,17.14895,17.14419,-0.000497,STAR
99997,17.50690,15.63152,15.22328,15.04469,15.28668,0.004587,GALAXY
99998,19.33946,17.21436,16.29697,15.86745,15.51556,0.084538,GALAXY


Classification analysis

In [4]:
import numpy as np
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
set_config(transform_output="pandas")

np.random.seed(1234)

star_train, star_test = train_test_split(
    star_filtered, train_size=0.75
)

star_knn_1 = KNeighborsClassifier()

star_preprocessor = make_column_transformer(
    (StandardScaler(), ["Ultraviolet", "Green", "Red", "Near Infrared", "Infrared", "Redshift"])
)

star_pipeline = make_pipeline(star_preprocessor, star_knn_1)

parameter_grid ={
    "kneighborsclassifier__n_neighbors" : range(2,15,1),
}

star_tune = GridSearchCV(
    star_pipeline,
    parameter_grid,
    cv=5,
    return_train_score=True,
    n_jobs=-1
)

star_model = star_tune.fit(star_train[["Ultraviolet", "Green", "Red", "Near Infrared", "Infrared", "Redshift"]], star_train["Class"])

star_accuracy = pd.DataFrame(star_model.cv_results_)

#accuracy_plot = alt.Chart(star_accuracy).mark_line(point=True).encode(
   # x=alt.X("
#)

star_accuracy


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsclassifier__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,0.104719,0.001732,0.664377,0.008694,2,{'kneighborsclassifier__n_neighbors': 2},0.965867,0.965733,0.966267,0.966,...,0.9662,0.000499,7,0.9872,0.987483,0.986983,0.987183,0.98775,0.98732,0.000268
1,0.10285,0.000721,0.684058,0.009815,3,{'kneighborsclassifier__n_neighbors': 3},0.971467,0.971067,0.973067,0.971067,...,0.97156,0.000768,1,0.98155,0.981417,0.9813,0.981617,0.981567,0.98149,0.000116
2,0.102858,0.000963,0.692428,0.008662,4,{'kneighborsclassifier__n_neighbors': 4},0.970733,0.971,0.9712,0.972067,...,0.971267,0.000448,2,0.9793,0.979633,0.979467,0.979283,0.980017,0.97954,0.00027
3,0.103647,0.000962,0.711568,0.005125,5,{'kneighborsclassifier__n_neighbors': 5},0.969,0.968667,0.970067,0.970467,...,0.96952,0.000665,3,0.976133,0.976517,0.976317,0.975833,0.9767,0.9763,0.000301
4,0.102907,0.000532,0.716901,0.021471,6,{'kneighborsclassifier__n_neighbors': 6},0.9686,0.968867,0.969267,0.9702,...,0.969093,0.00061,4,0.975233,0.975517,0.975483,0.975017,0.975417,0.975333,0.000186
5,0.103338,0.000565,0.722366,0.010537,7,{'kneighborsclassifier__n_neighbors': 7},0.967467,0.967,0.9686,0.969467,...,0.967907,0.000975,5,0.972767,0.972883,0.973183,0.972767,0.972867,0.972893,0.000153
6,0.10242,0.000383,0.739468,0.020123,8,{'kneighborsclassifier__n_neighbors': 8},0.966933,0.966267,0.968267,0.968733,...,0.967267,0.001053,6,0.972633,0.972367,0.972317,0.9721,0.97205,0.972293,0.000209
7,0.102455,0.000791,0.729915,0.00517,9,{'kneighborsclassifier__n_neighbors': 9},0.965467,0.965333,0.966867,0.966067,...,0.9658,0.000604,8,0.9706,0.970217,0.970133,0.96995,0.970283,0.970237,0.000213
8,0.103189,0.001451,0.745807,0.00761,10,{'kneighborsclassifier__n_neighbors': 10},0.965267,0.9654,0.966467,0.966133,...,0.965693,0.000511,9,0.969983,0.969733,0.96995,0.96975,0.96965,0.969813,0.00013
9,0.102876,0.000831,0.765104,0.003093,11,{'kneighborsclassifier__n_neighbors': 11},0.9636,0.964467,0.965467,0.965267,...,0.9646,0.000689,10,0.96845,0.968283,0.967933,0.968167,0.968,0.968167,0.000188


In [5]:
accuracy_plot = alt.Chart(star_accuracy).mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors").title("Neighbors").scale(zero=False),
    y=alt.Y("mean_test_score").title("Accuracy estimate").scale(zero=False)
)
accuracy_plot

In [6]:
star_test["Prediction"] = star_tune.predict(
    star_test[["Ultraviolet", "Green", "Red", "Near Infrared", "Infrared", "Redshift"]]
)
star_test

Unnamed: 0,Ultraviolet,Green,Red,Near Infrared,Infrared,Redshift,Class,Prediction
13504,15.12368,14.08564,13.68744,13.52967,13.36827,0.008385,GALAXY,GALAXY
72599,17.06570,15.39549,14.68195,14.37796,14.19508,-0.000692,STAR,STAR
42217,19.57275,18.39078,17.79880,17.27901,17.11310,0.085377,GALAXY,GALAXY
97650,18.41767,16.17268,15.52201,15.15301,14.90635,0.030430,GALAXY,STAR
99120,19.06618,17.72509,16.97738,16.53360,16.23584,0.117050,GALAXY,GALAXY
...,...,...,...,...,...,...,...,...
28063,18.87199,18.52925,18.61781,18.51241,18.32005,0.434929,QSO,QSO
98158,19.18158,18.82666,18.58927,18.62002,18.42401,0.254364,GALAXY,QSO
15227,17.77467,16.66624,16.21588,15.98671,15.81029,0.023767,GALAXY,GALAXY
92137,16.95530,15.43749,14.84422,14.64529,14.55261,-0.000269,STAR,STAR


In [7]:
star_tune.score(
    star_test[["Ultraviolet", "Green", "Red", "Near Infrared", "Infrared", "Redshift"]],
    star_test["Class"]
)

0.97248

In [8]:
pd.crosstab(star_test["Class"], star_test["Prediction"])

Prediction,GALAXY,QSO,STAR
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
GALAXY,12596,36,401
QSO,120,2457,4
STAR,123,4,9259


## Discussion

## References

* https://www.kaggle.com/datasets/fedesoriano/stellar-classification-dataset-sdss17/data
* https://www.sdss4.org/dr17/
* https://www.sdss4.org/instruments/camera/#Filters
* https://science.nasa.gov/universe/stars/
* https://www.aanda.org/articles/aa/full_html/2020/07/aa36770-19/aa36770-19.html#R16
* https://www.space.com/17262-quasar-definition.html
* https://science.nasa.gov/universe/stars/
* https://science.nasa.gov/universe/galaxies/