# Stellar Classification Multiclass Classifcation

## Context
In astronomy, stellar classification is the classification of stars based on their spectral characteristics. The classification scheme of galaxies, quasars, and stars is one of the most fundamental in astronomy. The early cataloguing of stars and their distribution in the sky has led to the understanding that they make up our own galaxy and, following the distinction that Andromeda was a separate galaxy to our own, numerous galaxies began to be surveyed as more powerful telescopes were built. This datasat aims to classify stars, galaxies, and quasars based on their spectral characteristics.

## Content
The data consists of 100,000 observations of space taken by the SDSS (Sloan Digital Sky Survey). Every observation is described by 17 feature columns and 1 class column which identifies it to be either a star, galaxy or quasar.

obj_ID = Object Identifier, the unique value that identifies the object in the image catalog used by the CAS

alpha = Right Ascension angle (at J2000 epoch)

delta = Declination angle (at J2000 epoch)

u = Ultraviolet filter in the photometric system

g = Green filter in the photometric system

r = Red filter in the photometric system

i = Near Infrared filter in the photometric system

z = Infrared filter in the photometric system

run_ID = Run Number used to identify the specific scan

rereun_ID = Rerun Number to specify how the image was processed

cam_col = Camera column to identify the scanline within the run

field_ID = Field number to identify each field

spec_obj_ID = Unique ID used for optical spectroscopic objects (this means that 2 different observations with the same spec_obj_ID must share the output class)
class = object class (galaxy, star or quasar object)

redshift = redshift value based on the increase in wavelength

plate = plate ID, identifies each plate in SDSS

MJD = Modified Julian Date, used to indicate when a given piece of SDSS data was taken

fiber_ID = fiber ID that identifies the fiber that pointed the light at the focal plane in each observation

In [65]:
#star_classification.csv

#import libraries needed
from sklearn.impute import SimpleImputer
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix, classification_report
from sklearn.naive_bayes import GaussianNB
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [66]:
df = pd.read_csv("Datasets/star_classification.csv")

In [67]:
df.head()

Unnamed: 0,obj_ID,alpha,delta,u,g,r,i,z,run_ID,rerun_ID,cam_col,field_ID,spec_obj_ID,class,redshift,plate,MJD,fiber_ID
0,1.237661e+18,135.689107,32.494632,23.87882,22.2753,20.39501,19.16573,18.79371,3606,301,2,79,6.543777e+18,GALAXY,0.634794,5812,56354,171
1,1.237665e+18,144.826101,31.274185,24.77759,22.83188,22.58444,21.16812,21.61427,4518,301,5,119,1.176014e+19,GALAXY,0.779136,10445,58158,427
2,1.237661e+18,142.18879,35.582444,25.26307,22.66389,20.60976,19.34857,18.94827,3606,301,2,120,5.1522e+18,GALAXY,0.644195,4576,55592,299
3,1.237663e+18,338.741038,-0.402828,22.13682,23.77656,21.61162,20.50454,19.2501,4192,301,3,214,1.030107e+19,GALAXY,0.932346,9149,58039,775
4,1.23768e+18,345.282593,21.183866,19.43718,17.58028,16.49747,15.97711,15.54461,8102,301,3,137,6.891865e+18,GALAXY,0.116123,6121,56187,842


In [68]:
df.shape
#100,000 rows with 17 features including 1 target(dependent) variable

(100000, 18)

In [69]:
df['class'].value_counts()

#our dependent(target) is a categorical value with 3 classes

GALAXY    59445
STAR      21594
QSO       18961
Name: class, dtype: int64

In [70]:
df.info()
#no missing entries, all predictors are numerical
#output variable is categorical

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 18 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   obj_ID       100000 non-null  float64
 1   alpha        100000 non-null  float64
 2   delta        100000 non-null  float64
 3   u            100000 non-null  float64
 4   g            100000 non-null  float64
 5   r            100000 non-null  float64
 6   i            100000 non-null  float64
 7   z            100000 non-null  float64
 8   run_ID       100000 non-null  int64  
 9   rerun_ID     100000 non-null  int64  
 10  cam_col      100000 non-null  int64  
 11  field_ID     100000 non-null  int64  
 12  spec_obj_ID  100000 non-null  float64
 13  class        100000 non-null  object 
 14  redshift     100000 non-null  float64
 15  plate        100000 non-null  int64  
 16  MJD          100000 non-null  int64  
 17  fiber_ID     100000 non-null  int64  
dtypes: float64(10), int64(7),

In [71]:
#drop ID column from dataframe since it plays no role in prediction
df.drop(columns=["obj_ID"], axis=1, inplace=True)

In [72]:
#display summary statistics on number columns
df.describe()

Unnamed: 0,alpha,delta,u,g,r,i,z,run_ID,rerun_ID,cam_col,field_ID,spec_obj_ID,redshift,plate,MJD,fiber_ID
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,177.629117,24.135305,21.980468,20.531387,19.645762,19.084854,18.66881,4481.36606,301.0,3.51161,186.13052,5.783882e+18,0.576661,5137.00966,55588.6475,449.31274
std,96.502241,19.644665,31.769291,31.750292,1.85476,1.757895,31.728152,1964.764593,0.0,1.586912,149.011073,3.324016e+18,0.730707,2952.303351,1808.484233,272.498404
min,0.005528,-18.785328,-9999.0,-9999.0,9.82207,9.469903,-9999.0,109.0,301.0,1.0,11.0,2.995191e+17,-0.009971,266.0,51608.0,1.0
25%,127.518222,5.146771,20.352353,18.96523,18.135828,17.732285,17.460677,3187.0,301.0,2.0,82.0,2.844138e+18,0.054517,2526.0,54234.0,221.0
50%,180.9007,23.645922,22.179135,21.099835,20.12529,19.405145,19.004595,4188.0,301.0,4.0,146.0,5.614883e+18,0.424173,4987.0,55868.5,433.0
75%,233.895005,39.90155,23.68744,22.123767,21.044785,20.396495,19.92112,5326.0,301.0,5.0,241.0,8.332144e+18,0.704154,7400.25,56777.0,645.0
max,359.99981,83.000519,32.78139,31.60224,29.57186,32.14147,29.38374,8162.0,301.0,6.0,989.0,1.412694e+19,7.011245,12547.0,58932.0,1000.0


In [73]:
df.corr()
#pearsons r coefficient of correlation matrix for all features

Unnamed: 0,alpha,delta,u,g,r,i,z,run_ID,rerun_ID,cam_col,field_ID,spec_obj_ID,redshift,plate,MJD,fiber_ID
alpha,1.0,0.138691,-0.001532,-0.002423,-0.022083,-0.02358,-0.002918,-0.013737,,0.019582,-0.165577,-0.002553,0.001667,-0.002554,0.019943,0.030464
delta,0.138691,1.0,0.002074,0.003523,-0.006835,-0.00448,0.00363,-0.301238,,0.032565,-0.173416,0.112329,0.031638,0.112329,0.107333,0.02825
u,-0.001532,0.002074,1.0,0.999311,0.054149,0.04573,0.998093,0.015309,,0.003548,-0.008374,0.029997,0.014309,0.029997,0.031997,0.016305
g,-0.002423,0.003523,0.999311,1.0,0.062387,0.056271,0.999161,0.01571,,0.003508,-0.008852,0.039443,0.022954,0.039443,0.040274,0.01747
r,-0.022083,-0.006835,0.054149,0.062387,1.0,0.962868,0.053677,0.153889,,0.00848,-0.026423,0.655245,0.433241,0.655243,0.67118,0.223106
i,-0.02358,-0.00448,0.04573,0.056271,0.962868,1.0,0.055994,0.147668,,0.007615,-0.026679,0.661641,0.492383,0.66164,0.672523,0.214787
z,-0.002918,0.00363,0.998093,0.999161,0.053677,0.055994,1.0,0.013811,,0.003365,-0.008903,0.037813,0.03038,0.037813,0.037469,0.014668
run_ID,-0.013737,-0.301238,0.015309,0.01571,0.153889,0.147668,0.013811,1.0,,-0.047098,0.031498,0.23946,0.0654,0.239459,0.262687,0.067165
rerun_ID,,,,,,,,,,,,,,,,
cam_col,0.019582,0.032565,0.003548,0.003508,0.00848,0.007615,0.003365,-0.047098,,1.0,-0.015684,-0.001946,9.7e-05,-0.001949,-0.006745,0.121597


In [74]:
#Y is our target variable, we are trying to predict stellar object class
y = df["class"]

#X is our attributes used for prediction
X = df.drop(columns=["class"], axis=1)

In [75]:
'''
Map categorical features
0: Quasar
1: Star
2: Galaxy
'''

y = df['class'].map({"QSO": 0, "STAR": 1, "GALAXY": 2})

y.value_counts()

2    59445
1    21594
0    18961
Name: class, dtype: int64

In [76]:
#split the dataset into random training and testing sets, 80% is used for training and 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=0.2)

In [77]:
X_train.shape

(80000, 16)

In [78]:
X_test.shape

(20000, 16)

In [79]:
y_train.shape

(80000,)

In [80]:
y_test.shape

(20000,)

In [81]:
#Since our set of features have varying units of measurements and ranges
#perform standarization or normalization and using Gaussian NB


from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)


In [82]:
def model_Evaluate(model):
    #accuracy of model on training data
    acc_train=model.score(X_train, y_train)
    #accuracy of model on test data
    acc_test=model.score(X_test, y_test)
    
    print('Accuracy of model on training data : {}'.format(acc_train*100))
    print('Accuracy of model on testing data : {} \n'.format(acc_test*100))

    # Predict values for Test dataset
    y_pred = model.predict(X_test)

    # Print the evaluation metrics for the dataset.
    print(classification_report(y_test, y_pred))
    
    
#Use a Gaussian Bayes classifier for binary classification purposes of the
#Class(target/dependent) variable


X_test = scaler.transform(X_test)

gnb = GaussianNB()
gnb.fit(X_train, y_train)
model_Evaluate(gnb)

Accuracy of model on training data : 71.05125
Accuracy of model on testing data : 71.055 

              precision    recall  f1-score   support

           0       0.59      0.88      0.71      3821
           1       0.99      0.15      0.27      4308
           2       0.75      0.86      0.80     11871

    accuracy                           0.71     20000
   macro avg       0.78      0.63      0.59     20000
weighted avg       0.77      0.71      0.67     20000



In [None]:
#Conclusion: our gaussian bayes classifier performed 71% accuracy on test dataset
#tune hyperparameters for better model.