In [162]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).



# Star Classifier
The Sloan Digital Sky Survey, or SDSS, is an international collaboration of scientists gathering data from two telescopes in North and South America to build the most detailed Three-Dimensional Imagery of the Universe ever made. The SDSS has produced deep multi-color images of one third of the sky, and created spectra of more than three million astronomical objects.

The SDSS has been searching for DataScientist and they ask you to join their team.

For this job you need to gather data from de SDSS survey and create a star classifier system.

In this project I will follow the **CRISP-DM model**

The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model that serves as the base for a data science process. It has six sequential phases:

- Business understanding – What does the business need?
- Data understanding – What data do we have / need? Is it clean?
- Data preparation – How do we organize the data for modeling?
- Modeling – What modeling techniques should we apply?
- Evaluation – Which model best meets the business objectives?
- Deployment – How do stakeholders access the results?

## Business understanding – What does the business need?
Since I don't know much about the SDSS work I have to research for my own and ask to them what do they expect from my work.

They need us to build a star classifier system to distinguish between a star a quasar and a galaxy so they gave me acces to the CasJobs data base from where I will retrive the data set and a data diccionary related to the data I will be working with.

### Content
The data consists of 10,000 observations of space taken by the SDSS. Every observation is described by 17 feature columns and 1 class column which identifies it to be either a star, galaxy or quasar.

### Feature Description
The table results from a query which joins two tables (actually views): "PhotoObj" which contains photometric data and "SpecObj" which contains spectral data.

To ease your start with the data you can read the feature descriptions below:

#### View "PhotoObj"
- objid = Object Identifier
- ra = J2000 Right Ascension (r-band)
- dec = J2000 Declination (r-band)

Right ascension (abbreviated RA) is the angular distance measured eastward along the celestial equator from the Sun at the March equinox to the hour circle of the point above the earth in question. When paired with declination (abbreviated dec), these astronomical coordinates specify the direction of a point on the celestial sphere (traditionally called in English the skies or the sky) in the equatorial coordinate system.

Source: https://voyages.sdss.org/expeditions/expedition-to-the-solar-system/solarsystem/radec/

- u = better of DeV/Exp magnitude fit
- g = better of DeV/Exp magnitude fit
- r = better of DeV/Exp magnitude fit
- i = better of DeV/Exp magnitude fit
- z = better of DeV/Exp magnitude fit

The Thuan-Gunn astronomic magnitude system. u, g, r, i, z represent the response of the 5 bands of the telescope.

- run = Run Number
- rereun = Rerun Number
- camcol = Camera column
- field = Field number

Run, rerun, camcol and field are features which describe a field within an image taken by the SDSS. A field is basically a part of the entire image corresponding to 2048 by 1489 pixels. A field can be identified by:

- run number, which identifies the specific scan,
- the camera column, or "camcol," a number from 1 to 6, identifying the scanline within the run, and
- the field number. The field number typically starts at 11 (after an initial rampup time), and can be as large as 800 for particularly long runs.
- An additional number, rerun, specifies how the image was processed.

#### View "SpecObj"
- specobjid = Object Identifier
- class = object class (galaxy, star or quasar object)
- The class identifies an object to be either a galaxy, star or quasar. This will be the response variable which we will be trying to predict.

- redshift = Final Redshift
- plate = plate number
- mjd = MJD of observation
- fiberid = fiber ID
- In physics, redshift happens when light or other electromagnetic radiation from an object is increased in wavelength, or shifted to the red end of the spectrum.

Each spectroscopic exposure employs a large, thin, circular metal plate that positions optical fibers via holes drilled at the locations of the images in the telescope focal plane. These fibers then feed into the spectrographs. Each plate has a unique serial number, which is called plate in views such as SpecObj in the CAS.

Modified Julian Date, used to indicate the date that a given piece of SDSS data (image or spectrum) was taken.

The SDSS spectrograph uses optical fibers to direct the light at the focal plane from individual objects to the slithead. Each object is assigned a corresponding fiberID.

Further information on SDSS images and their attributes:

http://www.sdss3.org/dr9/imaging/imaging_basics.php

http://www.sdss3.org/dr8/glossary.php

Acknowledgements
The data released by the SDSS is under public domain. Its taken from the current data release RD14.

More information about the license:

http://www.sdss.org/science/image-gallery/

It was acquired by querying the CasJobs database which contains all data published by the SDSS.

The exact query can be found at:
https://skyserver.sdss.org/dr18/SearchTools/sql

## Data understanding – What data do we have / need? Is it clean?

I couldn't create a connection between the ipynb and the SDSS RestApi to retrevie the data becasue is not available any more, what I can do to gather the data is to consult the Cassjob online data base and download it in csv.

![query](star_clasifier\images\query.png)
![result](star_clasifier\images\result.png)

In [163]:
import pandas as pd
import numpy as np
import altair as alt


In [164]:
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

In [165]:
dr14 = pd.read_csv("/content/drive/MyDrive/ColabNotebooks/Skyserver_adradev_DR14.csv")
dr14

Unnamed: 0,objid,ra,dec,u,g,r,i,z,run,rerun,camcol,field,specobjid,class,redshift,plate,mjd,fiberid
0,1237646798137852371,121.820752,0.931990,19.37035,17.34262,16.35286,15.92400,15.58903,308,301,5,104,5342663162779901952,GALAXY,0.101993,4745,55892,975
1,1237646798137918215,122.087900,0.843147,19.05249,17.03777,16.07633,15.63148,15.31245,308,301,5,105,5342400104622956544,GALAXY,0.101533,4745,55892,18
2,1237646798138245746,122.863995,0.896151,18.20631,16.89692,16.46658,16.31574,16.28902,308,301,5,110,2316073479176218624,STAR,0.000488,2057,53816,354
3,1237646798138310950,122.981945,0.963857,17.63113,16.55926,16.24861,16.14775,16.13221,308,301,5,111,2338584330990807040,STAR,0.000132,2077,53846,328
4,1237646798138310972,122.988638,0.973743,15.99172,14.98865,14.70003,14.64919,14.50626,308,301,5,111,2316072929420404736,STAR,0.000093,2057,53816,352
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,1237651755091689608,197.462885,3.309304,18.56033,17.22085,16.41203,16.02500,15.73324,1462,301,6,481,591201118567032832,GALAXY,0.110601,525,52295,377
9996,1237651755091755116,197.643419,3.361471,19.48133,18.36759,17.76637,17.34260,17.09888,1462,301,6,482,591198369787963392,GALAXY,0.110570,525,52295,367
9997,1237651755091755128,197.665024,3.297791,19.27929,18.38898,18.03337,17.91430,17.85355,1462,301,6,482,4511676729161392128,STAR,-0.000107,4007,55327,712
9998,1237651755091820576,197.767442,3.418576,17.72724,15.89321,15.09945,14.69967,14.38937,1462,301,6,483,591209364904241152,GALAXY,0.025301,525,52295,407


In [166]:
# Now that I have the data let's see what can figure out from this data set
dr14.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 18 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   objid      10000 non-null  int64  
 1   ra         10000 non-null  float64
 2   dec        10000 non-null  float64
 3   u          10000 non-null  float64
 4   g          10000 non-null  float64
 5   r          10000 non-null  float64
 6   i          10000 non-null  float64
 7   z          10000 non-null  float64
 8   run        10000 non-null  int64  
 9   rerun      10000 non-null  int64  
 10  camcol     10000 non-null  int64  
 11  field      10000 non-null  int64  
 12  specobjid  10000 non-null  uint64 
 13  class      10000 non-null  object 
 14  redshift   10000 non-null  float64
 15  plate      10000 non-null  int64  
 16  mjd        10000 non-null  int64  
 17  fiberid    10000 non-null  int64  
dtypes: float64(8), int64(8), object(1), uint64(1)
memory usage: 1.4+ MB


In [167]:
# Basic Statistical Information
dr14.describe()

Unnamed: 0,objid,ra,dec,u,g,r,i,z,run,rerun,camcol,field,specobjid,redshift,plate,mjd,fiberid
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,1.237651e+18,185.524241,37.517897,18.648291,17.432209,16.911216,16.65346,16.494506,1331.9628,301.0,3.9836,218.8603,1.682467e+18,0.171432,1494.2434,52901.4735,359.3319
std,706870300000.0,50.334146,26.782991,0.824173,0.961547,1.108636,1.1901,1.26793,164.620267,0.0,1.717098,180.853785,2.111862e+18,0.430373,1875.698877,1524.776962,206.65437
min,1.237647e+18,27.56775,-8.479532,12.42139,12.66632,11.9385,11.53573,11.31391,308.0,301.0,1.0,11.0,2.99582e+17,-0.004136,266.0,51578.0,1.0
25%,1.237651e+18,142.449612,1.86098,18.224453,16.8766,16.218625,15.875947,15.637037,1334.0,301.0,3.0,65.0,4.864045e+17,4.3e-05,432.0,51908.0,186.0
50%,1.237651e+18,180.486821,51.09688,18.88376,17.532435,16.91049,16.607845,16.432065,1345.0,301.0,4.0,172.0,5.777553e+17,0.051642,513.0,52051.0,370.0
75%,1.237652e+18,238.81133,60.139752,19.275627,18.06368,17.57825,17.330423,17.21373,1412.0,301.0,5.0,331.0,2.607728e+18,0.101066,2316.0,53757.0,518.0
max,1.237652e+18,262.966558,68.10677,19.59994,19.97727,24.80205,24.36182,27.87514,1462.0,301.0,6.0,775.0,9.319317e+18,6.519422,8277.0,57448.0,1000.0


In [168]:
# We can observe that we have a class variable in the data set, this will be usefull as our objective variable.
# Let's see what it can tell us to build our model
dr14['class'].value_counts()

GALAXY    5368
STAR      3551
QSO       1081
Name: class, dtype: int64

In [169]:
df2 = pd.DataFrame(dr14)
df2

Unnamed: 0,objid,ra,dec,u,g,r,i,z,run,rerun,camcol,field,specobjid,class,redshift,plate,mjd,fiberid
0,1237646798137852371,121.820752,0.931990,19.37035,17.34262,16.35286,15.92400,15.58903,308,301,5,104,5342663162779901952,GALAXY,0.101993,4745,55892,975
1,1237646798137918215,122.087900,0.843147,19.05249,17.03777,16.07633,15.63148,15.31245,308,301,5,105,5342400104622956544,GALAXY,0.101533,4745,55892,18
2,1237646798138245746,122.863995,0.896151,18.20631,16.89692,16.46658,16.31574,16.28902,308,301,5,110,2316073479176218624,STAR,0.000488,2057,53816,354
3,1237646798138310950,122.981945,0.963857,17.63113,16.55926,16.24861,16.14775,16.13221,308,301,5,111,2338584330990807040,STAR,0.000132,2077,53846,328
4,1237646798138310972,122.988638,0.973743,15.99172,14.98865,14.70003,14.64919,14.50626,308,301,5,111,2316072929420404736,STAR,0.000093,2057,53816,352
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,1237651755091689608,197.462885,3.309304,18.56033,17.22085,16.41203,16.02500,15.73324,1462,301,6,481,591201118567032832,GALAXY,0.110601,525,52295,377
9996,1237651755091755116,197.643419,3.361471,19.48133,18.36759,17.76637,17.34260,17.09888,1462,301,6,482,591198369787963392,GALAXY,0.110570,525,52295,367
9997,1237651755091755128,197.665024,3.297791,19.27929,18.38898,18.03337,17.91430,17.85355,1462,301,6,482,4511676729161392128,STAR,-0.000107,4007,55327,712
9998,1237651755091820576,197.767442,3.418576,17.72724,15.89321,15.09945,14.69967,14.38937,1462,301,6,483,591209364904241152,GALAXY,0.025301,525,52295,407


In [170]:
#rerun and objid std is equal 0, so has only one value --> Drop
df2.drop(['rerun', 'objid', 'specobjid'], axis=1, inplace=True)

In [171]:
# Rename classs column to Class to avoid python name errors
df2 = df2.rename(columns={'class':'Class'})

In [172]:
# Class column
Class = df2.Class.astype('category')
Class

0       GALAXY
1       GALAXY
2         STAR
3         STAR
4         STAR
         ...  
9995    GALAXY
9996    GALAXY
9997      STAR
9998    GALAXY
9999    GALAXY
Name: Class, Length: 10000, dtype: category
Categories (3, object): ['GALAXY', 'QSO', 'STAR']

In [173]:
df2

Unnamed: 0,ra,dec,u,g,r,i,z,run,camcol,field,Class,redshift,plate,mjd,fiberid
0,121.820752,0.931990,19.37035,17.34262,16.35286,15.92400,15.58903,308,5,104,GALAXY,0.101993,4745,55892,975
1,122.087900,0.843147,19.05249,17.03777,16.07633,15.63148,15.31245,308,5,105,GALAXY,0.101533,4745,55892,18
2,122.863995,0.896151,18.20631,16.89692,16.46658,16.31574,16.28902,308,5,110,STAR,0.000488,2057,53816,354
3,122.981945,0.963857,17.63113,16.55926,16.24861,16.14775,16.13221,308,5,111,STAR,0.000132,2077,53846,328
4,122.988638,0.973743,15.99172,14.98865,14.70003,14.64919,14.50626,308,5,111,STAR,0.000093,2057,53816,352
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,197.462885,3.309304,18.56033,17.22085,16.41203,16.02500,15.73324,1462,6,481,GALAXY,0.110601,525,52295,377
9996,197.643419,3.361471,19.48133,18.36759,17.76637,17.34260,17.09888,1462,6,482,GALAXY,0.110570,525,52295,367
9997,197.665024,3.297791,19.27929,18.38898,18.03337,17.91430,17.85355,1462,6,482,STAR,-0.000107,4007,55327,712
9998,197.767442,3.418576,17.72724,15.89321,15.09945,14.69967,14.38937,1462,6,483,GALAXY,0.025301,525,52295,407


In [174]:
df2.columns


Index(['ra', 'dec', 'u', 'g', 'r', 'i', 'z', 'run', 'camcol', 'field', 'Class',
       'redshift', 'plate', 'mjd', 'fiberid'],
      dtype='object')

### Ploting the different objects, Galaxy, Star, QSO

In [175]:
# Bar chart with the values from every class
alt.Chart(df2).mark_bar().encode(
    x=alt.X('Class:N', title='', axis = alt.Axis(labelAngle=0, labelFontSize=12)),
    y=alt.Y('count(field):Q', title='field'),
    color=("Class:N")
).properties(
    title = f"Objects by its class and field",
    width = 600,
)

In [176]:
# Scatter matrix
source = df2
alt.Chart(source).mark_circle().encode(
    alt.X(alt.repeat("column"), type='quantitative'),
    alt.Y(alt.repeat("row"), type='quantitative'),
    color="Class:N"
).properties(
    width=150,
    height=150
).repeat(
    row=['u', 'g', 'r', 'i', 'z',],
    column=[ 'u', 'g', 'r', 'i', 'z',]
).interactive()

### Grouping and Aggregating

In [177]:
(df2
  .groupby(['Class'])
  .agg(['min', 'max', 'median'])
  .loc[:, 'ra':'z']
  .T
)

Unnamed: 0,Class,GALAXY,QSO,STAR
ra,min,27.56775,27.898458,27.913685
ra,max,262.966558,262.903096,262.878974
ra,median,186.887991,188.177067,169.038647
dec,min,-8.479532,-6.844004,-8.09097
dec,max,68.08299,68.10677,67.968669
dec,median,51.603813,53.641737,49.173996
u,min,14.48426,15.59918,12.42139
u,max,19.59994,19.59841,19.59986
u,median,18.98151,19.10369,18.53671
g,min,12.66632,14.73122,13.05947


In [178]:
pd.crosstab(df2.Class, values=df2.dec, aggfunc=('min', 'median', 'max'),
            columns=df2.assign(val='ra').val)

Unnamed: 0_level_0,max,median,min
val,ra,ra,ra
Class,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
GALAXY,68.08299,51.603813,-8.479532
QSO,68.10677,53.641737,-6.844004
STAR,67.968669,49.173996,-8.09097


## Data preparation – How do we organize the data for modeling?

## Modeling – What modeling techniques should we apply?

## Evaluation – Which model best meets the business objectives?

## Deployment – How do stakeholders access the results?