# About the SDSS Dataset

In [1]:
import re
import pandas as pd
from IPython.core.display import HTML
HTML(open("styles/stylesheet.css", "r").read())

## 1. Introduction

The <a href="http://www.sdss.org/" target="_blank">Sloan Digital Sky Survey</a> (SDSS) is a comprehensive survey of the northern sky. This notebook explains the dataset **`data/sdss_dr7_photometry.csv.gz`**. This dataset contains 2.8 million objects that have been spectroscopically idendified in the <a href="http://classic.sdss.org/dr7/" target="_blank">SDSS Data Release 7</a>. Our goal is to build a classifier which can predict whether an object is a galaxy, a star, or a quasar, based on the photometric measurements. As an example, here are the first five objects:

In [2]:
sdss = pd.io.parsers.read_csv("data/sdss_dr7_photometry.csv.gz", compression="gzip", index_col=["ra", "dec"])
sdss.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,class,subclass,redshift,redshiftErr,zWarning,psfMag_u,psfMagErr_u,psfMag_g,psfMagErr_g,psfMag_r,...,petroMagErr_i,petroMag_z,petroMagErr_z,extinction_u,extinction_g,extinction_r,extinction_i,extinction_z,petroRad_r,petroRadErr_r
ra,dec,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
189.429821,-0.131042,Star,A0,0.0006484753,7e-06,0,17.84807,0.017034,16.66706,0.014412,16.85531,...,0.020146,17.19835,0.042538,0.119657,0.088043,0.063856,0.04842,0.03433,1.286998,0.019537
189.453801,-0.097313,Star,F5,5.906141e-07,9e-06,0,17.66626,0.016574,16.64595,0.0144,16.28934,...,0.00314,16.1933,0.009198,0.115112,0.084698,0.06143,0.046581,0.033026,1.265791,0.017794
189.468747,-0.036,Star,F5,0.0005791125,1.2e-05,0,17.28147,0.015869,16.2146,0.01434,15.76875,...,0.002288,15.51777,0.005647,0.118604,0.087268,0.063294,0.047994,0.034028,1.265033,0.017849
196.23665,0.412347,Galaxy,,0.6179363,0.000132,0,24.66923,0.841852,23.05412,0.206945,21.47531,...,0.379935,19.40505,0.474357,0.102667,0.075542,0.054789,0.041545,0.029456,2.018322,0.390549
196.193644,0.389485,Galaxy,,0.6413319,4.5e-05,0,23.43426,0.501488,22.80689,0.139163,21.82671,...,0.126625,19.26611,0.342775,0.106684,0.078497,0.056933,0.04317,0.030608,2.855696,0.554692


The first two columns (**`ra`** and **`dec`**) are the right ascension and the declination of the object in degrees. These are the row index in our Data Frame. The third column (**`class`**) is the spectroscopic class (Star, Galaxy, and Quasar) as determined by expert opnion. This will be the target vector in the classficiation. Some objects are also further divided into subclasses.  The columns (**`redshift`** and **`redshiftErr`**) are the redshift (with errror) of the object, also determined by expert opinion.

There are 11 columns that we can use as feature vectors. These are the different <a href="https://www.sdss3.org/dr10/algorithms/magnitudes.php#mag_psf" target="_blank">PSF</a> and <a href="https://www.sdss3.org/dr10/algorithms/magnitudes.php#mag_petro" target="_blank">Petrosian</a> magnitude measurements:

* **`psfMag_u`**: PSF magnitude measurement in u-band, assuming the object is a point souce
* **`psfMag_g`**: PSF magnitude measurement in g-band, assuming the object is a point souce
* **`psfMag_r`**: PSF magnitude measurement in r-band, assuming the object is a point souce
* **`psfMag_i`**: PSF magnitude measurement in i-band, assuming the object is a point souce
* **`psfMag_z`**: PSF magnitude measurement in z-band, assuming the object is a point souce
* **`petroMag_u`**: Petrosian magnitude measurement in u-band, assuming the object is an extended souce
* **`petroMag_g`**: Petrosian magnitude measurement in g-band, assuming the object is an extended souce
* **`petroMag_r`**: Petrosian magnitude measurement in r-band, assuming the object is an extended souce
* **`petroMag_i`**: Petrosian magnitude measurement in i-band, assuming the object is an extended souce
* **`petroMag_z`**: Petrosian magnitude measurement in z-band, assuming the object is an extended souce
* **`petroRad_r`**: size measurement of the object in r-band in arc seconds

Each of these 11 measurements also has an associated error.

## 2. Obtaining the Dataset

### 2.1. The Main Dataset (2.8 million objects)

If you would like to obtain the dataset yourself, create an account on the <a href="http://skyserver.sdss.org/CasJobs/" target="_blank">SDSS CasJobs</a> site and submit the following SQL query to the DR12 catalog:

The WHERE conditions ensure that we only select the best possible data.

### 2.2. The Full Dataset (800 million objects)

The obtain the photometric measurement of all objects, remove all the WHERE conditions and LEFT JOIN (instead of JOIN) PhotoObj with SpecObj. Note that since the full set is extremly large (around 200GB), you will not be able to use CasJobs. Instead, you need to email the SDSS Help Desk directly for a custom transfer.

## 4. Subclass

In the raw dataset, the `subclass` column is not formatted in a uniform way. If you need to work with subclasses, you might want to do some cleaning up. Here are some examples.

In [29]:
# remove null references
sdss['subclass'].replace('null', '', inplace=True)

# remove HD catalog number (stored in brackets)
sdss['subclass'].replace(r'\s*\(\d+\)\s*', '', regex=True, inplace=True)