# Separating Stars and Galaxies from SDSS 

##### Version 0.1

***
By AA Miller 2018 Feb 21

**Getting started**

Python boundary coniditions: before you begin work on this notebook, you will need to install a few non-standard `python` packages (note - I assume you already have `NumPy`, `matplotlib`, `pandas`, and `jupyter` installed). You are going to need [`scikit-learn`](http://scikit-learn.org/stable/) and [`seaborn`](). I highly recommend using a `python` package manager, like [miniconda](https://conda.io/miniconda.html), for handling python and the different versions you may want to install. With miniconda/anaconda, you can install `scikit-learn` and `seaborn` via

    conda install scikit-learn seaborn

alternatively, you can use `pip` to install these packages

    pip install -U scikit-learn
    pip install seaborn

We will now follow the steps from the lecture to develop an end-to-end machine learning model using actual astronomical data to separate stars and galaxies (as we discussed during the lecture). As a reminder we covered 5 steps in the machine learning workflow:

1. Data Preparation
2. Model Building
3. Model Evaluation
4. Model Optimization
5. Model Predictions

The data come from the [Sloan Digital Sky Survey](http://www.sdss.org) (SDSS), an imaging survey that has several similarities to LSST (though the telescope was significantly smaller and the survey did not cover as large an area of the sky). 

*Science background*: Many (nearly all?) of the science applications for LSST data will rely on the accurate separation of stars and galaxies in the LSST imaging data. As an example, imagine measuring the structure of the Milky Way without knowing which sources are galaxies and which are stars. 

During this exercise, we will utilize supervised machine-learning methods to separate extended sources (galaxies) and point sources (stars) in imaging data. These methods are highly flexible, and as a result can classify sources at higher fidelity than methods that simply make cuts in a low-dimensional space.

In [1]:
import numpy as np
from astropy.table import Table
import matplotlib.pyplot as plt
%matplotlib notebook

## Problem 1) Examine the Training Data

For this problem the training set, i.e. sources with known labels, includes stars and galaxies that have been confirmed with spectroscopic observations. The machine learning model is needed because there are $\gg 10^8$ sources with photometric observations in SDSS, and only $4 \times 10^6$ sources with spectroscopic observations. The model will allow us to translate our knowledge from the spectroscopic observations to the entire data set. The features include each $r$-band magnitude measurement made by SDSS (don't worry if you don't know what this means...). This yields 8 features to train the models (significantly fewer than the [454 properties measured for each source in SDSS](https://skyserver.sdss.org/dr12/en/help/browser/browser.aspx#&&history=description+PhotoObjAll+U)).