# Practical Machine Learning with SDSS Data

In this tutorial, we are going to use SDSS data to get some hands-on experience with machine learning. 

As a note in advance: none of the results you'll get out of this are science-worthy. This tutorial is meant to give you a first idea for how to set up your own machine learning model. But the first, and most important lesson, is this: **don't blindly trust your ML results.** 
As with any other science project, reporting or using results from a machine learning classifier or regressor requires careful understanding of the biases and caveats, assumptions and limitations that come with the data and algorithms chosen. Because the data sets you'll be using come straight out of the SDSS catalogue, you can expect there to be funny effects (both subtle and not) that may mess up any classification you'd want to do, and in a real-world setting, this would include understanding the limitations of the instrument and the data processing, before drawing any scientific conclusions from your procedure.

With that out of the way, let's have some fun with machine learning! In this tutorial, we will use python and a library called `scikit-learn` to do our machine learning, `pandas` to deal with data structures, and `matplotlib` and `seaborn` to do our plotting. 

In [1]:
# make plots interactive and import plotting functionality
%matplotlib notebook
import matplotlib.pyplot as plt

# pretty plotting
import seaborn as sns

# my standard styles for plots
sns.set_style("whitegrid")
sns.set_context("talk")

# Always need numpy
import numpy as np

# data array operations
import pandas as pd

### Load the Data

Our first task is loading the data. Every one of you should be assigned to a group, and each group should have been assigned a data set. Your task is to find the correct file in this folder and load the data into a `pandas.DataFrame` (if you've never worked with pandas, take a look at the `read_csv` function):

In [2]:
# add your code for loading the data here:
data = pd.read_csv("sdss_dataset1.csv")

The `head` method on your loaded `DataFrame` gives you a quick overview of what's in your data:

In [3]:
data.head()

Unnamed: 0,objid,ra,dec,dered_u,dered_g,dered_r,dered_i,dered_z,mag_u,mag_g,...,u_g_color,g_r_color,r_i_color,i_z_color,class,diff_u,diff_g,diff_g1,diff_i,diff_z
0,1237648721216405657,147.24805,-0.035724,19.92408,17.75023,16.76577,16.31368,15.98227,21.0404,19.02146,...,2.173847,0.984457,0.452093,0.331408,GALAXY,-1.116318,-1.271233,-1.266117,-1.325329,-1.306914
1,1237648721216274720,146.92403,-0.105143,23.7651,20.41189,18.60511,17.97406,17.66566,23.76242,21.48446,...,3.353209,1.80678,0.63105,0.308403,GALAXY,0.002676,-1.072573,-1.060022,-1.197628,-1.155197
2,1237650795683446969,146.9201,-0.306462,19.41834,18.03791,17.10992,16.70166,16.41131,20.86501,19.79343,...,1.380424,0.927998,0.408258,0.290352,GALAXY,-1.446667,-1.755512,-1.709305,-1.716152,-1.696211
3,1237650796220514489,147.32951,0.028903,17.59556,16.01916,15.31942,14.94442,14.66613,19.76252,18.20831,...,1.576405,0.699732,0.374999,0.278297,GALAXY,-2.16696,-2.189154,-2.128597,-2.133708,-2.111404
4,1237650795683512955,147.08969,-0.266509,24.22258,21.01025,19.2916,18.66112,18.2066,24.17756,21.5875,...,3.21233,1.718651,0.630482,0.454517,GALAXY,0.045025,-0.577246,-0.587679,-0.610262,-0.633143


Where you go from here depends on which data set you've downloaded. Specifically, for some data sets, you'll have to pull out the *classes* you want to classify, for others the continuous quantity you want to predict.

Classification
- `sdss_dataset1.csv`: extract the `class` column

Regression
- `sdss_dataset[2:9].csv`: extract the `spec_z` column

Clustering
- TODO: ADD CLUSTERING DATA SETS

**Note**: Don't forget to remove the column with your classes/regression variable from the DataFrame, otherwise you'll use the thing you want to find as a feature, which makes your ML performance *really* good, but also *really* wrong! Take a look at the `drop` method for DataFrames to help you achieve that.

Many of the steps are the same, so just follow along the following exercise for both!

Some quick lingo: In machine learning, the things we are trying to learn are often called **labels**, and the quantities we can use to learn them are **features**. For example, in some of the data sets, you're going to try and separate stars and galaxies by their magnitudes and colours. Here, for each **sample** in your data set, you have a bunch of magnitude and colour measurements, your features, and you're trying to predict whether that sample is a galaxy or a star, its label. For the photometric redshift estimation case, you similarly have magnitudes and colours as features, and you're trying to predict the redshifts (your labels). This is called **supervised learning**. 

Note that in this case, we always need examples where we *know* the ground truth: we need to know the class really well, or we need to know the redshift beyond a reasonable doubt (in our case here e.g. through precise spectroscopic measurements). This is often not the case in astronomy (or, indeed, science): we often don't know exactly what our labels should be. In these cases, **unsupervised learning** can be really helpful. Some of you have data sets without labels. You'll be playing around with clustering algorithms.

In [4]:
labels = data["class"]

In [5]:
features = data.drop("class", axis=1)

### Building a first classifier

Let's start by building a first, simple classifier/regressor. Normally, you wouldn't *start* by doing a classification, but for most of your data sets, there are some points we're going to make throughout this tutorial, so having a classification without knowing much about the data serves as a useful baseline. In general, though, running an ML algorithm comes at the end of *many* important steps, which is part of the point of this entire tutorial.

We're going to start working with the [k-nearest neighbour algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm). This is one of the simplest machine learning algorithms out there. Essentially, it takes the $k$ nearest neighbours of a given sample and uses these neighbours to give an estimate of what the label for that sample should be. For the classification rpoblem, it assigns the majority vote of neighbour labels, for regression it assigns the average of the values of its $k$ nearest neighbours.

In `scikit-learn`, these live in `sklearn.neighbors` as [`KNeighborsClassifier`](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier) and [`KNeighborsRegressor`](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor). The unsupervised equivalent is in [`NearestNeighbors](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html#sklearn.neighbors.NearestNeighbors).

Try it out! Don't worry about the parameters right now, we'll do that further down below. 

**Hint**: Basically all algorithms implemented in scikit-learn have the same interface. Basically all of them have a `fit` method that will fit your data, a `predict` method that will predict the classes/values of new samples, and a `score` method that tells you something about how good your algorithm is at making predictions. Some algorithms also have a `transform` and a `fit_transform` method, which allows you to transform your features (e.g. dimensionality reduction algorithms like Principal Component Analysis). Many machine learning libraries outside of scikit-learn have adopted the same structure, which is super helpful when using these libraries for algorithms that are not implemented in scikit-learn. Learning how to do a workflow in scikit-learn is well worth the investment.

In [6]:
# import the correct class for your problem and instantiate it

from sklearn.neighbors import KNeighborsClassifier

kn = KNeighborsClassifier()

In [7]:
# fit the features and the labels we've just extracted
kn.fit(features, labels)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

Let's *score* the model's performance. For most methods in scikit-learn, the `score` method calculates the *accuracy* of the predictions. That is, it uses the trained model to calculate predictions for all samples, and then counts all samples for which it predicted the right thing, and divides that by all samples. So the accuracy is the *fraction of samples where the algorithm predicted the correct number*. 

In [8]:
# calculate the score for your data
kn.score(features, labels)

0.999

What score did you get? It should hopefully be pretty high.

**Exercise**: Do you trust this score? Why? Why not? Discuss with your group! Are there ways you could make your score more trustworthy? 

## Training, Validation and Test Sets

Ok, so, are we done now? Obviously not, or this tutorial would be pretty short. :)
You hopefully just got a super high score on your classification/regression, but is this number actually useful? The answer is both yes and no. You just **trained** your data on a data set, but you also **tested** the model's performance on the *same* data. As a baseline, it's useful to know how well your algorithm does on the training data. In a simple example, imagine the quantity you're trying to predict lies on a parabola, but you're fitting a straight line to it. Even for your training data, the performance won't be great, because no matter what you do, your straight line won't do a good job of representing a parabola. 

But there is another problem here. Your performance on the same data you used to train your algorithm won't tell you anything about how well your algorithm generalizes to **new** examples, which is what we ultimately care about. It is easy, especially for some of the more complex algorithms, to make an arbitrarily complicated function that will reproduce the training data really well, but because it's so specifically trained on a specific data set, it'll do horribly on new examples. This is often called **overfitting**.

What we really need is some data we've never seen before, but for which we know what the output should be. In machine learning, it is standard to separate out a **test set**, i.e. part of the data for which you know the answer, but which you will not look at *until the very end*.

**Note**: For the regression data sets, I have helpfully saved a test set for you in a different file which you'll look at later. You'll need this to compare your results with other groups.