# This tutorial walks through an example classification with mock color-color data of stars and QSO's (Quasi-stellar object or quasar).  We use logistic regression as the "Machine Learning" tool for classification.

## Author: Camille Avestruz 

## Date Created: July 2019

In [0]:
#  This is where we import "packages". Think of these as pre-written bits of code (that are known to already work!) 
#      that we can just use.  
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np

# Learning goal 1 - What does a classifier do?
## Let's start with some example data.  

Below is a plot of stars and quasars in color space: g-r color vs. u-g

In [0]:
# Generate fake g-r and u-g data that kind of looks like the SDSS plot 
#   (avoid requiring access astroML this morning)

gr_qso_vals = np.random.normal(0.1, 0.2, 300)
ug_qso_vals = np.random.normal(0.2, 0.3, 300)


gr_stars_vals = np.random.normal(0.8, 0.2, 1000)
ug_stars_vals = np.random.normal(.8, 0.25, 1000)

In [0]:
plt.plot(ug_qso_vals, gr_qso_vals, '.', ms=2, c='r', label='QSO')

plt.plot(ug_stars_vals, gr_stars_vals, '.', ms=2, c='b', label='stars')

plt.xlabel('u-g color',fontsize='xx-large')
plt.ylabel('g-r color',fontsize='xx-large')
plt.xlim(-.5,2.5)
plt.ylim(-1,1.5)
plt.legend(fontsize='large')

## Note, the above plot is color coded, based on whether a data point corresponds to a QSO or star.  i.e. The data is labeled.  

## 1 min discussion with your neighbor: How would you pick out stars and QSOs from the follow unlabeled dataset?

In [0]:
gr_vals_no_label = np.concatenate((np.random.normal(0.1, 0.2, 100), np.random.normal(0.8, 0.2, 100)))
ug_vals_no_label = np.concatenate((np.random.normal(0.2, 0.3, 100), np.random.normal(.8, 0.25, 100)))

plt.plot(ug_vals_no_label, gr_vals_no_label, '.', ms=2, c='k', label='no label')

plt.xlabel('u-g color',fontsize='xx-large')
plt.ylabel('g-r color',fontsize='xx-large')
plt.xlim(-.5,2.5)
plt.ylim(-1,1.5)
plt.legend(fontsize='large')

## What were your assumptions? (1min)

### Let's now build a model based on some assumptions (to discuss further later)

In [0]:
# This is the model we will use to draw a decision boundary
from sklearn.linear_model import LogisticRegression

# We create an instance of the model (still untrained) here.
model = LogisticRegression()

In [0]:
# We have to train our model with our labeled data.  Feature values are commonly named X_?, 
#    and label values are commonly named y_?

# Feature 1
X_data_gr = np.concatenate((gr_qso_vals, gr_stars_vals)) 
print('g-r feature has ',X_data_gr.shape, 'values (column)')

# Feature 2
X_data_ug = np.concatenate((ug_qso_vals, ug_stars_vals))
print('u-g feature has ',X_data_ug.shape, 'values (column)')

# Features
X_data = np.array([X_data_gr, X_data_ug]).T
print('The features are of shape, ',X_data.shape, ' (two columns 1300 rows)')

# Labels - we will use 0 to indicate QSO, and 1 to indicate star.  (e.g. Is this a star?  False vs. True.)
y_data = np.concatenate( (np.zeros(len(gr_qso_vals)), 1+np.zeros(len(gr_stars_vals))) )
print('The labels are of length ',len(y_data), ', which is the total number of labeled data points we have.')

g-r feature has  (1300,) values (column)
u-g feature has  (1300,) values (column)
The features are of shape,  (1300, 2)  (two columns 1300 rows)
The labels are of length  1300 , which is the total number of labeled data points we have.


In [0]:
#  We train the model with our labeled data, but first we have to split the data (randomly) into some that will 
#    be used to train the model, and some that will be saved to later assess how well our model does on data it 
#    has never seen before

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.30)

model.fit(X_train,y_train)

predictions = model.predict(X_test)



# Let's visualize the decision boundary we just made.
Visualize the decision boundary:  https://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html?fbclid=IwAR190kLQgXQIIjC6n5ZyPIZAu_PigFJAnyEP2PkXdgnGiUe8fKbmZzQHmgE#sphx-glr-auto-examples-linear-model-plot-iris-logistic-py

In [0]:
def visualize_boundary_of_trained_model_and_data(X_to_plot, y_to_plot, legend_data_name, trained_model,
    h = .02, xaxis_min = -.5, xaxis_max = 2.5, yaxis_min = -1., yaxis_max = 1.5): 
    '''This function is a visualization code to see the decision boundary compared with our data 
    based on code from the link above.

    Parameters
    ----------
    X_to_plot : array-like
        features, should be at least 2-d for 2-d visualization
    y_to_plot : array-like
        one dimensional labels either 0 or 1
    legend_data_name : str
        name of data you are overplotting 
    trained_model : sklearn.linear_model
        trained model
    h : float 
        step size of the mesh grid used to visualize areas of decision
    xaxis_min : float
        optional 
    xaxis_max : float
        optional
    yaxis_min : float
        optional 
    yaxis_max : float
        optional
        
    Returns
    -------
    matplotlib figure
    
    '''
    
    xx, yy = np.meshgrid(np.arange(xaxis_min, xaxis_max, h), np.arange(yaxis_min, yaxis_max, h))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

    # Plot also the training points
    plt.scatter(X_to_plot[:, 0], X_to_plot[:, 1], c=y_to_plot, edgecolors='k', cmap=plt.cm.Paired, label="Training data")
    plt.xlabel('g-r', fontsize='xx-large')
    plt.ylabel('u-g', fontsize='xx-large')

    plt.xlim(xaxis_min, xaxis_max)
    plt.ylim(yaxis_min, yaxis_max)

    plt.legend()


In [0]:
#  Here we visualize the decision boundary of the trained model, and overplot the training data used to train 
#    that model
visualize_boundary_of_trained_model_and_data(X_train, y_train, 'Training data', model)

## How well do you think we did above?
###  There are multiple ways to *quantify* how well we did with our test set.  Let's first visually see how well the test set did, then use a *metric* to quantify how well the test set did.

In [0]:
#  Here we show where the test data lies (which was NOT used to train the model)
visualize_boundary_of_trained_model_and_data(X_test, y_test, 'Training data', model)

## Let's now quantify how well our model is doing with a metric.  Recall the ROC curve from the slides (1 is a perfect classifier, and 0.5 is a classifier that is as good as a coin flip!)

In [0]:
from sklearn.metrics import roc_auc_score

print('When evaluating our model on the training set, our model has an AUC of:  ',  
      roc_auc_score(y_train, model.predict(X_train)))

#  Do you think the AUC of the model evaluated on the test set should be higher, lower, or the same as the AUC of the model evaluated on the training set?

In [0]:
print('When evaluating our model on the training set, our model has an AUC of:  ',  
      roc_auc_score(y_test, model.predict(X_test)))

#  2 min discussion:  
## What happens if stars and qso's overlapped even more, how would our model "performance" change?  
## How might we improve the model performance?  (Hint:  Think about the movie likes/dislikes from the slides?  What if we only knew about Zootopia and Moonlight, but not Deadpool preferences?)

#  Bonus Exercise:  Look at how we constructed the train/test split and evaluated the model on new data.  What if the parameters of the normal distribution were different between the labeled data used to train the model and the new data?  See how the model performs.  (Possible real astro reasons for this:  Sample Selection Bias.)