# Logistic Regression Classifier
In this exercise, we'll experiment with a multi-class logistic regression classifier.  We'll compare the performance of a one-versus-all variant to that of a multinomial variant.
See [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)

## Acknowledgements
Much of this work draws on material originally presented by Joe Findlay to the Fort Collins Data Science Meetup in June 2018.  See https://github.com/findaz/FoCoAstronomy. Also the feature set used in the modeling came from Coursera [Data Driven Astronomy](https://www.coursera.org/learn/data-driven-astronomy/home/welcome)

## Citation

[Scikit-learn: Machine Learning in Python](http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html)


In [None]:
import numpy as np
import pandas as pd

In [None]:
# Set up matplotlib and use a nicer set of plot parameters

import matplotlib.pyplot as plt

%matplotlib inline


For this exercises, we'll classify galactic objects based on their spectral characteristics.  We'll use data obtained from the [Sloan Digital Sky Survey](https://www.sdss.org/surveys/). The data has previously been downloaded into a `csv` file for convenience. Here's the Python code that produced the data set we'll be working with:

```python
from astroquery.sdss import SDSS

# query quasars and galaxies
NOBJECTS = 40000
query_text = ('\n'.join(
    ("SELECT TOP %i" % NOBJECTS,
    "   p.objid, s.class as objtype, p.u, p.g, p.r, p.i, p.z, s.z as redshift, s.zerr as redshift_err",
    "FROM PhotoObj AS p",
    "   JOIN SpecObj AS s ON s.bestobjid = p.objid",
    "WHERE ",
    "   p.u BETWEEN 0 AND 19.6",
    "   AND p.g BETWEEN 0 AND 20" ,
    "   AND (s.class = 'GALAXY' OR s.class = 'QSO' or s.class = 'STAR')")))
    
res = SDSS.query_sql(query_text)

df = res.to_pandas()

```
The resulting data frame, `df` was written to a `.csv` file called `star_data.csv` in the `data` folder of the current repo.

## Load the Data

In [None]:
path = 'data/star_data.csv'

In [None]:
stars = pd.read_csv(path)

In [None]:
stars.info()

In [None]:
stars.head()

## Exploratory Data Analysis

We're going to build a classifier which predicts an object's `objtype` based on it's spectral characteristics `u`, `g`, `r`, `i` and `z`. What is the set of object types and how prevelant is each in the data set? Use something like:
```python
stars.groupby('objtype').count().objid
```
to find out.

In [None]:
# your code here to show counts by object type

To see it visually, try this:
```python
stars.groupby('objtype').count()['objid'].plot.bar()
```

In [None]:
# your code here to produce graph of counts by object type

In [None]:
types = ['STAR', 'QSO', 'GALAXY']
colors = ['red', 'green', 'blue']

Let's see how the object types map out relative to the spectral measurements.
Try this:
```
fig, ax = plt.subplots(ncols=3, figsize=(12,4), sharey=True, sharex=True)
for t,c,a in zip(types, colors,ax):
    df = stars.query('objtype == @t')
    a.scatter(df.u, df.g, color=c, label=t)
    a.legend()
plt.tight_layout()
```
and repeat for the  `(r,i)` and `(i,z)` spectral pairs.

In [None]:
#your code here to show plots of object type in the u,g plane

In [None]:
#your code here to show plots of object type in the r,i plane

In [None]:
#your code here to show plots of oject type in the i,z plane

## Data Wrangling

In [None]:
# some needed libraries
from sklearn.model_selection import train_test_split # to partition the dataset into training and test
from sklearn.metrics import accuracy_score, confusion_matrix # for model evaluation
from sklearn.linear_model import LogisticRegression # classifier model to test

The Coursera course cited above uses the __difference__ in adjacent spectral intensities as the features in its classification model.  Here we'll construct 4 features:
1. U minus G
1. G minus R
1. R minus I
1. I minus Z

In [None]:
# function to create features from input data set
# returns (n,4) feature matrix and (n,) label vector
def get_features_labels(df):
    features = np.zeros((len(df), 4))
    features[:,0] = df.u-df.g
    #your code here
    features[:,1] = None # G minus R
    features[:,2] = None # R minus I
    features[:,3] = None # I minus Z
    
    # get the object type labels
    labels = None
    return features, labels

In [None]:
#get list of all possible labels (need this later)
labs = stars.objtype.unique().tolist()

Use the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function in [sklearn.model_selection](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) to create an 80%/20% training and test split of the `stars` data frame.

In [None]:
# partition data into training and test, 20% test
trainingSet, testSet = None

We further want to partition the training set into a developement and validation set so that we can do model development and hyper-parameter selection without touching the test data

In [None]:
# further split training set into devel and validation, 20% validation
devSet, valSet = None

Now, get the features and labels from  each of the data sets: `dev`, `val` and `test`. 

In [None]:
#get the features and labels for each of the sets
x_dev,  y_dev  = get_features_labels(devSet)
x_val,  y_val  = None
x_test, y_test = None

## One versus Rest Classifier
The One versus Rest classifier essentially reduces a classification with k labeles into k binary classifications, then picks the label with the highest probability. To use this technique, specify `ovr` as the value of the `multi_class` parameter of the logistic regression object, as in:
```python
clf_ovr = LogisticRegression(random_state=0, solver='liblinear', multi_class='ovr')
```
then fit the model using the development test set as argument to object's `.fit` method:
```python
clf_ovr.fit(x_dev, y_dev)
```

In [None]:
#your code here
clf_ovr = None

#fit the model on the dev set
None

Compute accuracy for both the developement and the validation test sets. The logistic regression object's `predict` method will make predictions given a input feature matrix; `sklean.metric.accuracy_score` will compute accuracy from the true labels compared to the predicted labels, as in:
```python
# calculate prediction accuracy on the dev set
pred_ovr_dev = clf_ovr.predict(x_dev)
acc_ovr_dev = accuracy_score(y_dev, pred_ovr_dev)
```
In the cell below, compute the accuracy on the training dataset and the validation dataset.

In [None]:
#your code here
# calculate prediction accuracy on the dev set
pred_ovr_dev = None
acc_ovr_dev = None

#calculate prediction accuracy on the validation set
pred_ovr_val = None
acc_ovr_val = None

print('OVR Training Set Accuracy: ', acc_ovr_dev)
print('OVR Validation Set Accuracy: ', acc_ovr_val)

`sklearn.metrics.confustion_matrix` will show counts of the actual labels versus the predicted labels for each of the labels in the data sets.  It takes two arguments, the first is the vector of true labels, the second is the vector of predicted labels.  Call it as follows:
```python
cm_ovr = pd.DataFrame(confusion_matrix(y_val, pred_ovr_val, labels=labs),
                        index=labs, columns=labs)
```
Wrapping the function in a data frame as above puts the class labels on the result, making it easier to interpret.

In [None]:
#Your code here
cm_ovr = None

#display it
cm_ovr

## Multinomial Classifier
The multinomial classifier models the  classification as a multinomial (what else?). To use this technique, specify `multinomial` as the value of the `multi_class` parameter of the logistic regression object, as in:
```python
clf_multi = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial')
```
then fit the model using the development test set as argument to object's `.fit` method:
```python
clf_multi.fit(x_dev, y_dev)
```

In [None]:
#Your code here
clf_multi = None

#fit the model on the dev set
None

Compute accuracy for both the developement and the validation test sets. The logistic regression object's `predict` method will make predictions given a input feature matrix; `sklean.metric.accuracy_score` will compute accuracy from the true labels compared to the predicted labels, as in:
```python
# calculate prediction accuracy on the dev set
pred_multi_dev = clf_multi.predict(x_dev)
acc_multi_dev = accuracy_score(y_dev, pred_ovr_dev)
```
In the cell below, compute the accuracy on the training dataset and the validation dataset.

In [None]:
#Your code here
pred_multi_dev = None
acc_multi_dev = None

pred_multi_val = None
acc_multi_val = None

print('Multinomial Training Set Accuracy: ', acc_multi_dev)
print('Multinomial Validation Set Accuracy: ', acc_multi_val)


`sklearn.metrics.confustion_matrix` will show counts of the actual labels versus the predicted labels for each of the labels in the data sets.  It takes two arguments, the first is the vector of true labels, the second is the vector of predicted labels.  Call it as follows:
```python
cm_multi = pd.DataFrame(confusion_matrix(y_val, pred_multi_val, labels=labs),
                        index=labs, columns=labs)
```
Wrapping the function in a data frame as above puts the class labels on the result, making it easier to interpret.

In [None]:
#your code here
cm_multi = None

#display the confusion matrix
cm_multi

## Model Comparison

In [None]:
#make a data frame for tabluar comparison of results
mod_res=pd.DataFrame([[acc_ovr_dev, acc_ovr_val], [acc_multi_dev, acc_multi_val]],
                    columns=['Development','Validation'],
                    index=['OneVersusAll','MultiNomial'])

In [None]:
mod_res

In [None]:
# show graphically
ax=mod_res.plot.bar(title='Model Accuracy')
ax.legend(loc='lower right', title='Data Set')
ax.set_ylabel('Accuracy')
ax.grid()
ax.set_xlabel('Model Type')

## Conclusion
Not much difference between over all performance of the two model types, however the multinomial model does a better job of classifying QSOs.

Hopefully you've learned how to construct different logistic regression models and compare their performance.