# Homework 7 : Linear / Nonlinear Classification

Each subproblem in Problems 1 and 2 is worth 10 pts.  Problems 3 and 4 are worth 15 points each. All cells are marked with instructions to insert your code.  Please complete all cells as directed.

**What to turn in**:
 -  Please print the notebook containing the answers and results into a pdf file (you can use `File - Print`). Submit this pdf file to the main homework entry in gradescope. Be sure to locate your answers for each problem when you submit, as ususal. In the worst case where you cannot print it into a pdf file somehow, you can create a Microsoft word document and then copy-paste screenshots showing your code and output parts by parts.
 -  You also need to submit this jupyter notebook file filled with your answers in the code entry in gradescope.

**Description**:
This homework will study 3-class classification in  the famous "Iris" dataset.  The dataset was introduced by the British statistician and biologist Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems" as an example of linear discriminant analysis. The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear classifier to distinguish the species from each other. We will do the same using classifiers that we have learned.


In [None]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
import warnings

# Suppress warnings
warnings.filterwarnings("ignore")
np.random.seed(0)

## Problem 1 : Load / Explore Data

### (a)

Because this dataset is so well-known, Scikit-Learn includes a special function for loading it, which is provided below.  Do the following in the cell below:
  * Load the data and create a Train / Test split with 25% test data (train_test_split() function)
  * For the above, make sure to use the provided random state so that results are repeatable
  * Display the training inputs (you can use function display())

Note: You will need the feature names later on.  It is helpful at this point to store them in a set using the DataFrame.colums property.

In [None]:
# use this random state for train/test split
random_state=1234

iris = datasets.load_iris(as_frame=True)
X = iris.data
y = iris.target

# Insert code here

### (b)

Now we will explore our feature distributions.  In the cell below, use the Pandas DataFrame plot feature to plot the density of each feature in the training data.

[ Documentation - Pandas - DataFrame.plot ](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html)

In [None]:
# Insert code here

### (c)

Sometimes it is better to look at distributions of each feature plotted together.  In the cell below produce a boxplot (use plt.boxplot()) of each feature in the training data.  **Make sure to rotate X-tick labels 45-degrees so they are readable.**

In [None]:
# Insert code here

### (d)

Now let's see how well we can separate classes from each pair of features.  In the cell below produce a scatterplot of **every pair of features** in the training data.  There will be 6 scatterplots in all.  Make sure to follow these instructions:
  * Color each marker red, green, or blue depending on the true class label
  * Use numpy.corrcoef to compute the correlation coefficient of each feature
  * Title each plot with the correlation coefficient
  * Label each axis using the corresponding feature name

In [None]:
# Insert code here


## Problem 2 : Train a logistic regression classifier

### (a)

Now we will look at finding the best feature out of all the features. To do this, you will preform Cross Validation of Logustic Regression. We will break this into subproblems to walk through it. In the cell do the following:
* Using LogisticRegressionCV perform 5-fold cross validation to train on each feature
* For each run use Matplotlib errorbar() to plot the average +/- standard deviation of error versus regularization coefficient (the property LogisticRegressionCV.Cs_) -- there should be 4 plots in total
* Set plot X-label with the feature name, and Y-label "Accuracy"
* Title each plot with the maximum achieved accuracy score
* Report the best accuracy from cross-validation
* Finally, report the best performing feature and save it for later

Make sure to set the following properties in LogisticRegressionCV:
* cv=5
* max_iter=1e4
* random_state=0
* multi_class='multinomial'
* Cs=10

[Documentation - Scikit-Learn - LogisticRegressionCV](https://scikit-learn.org/0.16/modules/generated/sklearn.linear_model.LogisticRegressionCV.html)

In [None]:
# Insert code here

### (b)

Now lets look at all pairs of features.  The cell below provides a function plotLogreg2feat() to visualize the learned classifier for a pair of features.  This function will draw the decision boundaries for each of the three classes, which will give us a better picture of what's going on.  In the cell below that do the following:
* Loop over every pair of features (there are 6 pairs total)
* Using LogisticRegressionCV perform 5-fold cross validation to train a classifier on the pair of features
* Make sure to use **the same cross validation options as the previous experiment**
* Using plotLogreg2feat plot the learned classifier
* Title each plot with the maximum average accuracy from cross validation

In [None]:
def plotLogreg2feat(X, featname_1, featname_2, model):
    '''
    INPUTS:
      X - Input DataFrame (assumes Nx2 for N data points and 2 features)
      featname_1, featname_2 - String containing feature names
      model - Fitted LogisticRegressionCV model

    OUTPUTS:
      ax - Returns figure axis object
    '''

    # make grid
    x_min, x_max = X[featname_1].min() - 0.5*X[featname_1].std(), X[featname_1].max() + 0.5*X[featname_1].std()
    y_min, y_max = X[featname_2].min() - 0.5*X[featname_2].std(), X[featname_2].max() + 0.5*X[featname_2].std()
    h = 0.02  # step size in the mesh
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    #plt.figure(1, figsize=(4, 3))
    fig, ax = plt.subplots()
    ax.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

    # Plot also the training points
    #plt.scatter(X_train[features[i]][ Y_train == 0 ], X_train[features[j]][ Y_train == 0 ], c='r')
    #plt.scatter(X_train[features[i]][ Y_train == 1 ], X_train[features[j]][ Y_train == 1 ], c='g')
    #plt.scatter(X_train[features[i]][ Y_train == 2 ], X_train[features[j]][ Y_train == 2 ], c='b')

    ax.scatter(X[featname_1][ y == 0 ], X[featname_2][ y == 0 ], c='r')
    ax.scatter(X[featname_1][ y == 1 ], X[featname_2][ y == 1 ], c='g')
    ax.scatter(X[featname_1][ y == 2 ], X[featname_2][ y == 2 ], c='b')

    ax.set_xlabel(featname_1)
    ax.set_ylabel(featname_2)


    #plt.xlim(xx.min(), xx.max())
    #plt.ylim(yy.min(), yy.max())
    ax.set_xticks(())
    ax.set_yticks(())
    return ax

In [None]:
# Insert code here

### (c)

Surprisingly, adding pairs of features doesn't seem to improve things.  Let's try training on all features.  In the cell below:
* Perform 5-fold cross validation (using all the same parameters as before) to train a logistic regression classifier on all features
* Report the maximum of the average scores.

In [None]:
# Insert code here

If your results are the same as mine, the maximum score over all features is the same as over the best single feature.

## Problem 3 : Support Vector Machine
We have trained several logistic regression classifiers, all of which achieve an a cross-validation accuracy well into the 90% range.  In an effort to see if we can do better, let's train one lass classifier--a support vector machine.  For this classifier we will introduce a nonlinear tranformation using the Radial Basis kernel function.  In the cell be low do the following:
* Using Numpy.logspace create a logarithmically spaced set of 100 regularization coefficient in the range (1e-4, 1e4)
* For each coefficient define a support vector classifier with kernel='rbf' and set the regularization coefficient (C=coefficient)
* Perform 5-fold cross validation (e.g. using cross_val_score)
* Plot the average accuracy versus regularization coefficient and report the maximum accuracy and best coefficient
* Make sure to set the plot X-scale to 'log' and label axes and title

[ Documentation - Scikit-Learn - svm.SVC ](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html?highlight=svc#sklearn.svm.SVC.score)

In [None]:
# Insert code here

My results show slightly higher accuracy under the SVM classifier.  However, cross_val_score does not use the same cross validation splits as the built-in cross validation of LogisticRegressionCV (which randomizes splits).  So we shall see...

## Problem 4 : Evaluate on test
Now we will evaluate all classifiers on the test data.  Take the best regression classifier and the best SVM classifier (with previously chosen parameters), train them, and evaluate accordingly.  For each classifier report:
  * Test accuracy
  * Confusion matrix
  * Results of classification_report
  
We have only left a single cell below.  Feel free to insert additional cells and arrange output as you see fit.  Make sure it is readable.  Feel free to explore any additional visualizations or metrics.

In [None]:
# Insert code here