# Project 3: Multiclass and Linear Models

UIC CS 412, Spring 2018

_If you have discussed this assignment with anyone, please state their name(s) here: [NAMES]. Keep in mind the expectations set in the Academic Honesty part of the syllabus._

There are two parts to this project. The first is on multiclass reductions. The second is on linear models and gradient descent. There is also a third part which gives you an opportunity for extra credit.

This assignment is adapted from the github materials for [A Course in Machine Learning](https://github.com/hal3/ciml).

## Due Date

This assignment is due at 11:59pm Thursday, March 15th. 

### Files You'll Edit

``multiclass.py``: The multiclass classification implementation you need to complete.

``gd.py``: The gradient descent file you need to edit.

``quizbowl.py``: Multiclass evaluation of the quiz bowl dataset (optional).

``predictions.txt``: This file is automatically generated as part of Part 3 (optional).

### Files you might want to look at
  
``binary.py``: Our generic interface for binary classifiers (actually
works for regression and other types of classification, too).

``datasets.py``: Where a handful of test data sets are stored.

``util.py``: A handful of useful utility functions: these will
undoubtedly be helpful to you, so take a look!

``runClassifier.py``: A few wrappers for doing useful things with
classifiers, like training them, generating learning curves, etc.

``mlGraphics.py``: A few useful plotting commands

``data/*``: All of the datasets we'll use.

### What to Submit

You will hand in all of the python files listed above as a single zip file **h3.zip** on Gradescope under *Homework 3*.  The programming part constitutes 60% of the grade for this homework. You also need to answer the questions denoted by **WU#** (and a kitten) in this notebook which are the other 40% of your homework grade. When you are done, you should export **hw3.ipynb** with your answers as a PDF file **hw3WrittenPart.pdf**, upload the PDF file to Gradescope under *Homework 3 - Written Part*, and tag each question on Gradescope. 

Your entire homework will be considered late if any of these parts are submitted late. 

#### Autograding

Your code will be autograded for technical correctness. Please **do
not** change the names of any provided functions or classes within the
code, or you will wreak havoc on the autograder. We have provided two simple test cases that you can try your code on, see ``run_tests_simple.py``. As usual, you should create more test cases to make sure your code runs correctly.

# Part 1: Multiclass Classification *[30% impl, 20% writeup]*

In this section, you will explore the differences between three
multiclass-to-binary reductions: one-versus-all (OVA), all-versus-all
(AVA), and a tree-based reduction (TREE).  The evaluation will be on different datasets from 
`datasets.py`.

The classification task we'll work with is wine classification. The dataset was downloaded from allwines.com. Your job is to predict the type of wine, given the description of the wine. There are two tasks: WineData has 20 different wines, WineDataSmall is just the first five of those (sorted roughly by frequency). You can find the names of the wines both in WineData.labels as well as the file wines.names.

To start out, let's import everything and train decision "stumps" (aka depth=1 decision trees) on the large data set:

In [None]:
from sklearn.tree import DecisionTreeClassifier
import multiclass
import util
from datasets import *
import importlib

h = multiclass.OVA(20, lambda: DecisionTreeClassifier(max_depth=1))
h.train(WineData.X, WineData.Y)
P = h.predictAll(WineData.Xte)
mean(P == WineData.Yte)
# 0.29499072356215211

That means 29% accuracy on this task. The most frequent class is:

In [None]:
print(mode(WineData.Y))
# 1
print(WineData.labels[1])
# Cabernet-Sauvignon

And if you were to always predict label 1, you would get the following accuracy:

In [None]:
mean(WineData.Yte == 1)
# 0.17254174397031541

So we're doing a bit (12%) better than that using decision stumps. 

The default implementation of OVA uses decision tree confidence (probability of prediction) to weigh the votes. You can switch to zero/one predictions to see the effect:

In [None]:
P = h.predictAll(WineData.Xte, useZeroOne=True)
mean(P == WineData.Yte)
# 0.19109461966604824

As you can see, this is markedly worse.

Switching to the smaller data set for a minute, we can train, say, depth 3 decision trees:

In [None]:
h = multiclass.OVA(5, lambda: DecisionTreeClassifier(max_depth=3))
h.train(WineDataSmall.X, WineDataSmall.Y)
P = h.predictAll(WineDataSmall.Xte)
print(mean(P == WineDataSmall.Yte))
# 0.590809628009
print(mean(WineDataSmall.Yte == 1))
# 0.407002188184

So using depth 3 trees we get an accuracy of about 60% (this number varies a bit), versus a baseline of 41%. That's not too terrible, but not great.

We can look at what this classifier is doing.

In [None]:
print(WineDataSmall.labels[0])
#'Sauvignon-Blanc'
util.showTree(h.f[0], WineDataSmall.words)
#citrus?
#-N-> lime?
#|    -N-> gooseberry?
#|    |    -N-> class 0	(356.0 for class 0, 10.0 for class 1)
#|    |    -Y-> class 1	(0.0 for class 0, 4.0 for class 1)
#|    -Y-> apple?
#|    |    -N-> class 1	(1.0 for class 0, 15.0 for class 1)
#|    |    -Y-> class 0	(2.0 for class 0, 0.0 for class 1)
#-Y-> grapefruit?
#|    -N-> flavors?
#|    |    -N-> class 1	(4.0 for class 0, 12.0 for class 1)
#|    |    -Y-> class 0	(11.0 for class 0, 5.0 for class 1)
#|    -Y-> opens?
#|    |    -N-> class 1	(0.0 for class 0, 14.0 for class 1)
#|    |    -Y-> class 0	(1.0 for class 0, 0.0 for class 1)

This should show the tree that's associated with predicting label 0 (which is stored in h.f[0]). The 1s mean "likely to be Sauvignon-Blanc" and the 0s mean "likely not to be".

Now, go in and complete the AVA implementation in `multiclass.py`. You should be able to train an AVA model on the small data set by:

In [None]:
h = multiclass.AVA(5, lambda: DecisionTreeClassifier(max_depth=3))
h.train(WineDataSmall.X, WineDataSmall.Y)
P = h.predictAll(WineDataSmall.Xte)

Next, you must implement a 
tree-based reduction in `multiclass.py`. Most of train is given to you, but predict you
must do all on your own. There is a tree class to help you:

In [None]:
t = multiclass.makeBalancedTree(range(5))
print(t)
# [[0 1]] [2 [3 4]]]
print(t.isLeaf)
# False
print(t.getRight())
# [2 [3 4]]
print(t.getRight().getLeft())
# 2
print(t.getRight().getLeft().isLeaf)
# True

You should be able to train a MCTree model by:

In [None]:
h = multiclass.MCTree(t, lambda: DecisionTreeClassifier(max_depth=3))
h.train(WineDataSmall.X, WineDataSmall.Y)
P = h.predictAll(WineDataSmall.Xte)

<img src="data/kitten.jpeg" width="100px" align="left" float="left"/>
<br><br><br>
## WU1 (10%):
Answer A, B, C for both OVA and AVA.

(A) What words are most indicative of being Sauvignon-Blanc? Which words are most indicative of not being Sauvignon-Blanc? What about Pinot-Noir (label==2)?

(B) Train depth 3 decision trees on the full WineData task (with 20 labels). What accuracy do you get? How long does this take (in seconds)? One of my least favorite wines is Viognier -- what words are indicative of this?

(C) Compare the accuracy using zero-one predictions versus using confidence. How much difference does it make?

In [None]:
# WU1 CODE HERE

[WU1 ANSWER HERE]

<img src="data/kitten.jpeg" width="100px" align="left" float="left"/>
<br><br><br>
## WU2 (10%):
Using decision trees of constant depth for each
classifier (but you choose it as well as you can!), train AVA, OVA and
Tree (using balanced trees) for the wine data. Which does best and why?

In [None]:
# WU2 CODE HERE

[WU2 ANSWER HERE]

<img src="data/kitten.jpeg" width="100px" align="left" float="left"/>
<br><br><br>
## WU-EC1 ExtraCredit (10%):
Build a better tree (any way you want) other
than the balanced binary tree. Fill in your code for this in
`getMyTreeForWine`, which defaults to a balanced tree. It should get
at least 5% lower absolute error to get the extra credit. Describe what you
did.

[YOUR WU-EC1 ANSWER HERE]

# Part 2: Gradient Descent and Linear Classification *[30% impl, 20% writeup]*

To get started with linear models, we will implement a generic
gradient descent method.  This should go in `gd.py`, which
contains a single (short) function: `gd`. This takes five
parameters: the function we're optimizing, it's gradient, an initial
position, a number of iterations to run, and an initial step size.

In each iteration of gradient descent, we will compute the gradient
and take a step in that direction, with step size `eta`.  We
will have an *adaptive* step size, where `eta` is computed
as `stepSize` divided by the square root of the iteration
number (counting from one).

Once you have an implementation running, we can check it on a simple
example of minimizing the function `x^2`:

In [None]:
gd.gd(lambda x: x**2, lambda x: 2*x, 10, 10, 0.2)
#(1.0034641051795872, array([ 100.        ,   36.        ,   18.5153247 ,   10.95094653,
#          7.00860578,    4.72540613,    3.30810578,    2.38344246,
#          1.75697198,    1.31968118,    1.00694021]))

You can see that the "solution" found is about 1, which is not great
(it should be zero!), but it's better than the initial value of ten!
If yours is going up rather than going down, you probably have a sign
error somewhere!

We can let it run longer and plot the trajectory:

In [None]:
x, trajectory = gd.gd(lambda x: x**2, lambda x: 2*x, 10, 100, 0.2)
print(x)
# 0.003645900464603937
plot(trajectory)
show(False)

It's now found a value close to zero and you can see that the
objective is decreasing by looking at the plot.

<img src="data/kitten.jpeg" width="100px" align="left" float="left"/>
<br><br><br>
## WU3 (5%):
Find a few values of step size where it converges and
a few values where it diverges.  Where does the threshold seem to
be?

[Your WU3 answer here]

<img src="data/kitten.jpeg" width="100px" align="left" float="left"/>
<br><br><br>
## WU4 (10%):
Come up with a *non-convex* univariate
optimization problem.  Plot the function you're trying to minimize and
show two runs of `gd`, one where it gets caught in a local
minimum and one where it manages to make it to a global minimum.  (Use
different starting points to accomplish this.)

If you implemented it well, this should work in multiple dimensions,
too:

In [None]:
x, trajectory = gd.gd(lambda x: linalg.norm(x)**2, lambda x: 2*x, array([10,5]), 100, 0.2)
print(x)
# array([ 0.0036459 ,  0.00182295])
plot(trajectory)

Our generic linear classifier implementation is
in `linear.py`.  The way this works is as follows.  We have an
interface `LossFunction` that we want to minimize.  This must
be able to compute the loss for a pair `Y` and `Yhat`
where, the former is the truth and the latter are the predictions.  It
must also be able to compute a gradient when additionally given the
data `X`.  This should be all you need for these.

There are three loss function stubs: `SquaredLoss` (which is
implemented for you!), `LogisticLoss` and `HingeLoss`
(both of which you'll have to implement.  My suggestion is to hold off
implementing the other two until you have the linear classifier
working.

The `LinearClassifier` class is a stub implemention of a
generic linear classifier with an l2 regularizer.  It
is *unbiased* so all you have to take care of are the weights.
Your implementation should go in `train`, which has a handful
of stubs.  The idea is to just pass appropriate functions
to `gd` and have it do all the work.  See the comments inline
in the code for more information.
 
Once you've implemented the function evaluation and gradient, we can
test this.  We'll begin with a very simple 2D example data set so that
we can plot the solutions.  We'll also start with *no
regularizer* to help you figure out where errors might be if you
have them.  (You'll have to import `mlGraphics` to make this
work.)

In [None]:
f = linear.LinearClassifier({'lossFunction': linear.SquaredLoss(), 'lambda': 0, 'numIter': 100, 'stepSize': 0.5})
runClassifier.trainTestSet(f, datasets.TwoDAxisAligned)
# Training accuracy 0.91, test accuracy 0.86
print(f)
# w=array([ 2.73466371, -0.29563932])
mlGraphics.plotLinearClassifier(f, datasets.TwoDAxisAligned.X, datasets.TwoDAxisAligned.Y)
show(False)

Note that even though this data is clearly linearly separable,
the *unbiased* classifier is unable to perfectly separate it.

If we change the regularizer, we'll get a slightly different
solution:

In [None]:
f = linear.LinearClassifier({'lossFunction': linear.SquaredLoss(), 'lambda': 10, 'numIter': 100, 'stepSize': 0.5})
runClassifier.trainTestSet(f, datasets.TwoDAxisAligned)
# Training accuracy 0.9, test accuracy 0.86
print(f)
# w=array([ 1.30221546, -0.06764756])

As expected, the weights are *smaller*.

Now, we can try different loss functions.  Implement logistic loss and
hinge loss.  Here are some simple test cases:

In [None]:
f = linear.LinearClassifier({'lossFunction': linear.LogisticLoss(), 'lambda': 10, 'numIter': 100, 'stepSize': 0.5})
runClassifier.trainTestSet(f, datasets.TwoDDiagonal)
# Training accuracy 0.99, test accuracy 0.86
print(f)
# w=array([ 0.29809083,  1.01287561])

f = linear.LinearClassifier({'lossFunction': linear.HingeLoss(), 'lambda': 1, 'numIter': 100, 'stepSize': 0.5})
runClassifier.trainTestSet(f, datasets.TwoDDiagonal)
# Training accuracy 0.98, test accuracy 0.86
print(f)
# w=array([ 1.17110065,  4.67288657])

<img src="data/kitten.jpeg" width="100px" align="left" float="left"/>
<br><br><br>
## WU5 (5%):
For each of the loss functions, train a model on the
binary version of the wine data (called WineDataBinary) and evaluate
it on the test data. You should use lambda=1 in all cases. Which works
best? For that best model, look at the learned weights. Find
the *words* corresponding to the weights with the greatest
positive value and those with the greatest negative value (this is
like LAB3). Hint: look at WineDataBinary.words to get the id-to-word
mapping. List the top 5 positive and top 5 negative and explain.

[Your WU5 answer here]

# Part 3: Classification with Many Classes *[0% -- up to 15% extra credit]*

Finally, we'll do multiclass classification using Scikit-learn functionality. You can find the documentation here: http://scikit-learn.org/stable/modules/multiclass.html.

Quiz bowl is a game in which two teams compete head-to-head to answer questions from different areas of knowledge. It lets players interrupt the reading of a question when they know the answer. The goal here is to see how well a classifier performs in predicting the `Answer` of a question when a different portion of the question is revealed.

Here's an example question from the development data:

    206824,dev,History,Alan Turing,"This man and Donald Bayley created a secure voice communications machine called ""Delilah"". ||| The Chinese Room Experiment was developed by John Searle in response to one of this man's namesake tests. ||| He showed that the halting problem was undecidable. ||| He devised a bombe with Gordon Welchman that found the settings of an Enigma machine. ||| One of this man's eponymous machines which can perform any computing task is his namesake ""complete."" Name this man, whose eponymous test is used to determine if a machine can exhibit behavior indistinguishable from that of a human." 

The more of the question you get, the easier the problem becomes.

The default code below just runs OVA and AVA on top of a linear SVM (it might take a few seconds):

In [None]:
import sklearn.metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from numpy import *
import datasets
import importlib

importlib.reload(datasets)

if not datasets.Quizbowl.loaded:
    datasets.loadQuizbowl()

print('\n\nRUNNING ON EASY DATA\n')
    
print('training ova')
X = datasets.QuizbowlSmall.X
Y = datasets.QuizbowlSmall.Y
ova = OneVsOneClassifier(LinearSVC(random_state=0)).fit(X, Y)
print('predicting ova')
ovaDevPred = ova.predict(datasets.QuizbowlSmall.Xde)
print('error = {0}'.format(mean(ovaDevPred != datasets.QuizbowlSmall.Yde)))

print('training ava')
ava = OneVsRestClassifier(LinearSVC(random_state=0)).fit(X, Y)
print('predicting ava')
avaDevPred = ava.predict(datasets.QuizbowlSmall.Xde)
print('error = {0}'.format(mean(avaDevPred != datasets.QuizbowlSmall.Yde)))

print('\n\nRUNNING ON HARD DATA\n')
    
print('training ova')
X = datasets.QuizbowlHardSmall.X
Y = datasets.QuizbowlHardSmall.Y
ova = OneVsOneClassifier(LinearSVC(random_state=0)).fit(X, Y)
print('predicting ova')
ovaDevPred = ova.predict(datasets.QuizbowlHardSmall.Xde)
print('error = {0}'.format(mean(ovaDevPred != datasets.QuizbowlHardSmall.Yde)))

print('training ava')
ava = OneVsRestClassifier(LinearSVC(random_state=0)).fit(X, Y)
print('predicting ava')
avaDevPred = ava.predict(datasets.QuizbowlHardSmall.Xde)
print('error = {0}'.format(mean(avaDevPred != datasets.QuizbowlHardSmall.Yde)))

savetxt('predictions.txt', avaDevPred)

When you run the code above, you should see some statistics of the loaded datasets and the following error rates on two of the datasets `QuizbowlSmall` and `QuizbowlHardSmall` using OVA and AVA:

```
RUNNING ON EASY DATA

training ova
predicting ova
error = 0.293413
training ava
predicting ava
error = 0.218563


RUNNING ON HARD DATA

training ova
predicting ova
error = 0.595808
training ava
predicting ava
error = 0.553892
```

This is running on a shrunken version of the data (that only contains answers that occur at least 20 times in the data).

The first ("easy") version is when you get to see the entire question. The second ("hard") version is when you only get to use the first two sentences. It's clearly significantly harder to answer!

Your task is to achieve the lowest possible error on the development set for `QuizbowlSmall` and `QuizbowlHardSmall`. You will get 5% extra credit for getting lower error (by at least absolute 1%) on *either* dataset than the errors presented above (21.86% for `QuizbowlSmall` and 55.39% for `QuizbowlHardSmall`). 

You're free to use the training data in any way you want, but you must include your code in `quizbowl.py`, submit your predictions file(s), and a writeup here that says what you did, in order to receive the extra credit. The script `quizbowl.py` includes a command in the last line that saves predictions to a text file `predictions.txt`. You need to edit this line to rename the file to either `predictionsQuizbowlSmall.txt` or `predictionsQuizbowlHardSmall.txt` dependent on the dataset: that's what you upload for the EC. 

## WU-EC2 (5%):

[YOUR WU-EC2 WRITEUP HERE] 

Additionally, you can get extra credit for providing the lowest-error solution on the full versions of the easy and hard problems, `Quizbowl` and `QuizbowlHard` in comparison to your classmates' solutions. There will be two separate (hidden) leaderboards for each of these two datasets. You will receive 5% if your solution is the best for the respective dataset (first place), 3% for second place and 1% for third. We will reveal the top three scores for each dataset after the submission period is over, and you are welcome to compete in both. Note that this problem is much harder due to the larger number of class labels. A simple majority label classifier has an error of 99.89%.

You're free to use the training data in any way you want, but you must include your code in `quizbowl.py`, submit your predictions file(s) (`predictionsQuizbowl.txt` and/or `predictionsQuizbowlHard.txt`), and a writeup here that says what you did, in order to receive the extra credit.

<img src="data/kitten.jpeg" width="100px" align="left" float="left"/>
<br><br><br>
## WU-EC3 (up to 10%):

[YOUR WU-EC3 WRITEUP HERE] 