Andrew Carr

## Imports

In [54]:
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import keras

In [221]:
import matplotlib.pyplot as plt

### Problem 1

Run a multiclass (softmax) logistic regression on the scikit-learn digits dataset with the same train-test split we have used in the past. Experiment with different regularization parameters and choose the best. Justify your choice.

### Load the digits dataset

In [14]:
data = datasets.load_digits()
x_train, x_test, y_train, y_test = train_test_split(data.data, data.target, train_size=0.7)

In [26]:
# c = 1 in this case
clf_standard = LogisticRegression(multi_class='multinomial', solver='lbfgs')
clf_standard.fit(x_train, y_train)


clf = LogisticRegression(C=0.1,multi_class='multinomial', solver='lbfgs')
clf.fit(x_train, y_train)


print("predition accuracy standard {}".format(clf_standard.score(x_test, y_test)))
print("predition accuracy regularized {}".format(clf.score(x_test, y_test)))

predition accuracy standard 0.9666666666666667
predition accuracy regularized 0.9703703703703703


After some experimentation I found that a regularize coefficient of 0.1 works best with the current set up.

### Problem 2

Install Keras and tensorflow on your computer. For most of you this can be done in one line with `conda install keras`

Done

### Problem 3

Load the full MNIST dataset with keras's pre-chosen train-test split using
from `keras.datasets import mnist`
`(X_train, y_train), (X_test, y_test) = mnist.load_data()`
and flatten the images into a single vector
`input_dim = 784 #28*28`
`X_train = X_train.reshape(60000, input_dim)`
`X_test = X_test.reshape(10000, input_dim)`
You may also need to convert the data to floats (they come as ints).

In [27]:
from keras.datasets import mnist

In [69]:
(X_train, y_train), (X_test, y_test) = mnist.load_data()
input_dim = 784
X_train = X_train.reshape(60000, input_dim).astype(np.float32)
X_test = X_test.reshape(10000, input_dim).astype(np.float32)

### Problem 4

Construct the multi-class matrix from y
`from keras.utils import np_utils
Y = np_utils.to_categorical(y, nb_classes)`
and build a softmax classifier
`from keras.models import Sequential 
from keras.layers import Dense, Activation
output_dim = 10 # number of classes
soft = Sequential()
soft.add(Dense(output_dim, input_dim=input_dim, activation='softmax'))
soft.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'])`

In [70]:
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense, Activation

In [71]:
nb_classes = 10
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)

In [72]:
output_dim = 10
soft = Sequential()
soft.add(Dense(output_dim, input_dim=input_dim, activation='softmax'))
# note that adam performs significantly better than sgd. This is not surprising, but is an interesting result
soft.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

### Problem 5

Experiment with various parameters, including different batch sizes and numbers of epochs to find the combination that gives the best results on the MNIST data set:
`soft.fit(X_train, Y_train, batch_size=128, epochs=20, verbose=1, validation_data=(X_test, Y_test))`

In [74]:
# due to the large number of data points I decided that a larger batch size would be useful (since we have a more representative sample)
soft.fit(X_train, Y_train, batch_size=2000, epochs=20, verbose=1, validation_data=(X_test, Y_test))

Train on 60000 samples, validate on 10000 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x133d1e0b8>

It's interesting that the accuracy isn't really that great. I bet if I ran it for significantly longer I would be able to get much higher accuracy.

### Problem 6

Identify a multi-class classification problem related to your final project, using your project data. Use a softmax regression and choose an appropriate regularization parameter and appropriate choices of other hyperparameters and training parameters. Clearly identify your final preferred model, and explain why you chose that over the other contenders. What conclusions can be drawn from your results about the original classification question you asked?

Note this is not my entire dataset, but a representative and useful sample

In [93]:
df = pd.read_csv("platplus.csv")

In [198]:
y = df['Role_Code']

# we are trying to predict which role a champion fits into
li = ['Win Percent','Minions Killed', 'Total Healing', 'Team Jungle CS','Kills', 'Assists', 'Deaths', 'Damage Dealt', 'Damage Taken']
x = df[li]

In [199]:
train_x, test_x, train_y, test_y = train_test_split(x,y, train_size=0.7)

In [227]:
clf = LogisticRegression(C=(10**-1),multi_class='multinomial', solver='lbfgs')
clf.fit(train_x, train_y)

clf_or = LogisticRegression(C=(10**-1))
clf_or.fit(train_x, train_y)

print("softmax accuracy {}".format(clf.score(test_x, test_y)))
print("one vs rest accuracy {}".format(clf_or.score(test_x, test_y)))

softmax accuracy 0.819672131147541
one vs rest accuracy 0.9344262295081968


The results here are super interesting. The softmax classification is performing far worse than the one-vs-rest method used in the previous homework where I was able to get accuracy of ~93%. I believe this is because a single champion can perform multiple roles and so the hard boundary in softmax is a negative feature that results in worse accuracy.