In [1]:
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from keras.datasets import mnist
from keras.utils import np_utils
from keras.models import Sequential 
from keras.layers import Dense, Activation

Using TensorFlow backend.
  return f(*args, **kwds)


### Problem 1

Run a multiclass (softmax) logistic regression on the scikit-learn digits dataset with the same train-test split we have used in the past. Experiment with different regularization parameters and choose the best. Justify your choice.

In [2]:
digits = datasets.load_digits()
data = digits.data
n,m = data.shape
train_size = (7*n)//10
train_x = data[:train_size,:]
train_y = digits.target[:train_size]
test = data[train_size:,:]
test_act = digits.target[train_size:]

In [3]:
C = [10**k for k in range(-10,11,1)]
accuracies = []
for c in C :
    classifier = LogisticRegression(C=1/c,multi_class='multinomial',solver='lbfgs')
    classifier.fit(train_x,train_y)
    res = classifier.predict(test)
    acc = 1-np.count_nonzero([res[i]-test_act[i] for i in range(len(res))])/len(res)
    accuracies.append(acc)
accur = max(accuracies)
ind = accuracies.index(accur)
print('The best one was C={}, with an accuracy of {}.'.format(C[ind],accur))

The best one was C=10, with an accuracy of 0.924074074074074.


### Problem 2

Install Keras and tensorflow on your computer. For most of you this can be done in one line with `conda install keras`

### Problem 3

Load the full MNIST dataset with keras's pre-chosen train-test split using
from `keras.datasets import mnist`
`(X_train, y_train), (X_test, y_test) = mnist.load_data()`
and flatten the images into a single vector
`input_dim = 784 #28*28`
`X_train = X_train.reshape(60000, input_dim)`
`X_test = X_test.reshape(10000, input_dim)`
You may also need to convert the data to floats (they come as ints).

In [65]:
(X_train, y_train), (X_test, y_test) = mnist.load_data()
input_dim = X_train.shape[1]*X_train.shape[2]
X_train = X_train.reshape(X_train.shape[0],input_dim).astype(float)
X_test = X_test.reshape(10000,input_dim).astype(float)

### Problem 4

Construct the multi-class matrix from y
`from keras.utils import np_utils
Y = np_utils.to_categorical(y, nb_classes)`
and build a softmax classifier
`from keras.models import Sequential 
from keras.layers import Dense, Activation
output_dim = 10 # number of classes
soft = Sequential()
soft.add(Dense(output_dim, input_dim=input_dim, activation='softmax'))
soft.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'])`

In [78]:
output_dim = 10 # number of classes
Y = np_utils.to_categorical(y_train, output_dim)
Y_test = np_utils.to_categorical(y_test, output_dim)
soft = Sequential()
soft.add(Dense(output_dim, input_dim=input_dim, activation='softmax'))
soft.compile(optimizer='sgd', loss='categorical_crossentropy', 
              metrics=['accuracy'])

### Problem 5

Experiment with various parameters, including different batch sizes and numbers of epochs to find the combination that gives the best results on the MNIST data set:
`soft.fit(X_train, Y_train, batch_size=128, epochs=20, verbose=1, validation_data=(X_test, Y_test))`

In [79]:
# Fiddle with these for best results
epoch = 30
batch = 3072 #2560
soft.fit(X_train,Y, batch_size=batch, epochs=epoch , verbose=1, validation_data=(X_test, Y_test))

Train on 60000 samples, validate on 10000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x149737518>

### Problem 6

Identify a multi-class classification problem related to your final project, using your project data. Use a softmax regression and choose an appropriate regularization parameter and appropriate choices of other hyperparameters and training parameters. Clearly identify your final preferred model, and explain why you chose that over the other contenders. What conclusions can be drawn from your results about the original classification question you asked?

I'm using the same classification question I asked in logistic regression 2, namely determining whether a player is going to play an important role in tournament games.  I decided to classify them according to the average percent of possessions they were used per game during the tournament (x) that year.  Those who averaged less than 12% ($x<12\%$) received a rating of 0.  $12\leq x < 16\%$ a 1, $16 \leq x < 20\%$ a 2, $20 \leq x < 24\%$ a 3, and $x\geq 24\%$ a 4.  The reason for these numbers are because they are similar to those used by Ken Pomeroy in his distinction between crucial and noncrucial players.  The difference is that I'm focusing only on the tournament data for my classification.

In [2]:
import pandas as pd
import os
path = '../../../../Senior Project/DATA/'

train = []
test = []

# Walk through player files
for dir_path , dir_name , file_names in os.walk(path) :
    # 2017 will be our testing set
    if '2017' in dir_path :
        for name in file_names :
            # Grab avgs file
            if name[-4:] == 'avgs' :
                data = pd.read_csv(os.path.join(dir_path,name))
                if isinstance(test,list) :
                    test = data.drop(['Unnamed: 0'],axis=1).as_matrix()
                else :
                    test = np.vstack((test,data.drop(['Unnamed: 0'],axis=1)))
    # Everything else will become our training set
    else :
        for name in file_names :
            # Grab avgs file
            if name[-4:] == 'avgs' :
                data = pd.read_csv(os.path.join(dir_path,name))
                if isinstance(train,list) :
                    train = data.drop(['Unnamed: 0'],axis=1).as_matrix()
                else :
                    train = np.vstack((train,data.drop(['Unnamed: 0'],axis=1).as_matrix()))

# From the way the data is saved, the last column is whether or not the player
#     is a score on how much of a contributor he was during the season.
train_x = train[:,:-1]
train_y = train[:,-1]
test_x = test[:,:-1]
test_y = test[:,-1]

In [23]:
output_dim = 5 # number of classes
Y = np_utils.to_categorical(train_y, output_dim)
Y_test = np_utils.to_categorical(test_y, output_dim)
prj_soft = Sequential()
prj_soft.add(Dense(output_dim, input_dim=train_x.shape[1], activation='softmax'))
prj_soft.compile(optimizer='sgd', loss='categorical_crossentropy', 
              metrics=['accuracy'])
epoch = 40
batch = 2949 #max is 2949

In [24]:
prj_soft.fit(train_x,Y, batch_size=batch, epochs=epoch , verbose=1, validation_data=(test_x, Y_test))

Train on 2949 samples, validate on 768 samples
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<keras.callbacks.History at 0x12004ee10>

The model that I've picked uses 40 epochs and batches of size 2949 (the entire set).  The reason for each batch being the entirety of the data is that it is still super fast (as shown), and the reason for 40 epochs is because that's usually when the accuracy plateaus.  While it reached it's maximum of 47% earlier than that this time, there were many times that it didn't reach that point until around the $35^{th}$ epoch.  
The sad part is that it still plateaus around 47% accuracy, which is not very good.  What it tells me is that there are probably a lot more factors than average season statistics that coaches use to determine who plays.  If I am going to try and weight higher contributing players differently than lower contributing ones, I'm going to have to either return to a more simplistic classification (significant contributor or not) or try to take into account more variables of the data.  Either that, or I just need a lot more years of data to work with.