In [1]:
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression

Using the digits database and scikit-learn

1. Split your data into 30% test and 70% training sets, 
2. For each of the values of $C = 10^{k}$ for $k = {-10,...,10}$ train an $L^{2}$ regularized logistic regression model with regularization weight lambda = $\frac{1}{C}$  (this is the default form for scikit-learn) on the training set and compute the mean accuracy on the test set for each model.  Which performed best?  
3. Repeat #2 with $L^{1}$ regularization instead of $L^{2}$.  Do the results suggest any features that can be dropped from the data set?
4. Scikit-learn does not have logistic regression without regularization.  What values of $C$ are most similar to an un-regularized model?

In [2]:
digits = datasets.load_digits()
data = digits.data
n,m = data.shape
train_size = (7*n)//10
train_x = data[:train_size,:]
train_y = digits.target[:train_size]
test = data[train_size:,:]
test_act = digits.target[train_size:]

In [3]:
C = [10**k for k in range(-10,11,1)]
accuracies = []
for c in C :
    classifier = LogisticRegression(C=1/c)
    classifier.fit(train_x,train_y)
    res = classifier.predict(test)
    acc = 1-np.count_nonzero([res[i]-test_act[i] for i in range(len(res))])/len(res)
    accuracies.append(acc)
accur = max(accuracies)
ind = accuracies.index(accur)
print('The best one was C={}, with an accuracy of {}.'.format(C[ind],accur))

The best one was C=10, with an accuracy of 0.9148148148148149.


In [4]:
C = [10**k for k in range(-10,11,1)]
accuracies = []
coeffs = {}
for c in C :
    classifier = LogisticRegression(penalty='l1',C=1/c)
    classifier.fit(train_x,train_y)
    res = classifier.predict(test)
    acc = 1-np.count_nonzero([res[i]-test_act[i] for i in range(len(res))])/len(res)
    accuracies.append(acc)
    coeffs[c] = classifier.coef_
accur = max(accuracies)
ind = accuracies.index(accur)
print('The best one was C={}, with an accuracy of {}.'.format(C[ind],accur))

The best one was C=10, with an accuracy of 0.9203703703703704.


In [5]:
best_c = 10
print(coeffs[best_c])

[[  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00  -5.16608656e-02   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   1.66854393e-02   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   3.21280956e-02   0.00000000e+00
    0.00000000e+00   2.01381443e-01   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00  -1.25203503e-01
   -4.86223370e-01   0.00000000e+00   5.96869991e-02   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00  -6.21714254e-02
   -3.39824533e-01   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   1.55788903e-01  -1.56775978e-01
   -1.05180496e-01   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00  

Yes it does.  It seems that it suggests that certain pixels (from observation mostly around the outside and a few near the middle) are unnecessary in determining the number.  This makes sense, as the those are probably spots that are always the background and never actually part of the digit.  
I would say that the closest to one without a C would be the really large values of C, because then $\lambda = \frac{1}{C}$ is relatively small, meaning the penalized term hardly effects the minimization.

Identify a classification problem related to your final project, using your project data.  

1. Apply $L^2$ regularized logistic regression to model this with an appropriate choice of $C$ (or lambda).  Discuss how (and why) you chose your specific the value $C$.  
2. Apply $L^1$ regularized logistic regression to model this with an appropriate choice of $C$.  Discuss how (and why) you chose your specific value of $C$.  
3. Identify which features of your data to include and which to discard for a good logistic regression model for your problem.  Compare which features are suggested for removal by $L^1$ regularization (from `scikit-learn) versus using the methods we have used for linear regression, including p-values, BIC, and AIC (from statsmodels).  
Clearly identify your final preferred model, and explain why you chose that over the other contenders. 
What conclusions can be drawn from your results about the original classification question you asked?

In [14]:
import pandas as pd
import os
path = '../../../../Senior Project/DATA/'

train = []
test = []

# Walk through player files
for dir_path , dir_name , file_names in os.walk(path) :
    # 2017 will be our testing set
    if '2017' in dir_path :
        for name in file_names :
            # Grab avgs file
            if name[-4:] == 'avgs' :
                data = pd.read_csv(os.path.join(dir_path,name))
                if isinstance(test,list) :
                    test = data.drop(['Unnamed: 0'],axis=1).as_matrix()
                else :
                    test = np.vstack((test,data.drop(['Unnamed: 0'],axis=1)))
    # Everything else will become our training set
    else :
        for name in file_names :
            # Grab avgs file
            if name[-4:] == 'avgs' :
                data = pd.read_csv(os.path.join(dir_path,name))
                if isinstance(train,list) :
                    train = data.drop(['Unnamed: 0'],axis=1).as_matrix()
                else :
                    train = np.vstack((train,data.drop(['Unnamed: 0'],axis=1).as_matrix()))

# From the way the data is saved, the last column is whether or not the player
#     is a score on how much of a contributor he was during the season.

# !!! NOTE !!! : This ranking is currently arbitrary, and as such has no current
#                meaning.
train_x = train[:,:-1]
train_y = train[:,-1]
test_x = test[:,:-1]
test_y = test[:,-1]

In [None]:
C = [10**k for k in range(-10,11,1)]
accuracies = []
for c in C :
    classifier = LogisticRegression(C=1/c)
    classifier.fit(train_x,train_y)
    res = classifier.predict(test)
    acc = 1-np.count_nonzero([res[i]-test_act[i] for i in range(len(res))])/len(res)
    accuracies.append(acc)
accur = max(accuracies)
ind = accuracies.index(accur)
print('The best one was C={}, with an accuracy of {}.'.format(C[ind],accur))

In [None]:
C = [10**k for k in range(-10,11,1)]
accuracies = []
coeffs = {}
for c in C :
    classifier = LogisticRegression(penalty='l1',C=1/c)
    classifier.fit(train_x,train_y)
    res = classifier.predict(test)
    acc = 1-np.count_nonzero([res[i]-test_act[i] for i in range(len(res))])/len(res)
    accuracies.append(acc)
    coeffs[c] = classifier.coef_
accur = max(accuracies)
ind = accuracies.index(accur)
print('The best one was C={}, with an accuracy of {}.'.format(C[ind],accur))

In [None]:
best_c = ?
print(coeffs[best_c])

The code above (for finding C) is to be run multiple times.  The first time as displayed, and then when promising intervals of C are revealed, run again with different C values from those intervals.  When the best C has been discovered (which would be as consistent as possible between $l^{2}$ and $l^{1}$), the coefficients will then determine which parameters are and are not applicable.  
Since this ranking is my personal one and I haven't come up with a good way to score the players yet, the code is not run, as any computations on the data as is would be completely meaningless.