# Lab 04

## Conrad Appel, Erik Gabrielsen, Danh Nguyen

### Preparation and Overview (30 points total)

[5 points] Explain the task and what business-case or use-case it is designed to solve (or designed to investigate). Detail exactly what the classification task is and what parties would be interested in the results.

[10 points] (mostly the same processes as from lab one) Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis. Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created).

[15 points] Divide you data into training and testing data using an 80% training and 20% testing split. Use the cross validation modules that are part of scikit-learn. Argue for or against splitting your data using an 80/20 split. That is, why is the 80/20 split appropriate (or not) for your dataset?  

### Modeling (50 points total)

[20 points] Create a custom, one-versus-all logistic regression classifier using numpy and scipy to optimize. Use object oriented conventions identical to scikit-learn. You should start with the template used in the course. You should add the following functionality to the logistic regression classifier:
Ability to choose optimization technique when class is instantiated: either steepest descent, stochastic gradient descent, or Newton's method. Update the gradient calculation to include a customizable regularization term (either using no regularization, L1 regularization, L2 regularization, or both L1/L2 norm of the weights). Associate a cost with the regularization term, "C", that can be adjusted when the class is instantiated.  

[15 points] Train your classifier to achieve good generalization performance. That is, adjust the optimization technique and the value of the regularization term "C" to achieve the best performance on your test set. Is your method of selecting parameters justified? That is, do you think there is any "data snooping" involved with this method of selecting parameters?

[15 points] Compare the performance of your "best" logistic regression optimization procedure to the procedure used in scikit-learn. Visualize the performance differences in terms of training time, training iterations, and memory usage while training. Discuss the results. 

### Deployment (10 points total)

Which implementation of logistic regression would you advise be used in a deployed machine learning model, your implementation or scikit-learn (or other third party)? Why?

### Exceptional Work (10 points total)

You have free reign to provide additional analyses.
One idea: Make your implementation of logistic regression compatible with the GridSearchCV function that is part of scikit-learn.

In [1]:
## redoing large files
import pandas as pd
import numpy as np

%time df = pd.read_csv('~/Downloads/ss15pusb.csv') # read in the csv file



CPU times: user 1min 6s, sys: 18.1 s, total: 1min 24s
Wall time: 1min 37s


In [2]:
df.columns.values

array(['RT', 'SERIALNO', 'SPORDER', 'PUMA', 'ST', 'ADJINC', 'PWGTP',
       'AGEP', 'CIT', 'CITWP', 'COW', 'DDRS', 'DEAR', 'DEYE', 'DOUT',
       'DPHY', 'DRAT', 'DRATX', 'DREM', 'ENG', 'FER', 'GCL', 'GCM', 'GCR',
       'HINS1', 'HINS2', 'HINS3', 'HINS4', 'HINS5', 'HINS6', 'HINS7',
       'INTP', 'JWMNP', 'JWRIP', 'JWTR', 'LANX', 'MAR', 'MARHD', 'MARHM',
       'MARHT', 'MARHW', 'MARHYP', 'MIG', 'MIL', 'MLPA', 'MLPB', 'MLPCD',
       'MLPE', 'MLPFG', 'MLPH', 'MLPI', 'MLPJ', 'MLPK', 'NWAB', 'NWAV',
       'NWLA', 'NWLK', 'NWRE', 'OIP', 'PAP', 'RELP', 'RETP', 'SCH', 'SCHG',
       'SCHL', 'SEMP', 'SEX', 'SSIP', 'SSP', 'WAGP', 'WKHP', 'WKL', 'WKW',
       'WRK', 'YOEP', 'ANC', 'ANC1P', 'ANC2P', 'DECADE', 'DIS', 'DRIVESP',
       'ESP', 'ESR', 'FHICOVP', 'FOD1P', 'FOD2P', 'HICOV', 'HISP', 'INDP',
       'JWAP', 'JWDP', 'LANP', 'MIGPUMA', 'MIGSP', 'MSP', 'NAICSP',
       'NATIVITY', 'NOP', 'OC', 'OCCP', 'PAOC', 'PERNP', 'PINCP', 'POBP',
       'POVPIP', 'POWPUMA', 'POWSP', 'PRIVCOV', 'PUBC

In [3]:
df.head(10)
df.shape

new_df = df.filter(items=['AGEP','PINCP','QTRBIR','JWAP','JWDP','DECADE','YOEP','WAOB','RAC1P'])
new_df.shape
new_df.to_csv('~/Downloads/ss15pusb2.csv')

In [4]:
new_df[new_df['RAC1P']==1]

Unnamed: 0,AGEP,PINCP,QTRBIR,JWAP,JWDP,DECADE,YOEP,WAOB,RAC1P
0,77,9100.0,1,,,,,1,1
1,30,5000.0,2,,,,,1,1
2,65,20000.0,4,,,,,1,1
3,69,11000.0,2,,,,,1,1
4,24,25300.0,2,82.0,34.0,,,1,1
5,25,15000.0,2,175.0,103.0,,,1,1
6,20,0.0,2,,,,,1,1
7,19,19600.0,4,84.0,43.0,,,1,1
8,58,143000.0,2,84.0,37.0,,,1,1
9,53,56000.0,1,78.0,37.0,,,1,1


In [5]:
import numpy as np
import pandas as p
from scipy.optimize import fmin_bfgs
from scipy.special import expit

In [6]:
class BinaryClassifierBase:
    def __init__(self, eta, iterations=20, cost=0.001):
        self.eta = eta
        self.cost = cost
        self.iters = iterations
    
    def fit(self, x, y):
        self.w_ = np.zeros((x.shape[1],1))
        for _ in range(self.iters):
            gradient = self._get_gradient(x,y)
            self.w_ += gradient*self.eta
    
    def predict_proba(self,x):
        return 1/(1+np.exp(-(x @ self.w_)))
    
    def predict(self, x):
        return (self.predict_proba(x)>0.5)
    
    
class BinaryStochDescClassifier(BinaryClassifierBase):
    def _get_gradient(self, x, y):
        idx = int(np.random.rand()*len(y)) # grab random instance
        ydiff = y[idx]-self.predict_proba(x[idx]) # get y difference (now scalar)
        gradient = x[idx] * ydiff[:,np.newaxis] # make ydiff a column vector and multiply through
        gradient = gradient.reshape(self.w_.shape)
        gradient[1:] += -2 * self.w_[1:] * self.cost
        return gradient
    
    
class BinarySteepDescClassifier(BinaryClassifierBase):
    def _get_gradient(self, x, y):
        ydiff = y-self.predict_proba(x).ravel()
        gradient = np.mean(x * ydiff[:,np.newaxis], axis=0)
        return gradient.reshape(self.w_.shape)

    
class BinaryNewtonClassifier(BinaryClassifierBase):
    def fit(self, x, y):
        def obj_fn(w, x, y, c):
            g = expit(x @ w)
            return -np.sum(np.log(g[y==1]))-np.sum(np.log(1-g[y==0])) + c*sum(w**2)
        
        def obj_grad(w, x, y, c):
            g = expit(x @ w)
            ydiff = y-g
            gradient = np.mean(x * ydiff[:,np.newaxis], axis=0)
            gradient = gradient.reshape(w.shape)
            gradient[1:] += -2 * w[1:] * c
            return -gradient
        
        self.w_ = fmin_bfgs(obj_fn, 
                            np.zeros((x.shape[1], 1)), 
                            fprime=obj_grad, 
                            args=(x, y, self.cost), 
                            gtol=1e-03, 
                            maxiter=self.iters, 
                            disp=False).reshape((x.shape[1], 1))

        
class LogRegClassifier:
    def __init__(self, eta, iterations=20, optimize='steepdesc', cost=0.001):
        typesofoptimize = {
            'steepdesc': BinarySteepDescClassifier, 
            'stochdesc': BinaryStochDescClassifier, 
            'newton': BinaryNewtonClassifier
        }
        if optimize not in typesofoptimize.keys():
            raise ValueError('optimize must be one of: ' + ' '.join(typesofoptimize.keys()))
        
        self.eta = eta
        self.iters = iterations
        self.optimize = optimize
        self.classifier = typesofoptimize[optimize]
        self.cost = cost
        self.classifiers = [] # fill with binary classifiers during fit
    
    def _add_bias(self, x):
        return np.hstack((np.ones((x.shape[0],1)),x))
    
    def fit(self, x, y):
        Xb = self._add_bias(x)
        classes = np.unique(y)
        
        for cl in classes:
            cur_y = y==cl
            cur_classifier = self.classifier(self.eta, self.iters, cost=self.cost)
            cur_classifier.fit(x, cur_y)
            self.classifiers.append(cur_classifier)
    
    def predict(self, x):
        if not self.classifiers:
            raise RuntimeError('Classifier not fit!')
        
        probabilities = []
        for classifier in self.classifiers:
            probabilities.append(classifier.predict_proba(x))
        probabilities = np.hstack(probabilities)
        return np.argmax(probabilities,axis=1)

In [7]:
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
ds = load_iris()
regrs = {
    'newton': LogRegClassifier(.1, 500, optimize='newton'),
    'stoch': LogRegClassifier(.1, 500, optimize='stochdesc'), # Different results every time, random
    'steep': LogRegClassifier(.1, 500, optimize='steepdesc')
}
for regr in regrs.values():
    regr.fit(ds.data, ds.target)

for key, val in regrs.items():
    res = val.predict(ds.data)
    print(key + ': ' +str(accuracy_score(ds.target, res)))

steep: 0.96
stoch: 0.666666666667
newton: 0.966666666667


### Dividing Training And Test Data (15 pts)

In [8]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1528516 entries, 0 to 1528515
Data columns (total 9 columns):
AGEP      1528516 non-null int64
PINCP     1265774 non-null float64
QTRBIR    1528516 non-null int64
JWAP      657901 non-null float64
JWDP      657901 non-null float64
DECADE    173327 non-null float64
YOEP      173327 non-null float64
WAOB      1528516 non-null int64
RAC1P     1528516 non-null int64
dtypes: float64(5), int64(4)
memory usage: 105.0 MB


In [10]:
from sklearn.cross_validation import ShuffleSplit
df_imputed = new_df.copy()
# we want to predict the X and y data as follows:
if 'PINCP' in new_df:
    y = df_imputed['PINCP'].values # get the labels we want
    del df_imputed['PINCP'] # get rid of the class label
    X = df_imputed.values # use everything else to predict!

    ## X and y are now numpy matrices, by calling 'values' on the pandas data frames we
    #    have converted them into simple matrices to use with scikit learn
    
    
# to use the cross validation object in scikit learn, we need to grab an instance
#    of the object and set it up. This object will be able to split our data into 
#    training and testing splits
num_cv_iterations = 3
num_instances = len(y)
cv_object = ShuffleSplit(n=num_instances,
                         n_iter=num_cv_iterations,
                         test_size  = 0.2)
                         
print(cv_object)

ShuffleSplit(1528516, n_iter=3, test_size=0.2, random_state=None)


In [11]:
# run logistic regression and vary some parameters
from sklearn.linear_model import LogisticRegression
from sklearn import metrics as mt

# first we create a reusable logisitic regression object
#   here we can setup the object with different learning parameters and constants
lr_clf = LogisticRegression(penalty='l2', C=1.0, class_weight=None) # get object

# now we can use the cv_object that we setup before to iterate through the 
#    different training and testing sets. Each time we will reuse the logisitic regression 
#    object, but it gets trained on different data each time we use it.

iter_num=0
# the indices are the rows used for training and testing in each iteration
for train_indices, test_indices in cv_object: 
    # I will create new variables here so that it is more obvious what 
    # the code is doing (you can compact this syntax and avoid duplicating memory,
    # but it makes this code less readable)
    X_train = X[train_indices]
    y_train = y[train_indices]
    
    X_test = X[test_indices]
    y_test = y[test_indices]
    
    # train the reusable logisitc regression model on the training data
    lr_clf.fit(X_train,y_train)  # train object
    y_hat = lr_clf.predict(X_test) # get test set precitions

    # now let's get the accuracy and confusion matrix for this iterations of training/testing
    acc = mt.accuracy_score(y_test,y_hat)
    conf = mt.confusion_matrix(y_test,y_hat)
    print("====Iteration",iter_num," ====")
    print("accuracy", acc )
    print("confusion matrix\n",conf)
    iter_num+=1
    
# Also note that every time you run the above code
#   it randomly creates a new training and testing set, 
#   so accuracy will be different each time

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').