### Using regularized logistic regression to classify email

In [1]:
import scipy.io
import utils
import numpy as np
from sklearn import linear_model

# No modifications in this script
# complete the functions in util.py; then run the script

# load the spam data in

Xtrain,Xtest,ytrain,ytest = utils.load_spam_data()

# Preprocess the data 

Xtrain_std,mu,sigma = utils.std_features(Xtrain)
Xtrain_logt = utils.log_features(Xtrain)
Xtrain_bin = utils.bin_features(Xtrain)

Xtest_std = (Xtest - mu)/sigma
Xtest_logt = utils.log_features(Xtest)
Xtest_bin = utils.bin_features(Xtest)

# find good lambda by cross validation for these three sets

def run_dataset(X,ytrain,Xt,ytest,type,penalty):

    best_lambda = utils.select_lambda_crossval(X,ytrain,0.1,5.1,0.5,penalty)
    print "best_lambda = ", best_lambda

    # train a classifier on best_lambda and run it
    if penalty == "l2":
        lreg = linear_model.LogisticRegression(penalty=penalty,C=1.0/best_lambda, solver='lbfgs',fit_intercept=True)
    else:
        lreg = linear_model.LogisticRegression(penalty=penalty,C=1.0/best_lambda, solver='liblinear',fit_intercept=True)
    lreg.fit(X,ytrain)
    print "Coefficients = ", lreg.intercept_,lreg.coef_
    predy = lreg.predict(Xt)
    print "Accuracy on set aside test set for ", type, " = ", np.mean(predy==ytest)

print "L2 Penalty experiments -----------"
run_dataset(Xtrain_std,ytrain,Xtest_std,ytest,"std","l2")
run_dataset(Xtrain_logt,ytrain,Xtest_logt,ytest,"logt","l2")
run_dataset(Xtrain_bin,ytrain,Xtest_bin,ytest,"bin","l2")

print "L1 Penalty experiments -----------"
run_dataset(Xtrain_std,ytrain,Xtest_std,ytest,"std","l1")
run_dataset(Xtrain_logt,ytrain,Xtest_logt,ytest,"logt","l1")
run_dataset(Xtrain_bin,ytrain,Xtest_bin,ytest,"bin","l1")

L2 Penalty experiments -----------
best_lambda =  0.1
Coefficients =  [-4.86311363] [[ -2.74146423e-02  -2.25297597e-01   1.21840933e-01   2.29362879e+00
    2.70425715e-01   2.32851163e-01   9.28595395e-01   2.95200236e-01
    1.62205936e-01   6.78260362e-02  -8.32604386e-02  -1.60373354e-01
   -4.72247682e-02   1.07677111e-02   1.87903360e-01   8.19771812e-01
    5.09528973e-01   3.98711504e-02   2.67729695e-01   3.47047564e-01
    2.60498923e-01   3.64605215e-01   7.25019578e-01   1.96728249e-01
   -3.15395701e+00  -4.03133789e-01  -1.25451044e+01  -6.16580960e-02
   -1.56114609e+00  -5.51429801e-02  -3.00815864e-02   4.07263543e-01
   -3.68156446e-01  -1.43611787e+00  -5.87180606e-01   4.44294891e-01
    4.23159462e-02  -1.56897094e-01  -4.55330838e-01  -1.02250289e-01
   -3.54273295e+00  -1.72944487e+00  -4.37529300e-01  -1.05999941e+00
   -9.18599328e-01  -1.75490328e+00  -1.67475856e-01  -9.56875266e-01
   -3.65653149e-01  -1.36535510e-01  -6.58692488e-02   2.06714030e-01
    1.

### Comment on the model sparsities with L1 and L2 regularization. 
#Which class of models will you recommend for this data set and why?

L1 regularization will have more sparse coefficients after regularization. And I may recommend the L1 models for this data set because not all the features of the data is crucial for a spam classification problem. But some of the features should have a hight weight when we decide if the emails are spams. Thus, the sparsity of L1 regularization will help us focus on the important features and reduce the impact of insignificant features.