# Comprehensive week 4 practice

This week you've learned about:

- Classification vs. regression
    - Predicting class labels
    - Predicted probability vs. predicted mean for target/dependent variables
- Categorical vs. continuous variables
    - Dummy coding representation in the X matrix
- The kNN classification algorithm
    - How choice of neighbors affects the bias-variance tradeoff
- The logistic regression algorithm
    - The logit/logistic link function
    - How logistic regression can still use the least squares loss function via the link function
    - Pros/Cons of logistic regression vs. kNN
- Validation of classifiers using cross-validation
- Benefits of predictor normalization
- How classification metrics differ from regression
    - Confusion matrices (TP, FP, TN, FN)
    - ROC curves
    - How changing predicted probability thresholds change confusion matrices
    - How context and goals inform your choice of threshold
- Regularization
    - How Lasso and Ridge change the least squared loss function
    - How regularization affects the bias-variance tradeoff
    - How to tune your regularization with cross-validation and gridsearching
    - How context of the problem informs which regularization to use (if any!)
    - Pros/cons to choice of Ridge or Lasso
   
---

### Now it's time to put it all together

As a class we're going to go through the process of classifying spam in a dataset with a wide variety of predictors. You will need to go through the full process.

The data has been pre-cleaned, so no need to go through that part of the process. We have been practicing that enough and I want you to focus on the new things we learned this week.

Given the thing's you've learned above, go through the process of classifying the **`is_spam`** column with some or all of the provided predictors!

The dataset path is provided for you in the cell below:

In [1]:
spam_path = '../assets/datasets/spam_modified.csv'

---

### Step 1: Load packages and spam dataset

In [37]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import patsy
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split, StratifiedKFold
%matplotlib inline

In [3]:
spam = pd.read_csv(spam_path)

In [6]:
spam.head()
print spam['is_spam']

0       1
1       1
2       1
3       1
4       1
5       1
6       1
7       1
8       1
9       1
10      1
11      1
12      1
13      1
14      1
15      1
16      1
17      1
18      1
19      1
20      1
21      1
22      1
23      1
24      1
25      1
26      1
27      1
28      1
29      1
       ..
4571    0
4572    0
4573    0
4574    0
4575    0
4576    0
4577    0
4578    0
4579    0
4580    0
4581    0
4582    0
4583    0
4584    0
4585    0
4586    0
4587    0
4588    0
4589    0
4590    0
4591    0
4592    0
4593    0
4594    0
4595    0
4596    0
4597    0
4598    0
4599    0
4600    0
Name: is_spam, dtype: int64


In [32]:
spam_sub = spam[spam.columns[:-26]]
target = spam['is_spam']

In [48]:
preds = [x for x in spam_sub.columns if x != 'is_spam']

In [50]:
spam_sub.ix[:, preds] = (spam_sub.ix[:, preds] - spam_sub.ix[:, preds].mean()) / spam_sub.ix[:, preds].std()

In [71]:
logmodel = LogisticRegression()

# 0 -> 10**5 (0, 10, 100, 1000, 10000)
# Put 20 points evenly spaced between 0 and 10000
Cs = 1./np.logspace(0.0, 4.0, 10)

#_jobs = 4 means run 4 threads at once on the computer.
# most modern CPUs do 4 with relative ease, but if you have a shit computer you should probably leave it out.
# (if you put -1 it will use all cores - not recommended.)
search_parameters = {
    "penalty":             ['l1','l2'],   # Used to specify the norm used in the penalization.
    "C":                   Cs,  # Regularization paramter
    "class_weight":        [None, "balanced"], # The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
    'n_jobs':              [4]
}

est = GridSearchCV(logmodel, search_parameters)

In [72]:
x = spam_sub[preds]
y = target

In [73]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 42)

In [74]:
grid = est.fit(x, y)

In [75]:
grid.grid_scores_

[mean: 0.85068, std: 0.12989, params: {'penalty': 'l1', 'C': 1.0, 'n_jobs': 4, 'class_weight': None},
 mean: 0.93827, std: 0.05329, params: {'penalty': 'l2', 'C': 1.0, 'n_jobs': 4, 'class_weight': None},
 mean: 0.85612, std: 0.13155, params: {'penalty': 'l1', 'C': 1.0, 'n_jobs': 4, 'class_weight': 'balanced'},
 mean: 0.93349, std: 0.06128, params: {'penalty': 'l2', 'C': 1.0, 'n_jobs': 4, 'class_weight': 'balanced'},
 mean: 0.84612, std: 0.12999, params: {'penalty': 'l1', 'C': 0.35938136638046275, 'n_jobs': 4, 'class_weight': None},
 mean: 0.94175, std: 0.04776, params: {'penalty': 'l2', 'C': 0.35938136638046275, 'n_jobs': 4, 'class_weight': None},
 mean: 0.85742, std: 0.13186, params: {'penalty': 'l1', 'C': 0.35938136638046275, 'n_jobs': 4, 'class_weight': 'balanced'},
 mean: 0.93588, std: 0.05845, params: {'penalty': 'l2', 'C': 0.35938136638046275, 'n_jobs': 4, 'class_weight': 'balanced'},
 mean: 0.84851, std: 0.13000, params: {'penalty': 'l1', 'C': 0.12915496650148842, 'n_jobs': 4, '

In [77]:
for params, mean_acc, std in grid.grid_scores_:
    print 'C:', params['C']
    print 'penalty:', params['penalty']
    print 'class weight:', params['class_weight']
    print 'mean accuracy:', mean_acc
    print '------------------------------\n'

C: 1.0
penalty: l1
class weight: None
mean accuracy 0.850684633775
------------------------------

C: 1.0
penalty: l2
class weight: None
mean accuracy 0.938274288198
------------------------------

C: 1.0
penalty: l1
class weight: balanced
mean accuracy 0.856118235166
------------------------------

C: 1.0
penalty: l2
class weight: balanced
mean accuracy 0.933492718974
------------------------------

C: 0.35938136638
penalty: l1
class weight: None
mean accuracy 0.846120408607
------------------------------

C: 0.35938136638
penalty: l2
class weight: None
mean accuracy 0.941751793088
------------------------------

C: 0.35938136638
penalty: l1
class weight: balanced
mean accuracy 0.8574222995
------------------------------

C: 0.35938136638
penalty: l2
class weight: balanced
mean accuracy 0.935883503586
------------------------------

C: 0.129154966501
penalty: l1
class weight: None
mean accuracy 0.848511193219
------------------------------

C: 0.129154966501
penalty: l2
class weight: 