# Lab 3: Logistic Regression, Support Vector Machines, and Evaluation


In this lab we'll get some hands on experience with two more classifiers we've seen in class
- Logitic Regression
- Support Vector Machines

We will also explore evaluation metrics that we covered in class and understand how to calculate them.

## Goals for this lab

- Understand the practical implications for changing the parameters used in Logistic Regression and Support Vector Machines
  
- Learn more about the evaluation metrics covered in class and learn how to calculate them (at different thresholds)
  - accuracy
  - precision
  - recall
  - AUC

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn.tree as tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score as accuracy
import graphviz # If you don't have this, install via pip/conda
%matplotlib inline

# exercise: what additional modules should you import?

# Data
We'll continue to use the same data as in the previous lab.

It is a subset of the data set from https://www.kaggle.com/new-york-state/nys-patient-characteristics-survey-pcs-2015

The data has been downloaded, modified, and is in the github repo for the lab

You should also try this with other data sets you have been provided for the homeworks.

In [None]:
# Change this to wherever you're storing your data
datafile = '../data/nysmedicaldata.csv'
df = pd.read_csv(dfile)

In [None]:
df.head()

In [None]:
df.dtypes

# Some Quick Data Exploration
Before running any sort of model on your dataset, it's always a good idea to do some quick data exploration to get a sense of how your data looks like. Try to answer the following questions with some sort of plot/histogram/etc:

1) What do the distributions of each feature look like?

In [None]:
# Ex


# Using scikitlearn for classification

sklearn is a very useful python packager for building machiune learning models. To build a model in sklearn, you need to have a matrix (or dataframe) with X and y columns. X is your set of features/predictors. y is a single column that is your label. We'll take the foll;owing steps:

1. Select/create column as label/outcome (y)
2. Select/create columns as features (X)
3. Create Training Set
4. Create Validation Set
5. Build model on Training Set
6. Predict risk scores for the Validation Set
7. Calculate performance metric(s)

## Some useful things to know in sklearn

fit = train an algorithm

predict_proba = predict a "risk" score for all possible classes for a given record (classification only)


## Important- never use .predict
There is also a function called "predict" which first runs predict_probs and then predicts a 1 if the score > 0.5 and 0 otherwise. *Never* use that function since 0.5 is a completely arbitrary threshold to call a prediction 1 vs 0.



## 1. Create label/outcome
One thing we can do with this dataset is to try to use the various feature columns to classify whether a person has High Blood Pressure. Let's create a column that is 1 if a person has High Blood Pressure and 0 otherwise

In [None]:
# code

### Question: what percentage of people have High Blood Pressure?

In [None]:
# code

## 2. create or select existing predictors/features

For now, let's take a handful of existing columns to use.

sklearn needs features to be numeric and not categorical so we'll have to turn our selected features to be binary (also known as dummy variables)

In [None]:
# code

In [None]:
# code

# Train/Test Splits

Create a train/test set split using sklearn's [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. We'll use these train/test splits for evaluating all our classification models.

In [None]:
# code

# Logistic Regression
See the sklearn documentation on Logistic Regression to see its parameters. The one's we'll mostly be interested in are:
- penalty
- C

Remember that when training a model, **you should only use the training data!** The test set is reserved exclusively for evaluating your model. Now let's use the classifier:

In [1]:
# code


## Logistic Regression Tasks:

The goal here is to explore different penalty parameters and different C values. You can also try modofyinfg other parameters to see their impact. How does accuracy change, using different thresholds, as you vary penalty and C values? You can write a nested for loop that loops over all the parameters and values and store the results in a data frame (similar to last lab)

Ref: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

You'll notice that LogisticRegression takes a ton of parameters. We'll play around with the "penalty" and "C" parameters.
If we set the penalty parameter to ['l2'](http://mathworld.wolfram.com/L2-Norm.html), sklearn's LogisticRegression model solves the following minimization problem:

$$ \min_{\beta} ||\beta||_2 + C \sum_{i} \log ( -y_i (X_i^T \beta) +1)$$

Similarly, if we set the penalty parameter to ['l1'](http://mathworld.wolfram.com/L2-Norm.html), LogisticRegression will solve the following minimization problem:

$$\min_{\beta} ||\beta||_1 + C \sum_{i} \log ( -y_i (X_i^T \beta) +1)$$

where $$||\beta||_2 = \sqrt { \sum_{i} \beta_i^2 }$$ and $$||\beta||_1 =  \sum_{i} | \beta_i | $$ 

Try running logistic regression with both L1 and L2 penalties and a mix of C values. Something like $10^{-2}, 10^{-1}, 1, 10, 10^2)$ is reasonable.

In [4]:
# code

## Understanding what's going on inside Logistic Regression

To really see the difference between L1 and L2 regularization, we need to take a closer look at the models they produced. Plot a histogram of the weight values of LogisticRegression models for each C value. You can access these weight coefficients via the coef\_ attribute in LogisticRegression. Do you notice anything interesting happening as the C value varies?

In [9]:
# code

# Support Vector Machines

Ref: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC
The SVM Classifier also takes quite a few parameters. For now we will use Linear SVMs. The model is called LinearSVC in sklearn.

We will be playing with following parameters:
* C: same as above

SVM tries to find the hyperplane that maximizes the "margin" between the two classes of points. The "C" parameter in [SVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) has the same role as the "C" parameter in LogisticRegression: it tells you how much to penalize the "size" of the weight vector. Note that SVC only allows for L2 regularization.



### Let's fit an SVM

### Now predict scores on the test set and plot the distribution of scores
You might notice that the function you've been using to predict so far does not work. Is another function you need to use? Which one? Why?

### now we can select a threshold and calculate accuracy

### Let's now vary values of C and see the results.

# Evaluation Metrics

We covered several evaluation metrics in class:
    - accuracy
    - precision
    - recall
    - area under curve
    - ROC curves
    
Although sklearn has built-in functions to calculate these metrics,
in this lab we want to give you an understanding of these metrics 
by writing functions to calculate them yourself.

Remember that accuracy, precision, and recall are calculated at a specific threshold for turning scores into 0 and 1.


### Set Threshold


In [None]:
threshold = 

### We will first create a confusion matrix based on this threshold

In [None]:
true_positives =
false_positive =
true_negatives =
false_negatives = 

### Let's now write functions that can calculate each metric

In [None]:
def calculate_accuracy_at_threshold(predicted_scores, true_labels, threshold):


def calculate_precision_at_threshold(predicted_scores, true_labels, threshold):


def calculate_recall_at_threshold(predicted_scores, true_labels, threshold):


### Now let's calculate all of these for a logistic regression model you built above

### Now let's write a function that generates the precision, recall, k (% of population) graph that we covered in class
<img src="../imgs/prk.png">


In [None]:
def plot_precision_recall_k(predicted_scores, true_labels):

### let's plot it for the same logistic regression model

### Now we build the same graph for an svm model and compare the two. Which one is better?