# Precision-recall vs. ROC lab

In this lab you'll explore the differences between the ROC and PRAUC plot, as well as what changes when you optimize a gridsearch for different metrics - in this case changing the optimization from accuracy to "f1-score".

---

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import patsy

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import cross_val_score, train_test_split

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

---

### Load a classification dataset

The data you can use is up to you for this lab, but I've put two new datasets here if you want to try them out (they are used in the solutions.)

The first is a dataset on police killings. This dataset is pretty small and the model's don't fit that well from the ones I tried out, but your code will be fast.

The second is a cleaned up dataset on a stack overflow survey given out this year. I have only recently cleaned this one and so haven't tried out many models. I think this would be the more interesting one but it has more rows so your code may take longer, depending on your model specifications.

In [2]:
pk = pd.read_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/police_killings/police-killings.csv')

In [None]:
sos = pd.read_csv('/Users/kiefer/github-repos/DSI-SF-2/datasets/stack_overflow_surveys/survey_simple_cleaned_nonull.csv')

---

### Clean and/or explore the data

---

### Create two or three binary target variables to predict

---

### Using patsy (or manually) create corresponding predictor matrices for your target variables

---

### Break up your predictors and targets with train_test_split

**Use the `stratify` option! It takes as its argument the target variable vector.**

Choose a reasonable test size that's not too small. 0.3 to 0.4 is probably good, but depends on your data.

---

### Gridsearch best parameters for your models

For each of your target variables, optimize parameters with gridsearch **fitting on the training data from above**. We will be saving the test data for later.

It is up to you whether you want to fit a LogisticRegression or KNeighborsClassifier (or both?).

Example parameters to search for with LogisticRegression:

    {'solver':['liblinear'], 'penalty':['l1','l2'], 'C':np.linspace(0.0001, 1000., 100)}
    
Example parameters to search for with KNeighborsClassifier:

    {'n_neighbors':range(1,100), 'weights':['uniform','distance']}

---

### Compare performance of your models against baseline accuracy

---

### Create a function to plot an ROC curve

Your function will probably take a model, and X matrix and y target vector. This way you can use `roc_curve` and `auc` to get the stuff needed for the plot.

See the sklearn example here:

http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html

---

### Plot ROC curves using your models and the test data you split off earlier

---

### Calculate confusion matrices for your models on the training and testing data

What do they tell you about the models?

---

### Write a function to plot the precision-recall curve

It's very similar code to the ROC curve. 

See here for an example:

http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html

In [3]:
from sklearn.metrics import (precision_recall_curve, average_precision_score, f1_score)

---

### Plot precision-recall curves using your models on the test data

---

### Run new gridsearches with keyword argument `scoring='f1'` for your data

The f1-score is a metric combining performance on both precision and recall. It is similar to area under the precision recall curve.  

Setting the scoring to this will now have the gridsearch optimize the parameters to find the best f1-score as opposed to the best accuracy!

---

### Calculate the confusion matrices using the new models on the train or test data.

Has anything changed? Why would that be the case?