# [Computational Social Science] Classification Part 2

## Classification Metrics

We will extend our work from last week to learn more about different classification metrics, as well as some other useful techniques.

## Data
We're going to use our Census Income dataset dataset again for this lab. Load the dataset in, and explore it.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelBinarizer
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from matplotlib import pyplot

%matplotlib inline
sns.set_style("darkgrid")

In [None]:
# Create a list of column names, found in "adult.names"
col_names = ['age', 'workclass', 'fnlwgt',
            'education', 'education-num',
            'marital-status', 'occupation', 
             'relationship', 'race', 
             'sex', 'capital-gain',
            'capital-loss', 'hours-per-week',
            'native-country', 'income-bracket']

# Read table from the data folder
census = pd.read_table("../../data/adult.data", sep = ',', names = col_names)
census.head()

Remember that before we train machine learning algorithms, we need to preprocess the data. Run the cell below to preprocess the data.

In [None]:
# Target
lb_style = LabelBinarizer()
y = census['income-bracket-binary'] = lb_style.fit_transform(census["income-bracket"])

# Features
X = census.drop(['income-bracket', 'income-bracket-binary'], axis = 1)
X = pd.get_dummies(X)
X.head()

## Fit a Logistic Regression Model

Before we explore metrics for evaluating classification algorithms, let's train a logistic regression to work with. Do train, test, validation splits, train a logistic regression model on the training set, and make predictions on the validation set.

In [None]:
# Set seed
np.random.seed(10)

... = ...

... = ...

In [None]:
# create a model
...

# fit the model
...

...

## Accuracy

Recall the metrics we defined last week: **True Positives (TP)**, **False Positives (FP)**, **True Negatives (TN)**, and **False Negatives (FN)**. Accuracy can be expressed as:

$$
Accuracy = \frac{TP + TN}{TP + TN + FP + FN}
$$

In plain language, what does this formula represent?

**Answer**: 

Write code to calculate the accuracy of your logistic regression. Calculate the number of true positives, false positives, true negatives, and false negatives and then calculate and print the accuracy.

In [None]:
TP = ...
FP = ...
TN = ...
FN = ...

for i in range(len(y_pred)): 
    # True Positives. Hint, what two vectors need to equal each other and 1?
    if ... == ... == 1:
       TP += 1
    # False Positive. Hint, what vector needs to equal 1 while the other equals 0?
    if ... == 1 and ... != ...:
       FP += 1
    # True Negative
    if ... == ... ==0:
       TN += 1
    # False Negative
    if ... == 0 and y_pred[i]!= ...:
       FN += 1

In [None]:
accuracy = ...
print("Accuracy is", accuracy)

## Precision

Precision is a measure of how well calibrated predictions are. The formula for precision is:

$$
Precision = \frac{TP}{TP + FP}
$$

**Question**: In plain language, what does this formula tell us?

**Answer**:

Calculate and print the precision for the logistic regression.

In [None]:
precision = ...
print("Precision is", precision)

## Recall

Recall is defined as:

$$
Recall = \frac{TP}{TP + FN}
$$

**Question**: In plain language, what does the formula tell us?

**Answer**: 

Calculate the recall for our logistic regression model.

In [None]:
recall = ...
print("Recall is", recall)

**Question**: How did we do on precision and recall? Could you optimize for one or the other?

**Answer**: 

## F1 Score

The precision-recall tradeoff can be managed in a few different ways. One popular metric is the F1 score. It is defined as:

$$
F1 = 2 * \frac{precision * recall}{precision + recall}
$$

Calculate and print the f1 score.

In [None]:
f1 = ...
print("F1 Score is", f1)

**Question**: How does F1 trade off between precision and recall? What are the advantages and disadvantages?

**Answer**: 

## AUC-ROC

[Area Under the Curve - Receiver Operating Characteristic (AUC-ROC)](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) is a popular method for seeing how well an algorithm does at separating between two classes. It is calculated by plotting the True Positive Rate against the False Positive Rate. Let's define these quantities:

$$
True \space Positive \space Rate(TPR) = Sensitivity = \frac{TP}{TP + FN}
$$

Hm, this formula looks familiar. In fact, it is exactly the same as Recall! Meanwhile, the False Positive Rate is:

$$
False \space Positive \space Rate (FPR) = 1 - Specificity = \frac{FP}{TN + FP}
$$

**Question**: Why does plotting TPR against FPR express separability between class labels?

**Answer**: 

Fill in the following code to plot the AUC-ROC for the logistic regression and a "no skill" model. Make sure to look up documentation as necessary.

In [None]:
# roc curve and auc

# split into train/test sets
# generate a no skill prediction (majority class)
ns_probs = [0 for _ in range(len(y_validate))]


# predict probabilities for logistic regression
lr_probs = logit_model.predict_proba(...)

# keep probabilities for the positive outcome only
lr_probs = ...

# calculate scores
ns_auc = roc_auc_score(...)
lr_auc = roc_auc_score(...)

# summarize scores
print('No Skill: ROC AUC=%.3f' % (ns_auc))
print('Logistic: ROC AUC=%.3f' % (lr_auc))

# calculate roc curves
ns_fpr, ns_tpr, _ = roc_curve(...)
lr_fpr, lr_tpr, _ = roc_curve(...)

# plot the roc curve for the model
pyplot.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill')
pyplot.plot(lr_fpr, lr_tpr, marker='.', label='Logistic')
# axis labels
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
# show the legend
pyplot.legend()
# show the plot
pyplot.show()

**Question**: How did the logistic regression do on the AUC-ROC metric? Compared to the "no skill" decision rule, was it a meaningful improvement?

**Answer**: 

## Over and Under Sampling

We have seen that imbalanced data can cause all sorts of problems and give us misleading results, especially if we only focus on accuracy. How can we correct for these problems? One simple method is to **resample** the data. For example, you might **oversample** the minority class or **undersample** the majority class. Let's use the [**imblearn**](https://imbalanced-learn.readthedocs.io/en/stable/api.html) to try this out. First, you might need to run the cell below to install the library. Anytime you use "!" in a Jupyter notebook, this will actually run a bash command.

In [None]:
#!pip install imblearn

Now let's import the RandomOverSampler and RandomUnderSampler methods. Then take a look at the first 15 values in y_train before we resample.

**Question**: Why would we resample the training set, instead of the dataset or the validation/test sets?

**Answer**: 

In [None]:
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

In [None]:
...

Next use either the RandomOverSampler or RandomUnderSampler to resample the training set.

In [None]:
random_over_sampler = RandomOverSampler(sampling_strategy=0.5)
random_under_sampler = RandomUnderSampler(sampling_strategy=0.5)

X_train_new, y_train_new = random_over_sampler...

Check the training labels again. Did anything change?

In [None]:
y_train_new[0:15]

**Question**: What do you notice about the resampled training targets? What might be some issues with over and undersampling?

**Answer**: 

Retrain the logistic regression model on the newly resampled data. How does AUC-ROC change?

In [None]:
...

**Answer**: 

Overall, while sklearn puts together many of the methods we need to train, predict, and visualize the results of our machine learning, there are a lot of substantive choices involved. As you can see, even a slightly imbalanced dataset can cause problems. If you optimize only on accuracy, you might miss relevant aspects of the problem. Be mindful of the various metrics available, and decide which ones best answer the scientific question you have in mind.

---
Authored by Aniket Kesari.