## Evaluating Classifiers

Today we are going to discuss metrics for comparing classifiers. We will bring back the rule-based classifier that was introduced yesterday (the classifier that used the word overlap proportion for headlines and articles in order to make its decision; although we will restrict this classifier so that it can only predict 'unrelated' and 'related'), and we will compare the performance of this classifier against a classifier that classifies every article-headline pair as 'unrelated'. We will see that accuracy is only one way to measure performance, and that it may not be the best metric when the training and test examples are unbalanced across classes.

Anytime you see ``______ # TODO: FILL IN HERE.`` in the code, you should replace the ``______`` with your own code.

As always, ask your neighbors or an instructor if you have any questions!

### 1. Load the data 

We have copied the functions that we created yesterday into a separate file (named 'initial_classifier_utils.py'), and we will now import those functions here, along with the other python packages we need:

In [None]:
import initial_classifier_utils as utils
import numpy as np
import pandas as pd
import sys
import os.path
# Adjust settings so that we can fully see the dataset below
pd.set_option('display.max_colwidth', -1)

In [None]:
# Create the training data, if it doesn't already exist:
if os.path.isfile('train_data.csv') == False:
    utils.merge_data("train_stances.csv", "train_bodies.csv", "train_data.csv")

In [None]:
# Create the test data, if it doesn't already exist:
if os.path.isfile('test_data.csv') == False:
    utils.merge_data("competition_test_stances.csv", "competition_test_bodies.csv", "test_data.csv")

In [None]:
# Read in the training data and separate out the training data according to the value of the Stance variable:
train_data = pd.read_csv("train_data.csv", encoding = "utf-8")
unrelated_train = train_data[train_data['Stance'] == 'unrelated']

In [None]:
# TODO: FILL IN HERE
discuss_train = train_data[train_data['Stance'] == ________]  # TODO: FILL IN HERE
agree_train = train_data[train_data[______] == 'agree'] # TODO: FILL IN HERE
disagree_train = _______[______['Stance'] == 'disagree'] # TODO: FILL IN HERE

In [None]:
# Read in test data:
# TODO: FILL IN HERE 
test_data = pd.______("test_data.csv", encoding = "utf-8")

### 2. Load the classifier and make predictions on the test data

In [None]:
# We are going to compute the headline-article word overlap for each of the examples in the unrelated,
# discuss, agree and disagree categories in the training set.  
# Warning: this may take a minute - it is iterating through almost 50,000 examples!
proportions_train = utils.compute_proportions(unrelated_train, discuss_train, agree_train, disagree_train)

In [None]:
# We are going to modify the make_prediction function that we created yesterday, so that it only predicts 'unrelated'
# and 'related'. It will use the same method as our classifier yesterday used: the proportion overlap for an example
# will be compared with the overlap proportions for 'related' and 'unrelated' before a decision is made
def make_prediction(example, proportions_train):
    # Keep only the 'unrelated' and 'related' proportions:
    keys = ['unrelated', 'related']
    new_proportions = { key: proportions_train[key] for key in keys }
    proportions_stances = list(new_proportions.keys())
    proportion = utils.find_headline_in_article_proportion(example)
    predicted_stance = proportions_stances[np.argmin(np.abs(np.array(list(new_proportions.values())) - proportion))]
    return predicted_stance

In [None]:
# We are going to use a lot of the functions that were created yesterday, but we are also going to define a new one:
# The function below takes the test data set and iterates through each article-headline pair. For each article-headline pair
# it makes a prediction by calculating the headline-article word overlap and comparing this to the mean overlap values 
# for each of the categories in the training set. The function returns a list of predictions for every example in the
# test data set
def make_predictions(test_data, proportions_train):
    predictions_list = []
    for i in range(test_data.shape[0]):
        example = test_data.iloc[i]
        predicted_stance = make_prediction(example, proportions_train)
        # Append the predicted stance to the predictions_list - this will allow us to compare predictions and true 
        # values later on
        predictions_list.append(predicted_stance)
    return predictions_list

In [None]:
# Call the make_predictions function (it may take a minute, as the test dataset contains a lot of examples!) 
predictions = make_predictions(test_data, proportions_train)

In [None]:
# TODO: FILL IN HERE
# Check that the predictions list is the right size (this one way to check that the make_predictions function 
# is performing as expected): 
# 1. From test_data, what do you expect the length of the predictions list to be? 
# (You may want to calculate an attribute of test_data you have seen on a previous day)
# 2. What is the length of the predictions list and does it match your expectation?

### 3. Compare predictions and ground truth

In [None]:
# We are going to create a new column (called 'new_stance') in the test_data dataframe so that any Stance value that is 
# 'agree', 'discuss', or'disagree' is represented as 'related', while a value of 'unrelated' remains 'unrelated'
test_data['new_stance'] = test_data['Stance']
test_data.loc[test_data['Stance'].isin(['agree', 'disagree', 'discuss']), 'new_stance'] = 'related'

In [None]:
# We are going to append the predictions list we just obtained to the test_data dataframe as an additional column,
# along with one more column that predicts 'unrelated' for every example:
test_data['prediction_1'] = predictions
test_data['prediction_2'] = 'unrelated'

In [None]:
# TODO: FILL IN HERE
# Let's look at the modified test_data dataframe by viewing the first few examples: (use a function you have learned
# about on a previous day)
test_data._____

In [None]:
# Let's look at a few more examples (this time at the end of the dataset):
test_data[-3:]

### 4. Compare classifier performance

We are now going to introduce some tools that we can use for evaluating classifier performance. These are:
- Accuracy 
- Confusion Matrices
- Precision and Recall
- F1 score

#### 4a. Accuracy

Accuracy is defined as the ratio of the number of examples that were correctly predicted compared to the number of examples in the test dataset:
$\text{accuracy} = \dfrac{\text{number correct}}{\text{number of examples}}$

Let's calculate accuracy values for the classifier we saw yesterday, and compare this to the classifier which predicts 'unrelated' for all examples:

In [None]:
def calculate_accuracy(test_data, truth_col_name, prediction_col_name):
    number_correct = sum(test_data[truth_col_name] == test_data[prediction_col_name])
    number_examples = test_data.shape[0]
    accuracy = number_correct/number_examples
    return accuracy

In [None]:
# Calculate accuracy for the classifier we created yesterday:
calculate_accuracy(test_data, 'new_stance', 'prediction_1')

In [None]:
# TODO: FILL IN HERE
# Calculate accuracy for the classifier that classifies all examples as 'unrelated':
calculate_accuracy(test_data, 'new_stance', _______)

#### 4b. Confusion Matrices

We have just seen that the classifier that predicts 'unrelated' for everything manages to obtain a high accuracy value. But there is something unsatisfying about this second classifier, and if we were using it to predict articles that might be Fake News (because the headline and article are unrelated), all articles would be marked as Fake News! This doesn't seem helpful! 

Perhaps we should be using other tools, in addition to accuracy, to evaluate our classifier...

We are now going to calculate something called a 'confusion matrix' for each of our classifiers:

In [None]:
def calculate_confusion_matrix(test_data, truth_col_name, prediction_col_name):
    cross_tab = pd.crosstab(test_data[truth_col_name], test_data[prediction_col_name])
    column_names = cross_tab.columns.values
    row_names = cross_tab.index
    # Check each row name has an equivalent column; if not, add column and fill with 0s
    for row in row_names:
        if row not in column_names:
            cross_tab[row] = 0
    # reorder columns so that order matches that of rows and return this
    return cross_tab[row_names] 

In [None]:
cf_mat_1 = calculate_confusion_matrix(test_data, 'new_stance', 'prediction_1')
cf_mat_1

What does the confusion matrix tell us? 

If we look at the first entry of the confusion matrix, with value 4491, this tells us that 4491 examples in the test dataset had label 'related' (row value) and were correctly classified by our classifier as 'related' (column value).

If we look at the entry in the second row and first column (value 1819), this tells us that 1819 examples in the test dataset had label 'unrelated', but were classified as 'related' by our classifier. Similarly, 2573 examples had true label 'related' but were classified as 'unrelated' by our classifier; and 16,530 examples had true label 'unrelated' and were correctly predicted.

In [None]:
# TODO: FILL IN HERE
# Let's also look at the confusion matrix for the second classifier:
cf_mat_2 = calculate_confusion_matrix(test_data, 'new_stance', ___________)
cf_mat_2

Sometimes, when trying to interpret the confusion matrix, it helps to divide the entries in the confusion matrix by either the number of test examples altogether, or by the number of examples in each class (i.e. divide each entry in the matrix by the sum of the elements in its row). Let's do the latter:

In [None]:
def calculate_class_accuracies(confusion_matrix):
    # Calculate row sum for confusion matrix (total number of examples in a particular class):
    row_sums = confusion_matrix.sum(1)
    # Divide each element by its row sum to calculate class accuracies:
    class_acc_mat = confusion_matrix.divide(row_sums, axis = 0)
    # Add column which sums row proportions
    class_acc_mat['row_sum'] = class_acc_mat.sum(1)
    return class_acc_mat

In [None]:
class_acc_mat_1 = calculate_class_accuracies(cf_mat_1)
class_acc_mat_1

What does the scaled confusion matrix above tell us? If we look at the first entry (0.635759), this tells us that ~64% of the examples in our test dataset that were marked as 'related' were properly classified by our classifier, while 36% were incorrectly classified (entry in first row, second column). In comparison, 90% of the 'unrelated' examples in the test dataset were correctly classified, while ~10% were incorrectly classified. This means that our classifier is better at identifying 'unrelated' examples compared to 'related' examples, and that we should think about other features we can use to distinguish 'related' examples from 'unrelated' examples.

What happens when we calculate the class accuracies for the second classifier?

In [None]:
# TODO: FILL IN HERE
# What happens when we calculate the class accuracies for the second classifier?
class_acc_mat_2 = calculate_class_accuracies(_______)
class_acc_mat_2

When we calculate the scaled confusion matrix here, we see that, as expected, 100% of 'unrelated' examples are correctly classified, while 100% of 'related' examples are incorrectly classified.

#### 4c. Precision and Recall

From the confusion matrix, we can calculate some additional quantities. In particular, we can calculate **precision** and **recall** scores for our classifiers. In our case, precision is defined as:

$\text{precision} = \dfrac{\text{Number of unrelated examples correctly classified}}{\text{Number of examples classified as 'unrelated'}}$

In [None]:
def calculate_precision(confusion_matrix):
    unrelated_correct = confusion_matrix[confusion_matrix.index == 'unrelated']['unrelated']
    classified_as_unrelated = confusion_matrix.sum(0)['unrelated']
    precision = (unrelated_correct/classified_as_unrelated)[0]
    return precision

In [None]:
# Let's now calculate the precision value for the first classifier with its confusion matrix:
calculate_precision(cf_mat_1)

In [None]:
# TODO: FILL IN HERE
# Let's do the same for our second classifier:
calculate__________(cf_mat_2)

Recall is defined as:
$\text{recall} = \dfrac{\text{Number of unrelated examples correctly classified}}{\text{Number of unrelated examples}}$

In [None]:
def calculate_recall(confusion_matrix):
    unrelated_correct = confusion_matrix[confusion_matrix.index == 'unrelated']['unrelated']
    number_unrelated = confusion_matrix.sum(1)['unrelated']
    recall = (unrelated_correct/number_unrelated)[0]
    return recall

In [None]:
# Let's now calculate the recall value for the first classifier with its confusion matrix:
calculate_recall(cf_mat_1)

In [None]:
# TODO: FILL IN HERE
# Let's now calculate the recall value for the second classifier with its confusion matrix:
__________(cf_mat_2)

In this notebook, we have introduced **accuracy**, **confusion matrices**, **precision** and **recall** as tools for evaluating classifier performance. We have compared two possible classifiers for the Fake News Challenge, and have seen that, while one classifier may get a higher score according to one metric, it may get a lower score when using a different metric (our classifier from yesterday returns higher accuracy and precision scores compared to the classifier that labels everything as 'unrelated', but has a lower recall score). The metric that is used to evaluate classifier performance is often specific to the nature of the problem being studied, and often multiple metrics are used.

### 5. Extra Challenge

1. Take a look at some of the examples that were misclassified by our proportion overlap classifier. Why do you think our proportion overlap model failed? Can you think of a rule that would properly classify these examples?  Why not test it out?
2. Yesterday we created a four-way classifier that was able to predict 'agree', 'disagree', 'discuss' and 'unrelated'. Calculate the confusion matrix for this classifier. In this case, how many rows and columns would the matrix have? Compare the confusion matrix for this classifier to the classifier that predicts 'unrelated' for everything. What does the confusion matrix tell us about each of these models? What are the strengths and weaknesses of both of these models?