Today we'll create some rule-based classifiers. In this workbook, we'll practice working with Python and the FNC data, and create a simple FNC predictor.

Anytime you see ``______ # TODO: FILL IN HERE.`` in the code, you should replace the ``______`` with your own code.

As always, ask your neighbors or an instructor if you have any questions!

### 0. Import the packages we'll need.

In [1]:
import numpy as np
import pandas as pd
import sys

### 1. Load the data.

If you don't have ``train_data.csv`` and ``test_data.csv`` saved, run the following code to construct and save csvs of the train and test data.

In [2]:
def merge_data(stances_filename, bodies_filename, merged_filename):
    stances = pd.read_csv(stances_filename, encoding = "utf-8")
    bodies = pd.read_csv(bodies_filename, encoding = "utf-8")
    data = pd.merge(bodies, stances, on='Body ID')
    data.to_csv(merged_filename, index=False, encoding = "utf-8")

In [4]:
merge_data("train_stances.csv", "train_bodies.csv", "train_data.csv")

In [5]:
merge_data("competition_test_stances.csv", "competition_test_bodies.csv", "test_data.csv")

### 2. Separate the dataset based on stance.

Fill in and run the following cells to separate the train and test sets based on their stance. This will help us generate and test our predictions.

In [6]:
train_data = pd.read_csv("train_data.csv", encoding = "utf-8")
unrelated_train = train_data[train_data['Stance'] == ____]
discuss_train = train_data[train_data['Stance'] == ____]
agree_train = train_data[train_data['Stance'] == ____]
disagree_train = train_data[train_data['Stance'] == ____]

test_data = pd.read_csv("test_data.csv", encoding = "utf-8")

### 3. Create a classifer based on shared words.

Our first classifier will be based on the percentage of headline words that appear in the article. We'll find the average percentage for each stance. When our classifier gets a headline-article pair, it'll find the percentage of headline words in the article and then predict the stance with the closest average percentage.

Before we move on, try to write out the steps of this classifier, and the components we'll need to write to create the classifier.

#### 3a. Write a function that finds the proportion of headline words in the article.

In [7]:
def find_headline_in_article_proportion(example):
    headline_words = str.split(example[____])
    article_words = str.split(example[____])
    counter = 0
    for word in headline_words:
        if word in ____:
            counter += 1
    proportion = ____/len(____)
    return proportion

#### 3b. Find the average proportion for each stance.

In [8]:
def compute_proportions(unrelated, discuss, agree, disagree):
    proportions_unrelated = []
    for i in range(unrelated.shape[0]):
        this_example = ____.iloc[i]
        proportions_unrelated.append(____(this_example))
    proportions_related = []
    proportions_discuss= []
    for i in range(discuss.shape[0]):
        this_example = ____.iloc[i]
        proportions_discuss.append(____(this_example))
        proportions_related.append(____(this_example))
    proportions_agree= []
    for i in range(agree.shape[0]):
        this_example = ____.iloc[i]
        proportions_agree.append(____(this_example))
        proportions_related.append(____(this_example))
    proportions_disagree= []
    for i in range(disagree.shape[0]):
        this_example = ____.iloc[i]
        proportions_disagree.append(____(this_example))
        proportions_related.append(____(this_example))
    return {"unrelated":np.mean(proportions_unrelated), "discuss":np.mean(proportions_discuss), "agree":np.mean(proportions_agree), "disagree":np.mean(proportions_disagree), "related":np.mean(proportions_related)}

In [9]:
proportions = compute_proportions(____, ____, ____, ____)

#### 3c. Write a prediction function based on the closest average proportion.

In [76]:
def make_prediction(example, proportions):
    proportions_stances = list(proportions.keys())
    proportion = ____(example)
    predicted_stance = proportions_stances[np.argmin(np.abs(np.array(list(proportions.values())) - proportion))]
    return predicted_stance

### 4. Test the classifier.

#### 4a.  Write a function that runs and evaluates predictions on the test set.

In [57]:
def test_predictions(test_data, proportions):
    stance_counts = {"unrelated":0, "discuss":0, "agree":0, "disagree":0}
    stance_correct_counts = {"unrelated":0, "discuss":0, "agree":0, "disagree":0}
    for i in range(test_data.shape[0]):
        example = test_data.iloc[i]
        predicted_stance = ____(example, proportions)
        actual_stance = example[____]
        ____[____] += 1
        if predicted_stance == actual_stance:
            ____[____] += 1
    return {"unrelated":____["unrelated"]/____["unrelated"], "discuss":____["discuss"]/____["discuss"], "agree":____["agree"]/____["agree"], "disagree":____["disagree"]/____["disagree"]}

#### 4b. Test the classifier and examine the results.

In [79]:
correct_percentages = ____(test_data, proportions)

In [80]:
correct_percentages

{'agree': 0.4703100367840252,
 'disagree': 0.21377331420373027,
 'discuss': 0.0035842293906810036,
 'unrelated': 0.9000490489944956}

What categories does the predictor do well on, and what does the predictor do less well on? Why do you think this is? (hint: take a look at the average proportion of in-article headline words).

Take a look at the number of examples in each category, and compute the overall classification accuracy. Is this higher or lower than you would have expected?

In [81]:
unrelated_test = test_data[test_data['____'] == '____']
discuss_test = test_data[test_data['____'] == '____']
agree_test = test_data[test_data['____'] == '____']
disagree_test = test_data[test_data['____'] == '____']

In [86]:
overall_accuracy = (correct_percentages['____'] * agree_test.shape[0] \
                    + correct_percentages['____'] * discuss_test.shape[0] \
                    + correct_percentages['____'] * disagree_test.shape[0] \
                    + correct_percentages['____'] * unrelated_test.shape[0]) \
                    / ____.shape[0]

In [87]:
overall_accuracy

0.691575178058474

### Extra Challenge

"Extra challenge" sections are a more unguided exploration into the concepts we've discussed. You'll notice less scaffolding for the code -- try implementing these concepts from scratch, and feel free to ask your neighbors or an instructor if you have any questions!

#### Challenge 1. Two-way classification.

Try adapting our code for four-way classification (between 'unrelated', 'agree', 'disagree', and 'discuss') for two-way classification (between 'unrelated' and 'related'). To do this, we'll group 'agree', 'disagree', and 'discuss' examples into a single 'related' category.

#### Challenge 2. Experiment with more statistics.
In today's classifier, we compared the proportion of in-article headline words to the mean of each stance category. Experment with some different statistics. Some ideas:
- Use the median instaed of the mean for each stance.
- The 'Jaccard index' between two bodies of text, A and B, is (number of words shared by A and B)/(number of words in either A or B). Try using the Jaccard index instead of the proportion of in-article headline words. (hint: look up the functions set(), intersection(), and union())