Today we'll create some rule-based classifiers. In this workbook, we'll practice working with Python and the FNC data, and create a simple FNC predictor.

***Anytime you see ``______ # TODO: FILL IN HERE.`` in the code, you should replace the ``______`` with your own code. ***

As always, ask your neighbors or an instructor if you have any questions!

***There are some longer functions here, where many of the lines have already been written. Try to read through the code and understand why we're doing each step!***

### 0. Import the packages we'll need.

In [None]:
import numpy as np
import pandas as pd
import sys

### 1. Load the data.

If you don't have ``train_data.csv`` and ``test_data.csv`` saved, run the following code to construct and save csvs of the train and test data.

In [None]:
def merge_data(stances_filename, bodies_filename, merged_filename):
    stances = pd.read_csv(stances_filename, encoding = "utf-8")
    bodies = pd.read_csv(bodies_filename, encoding = "utf-8")
    data = pd.merge(bodies, stances, on='Body ID')
    data.to_csv(merged_filename, index=False, encoding = "utf-8")

**Where are the train_stances.csv and train_bodies.csv files?**

In [None]:
merge_data("../train_stances.csv", "../train_bodies.csv", "../train_data.csv")

In [None]:
merge_data("../competition_test_stances.csv", "../competition_test_bodies.csv", "../test_data.csv")

**Why do we add a "../" in front of the names of our .csv files?**

**Try to find all six .csv files in Finder.**

### 2. Separate the dataset based on stance.

Fill in and run the following cells to separate the train and test sets based on their stance.

In [None]:
train_data = pd.read_csv("../train_data.csv", encoding = "utf-8")
unrelated_train = train_data[train_data['Stance'] == ____] # TODO: Fill in here.
discuss_train = train_data[train_data['Stance'] == ____] # TODO: Fill in here.
agree_train = train_data[train_data['Stance'] == ____] # TODO: Fill in here.
disagree_train = train_data[train_data['Stance'] == ____] # TODO: Fill in here.

test_data = pd.read_csv("../test_data.csv", encoding = "utf-8")

### 3. Create a classifer based on shared words.

Our first classifier will be based on the percentage of headline words that appear in the article. We'll find the average percentage for each stance. When our classifier gets a headline-article pair, it'll find the percentage of headline words in the article and then predict the stance with the closest average percentage.

Before we move on, try to write out the steps of this classifier, and the components we'll need to write to create the classifier.

#### 3a. Write a function that finds the proportion of headline words in the article.

In [None]:
def find_headline_in_article_proportion(example):
    headline_words = str.split(example['Headline'])
    article_words = str.split(example[____])  # TODO: Fill in here.
    counter = 0
    for word in headline_words:
        if word in article_words:
            counter += ____ # TODO: Fill in here.
    proportion = ____/len(____)
    return proportion

**Explain in plain English what this function does:**

Let's test out this function. Before running the following code, try to predict what the output will be.

In [None]:
headline_text = "this is a test headline"
article_text = "this is a test article"
test_df = pd.DataFrame({'Headline':[headline_text], 'articleBody':[article_text]})
example = test_df.iloc[0]
find_headline_in_article_proportion(example)

Now test it out on a headline and article pair of your choosing:

In [None]:
headline_text = ____  # TODO: Fill in here.
article_text = ____  # TODO: Fill in here.
test_df = pd.DataFrame({'Headline':[headline_text], 'articleBody':[article_text]})
example = test_df.iloc[0]
find_headline_in_article_proportion(example)

Now test out the function on an example from the FNC data:

In [None]:
example = train_data.iloc[0]

In [None]:
print(example)

In [None]:
find_headline_in_article_proportion(example)

**If we wanted to use something other than the `print` function, how could we have seen what `example` contains?**

#### 3b. Find the average proportion for each stance.

This function will find the average proportion overlap between article and body for each category.

In [None]:
def compute_proportions(unrelated, discuss, agree, disagree):
    proportions_unrelated = []
    for i in range(unrelated.shape[0]):
        this_example = unrelated.iloc[i]
        proportions_unrelated.append(find_headline_in_article_proportion(this_example))
    proportions_related = []
    proportions_discuss= []
    for i in range(discuss.shape[0]):
        this_example = discuss.iloc[i]
        proportions_discuss.append(find_headline_in_article_proportion(this_example))
        proportions_related.append(find_headline_in_article_proportion(this_example))
    proportions_agree= []
    for i in range(agree.shape[0]):
        this_example = agree.iloc[i]
        proportions_agree.append(find_headline_in_article_proportion(this_example))
        proportions_related.append(find_headline_in_article_proportion(this_example))
    proportions_disagree= []
    for i in range(disagree.shape[0]):
        this_example = disagree.iloc[i]
        proportions_disagree.append(find_headline_in_article_proportion(this_example))
        proportions_related.append(find_headline_in_article_proportion(this_example))
    return {"unrelated":np.mean(proportions_unrelated), "discuss":np.mean(proportions_discuss), "agree":np.mean(proportions_agree), "disagree":np.mean(proportions_disagree), "related":np.mean(proportions_related)}

In [None]:
proportions = compute_proportions(unrelated_train, discuss_train, agree_train, disagree_train)

Before looking at `proportions`, try to guess what the proportions are for each category.
Side note: We saved proportions in a **dictionary**. If you're interested in learning more about it, ask an instructor or see this description: https://www.tutorialspoint.com/python/python_dictionary.htm

**Now look at the proportions for each category.**

In [None]:
proportions

#### 3c. Write a prediction function based on the closest average proportion.

Now, we'll write a function that predicts an example's class depending on which group's overlap proportion it's closest to.

In [None]:
def make_prediction(example, proportions):
    proportions_stances = list(proportions.keys())
    proportion = find_headline_in_article_proportion(____) # TODO: Fill in here.
    predicted_stance = proportions_stances[np.argmin(np.abs(np.array(list(proportions.values())) - proportion))]
    return predicted_stance

### 4. Test the classifier.

#### 4a.  Write a function that runs and evaluates predictions on the test set.

This function classifies the percentage of correct predictions for each category.

In [None]:
def test_predictions(test_data, proportions):
    stance_counts = {"unrelated":0, "discuss":0, "agree":0, "disagree":0}
    stance_correct_counts = {"unrelated":0, "discuss":0, "agree":0, "disagree":0}
    for i in range(test_data.shape[0]):
        example = test_data.iloc[i]
        predicted_stance = make_prediction(____, ____)  # TODO: Fill in here.
        actual_stance = example[____]  # TODO: Fill in here.
        stance_counts[actual_stance] += ____  # TODO: Fill in here.
        if predicted_stance == actual_stance:
            stance_correct_counts[actual_stance] += ____  # TODO: Fill in here.
    return {"unrelated":stance_correct_counts["unrelated"]/stance_counts["unrelated"], "discuss":stance_correct_counts["discuss"]/stance_counts["discuss"], "agree":stance_correct_counts["agree"]/stance_counts["agree"], "disagree":stance_correct_counts["disagree"]/stance_counts["disagree"]}

#### 4b. Test the classifier and examine the results.

In [None]:
correct_percentages = test_predictions(____, ____)  # TODO: Fill in here.

In [None]:
correct_percentages

**What categories does the predictor do well on, and what does the predictor do less well on? Why do you think this is? (hint: take a look at the average proportion of in-article headline words).**

It does very well on 'unrelated' and worst on 'discuss'. The mean proportion for 'unrelated' is very different from the other categories, while the 'discuss' proportion is sandwiched between the proportions for 'agree' and 'disagree'.

Take a look at the number of examples in each category, and compute the overall classification accuracy. Is this higher or lower than you would have expected?

In [None]:
unrelated_test = test_data[test_data['Stance'] == ____] # TODO: Fill in here.
discuss_test = test_data[test_data['Stance'] == ____] # TODO: Fill in here.
agree_test = test_data[test_data['Stance'] == ____] # TODO: Fill in here.
disagree_test = test_data[test_data['Stance'] == ____] # TODO: Fill in here.

In [None]:
overall_accuracy = (correct_percentages[____] * ____.shape[0] \ # TODO: Fill in here.
                    + correct_percentages[____] * ____.shape[0] \ # TODO: Fill in here.
                    + correct_percentages[____] * ____.shape[0] \ # TODO: Fill in here.
                    + correct_percentages[____] * ____.shape[0]) \ # TODO: Fill in here.
                    / ____.shape[0] # TODO: Fill in here.

In [None]:
overall_accuracy

**What do you think of this accuracy? How much better do you think we can do?**

### 5. Create a simple classifer based on the most common category.

#### 5a. Find the most common category.

First, we need to find what the most common category is, using the training data. For something like this, there's often a function someone else used that we can borrow. The easiest way to find this is by googling -- for example, I googled "pandas series count values for each type" and this was the second result: https://stackoverflow.com/questions/22391433/count-the-frequency-that-a-value-occurs-in-a-dataframe-column. Check out this link, and try to use one line to find the number of occurrences of each category.

In [None]:
____  # TODO: Fill in here.

**What is the most common category?**

*Extra Challenge: Try determining the number of examples for each category without using these functions.*

#### 5b. Compute the test accuracy.

First, we need to find how many examples of each category there are in the test set.

In [None]:
____ # TODO: Fill in here.

**Now, what would the accuracy be if we always guessed the most common category?**

In [None]:
____/____.shape[0] # TODO: Fill in here.

### Extra Challenge

"Extra challenge" sections are a more unguided exploration into the concepts we've discussed. You'll notice less scaffolding for the code -- try implementing these concepts from scratch, and feel free to ask your neighbors or an instructor if you have any questions!

#### Challenge 1. Two-way classification.

Try adapting our code for four-way classification (between 'unrelated', 'agree', 'disagree', and 'discuss') for two-way classification (between 'unrelated' and 'related'). To do this, we'll group 'agree', 'disagree', and 'discuss' examples into a single 'related' category.

#### Challenge 2. Experiment with more statistics.
In today's classifier, we compared the proportion of in-article headline words to the mean of each stance category. Experment with some different statistics. Some ideas:
- Use the median instaed of the mean for each stance.
- The 'Jaccard index' between two bodies of text, A and B, is (number of words shared by A and B)/(number of words in either A or B). Try using the Jaccard index instead of the proportion of in-article headline words. (hint: look up the functions set(), intersection(), and union())