Today we'll create some rule-based classifiers. In this workbook, we'll practice working with Python and the FNC data, and create a simple FNC predictor.

Anytime you see ``______ # TODO: FILL IN HERE.`` in the code, you should replace the ``______`` with your own code.

As always, ask your neighbors or an instructor if you have any questions!

### 0. Import the packages we'll need.

In [1]:
import numpy as np
import pandas as pd
import sys

### 1. Load the data.

If you don't have ``train_data.csv`` and ``test_data.csv`` saved, run the following code to construct and save csvs of the train and test data.

In [2]:
def merge_data(stances_filename, bodies_filename, merged_filename):
    stances = pd.read_csv(stances_filename, encoding = "utf-8")
    bodies = pd.read_csv(bodies_filename, encoding = "utf-8")
    data = pd.merge(bodies, stances, on='Body ID')
    data.to_csv(merged_filename, index=False, encoding = "utf-8")

**Where are the train_stances.csv and train_bodies.csv files?**

They are inside the AI4ALL_NLP_student folder.

In [3]:
merge_data("../train_stances.csv", "../train_bodies.csv", "../train_data.csv")

In [4]:
merge_data("../competition_test_stances.csv", "../competition_test_bodies.csv", "../test_data.csv")

**Why do we add a "../" in front of the names of our .csv files?**

train_stances.csv, train_bodies.csv, competition_test_stances.csv, and competition_test_bodies.csv are saved in the folder one level above the folder we're currently in. We also want to save the merged train_data.csv and test_data.csv in the folder one level above the folder we're currently in. This will make it easier to access these files in future days, since we work in a separate folder (inside the AI4ALL_NLP_student folder) each day.

**Try to find all six .csv files in Finder.**

### 2. Separate the dataset based on stance.

Fill in and run the following cells to separate the train and test sets based on their stance.

In [5]:
train_data = pd.read_csv("../train_data.csv", encoding = "utf-8")
unrelated_train = train_data[train_data['Stance'] == 'unrelated']
discuss_train = train_data[train_data['Stance'] == 'discuss']
agree_train = train_data[train_data['Stance'] == 'agree']
disagree_train = train_data[train_data['Stance'] == 'disagree']

test_data = pd.read_csv("../test_data.csv", encoding = "utf-8")

### 3. Create a classifer based on shared words.

Our first classifier will be based on the percentage of headline words that appear in the article. We'll find the average percentage for each stance. When our classifier gets a headline-article pair, it'll find the percentage of headline words in the article and then predict the stance with the closest average percentage.

Before we move on, try to write out the steps of this classifier, and the components we'll need to write to create the classifier.

#### 3a. Write a function that finds the proportion of headline words in the article.

In [9]:
def find_headline_in_article_proportion(example):
    headline_words = str.split(example['Headline'])
    article_words = str.split(example['articleBody'])
    counter = 0
    for word in headline_words:
        if word in article_words:
            counter += 1
    proportion = counter/len(headline_words)
    return proportion

**Explain in plain English what this function does:**

This function calculates the proportion of overlapping words between the headline and article.

Let's test out this function. Before running the following code, try to predict what the output will be.

In [18]:
headline_text = "this is a test headline"
article_text = "this is a test article"
test_df = pd.DataFrame({'Headline':[headline_text], 'articleBody':[article_text]})
example = test_df.iloc[0]
find_headline_in_article_proportion(example)

0.8

Now test it out on a headline and article pair of your choosing:

In [19]:
headline_text = "test is this a"
article_text = "this is a test"
test_df = pd.DataFrame({'Headline':[headline_text], 'articleBody':[article_text]})
example = test_df.iloc[0]
find_headline_in_article_proportion(example)

1.0

Now test out the function on an example from the FNC data:

In [20]:
example = train_data.iloc[0]

In [21]:
print(example)

Body ID                                                        0
articleBody    A small meteorite crashed into a wooded area i...
Headline       Soldier shot, Parliament locked down after gun...
Stance                                                 unrelated
Name: 0, dtype: object


**If we wanted to use something other than the `print` function, how could we have seen what `example` contains?**

In [41]:
example

Body ID                                                        0
articleBody    A small meteorite crashed into a wooded area i...
Headline       Soldier shot, Parliament locked down after gun...
Stance                                                 unrelated
Name: 0, dtype: object

In [22]:
find_headline_in_article_proportion(example)

0.09090909090909091

#### 3b. Find the average proportion for each stance.

This function will find the average proportion overlap between article and body for each category.

In [23]:
def compute_proportions(unrelated, discuss, agree, disagree):
    proportions_unrelated = []
    for i in range(unrelated.shape[0]):
        this_example = unrelated.iloc[i]
        proportions_unrelated.append(find_headline_in_article_proportion(this_example))
    proportions_related = []
    proportions_discuss= []
    for i in range(discuss.shape[0]):
        this_example = discuss.iloc[i]
        proportions_discuss.append(find_headline_in_article_proportion(this_example))
        proportions_related.append(find_headline_in_article_proportion(this_example))
    proportions_agree= []
    for i in range(agree.shape[0]):
        this_example = agree.iloc[i]
        proportions_agree.append(find_headline_in_article_proportion(this_example))
        proportions_related.append(find_headline_in_article_proportion(this_example))
    proportions_disagree= []
    for i in range(disagree.shape[0]):
        this_example = disagree.iloc[i]
        proportions_disagree.append(find_headline_in_article_proportion(this_example))
        proportions_related.append(find_headline_in_article_proportion(this_example))
    return {"unrelated":np.mean(proportions_unrelated), "discuss":np.mean(proportions_discuss), "agree":np.mean(proportions_agree), "disagree":np.mean(proportions_disagree), "related":np.mean(proportions_related)}

In [24]:
proportions = compute_proportions(unrelated_train, discuss_train, agree_train, disagree_train)

Before looking at `proportions`, try to guess what the proportions are for each category.
Side note: We saved proportions in a **dictionary**. If you're interested in learning more about it, ask an instructor or see this description: https://www.tutorialspoint.com/python/python_dictionary.htm

**Now look at the proportions for each category.**

In [44]:
proportions

{'agree': 0.44605676586182841,
 'disagree': 0.43070433968733673,
 'discuss': 0.43875075101834521,
 'related': 0.44024866842925497,
 'unrelated': 0.14348271330437723}

#### 3c. Write a prediction function based on the closest average proportion.

Now, we'll write a function that predicts an example's class depending on which group's overlap proportion it's closest to.

In [25]:
def make_prediction(example, proportions):
    proportions_stances = list(proportions.keys())
    proportion = find_headline_in_article_proportion(example)
    predicted_stance = proportions_stances[np.argmin(np.abs(np.array(list(proportions.values())) - proportion))]
    return predicted_stance

### 4. Test the classifier.

#### 4a.  Write a function that runs and evaluates predictions on the test set.

This function classifies the percentage of correct predictions for each category.

In [26]:
def test_predictions(test_data, proportions):
    stance_counts = {"unrelated":0, "discuss":0, "agree":0, "disagree":0}
    stance_correct_counts = {"unrelated":0, "discuss":0, "agree":0, "disagree":0}
    for i in range(test_data.shape[0]):
        example = test_data.iloc[i]
        predicted_stance = make_prediction(example, proportions)
        actual_stance = example['Stance']
        stance_counts[actual_stance] += 1
        if predicted_stance == actual_stance:
            stance_correct_counts[actual_stance] += 1
    return {"unrelated":stance_correct_counts["unrelated"]/stance_counts["unrelated"], "discuss":stance_correct_counts["discuss"]/stance_counts["discuss"], "agree":stance_correct_counts["agree"]/stance_counts["agree"], "disagree":stance_correct_counts["disagree"]/stance_counts["disagree"]}

#### 4b. Test the classifier and examine the results.

In [27]:
correct_percentages = test_predictions(test_data, proportions)

In [28]:
correct_percentages

{'agree': 0.4703100367840252,
 'disagree': 0.21377331420373027,
 'discuss': 0.0035842293906810036,
 'unrelated': 0.9000490489944956}

**What categories does the predictor do well on, and what does the predictor do less well on? Why do you think this is? (hint: take a look at the average proportion of in-article headline words).**

It does very well on 'unrelated' and worst on 'discuss'. The mean proportion for 'unrelated' is very different from the other categories, while the 'discuss' proportion is sandwiched between the proportions for 'agree' and 'disagree'.

Take a look at the number of examples in each category, and compute the overall classification accuracy. Is this higher or lower than you would have expected?

In [29]:
unrelated_test = test_data[test_data['Stance'] == 'unrelated']
discuss_test = test_data[test_data['Stance'] == 'discuss']
agree_test = test_data[test_data['Stance'] == 'agree']
disagree_test = test_data[test_data['Stance'] == 'disagree']

In [30]:
overall_accuracy = (correct_percentages['agree'] * agree_test.shape[0] \
                    + correct_percentages['discuss'] * discuss_test.shape[0] \
                    + correct_percentages['disagree'] * disagree_test.shape[0] \
                    + correct_percentages['unrelated'] * unrelated_test.shape[0]) \
                    / test_data.shape[0]

In [31]:
overall_accuracy

0.691575178058474

**What do you think of this accuracy? How much better do you think we can do?**

We'll see ;)

### 5. Create a simple classifer based on the most common category.

#### 5a. Find the most common category.

First, we need to find what the most common category is, using the training data. For something like this, there's often a function someone else used that we can borrow. The easiest way to find this is by googling -- for example, I googled "pandas series count values for each type" and this was the second result: https://stackoverflow.com/questions/22391433/count-the-frequency-that-a-value-occurs-in-a-dataframe-column. Check out this link, and try to use one line to find the number of occurrences of each category.

In [35]:
train_data.groupby('Stance').count()

Unnamed: 0_level_0,Body ID,articleBody,Headline
Stance,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
agree,3678,3678,3678
disagree,840,840,840
discuss,8909,8909,8909
unrelated,36545,36545,36545


**What is the most common category?**

The most common category is 'unrelated'.

*Extra Challenge: Try determining the number of examples for each category without using these functions.*

In [38]:
stances = ['unrelated', 'agree', 'discuss', 'disagree']
for stance in stances:
    print("Number of " + stance + " examples is " + str(train_data[train_data['Stance'] == stance].shape[0]))

Number of unrelated examples is 36545
Number of agree examples is 3678
Number of discuss examples is 8909
Number of disagree examples is 840


#### 5b. Compute the test accuracy.

First, we need to find how many examples of each category there are in the test set.

In [39]:
test_data.groupby('Stance').count()

Unnamed: 0_level_0,Body ID,articleBody,Headline
Stance,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
agree,1903,1903,1903
disagree,697,697,697
discuss,4464,4464,4464
unrelated,18349,18349,18349


**Now, what would the accuracy be if we always guessed the most common category?**

In [40]:
18349/test_data.shape[0]

0.7220320308503522

### Extra Challenge

"Extra challenge" sections are a more unguided exploration into the concepts we've discussed. You'll notice less scaffolding for the code -- try implementing these concepts from scratch, and feel free to ask your neighbors or an instructor if you have any questions!

#### Challenge 1. Two-way classification.

Try adapting our code for four-way classification (between 'unrelated', 'agree', 'disagree', and 'discuss') for two-way classification (between 'unrelated' and 'related'). To do this, we'll group 'agree', 'disagree', and 'discuss' examples into a single 'related' category.

#### Challenge 2. Experiment with more statistics.
In today's classifier, we compared the proportion of in-article headline words to the mean of each stance category. Experment with some different statistics. Some ideas:
- Use the median instaed of the mean for each stance.
- The 'Jaccard index' between two bodies of text, A and B, is (number of words shared by A and B)/(number of words in either A or B). Try using the Jaccard index instead of the proportion of in-article headline words. (hint: look up the functions set(), intersection(), and union())