# Exploring The Data: Fake News Challenge

In today's project session, we are going to think more about the Fake News Challenge dataset (https://github.com/FakeNewsChallenge/fnc-1) and the task of classifying the stance of the body text relative to the claim made in the headline into one of four categories:
1. Agrees: The body text agrees with the headline.
2. Disagrees: The body text disagrees with the headline.
3. Discusses: The body text discuss the same topic as the headline, but does not take a position
4. Unrelated: The body text discusses a different topic than the headline

Today we will:
1. Look at the distribution of examples across stance classes
2. Think about the rules we, as humans, use to perform the article-headline classification challenge. We will then explore if these rules enable the computer to separate out the different stance classes. 

Anytime you see ``______ # TODO: FILL IN HERE.`` in the code, you should replace the ``______`` with your own code.

As always, ask your neighbors or an instructor if you have any questions!

### 1. Read in data

In [10]:
# Import modules
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import os.path

pd.set_option('display.max_colwidth', -1)

In [7]:
# Function to create the training data if you don't already have it
def merge_data(stances_filename, bodies_filename, merged_filename):
    stances = pd.read_csv(stances_filename, encoding = "utf-8")
    bodies = pd.read_csv(bodies_filename, encoding = "utf-8")
    data = pd.merge(bodies, stances, on='Body ID')
    data.to_csv(merged_filename, index=False, encoding = "utf-8")

In [8]:
# Let's read in the dataset we created yesterday (and we will create it if it doesn't already exist; you may need to 
# adjust the file path if python cannot find the 'train_stances.csv' and 'train_bodies.csv' files):
if os.path.isfile('train_data.csv') == False:
    merge_data("../train_stances.csv", "../train_bodies.csv", "train_data.csv")

In [9]:
# Read in the training data and separate out the training data according to the value of the Stance variable:
train_data = pd.read_csv("train_data.csv", encoding = "utf-8")

In [None]:
# Reminder: what do these datasets look like? Use a function that you encountered yesterday to display the first few 
# rows of the training dataset:
train_data._____() # TODO: FILL IN HERE 

### 2. Stance variable distribution

Reminder: the goal of this challenge is to teach the computer to 'read' in article/headline pairs and predict the associated value of the Stance variable (UNRELATED/DISCUSS/AGREE/DISAGREE).  First, we should look at the number of examples that we have for each value of the stance variable that we can use for teaching the computer:

In [None]:
# Let's look at the distribution of stance variables in the dataset:
train_data.Stance.value_counts()

In [None]:
# And as a proportion of the rows in the dataset: (use an attribute of train_data that you learned about yesterday to
# divide the count values by the number of rows in the dataset)
train_data.Stance.value_counts()/train_data.____[0] # TODO: FILL IN HERE

Thus, we have a very uneven dataset, with many more examples of UNRELATED article/headline pairs compared to DISCUSS/AGREE/DISAGREE article/headline pairs. We will discuss the consequences of this uneven distribution later.

### 3. Exploring the data - Can we find features that are different across stance groups?

In what follows, we are going to work with a reduced subset of the training dataset, which has an equal number of examples from each Stance category. We will return to working with the full dataset later.

#### 3a. Randomly sample 100 examples from each stance class

In [None]:
unrelated = train_data[train_data['Stance'] == 'unrelated'].sample(100)

In [None]:
discuss = train_data[train_data['Stance'] == 'discuss'].sample(_____) # TODO: FILL IN HERE
agree = train_data[train_data['Stance'] == 'agree']._____(_____) # TODO: FILL IN HERE
disagree = train_data[train_data['Stance'] == _____].sample(100) # TODO: FILL IN HERE

In [None]:
# Check that each class has one hundred samples:
unrelated._____[0] # TODO: FILL IN HERE

#### 3b. What distinguishes each class? 
When headlines and articles are 'unrelated', is the proportion of words from the article contained in the headline different to that for other stance classes?

In [None]:
# Let's examine the proportion of words that are contained in a headline that are also contained in an article for 
# a headline-article example.  Let's create a function that reads in an example (single row in any of the dataframes we 
# created above) and returns the proportion of the headline that is contained in the article:
def find_headline_in_article_proportion(example):
    # Separate out all of the individual words in headline:
    headline_words = str.split(example['Headline'])
    # Separate out all of the individual words in the article:
    article_words = str.split(example['articleBody'])
    # Counter to count the number of words in the headline that are contained in the article
    counter = 0 
    # Now iterate through each of the words in the headline and check if the word also exists in the article
    for word in headline_words:
        if word in article_words:
            # If the word in the headline is contained in the article, add 1 to the counter
            counter += 1
    # Since the length of each headline is different, what we really want to compare across examples is the proportion 
    # of the headline that is contained in the article 
    proportion = counter/len(headline_words)
    return proportion

In [None]:
# Let's apply the function to all examples in each of the four dataframes we created above.
# We will save the results to a list:
proportions_unrelated = [] # Save results to this list
for i in range(unrelated.shape[0]):
    # Extract example
    this_example = unrelated.iloc[i]
    # Save result of calling find_headline_in_article_proportion() on example to proportions_unrelated list
    proportions_unrelated.append(find_headline_in_article_proportion(this_example))

In [None]:
# Now do the same for each of the other stance classes:
# discuss class
proportions_discuss= [] 
for i in range(discuss.shape[0]):
    this_example = discuss.iloc[____] # TODO: FILL IN HERE
    proportions_discuss.append(find_headline_in_article_proportion(_____)) # TODO: FILL IN HERE

In [None]:
# agree class
proportions_agree= [] 
for i in range(agree.____): # TODO: FILL IN HERE
    this_example = agree.iloc[i]
    proportions_agree.append(find_headline_in_article_proportion(this_example))

In [None]:
# disagree class
proportions_______= [] # TODO: FILL IN HERE
for i in range(disagree.shape[0]):
    this_example = disagree.iloc[i]
    proportions_disagree.append(_______(this_example)) # TODO: FILL IN HERE

#### 3c. Calculate mean proportion for each stance class.
What do you notice?

In [None]:
# unrelated
np.mean(proportions_unrelated)

In [None]:
# agree
np.mean(________agree) # TODO: FILL IN HERE

In [None]:
# discuss
_____.mean(proportions_discuss) # TODO: FILL IN HERE

In [None]:
# disagree
np._____(proportions_disagree) # TODO: FILL IN HERE

#### 3d. Visualize the differences in proportions across stance classes
We will create a boxplot to show the differences in proportions across stance classes

In [None]:
data_to_plot = [proportions_unrelated, proportions_agree, proportions_disagree, ___________] # TODO: FILL IN HERE

In [None]:
# Create a figure instance
fig = plt.figure(1, figsize=(9, 6))

# Create an axes instance
ax = fig.add_subplot(111)

# Create the boxplot
bp = ax.boxplot(data_to_plot)
# Set the xaxis tick marks
ax.set_xticklabels(['Unrelated', 'Agree', 'Disagree', 'Discuss'])
# Set the yaxis label
ax.set_ylabel('Proportion headline in article')

What do you notice from the boxplot? From the boxplot, could you come up with a rule for the computer to use in order to classify article-headline examples as UNRELATED/AGREE/DISAGREE/DISCUSS? What about if the challenge was only to classify pairs as either UNRELATED or RELATED (with RELATED = AGREE + DISAGREE + DISCUSS)? 

### Extra Challenge

"Extra challenge" sections are a more unguided exploration into the concepts we've discussed. You'll notice less scaffolding for the code -- try implementing these concepts from scratch, and feel free to ask your neighbors or an instructor if you have any questions!

After examining your boxplot's results and coming up with a rule for the computer to use in order to classify article-headline pairs as UNRELATED/RELATED, test out this rule on some examples from the training dataset. Is your rule able to correctly predict the true value of Stance for each example?