Today we will start working with the Fake News Challenge dataset (https://github.com/FakeNewsChallenge/fnc-1). By the end of today, we hope you will be comfortable:
1. Importing and exporting data from a Jupyter notebook
2. Examining the structure of the dataset (how many rows and columns are in the dataset? What does each row and column represent?)
3. With the goal of the Fake News Challenge. How good are you at identifying misleading headlines?  Do you think you can beat an AI?   

Anytime you see ``______ # TODO: FILL IN HERE.`` in the code, you should replace the ``______`` with your own code.

As always, ask your neighbors or an instructor if you have any questions!

# Data Representation and Exploration

### 0. Import the packages we'll need.

In [None]:
import pandas as pd
# Adjust settings so that we can fully see the dataset below
pd.set_option('display.max_colwidth', -1)

### 1. Download the data

Go to https://github.com/FakeNewsChallenge/fnc-1 and download the following files:
    - train_stances.csv
    - train_bodies.csv

These files contain the data for the Fake News Challenge in the "csv" format. "csv" stands for "comma-separate values". We'll use this information later, when we tell the program how to load this data.

PRO TIP: Make sure that the downloaded datasets and this jupyter notebook are in the same directory (folder), else you will have problems later.

### 2. Understand what the data contains.

Before we start looking at the data, we should understand what it contains and where it comes from.

Go to http://www.fakenewschallenge.org/ and read up to, and including, the "DATA" section. Try to answer the following questions:

    - What information do these .csv files contain?
    - What is the classification goal posed by the Fake News Challenge?
    - How was this data collected?
    
Discuss your answers with your neighbor or an instructor.

### 3. Load the data.

Fill in the following code to load the data you downloaded into DataFrames.

In [None]:
train_stances = pd.read_csv("train_stances.csv", encoding = "utf-8")
train_bodies = pd.____("train_bodies.csv", encoding = "utf-8") 

### 4. Look at the layout of the data.

The ``.columns`` parameter of a DataFrame tells us the name of the columns. Run the following cells to examine the column names of the DataFrames we just created.

In [None]:
train_stances.columns

In [None]:
train_bodies.____

Notice that each dataset contains the column 'Body ID'. We will use this column later in order to match the datasets.

Now that we've looked at the columns of the dataset, let's look at the rows. How many rows are in each dataset? We can use the ``.shape`` parameter to tell us about the number of rows in each dataset. Can you guess what the second number, returned by ``.shape``, corresponds to?

In [None]:
train_stances.shape

In [None]:
train_bodies.shape

Notice that the number of rows in ``train_stances`` is different to the number of rows in ``train_bodies``.  Why is this? How many unique entries are there for the ``Body ID`` variable in each dataset? We can use the ``pd.unique`` function to figure this out, and we can use ``[]`` (square parentheses) to isolate a single column in each dataset:

In [None]:
len(pd.unique(train_stances['Body ID']))

In [None]:
len(pd.unique(train_bodies['Body ID']))

Aha! While the ``train_bodies`` dataset has one row for each value of the ``Body ID`` variable, there are multiple rows in ``train_stances`` corresponding to a single value of ``Body ID``.  Let's look at examples in each dataset corresponding to a value of 0 for the ``Body ID``:

In [None]:
# We expect there to be one row of the train_bodies dataset returned here:
train_bodies[train_bodies['Body ID'] == 0]

In [None]:
# Let's check that there is exactly one row in train_bodies corresponding to Body ID = 0 by using the .shape parameter 
# discussed before:
train_bodies[train_bodies['Body ID'] == 0].shape

In [None]:
# We expect there to be multiple rows of the train_stances dataset returned here: (Note: the .head() function 
# really useful for returning just a few rows)
train_stances[train_stances['Body ID'] == 0].____()

In [None]:
# Let's use the shape parameter once more to check:
train_stances[train_stances['Body ID'] == 0].shape

Hence, the ``train_stances`` dataset contains multiple headlines corresponding to each article in ``train_bodies``, and our task is to train an AI to identify the correct headline associated with each article.

### 5. Re-organize the data.

The data we loaded contains all the information we need, but it puts different pieces of information about the same article in different DataFrames. To make the data easier to work with, we'd like to put the information about each article in one DataFrame.

This function reads a dataset into a single DataFrame.

In [None]:
train_data = pd.merge(train_bodies, train_stances, on='Body ID')

In [None]:
# Let's examine the shape of the newly created dataset:
train_data.shape

In [None]:
# Let's also look at the first few rows of this merged dataset:
train_data.____()

### 6. Exporting the data

Now that we have created an amalgamated dataset, we'd like to export this, so that we can use it in the future:

In [None]:
train_data.to_csv("train_data.csv", index=False, encoding = "utf-8")

### Extra Challenge

"Extra challenge" sections are a more unguided exploration into the concepts we've discussed. You'll notice less scaffolding for the code -- try implementing these concepts from scratch, and feel free to ask your neighbors or an instructor if you have any questions!

Now let's try looking at some other information about the data. Try to answer the following questions (and any others you think would be useful) -- feel free to look things up online along the way.

    - How many examples are there for each stance? (For instance, how many "unrelated" examples are there?).
        - Are each of the stances equally represented?
    - In general, how long are the headings and article bodies?
        - Does this differ for different stances?
        - How much do these counts vary between different examples?
    - When you read the article and headline and try to decide whether the headline agrees, disagrees, discusses the content of the article or is irrelevant, what factors do you consider?  How can we code these up so that a computer can gain our intuition and perform this classification by itself?