# <b>Data Processing: Fake & Satire</b>
#### <font color = 'gray'>Christine Gregg  |  November 2022</font>

This notebook includes processing and initial exploration of the Fake News vs Satire dataset from <a href="https://web.stanford.edu/~mattm401/docs/2018-Golbeck-WebSci-FakeNewsVsSatire.pdf">Golbeck, et al. (2018)</a>.

###<b>Initial Set-Up & Mount Google Drive</b>

In [1]:
import os
import glob
import numpy as np
import pandas as pd
from tqdm import tqdm
import plotly.express as px


# These paths will work if our project drive has been shared with you
LABEL_PATH = 'data\raw\satire_fake\labels.xlsx'
FAKE_PATH = 'data\raw\satire_fake\fake'
SATIRE_PATH = 'data\raw\satire_fake\satire'


# # If running in Colab: connecting to Shared Google Drive
# # Run this cell and select your UMich Google account in the pop-up
# from google.colab import drive
# import sys
# drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


###<b>Function Definitions</b>

Note that fake and satire articles are stored as separate .txt files with the following format:

*   File Name: Integer that corresponds to `article_id` in `df_labels`
*   First Line: Article title
*   Second Line: Article URL (a proxy for article source)
*   Remainder of File: Full plain text article

In [2]:
def process_article(article, separate_features=False):
    """
    Processes a single article and returns its features.

    Args:
        article: The text file to be processed.
        separate_features: When True, the .txt file will be separated into 
            distict features (columns) in the returned dataframe. When False, 
            the contents of the .txt file will be returned as a single 
            feature (column).
    Returns:
        file_id: The unique file name (integer).
        title: The title of the article (assumed to be the first line).
        url: The URL of the article (assumed to be the second line).
        contents: The full text of the article (third line through EOF).
        lines: The title, URL, and contents as a single blob of text.
    """
    if separate_features == True:
        with open(article, 'r', errors='ignore') as f:
            lines = f.readlines()
            f.close()

        file_id = os.path.basename(f.name).split('.')[0]
        title = lines[0].strip()
        url = lines[1].strip()
        contents = lines[2].strip()

        return file_id, title, url, contents
    else:
        with open(article, 'r', errors='ignore') as f:
            lines = f.read()
            f.close()
        
        file_id = os.path.basename(f.name).split('.')[0]

        return file_id, lines


def process_files(path, separate_features=False):
    """
    Iterates through a folder of text files.

    Args:
        path: The path of the folder to be processed.
        separate_features: When True, the .txt file will be separated into 
            distict features (columns) in the returned dataframe. When False, 
            the contents of the .txt file will be returned as a single 
            feature (column).  
    Returns:
        df: A dataframe containing the processed files as separate rows.
    """
    files = glob.glob(path + '/*.txt')
    articles = []

    if separate_features == True:
        for f in tqdm(files):
            file_id, title, url, contents = process_article(
                f, 
                separate_features
                )
            articles.append([file_id, title, url, contents])
        
        df = pd.DataFrame(
            articles,
            columns=['article_id', 'title', 'url', 'text']
            )

    else:
        for f in tqdm(files):
            file_id, contents = process_article(
                f, 
                separate_features
                )

        df = pd.DataFrame(articles, columns=['article_id', 'text'])

    return df

###<b>Load and Review Fake/Satire Class Labels</b>

In [3]:
labels_df = pd.read_excel(LABEL_PATH, names=['article_id', 'url', 'label'],
                          usecols=['article_id', 'label'],
                          dtype={'article_id': np.int32, 'label': str}
                          )

labels_df.label.describe()

  warn(msg)


count      492
unique       3
top       Fake
freq       291
Name: label, dtype: object

We are expecting 2 labels ("Fake" and "Satire"), so we need to clean up whatever is causing a third label to appear. In this case, it's whitespace.

In [4]:
labels_df.label.unique()

array(['Satire', 'Fake', 'Satire '], dtype=object)

In [5]:
# This should result in two unique labels
labels_df.label = labels_df.label.str.strip()
labels_df.label.describe()

count      492
unique       2
top       Fake
freq       291
Name: label, dtype: object

We have a slight class imbalance (~60% of the articles are labeled "Fake") and so we will need to watch out for the model favoring the majority class.

In [6]:
fig = px.bar(labels_df.groupby('label').count(),
             width=500, height=400)
fig.update_layout(showlegend=False, 
                  title_text='Fake-Satire Class Imbalance', title_x=0.5)
fig.show()

###<b>Load and Review Fake Articles</b>
We are using all articles accepted by the original paper authors. For the purposes of this project, we will not do any additional acceptance screening.

In [7]:
fake_df = process_files(FAKE_PATH, separate_features=True)
fake_df.sample(4)

100%|██████████| 283/283 [00:01<00:00, 170.89it/s]


Unnamed: 0,article_id,title,url,text
193,545,The DEA Just Raided A United States Senator-De...,http://breaking.newsfeedhunter.com/the-dea-jus...,The DEA just raided the vacation ranch of Demo...
32,488,Trump allowed Black homeless woman to live in ...,http://www.naturalnews.com/2016-12-13-trump-al...,
282,19,Breaking: Sharia Law is Gone--The House Made a...,http://24wpn.com/index.php/2017/04/29/breaking...,Former president Barack Obama and his Democrat...
275,290,Michelle Obama Deletes Hillary Clinton From Tw...,http://yournewswire.com/michelle-obama-deletes...,Michelle Obama deleted Hillary Clinton from bo...


In [8]:
fake_df.describe()

Unnamed: 0,article_id,title,url,text
count,283,283,283.0,283.0
unique,283,274,267.0,231.0
top,316,NPR: 25 Million Votes For Clinton ‘Completely ...,,
freq,1,2,16.0,49.0


###<b>Load and Review Satire Articles</b>
We are using all articles accepted by the original paper authors. For the purposes of this project, we will not do any additional acceptance screening.

In [9]:
satire_df = process_files(SATIRE_PATH, separate_features=True)
satire_df.sample(8)

100%|██████████| 203/203 [00:04<00:00, 50.68it/s] 


Unnamed: 0,article_id,title,url,text
159,280,Trump Selects McGruff The Crime Dog As FBI Dir...,http://babylonbee.com/news/trump-selects-mcgru...,"WASHINGTON, D.C.—Finally selecting James Comey..."
42,128,Hillary Clinton Actually President as Nation W...,http://www.satirenews.net/hillary-clinton-actu...,WASHINGTON—The morning after what had appeared...
180,441,Ohio Gov. John Kasich Legalizes Exhumation of ...,https://www.delawareohionews.com/national-news...,By Ricardo Paye 69784 69
129,594,BREAKING: Clinton Foundation Shipping Manifest...,http://thelastlineofdefense.org/breaking-clint...,According to the manifest of the seized Clinto...
163,126,Trump Creates 20% Tax On Tweets From Mexico,http://www.satirenews.net/trump-creates-20-tax...,WASHINGTON—Mexican government officials had no...
66,162,New Monument Avenue Snapchat Filter Makes You ...,https://thepeedmont.com/2017/09/06/new-monumen...,
85,426,Hillary Clinton Hospitalized With Exhaustion A...,http://realnewsrightnow.com/2017/09/hillary-cl...,"YONKERS, Ny. – Hillary Clinton was hospitalize..."
30,499,Now That President Trump Has Finally Declared ...,http://christwire.org/2017/07/president-trump-...,Today President Trump finally unveiled his tru...


In [10]:
satire_df.describe()

Unnamed: 0,article_id,title,url,text
count,203,203,203.0,203.0
unique,203,203,197.0,183.0
top,459,Trump Cuts NASA After Discovering Moon Not Mad...,,
freq,1,1,7.0,21.0


### <b>Remove Duplicates</b>
From a manual inspection, there seems to be at least one instance where an article is completely duplicated, but it may have been missed because one version included the URL and the other did not. We don't want duplicates to appear across the train/test split, so we will remove them if one or both of the following is true (blank strings are used instead of duplicates in this dataset):
* If the non-blank URLs is identical.
* If the non-blank text is identical.

In [11]:
# Identical URLs
duplicate_fake = fake_df[fake_df.duplicated('url') & fake_df['url'].ne('')]
duplicate_fake

Unnamed: 0,article_id,title,url,text
148,431,BREAKING: Hillary Clinton Has Third Heart Atta...,http://wazanews.tk/2017/07/22/breaking-hillary...,Hillary Clinton had a third and most-likely fa...


In [12]:
# Identical URLs - no duplicates for satire dataset
duplicate_satire = satire_df[satire_df.duplicated('url') & satire_df['url'].ne('')]
duplicate_satire

Unnamed: 0,article_id,title,url,text


In [13]:
# Drop duplicate URL records
fake_df.drop(duplicate_fake.index, axis=0,inplace=True)
satire_df.drop(duplicate_satire.index, axis=0,inplace=True)

In [14]:
# Identical text
duplicate_fake = fake_df[fake_df.duplicated('text') & fake_df['text'].ne('')]
duplicate_fake

Unnamed: 0,article_id,title,url,text
169,51,Nancy Pelosi In Critical Condition After Head-...,http://ourlandofthefree.com/2017/08/nancy-pelo...,Democrat Senator Nancy Pelosi was involved in ...
182,586,Burger King Admits To Using Horse Meat In Burg...,,The brand burger king admitted that they have ...
207,393,BREAKING: Malia Obama Busted Buying 6 POUNDS O...,http://xbn-news.com/index.php/2017/08/08/break...,If you thought that one little joint Malia Oba...
235,542,BREAKING: Obama PERSONALLY Called Harvard And ...,http://defensepatriot.site/2017/08/14/breaking...,"If you haven’t heard, Malia Obama was recently..."


In [15]:
# Identical text - no duplicates for satire dataset
duplicate_satire = satire_df[satire_df.duplicated('text') & satire_df['text'].ne('')]
duplicate_satire

Unnamed: 0,article_id,title,url,text


In [16]:
# Drop duplicate full text records
fake_df.drop(duplicate_fake.index, axis=0, inplace=True)
satire_df.drop(duplicate_satire.index, axis=0, inplace=True)

# Replace empty titles and text with whitespace
fake_df.replace('', np.nan, inplace=True)
fake_df.dropna(axis=0, subset=['title'], inplace=True)
fake_df.dropna(axis=0, subset=['text'], inplace=True)

satire_df.replace('', np.nan, inplace=True)
satire_df.dropna(axis=0, subset=['title'], inplace=True)
satire_df.dropna(axis=0, subset=['text'], inplace=True)

###<b>Save Processed DataFrames</b>

In [17]:
full_golbeck_df = pd.concat([fake_df, satire_df], axis=0, join='outer').reset_index(drop=True)
full_golbeck_df.drop('url', axis=1, inplace=True)
full_golbeck_df.dropna(axis=0, inplace=True)
full_golbeck_df.sample(5)

Unnamed: 0,article_id,title,text
345,437,Donald Trump tip-off sees Theresa May strike u...,Theresa May will use some of the control Brita...
373,37,Trump Wants His Limo To Be Bigger Than Obama’s,President Donald Trump seems to have an obsess...
378,495,Even Trump Surprised He Hasn’t Been Impeached Yet,"WASHINGTON, DC—Just hours after revealing high..."
409,137,President Obama Awards Himself Another Medal,President Obama awarded himself the prestigiou...
90,383,Richmond cop kills 3-yr old baby on Chamberlay...,There was a shootout today involving 2 suspect...


In [18]:
# Unify with labels
full_golbeck_df = full_golbeck_df.astype({'article_id': 'int32'})
full_golbeck_df = pd.concat([full_golbeck_df.set_index('article_id'), 
                             labels_df.set_index('article_id')], 
                            axis=1, join='inner').reset_index()
full_golbeck_df.sample(5)

Unnamed: 0,article_id,title,text,label
397,442,"Delaware, Ohio Police Reports Week of June 12,...",By Ricardo Paye 3201 0,Satire
295,77,Clinton Already Working On Follow-Up Book Cast...,"CHAPPAQUA, NY—Saying it would provide a candid...",Satire
407,256,The Biblical Case for Making Trump King,"This may be a bit controversial, but I’ve been...",Satire
324,3,Trump Vows To Protect The US From All Foreign ...,With the impending threat of Hurricane Irma in...,Satire
206,293,Police: Charlottesville Was ‘Inside Job’ To Ig...,A Charlottesville police officer claims the ri...,Fake


In [19]:
fig = px.bar(full_golbeck_df[['text', 'label']].groupby('label').count(),
             width=500, height=400)
fig.update_layout(showlegend=False, 
                  title_text='Final Fake-Satire Class Imbalance', title_x=0.5)
fig.show()

In [20]:
PROCESSED_PATH = 'data\processed\satire_fake'
labels_df.reset_index().to_csv(PROCESSED_PATH + '/labels_df')
satire_df.to_csv(PROCESSED_PATH + '/satire_df')
fake_df.to_csv(PROCESSED_PATH + '/fake_df')
full_golbeck_df.to_csv(PROCESSED_PATH + '/full_golbeck_df')