# Create Outcome Variable: Fake News or Factual?

Read in subset of news source labels for sources that appear in our random sample of news articles. Compute outcome variable by news source from the available set of fact-checker labels. This is an instance of weakly supervised learning in which the data labels are obtained not based on individual observations but on groups of observations (in this case, articles grouped by news source).

## Import modules

In [1]:
import pandas as pd
import re
from nltk.help import upenn_tagset
from nltk import word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer
# import other modules...

## Load data

In [2]:
labels = pd.read_csv("./labels_subset.csv")
labels.head()

Unnamed: 0.1,Unnamed: 0,"NewsGuard, Does not repeatedly publish false content","NewsGuard, Gathers and presents information responsibly","NewsGuard, Regularly corrects or clarifies errors","NewsGuard, Handles the difference between news and opinion responsibly","NewsGuard, Avoids deceptive headlines","NewsGuard, Website discloses ownership and financing","NewsGuard, Clearly labels advertising","NewsGuard, Reveals who's in charge, including any possible conflicts of interest","NewsGuard, Provides information about content creators",...,"Allsides, community_agree","Allsides, community_disagree","Allsides, community_label","BuzzFeed, leaning","PolitiFact, Pants on Fire!","PolitiFact, False","PolitiFact, Mostly False","PolitiFact, Half-True","PolitiFact, Mostly True","PolitiFact, True"
0,21stCenturyWire,,,,,,,,,,...,,,,left,,,,,,
1,ABC News,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,8964.0,6949.0,somewhat agree,,,,,,,
2,AMERICAblog News,,,,,,,,,,...,,,,left,,,,,,
3,Activist Post,,,,,,,,,,...,,,,left,,,,,,
4,Addicting Info,,,,,,,,,,...,,,,left,,,,,,


In [4]:
len(labels)

194

In [5]:
articles = pd.read_csv('./articles_df.csv', index_col = 0)
articles.head()

Unnamed: 0,news_source,pub_date,title,text
0,Talking Points Memo,2018-07-14,Trump Admin Will Start Reunifying Older Kids W...,The Trump administration says it expects to be...
1,Talking Points Memo,2018-07-14,Louie Gohmert Is Our Duke Of The Week,As our Allegra Kirkland and Tierney Sneed have...
2,Talking Points Memo,2018-07-14,Top State Election Officials Gather Amid Secur...,Trump has never condemned Russia over its medd...
3,Talking Points Memo,2018-07-14,Indictment Russia Hack Targeted Clinton Emails...,"In a July 27, 2016, speech, then-candidate Don..."
4,Talking Points Memo,2018-07-14,Nielsen No Indication Russia Targeting 2018 El...,PHILADELPHIA (AP) The U.S. homeland security ...


In [6]:
len(articles)

697695

## Create target variable

In [8]:
# view all veracity score columns
labels.columns

Index(['News_Source', 'NewsGuard, Does not repeatedly publish false content',
       'NewsGuard, Gathers and presents information responsibly',
       'NewsGuard, Regularly corrects or clarifies errors',
       'NewsGuard, Handles the difference between news and opinion responsibly',
       'NewsGuard, Avoids deceptive headlines',
       'NewsGuard, Website discloses ownership and financing',
       'NewsGuard, Clearly labels advertising',
       'NewsGuard, Reveals who's in charge, including any possible conflicts of interest',
       'NewsGuard, Provides information about content creators',
       'NewsGuard, score', 'NewsGuard, overall_class',
       'Pew Research Center, known_by_40%', 'Pew Research Center, total',
       'Pew Research Center, consistently_liberal',
       'Pew Research Center, mostly_liberal', 'Pew Research Center, mixed',
       'Pew Research Center, mostly conservative',
       'Pew Research Center, consistently conservative', 'Wikipedia, is_fake',
       'Open 

Notes on labeling conventions:
- NewsGuard, overall_class: 1 = good, 0 = bad
- Pew Research Center, total: 1 = trusted, 0 = undecided, -1 = not trusted
- Wikipedia, is_fake: 1 = fake
- Open Sources: number of tags
- Media Bias/Fact Check:
    - label: label (text)
    - factual_reporting: bad 1 - 5 good
    - everything else: 1 = true, 0 = false
- Allsides, bias_rating: label (text)
- BuzzFeed, leaning: label (text)
- PolitiFact: number of counts

We will use the following fact-checker scores:
- NewsGuard, overall class & score
- Pew Research Center, total
- Media Bias/Fact Check, factual reporting score (1-5 scale)

In [9]:
# read in Josh's labels
fake_labels = pd.read_csv('Labels_done.csv', engine='python', encoding='utf-8')
len(fake_labels)

114

In [10]:
fake_labels.head()

Unnamed: 0.1,Unnamed: 0,"NewsGuard, score","NewsGuard, overall_class","Pew Research Center, total","Media Bias / Fact Check, label","Media Bias / Fact Check, factual_reporting",SCORE 1,SCORE 2,SCORE 3,SCORE 4,SCORE 5,AVG SCORE,LABEL
0,21stCenturyWire,,"""NA""","""NA""",1.0,3.0,,,,1.0,0.4,0.7,1
1,ABC News,95.0,1,1,0.25,4.0,0.05,0.0,0.0,0.25,0.2,0.1,0
2,Activist Post,,"""NA""","""NA""",1.0,2.0,,,,1.0,0.6,0.8,1
3,Addicting Info,,"""NA""","""NA""",0.5,3.0,,,,0.5,0.4,0.45,0
4,Al Jazeera,52.0,0,-1,0.25,4.0,0.48,1.0,1.0,0.25,0.2,0.586,1


In [11]:
fake_labels = fake_labels[['Unnamed: 0', 'LABEL']]
fake_labels.head()

Unnamed: 0.1,Unnamed: 0,LABEL
0,21stCenturyWire,1
1,ABC News,0
2,Activist Post,1
3,Addicting Info,0
4,Al Jazeera,1


In [12]:
for i in range(len(fake_labels['LABEL'])):
    if fake_labels['LABEL'][i] == 1:
        print(fake_labels['Unnamed: 0'][i])

21stCenturyWire
Activist Post
Al Jazeera
Bipartisan Report
Breitbart
DC Gazette
Daily Kos
Daily Mail
Drudge Report
Freedom Daily
FrontPage Magazine
Infowars
Instapundit
Live Action
Palmer Report
Prison Planet
Sputnik
The Conservative Tree House
The D.C. Clothesline
The Gateway Pundit
The Political Insider
The Right Scoop
TheBlaze
True Activist
True Pundit
Western Journal


In [13]:
lean_labels = pd.read_csv('finished_labels.csv', engine='python', encoding='utf-8')
len(lean_labels)

114

In [14]:
lean_labels.head()

Unnamed: 0.1,Unnamed: 0,"Media Bias / Fact Check, label","Allsides, bias_rating","BuzzFeed, leaning",LEFT 1,LEFT 2,LEFT 3,AVG LEFT,LEFT LABEL,RIGHT 1,RIGHT 2,RIGHT 3,AVG RIGHT,RIGHT LABEL
0,21stCenturyWire,conspiracy_pseudoscience,,left,,,1.0,1.0,1,,,0.0,0.0,0
1,ABC News,left_center_bias,Lean Left,,0.5,0.5,,0.5,1,0.0,0.0,,0.0,0
2,Activist Post,conspiracy_pseudoscience,,left,,,1.0,1.0,1,,,0.0,0.0,0
3,Addicting Info,left_bias,,left,1.0,,1.0,1.0,1,0.0,,0.0,0.0,0
4,Al Jazeera,left_center_bias,Center,,0.5,0.0,,0.25,0,0.0,0.0,,0.0,0


In [15]:
lean_labels = lean_labels[['Unnamed: 0', 'AVG LEFT', 'AVG RIGHT']]
lean_labels.head()

Unnamed: 0.1,Unnamed: 0,AVG LEFT,AVG RIGHT
0,21stCenturyWire,1.0,0.0
1,ABC News,0.5,0.0
2,Activist Post,1.0,0.0
3,Addicting Info,1.0,0.0
4,Al Jazeera,0.25,0.0


In [16]:
# Removing sources with either not enough labels or satire labels
srces = set(fake_labels["Unnamed: 0"]).intersection(set(lean_labels['Unnamed: 0']))

bad_src = []
for i in range(len(articles_df)):
    if articles_df['news_source'][i] not in srces:
        bad_src.append(i)
articles_df = articles_df.drop(bad_src).reset_index(drop = True)
len(articles_df)

523105

In [17]:
# Attaching labels
both_labels = fake_labels.merge(lean_labels, on = 'Unnamed: 0')
rename_dict = {
    'LABEL': 'fake_news_binary',
    'AVG LEFT': 'left_bias_avg',
    'AVG RIGHT': 'right_bias_avg',
    'Unnamed: 0': 'News_Source'
}
both_labels = both_labels.rename(columns = rename_dict)
both_labels.head()

Unnamed: 0,News_Source,fake_news_binary,left_bias_avg,right_bias_avg
0,21stCenturyWire,1,1.0,0.0
1,ABC News,0,0.5,0.0
2,Activist Post,1,1.0,0.0
3,Addicting Info,0,1.0,0.0
4,Al Jazeera,1,0.25,0.0


In [18]:
articles_df = articles_df.merge(both_labels,
                                left_on = 'news_source',
                                right_on = 'News_Source').drop('News_Source', axis = 1)

In [19]:
# creating net bias score
articles_df["net_bias"] = articles_df["right_bias_avg"] - articles_df["left_bias_avg"]

In [20]:
articles_df.head()

Unnamed: 0,news_source,pub_date,title,text,fake_news_binary,left_bias_avg,right_bias_avg,net_bias
0,Talking Points Memo,2018-07-14,Trump Admin Will Start Reunifying Older Kids W...,The Trump administration says it expects to be...,0,1.0,0.0,-1.0
1,Talking Points Memo,2018-07-14,Louie Gohmert Is Our Duke Of The Week,As our Allegra Kirkland and Tierney Sneed have...,0,1.0,0.0,-1.0
2,Talking Points Memo,2018-07-14,Top State Election Officials Gather Amid Secur...,Trump has never condemned Russia over its medd...,0,1.0,0.0,-1.0
3,Talking Points Memo,2018-07-14,Indictment Russia Hack Targeted Clinton Emails...,"In a July 27, 2016, speech, then-candidate Don...",0,1.0,0.0,-1.0
4,Talking Points Memo,2018-07-14,Nielsen No Indication Russia Targeting 2018 El...,PHILADELPHIA (AP) The U.S. homeland security ...,0,1.0,0.0,-1.0


In [21]:
# add column for outlier detection fake news labels (inlier = 1, outlier = -1)
outlier = []
for el in articles_df['fake_news_binary']:
    if el == 1:
        outlier.append(-1)
    elif el == 0:
        outlier.append(1)
    else:
        outlier.append(el)
articles_df['fake_news_outlier'] = outlier

In [22]:
articles_df['fake_news_binary'].value_counts()

0    392755
1    130350
Name: fake_news_binary, dtype: int64