# Create Outcome Variable: Fake News or Factual?

Read in subset of news source labels for sources that appear in our random sample of news articles. Compute outcome variable by news source from the available set of fact-checker labels. This is an instance of weakly supervised learning in which the data labels are obtained not based on individual observations but on groups of observations (in this case, articles grouped by news source).

## Import modules

In [10]:
import pandas as pd
#import re
#from nltk.help import upenn_tagset
#from nltk import word_tokenize, pos_tag
#from nltk.stem import WordNetLemmatizer
#from nltk.corpus import wordnet
#from sklearn.feature_extraction.text import CountVectorizer

## Load data

In [11]:
labels = pd.read_csv("./labels_subset.csv", index_col = 0)
labels.head()

Unnamed: 0,news_source,"NewsGuard, Does not repeatedly publish false content","NewsGuard, Gathers and presents information responsibly","NewsGuard, Regularly corrects or clarifies errors","NewsGuard, Handles the difference between news and opinion responsibly","NewsGuard, Avoids deceptive headlines","NewsGuard, Website discloses ownership and financing","NewsGuard, Clearly labels advertising","NewsGuard, Reveals who's in charge, including any possible conflicts of interest","NewsGuard, Provides information about content creators",...,"Allsides, community_agree","Allsides, community_disagree","Allsides, community_label","BuzzFeed, leaning","PolitiFact, Pants on Fire!","PolitiFact, False","PolitiFact, Mostly False","PolitiFact, Half-True","PolitiFact, Mostly True","PolitiFact, True"
0,21stCenturyWire,,,,,,,,,,...,,,,left,,,,,,
1,Al Jazeera,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,2204.0,3249.0,somewhat disagree,,0.0,1.0,0.0,0.0,0.0,0.0
2,Alternet,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1314.0,595.0,strongly agree,left,,,,,,
3,BBC,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7527.0,7177.0,somewhat agree,,,,,,,
4,BBC UK,,,,,,,,,,...,,,,,,,,,,


In [12]:
len(labels)

113

In [53]:
articles = pd.read_csv('./articles_df.csv', index_col = 0)
articles.head()

Unnamed: 0,news_source,pub_date,title,text,clean_txt,POS_tags
0,The Gateway Pundit,2018-07-14,REPORT House Conservatives Prepare to Impeach ...,House GOP lawmakers are preparing to push to i...,House GOP lawmaker be prepare push impeach Dep...,NNP NNP NNS VBP VBG TO NN TO NN NN NN NNP NN N...
1,oann,2018-03-24,French policeman who took place of hostage die...,PARIS (Reuters) A gendarme who was shot three...,PARIS Reuters gendarme who be shot time volunt...,NN ( NNS ) DT NN WP VBD NN CD NNS IN RB VBG DT...
2,New York Daily News,2018-03-24,Attorney for Roy Moore accuser was offered 10G...,"An attorney for Leigh Corfman, a woman who acc...",attorney Leigh Corfman woman who accuse fail S...,"DT NN IN JJ NN , DT NN WP VBN VBD NNP NNP NN N..."
3,Sputnik,2018-03-23,Martin Vizcarra is New Peruvian President Afte...,Martin Vizcarra is sworn in as Peruvian presid...,Martin Vizcarra be sworn Peruvian president pr...,NN NN VBZ NN IN IN JJ NN IN NN VBN IN NN NNS ....
4,oann,2018-04-02,Oil falls 2 percent on Russia output rise pote...,NEW YORK (Reuters) Oil fell by more than 2 pe...,NEW YORK Reuters Oil fell more percent Monday ...,"NN NN ( NNS ) NN VBD IN RBR IN CD NN IN NNP , ..."


In [14]:
len(articles)

498

## Create target variable

In [15]:
# view all veracity score columns
labels.columns

Index(['news_source', 'NewsGuard, Does not repeatedly publish false content',
       'NewsGuard, Gathers and presents information responsibly',
       'NewsGuard, Regularly corrects or clarifies errors',
       'NewsGuard, Handles the difference between news and opinion responsibly',
       'NewsGuard, Avoids deceptive headlines',
       'NewsGuard, Website discloses ownership and financing',
       'NewsGuard, Clearly labels advertising',
       'NewsGuard, Reveals who's in charge, including any possible conflicts of interest',
       'NewsGuard, Provides information about content creators',
       'NewsGuard, score', 'NewsGuard, overall_class',
       'Pew Research Center, known_by_40%', 'Pew Research Center, total',
       'Pew Research Center, consistently_liberal',
       'Pew Research Center, mostly_liberal', 'Pew Research Center, mixed',
       'Pew Research Center, mostly conservative',
       'Pew Research Center, consistently conservative', 'Wikipedia, is_fake',
       'Open 

Notes on labeling conventions:
- NewsGuard, overall_class: 1 = good, 0 = bad
- Pew Research Center, total: 1 = trusted, 0 = undecided, -1 = not trusted
- Wikipedia, is_fake: 1 = fake
- Open Sources: number of tags
- Media Bias/Fact Check:
    - label: label (text)
    - factual_reporting: bad 1 - 5 good
    - everything else: 1 = true, 0 = false
- Allsides, bias_rating: label (text)
- BuzzFeed, leaning: label (text)
- PolitiFact: number of counts

We will use the following fact-checker scores:
- NewsGuard, overall class
- Pew Research Center, total
- Open Sources:
    - reliable
    - fake
    - unreliable
    - conspiracy
    - junksci
- Media Bias/Fact Check
    - factual reporting score (1-5 scale)
- Wikipedia, fake label
- PolitiFact (all)

In [16]:
# pull out select veracity score columns
labels_select = pd.concat([labels.loc[:, ['news_source', 'NewsGuard, overall_class', 'Pew Research Center, total', 'Open Sources, conspiracy', 'Open Sources, junksci', 'Media Bias / Fact Check, label', 'Media Bias / Fact Check, factual_reporting']],
                           labels.loc[:, 'Wikipedia, is_fake':'Open Sources, unreliable'],
                           labels.loc[:, 'PolitiFact, Pants on Fire!':'PolitiFact, True']],
                          axis = 1)
labels_select.head()

Unnamed: 0,news_source,"NewsGuard, overall_class","Pew Research Center, total","Open Sources, conspiracy","Open Sources, junksci","Media Bias / Fact Check, label","Media Bias / Fact Check, factual_reporting","Wikipedia, is_fake","Open Sources, reliable","Open Sources, fake","Open Sources, unreliable","PolitiFact, Pants on Fire!","PolitiFact, False","PolitiFact, Mostly False","PolitiFact, Half-True","PolitiFact, Mostly True","PolitiFact, True"
0,21stCenturyWire,,,1.0,,conspiracy_pseudoscience,3.0,,,,,,,,,,
1,Al Jazeera,0.0,-1.0,,,left_center_bias,4.0,,,,,0.0,1.0,0.0,0.0,0.0,0.0
2,Alternet,1.0,,,,left_bias,3.0,,2.0,,,,,,,,
3,BBC,1.0,1.0,,,left_center_bias,4.0,,,,,,,,,,
4,BBC UK,,,,,,,,,,,,,,,,


In [17]:
# NewsGuard
newsguard = []
for el in labels_select['NewsGuard, overall_class']:
    if el == 1:
        newsguard.append(-1)
    elif el == 0:
        newsguard.append(1)
    else:
        newsguard.append(None)

labels_select['newsguard'] = newsguard
labels_select['newsguard']

0      NaN
1      1.0
2     -1.0
3     -1.0
4      NaN
      ... 
108    NaN
109    NaN
110    NaN
111    NaN
112    NaN
Name: newsguard, Length: 113, dtype: float64

In [18]:
# Wikipedia
wiki = []
for el in labels_select['Wikipedia, is_fake']:
    if el == 1:
        wiki.append(-1)
    else:
        wiki.append(None)
labels_select['wiki'] = wiki
labels_select['wiki']

0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
       ..
108   NaN
109   NaN
110   NaN
111   NaN
112   NaN
Name: wiki, Length: 113, dtype: float64

In [19]:
# PolitiFact
politifact = []
for el in range(len(labels_select)):
    if labels_select['PolitiFact, Pants on Fire!'][el] == labels_select['PolitiFact, Pants on Fire!'][el]:
        num = -labels_select['PolitiFact, Pants on Fire!'][el] - labels_select['PolitiFact, False'][el] - 0.5*labels_select['PolitiFact, Mostly False'][el] + 0.5*labels_select['PolitiFact, Mostly True'][el] + labels_select['PolitiFact, True'][el]
        num = num/(labels_select['PolitiFact, Pants on Fire!'][el] + labels_select['PolitiFact, False'][el] + labels_select['PolitiFact, Mostly False'][el] + labels_select['PolitiFact, Half-True'][el] + labels_select['PolitiFact, Mostly True'][el] + labels_select['PolitiFact, True'][el])
        politifact.append(num)
    else:
        politifact.append(None)
politifact

# anything with a score of less than 0 will be categorized as fake (-1);
# anything with a score of greater than 0 will be categorized as reliable (1);
# sources with a score equal to 0 will be considered undecided (0)
pfact_cat = []
for el in politifact:
    if el != None:
        if el > 0:
            pfact_cat.append(1)
        elif el < 0:
            pfact_cat.append(-1)
        elif el == 0:
            pfact_cat.append(0)
    else:
        pfact_cat.append(el)

labels_select['politifact'] = pfact_cat
labels_select['politifact']

0      NaN
1     -1.0
2      NaN
3      NaN
4      NaN
      ... 
108    NaN
109    NaN
110    NaN
111    NaN
112    NaN
Name: politifact, Length: 113, dtype: float64

In [20]:
# Open Sources
open_sources = []
for el in range(len(labels_select)):
    num = None
    den = 0
    # reliable
    if labels_select['Open Sources, reliable'][el] == labels_select['Open Sources, reliable'][el]:
        num = labels_select['Open Sources, reliable'][el]
        den += labels_select['Open Sources, reliable'][el]
    # unreliable
    if labels_select['Open Sources, unreliable'][el] == labels_select['Open Sources, unreliable'][el]:
        if num == None:
            num = -labels_select['Open Sources, unreliable'][el]
        else:
            num -= labels_select['Open Sources, unreliable'][el]
        den += labels_select['Open Sources, unreliable'][el]
    # fake
    if labels_select['Open Sources, fake'][el] == labels_select['Open Sources, fake'][el]:
        if num == None:
            num = -labels_select['Open Sources, fake'][el]
        else:
            num -= labels_select['Open Sources, fake'][el]
        den += labels_select['Open Sources, fake'][el]
    # conspiracy
    if labels_select['Open Sources, conspiracy'][el] == labels_select['Open Sources, conspiracy'][el]:
        if num == None:
            num = -labels_select['Open Sources, conspiracy'][el]
        else:
            num -= labels_select['Open Sources, conspiracy'][el]
        den += labels_select['Open Sources, conspiracy'][el]
    # junk science
    if labels_select['Open Sources, junksci'][el] == labels_select['Open Sources, junksci'][el]:
        if num == None:
            num = -labels_select['Open Sources, junksci'][el]
        else:
            num -= labels_select['Open Sources, junksci'][el]
        den += labels_select['Open Sources, junksci'][el]
    # check if num is none
    if num != None:
        # normalize by number of tags
        num = num/den
    open_sources.append(num)

labels_select['open_sources'] = open_sources
labels_select['open_sources']

0     -1.0
1      NaN
2      1.0
3      NaN
4      NaN
      ... 
108    NaN
109    1.0
110    NaN
111    NaN
112    NaN
Name: open_sources, Length: 113, dtype: float64

In [21]:
# Media Bias/Fact Check
media_bias = []
for i in range(len(labels_select)):
    if (labels_select['Media Bias / Fact Check, label'][i] == 'conspiracy_pseudoscience') or (labels_select['Media Bias / Fact Check, label'][i] == 'questionable_source'):
        media_bias.append(-1)
    else:
        if labels_select['Media Bias / Fact Check, factual_reporting'][i] < 3:
            media_bias.append(-1)
        elif labels_select['Media Bias / Fact Check, factual_reporting'][i] >= 3:
            media_bias.append(1)
        else:
            media_bias.append(None)

labels_select['mediabias'] = media_bias
labels_select['mediabias']

0     -1.0
1      1.0
2      1.0
3      1.0
4      NaN
      ... 
108    1.0
109    NaN
110    NaN
111   -1.0
112    NaN
Name: mediabias, Length: 113, dtype: float64

In [23]:
our_labels = labels_select[['news_source', 'newsguard', 'wiki', 'mediabias', 'open_sources', 'politifact']]
our_labels.head()

Unnamed: 0,news_source,newsguard,wiki,mediabias,open_sources,politifact
0,21stCenturyWire,,,-1.0,-1.0,
1,Al Jazeera,1.0,,1.0,,-1.0
2,Alternet,-1.0,,1.0,1.0,
3,BBC,-1.0,,1.0,,
4,BBC UK,,,,,


In [24]:
len(our_labels[our_labels['newsguard'].notnull() | our_labels['wiki'].notnull() | our_labels['mediabias'].notnull() | our_labels['open_sources'].notnull() | our_labels['politifact'].notnull()])

90

In [25]:
# compute average across newsguard, wikipedia, media bias, open sources, and politifact scores
ave_score = []
for i in range(len(our_labels)):
    den = 0
    num = None
    # newsguard
    if our_labels['newsguard'][i] == our_labels['newsguard'][i]:
        num = our_labels['newsguard'][i]
        den += 1
    # wikipedia
    if our_labels['wiki'][i] == our_labels['wiki'][i]:
        if num != None:
            num += our_labels['wiki'][i]
        else:
            num = our_labels['wiki'][i]
        den += 1
    # media bias
    if our_labels['mediabias'][i] == our_labels['mediabias'][i]:
        if num != None:
            num += our_labels['mediabias'][i]
        else:
            num = our_labels['mediabias'][i]
        den += 1
    # open sources
    if our_labels['open_sources'][i] == our_labels['open_sources'][i]:
        if num != None:
            num += our_labels['open_sources'][i]
        else:
            num = our_labels['open_sources'][i]
        den += 1
    # politifact
    if our_labels['politifact'][i] == our_labels['politifact'][i]:
        if num != None:
            num += our_labels['politifact'][i]
        else:
            num = our_labels['politifact'][i]
        den += 1
    
    if num != None:
        num = num/den
    ave_score.append(num)

our_labels['ave_score'] = ave_score

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [27]:
for i in range(len(our_labels)):
    if our_labels['ave_score'][i] == 0:
        print(our_labels['news_source'][i])

BBC
CBS News
CNBC
Chicago Sun-Times
Crooks and Liars
Daily Beast
Daily Signal
Foreign Policy
Fortune
Media Matters for America
Mercury News
NPR
New York Daily News
PBS
Pravada Report
Raw Story
Real Clear Politics
Reuters
Talking Points Memo
The American Conservative
The Atlantic
The Denver Post
The Gateway Pundit
The Huffington Post
The Independent
The Political Insider
The Verge
ThinkProgress
USA Today
Vox
Washington Post
Yahoo News


In [28]:
for i in range(len(our_labels)):
    if our_labels['ave_score'][i] < 0:
        print(our_labels['news_source'][i])

21stCenturyWire
Bipartisan Report
Breitbart
Buzzfeed
CNN
CNS News
Daily Mail
Daily Stormer
Fox News
Infowars
Intellihub
LewRockwell
MSNBC
Natural News
New York Post
News Busters
Newswars
Politicus USA
Prison Planet
The Beaverton
The D.C. Clothesline
The Daily Caller
The New York Times
TheAntiMedia
True Pundit
Veterans Today
sott.net


In [36]:
for i in range(len(our_labels)):
    if our_labels['ave_score'][i] <= -0.33:
        print(our_labels['news_source'][i])

21stCenturyWire
Bipartisan Report
Breitbart
Buzzfeed
CNN
CNS News
Daily Mail
Daily Stormer
Fox News
Infowars
Intellihub
LewRockwell
MSNBC
Natural News
New York Post
News Busters
Newswars
Politicus USA
Prison Planet
The Beaverton
The D.C. Clothesline
The Daily Caller
The New York Times
TheAntiMedia
True Pundit
Veterans Today
sott.net


In [29]:
for i in range(len(our_labels)):
    if our_labels['ave_score'][i] <= -0.5:
        print(our_labels['news_source'][i])

21stCenturyWire
Breitbart
CNS News
Daily Stormer
Infowars
Intellihub
LewRockwell
Natural News
News Busters
Newswars
Politicus USA
Prison Planet
The Beaverton
The D.C. Clothesline
Veterans Today
sott.net


In [37]:
# if average score is greater than -0.5 --> "reliable" class (=0);
# otherwise --> "unreliable" class (=1)
our_class = []
for el in ave_score:
    k = None
    if el != None:
        if el < -0.5:
            k = 1
        else:
            k = 0
    our_class.append(k)

c = 0
for el in ave_score:
    if el == 0:
        c += 1
c

32

In [38]:
c = 0
for el in our_class:
    if el == 0:
        c += 1
c

76

In [40]:
c/len(our_labels[our_labels['newsguard'].notnull() | our_labels['wiki'].notnull() | our_labels['mediabias'].notnull() | our_labels['open_sources'].notnull() | our_labels['politifact'].notnull()])*100

84.44444444444444

In [41]:
our_labels['binary_target'] = our_class

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [42]:
our_labels.head()

Unnamed: 0,news_source,newsguard,wiki,mediabias,open_sources,politifact,ave_score,binary_target
0,21stCenturyWire,,,-1.0,-1.0,,-1.0,1.0
1,Al Jazeera,1.0,,1.0,,-1.0,0.333333,0.0
2,Alternet,-1.0,,1.0,1.0,,0.333333,0.0
3,BBC,-1.0,,1.0,,,0.0,0.0
4,BBC UK,,,,,,,


In [44]:
class_df = our_labels[['news_source', 'binary_target']]
class_df

Unnamed: 0,news_source,binary_target
0,21stCenturyWire,1.0
1,Al Jazeera,0.0
2,Alternet,0.0
3,BBC,0.0
4,BBC UK,
...,...,...
108,iPolitics,0.0
109,oann,0.0
110,rferl,
111,sott.net,1.0


## Merge with articles dataframe

In [49]:
articles = articles.merge(class_df)

# get news sources where we don't have a class label
articles[articles['binary_target'].isnull()]['news_source'].unique()

418

In [56]:
# drop rows where we don't have a label
articles = articles[articles['binary_target'].notnull()].reset_index(drop = True)
print(f'{len(articles)} articles in our dataset come from news sources for which we have a label.')

418 articles in our dataset come from news sources for which we have a label.


In [57]:
articles['binary_target'].value_counts()

0.0    379
1.0     39
Name: binary_target, dtype: int64

In [46]:
# add column for outlier detection (inlier = 1, outlier = -1)
outlier = []
for el in articles['binary_target']:
    if el == 1:
        outlier.append(-1)
    elif el == 0:
        outlier.append(1)
    else:
        outlier.append(el)
articles['outlier_target'] = outlier
articles.head()

Unnamed: 0,news_source,pub_date,title,text,clean_txt,POS_tags,binary_target,outlier_target
0,The Gateway Pundit,2018-07-14,REPORT House Conservatives Prepare to Impeach ...,House GOP lawmakers are preparing to push to i...,House GOP lawmaker be prepare push impeach Dep...,NNP NNP NNS VBP VBG TO NN TO NN NN NN NNP NN N...,0.0,1.0
1,The Gateway Pundit,2018-05-20,FLASHBACK Barack Obama I Guarantee There Is No...,Barack Obama went on with Chris Wallace on FOX...,Barack Obama go Chris Wallace FOX News Sunday ...,NN NN VBD IN IN NN NN IN NN NNS NNP IN NNP CD ...,0.0,1.0
2,The Gateway Pundit,2018-08-14,Infowars Website Goes Down For Second Time Sin...,Infowars website run by Alex Jones went offlin...,Infowars website run Alex Jones go offline aga...,NNS NN VB IN NNP NNP VBD NN RB NNP IN DT JJ NN...,0.0,1.0
3,The Gateway Pundit,2018-08-15,DNC Releases Statement on Keith Ellison Domest...,Fellow Democrat and Minnesotan running against...,Fellow Democrat Minnesotan run Ellison AG prim...,"NN NNP CC NN VBG IN NN IN DT NNP NN , NN NN , ...",0.0,1.0
4,The Gateway Pundit,2018-07-18,Chuck Schumer Attacks Trump For Getting Cozy w...,Senator Chuck Schumer held a press conference ...,Senator Chuck Schumer held press conference fo...,NN NN NN NN DT NN NN VBG NNP NNS JJ VBN NN IN ...,0.0,1.0


In [None]:
# save updated articles data frame with labels; write over existing articles csv file
articles.to_csv('articles_df.csv')