# Fake News URL Datasets

There exist lists of fake news domains from three different data sources: **Politifacts**, **OpenSourceGroup**, and **Wikipedia**. 

In order to reliably identify fake news in our social media research studies, we want to reproducibly generate a data set of fake news domains from each of these data sources.

Therefore we download and process the raw data from each data source in this python notebook, in order to generate unique lists of fake news domains for different research questions.

The process is as follows:

1. Download the different fake news URL datasets

- Clean up data & remove duplicates

- Compare different datasets

- Output as .txt files


## 1) Download fake news URL datasets

The datasets have already been downloaded to this folder. The data sources are documented in the readme text files.


## 2) Clean up data & remove duplicates

In [1]:
import pandas as pd

data_sources = {
    "OpenSourceGroup": ["opensourcesgroup-raw-16Apr2020.csv"],
    "Politifacts": ["politifacts-raw-16Apr2020.csv"],
    "Wikipedia": ["wikipedia-fake-news-raw-16Apr2020.csv", "wikipedia-fake-news-usa-raw-16Apr2020.csv"]
}


dfs = {}

# function to clean up our domains
def clean_up_entry(str_domain_or_link):
    # if domain is google.com/foo then we only want google.com
    for sep in ['/', '\\', '#', '?']:
        if sep in str_domain_or_link:
            str_domain_or_link = str_domain_or_link.split(sep)[0]
            
    # make everything lowercase
    domain = str_domain_or_link.lower()
                
    # replace www in front of domains
    domain = domain.replace("www.", "")
    
    return domain

# clean up a category (e.g. "fake news", "fake ", " fake" => "fake")
def clean_up_category(str_category):
    # lower-case and remove trailing whitespace
    str_category = str_category.lower().strip()
    
    # only use first word
    first_part = str_category.split(" ")[0]
    
    return first_part

# remove categories that dont make any sense for our "fake news"-focused study
bogus_categories = ["reliable", "political", "some", "state", "clickbait"]
def remove_bogus_categories(set_of_tuples):
    print("  remove_bogus_categories", len(set_of_tuples))
    retval = list(filter(lambda x: bool(x[1] not in bogus_categories), list(set_of_tuples)))
    print("   -> still remaining:", len(retval))
    return retval

# removes duplicate tuples, keeping the most important categories
category_importance = ["fake", "hate", "satire"]
def remove_duplicate_entries(list_of_tuples):
    unique_domains = set([x[0] for x in list_of_tuples])
    ret = []
    for domain in unique_domains:
        unique_categories = list(set([x[1] for x in list(filter(lambda x: x[0] == domain, list_of_tuples))]))
        if len(unique_categories) > 1:
            print("len > 1:", domain, unique_categories)
            selected_category = None
            for category in unique_categories:
                if not selected_category:
                    selected_category = category
                else:
                    if category in category_importance and selected_category not in category_importance:
                        selected_category = category
                    if category not in category_importance and selected_category in category_importance:
                        continue # do nothing
                    if category not in category_importance and selected_category not in category_importance:
                        continue # do nothing
                    if category_importance.index(category) < category_importance.index(selected_category):
                        selected_category = category
            print ("selected_category:", selected_category)
                
        else:
            selected_category = unique_categories[0]
        
        ret.append((domain, selected_category))                  
    return ret


# read all downloaded files into our data structure
for source in data_sources.keys():
    print("\nFrom", source + ":")
    for csv in data_sources[source]:
        # open + read csv file into pandas dataframe
        df = pd.read_csv("data/" + source + "/" + csv)
        
        # domains are always in first column of csv
        first_column = list(df[df.columns[0]])
        
        # clean up domains from rubbish data
        domains = list(map(clean_up_entry, first_column))
        
        # categories are always in second column of csv, 
        categories = list(df[df.columns[1]])
        
        # ... except for wikipedia, here everything should be seen as "fake"
        if source == "Wikipedia":
            categories = ["fake"] * len(domains)
        else:
            # clean up categories
            categories = list(map(clean_up_category, categories))
        
        print("  -", len(domains), "entries in", csv)
        
        # zip categories and domains into tuples
        new_data = list(zip(domains, categories))
        
        # store the data (merge with other data for same source if needed)
        if source in dfs.keys():
            dfs[source].extend(new_data)
        else:
            dfs[source] = new_data
    
    # remove duplicate tuples
    dfs[source] = set(sorted(dfs[source]))
    
    # remove bogus categories
    dfs[source] = remove_bogus_categories(dfs[source])
    
    # remove duplicate entries
    dfs[source] = remove_duplicate_entries(dfs[source])
    
    print("  => unique fake news domains:\t", len(dfs[source]))


From OpenSourceGroup:
  - 833 entries in opensourcesgroup-raw-16Apr2020.csv
  remove_bogus_categories 827
   -> still remaining: 723
len > 1: patriotnewsdaily.com ['bias', 'satire']
selected_category: satire
len > 1: centerforsecuritypolicy.org ['bias', 'hate']
selected_category: hate
len > 1: madworldnews.com ['fake', 'unreliable']
selected_category: fake
  => unique fake news domains:	 720

From Politifacts:
  - 327 entries in politifacts-raw-16Apr2020.csv
  remove_bogus_categories 327
   -> still remaining: 277
len > 1: civictribune.com ['fake', 'imposter']
selected_category: fake
  => unique fake news domains:	 276

From Wikipedia:
  - 82 entries in wikipedia-fake-news-raw-16Apr2020.csv
  - 18 entries in wikipedia-fake-news-usa-raw-16Apr2020.csv
  remove_bogus_categories 88
   -> still remaining: 88
  => unique fake news domains:	 88


# 3) Compare different datasets

Now that we have parsed the data, cleaned it up and removed the duplicates, we want to compare the datasets.

We want to find out **common fake news domains accross all datasets**.

In [2]:
import numpy as np
from functools import reduce

# calculate list of common domains across all datasets
common_domains_across_all_datasets = reduce(set.intersection, [set([y[0] for y in x]) for x in dfs.values()])

print("Common domains across all data sets:",  len(common_domains_across_all_datasets))
print()
for domain in common_domains_across_all_datasets:
    print("  -", domain)

Common domains across all data sets: 29

  - americannews.com
  - now8news.com
  - disclose.tv
  - washingtonpost.com.co
  - newsexaminer.net
  - beforeitsnews.com
  - nationalreport.net
  - usadailyinfo.com
  - huzlers.com
  - worldtruth.tv
  - worldnewsdailyreport.com
  - newsbreakshere.com
  - christiantimesnewspaper.com
  - uspostman.com
  - empireherald.com
  - conservativedailypost.com
  - dailybuzzlive.com
  - empirenews.net
  - react365.com
  - undergroundnewsreport.com
  - prntly.com
  - dailyusaupdate.com
  - thenewyorkevening.com
  - truetrumpers.com
  - abcnews.com.co
  - thelastlineofdefense.org
  - empiresports.co
  - gummypost.com
  - kmt11.com


In [3]:
# calculate intersection of domains labeled "fake"

# filter for "fake" category
common_domains_across_datasets_fake_only = [list(filter(lambda x: x[1] == "fake", arr)) for arr in dfs.values()]

# only use domain name
common_domains_across_datasets_fake_only = [set(map(lambda x: x[0], arr)) for arr in common_domains_across_datasets_fake_only]

# perform set intersection
common_domains_across_datasets_fake_only = list(reduce(set.intersection, common_domains_across_datasets_fake_only))

print("Common domains across all data sets (fake only):",  len(common_domains_across_datasets_fake_only))
print()
for domain in common_domains_across_datasets_fake_only:
    print("  -", domain)

Common domains across all data sets (fake only): 13

  - americannews.com
  - prntly.com
  - dailyusaupdate.com
  - newsbreakshere.com
  - conservativedailypost.com
  - beforeitsnews.com
  - dailybuzzlive.com
  - truetrumpers.com
  - gummypost.com
  - uspostman.com
  - usadailyinfo.com
  - newsexaminer.net
  - undergroundnewsreport.com


In [4]:
# calculate list of all fake news domains
all_domains_fake_only = [set(filter(lambda x: x[1] == "fake", arr)) for arr in dfs.values()]

# perform set union operation
all_domains_fake_only = set(reduce(set.union, [set([x[0] for x in arr]) for arr in all_domains_fake_only]))

print("Total number of 'fake' domains (all):", len(all_domains_fake_only))
print()
for domain in all_domains_fake_only:
    print("  -", domain)

Total number of 'fake' domains (all): 386

  - focusnews.info
  - infowars.com
  - healthycareandbeauty.com
  - theexaminer.site
  - farmwars.info
  - washingtonpost.com.co
  - breaking-cnn.com
  - usadailypolitics.com
  - thetruthseeker.co.uk
  - mzansiville.co.za
  - abcnewsgo.co
  - ladylibertysnews.com
  - givemeliberity01.com
  - americanoverlook.com
  - thelastgreatstand.com
  - dailyheadlines.net
  - usadailyinfo.com
  - 70news.wordpress.com
  - politicono.com
  - bb4sp.com
  - linkbeef.com
  - success-street.com
  - teaparty.org
  - newsdaily12.com
  - vigilantcitizen.com
  - proamericanews.com
  - aurora-news.us
  - voxtribune.com
  - konkonsagh.biz
  - bvanews.com
  - thetrumpmedia.com
  - redflagnews.com
  - newzmagazine.com
  - thereporterz.com
  - usanewshome.com
  - newsfrompolitics.com
  - newsbbc.net
  - nnn.is
  - newpoliticstoday.com
  - politicalsitenews.com
  - tmzworldnews.com
  - dailynewsposts.info
  - world.politics.com
  - dailyheadlines.com
  - usherald.com
  

In [5]:
# calculate list of all fake news domains
all_domains = reduce(set.union, 
                               [set([y[0] for y in x]) for x in dfs.values()])

print("Total number of domains (all):", len(all_domains))
print()
for domain in all_domains:
    print("  -", domain)

Total number of domains (all): 877

  - weaselzippers.us
  - stneotscitizen.com
  - infowars.com
  - healthycareandbeauty.com
  - libertyfederation.com
  - theexaminer.site
  - washingtonpost.com.co
  - whydontyoutrythis.com
  - thetruthseeker.co.uk
  - westernjournalism.com
  - katehon.com
  - collectivelyconscious.net
  - mzansiville.co.za
  - abcnewsgo.co
  - sputniknews.com
  - americanoverlook.com
  - speld.nl
  - unz.com
  - dailyheadlines.net
  - therealshtick.com
  - 70news.wordpress.com
  - silverstealers.net
  - linkbeef.com
  - freedomsphoenix.com
  - teaparty.org
  - truthrevolt.org
  - therightists.com
  - nationindistress.weebly.com
  - voxtribune.com
  - konkonsagh.biz
  - bvanews.com
  - newzmagazine.com
  - oilgeopolitics.net
  - themindunleashed.org
  - ancient-code.com
  - newsbbc.net
  - nnn.is
  - politicalsitenews.com
  - dailynewsposts.info
  - canadafreepress.com
  - dailysnark.com
  - anonjekloy.tk
  - revolutions2040.com
  - knowledgeoftoday.org
  - interestin

  - mississippiherald.com
  - infiniteunknown.net
  - bizstandardnews.com
  - supremepatriot.com
  - politicot.com
  - financialsurvivalnetwork.com
  - nunadisbereel.com
  - express.co.uk
  - uspoln.com
  - ironictimes.com
  - weconservative.com
  - freedomforceinternational.com
  - bostontribune.com
  - onepoliticalplaza.com
  - uspoliticsinfo.com
  - regated.com
  - thenewsnerd.com
  - winningdemocrats.com
  - journal-neo.org
  - glaringhypocrisy.com
  - antoniusaquinas.wp.com
  - blog.veterantv.net
  - healthy-holistic-living.com
  - newsuptoday.com
  - undergroundworldnews.com
  - myfreshnews.com
  - anotherdayintheempire.com
  - dcleaks.com
  - federalisttribune.com
  - redrocktribune.com
  - diyhours.net
  - rightwingnews.com
  - americankabuki.blogspot.com
  - empirenews.net
  - thebeaverton.com
  - notallowedto.com
  - zerohedge.com
  - naturalblaze.com
  - der-postillon.com
  - wikileaks.org
  - theunrealtimes.com
  - usdefensewatch.com
  - cnewsgo.com
  - dollarvigilante.com


# 4) Output list of fake news domains as .txt file

In [6]:
FAKE_SUFFIX = "-fake-category-only"

# calculate fake news only categories
new_keys = {}
for key in dfs:
    data = dfs[key]
    
    fake_only = list(filter(lambda x: x[1] == "fake", data))
    new_keys[key + FAKE_SUFFIX] = fake_only
dfs.update(new_keys)


# store data for saving
dfs["all"] = all_domains
dfs["all" + FAKE_SUFFIX] = all_domains_fake_only
dfs["intersection"] = common_domains_across_all_datasets
dfs["intersection" + FAKE_SUFFIX] = common_domains_across_datasets_fake_only

# store all results in csv files
print ("Saving files:\n")

import operator
import os 

outputdir = "output/"
for key in dfs:
    # create filename
    filename = "fake-news-domains-" + key + "-Apr2020.csv"
    outputpath = os.path.join(outputdir, filename)

    # sort the data
    tuples = list(dfs[key])
    tuples.sort(key = operator.itemgetter(0))

    if type(tuples[0]) is tuple:
        domains = [x[0] for x in tuples]
        categories = [x[1] for x in tuples]
    else:
        domains = tuples
        categories = [""] * len(domains)

    # create dataframe
    df = pd.DataFrame({"fake_news_domain": domains, "category": categories })

    # store in csv
    df.to_csv(outputpath, header=True, index=False)

    print(" -", filename, " (" + str(len(dfs[key])) + " entries)")

Saving files:

 - fake-news-domains-OpenSourceGroup-Apr2020.csv  (720 entries)
 - fake-news-domains-Politifacts-Apr2020.csv  (276 entries)
 - fake-news-domains-Wikipedia-Apr2020.csv  (88 entries)
 - fake-news-domains-OpenSourceGroup-fake-category-only-Apr2020.csv  (236 entries)
 - fake-news-domains-Politifacts-fake-category-only-Apr2020.csv  (192 entries)
 - fake-news-domains-Wikipedia-fake-category-only-Apr2020.csv  (88 entries)
 - fake-news-domains-all-Apr2020.csv  (877 entries)
 - fake-news-domains-all-fake-category-only-Apr2020.csv  (386 entries)
 - fake-news-domains-intersection-Apr2020.csv  (29 entries)
 - fake-news-domains-intersection-fake-category-only-Apr2020.csv  (13 entries)
