# Fake News URL Datasets

There exist lists of fake news domains from three different data sources: **Politifacts**, **OpenSourceGroup**, and **Wikipedia**. 

In order to reliably identify fake news in our social media research studies, we want to reproducibly generate a data set of fake news domains from each of these data sources.

Therefore we download and process the raw data from each data source in this python notebook, in order to generate unique lists of fake news domains for different research questions.

The process is as follows:

1. Download the different fake news URL datasets

- Clean up data & remove duplicates

- Compare different datasets

- Output as .txt files


## 1) Download fake news URL datasets

The datasets have already been downloaded to this folder. The data sources are documented in the readme text files.


## 2) Clean up data & remove duplicates

In [1]:
import pandas as pd

data_sources = {
    "OpenSourceGroup": ["opensourcesgroup-raw-16Apr2020.csv"],
    "Politifacts": ["politifacts-raw-16Apr2020.csv"],
    "Wikipedia": ["wikipedia-fake-news-raw-16Apr2020.csv", "wikipedia-fake-news-usa-raw-16Apr2020.csv"]
}


dfs = {}

# function to clean up our domains
def clean_up_entry(str_domain_or_link):
    # if domain is google.com/foo then we only want google.com
    for sep in ['/', '\\', '#', '?']:
        if sep in str_domain_or_link:
            str_domain_or_link = str_domain_or_link.split(sep)[0]
            
    # make everything lowercase
    domain = str_domain_or_link.lower()
                
    # replace www in front of domains
    domain = domain.replace("www.", "")
    
    return domain

# clean up a category (e.g. "fake news", "fake ", " fake" => "fake")
def clean_up_category(str_category):
    # lower-case and remove trailing whitespace
    str_category = str_category.lower().strip()
    
    # only use first word
    first_part = str_category.split(" ")[0]
    
    return first_part

# remove categories that dont make any sense for our "fake news"-focused study
bogus_categories = ["reliable", "political", "some", "state", "clickbait"]
def remove_bogus_categories(set_of_tuples):
    print("  remove_bogus_categories", len(set_of_tuples))
    retval = list(filter(lambda x: bool(x[1] not in bogus_categories), list(set_of_tuples)))
    print("   -> still remaining:", len(retval))
    return retval

# removes duplicate tuples, keeping the most important categories
category_importance = ["fake", "hate", "satire"]
def remove_duplicate_entries(list_of_tuples):
    unique_domains = set([x[0] for x in list_of_tuples])
    ret = []
    for domain in unique_domains:
        unique_categories = list(set([x[1] for x in list(filter(lambda x: x[0] == domain, list_of_tuples))]))
        if len(unique_categories) > 1:
            print("len > 1:", domain, unique_categories)
            selected_category = None
            for category in unique_categories:
                if not selected_category:
                    selected_category = category
                else:
                    if category in category_importance and selected_category not in category_importance:
                        selected_category = category
                    if category not in category_importance and selected_category in category_importance:
                        continue # do nothing
                    if category not in category_importance and selected_category not in category_importance:
                        continue # do nothing
                    if category_importance.index(category) < category_importance.index(selected_category):
                        selected_category = category
            print ("selected_category:", selected_category)
                
        else:
            selected_category = unique_categories[0]
        
        ret.append((domain, selected_category))                  
    return ret


# read all downloaded files into our data structure
for source in data_sources.keys():
    print("\nFrom", source + ":")
    for csv in data_sources[source]:
        # open + read csv file into pandas dataframe
        df = pd.read_csv("data/" + source + "/" + csv)
        
        # domains are always in first column of csv
        first_column = list(df[df.columns[0]])
        
        # clean up domains from rubbish data
        domains = list(map(clean_up_entry, first_column))
        
        # categories are always in second column of csv, 
        categories = list(df[df.columns[1]])
        
        # ... except for wikipedia, here everything should be seen as "fake"
        if source == "Wikipedia":
            categories = ["fake"] * len(domains)
        else:
            # clean up categories
            categories = list(map(clean_up_category, categories))
        
        print("  -", len(domains), "entries in", csv)
        
        # zip categories and domains into tuples
        new_data = list(zip(domains, categories))
        
        # store the data (merge with other data for same source if needed)
        if source in dfs.keys():
            dfs[source].extend(new_data)
        else:
            dfs[source] = new_data
    
    # remove duplicate tuples
    dfs[source] = set(sorted(dfs[source]))
    
    # remove bogus categories
    dfs[source] = remove_bogus_categories(dfs[source])
    
    # remove duplicate entries
    dfs[source] = remove_duplicate_entries(dfs[source])
    
    print("  => unique fake news domains:\t", len(dfs[source]))


From OpenSourceGroup:
  - 833 entries in opensourcesgroup-raw-16Apr2020.csv
  remove_bogus_categories 827
   -> still remaining: 723
len > 1: centerforsecuritypolicy.org ['hate', 'bias']
selected_category: hate
len > 1: patriotnewsdaily.com ['bias', 'satire']
selected_category: satire
len > 1: madworldnews.com ['unreliable', 'fake']
selected_category: fake
  => unique fake news domains:	 720

From Politifacts:
  - 327 entries in politifacts-raw-16Apr2020.csv
  remove_bogus_categories 327
   -> still remaining: 277
len > 1: civictribune.com ['imposter', 'fake']
selected_category: fake
  => unique fake news domains:	 276

From Wikipedia:
  - 82 entries in wikipedia-fake-news-raw-16Apr2020.csv
  - 18 entries in wikipedia-fake-news-usa-raw-16Apr2020.csv
  remove_bogus_categories 88
   -> still remaining: 88
  => unique fake news domains:	 88


# 3) Compare different datasets

Now that we have parsed the data, cleaned it up and removed the duplicates, we want to compare the datasets.

We want to find out **common fake news domains accross all datasets**.

In [2]:
import numpy as np
from functools import reduce


# calculate list of common fake news domains across all datasets
common_fake_news_domains_across_all_datasets = reduce(set.intersection, 
                                                      [set([y[0] for y in x]) for x in dfs.values()])

print("Common fake news domains across all data sets:",  len(common_fake_news_domains_across_all_datasets))
print()
for domain in common_fake_news_domains_across_all_datasets:
    print("  -", domain)

Common fake news domains across all data sets: 29

  - empiresports.co
  - react365.com
  - thelastlineofdefense.org
  - americannews.com
  - uspostman.com
  - usadailyinfo.com
  - newsexaminer.net
  - huzlers.com
  - beforeitsnews.com
  - washingtonpost.com.co
  - nationalreport.net
  - dailyusaupdate.com
  - abcnews.com.co
  - disclose.tv
  - prntly.com
  - thenewyorkevening.com
  - conservativedailypost.com
  - christiantimesnewspaper.com
  - newsbreakshere.com
  - gummypost.com
  - undergroundnewsreport.com
  - kmt11.com
  - dailybuzzlive.com
  - worldtruth.tv
  - truetrumpers.com
  - worldnewsdailyreport.com
  - now8news.com
  - empireherald.com
  - empirenews.net


In [3]:
# calculate list of all fake news domains
all_fake_news_domains = reduce(set.union, 
                               [set([y[0] for y in x]) for x in dfs.values()])

print("Total number of fake news domains:", len(all_fake_news_domains))
print()
for domain in all_fake_news_domains:
    print("  -", domain)

Total number of fake news domains: 877

  - usatoday.com.co
  - newpoliticstoday.com
  - newsoftrump.com
  - ihr.org
  - rogue-nation3.com
  - yesimright.com
  - foodbabe.com
  - thelastgreatstand.com
  - newswatch28.com
  - sputniknews.com
  - intrendtoday.com
  - borowitzreport.com
  - persecutes.com
  - fedsalert.com
  - washingtonpost.com.co
  - jamesrgrangerjr.com
  - dailyusaupdate.com
  - glaringhypocrisy.com
  - wundergroundmusic.com
  - thepredicted.com
  - usfanzone.com
  - newyorker.com
  - wy21news.com
  - 16wmpo.com
  - prntly.com
  - americasfreedomfighters.com
  - viralliberty.com
  - bluevision.news
  - thefederalistpapers.org
  - tdtalliance.com
  - halfwaypost.com
  - presstv.ir
  - redrocktribune.com
  - conservativepapers.com
  - dailyinfobox.com
  - meanlefthook.com
  - rawforbeauty.com
  - americanborderpatrol.com
  - infiniteunknown.net
  - aheadoftheherd.com
  - drrichswier.com
  - wikileaks.org
  - x22report.com
  - returnofkings.com
  - breaking-cnn.com
  - co

  - shareblue.com
  - elmundotoday.com
  - truthkings.com
  - worldstoriestoday.com
  - veteransnewsnow.com
  - washingtonfeed.com
  - canadafreepress.com
  - stormcloudsgathering.com
  - interestingdailynews.com
  - fprnradio.com
  - theduran.com
  - derfmagazine.com
  - knowledgeoftoday.org
  - clickhole.com
  - molonlabemedia.com
  - dailysquib.co.uk
  - americanthinker.com
  - hangthebankers.com
  - denverguardian.com
  - channel18news.com
  - objectiveministries.org
  - flashnewscorner.com
  - corbettreport.com
  - dailyken.com
  - empirenews.net
  - channel24news.com
  - allnewspipeline.com
  - local31news.com
  - americantoday.us
  - kbc14.com
  - dcleaks.com
  - creativitymovement.net
  - whatdoesitmean.com
  - eyeopening.info
  - benjaminfulford.typepad.com
  - patriotnewsdaily.com
  - glossynews.com
  - barenakedislam.com
  - focusnews.info
  - damnleaks.com
  - freedomworldnews.com
  - madworldnews.com
  - alabamaobserver.com
  - usadailypost.us
  - conservativeview.info
  -

# 4) Output list of fake news domains as .txt file

In [4]:
# store data for saving
dfs["intersection"] = common_fake_news_domains_across_all_datasets
dfs["all"] = all_fake_news_domains

# store all results in csv files
print ("Saving files:\n")

import operator

outputdir = "output/"
for key in dfs:
    # create filename
    filename = "fake-news-domains-" + key + "-Apr2020.csv"
    
    # sort the data
    tuples = list(dfs[key])
    tuples.sort(key = operator.itemgetter(0))
    
    if type(tuples[0]) is tuple:
        domains = [x[0] for x in tuples]
        categories = [x[1] for x in tuples]
    else:
        domains = tuples
        categories = [""] * len(domains)
    
    # create dataframe
    df = pd.DataFrame({"fake_news_domain": domains, "category": categories })
    
    # store in csv
    df.to_csv(outputdir + filename, header=True, index=False)
    
    print(" -", filename, " (" + str(len(dfs[key])) + " entries)")

Saving files:

 - fake-news-domains-OpenSourceGroup-Apr2020.csv  (720 entries)
 - fake-news-domains-Politifacts-Apr2020.csv  (276 entries)
 - fake-news-domains-Wikipedia-Apr2020.csv  (88 entries)
 - fake-news-domains-intersection-Apr2020.csv  (29 entries)
 - fake-news-domains-all-Apr2020.csv  (877 entries)
