# Fake News URL Datasets

There exist lists of fake news domains from three different data sources: **Politifacts**, **OpenSourceGroup**, and **Wikipedia**. 

In order to reliably identify fake news in our social media research studies, we want to reproducibly generate a data set of fake news domains from each of these data sources.

Therefore we download and process the raw data from each data source in this python notebook, in order to generate unique lists of fake news domains for different research questions.

The process is as follows:

1. Download the different fake news URL datasets

- Clean up data & remove duplicates

- Compare different datasets

- Output as .txt files


## 1) Download fake news URL datasets

The datasets have already been downloaded to this folder. The data sources are documented in the readme text files.


## 2) Clean up data & remove duplicates

In [35]:
import pandas as pd

data_sources = {
    "OpenSourceGroup": ["opensourcesgroup-raw-16Apr2020.csv"],
    "Politifacts": ["politifacts-raw-16Apr2020.csv"],
    "Wikipedia": ["wikipedia-fake-news-raw-16Apr2020.csv", "wikipedia-fake-news-usa-raw-16Apr2020.csv"]
}


dfs = {}

# function to clean up our domains
def clean_up_entry(str_domain_or_link):
    # if domain is google.com/foo then we only want google.com
    for sep in ['/', '\\', '#', '?']:
        if sep in str_domain_or_link:
            str_domain_or_link = str_domain_or_link.split(sep)[0]
            
    # make everything lowercase
    domain = str_domain_or_link.lower()
                
    # replace www in front of domains
    domain = domain.replace("www.", "")
    
    return domain

# clean up a category (e.g. "fake news", "fake ", " fake" => "fake")
def clean_up_category(str_category):
    # lower-case and remove trailing whitespace
    str_category = str_category.lower().strip()
    
    # only use first word
    first_part = str_category.split(" ")[0]
    
    return first_part


# read all downloaded files into our data structure
for source in data_sources.keys():
    print("\nFrom", source + ":")
    for csv in data_sources[source]:
        # open + read csv file into pandas dataframe
        df = pd.read_csv("data/" + source + "/" + csv)
        
        # domains are always in first column of csv
        first_column = list(df[df.columns[0]])
        
        # clean up domains from rubbish data
        domains = list(map(clean_up_entry, first_column))
        
        # categories are always in second column of csv, 
        categories = list(df[df.columns[1]])
        
        # ... except for wikipedia, here everything should be seen as "fake"
        if source == "Wikipedia":
            categories = ["fake"] * len(domains)
        else:
            # clean up categories
            categories = list(map(clean_up_category, categories))
        
        print("  -", len(domains), "entries in", csv)
        
        # zip categories and domains into tuples
        new_data = list(zip(domains, categories))
        
        # store the data (merge with other data for same source if needed)
        if source in dfs.keys():
            dfs[source].extend(new_data)
        else:
            dfs[source] = new_data
    
    # remove duplicate entries
    dfs[source] = set(sorted(dfs[source]))
    
    print("  => unique fake news domains:\t", len(dfs[source]))


From OpenSourceGroup:
  - 833 entries in opensourcesgroup-raw-16Apr2020.csv
  => unique fake news domains:	 827

From Politifacts:
  - 327 entries in politifacts-raw-16Apr2020.csv
  => unique fake news domains:	 327

From Wikipedia:
  - 82 entries in wikipedia-fake-news-raw-16Apr2020.csv
  - 18 entries in wikipedia-fake-news-usa-raw-16Apr2020.csv
  => unique fake news domains:	 88


# 3) Compare different datasets

Now that we have parsed the data, cleaned it up and removed the duplicates, we want to compare the datasets.

We want to find out **common fake news domains accross all datasets**.

In [21]:
import numpy as np
from functools import reduce


# calculate list of common fake news domains across all datasets
common_fake_news_domains_across_all_datasets = reduce(set.intersection, 
                                                      [set([y[0] for y in x]) for x in dfs.values()])

print("Common fake news domains across all data sets:",  len(common_fake_news_domains_across_all_datasets))
print()
for domain in common_fake_news_domains_across_all_datasets:
    print("  -", domain)

Common fake news domains across all data sets: 31

  - nationalreport.net
  - undergroundnewsreport.com
  - americannews.com
  - disclose.tv
  - dailybuzzlive.com
  - gummypost.com
  - prntly.com
  - newsexaminer.net
  - washingtonpost.com.co
  - yournewswire.com
  - kmt11.com
  - empireherald.com
  - react365.com
  - uspostman.com
  - huzlers.com
  - thelastlineofdefense.org
  - empiresports.co
  - newsbreakshere.com
  - conservativedailypost.com
  - christiantimesnewspaper.com
  - dailyusaupdate.com
  - now8news.com
  - thegatewaypundit.com
  - abcnews.com.co
  - truetrumpers.com
  - usadailyinfo.com
  - worldtruth.tv
  - thenewyorkevening.com
  - empirenews.net
  - beforeitsnews.com
  - worldnewsdailyreport.com


In [23]:
# calculate list of all fake news domains
all_fake_news_domains = reduce(set.union, 
                               [set([y[0] for y in x]) for x in dfs.values()])

print("Total number of fake news domains:", len(all_fake_news_domains))
print()
for domain in all_fake_news_domains:
    print("  -", domain)

Total number of fake news domains: 1013

  - prisonplanet.com
  - oftwominds.com
  - countercurrents.org
  - theblaze.com
  - newsninja2012.com
  - dailysnark.com
  - davidwolfe.com
  - thecommonsenseshow.com
  - focusnews.info
  - success-street.com
  - dailyoccupation.com
  - attitude.co.uk
  - yesimright.com
  - weeklyworldnews.com
  - americankabuki.blogspot.com
  - rogue-nation3.com
  - surrealscoop.com
  - endoftheamericandream.com
  - 24x365live.com
  - ecowatch.com
  - gaia.com
  - scrappleface.com
  - usafirstinformation.com
  - patriotpost.us
  - thestatelyharold.com
  - citizensunited.org
  - glaringhypocrisy.com
  - topinfopost.com
  - fourwinds10.net
  - usanewsflash.com
  - deadlyclear.wordpress.com
  - whitepower.com
  - conservativeoutfitters.com
  - electionnightgatekeepers.com
  - krbcnews.com
  - wakingupwisconsin.com
  - theamericanindependent.wordpress.com
  - independentminute.com
  - awm.com
  - freedomdaily.com
  - vigilantcitizen.com
  - 70news.wordpress.com
  

# 4) Output list of fake news domains as .txt file

In [39]:
# store data for saving
dfs["intersection"] = common_fake_news_domains_across_all_datasets
dfs["all"] = all_fake_news_domains

# store all results in csv files
print ("Saving files:\n")

import operator

outputdir = "output/"
for key in dfs:
    # create filename
    filename = "fake-news-domains-" + key + "-Apr2020.csv"
    
    # sort the data
    tuples = list(dfs[key])
    tuples.sort(key = operator.itemgetter(0))
    
    if type(tuples[0]) is tuple:
        domains = [x[0] for x in tuples]
        categories = [x[1] for x in tuples]
    else:
        domains = tuples
        categories = [""] * len(domains)
    
    # create dataframe
    df = pd.DataFrame({"fake_news_domain": domains, "category": categories })
    
    # store in csv
    df.to_csv(outputdir + filename, header=True, index=False)
    
    print(" -", filename, " (" + str(len(dfs[key])) + " entries)")

Saving files:

 - fake-news-domains-OpenSourceGroup-Apr2020.csv  (827 entries)
 - fake-news-domains-Politifacts-Apr2020.csv  (327 entries)
 - fake-news-domains-Wikipedia-Apr2020.csv  (88 entries)
 - fake-news-domains-intersection-Apr2020.csv  (31 entries)
 - fake-news-domains-all-Apr2020.csv  (1013 entries)
