# Fake News URL Datasets

There exist lists of fake news domains from three different data sources: **Politifacts**, **OpenSourceGroup**, and **Wikipedia**. 

In order to reliably identify fake news in our social media research studies, we want to reproducibly generate a data set of fake news domains from each of these data sources.

Therefore we download and process the raw data from each data source in this python notebook, in order to generate unique lists of fake news domains for different research questions.

The process is as follows:

1. Download the different fake news URL datasets

- Clean up data & remove duplicates

- Compare different datasets

- Output as .txt files


## 1) Download fake news URL datasets

The datasets have already been downloaded to this folder. The data sources are documented in the readme text files.


## 2) Clean up data & remove duplicates

In [1]:
import pandas as pd

data_sources = {
    "OpenSourceGroup": ["opensourcesgroup-raw-16Apr2020.csv"],
    "Politifacts": ["politifacts-raw-16Apr2020.csv"],
    "Wikipedia": ["wikipedia-fake-news-raw-16Apr2020.csv", "wikipedia-fake-news-usa-raw-16Apr2020.csv"]
}


dfs = {}

# function to clean up our domains
def clean_up_entry(str_domain_or_link):
    # if domain is google.com/foo then we only want google.com
    for sep in ['/', '\\', '#', '?']:
        if sep in str_domain_or_link:
            str_domain_or_link = str_domain_or_link.split(sep)[0]
            
    # make everything lowercase
    domain = str_domain_or_link.lower()
                
    # replace www in front of domains
    domain = domain.replace("www.", "")
    
    return domain

# read all downloaded files into our data structure
for source in data_sources.keys():
    print("\nFrom", source + ":")
    for csv in data_sources[source]:
        # open + read csv file into pandas dataframe
        df = pd.read_csv("data/" + source + "/" + csv)
        
        # domains are always in first column of csv
        first_column = list(df[df.columns[0]])
        
        print("  -", len(first_column), "entries in", csv)
        
        # store the data (merge with other data for same source if needed)
        if source in dfs.keys():
            dfs[source].extend(first_column)
        else:
            dfs[source] = first_column
    
    # apply cleanup function to all entries
    dfs[source] = [clean_up_entry(entry) for entry in dfs[source]]
    
    # remove duplicate entries
    dfs[source] = set(sorted(dfs[source]))
    
    print("  => unique fake news domains:\t", len(dfs[source]))


From OpenSourceGroup:
  - 833 entries in opensourcesgroup-raw-16Apr2020.csv
  => unique fake news domains:	 824

From Politifacts:
  - 327 entries in politifacts-raw-16Apr2020.csv
  => unique fake news domains:	 325

From Wikipedia:
  - 82 entries in wikipedia-fake-news-raw-16Apr2020.csv
  - 18 entries in wikipedia-fake-news-usa-raw-16Apr2020.csv
  => unique fake news domains:	 88


# 3) Compare different datasets

Now that we have parsed the data, cleaned it up and removed the duplicates, we want to compare the datasets.

We want to find out **common fake news domains accross all datasets**.

In [2]:
import numpy as np
from functools import reduce

# calculate list of common fake news domains across all datasets
common_fake_news_domains_across_all_datasets = reduce(set.intersection, dfs.values())

print("Common fake news domains across all data sets:",  len(common_fake_news_domains_across_all_datasets))
print()
#for domain in common_fake_news_domains_across_all_datasets:
#    print("  -", domain)


# calculate list of all fake news domains
all_fake_news_domains = reduce(set.union, dfs.values())

print("Total number of fake news domains:", len(all_fake_news_domains))
print()
#for domain in all_fake_news_domains:
#    print("  -", domain)

Common fake news domains across all data sets: 31

Total number of fake news domains: 1013



# 4) Output list of fake news domains as .txt file

In [3]:
# store data for saving
dfs["intersection"] = common_fake_news_domains_across_all_datasets
dfs["all"] = all_fake_news_domains

# store all results in csv files
print ("Saving files:\n")

outputdir = "output/"
for key in dfs:
    filename = "fake-news-domains-" + key + "-Apr2020.csv"
    pd.DataFrame({"fake_news_domain": sorted(list(dfs[key])) }).to_csv(outputdir + filename, header=False, index=False)
    print(" -", filename, " (" + str(len(dfs[key])) + " entries)")

Saving files:

 - fake-news-domains-OpenSourceGroup-Apr2020.csv  (824 entries)
 - fake-news-domains-Politifacts-Apr2020.csv  (325 entries)
 - fake-news-domains-Wikipedia-Apr2020.csv  (88 entries)
 - fake-news-domains-intersection-Apr2020.csv  (31 entries)
 - fake-news-domains-all-Apr2020.csv  (1013 entries)
