<a href="https://colab.research.google.com/github/raebeht/DATA301-Research-Project/blob/master/DATA301_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Author: Josh Smith

This is a python program that utilises pySpark and the GDELT event project to answer the following research question: How did the global perception of the UK’s relationship with the EU, compared to the UK’s own perception, change following the Brexit vote in 2016? 

The results are measured on the Goldstein Scale of each event and the tone score of each source article that is analysed. Each article has its contents scraped and word frequency processed, with terms connected with brexit qualifying a source for entry into the overall analysis.

The Goldstein Scale is measured on a scale of -10 to +10 that captures the likely impact that type of event will have on the political stability of a country.

The Tone of a source covering an event is measured on a scale of -100 (negative) to +100 (positive)

Note: The processing time of this program will take over an hour if the scope of the data processing is equal-to or more-than 7 days for each data-point (currently 3 data points). Future versions utilising google cloud and more refined parallelism will hopefully cut this time down. 

Results from time of project submission are at the bottom of this page, re-running the program in it's current state will take >1 hour and may produce different results.

Timer



In [None]:
import time
start_time = time.perf_counter()

Libraries setup

In [None]:
#library and code setup
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!pip install -q pyspark
!pip install gdelt
import pyspark, os
from pyspark import SparkConf, SparkContext
os.environ["PYSPARK_PYTHON"]="python3"
os.environ["JAVA_HOME"]="/usr/lib/jvm/java-8-openjdk-amd64/"

[K     |████████████████████████████████| 204.7MB 66kB/s 
[K     |████████████████████████████████| 204kB 44.0MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
Collecting gdelt
[?25l  Downloading https://files.pythonhosted.org/packages/65/f9/a3d5111c8f17334b1752c32aedaab0d01ab4324bf26417bd41890d5b25d0/gdelt-0.1.10.6.1-py2.py3-none-any.whl (773kB)
[K     |████████████████████████████████| 778kB 2.8MB/s 
Installing collected packages: gdelt
Successfully installed gdelt-0.1.10.6


In [None]:
#start spark local server
import sys, os
from operator import add
import time

os.environ["PYSPARK_PYTHON"]="python3"

import pyspark
from pyspark import SparkConf, SparkContext

try:
  conf = SparkConf().setMaster("local[*]").set("spark.executor.memory", "1g")
  sc = SparkContext(conf = conf)
except ValueError:
  pass

def dbg(x):
  """ A helper function to print debugging information on RDDs """
  if isinstance(x, pyspark.RDD):
    print([(t[0], list(t[1]) if 
            isinstance(t[1], pyspark.resultiterable.ResultIterable) else t[1])
           if isinstance(t, tuple) else t
           for t in x.take(100)])
  else:
    print(x)
    

External code for scraping websites from URLs, sourced from https://gist.github.com/linwoodc3/e12a7fbebfa755e897697165875f8fdb

In [None]:
!pip install requests
!pip3 install newspaper3k
!pip install bs4
!pip install requests
!pip install readability-lxml

Collecting newspaper3k
[?25l  Downloading https://files.pythonhosted.org/packages/d7/b9/51afecb35bb61b188a4b44868001de348a0e8134b4dfa00ffc191567c4b9/newspaper3k-0.2.8-py3-none-any.whl (211kB)
[K     |████████████████████████████████| 215kB 2.7MB/s 
[?25hCollecting jieba3k>=0.35.1
[?25l  Downloading https://files.pythonhosted.org/packages/a9/cb/2c8332bcdc14d33b0bedd18ae0a4981a069c3513e445120da3c3f23a8aaa/jieba3k-0.35.1.zip (7.4MB)
[K     |████████████████████████████████| 7.4MB 8.2MB/s 
[?25hCollecting feedparser>=5.2.1
[?25l  Downloading https://files.pythonhosted.org/packages/91/d8/7d37fec71ff7c9dbcdd80d2b48bcdd86d6af502156fc93846fb0102cb2c4/feedparser-5.2.1.tar.bz2 (192kB)
[K     |████████████████████████████████| 194kB 29.6MB/s 
[?25hCollecting cssselect>=0.9.2
  Downloading https://files.pythonhosted.org/packages/3b/d4/3b5c17f00cce85b9a1e6f91096e1cc8e8ede2e1be8e96b87ce1ed09e92c5/cssselect-1.1.0-py2.py3-none-any.whl
Collecting tldextract>=2.0.1
[?25l  Downloading https://f

Text scraper from URL - Slightly modifed from source to attain consistent syntax

In [None]:
# Author: Linwood Creekmore
# Email: valinvescap@gmail.com
# Description:  Python script to pull content from a website (works on news stories).

#Licensed under GNU GPLv3; see https://choosealicense.com/licenses/lgpl-3.0/ for details

# Notes
"""
23 Oct 2017: updated to include readability based on PyCon talk: https://github.com/DistrictDataLabs/PyCon2016/blob/master/notebooks/tutorial/Working%20with%20Text%20Corpora.ipynb
18 Jul 2018: added keywords and summary
"""

###################################
# Standard Library imports
###################################

import re
import pytz
import datetime
import platform


###################################
# Third party imports
###################################

import requests
from newspaper import Article
from bs4 import BeautifulSoup
from readability.readability import Document as Paper
from requests.packages.urllib3.exceptions import InsecureRequestWarning


requests.packages.urllib3.disable_warnings(InsecureRequestWarning)


done = {}


def textgetter(url):
    """Scrapes web news and returns the content
    Parameters
    ----------
    url : str
        web address to news report
    Returns 
    -------
    
    answer : dict
        Python dictionary with key/value pairs for:
            text (str) - Full text of article
            url (str) - url to article
            title (str) - extracted title of article
            author (str) - name of extracted author(s)
            base (str) - base url of where article was located
            provider (str) - string of the news provider from url
            published_date (str,isoformat) - extracted date of article
            top_image (str) - extracted url of the top image for article
    """
    global done
    TAGS = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'p', 'li']

    # regex for url check
    s = re.compile('(http://|https://)([A-Za-z0-9_\.-]+)')
    u = re.compile("(http://|https://)(www.)?(.*)(\.[A-Za-z0-9]{1,4})$")
    if s.search(url):
        site = u.search(s.search(url).group()).group(3)
    else:
        site = None
    answer = {}
    # check that its an url
    if s.search(url):
        if url in done.keys():
            yield done[url]
            pass
        try:
            # make a request to the url
            r = requests.get(url, verify=False, timeout=1)
        except:
            # if the url does not return data, set to empty values
            done[url] = "Unable to reach website."
            answer['author'] = '_'
            answer['base'] = s.search(url).group()
            answer['provider']=site
            answer['published_date']='_'
            answer['text'] = "Unable to reach website."
            answer['title'] = '_'
            answer['top_image'] = '_'
            answer['url'] = url
            answer['keywords']='_'
            answer['summary']='_'
            yield answer
        # if url does not return successfully, set ot empty values
        if r.status_code != 200:
            done[url] = "Unable to reach website."
            answer['author'] = '_'
            answer['base'] = s.search(url).group()
            answer['provider']=site
            answer['published_date']='_'
            answer['text'] = "Unable to reach website."
            answer['title'] = '_'
            answer['top_image'] = '_'
            answer['url'] = url
            answer['keywords']='_'
            answer['summary']='_'

        # test if length of url content is greater than 500, if so, fill data
        if len(r.content)>500:
            # set article url
            article = Article(url)
            # test for python version because of html different parameters
            if int(platform.python_version_tuple()[0])==3:
                article.download(input_html=r.content)
            elif int(platform.python_version_tuple()[0])==2:
                article.download(html=r.content)
            # parse the url
            article.parse()
            article.nlp()
            # if parse doesn't pull text fill the rest of the data
            if len(article.text) >= 200:
                answer['author'] = ", ".join(article.authors)
                answer['base'] = s.search(url).group()
                answer['provider']=site
                answer['published_date'] = article.publish_date
                answer['keywords']=article.keywords
                answer['summary']=article.summary
                # convert the data to isoformat; exception for naive date
                if isinstance(article.publish_date,datetime.datetime):
                    try:
                        answer['published_date']=article.publish_date.astimezone(pytz.utc).isoformat()
                    except:
                        answer['published_date']=article.publish_date.isoformat()
                

                answer['text'] = article.text
                answer['title'] = article.title
                answer['top_image'] = article.top_image
                answer['url'] = url
                
                

            # if previous didn't work, try another library
            else:
                doc = Paper(r.content)
                data = doc.summary()
                title = doc.title()
                soup = BeautifulSoup(data, 'lxml')
                newstext = " ".join([l.text for l in soup.find_all(TAGS)])

                # as we did above, pull text if it's greater than 200 length
                if len(newstext) > 200:
                    answer['author'] = '_'
                    answer['base'] = s.search(url).group()
                    answer['provider']=site
                    answer['published_date']='_'
                    answer['text'] = newstext
                    answer['title'] = title
                    answer['top_image'] = '_'
                    answer['url'] = url
                    answer['keywords']='_'
                    answer['summary']='_'
                # if nothing works above, use beautiful soup
                else:
                    newstext = " ".join([
                        l.text
                        for l in soup.find_all(
                            'div', class_='field-item even')
                    ])
                    done[url] = newstext
                    answer['author'] = '_'
                    answer['base'] = s.search(url).group()
                    answer['provider']=site
                    answer['published_date']='_'
                    answer['text'] = newstext
                    answer['title'] = title
                    answer['top_image'] = '_'
                    answer['url'] = url
                    answer['keywords']='_'
                    answer['summary']='_'
        # if nothing works, fill with empty values
        else:
            answer['author'] = '_'
            answer['base'] = s.search(url).group()
            answer['provider']=site
            answer['published_date']='_'
            answer['text'] = 'No text returned'
            answer['title'] = '_'
            answer['top_image'] = '_'
            answer['url'] = url
            answer['keywords']='_'
            answer['summary']='_'
            yield answer
        yield answer

    # the else clause to catch if invalid url passed in
    else:
        answer['author'] = '_'
        answer['base'] = '_' #s.search(url).group()
        answer['provider']=site
        answer['published_date']='_'
        answer['text'] = 'This is not a proper url'
        answer['title'] = '_'
        answer['top_image'] = '_'
        answer['url'] = url
        answer['keywords']='_'
        answer['summary']='_'
        yield answer

Fetch GDELT data for the date ranges
Note: This will throw errors of dates with no events, these dates get removed after generation

In [None]:
from concurrent.futures import ProcessPoolExecutor
from datetime import date, timedelta
import pandas as pd
import gdelt
import os
import nltk
nltk.download('punkt')

gd = gdelt.gdelt(version=2)
executor = ProcessPoolExecutor()

def get_filename(x):
  date = x.strftime('%Y%m%d')
  return "{}_gdeltdata.csv".format(date)

def intofile(filename):
    try:
        if not os.path.exists(filename):
          date = filename.split("_")[0]
          data = gd.Search(date, table='events',coverage=False)
          data.to_csv(filename,encoding='utf-8',index=False)
    except:
        print("Error occurred at", filename)

def get_invalid_dates(filename):
    try:
        if not os.path.exists(filename):
            date = filename.split("_")[0]
            gd.Search(date, table='events',coverage=False)
    except:
      return filename

# pull the data from gdelt into multi files, then removes dates that have no data in GDELT
bad_data = []

dates_7d_before = [get_filename(x) for x in pd.date_range('2016 Jun 17','2016 Jun 23')]
for date in dates_7d_before:
  if get_invalid_dates(date) is not None:
      bad_data.append(get_invalid_dates(date))
while len(bad_data) > 0:
  date = bad_data.pop()
  dates_7d_before.remove(date)

dates_7d_after = [get_filename(x) for x in pd.date_range('2016 Jun 19','2016 Jun 25')]
for date in dates_7d_after:
  if get_invalid_dates(date) is not None:
      bad_data.append(get_invalid_dates(date))
while len(bad_data) > 0:
  date = bad_data.pop()
  dates_7d_after.remove(date)

dates_2yrs_after = [get_filename(x) for x in pd.date_range('2019 Jun 2','2019 Jun 8')]
for date in dates_2yrs_after:
  if get_invalid_dates(date) is not None:
      bad_data.append(get_invalid_dates(date))
while len(bad_data) > 0:
  date = bad_data.pop()
  dates_2yrs_after.remove(date)

results = list(executor.map(intofile,dates_7d_before+dates_7d_after+dates_2yrs_after))
# dbg(results)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Read data in RDDs


In [None]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

data_7d_before = sqlContext.read.option("header", "true").csv(dates_7d_before)
data_7d_after = sqlContext.read.option("header", "true").csv(dates_7d_after)
data_2yrs_after = sqlContext.read.option("header", "true").csv(dates_2yrs_after)

Event and keyword data

In [None]:
# events = ['101', '102', '105', '106', '107', '110', '111', '112', '114', '115',
#           '116', '120', '121', '122', '123', '124', '125', '126', '127', '128', 
#           '130', '131', '132', '133', '134', '136', '137', '138', '139', '140', 
#           '143', '144', '153', '154', '160', '161', '162', '163', '164', '165',
#           '166', '170', '171', '172', '173', '174', '175', '180', '181', '182',
#           '184', '185', '191', '192', '193', '194', '195', '196', '201', '202',
#           '203', '211', '213', '214', '231', '232', '234', '241', '242', '243',
#           '244', '252', '253', '254', '255', '256', '311', '312', '313', '314',
#           '331', '332', '333', '334', '355', '356', '811', '812', '813', '831', 
#           '833', '834', '841', '842', '861', '862', '863', '871', '872', '873', 
#           '874', '1012', '1013', '1014', '1033', '1034', '1041', '1042', '1043',
#           '1044', '1051', '1053', '1054', '1056', '1121', '1123', '1124', '1125',
#           '1211', '1212', '1213', '1222', '1224', '1231', '1232', '1233', '1241',
#           '1243', '1244', '1246', '1311', '1312', '1313', '1322', '1382', '1383',
#           '1384', '1385', '1412', '1413', '1414', '1431', '1621', '1623', '1662',
#           '1711', '1721', '1722', '1723', '1724', '1821', '1822', '1823', '1831', '1832'] #Didn't end up using these, may look into using these as a filter in future versions

keywords = ['backstop', 'brexit', 'brexiteer',
            'brexiter', 'brextremist', 'brexshit', 'brextension',
            'chequers', 'customs', 'union', 'divorce', 'bill', 'eu',
            'exit', 'flextension', 'hard', 'border', 'indicative', 'vote', 'implementation',
            'irish', 'leaver', 'lexit', 'meaningful', 'no-deal',
            "people's", 'declaration',
            'remain', 'remainer', 'second', 'referendum', 'slow', 'soft', 'withdrawal', 'agreement']

Text file of common english words sourced from https://simple.wikipedia.org/wiki/Wikipedia:List_of_1000_basic_words

In [None]:
%%writefile common_words.txt
a about above across act active activity add afraid after again age ago agree air all alone along already always am amount an and angry another answer any anyone anything anytime appear apple are area arm army around arrive art as ask at attack aunt autumn away baby back bad bag ball bank base basket bath be bean bear beautiful bed bedroom beer behave before begin behind bell below besides best better between big bird birth birthday bit bite black bleed block blood blow blue board boat body boil bone book border born borrow both bottle bottom bowl box boy branch brave bread break breakfast breathe bridge bright bring brother brown brush build burn business bus busy but buy by cake call can candle cap car card care careful careless carry case cat catch central century certain chair chance change chase cheap cheese chicken child children chocolate choice choose circle city class clever clean clear climb clock cloth clothes cloud cloudy close coffee coat coin cold collect colour comb comfortable common compare come complete computer condition continue control cook cool copper corn corner correct cost contain count country course cover crash cross cry cup cupboard cut dance dangerous dark daughter day dead decide decrease deep deer depend desk destroy develop die different difficult dinner direction dirty discover dish do dog door double down draw dream dress drink drive drop dry duck dust duty each ear early earn earth east easy eat education effect egg eight either electric elephant else empty end enemy enjoy enough enter equal entrance escape even evening event ever every everyone exact everybody examination example except excited exercise expect expensive explain extremely eye face fact fail fall false family famous far farm father fast fat fault fear feed feel female fever few fight fill film find fine finger finish fire first fish fit five fix flag flat float floor flour flower fly fold food fool foot football for force foreign forest forget forgive fork form fox four free freedom freeze fresh friend friendly from front fruit full fun funny furniture further future game garden gate general gentleman get gift give glad glass go goat god gold good goodbye grandfather grandmother grass grave great green gray ground group grow gun hair half hall hammer hand happen happy hat hate have he head healthy hear heavy heart heaven height hello help hen her here hers hide high hill him his hit hobby hold hole holiday home hope horse hospital hot hotel house how hundred hungry hour hurry husband hurt i ice idea if important in increase inside into introduce invent iron invite is island it its
jelly job join juice jump just keep key kill kind king kitchen knee knife knock know ladder lady lamp land large last late lately laugh lazy lead leaf learn leg left lend length less lesson let letter library lie life light like lion lip list listen little live lock lonely long look lose lot love low lower luck
machine main make male man many map mark market marry matter may me meal mean measure meat medicine meet member mention method middle milk million mind minute miss mistake mix model modern moment money monkey month moon more morning most mother mountain mouth move much music must my name narrow nation nature near nearly neck need needle neighbour neither net never new news newspaper next nice night nine no noble noise none nor north nose not nothing notice now number obey object ocean of off offer office often oil old on one only open opposite or orange order other our out outside over own page pain paint pair pan paper parent park part partner party pass past path pay peace pen pencil people pepper per perfect period person petrol photograph piano pick picture piece pig pin pink place plane plant plastic plate play please pleased plenty pocket point poison police polite pool poor popular position possible potato pour power present press pretty prevent price prince prison private prize probably problem produce promise proper protect provide public pull punish pupil push put queen question quick quiet quite radio rain rainy raise reach read ready real really receive record red remember remind remove rent repair repeat reply report rest restaurant result return rice rich ride right ring rise road rob rock room round rubber rude rule ruler run rush sad safe sail salt same sand save say school science scissors search seat see seem sell send sentence serve seven several sex shade shadow shake shape share sharp she sheep sheet shelf shine ship shirt shoe shoot shop short should shoulder shout show sick side signal silence silly silver similar simple single since sing sink sister sit six size skill skin skirt sky sleep slip small smell smile smoke snow so soap sock soft some someone something sometimes son soon sorry sound soup south space speak special speed spell spend spoon sport spread spring square stamp stand star start station stay steal steam step still stomach stone stop store storm story strange street strong structure student study stupid subject substance successful such sudden sugar suitable summer sun sunny support sure surprise sweet swim sword table take talk tall taste taxi tea teach team tear telephone television tell ten tennis terrible test than that the their then there therefore these thick thin thing think third this though threat three tidy tie title to today toe together tomorrow tonight too tool tooth top total touch town train tram travel tree trouble true trust twice try turn type ugly uncle under understand unit until up use useful usual usually vegetable very village voice visit wait wake walk want warm was wash waste watch water way we weak wear weather wedding week weight welcome were well west wet what wheel when where which while white who why wide wife wild will win wind window wine winter wire wise wish with without woman wonder word work world worry
yard yell yesterday yet you young your zero zoo

Writing common_words.txt


Code to answer research question

In [None]:
import numpy as np
from math import sqrt, log
from statistics import mean, median, stdev, variance

def get_source_country(data):
  country_sources = data.rdd.map(lambda row: ((row['GLOBALEVENTID'], row['Actor1CountryCode'], row['Actor2CountryCode'], (row['SOURCEURL'].split('/'))[2].split('.')[-1], row['SOURCEURL'], row['EventCode'], row['GoldsteinScale'], row['AvgTone']), 1))
  country_sources = country_sources.filter(lambda line: line[0][4] is not None)
  country_sources = country_sources.filter(lambda line: 'GBR' in [line[0][1], line[0][2]])
  return country_sources

def get_word_counts(article):
  text = sc.parallelize(article.split(' '))
  words = text.flatMap(lambda line: [(word.lower(), 1) for word in line.split(" ")])
  counts = (words.reduceByKey(lambda a, b: a+b).sortBy(lambda x: x[1], False))
  return counts

def scrape_data(sources):
  data = sources.map(lambda row: row)
  scraped_data = data.map(lambda row: [row[0][0], next(textgetter(row[0][4]))])
  scraped_data = scraped_data.filter(lambda row: type(row[1]) is dict)
  scraped_text = scraped_data.map(lambda row: row[1]['text'])
  scraped_text = scraped_text.filter(lambda row: row.lower() is not 'no text returned')
  return scraped_text, scraped_data

def calc_IDFi(sources):
  text, scraped_data = scrape_data(sources)
  #Monotonisation
  text = [x for x in text.toLocalIterator()]
  N = len(text)
  words_scraped = scraped_data.map(lambda row: row[1]['text']).filter(lambda row: row.lower() is not 'no text returned')
  word_counts = words_scraped.flatMap(lambda line: [(word.lower(), 1) for word in line.split(" ")]).reduceByKey(lambda a, b: a+b).sortBy(lambda x: x[1], False)
  word_counts = word_counts.subtractByKey(common)
  IDFi = word_counts.groupByKey().map(lambda ind: (ind[0], len(ind[1]))).sortBy(lambda ind: ind[1], True).map(lambda x: (x[0], (log(N/x[1], 2))))
  return IDFi, scraped_data

def calc_TFij(s_data):
  words_scraped = s_data.map(lambda row: [row[0], row[1]['text']])
  words_scraped = words_scraped.filter(lambda row: row[1].lower() is not 'no text returned')
  word_counts = words_scraped.map(lambda line: [line[0], [(word.lower(), 1) for word in line[1].split(" ")]])
  #Monotonisation
  articles = [x for x in word_counts.toLocalIterator()]
  TFij = []
  for article in articles:
    ID = article[0]
    article = article[1]
    #This has had to be monotonised because I couldn't get the mapping to work without the program crashing
    par_data = sc.parallelize(article)
    par_data = par_data.reduceByKey(lambda a, b: a+b).sortBy(lambda x: x[1], False).subtractByKey(common)
    max_value = par_data.first()[1]
    par_data = par_data.map(lambda word: [word[0], word[1]/max_value])
    TFij.append([ID, [x for x in par_data.toLocalIterator()]])
  return TFij

def calc_TFijxIDFi(sources):
  IDFi, s_data = calc_IDFi(sources)
  TFij = calc_TFij(s_data)
  #Monotonisation
  glo_TFIDF = []
  for ind in range(len(TFij)):
    #Monotoisation to avoid crashing
    ID = TFij[ind][0]
    TFij[ind] = TFij[ind][1]
    TF_art = sc.parallelize(TFij[ind])
    if TF_art.count() > 5:
      TFijxIDFi = TF_art.join(IDFi)
      TFijxIDFi = TFijxIDFi.map(lambda word: (word[0], word[1][0]*word[1][1]))
      TFijxIDFi = TFijxIDFi.sortBy(lambda word: word[1], False)
      glo_TFIDF.append([ID, [x for x in TFijxIDFi.toLocalIterator()]])
  return glo_TFIDF

def get_relevant_sources(TFIDF):
  #Monotonisation
  checked_sources = []
  for source in TFIDF:
    #Monotoisation to avoid crashing
    ID = source[0]
    source = sc.parallelize(source[1])
    data = source.take(100)
    source = source.filter(lambda item: any(item[0] in data for word in keywords))
    checked_sources.append([ID, [x for x in source.toLocalIterator()]])
  return checked_sources

common = sc.textFile("common_words.txt").flatMap(lambda line: [(word, 1) for word in line.split(" ")])

sources_7d_before = get_source_country(data_7d_before)
TFijxIDFi_7d_before = calc_TFijxIDFi(sources_7d_before)
rel_srcs_7d_before = get_relevant_sources(TFijxIDFi_7d_before)

sources_7d_after = get_source_country(data_7d_after)
TFijxIDFi_7d_after = calc_TFijxIDFi(sources_7d_after)
rel_srcs_7d_after = get_relevant_sources(TFijxIDFi_7d_after)

sources_2yrs_after = get_source_country(data_2yrs_after)
TFijxIDFi_2yrs_after = calc_TFijxIDFi(sources_2yrs_after)
rel_srcs_2yrs_after = get_relevant_sources(TFijxIDFi_2yrs_after)


In [None]:
def split_uk_non_uk(IDs, sources):
  data = sources.map(lambda row: [row[0][0], [row[0][7], row[0][6], row[0][3]]]) #EventID, Tone, GsS, URL suffix
  IDs = sc.parallelize(IDs)
  global_map = IDs.join(data)
  uk_sources = global_map.map(lambda row: [row[0], row[1][1]]).filter(lambda row: row[1][2] == 'uk')
  non_uk_sources = global_map.map(lambda row: [row[0], row[1][1]]).filter(lambda row: row[1][2] != 'uk')
  return uk_sources, non_uk_sources
  
def get_stats(rdd1):
  spread_tone = stdev([float(x[1][0]) for x in rdd1.toLocalIterator()])
  spread_GsS = stdev([float(x[1][1]) for x in rdd1.toLocalIterator()])
  median_tone = median([float(x[1][0]) for x in rdd1.toLocalIterator()])
  median_GsS = median([float(x[1][1]) for x in rdd1.toLocalIterator()])
  mean_tone = mean([float(x[1][0]) for x in rdd1.toLocalIterator()])
  mean_GsS = mean([float(x[1][1]) for x in rdd1.toLocalIterator()])
  return spread_tone, spread_GsS, median_tone, median_GsS, mean_tone, mean_GsS

In [None]:
uk_7d_before, non_uk_7d_before = split_uk_non_uk(rel_srcs_7d_before, sources_7d_before)
spread_tone_uk_7d_before, spread_GsS_uk_7d_before, median_tone_uk_7d_before, median_GsS_uk_7d_before, mean_tone_uk_7d_before, mean_GsS_uk_7d_before = get_stats(uk_7d_before)
spread_tone_non_uk_7d_before, spread_GsS_non_uk_7d_before, median_tone_non_uk_7d_before, median_GsS_non_uk_7d_before, mean_tone_non_uk_7d_before, mean_GsS_non_uk_7d_before = get_stats(non_uk_7d_before)
print("7 days pre-referendum")
print("UK-internal statistics:")
print("Tone of articles:")
print("Standard Deviation:", spread_tone_uk_7d_before)
print("Median:", median_tone_uk_7d_before)
print("Mean:", mean_tone_uk_7d_before)
print()
print("Goldstein Scale of the event:")
print("Standard Deviation:", spread_GsS_uk_7d_before)
print("Median:", median_GsS_uk_7d_before)
print("Mean:", mean_GsS_uk_7d_before)
print()
print("UK-external statistics:")
print("Tone of articles:")
print("Standard Deviation:", spread_tone_non_uk_7d_before)
print("Median:", median_tone_non_uk_7d_before)
print("Mean:", mean_tone_non_uk_7d_before)
print()
print("Goldstein Scale of the event:")
print("Standard Deviation:", spread_GsS_non_uk_7d_before)
print("Median:", median_GsS_non_uk_7d_before)
print("Mean:", mean_GsS_non_uk_7d_before)
print()
uk_7d_after, non_uk_7d_after = split_uk_non_uk(rel_srcs_7d_after, sources_7d_after)
spread_tone_uk_7d_after, spread_GsS_uk_7d_after, median_tone_uk_7d_after, median_GsS_uk_7d_after, mean_tone_uk_7d_after, mean_GsS_uk_7d_after = get_stats(uk_7d_after)
spread_tone_non_uk_7d_after, spread_GsS_non_uk_7d_after, median_tone_non_uk_7d_after, median_GsS_non_uk_7d_after, mean_tone_non_uk_7d_after, mean_GsS_non_uk_7d_after = get_stats(non_uk_7d_after)
print("7 days post-referendum")
print("UK-internal statistics:")
print("Tone of articles:")
print("Standard Deviationead:", spread_tone_uk_7d_after)
print("Median:", median_tone_uk_7d_after)
print("Mean:", mean_tone_uk_7d_after)
print()
print("Goldstein Scale of the event:")
print("Standard Deviation:", spread_GsS_uk_7d_after)
print("Median:", median_GsS_uk_7d_after)
print("Mean:", mean_GsS_uk_7d_after)
print()
print("UK-external statistics:")
print("Tone of articles:")
print("Standard Deviation:", spread_tone_non_uk_7d_after)
print("Median:", median_tone_non_uk_7d_after)
print("Mean:", mean_tone_non_uk_7d_after)
print()
print("Goldstein Scale of the event:")
print("Standard Deviation:", spread_GsS_non_uk_7d_after)
print("Median:", median_GsS_non_uk_7d_after)
print("Mean:", mean_GsS_non_uk_7d_after)
print()
uk_2yrs_after, non_uk_2yrs_after = split_uk_non_uk(rel_srcs_2yrs_after, sources_2yrs_after)
spread_tone_uk_2yrs_after, spread_GsS_uk_2yrs_after, median_tone_uk_2yrs_after, median_GsS_uk_2yrs_after, mean_tone_uk_2yrs_after, mean_GsS_uk_2yrs_after = get_stats(uk_2yrs_after)
spread_tone_non_uk_2yrs_after, spread_GsS_non_uk_2yrs_after, median_tone_non_uk_2yrs_after, median_GsS_non_uk_2yrs_after, mean_tone_non_uk_2yrs_after, mean_GsS_non_uk_2yrs_after = get_stats(non_uk_2yrs_after)
print("2 years post-referendum")
print("UK-internal statistics:")
print("Tone of articles:")
print("Standard Deviation:", spread_tone_uk_2yrs_after)
print("Median:", median_tone_uk_2yrs_after)
print("Mean:", mean_tone_uk_2yrs_after)
print()
print("Goldstein Scale of the event:")
print("Standard Deviation:", spread_GsS_uk_2yrs_after)
print("Median:", median_GsS_uk_2yrs_after)
print("Mean:", mean_GsS_uk_2yrs_after)
print()
print("UK-external statistics:")
print("Tone of articles:")
print("Standard Deviation:", spread_tone_non_uk_2yrs_after)
print("Median:", median_tone_non_uk_2yrs_after)
print("Mean:", mean_tone_non_uk_2yrs_after)
print()
print("Goldstein Scale of the event:")
print("Standard Deviation:", spread_GsS_non_uk_2yrs_after)
print("Median:", median_GsS_non_uk_2yrs_after)
print("Mean:", mean_GsS_non_uk_2yrs_after)

end_time = time.perf_counter()
print()
print("Time elapsed:", end_time-start_time, "seconds")

7 days pre-referendum
UK-internal statistics:
Tone of articles:
Standard Deviation: 1.7470287573543113
Median: -0.32786885245902003
Mean: -1.0406219730449306

Goldstein Scale of the event:
Standard Deviation: 2.7729990641043876
Median: 1.0
Mean: 1.5333333333333334

UK-external statistics:
Tone of articles:
Standard Deviation: 2.719328055462484
Median: -0.867827447519135
Mean: -1.457123810807092

Goldstein Scale of the event:
Standard Deviation: 4.772018697740349
Median: 1.9
Mean: 1.0204761904761905

7 days post-referendum
UK-internal statistics:
Tone of articles:
Standard Deviationead: 2.725712730384123
Median: -0.5221932114882499
Mean: -2.1541859535737675

Goldstein Scale of the event:
Standard Deviation: 2.9391253224880227
Median: 0.7
Mean: 0.934375

UK-external statistics:
Tone of articles:
Standard Deviation: 2.44143811608839
Median: -1.13895216400911
Mean: -1.5637635275805966

Goldstein Scale of the event:
Standard Deviation: 4.319980870275943
Median: 1.9
Mean: 1.2025751072961373
