# INM432: Big Data - Coursework (Part II)

## Predicting shifts in GBP-EUR exchange rates based on the content of UK parliamentary debates: A pySpark application

### Alexandros Dimitrios Nalmpantis; Georgios Kyriakopoulos (2017)

**Abstract_** This notebook presents pySpark code that implements a machine-learning pipeline, which (a) scrapes and processes daily reports on debates at the House of Commons and the House of Lords; (b) scrapes daily GBP-EUR exchange rates and calculates their day-to-day shifts; and (c) links the two to train a regression-based algorithm that predicts the latter from the former. The analysis focuses on the prediction exchange-rate shifts(instead of raw exchange rates) to overcome the implications of autocorrelation for the modelling process. Alternative model hyperparameters are systematically assessed using a grid-search approach, which involves training, validating, and testing the performance of alternative model specifications.   

## 1. Modules

* Modules needed for the analysis are imported below. Some modules may need to be installed with the following commands to a termninal: **pip install <"name of module">** eg: pip install tqdm  or **with conda install <"name of module">** eg: conda install tqdm

In [1]:
# Import modules for scraping links
from bs4 import BeautifulSoup
import urllib.request
import re
import datetime
from datetime import date,timedelta
import os

# Import modules for downloading links
import wget
import pandas as pd

# Import midules for parsing pdf's,progress bars and handling errors
import warnings
from tqdm import tnrange, tqdm_notebook
from tika import parser
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

# Import modules for spark ML, math and operators
import re
from operator import add
#from pyspark.mllib.regression import LabeledPoint
#from pyspark.mllib.classification import LogisticRegressionWithLBFGS
#from pyspark.mllib.util import MLUtils
#from pyspark import SparkContext
from math import log
import time
from pprint import pprint
import sys

## 2. Data collection and pre-processing

###     2.1 Defining and calling wrapped procedures from scraping, downloading and converting parliamentary debates to text files


In [2]:
# Data control function that controls wether the data will be scraped or were provided by students
def data_control(page,start_date,trg):
    if trg=='yes':
        os.chdir(os.getcwd)
    else:
        html_page = urllib.request.urlopen(page) #request page with urllib packages
        soup = BeautifulSoup(html_page) #pass the page to beautiful soup in order to extract the links contained in webpage 
        #print (soup) #visually inspect the html structure
        hl = [] #set hyperlink array to store the extracted links
        ##search html for hyperlinks starting with qna
        for hyperlink in soup.findAll('a', attrs={'href': re.compile("^http://qna")}): 
            hl.append(hyperlink.get('href')) #store the hyperlinks found on an array
        #    print (link.get('href'))
        
        url=[hl[1][:-20]+'Lords-',hl[1][:-20]+'Commons-'] #take first result and cut the dates and category of either lords or commons
    
        #date interval search set and downloading of the pdf files
        ##create interval search date
        today=datetime.datetime.today() #today's date set
        cur_date = date(today.year,today.month,today.day)  # set current date in format of YYYY-MM-DD
    
        dt = cur_date - start_date #calculate interval in days to use for loop
        #make directory to downloaded files
        try:
            os.makedirs(os.getcwd()+'/parliament_practicals') #make directory to downloaded files
        except:
            pass
    
        #loop throught the interval with 1 day step and append the date to url along with categories of either house of Lords or Commons
        for ul in url:
        #    print ('Downloading: ',str('House of '+ul[112:-2]+'s'))
            for i in tnrange(dt.days + 1,desc='Downloading: '+str(ul[112:-2]+'s')):
                try: #test for errors and pass since there are dates that the House of Lords do not convene and HTTP request returns error; Also store results on folder parliament practicals
                    filename = wget.download(ul+str(start_date + timedelta(days=i))+'.pdf',os.getcwd()+'/parliament_practicals')
                except:
                    next  
    
# function to convert the downloaded pdfs to text files
def convert_pdf_to_text(trg):
    if trg=='yes': # user input in case data are already given in appropriate format
        print ('====Data were given====')
    else:
        try: # test if directory textfiles already exists otherwise make the directory
            os.makedirs(os.getcwd()+'/textfiles') #make directory to downloaded files
        except:
            pass
        list_of_files=os.listdir(os.getcwd()+'/parliament_practicals') # create a list of pdf files to be converted
        for i in tnrange(len(list_of_files),desc='Converting pdf to txt'): # iterate throught the files on the list and install progress bar
            if list_of_files[i].endswith(".pdf"): # check that file input is pdf file
                parsedPDF=parser.from_file(os.getcwd()+'/parliament_practicals/'+list_of_files[i]) # parse pdf file
                text_file = open(os.getcwd()+'/textfiles/'+list_of_files[i][:-4]+'.txt', 'a') # create new filename with extension .txt
                text_file.write(parsedPDF["content"]) # write parsed pdf to text
                text_file.close() # close text file
            else: # if file other than pdf continue loop
                next

In [12]:
# If data are given for time saving purposes then set the following parameter to yes
trg='no'
# set link to parliament daily questions and answers reports
page="http://www.parliament.uk/business/publications/written-questions-answers-statements/daily-reports/" # set link to parliament daily questions and answers reports
# set from which date to current date the function will download reports in YYYY-M-D format below
start_date = date(2016, 6, 23)

# Call function to either download the data or set current folder as working folder...please make sure that
# if data are give then those should be stored on the folder: 'parliament_practicals'
# We suggest to run the scraping function since it only takes 2minutes for downloading a year of reports
data_control(page,start_date,trg) # call set data function
convert_pdf_to_text(trg) # convert to pdf function

### 2.2 Defining and calling wrapped procedures for scraping, downloading and converting time-series exchange-rate shifts to a dataframe

In [3]:
# define function to download exchange rates to text file in folder xr
def download_xr(html, trg):
    if trg=='yes': # user input in case data are already given in appropriate format
        print ('====Data were given====')
    else:
        try: # test if directory xr already exists otherwise make the directory
            os.makedirs(os.getcwd()+'/xr') #make directory to downloaded files
        except:
            pass
        soup_xr = BeautifulSoup(html)
        xr = soup_xr.get_text()
        #print(xr)
        text_xr = open(os.getcwd()+'/xr/'+'exchangeRates'+'.txt', "a")
        text_xr.write(xr)
        text_xr.close()
        next

In [4]:
trg='no'
html_page_xr = urllib.request.urlopen("http://www.bankofengland.co.uk/boeapps/iadb/fromshowcolumns.asp?Travel=NIxIRxSUx&FromSeries=1&ToSeries=50&DAT=RNG&FD=1&FM=Jan&FY=1963&TD=11&TM=Apr&TY=2017&VFD=Y&CSVF=TT&C=C8J&Filter=N&html.x=11&html.y=9")
download_xr(html_page_xr, trg)

In [38]:
# Clean exchange data download function and transform to pandas dataframe
def clean_ex(file_path):
    data = pd.read_csv(file_path,sep=" t ",header=None, encoding="ISO-8859-1") # load the text file
    data = data.loc[6:, 0:2] # remove headers
    data.reset_index(inplace=True,drop=True) # reset indices
    date=data.iloc[::2] # extract odd rows
    rate=data.iloc[1::2] # extract even rows
    date.reset_index(inplace=True,drop=True) # reset indices
    rate.reset_index(inplace=True,drop=True) # reset indices
    x=pd.concat([date,rate],axis=1) # concatenate date and rates
    x.columns=['Date','Rate'] # rename columns
    x['Date'] = pd.to_datetime(x['Date']) # converrt date column to date
    x[['Rate']] = x[['Rate']].apply(pd.to_numeric) #convert exchange rate to float
    x.dtypes #check data types
    x = x.set_index('Date').diff() # calculate [rate(t+1) - rate(t)]
    return x # return dataframe

In [39]:
data_xr = clean_ex(os.getcwd()+'/xr/exchangeRates.txt') # call clean function
print(data_xr.head()) # print head

  app.launch_new_instance()


              Rate
Date              
1999-01-05     NaN
1999-01-06  0.0082
1999-01-07  0.0007
1999-01-08  0.0068
1999-01-11  0.0011


### 2.3 Reading, cleaning, and reshaping text files to (file, word) tuples

In [70]:
def splitFileWords(file_text): # function (a) builds (file, word) tuples from (file, text) tuples
    f,t = file_text # define the input to the function
    t = re.sub(r'\d+','', t) # remove numbers from texts with regular expressions <<<<<
    t = re.sub(r'\[.*?\]|\(.*?\)|\W+', ' ', t) # remove punctuation from texts with regular expressions <<<<<
    file_word_List = [] # create an empty (file,word) list
    word_List = re.split('\W+',t) # split texts into words using regular expression
    for w in word_List: 
        file_word_List.append((f,w.lower())) # append words in lowercase to their corresponding file
    return file_word_List

def read_file_word_RDD(argDir): # function (b) builds (file, word) tuples using function (a) (which builds (file, word) tuples from (file, text) tuples 
    file_text_RDD = sc.wholeTextFiles(argDir)# read the files and build (file, text) tuples
    file_word_RDD = file_text_RDD.flatMap(splitFileWords) #use function (a)to build (file, word) tuples
    #print('Read {} files from directory {}'.format(file_text_RDD.count(), argDir)) # print count and location of files used
    #print('file word count histogram')
    #print(file_word_RDD.map(lambda fwL: (len(fwL[1]))).histogram([0,10,100,500, 1000, 5000, 10000])) # print word-count histogram 
    return file_word_RDD 

file_word_RDD = read_file_word_RDD(os.getcwd()+'/textfiles') # apply function (b) on the text corpus for the analysis 
pprint(file_word_RDD.take(5)) # print (file, word) tuples indicatively

[('file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Commons-2016-06-03.txt',
  ''),
 ('file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Commons-2016-06-03.txt',
  'daily'),
 ('file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Commons-2016-06-03.txt',
  'report'),
 ('file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Commons-2016-06-03.txt',
  'friday'),
 ('file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Commons-2016-06-03.txt',
  'june')]


In [71]:
def file_word_RDD_map_reduce(file_word): # function (c) uses map and reduce to build ((file, word), count) tuples
    file_word_1_RDD = file_word.map(lambda x: (x,1)) # map (file, word) tuples against 1
    fileWord_count_RDD = file_word_1_RDD.reduceByKey(add) # aggregate the (file, word) tuples
    from stop_words import get_stop_words # import stop_word pachage <<<<<
    stop_words = get_stop_words('english') # create stowords object <<<<<
    fileWord_count_RDD = fileWord_count_RDD.filter(lambda x: x[1] not in stop_words) # filter out stopwords <<<<<
    return fileWord_count_RDD

fileWord_count_RDD = file_word_RDD_map_reduce(file_word_RDD) # map (file, word) tuples 
pprint(fileWord_count_RDD.take(50)) # print ((file, word), 1) tuples indicatively

[(('file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Commons-2016-10-13.txt',
   'tech'),
  2),
 (('file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Commons-2017-01-11.txt',
   'feel'),
  2),
 (('file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Lords-2016-11-08.txt',
   'special'),
  6),
 (('file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Lords-2016-07-21.txt',
   'mutuals'),
  1),
 (('file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Commons-2016-10-24.txt',
   'ups'),
  3),
 (('file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Lords-2017-04-04.txt',
   'post'),
  3),
 (('file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Lords-2016-06-30.txt',
   'implement'),
  1),
 (('file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Lords-2016-10-11.txt',
   'remain

### 2.4 Creating ((file, word), count) tuples

In [74]:
def file_word_RDD_map_reduce(file_word): # function (c) uses map and reduce to build ((file, word), count) tuples
    file_word_1_RDD = file_word.map(lambda x: (x,1)) # map (file, word) tuples against 1
    fileWord_count_RDD = file_word_1_RDD.reduceByKey(add) # aggregate the (file, word) tuples
    return fileWord_count_RDD

fileWord_count_RDD = file_word_RDD_map_reduce(file_word_RDD) # map (file, word) tuples 
pprint(fileWord_count_RDD.take(5)) # print ((file, word), 1) tuples indicatively

[(('file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Commons-2016-10-13.txt',
   'tech'),
  2),
 (('file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Commons-2017-01-11.txt',
   'feel'),
  2),
 (('file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Lords-2016-11-08.txt',
   'special'),
  6),
 (('file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Lords-2016-07-21.txt',
   'mutuals'),
  1),
 (('file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Commons-2016-10-24.txt',
   'ups'),
  3)]


### 2.5 Creating (file, (word, count)) tuples

In [75]:
def reorganise_tuples(fw_c): # function (d) reorganises tuples from ((file, word), count) to (file, (word, count))
    fw,c = fw_c # unpack the ((file, word), count) tuple into its elements
    f,w = fw # unpack the nested (filename,word) tuple into its elements
    return (f,[(w,c)]) # reorganise the elements into the structure (file, (word, count))

file_wordCount_RDD = fileWord_count_RDD.map(reorganise_tuples) 
pprint(file_wordCount_RDD.top(5))

#### An krino apo to parakato, to stoword list pou xrisimopoiisa einai tipotenio kai gia petama. 
#### alla den mou trexei to nltk <<<<<

[('file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Lords-2017-04-06.txt',
  [('zone', 1)]),
 ('file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Lords-2017-04-06.txt',
  [('younger', 2)]),
 ('file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Lords-2017-04-06.txt',
  [('young', 2)]),
 ('file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Lords-2017-04-06.txt',
  [('you', 2)]),
 ('file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Lords-2017-04-06.txt',
  [('yet', 1)])]


### 2.6 Creating normalised frequency vectors

In [77]:
def make_file_termFreq_norm_RDD(argDir): # function (e) produces normalised frequency vectors
    file_word_RDD = read_file_word_RDD(argDir) # use function (b) 
    fileWord_count_RDD = file_word_RDD_map_reduce(file_word_RDD) # use function (c)
    file_wordCount_RDD = fileWord_count_RDD.map(reorganise_tuples) # use function (d)
    file_wordCount2_RDD = file_wordCount_RDD.reduceByKey(add)
    file_wordCount_norm_RDD = file_wordCount2_RDD.map(lambda f_wcL:(f_wcL[0],[(w,c/sum([c for (w, c) in f_wcL[1]]))for (w,c) in f_wcL[1]])) # normalise
    return file_wordCount_norm_RDD                                                

file_wordCount_norm_RDD = make_file_termFreq_norm_RDD(os.getcwd()+'/textfiles') # test <<<<<
print(file_wordCount_norm_RDD.take(1))

word_count_norm = file_wordCount_norm_RDD.take(1)[0][1] # get the first normalised word count list
pprint(sum([c for (w,c) in word_count_norm])) # check that sum of normalised sum approximates 1 

[('file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Commons-2017-02-03.txt', [('now', 0.00020627062706270627), ('outgoing', 4.1254125412541255e-05), ('emission', 4.1254125412541255e-05), ('tougher', 4.1254125412541255e-05), ('certain', 4.1254125412541255e-05), ('robust', 4.1254125412541255e-05), ('reflect', 0.00012376237623762376), ('documentation', 8.250825082508251e-05), ('leading', 0.0002887788778877888), ('figure', 0.00012376237623762376), ('operators', 0.00012376237623762376), ('streaming', 4.1254125412541255e-05), ('wasted', 4.1254125412541255e-05), ('economy', 0.00016501650165016502), ('mandate', 4.1254125412541255e-05), ('publicised', 4.1254125412541255e-05), ('allowances', 0.00016501650165016502), ('normally', 4.1254125412541255e-05), ('isr', 4.1254125412541255e-05), ('adjusts', 4.1254125412541255e-05), ('comparative', 4.1254125412541255e-05), ('programming', 4.1254125412541255e-05), ('refit', 0.00012376237623762376), ('criteria', 0.00012376237623

### 2.7 Creating hash vectors

In [78]:
def hashing_vectorizer(word_count_list, N): # function (f) applies the hashing approach to creating a vector
     v = [0] * N  # create fixed size vector of 0s
     for word_count in word_count_list: 
         word,count = word_count# unpack tuple
         h = hash(word)# get hash value
         v[h % N] = v[h % N] + count # add count
     return v# return hashed word vector

In [79]:
def make_file_wordHashVector_norm_RDD(file_wordCount_norm, argN): # function (g) applies the hashing vectoriser
    file_wordHashVector_norm_RDD = file_wordCount_norm.map(lambda f_wc: (f_wc[0],hashing_vectorizer(f_wc[1],argN))) 
    return file_wordHashVector_norm_RDD

N=500
file_wordHashVector_norm_RDD = make_file_wordHashVector_norm_RDD(make_file_termFreq_norm_RDD(os.getcwd()+'/textfiles'),N)
print(file_wordHashVector_norm_RDD.take(2)) # test
print(sum(file_wordHashVector_norm_RDD.take(1)[0][1])) # test

[('file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Commons-2017-02-03.txt', [0.001278877887788779, 0.0002887788778877888, 0.0009900990099009901, 0.0019389438943894386, 0.0008250825082508251, 0.0007838283828382838, 0.007962046204620462, 0.002062706270627063, 0.0008250825082508251, 0.0031353135313531345, 0.0018151815181518148, 0.0009488448844884489, 0.002227722772277227, 0.00024752475247524753, 0.0007013201320132013, 0.0014026402640264027, 0.002764026402640264, 0.0007013201320132013, 0.0019389438943894386, 0.0009488448844884489, 0.0015676567656765675, 0.0026815181518151814, 0.001113861386138614, 0.0007425742574257426, 0.007797029702970298, 0.018275577557755777, 0.0010726072607260724, 0.0014851485148514852, 0.0002887788778877888, 0.0010313531353135313, 0.010024752475247525, 0.00020627062706270627, 0.0024752475247524757, 0.0004950495049504951, 0.0014026402640264027, 0.0010313531353135313, 0.00024752475247524753, 0.0025577557755775576, 0.0009488448844884489, 0