# INM432: Big Data - Coursework (Part II)

## Using the content of post-EU-referendum Great British parliamentary debates to predict GBP-EUR exchange rates: A pySpark application

### Alexandros Dimitrios Nalmpantis; Georgios Kyriakopoulos (2017)

**Abstract_** This notebook presents pySpark code that implements a machine-learning pipeline, which (a) scrapes and processes daily reports on debates at the House of Commons and the House of Lords from 23 June 2016 onwards; (b) scrapes daily GBP-EUR exchange rates and calculates their lag; and (c) links the two to train a regression-based algorithm that predicts the latter from the former. Lags of exchange rates are being used (instead of raw exchange rates) to address issues of autocorrelation in the modelling process. Alternative model hyperparameters are assessed at a grid-search approach, which involves training, validating, and testing model performance.   

### 1. Modules

* Modules needed for the analysis are imported below.
* Some modules may need to be installed with the following commands to a termninal:
    **pip install <"name of module">** eg: pip install tqdm
    or **with conda install <"name of module">** eg: conda install tqdm

In [4]:
# Import modules for scraping links
from bs4 import BeautifulSoup
import urllib.request
import re
import datetime
from datetime import date,timedelta
import os

# Import modules for downloading links
import wget
import pandas as pd

# Import midules for parsing pdf's,progress bars and handling errors
import warnings
from tqdm import tnrange, tqdm_notebook
from tika import parser
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

# Import modules for spark ML, math and operators
import re
from operator import add
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.util import MLUtils
from pyspark import SparkContext as sc
from math import log
import time
from pprint import pprint
import sys

### 2. Data collection

###     2.1 Wrapped procedures of scraping, downloading and converting parliamentary debates to text files

Daily reports of debates at the House of Lords and the House of Commons are published on the parliament website as PDFs. Below we scrape and download these, before converting them into txt format. Using the trigger argument *trg* we offer the option of not downloading the reports, if these are already available.

In [10]:
# Data control function that controls wether the data will be scraped or were provided by students
def data_control(page,start_date,trg):
    if trg=='yes':
        os.chdir(os.getcwd)
    else:
        html_page = urllib.request.urlopen(page) #request page with urllib packages
        soup = BeautifulSoup(html_page) #pass the page to beautiful soup in order to extract the links contained in webpage 
        #print (soup) #visually inspect the html structure
        hl = [] #set hyperlink array to store the extracted links
        ##search html for hyperlinks starting with qna
        for hyperlink in soup.findAll('a', attrs={'href': re.compile("^http://qna")}): 
            hl.append(hyperlink.get('href')) #store the hyperlinks found on an array
        #    print (link.get('href'))
        
        url=[hl[1][:-20]+'Lords-',hl[1][:-20]+'Commons-'] #take first result and cut the dates and category of either lords or commons
    
        #date interval search set and downloading of the pdf files
        ##create interval search date
        today=datetime.datetime.today() #today's date set
        cur_date = date(today.year,today.month,today.day)  # set current date in format of YYYY-MM-DD
    
        dt = cur_date - start_date #calculate interval in days to use for loop
        #make directory to downloaded files
        try:
            os.makedirs(os.getcwd()+'/parliament_practicals') #make directory to downloaded files
        except:
            pass
    
        #loop throught the interval with 1 day step and append the date to url along with categories of either house of Lords or Commons
        for ul in url:
        #    print ('Downloading: ',str('House of '+ul[112:-2]+'s'))
            for i in tnrange(dt.days + 1,desc='Downloading: '+str(ul[112:-2]+'s')):
                try: #test for errors and pass since there are dates that the House of Lords do not convene and HTTP request returns error; Also store results on folder parliament practicals
                    filename = wget.download(ul+str(start_date + timedelta(days=i))+'.pdf',os.getcwd()+'/parliament_practicals')
                except:
                    next  
    
# function to convert the downloaded pdfs to text files
def convert_pdf_to_text(trg):
    if trg=='yes': # user input in case data are already given in appropriate format
        print ('====Data were given====')
    else:
        try: # test if directory textfiles already exists otherwise make the directory
            os.makedirs(os.getcwd()+'/textfiles') #make directory to downloaded files
        except:
            pass
        list_of_files=os.listdir(os.getcwd()+'/parliament_practicals') # create a list of pdf files to be converted
        for i in tnrange(len(list_of_files),desc='Converting pdf to txt'): # iterate throught the files on the list and install progress bar
            if list_of_files[i].endswith(".pdf"): # check that file input is pdf file
                parsedPDF=parser.from_file(os.getcwd()+'/parliament_practicals/'+list_of_files[i]) # parse pdf file
                text_file = open(os.getcwd()+'/textfiles/'+list_of_files[i][:-4]+'.txt', 'a') # create new filename with extension .txt
                text_file.write(parsedPDF["content"]) # write parsed pdf to text
                text_file.close() # close text file
            else: # if file other than pdf continue loop
                next

###     2.2 User input of parameters and function calling
Here, call the functions that scrape, download, and produce daily parliamentary-debate reports in txt format. 

In [12]:
# If data are given for time saving purposes then set the following parameter to yes
trg='no'
# set link to parliament daily questions and answers reports
page="http://www.parliament.uk/business/publications/written-questions-answers-statements/daily-reports/" # set link to parliament daily questions and answers reports
# set from which date to current date the function will download reports in YYYY-M-D format below
start_date = date(2016, 6, 23)

# Call function to either download the data or set current folder as working folder...please make sure that
# if data are give then those should be stored on the folder: 'parliament_practicals'
# We suggest to run the scraping function since it only takes 2minutes for downloading a year of reports
data_control(page,start_date,trg) # call set data function
convert_pdf_to_text(trg) # convert to pdf function

### 2.3 Defining function for sourcing daily exchenge rates 
Daily GBP-EUR exchange rates are published by the Bank of England. Below, we download these in a txt format. Again, using the trigger argument *trg* we offer the option of not downloading the exchange rates, if these are already available

In [3]:
# define function to download exchange rates to text file in folder xr
def download_xr(html, trg):
    if trg=='yes': # user input in case data are already given in appropriate format
        print ('====Data were given====')
    else:
        try: # test if directory xr already exists otherwise make the directory
            os.makedirs(os.getcwd()+'/xr') #make directory to downloaded files
        except:
            pass
        soup_xr = BeautifulSoup(html)
        xr = soup_xr.get_text()
        #print(xr)
        text_xr = open(os.getcwd()+'/xr/'+'exchangeRates'+'.txt', "a")
        text_xr.write(xr)
        text_xr.close()
        next

### 2.3 User input of parameters and function calling
Here, we call the function that scrapes and downloads daily GBP-EUR exchange rates in txt format. 

In [4]:
trg='no'
html_page_xr = urllib.request.urlopen("http://www.bankofengland.co.uk/boeapps/iadb/fromshowcolumns.asp?Travel=NIxIRxSUx&FromSeries=1&ToSeries=50&DAT=RNG&FD=1&FM=Jan&FY=1963&TD=11&TM=Apr&TY=2017&VFD=Y&CSVF=TT&C=C8J&Filter=N&html.x=11&html.y=9")
download_xr(html_page_xr, trg)

In [70]:
# Clean exchange data download function and transform to pandas dataframe
def clean_ex(file_path):
    data = pd.read_csv(file_path,sep=" t ",header=None, encoding="ISO-8859-1") # load the text file
    data = data.loc[6:, 0:2] # remove headers
    data.reset_index(inplace=True,drop=True) # reset indexes
    date=data.iloc[::2] # extract odd rows
    rate=data.iloc[1::2] # extract even columns
    date.reset_index(inplace=True,drop=True) # reset indexes
    rate.reset_index(inplace=True,drop=True) # reset indexes
    x=pd.concat([date,rate],axis=1) # concatenate date and rates
    x.columns=['Date','Rate'] # rename columns
    return x # return dataframe

In [71]:
data = clean_ex(os.getcwd()+'/xr/exchangeRates.txt') # call clean function
print(data.head()) # print head

  app.launch_new_instance()


        Date    Rate
0  05 Jan 99  1.4048
1  06 Jan 99   1.413
2  07 Jan 99  1.4137
3  08 Jan 99  1.4205
4  11 Jan 99  1.4216


### 3 Load spark and read and split files to f,w tuples

In [5]:
def splitFileWords(file_text): # function (a) builds (file, word) tuples from (file, text) tuples
    f,t = file_text # define the input to the function
    file_word_List = [] # create an empty (file,word) list
    word_List = re.split('\W+',t) # split texts into words using regular expression
    for w in word_List: 
        file_word_List.append((f,w.lower())) # append words in lowercase to their corresponding file
    return file_word_List

def read_file_word_RDD(argDir): # function (b) builds (file, word) tuples using function (a) (which builds (file, word) tuples from (file, text) tuples 
    file_text_RDD = sc.wholeTextFiles(argDir)# read the files and build (file, text) tuples
    file_word_RDD = file_text_RDD.flatMap(splitFileWords) #use function (a)to build (file, word) tuples
    #print('Read {} files from directory {}'.format(file_text_RDD.count(), argDir)) # print count and location of files used
    #print('file word count histogram')
    #print(file_word_RDD.map(lambda fwL: (len(fwL[1]))).histogram([0,10,100,500, 1000, 5000, 10000])) # print word-count histogram 
    return file_word_RDD 

file_word_RDD = read_file_word_RDD(os.getcwd()+'/textfiles') # apply function (b) on the text corpus for the analysis 
pprint(file_word_RDD.take(2)) # print (file, word) tuples indicatively

TypeError: wholeTextFiles() missing 1 required positional argument: 'path'