# INM432: Big Data - Coursework (Part II)

## Predicting shifts in GBP-EUR exchange rates based on the content of UK parliamentary debates: A pySpark application

### Alexandros Dimitrios Nalmpantis; Georgios Kyriakopoulos (2017)

**Abstract_** This notebook presents pySpark code that implements a machine-learning pipeline, which (a) scrapes and processes daily reports on debates at the House of Commons and the House of Lords; (b) scrapes daily GBP-EUR exchange rates and calculates their day-to-day shifts; and (c) links the two to train a regression-based algorithm that predicts the latter from the former. The analysis focuses on the prediction exchange-rate shifts(instead of raw exchange rates) to overcome the implications of autocorrelation for the modelling process. Alternative model hyperparameters are systematically assessed using a grid-search approach, which involves training, validating, and testing the performance of alternative model specifications.   

## 1. Modules

* Modules needed for the analysis are imported below. Some modules may need to be installed with the following commands to a termninal: **pip install <"name of module">** eg: pip install tqdm  or **with conda install <"name of module">** eg: conda install tqdm

In [225]:
# Import modules for scraping links
from bs4 import BeautifulSoup
import urllib.request
import re
import datetime
from datetime import date,timedelta
import os

# Import modules for downloading links
import wget
import pandas as pd

# Import modules for parsing pdf's,progress bars and handling errors
import warnings
from tqdm import tnrange, tqdm_notebook
from tika import parser
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

# Import modules for spark ML and SQL
import re
from pyspark import SparkContext
from pyspark.ml import *
from pyspark.sql import *

# import various operators
from math import log
import time
from pprint import pprint
import sys
import matplotlib.pyplot as plt
import stop_words

# filter out warnings
import warnings
warnings.filterwarnings("ignore")
# Set sparkcontext as sc
#sc=SparkContext()

## 2. Data collection and pre-processing

###     2.1 Defining and calling wrapped procedures from scraping, downloading and converting parliamentary debates to text files


In [72]:
# Define if report data were provided ( yes - data were provided, no - data need to be scraped)
trg='no'

    # set from which date to current date that the function will download reports in YYYY-M-D format below
start_date = date(2016, 6, 23)


In [73]:
# Data control function that controls wether the data will be scraped or were provided by students
def data_control(page,start_date,trg):
    if trg=='yes':
        os.chdir(os.getcwd)
    else:
        html_page = urllib.request.urlopen(page) #request page with urllib packages
        soup = BeautifulSoup(html_page) #pass the page to beautiful soup in order to extract the links contained in webpage 
        #print (soup) #visually inspect the html structure
        hl = [] #set hyperlink array to store the extracted links
        ##search html for hyperlinks starting with qna
        for hyperlink in soup.findAll('a', attrs={'href': re.compile("^http://qna")}): 
            hl.append(hyperlink.get('href')) #store the hyperlinks found on an array
        #    print (link.get('href'))
        
        url=[hl[1][:-20]+'Lords-',hl[1][:-20]+'Commons-'] #take first result and cut the dates and category of either lords or commons
    
        #date interval search set and downloading of the pdf files
        ##create interval search date
        today=datetime.datetime.today() #today's date set
        cur_date = date(today.year,today.month,today.day)  # set current date in format of YYYY-MM-DD
    
        dt = cur_date - start_date #calculate interval in days to use for loop
        #make directory to downloaded files
        try:
            os.makedirs(os.getcwd()+'/parliament_practicals') #make directory to downloaded files
        except:
            pass
    
        #loop throught the interval with 1 day step and append the date to url along with categories of either house of Lords or Commons
        for ul in url:
        #    print ('Downloading: ',str('House of '+ul[112:-2]+'s'))
            for i in tnrange(dt.days + 1,desc='Downloading: '+str(ul[112:-2]+'s')):
                try: #test for errors and pass since there are dates that the House of Lords do not convene and HTTP request returns error; Also store results on folder parliament practicals
                    filename = wget.download(ul+str(start_date + timedelta(days=i))+'.pdf',os.getcwd()+'/parliament_practicals')
                except:
                    next  
    
# function to convert the downloaded pdfs to text files
def convert_pdf_to_text(trg):
    if trg=='yes': # user input in case data are already given in appropriate format
        print ('====Data were given====')
    else:
        try: # test if directory textfiles already exists otherwise make the directory
            os.makedirs(os.getcwd()+'/textfiles') #make directory to downloaded files
        except:
            pass
        list_of_files=os.listdir(os.getcwd()+'/parliament_practicals') # create a list of pdf files to be converted
        for i in tnrange(len(list_of_files),desc='Converting pdf to txt'): # iterate throught the files on the list and install progress bar
            if list_of_files[i].endswith(".pdf"): # check that file input is pdf file
                parsedPDF=parser.from_file(os.getcwd()+'/parliament_practicals/'+list_of_files[i]) # parse pdf file
                text_file = open(os.getcwd()+'/textfiles/'+list_of_files[i][:-4]+'.txt', 'a') # create new filename with extension .txt
                text_file.write(parsedPDF["content"]) # write parsed pdf to text
                text_file.close() # close text file
            else: # if file other than pdf continue loop
                next
def download_practicals_convert(start_date,trg):
    page="http://www.parliament.uk/business/publications/written-questions-answers-statements/daily-reports/" # set link to parliament daily questions and answers reports
    data_control(page,start_date,trg) # call set data function
    convert_pdf_to_text(trg)# convert to pdf function
    print ('==Downloading and conversion to text files completed==')
    
    # Call function to either download the data or set current folder as working folder...please make sure that
    # if data are give then those should be stored on the folder: 'parliament_practicals'
    # We suggest to run the scraping function since it only takes 2minutes for downloading a year of reports   
download_practicals_convert(start_date,trg) 







2017-04-22 19:06:08,513 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.14/tika-server-1.14.jar to /var/folders/0c/dkmpfbdd6h96whwytkxqnp880000gp/T/tika-server.jar.
2017-04-22 19:06:21,510 [MainThread  ] [INFO ]  Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/1.14/tika-server-1.14.jar.md5 to /var/folders/0c/dkmpfbdd6h96whwytkxqnp880000gp/T/tika-server.jar.md5.



==Downloading and conversion to text files completed==


### 2.2 Defining and calling wrapped procedures for scraping, downloading and converting time-series exchange-rate shifts to a dataframe

In [151]:
# Define if exchange rate data were provided ( yes - data were provided, no - data need to be scraped)
trg='yes' # change this value to yes or no

In [153]:
# define function to download exchange rates to text file in folder xr
# Clean exchange data download function and transform to pandas dataframe
def clean_ex(file_path):
    data = pd.read_csv(file_path,sep=" t ",header=None, encoding="ISO-8859-1") # load the text file
    data = data.loc[4:,:] # remove headers
    data.reset_index() # reset indices
    rate=data.iloc[::2] # extract odd rows
    date=data.iloc[1::2] # extract even rows
    date.reset_index(inplace=True,drop=True) # reset indices
    rate.reset_index(inplace=True,drop=True) # reset indices
    x=pd.concat([date,rate],axis=1) # concatenate date and rates
    x.columns=['Rate','Date'] # rename columns
    x=x.dropna()# drop na
    x['Date'] = pd.to_datetime(x['Date']) # convert date column to date
    x[['Rate']] = x[['Rate']].apply(pd.to_numeric) #convert exchange rate to float
    x.dtypes #check data types
    x = x.set_index('Date').diff() # calculate [rate(t+1) - rate(t)]
    x.reset_index(inplace=True)# reset Date column
    x['Date']=x['Date'].dt.strftime('%Y-%m-%d')# convert to string for join matching operations
    x.Rate = x.Rate.shift(-1)# shift rate column by one day to account for the delay of the parliament report
    x=x.dropna()# drop na
    x.to_csv(os.getcwd()+'/xr/exchangeRates_diff.csv')# save to csv file
    return x # return dataframe

def download_xr(trg):
    html= urllib.request.urlopen("http://www.bankofengland.co.uk/boeapps/iadb/fromshowcolumns.asp?Travel=NIxIRxSUx&FromSeries=1&ToSeries=50&DAT=RNG&FD=1&FM=Jan&FY=1963&TD=11&TM=Apr&TY=2017&VFD=Y&CSVF=TT&C=C8J&Filter=N&html.x=11&html.y=9")
    if trg=='yes': # user input in case data are already given in appropriate format
        print ('====Data were given====')
        data=clean_ex(os.getcwd()+'/xr/exchangeRates.txt')
    else:
        try: # test if directory xr already exists otherwise make the directory
            os.makedirs(os.getcwd()+'/xr') #make directory to downloaded files
        except:
            pass
        soup_xr = BeautifulSoup(html)
        xr = soup_xr.get_text()
        #print(xr)
        text_xr = open(os.getcwd()+'/xr/'+'exchangeRates'+'.txt', "a")
        text_xr.write(xr)
        text_xr.close()
        data=clean_ex(os.getcwd()+'/xr/exchangeRates.txt')
        next
    return data

data_xr = download_xr(trg)
print(data_xr.head()) # print head

====Data were given====
         Date    Rate
0  1999-01-04 -0.0019
1  1999-01-05  0.0082
2  1999-01-06  0.0007
3  1999-01-07  0.0068
4  1999-01-08  0.0011


In [226]:
# add libraries for ploting
from bokeh import *
from bokeh.charts import Line, show, output_file
from bokeh.io import output_notebook
from bokeh.plotting import figure,show,hplot

In [227]:
# bound bokeh to notebook
output_notebook()

In [234]:
# add plots of exchange rate to bokeh
# set figure 1 properties
p1 = figure(title="Exchange rate day difference (£ to Euro)",
            x_axis_label="Date",
            y_axis_label='Day on day difference',
            x_axis_type="datetime",
            plot_width=470,
            plot_height=450,
            background_fill = "beige",border_fill = "whitesmoke")

# plot figure 1 with absolute and mean values
p1.line(x.Date,x.Rate,legend='Absolute Values')
p1.line(x.Date,pd.rolling_mean(x.Rate,window=5),legend='Rolling Mean (5 day window)',color='red')
# set figure 2 properties
p2 = figure(title="Exchange rate day difference (£ to Euro)",
            x_axis_label="Date",
            y_axis_label='Day on day difference',
            x_axis_type="datetime",
            plot_width=470,
            plot_height=450,
            background_fill = "beige",border_fill = "whitesmoke")
# plot figure 2 with absolute and mean values
p2.line(x.Date,x.Rate,legend='Absolute Values')
p2.line(x.Date,pd.rolling_mean(x.Rate,window=30),legend='Rolling Mean (30 day window)',color='red')
# plot the line charts horizontally
p = hplot(p1, p2)
show(p)

## 3. Pipeline

### Task A: Select a dataset and make initial load and transformations

In [63]:
# Create dataframe of filename - text with numbers and punctuation removed

# Set stop word parameters-for stopwords removal the stop_words pachage was used
# The  StopWordsRemover from pyspark.ml.features was also tested extensively but was not effective or buggy
stop_words = get_stop_words('english')

def remove_n_p(text): # function that removes punctuation and numbers as well lowercasing the text
    text = re.sub(r'\d+','', text) # remove numbers from texts with regular expressions <<<<<
    text = re.sub(r'\[.*?\]|\(.*?\)|\W+', ' ', text)# remove punctuation from texts with regular expressions <<<<<
    text=text.lower() # lowercase the text
    text = ' '.join([word for word in text.split() if word not in stop_words]) # remove stopwords
    return text

# extract date from filename function 
def trim_filename(filename):
    date=filename[-14:-4] # extract the timestamp from end of the file
    return date # return date
    
# sparksession added for spark dataframes
spark = SparkSession.builder.getOrCreate()

def make_dataFrame(dirPath): # make a dataFrame with filename and text 
    ft_RDD = sc.wholeTextFiles(dirPath) # add code to create an RDD with wholeTextFiles
    spm_t_RDD = ft_RDD.map(lambda ft: (trim_filename(ft[0]), remove_n_p(ft[1]))) # create RDD with filename and call remove_n_p function to text
    file_text_df = spark.createDataFrame(spm_t_RDD,schema=['id','text']) # create a dataFrame - filename - text
    return file_text_df

# file_text_df = make_dataFrame(os.getcwd()+'/textfiles') # assign dataframe to file_text_df
# file_text_df.show(5)

+----------+--------------------+
|        id|                text|
+----------+--------------------+
|2016-06-03|daily report frid...|
|2016-06-06|daily report mond...|
|2016-06-07|daily report tues...|
|2016-06-08|daily report wedn...|
|2016-06-09|daily report thur...|
+----------+--------------------+
only showing top 5 rows



In [74]:
# Convert currency pandas dataframe to Spark Dataframe and specify datatypes

def currency_df(data_xr):# function to create dataframe
    data_xr_DF=spark.createDataFrame(data_xr,schema=['Date','Rate']) # create dataframe
    data_xr_DF=data_xr_DF.withColumn('Rate', data_xr_DF['Rate'].cast('float'))# convert rate to float
    return data_xr_DF # return spark dataframe

# function to join the two dataframes on dates, the exchange rate dates have been shifted back by one day
# in order to account for the delay of the parliament publication
def connect_xr_df(file_text_df,data_xr_DF): # the function takes the two inpurts of file_text_df and the data_xr_DF from the previous function
    file_text_Date_rate=file_text_df.join(data_xr_DF,file_text_df.id==data_xr_DF.Date,'left_outer') # the exchange rates were connected to the timestamp of the parliament files with a matching left outer join
    file_text_Date_rate.createOrReplaceTempView("temp") # create a temporary sql view
    file_text_Date_rate_sql = spark.sql("SELECT id,text,Rate as label FROM temp") # select statement of the three columns required for analysis and relabeling
    return file_text_Date_rate_sql # return the dataframe for analysis

id_text_label=connect_xr_df(make_dataFrame(os.getcwd()+'/textfiles'),currency_df(data_xr))
id_text_label.show(5)

+----------+--------------------+-------+
|        id|                text|  label|
+----------+--------------------+-------+
|2017-02-24|daily report frid...|-0.0069|
|2017-02-24|friday february p...|-0.0069|
|2016-07-06|daily report wedn...| 0.0041|
|2016-07-06|wednesday july p ...| 0.0041|
|2017-03-08|daily report wedn...|-0.0044|
+----------+--------------------+-------+
only showing top 5 rows



### Task B: Implement a machine learning pipeline in Spark, including feature extractors, transformers, and/or selectors. 

### Task C: Evaluate the performance of your pipeline using training and test set 