# INM432: Big Data - Coursework (Part II)

## Predicting shifts in GBP-EUR exchange rates based on the content of UK parliamentary debates: A pySpark application

### Alexandros Dimitrios Nalmpantis; Georgios Kyriakopoulos (2017)

**Abstract_** This notebook presents pySpark code that implements a machine-learning pipeline, which (a) scrapes and processes daily reports on debates at the House of Commons and the House of Lords; (b) scrapes daily GBP-EUR exchange rates and calculates their day-to-day shifts; and (c) links the two to train a regression-based algorithm that predicts the latter from the former. The analysis focuses on the prediction exchange-rate shifts(instead of raw exchange rates) to overcome the implications of autocorrelation for the modelling process. Alternative model hyperparameters are systematically assessed using a grid-search approach, which involves training, validating, and testing the performance of alternative model specifications.   

## 1. Modules

* Modules needed for the analysis are imported below. Some modules may need to be installed with the following commands to a termninal: **pip install <"name of module">** eg: pip install tqdm  or **with conda install <"name of module">** eg: conda install tqdm

In [12]:
# Import modules for scraping links
from bs4 import BeautifulSoup
import urllib.request
import re
import datetime
from datetime import date,timedelta
import os

# Import modules for downloading links
import wget
import pandas as pd

# Import midules for parsing pdf's,progress bars and handling errors
import warnings
from tqdm import tnrange, tqdm_notebook
from tika import parser
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

# Import modules for spark ML, math and operators
import re
from operator import add
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.util import MLUtils
from pyspark import SparkContext
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.feature import Tokenizer,HashingTF, IDF
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import CountVectorizer,CountVectorizerModel
from pyspark.ml.regression import LinearRegression
from math import log
import time
from pprint import pprint
import sys

# Set sparkcontext as sc
#sc=SparkContext() <<<<<<

## 2. Data collection and pre-processing

###     2.1 Defining and calling wrapped procedures for scraping, downloading and converting parliamentary debates to text files


In [3]:
# Data control function that controls wether the data will be scraped or were provided by students
def data_control(page,start_date,trg):
    if trg=='yes':
        os.chdir(os.getcwd)
    else:
        html_page = urllib.request.urlopen(page) #request page with urllib packages
        soup = BeautifulSoup(html_page) #pass the page to beautiful soup in order to extract the links contained in webpage 
        #print (soup) #visually inspect the html structure
        hl = [] #set hyperlink array to store the extracted links
        ##search html for hyperlinks starting with qna
        for hyperlink in soup.findAll('a', attrs={'href': re.compile("^http://qna")}): 
            hl.append(hyperlink.get('href')) #store the hyperlinks found on an array
        #   print (link.get('href'))
        
        url=[hl[1][:-20]+'Lords-',hl[1][:-20]+'Commons-'] #take first result and cut the dates and category of either lords or commons
    
        #date interval search set and downloading of the pdf files
        ##create interval search date
        today=datetime.datetime.today() #today's date set
        cur_date = date(today.year,today.month,today.day)  # set current date in format of YYYY-MM-DD
    
        dt = cur_date - start_date #calculate interval in days to use for loop
        #make directory to downloaded files
        try:
            os.makedirs(os.getcwd()+'/parliament_practicals') #make directory to downloaded files
        except:
            pass
    
        #loop throught the interval with 1 day step and append the date to url along with categories of either house of Lords or Commons
        for ul in url:
        #    print ('Downloading: ',str('House of '+ul[112:-2]+'s'))
            for i in tnrange(dt.days + 1,desc='Downloading: '+str(ul[112:-2]+'s')):
                try: #test for errors and pass since there are dates that the House of Lords do not convene and HTTP request returns error; Also store results on folder parliament practicals
                    filename = wget.download(ul+str(start_date + timedelta(days=i))+'.pdf',os.getcwd()+'/parliament_practicals')
                except:
                    next  
    
# function to convert the downloaded pdfs to text files
def convert_pdf_to_text(trg):
    if trg=='yes': # user input in case data are already given in appropriate format
        print ('====Data were given====')
    else:
        try: # test if directory textfiles already exists otherwise make the directory
            os.makedirs(os.getcwd()+'/textfiles') #make directory to downloaded files
        except:
            pass
        list_of_files=os.listdir(os.getcwd()+'/parliament_practicals') # create a list of pdf files to be converted
        for i in tnrange(len(list_of_files),desc='Converting pdf to txt'): # iterate throught the files on the list and install progress bar
            if list_of_files[i].endswith(".pdf"): # check that file input is pdf file
                parsedPDF=parser.from_file(os.getcwd()+'/parliament_practicals/'+list_of_files[i]) # parse pdf file
                text_file = open(os.getcwd()+'/textfiles/'+list_of_files[i][:-4]+'.txt', 'a') # create new filename with extension .txt
                text_file.write(parsedPDF["content"]) # write parsed pdf to text
                text_file.close() # close text file
            else: # if file other than pdf continue loop
                next

In [None]:
# If data are given for time saving purposes then set the following parameter to yes
trg='no'
# set link to parliament daily questions and answers reports
page="http://www.parliament.uk/business/publications/written-questions-answers-statements/daily-reports/" # set link to parliament daily questions and answers reports
# set from which date to current date the function will download reports in YYYY-M-D format below
start_date = date(2016, 6, 1)

# Call function to either download the data or set current folder as working folder...please make sure that
# if data are give then those should be stored on the folder: 'parliament_practicals'
# We suggest to run the scraping function since it only takes 2minutes for downloading a year of reports
data_control(page,start_date,trg) # call set data function
convert_pdf_to_text(trg) # convert to pdf function

### 2.2 Defining and calling wrapped procedures for scraping, downloading and converting time-series exchange-rate shifts to a dataframe

In [4]:
# define function to download exchange rates to text file in folder xr
def download_xr(html, trg):
    if trg=='yes': # user input in case data are already given in appropriate format
        print ('====Data were given====')
    else:
        try: # test if directory xr already exists otherwise make the directory
            os.makedirs(os.getcwd()+'/xr') #make directory to downloaded files
        except:
            pass
        soup_xr = BeautifulSoup(html)
        xr = soup_xr.get_text()
        #print(xr)
        text_xr = open(os.getcwd()+'/xr/'+'exchangeRates'+'.txt', "a")
        text_xr.write(xr)
        text_xr.close()
        next

In [None]:
trg='no'
html_page_xr = urllib.request.urlopen("http://www.bankofengland.co.uk/boeapps/iadb/fromshowcolumns.asp?Travel=NIxIRxSUx&FromSeries=1&ToSeries=50&DAT=RNG&FD=1&FM=Jan&FY=1963&TD=11&TM=Apr&TY=2017&VFD=Y&CSVF=TT&C=C8J&Filter=N&html.x=11&html.y=9")
download_xr(html_page_xr, trg)

In [6]:
# Clean exchange data download function and transform to pandas dataframe
def clean_ex(file_path):
    data = pd.read_csv(file_path,sep=" t ",header=None, encoding="ISO-8859-1") # load the text file
    data = data.loc[6:, 0:2] # remove headers
    data.reset_index(inplace=True,drop=True) # reset indices
    date=data.iloc[::2] # extract odd rows
    rate=data.iloc[1::2] # extract even rows
    date.reset_index(inplace=True,drop=True) # reset indices
    rate.reset_index(inplace=True,drop=True) # reset indices
    x=pd.concat([date,rate],axis=1) # concatenate date and rates
    x.columns=['Date','Rate'] # rename columns
    x=x.dropna()
    x['Date'] = pd.to_datetime(x['Date']) # convert date column to date
    x[['Rate']] = x[['Rate']].apply(pd.to_numeric) #convert exchange rate to float
    x.dtypes #check data types
    x = x.set_index('Date').diff() # calculate [rate(t+1) - rate(t)]
    return x # return dataframe

In [None]:
data_xr = clean_ex(os.getcwd()+'/xr/exchangeRates.txt') # call clean function
print(data_xr.head()) # print head

## 3. Pipeline

### Task A: Select a dataset and make initial load and transformations

In [7]:
# Create dataframe of filename - text with numbers and punctuation removed

def remove_n_p(text): # function that removes punctuation and numbers as well lowercasing the text
    text = re.sub(r'\d+','', text) # remove numbers from texts with regular expressions <<<<<
    text = re.sub(r'\[.*?\]|\(.*?\)|\W+', ' ', text)# remove punctuation from texts with regular expressions <<<<<
    text=text.lower() # lowercase the text
    return text

# we need a SparkSession to create DataFrames
spark = SparkSession.builder.getOrCreate()

def make_dataFrame(dirPath): # make a dataFrame with filename and text 
    ft_RDD = sc.wholeTextFiles(dirPath) # add code to create an RDD with wholeTextFiles
    spm_t_RDD = ft_RDD.map(lambda ft: (ft[0], remove_n_p(ft[1]))) # create RDD with filename and call remove_n_p function to text
    rows_DF = spark.createDataFrame(spm_t_RDD,schema=['id','text']) # create a dataFrame - filename - text
    return rows_DF

file_text_df = make_dataFrame(os.getcwd()+'/textfiles') # assign dataframe to file_text_df
file_text_df.take(1)

[Row(id='file:/c:/Spark/bin/textfiles/Written-Questions-Answers-Statements-Daily-Report-Commons-2016-06-03.txt', text=' daily report friday june this report shows written answers and statements provided on june and the information is correct at the time of publication p m june for the latest information on written questions and answers ministerial corrections and written statements please visit http www parliament uk writtenanswers contents answers business innovation and skills business stafford companies ownership department for business innovation and skills food disabled students allowances laboratories investment minimum wage minimum wage arrears nurses training overseas trade mexico property ownership universities admissions universities finance video games internet cabinet office anti corruption summit cabinet office food census armed forces passports treasury bank services bank services fees and charges bank services ict banks closures debts usa eu budget contributions immigrat

In [9]:
# Seperate dataset to train and test
[train_DF,test_DF] = file_text_df.randomSplit([8.0,2.0])

### Task B: Implement a machine learning pipeline in Spark, including feature extractors, transformers, and/or selectors. 

In [42]:

#Step1: use tokenizer to split word into array and sql to select the filename - word_array created
tokenizer = Tokenizer(inputCol="text", outputCol="words")

#Step2: make hashTF sparse vector with maximum 500 features
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")

#Step 3: feed hash vector to calculate idf
idf = IDF(inputCol=hashingTF.getOutputCol(), outputCol="idf") #<<<

#Step4: linear regression parameters
lr = LinearRegression()

#Step 5: configure alternative pipelines 
pipeline_hash = Pipeline(stages=[tokenizer, hashingTF, lr]) #with hash vector only
pipeline_hash_idf = Pipeline(stages=[tokenizer, hashingTF, idf, lr]) #with hash vector and idf <<

#Step 6: set parameter grid
paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [500]) \
    .addGrid(lr.regParam, [0.3]) \
    .addGrid(lr.maxIter, [50]) \
    .addGrid (lr.elasticNetParam,[0.8])\
    .build()

In [32]:
# Train model
train = pipeline_hash.fit(train_DF)



In [None]:
# Test model <<< an exo katalavei kala, de prepei na xreiazetai
prediction = train.transform(test_DF)
selected = prediction.select("id", "text", "probability", "prediction")
for row in selected.collect():
    rid, text, prob, prediction = row
    print("(%d, %s) --> prob=%s, prediction=%f" % (rid, text, str(prob), prediction))

### Task C: Evaluate the performance of your pipeline using training and test set 

In [None]:
# Validate the model
tvs = TrainValidationSplit(estimator=lr,
                           estimatorParamMaps=paramGrid,
                           evaluator=RegressionEvaluator(),
                           # 80% of the data will be used for training, 20% for validation.
                           trainRatio=0.8)

# Run TrainValidationSplit, and choose the best set of parameters.
model = tvs.fit(train)

# Make predictions on test data. model is the model with combination of parameters
# that performed best.
model.transform(test)\
    .select("features", "label", "prediction")\
    .show()

In [129]:
# ************ GEORGE's based on example data

#***
training = spark.read.format("libsvm")\
    .load("C:/Spark/data/mllib/sample_linear_regression_data.txt")
    
lr = LinearRegression()\
     .setFeaturesCol("features")\
     .setLabelCol('label')

grid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.3]) \
    .addGrid(lr.maxIter, [50]) \
    .addGrid (lr.elasticNetParam,[0.8])\
    .build()
    
pipe = Pipeline(stages=[lr])


# lrModel = lr.fit(training)
# Print the coefficients and intercept for linear regression
#print("Coefficients: %s" % str(lrModel.coefficients))
#print("Intercept: %s" % str(lrModel.intercept))
# Summarize the model over the training set and print out some metrics
#trainingSummary = lrModel.summary
#print("numIterations: %d" % trainingSummary.totalIterations)
#print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
#trainingSummary.residuals.show()
#print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
#print("r2: %f" % trainingSummary.r2)


from pyspark.ml.tuning import TrainValidationSplit
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(metricName="r2")

tvs = TrainValidationSplit(estimator=lr,
                           estimatorParamMaps=grid,
                           evaluator =evaluator,
                           # 80% of the data will be used for training, 20% for validation
                           trainRatio=0.8)

# Run TrainValidationSplit, and choose the best set of parameters.
tvsModel = tvs.fit(training)
rsquared = evaluator.evaluate(tvsModel.transform(training))
print("--- R Squared is %s ---" % (rsquared))


--- R Squared is 0.022861466913958184 ---


### Task D: Implement a parameter grid

In [141]:
# Grid search on LR model

#***
training = spark.read.format("libsvm")\
    .load("C:/Spark/data/mllib/sample_linear_regression_data.txt")
    
lr = LinearRegression()\
     .setFeaturesCol("features")\
     .setLabelCol('label')

lrparamGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.1, 0.3, 0.5, 1.0, 2.0]) \
    .addGrid(lr.maxIter, [5, 10, 20, 50]) \
    .addGrid (lr.elasticNetParam,[0.0, 0.1, 0.5, 0.8, 1.0])\
    .build()

pipe = Pipeline(stages=[lr])

from pyspark.ml.tuning import TrainValidationSplit
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(metricName="r2", labelCol="label", predictionCol="prediction")

for r in [0.6, 0.7, 0.8, 0.9]:
    tvs = TrainValidationSplit(estimator=lr,
    estimatorParamMaps=grid,
    evaluator =evaluator,
    trainRatio=r)

# Run TrainValidationSplit, and choose the best set of parameters.
import time
start_time = time.time()
tvsModel = tvs.fit(training)
print("--- Training-validation completed in %s seconds ---" % (time.time() - start_time))

print(tvsModel.bestModel)
paramMap = list(zip(tvsModel.getEstimatorParamMaps()))
print(paramMap)


--- Training-validation completed in 0.5165371894836426 seconds ---
LinearRegression_4be9b23af70bd255c483
[({Param(parent='LinearRegression_4bedabe28a6071094c04', name='regParam', doc='regularization parameter (>= 0).'): 0.3, Param(parent='LinearRegression_4bedabe28a6071094c04', name='maxIter', doc='max number of iterations (>= 0).'): 50, Param(parent='LinearRegression_4bedabe28a6071094c04', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.8},)]


In [149]:
print(tvsModel.bestModel)
paramMap = list(zip(tvsModel.getEstimatorParamMaps(),))
print(paramMap)

LinearRegression_4be9b23af70bd255c483
[({Param(parent='LinearRegression_4bedabe28a6071094c04', name='regParam', doc='regularization parameter (>= 0).'): 0.3, Param(parent='LinearRegression_4bedabe28a6071094c04', name='maxIter', doc='max number of iterations (>= 0).'): 50, Param(parent='LinearRegression_4bedabe28a6071094c04', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.8},)]
