# Data Collection: Scraping transcripts of parliamentary debates in Denmark


This Jupyter Notebook consist of 5 parts representing my data collection:
    1. Part 1 loads all the required packages
    2. Part 2 creates a function that collects urls to transcripts of danish parliamentary debates
    3. Part 3 creates a function that collects the title, date and content of a parliamentary debate from a url 
    4. Part 4 scrapes all the transcripts using both the url-function and the scraper-function
    5. Part 5 saves both the urls and the transcripts using pickle

## Part 1: Loading packages

In this section, I load all required packages.

In [429]:
# Importing packages
import requests
import json
from bs4 import BeautifulSoup
import numpy as np
from datascience import *
import re
import datetime
import time
import math
import pickle

## Part 2: URL-collector

In this section, I create and test a function that collects urls to transcripts of parliamentary debates in Denmark.

In [None]:
def ft_url_collect(year = None, month = None, day = None):
    """ Function that collects url-links to transcript of danish parliamentary debates
        The function takes three imports:
        1) The year of the start date you want (e.g. "2012"). The default is year "2000".
        2) The month written as two-digits of the start date you want (e.g. "07"). The default is "01" (january).
        3) The day written as two-digits of the start date you want (e.g. "31"). The default is is "01"
        
        By 14th March 2018 the transcripts go back to 5th October 2004 
        """
    
    ### Part 1 of the function: Creating the urls ###
    base_url = "http://www.ft.dk/da/dokumenter/dokumentlister/referater?pageSize=200&startDate="
    
    if year == None:
        year = "2000"
    if month == None:
        month = "01"
    if day == None:
        day = "01"
    
    startdate = str(year)+str(month)+str(day) #creating start date
    url = base_url+startdate #creating url with links to debate transcripts
    
    response = requests.get(url) # GET-request
    soup = BeautifulSoup(response.content, 'html.parser')
    hits_str = soup.find("span", attrs={'class':'results'}).text #collecting the number of hits in a string
    hits = [int(s) for s in hits_str.split() if s.isdigit()][0] #convert number of hits to int
    number_of_pages = math.ceil(hits/200) #collect number of pages (every page have 200 links)
    pages_numbers = np.arange(number_of_pages+1)[1:] #creating np.array with a number of each of the pages
    
    
    page_url_template = soup.find("li", attrs={'class':'next'}).find("a")["href"] #find the url-template for the difference pages
    page_urls = np.array("page_url") # creating empty np.array
    
    for page_number in pages_numbers: #creating ulrs for each page with links
        page_urls = np.append(page_urls, "http://www.ft.dk/da/dokumenter/dokumentlister/referater" + page_url_template.replace("pageNumber=2","pageNumber="+str(page_number)))
    page_urls = page_urls[1:] #deling the irrelevant first item   
    
    ### Part 2 of the function: Collecting the links to the debate transcripts ###
    np_links = np.array("link") # creating empty numpy array
    
    for page_url in page_urls:
        response = requests.get(page_url) # GET-request
        soup2 = BeautifulSoup(response.content, 'html.parser')
    
        # Creating a loop that collects every link and only keep the links that contain "forhandling" in th url
        for link in soup2.find_all("a"):
            every_link = link.get("href")
            if every_link[1:14] == "forhandlinger": 
                np_links = np.append(np_links, "http://www.ft.dk"+every_link)

    links = np_links[1:] # drop the first irrelevant element
    links = np.unique(links) # drop duplicates
    
    return (links)

In [193]:
# Testing the collector
ft_url_collect(year = "1999", month = "07", day = "01")[0:5]

array(['http://www.ft.dk/forhandlinger/20041/20041_M10_helemoedet.pdf',
       'http://www.ft.dk/forhandlinger/20041/20041_M11_helemoedet.pdf',
       'http://www.ft.dk/forhandlinger/20041/20041_M12_helemoedet.pdf',
       'http://www.ft.dk/forhandlinger/20041/20041_M13_helemoedet.pdf',
       'http://www.ft.dk/forhandlinger/20041/20041_M14_helemoedet.pdf'],
      dtype='<U66')

## Part 3: Transcript collector

In this section, I create and test a function that when receiving an url as input returns the title, date and text content of a parliamentary debate.

In [233]:
def scrape_title_date_text(debate_url):
    """This function takes as input an URL with the transscript of the parliamentary debate in html-format 
       and return a np.array with three elements: title of the debate, date of the debate, and a string 
       with the content of the debate"""
    
    response = requests.get(debate_url) # GET-request
    soup = BeautifulSoup(response.content, 'html.parser') #turn into a soup
    
    try: # Some htmls have no content due to special events e.g. election. This code ignores an error if this is the case
         # Finding element 1: The title of the debate
        title = soup.find("p", attrs={'class':'Titel'}).text
    
        # Finding element 2: The date and time of the debate
        date = soup.find("meta", attrs={'name':'DateOfSitting'}).get("content")
    
        # Finding element 3: The content of the debate (Everything that was said in the debate)
        all_text_parts = soup.find_all("p", attrs={'class':'Tekst'}) + soup.find_all("p", attrs={'class':'TekstIndryk'}) #getting a list with all text parts
        all_text = "" #creating a empty character string
    
        for text_part in all_text_parts: #creating a loop that take all text parts and collects them in one string
            text = text_part.text
            all_text = all_text + text + " "
        all_text = all_text.replace("\n", "") #removing \n
    
        # Collecting all elements in one np.array
        result = [title, date, all_text] 
    
        return(result)
    
    except: #Continues if an error happens
        pass
    

In [147]:
# Testing the transcript collector function
test_url = "http://www.ft.dk/forhandlinger/20171/20171M018_2017-11-14_1300.htm"
scrape_title_date_text(test_url)[0:2]

['18. møde', '2017-11-14T13:00:00']

## Part 4: Scraping transcripts for debates since 1-1-2000

In this section, I use the previous created functions. I first collect urls to all debates since 1/1/2000. Thereafter, I exclude all urls that do not contain transcripts in html-format. Lastly, I use the transcript-collector function to get the title, date and text content of each url. 

In [194]:
# Collecting urls to all available transcripts 
all_urls = ft_url_collect(year = "2000", month = "01", day = "01")

In [195]:
# Checking the number of urls collected
len(all_urls)

1454

In [215]:
# Seperating transcripts in html and pdf
html_urls = [url for url in all_urls if url.find("pdf")==-1]
pdf_urls = [url for url in all_urls if url.find("pdf")!=-1]

In [237]:
# Number of HTML urls
len(html_urls)

1154

In [434]:
# Checking the result of html-urls
html_urls[0:3]

['http://www.ft.dk/forhandlinger/20071/20071M001_2007-10-02_1200.htm',
 'http://www.ft.dk/forhandlinger/20071/20071M002_2007-10-03_1300.htm',
 'http://www.ft.dk/forhandlinger/20071/20071M003_2007-10-04_1000.htm']

In [435]:
# Checking the result of pdf-urls
pdf_urls[0:3]

['http://www.ft.dk/forhandlinger/20041/20041_M10_helemoedet.pdf',
 'http://www.ft.dk/forhandlinger/20041/20041_M11_helemoedet.pdf',
 'http://www.ft.dk/forhandlinger/20041/20041_M12_helemoedet.pdf']

In the following code, I loop through all html-urls to collect transcripts of parliamentary debates in Denmark. I receive an error after 18 iterations because the webpage won't allow me to make many calls. I have randomized the time intervals between each call, in order to increase the number of calls I can make before the webpage cuts my access.

In [291]:
# Creating a loop that scrape debate data from each url
debates_list = [] #creating a empty list
iteration = 1

for url in html_urls:
    debate_data = scrape_title_date_text_test(url) #scraping data from url using scraper-function
    debates_list.append(debate_data) #appending scraped data to list
    time.sleep(np.random.choice(20))  # waits a random number of seconds between 0 and 20 before next iteration
    print("Iteration number " + str(iteration) + " is done") #print which iteration that is completed
    iteration = iteration + 1 
    

Iteration number 1 is done
Iteration number 2 is done
Iteration number 3 is done
Iteration number 4 is done
Iteration number 5 is done
Iteration number 6 is done
Iteration number 7 is done
Iteration number 8 is done
Iteration number 9 is done
Iteration number 10 is done
Iteration number 11 is done
Iteration number 12 is done
Iteration number 13 is done
Iteration number 14 is done
Iteration number 15 is done
Iteration number 16 is done
Iteration number 17 is done
Iteration number 18 is done


ChunkedEncodingError: ("Connection broken: ConnectionResetError(10054, 'En eksisterende forbindelse blev tvangsafbrudt af en ekstern vært', None, 10054, None)", ConnectionResetError(10054, 'En eksisterende forbindelse blev tvangsafbrudt af en ekstern vært', None, 10054, None))

Due to limits of the amount of calls the Danish Parliament allows us to scrape, I have added the following code to continue the scraping process, when then Danish Parliament is disconnecting my access. I run this chunk of code until all the debate transcripts have been scraped. This requires several calls of the cell below. 

In [373]:
# Continuing the scraping process

for url in html_urls[iteration-1:]:
    debate_data = scrape_title_date_text(url) #scraping data from url using scraper-function
    debates_list.append(debate_data) #appending scraped data to list
    time.sleep(np.random.choice(20))  #waits a random nuber of seconds between 0 and 20 before next iteration
    print("Iteration number " + str(iteration) + " is done") #print which iteration that is completed
    iteration = iteration + 1

Iteration number 1104 is done
Iteration number 1105 is done
Iteration number 1106 is done
Iteration number 1107 is done
Iteration number 1108 is done
Iteration number 1109 is done
Iteration number 1110 is done
Iteration number 1111 is done
Iteration number 1112 is done
Iteration number 1113 is done
Iteration number 1114 is done
Iteration number 1115 is done
Iteration number 1116 is done
Iteration number 1117 is done
Iteration number 1118 is done
Iteration number 1119 is done
Iteration number 1120 is done
Iteration number 1121 is done
Iteration number 1122 is done
Iteration number 1123 is done
Iteration number 1124 is done
Iteration number 1125 is done
Iteration number 1126 is done
Iteration number 1127 is done
Iteration number 1128 is done
Iteration number 1129 is done
Iteration number 1130 is done
Iteration number 1131 is done
Iteration number 1132 is done
Iteration number 1133 is done
Iteration number 1134 is done
Iteration number 1135 is done
Iteration number 1136 is done
Iteration 

## Part 5: Saving the transcripts and urls using pickle

In this section, I save the data using pickle. I use pickle instead of CSV, because I have used all the RAM on my computer at this point. The pickle-function allows me to save the data without using RAM.  

In [283]:
# I delete one transcripts. This transcipt was empty due to a parliamentary election. 
all_debates_clean = debates_list[:10] + debates_list[11:] 

In [433]:
# Save object with all debate urls
pickle.dump(all_urls, open( "debate_urls.p", "wb" ) )

In [436]:
# Save object with all debates
pickle.dump(all_debates_clean, open( "debates.p", "wb" ) )

Go to the next Jupyter Notebook *(02.text_analysis.ipynb)* to see the code used for the text analysis. 
The code is also available here: https://github.com/basgpol/ps239t-final-project/blob/master/code/02.text_analysis.ipynb