# Transcripts of parliamentary debates in Denmark


This Jupyter Notebook consist of three parts:
    1. Part 1 loads all the required packages
    2. Part 2 creates a function that collects url to transscripts of danish parliamentary debates
    3. Part 3 creates a function that collects the title, date and content of a parliamentary debate from a url 
    4. Collects the data and save it as a CSV-file

## Part 1: Loading packages

In [98]:
# Importing packages
import requests
import json
from bs4 import BeautifulSoup
import numpy as np
from datascience import *
import re
import datetime
import time

## Part 2: URL-collector

In [129]:
def ft_url_collect(year = None, month = None, day = None):
    """ Function that collects url-links to transcript of danish parliamentary debates
        The function takes three imports
        
        1) The year of the start date you want (e.g. "2012"). The default is year "2000".
        2) The month written as two-digits of the start date you want (e.g. "07"). The default is "01" (january).
        3) The day written as two-digits of the start date you want (e.g. "31"). The default is is "01"
        
        By 14th March 2018 the transcripts go back to 5th October 2004 
        """
    base_url = "http://www.ft.dk/da/dokumenter/dokumentlister/referater?pageSize=200&startDate="
    
    if year == None:
        year = "2000"
    if month == None:
        month = "01"
    if day == None:
        day = "01"
    
    startdate = str(year)+str(month)+str(day) #creating start date
    url = base_url+startdate #creating url with links to debate transcripts
    
    response = requests.get(url) # GET-request
    soup = BeautifulSoup(response.content, 'html.parser')
    
    np_links = np.array("link") # creating empty numpy array

    # Creating a loop that collects every link and only keep the links that contain "forhandling" in th url
    for link in soup.find_all("a"):
        every_link = link.get("href")
        if every_link[1:14] == "forhandlinger": 
            np_links = np.append(np_links, "http://www.ft.dk"+every_link)

    links = np_links[1:] # drop the first irrelevant element
    
    links = np.unique(links) # drop duplicates
    
    return (links)

In [131]:
# Testing the collector
ft_url_collect(year = "1999", month = "07", day = "01")[0:5]

array(['http://www.ft.dk/forhandlinger/20151/20151M096_2016-05-18_1300.htm',
       'http://www.ft.dk/forhandlinger/20151/20151M097_2016-05-19_1000.htm',
       'http://www.ft.dk/forhandlinger/20151/20151M098_2016-05-20_1000.htm',
       'http://www.ft.dk/forhandlinger/20151/20151M099_2016-05-23_1000.htm',
       'http://www.ft.dk/forhandlinger/20151/20151M100_2016-05-24_1300.htm'],
      dtype='<U66')

## Part 3: Transcript collector

In [77]:
def scrape_title_date_text(debate_url):
    """This function takes as input an URL with the transscript of the parliamentary debate in html-format 
       and return a np.array with three elements: title of the debate, date of the debate, and a string 
       with the content of the debate"""
    
    response = requests.get(debate_url) # GET-request
    soup = BeautifulSoup(response.content, 'html.parser') #turn into a soup
    
    # Finding element 1: The title of the debate
    title = soup.find("p", attrs={'class':'Titel'}).text
    
    # Finding element 2: The date and time of the debate
    date = soup.find("meta", attrs={'name':'DateOfSitting'}).get("content")
    
    # Finding element 3: The content of the debate (Everything that was said in the debate)
    all_text_parts = soup.find_all("p", attrs={'class':'Tekst'}) + soup.find_all("p", attrs={'class':'TekstIndryk'}) #getting a list with all text parts
    all_text = "" #creating a empty character string
    
    for text_part in all_text_parts: #creating a loop that take all text parts and collects them in one string
        text = text_part.text
        all_text = all_text + text + " "
    all_text = all_text.replace("\n", "") #removing \n
    
    # Collecting all elements in one np.array
    result = [title, date, all_text] 
    
    return(result)
    

In [147]:
# Testing the transcript collector function
test_url = "http://www.ft.dk/forhandlinger/20171/20171M018_2017-11-14_1300.htm"
scrape_title_date_text(test_url)[0:2]

['18. møde', '2017-11-14T13:00:00']

## Part 4: Scraping transcripts for debates since 1-1-2017

In [138]:
# Collecting URL's with debate transcripts in html-format since 1-1-2017
ft_urls = ft_url_collect(year = "2017", month = "01", day = "01")

In [133]:
# Checking the result
ft_urls[1:10]

array(['http://www.ft.dk/forhandlinger/20161/20161M041_2017-01-11_1300.htm',
       'http://www.ft.dk/forhandlinger/20161/20161M042_2017-01-12_1000.htm',
       'http://www.ft.dk/forhandlinger/20161/20161M043_2017-01-13_1000.htm',
       'http://www.ft.dk/forhandlinger/20161/20161M044_2017-01-17_1300.htm',
       'http://www.ft.dk/forhandlinger/20161/20161M045_2017-01-18_1300.htm',
       'http://www.ft.dk/forhandlinger/20161/20161M046_2017-01-19_1000.htm',
       'http://www.ft.dk/forhandlinger/20161/20161M047_2017-01-20_1000.htm',
       'http://www.ft.dk/forhandlinger/20161/20161M048_2017-01-24_1400.htm',
       'http://www.ft.dk/forhandlinger/20161/20161M049_2017-01-25_1300.htm'],
      dtype='<U66')

In [134]:
# Creating a empty table
t = Table().empty(make_array("Title", "Date", "Text"))
t



Title,Date,Text


In the following code, I use the scraper function to collect transcripts of parliamentary debates in Denmark. I cap the loop at 50 iteration, because the webside of the danish parliament cut us off after 50 url-calls. 

In [None]:
# Creating a loop that scrape debate data from each url
debates_list = [] #creating a empty list

for url in ft_urls[0:50]:
   debate_data = scrape_title_date_text(url) #scraping data from url using scraper-function
   debates_list.append(debate_data) #appending scraped data to list
   time.sleep(2)  # waits 2 seconds before next iteration 

In [142]:
debate_table = t.with_rows(debates_list)
debate_table

Title,Date,Text
40. møde,2017-01-10T13:00:00,Mødet er åbnet. 1) 3. behandling af lovforslag nr. L 92 ...
41. møde,2017-01-11T13:00:00,Mødet er åbnet. 1) Besvarelse af oversendte spørgsmål ti ...
42. møde,2017-01-12T10:00:00,Mødet er åbnet. 1) Spørgsmål om fremme af forespørgsel n ...
43. møde,2017-01-13T10:00:00,Mødet er åbnet. 1) 1. behandling af lovforslag nr. L 100 ...
44. møde,2017-01-17T13:00:00,Mødet er åbnet. 1) Spørgetime med statsministeren. Jeg g ...
45. møde,2017-01-18T13:00:00,Mødet er åbnet. 1) Besvarelse af oversendte spørgsmål ti ...
46. møde,2017-01-19T10:00:00,Mødet er åbnet. 1) Spørgsmål om fremme af forespørgsel n ...
47. møde,2017-01-20T10:00:00,Mødet er åbnet. 1) Spørgsmål om fremme af forespørgsel n ...
48. møde,2017-01-24T14:00:00,Mødet er åbnet. 1) Spørgsmål om fremme af forespørgsel n ...
49. møde,2017-01-25T13:00:00,Mødet er åbnet. 1) Besvarelse af oversendte spørgsmål ti ...


In [143]:
## Exporting data as a CSV
debate_table.to_df().to_csv("debate_text_data.csv")