# Text classification of speeches from the Danish Parliament

In this notebook, we show how to perform text classification. 
As a concrete example, we train a text classifier to predict the party in which belongs a speech from the Danish Parliament. 

Steps in this tutorial:

1. Download the data
2. Extract the texts from the XML files
3. Preprocess the data (cleaning)
4. Prepare the data for training and evaluation
5. Train a model with fastText
6. Test and evaluate the performance

You can restart the tutorial at each step if you have previously saved the models/data. 
To reinitialize the notebook with required libraries and variables, run the following cell. 


In [None]:
# Run this cell for restarting from any step

import os
import glob
from zipfile import ZipFile
import xmltodict
import timestring
import json
import fasttext

# path to the (default) data folder 
# where we save/load data and models for this notebook 
DATA_DIR = "FT-data-DSpace"

# paths to the training and test files
train_path = os.path.join(DATA_DIR, "train.txt")
test_path = os.path.join(DATA_DIR, "test.txt")

## Step 1. Download the Data

The data we use comes from the `The Danish Parliament Corpus 2009 - 2017, v1` (Hansen, Dorte Haltrup, 2018, CLARIN-DK-UCPH Centre Repository).

The corpus contains transcripts of parliamentary speeches. It consists of 10 XML files (one for each year). XML tags include meetings, item title and number, speeches, name and party of speakers, date, time, etc.


1. Download the data at http://hdl.handle.net/20.500.12115/8
2. Unzip the folder : `unzip FT-data-DSpace.zip`
3. Enter the folder and unzip the files : `cd FT-data-DSpace && unzip '*.zip'`


## Step 2. Extract the texts

For building and testing a model, we need labelled (classified) texts. 
Texts are transcripts of speeches; labels/classes are the parties they belong to. 
We start by extracting these data from the XML files. 

For each parliament year, we want to extract all the speeches that are attached to a party (some speeches have no party attached, we don't keep them in our data) and store them as a list of triplets (year, party, speech). 
In the xml file, speeches are stored under the `<EdixiData><Møder><Møde><Dagsordenpunkt><Tale>` tags. The recorded year, name of the party and transcript of the speech are respecctively stored with the `<Starttid>`, `<Parti>` and `<Tekst>` tags. 
The structure of an XML file is as follows: 

~~~ xml
    <EdixiData>
        <Møder>
            <Samling>...</Samling>
            <Møde>
                <MeetingID>...</MeetingID>
                <Location>...</Location>
                <DateOfSitting>...</DateOfSitting>
                <Mødenummer>...</Mødenummer>
                <Dagsordenpunkt>
                    <Punktnummer>...</Punktnummer>
                    <Mødetitle>...</Mødetitle>
                    <Sagstype>...</Sagstype>
                    <Tale>
                        <Starttid>...</Starttid>
                        <Sluttid>...</Sluttid>
                        <Navn>...</Navn>
                        <Rolle>...</Rolle>
                        <Tekst>...</Tekst>
                    </Tale>
                    <Tale>
                    ...
                    </Tale>
                </Dagsordenpunkt>
            </Møde>
            ...
        </Møder>
    </EdixiData>
~~~


We first define a function for extracting this data from one xml file. 
 

In [None]:
def speeches_from_xml(xml_file):

    from collections import OrderedDict

    speeches = []

    # converting xml structure to dict
    xml_data = xmltodict.parse(xml_file.read())
    xml_data = xml_data['EdixiData']['Møder']

    for moder in xml_data:
        if not isinstance(moder, OrderedDict):
            continue
        if not 'Møde' in moder.keys():
            continue
        for meeting in moder['Møde']:
            if not isinstance(meeting, OrderedDict):
                continue
            for dagsordenpunkt in meeting['Dagsordenpunkt']:
                if not isinstance(dagsordenpunkt, OrderedDict):
                    continue
                if not 'Tale' in dagsordenpunkt:
                    continue
                for tale in dagsordenpunkt['Tale']:
                    if not isinstance(tale, OrderedDict):
                        continue
                    if not 'Parti' in tale or not isinstance(tale['Parti'], str):
                        continue
                    if not 'Starttid' in tale or not isinstance(tale['Starttid'], str):
                        continue
                    if not 'Tekst' in tale or not isinstance(tale['Tekst'], str):
                        continue
                    # we only save the year, not the exact date of the speech
                    year = str(timestring.Date(tale['Starttid']).year)
                    party = tale['Parti']
                    text = tale['Tekst']
                    if len(text)<1:
                        continue
                    speeches.append({'year': year, 'party': party, 'text': text})

    return speeches

We run it on all the XML files, extracting the speeches from 2009 to 2017.

In [None]:
speeches = []
for xml_path in glob.glob(os.path.join(DATA_DIR, "EdixiXMLExport_*.zip")):
    filename = os.path.splitext(os.path.basename(xml_path))[0]
    print("Extract texts from ", filename)
    with ZipFile(xml_path) as xml_zip:
        with xml_zip.open(filename+'.xml') as xml_file:
            speeches += speeches_from_xml(xml_file)

In [None]:
print("Example\n-------")
print(speeches[0])
print()

# listing the years
years = sorted(list(set([s['year'] for s in speeches])))
for year in years:
    print(len([_ for s in speeches if s['year']==year]), " speeches in ", year)
print()

# listing the parties 
parties = sorted(list(set([s['party'] for s in speeches])))
for party in parties:
    print(len([_ for s in speeches if s['party']==party]), " speeches from ", party)


We save the data so we can restart the notebook from the preprocessing step.

In [None]:
with open(os.path.join(DATA_DIR, "speeches.json"), 'w') as f:
    f.write(json.dumps(speeches, indent=4))


## Step 3. Preprocess (clean) the data

For preprocessing the texts, we use the Danish SpaCy model.
Using this model, we can tokenize and tag the sentences with part-of-speech. 

First, we load the spacy model.

In [None]:
from danlp.models import load_spacy_model
nlp = load_spacy_model()


We load the data.

In [None]:
speeches = []
with open(os.path.join(DATA_DIR, "speeches.json")) as f:
    speeches = json.loads(f.read())

And preprocess (clean) the texts by: 
- removing punctuation and symbols
- removing stop words and numbers
- lowercasing the tokens

(This process might take several minutes)

The purpose of this step is to reduce the vocabulary in order to speed up the training process. It is possible to skip some of the cleaning steps in order to improve the quality of the prediction (e.g., lowercasing might reduce the benefits of using word embeddings). 

In [None]:
from stop_words import get_stop_words
da_stopwords = get_stop_words('da')

import lemmy.pipe
lemmatizer = lemmy.load('da')

for speech in speeches:
    text = speech['text']
    doc = nlp(text)
    pruned = []
    lemmas = []
    for tok in doc:
        if tok.tag_ in ["PUNCT", "SYM"]:
            continue
        if tok.is_stop or tok.is_digit:
            continue
        pruned.append(tok.lower_)
        lemmas.append(lemmatizer.lemmatize(tok.tag_, tok.lower_)[0])

    speech['preprocessed'] = " ".join(pruned)
    speech['lemmas'] = " ".join(lemmas)

speeches[0]

In [None]:
 with open(os.path.join(DATA_DIR, "speeches_pp.json"), 'w') as f:
    f.write(json.dumps(speeches, indent=4))

## Step 4. Prepare the data for training and testing

In [None]:
We re-load the preprocessed data.

In [None]:
speeches = []
with open(os.path.join(DATA_DIR, "speeches_pp.json")) as f:
    speeches = json.loads(f.read())


We split into train and test data.
We will build a model from speeches from 2009 to 2014 and evaluate its performance on the 2015 speeches. 

In [None]:

train_data = [(sp['party'],sp['preprocessed']) for sp in speeches if not sp['year'] == '2015']
test_data = [(sp['party'],sp['preprocessed']) for sp in speeches if sp['year'] == '2015']

print("Training data : ", len(train_data), "speeches")
print("Test data : ", len(test_data), "speeches")

We save the training and test data in a format accepted by fastText : 
```
__label__class1 text1
__label__class2 text2
...
__label__classN textN
```

In [None]:

with open(train_path, 'w') as f:
    for (p,t) in train_data:
        f.write("__label__"+p+" "+t+"\n")

with open(test_path, 'w') as f:
    for (p,t) in test_data:
        f.write("__label__"+p+" "+t+"\n")


## Step 5. Learn a model with fastText

We load the common crawl word embeddings ("`cc.da.wv`") from fastText using Gensim. 
If you prefer to use other embeddings from our library, you can have a look at our [page](https://github.com/alexandrainst/danlp/blob/master/docs/models/embeddings.md).


In [None]:
from danlp.models.embeddings import load_wv_with_gensim
wv = load_wv_with_gensim("cc.da.wv")


We save the embeddings in a format that is accepted by fastText (i.e. .vec).

In [None]:
wv.save_word2vec_format(os.path.join(DATA_DIR, "cc-wv.vec"), binary=False)


We train the model with fastText. We can fine-tune the hyperparameters, e.g.:
- the number of epochs (recommended : from 1 to 50)
- the learning rate (recommended : from 0.1 to 1.0)
- the N-grams length (recommended : from 1 to 5)

In [None]:
import time 

start_time = time.time()
model = fasttext.train_supervised(input=train_path, epoch=20, lr=0.2, dim=100, wordNgrams=2, pretrainedVectors=os.path.join(DATA_DIR, "cc-wv.vec"))
print("time :", time.time()-start_time)
print("score : ", model.test(test_path))

In [None]:
model.save_model(os.path.join(DATA_DIR, "model.bin"))

## Step 6. Test and Evaluate the performance

Load the model (and test on the test data).

In [None]:
model = fasttext.load_model(os.path.join(DATA_DIR, "model.bin"))
model.test(test_path)

To make a prediction with the model, you can use the following code (you can replace the text with any (preprocessed) text) : 

In [None]:

text = "håber enighed gennemføre saglig seriøs høring hele lovforslaget dets aspekter tror element frank aaen nævner vedkommende volde umiddelbart største problemer"
predicted_party = model.predict(text)
predicted_party = predicted_party[0][0].split("__")[-1]
print(predicted_party)

We calculate the accuracy per label to see how the model performs for each party and the micro average accuracy.

In [None]:
import numpy as np
from sklearn.metrics import confusion_matrix

golds = []
preds = []
with open(test_path) as f:
    for line in f:
        line = line.strip()
        line = line.split(" ", maxsplit=1)
        text = line[1]
        gold_party = line[0].split("__")[-1]
        pred_party = model.predict(text)[0][0].split("__")[-1]
        golds.append(gold_party)
        preds.append(pred_party)

golds = np.array(golds)
preds = np.array(preds)
labels = np.unique(golds)
cf = confusion_matrix(golds, preds, labels=labels)

print("Accuracy per party")
for i, label in enumerate(labels):
    tp = cf[i][i]
    tt = np.sum(golds == label)
    print(label, "\t{:.1%}\t({}/{})".format(tp/tt, tp, tt))

print("\nGlobal accuracy")
tp = np.sum(preds==golds)
tt = len(golds)
print("{:.1%}\t({}/{})".format(tp/tt, tp, tt))