# Annotating the data
For the annotations of the sample I use the quantative content analysis (Lamnek 2005). Here three categories will be formed:
1. non-answer: The category encompasses every response where no reaction to the question occurs. Example: ""
2. evasive answer: This category is defined as reacting to the question in not or just partly answering the question. Example: "Sehr geehrter Herr W., haben Sie vielen Dank für Ihre Anfrage. Ich beteilige mich nicht länger am Portal abgeordnetenwatch.de. Um Ihre Frage dennoch zu beantworten, bitte ich um Mitteilung Ihrer E-Mail-Adresse an antje.tillmann@bundestag.de. Mit freundlichen Grüßen Antje Tillmann MdB"
3. answer: Every response which contains the answer to the questions in annotated in this category. Expample: "Sehr geehrter Herr Schellerich,die gesamte Fraktion DIE LINKE im Deutschen Bundestag wird dem ESM-Vertrag nicht zustimmen. Ich habe dies in meiner Rede vom 29.März im Bundestag auch versucht zu begründen. Mit freundlichen Grüßen Dr. Gysi"

The drawn sample will be mannualy annotated. Next the sample will be used to categorise the rest of the answers automatically.

In [112]:
# load libraries for data manipulation
import pandas as pd
import re
import regex
import numpy as np

# ML: Train/test splits, cross validation,
# gridsearch
from sklearn.model_selection import (
    train_test_split,
    cross_val_score,
    GridSearchCV,
)

# load libraries for tokenization
import nltk
from nltk.tokenize import TreebankWordTokenizer, WhitespaceTokenizer
from nltk.corpus import stopwords
#nltk.download("stopwords")
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# load libraries for text cleaning
import spacy
import ufal.udpipe
from gensim.models import KeyedVectors, Phrases
from gensim.models.phrases import Phraser
from ufal.udpipe import Model, Pipeline
import conllu

# Supervised text classification
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
import joblib
#import eli5



## Preprocessing

In [113]:
# load data
sample_df = pd.read_csv("./data/stratified_sample.csv")

sample_df_2 = pd.read_csv("./data/stratified_sample_2.csv", sep=";")

sample_df = pd.concat([sample_df, sample_df_2])

In [114]:
# remove NaN for tokenizer to work
sample_df = sample_df.dropna(subset=["answer"])

In [115]:
sample_df = sample_df.drop_duplicates(["answer", "question_text", "party", "first_name", "last_name", "question_teaser"])

The next step comprises the preprocessing of the data. All answers will be converted to lowercase, punctuation and other noise will be removed. Lowercasing each word has the advantage that there no two different writing styles of a word. I.e. "die" and "Die" are now recognized as the same word.

In [116]:
# lower the answers to make the analysis easier
columns_to_process = ["answer", "question_text", "question_teaser"]
sample_df[columns_to_process] = sample_df[columns_to_process].apply(lambda col: col.str.lower())

# remove links and punctuation
sample_df[columns_to_process] = sample_df[columns_to_process].apply(lambda col: col.str.replace(r"\bhttps?://\S*|&\w+;|[\.,]", " ", regex=True))
sample_df[columns_to_process] = sample_df[columns_to_process].apply(lambda col: col.str.replace(r"\s+", " ", regex=True))

# combine the columns "answer", "question_text" and "question_teaser" to one column
sample_df["combined"] = sample_df[["question_text", "question_teaser"]].apply(lambda row: " ".join(row.values.astype(str)), axis=1)

## Analyzing

In the next code chunk the sample data will be split into a training and test set. On the data of the training set the model will train and with the testing set the trained model will be tested. This step is necessary to avoid overfitting and ensure the quality of the results. This classifier functions as a baseline.

In [117]:
# split data into training and testing set with a testing set size of 20% of the data.
X_train, X_test, y_train, y_test = train_test_split(
    sample_df["answer"],
    sample_df["answer_encoded"],
    test_size=0.2,
    random_state=42
)

# vectorizer with stopwords
vectorizer = CountVectorizer(
    stop_words=stopwords.words("german")
)

text_train = vectorizer.fit_transform(X_train)
text_test = vectorizer.transform(X_test)

# create mulitnominal naive bayes classifier
nb = MultinomialNB()
nb.fit(text_train, y_train)

y_pred = nb.predict(text_test)

rep = metrics.classification_report(y_test, y_pred)
print(rep)

                precision    recall  f1-score   support

        answer       0.71      0.95      0.81       242
evasive answer       0.75      0.30      0.43       131

      accuracy                           0.72       373
     macro avg       0.73      0.62      0.62       373
  weighted avg       0.73      0.72      0.68       373



### Creating a pipeline
In the next step a pipeline is created to efficiently test and tune different vectorizers and classifiers.

In [125]:
# split data into training and testing set with a testing set size of 20% of the data.
X_train, X_test, y_train, y_test = train_test_split(
    sample_df["answer"],
    sample_df["answer_encoded"],
    test_size=0.2,
    random_state=42
)

# create pipeline with vectorizer and classifier
pipeline = Pipeline(
    steps=[
        ("vectorizer", CountVectorizer()),
        ("classifier", MultinomialNB())
    ]
)

# create grid with different preprocessing steps
grid = {
    "vectorizer__stop_words" : [None, stopwords.words("german")],
    "vectorizer__ngram_range" : [(1,1), (1,2), (1,3)],
    "vectorizer__max_df" : [0.5, 1.0],
    "vectorizer__min_df" : [1, 5],
    #"classifier__C" : [0.01, 1, 100]
}

search = GridSearchCV(
    estimator=pipeline, n_jobs=-1, param_grid=grid, scoring="accuracy", cv=10
)

search.fit(X_train, y_train)

print(f"Best parameters: {search.best_params_}")

rep = metrics.classification_report(y_test, y_pred)
print(rep)

Best parameters: {'vectorizer__max_df': 0.5, 'vectorizer__min_df': 10, 'vectorizer__ngram_range': (1, 1), 'vectorizer__stop_words': None}
                precision    recall  f1-score   support

        answer       0.71      0.95      0.81       242
evasive answer       0.75      0.30      0.43       131

      accuracy                           0.72       373
     macro avg       0.73      0.62      0.62       373
  weighted avg       0.73      0.72      0.68       373

