# Annotating the data
For the annotations of the sample I use the quantative content analysis (Lamnek 2005). Here three categories will be formed:
1. non-answer: The category encompasses every response where no reaction to the question occurs. Example: ""
2. evasive answer: This category is defined as reacting to the question in not or just partly answering the question. Example: "Sehr geehrter Herr W., haben Sie vielen Dank für Ihre Anfrage. Ich beteilige mich nicht länger am Portal abgeordnetenwatch.de. Um Ihre Frage dennoch zu beantworten, bitte ich um Mitteilung Ihrer E-Mail-Adresse an antje.tillmann@bundestag.de. Mit freundlichen Grüßen Antje Tillmann MdB"
3. answer: Every response which contains the answer to the questions in annotated in this category. Expample: "Sehr geehrter Herr Schellerich,die gesamte Fraktion DIE LINKE im Deutschen Bundestag wird dem ESM-Vertrag nicht zustimmen. Ich habe dies in meiner Rede vom 29.März im Bundestag auch versucht zu begründen. Mit freundlichen Grüßen Dr. Gysi"

The drawn sample will be mannualy annotated. Next the sample will be used to categorise the rest of the answers automatically.

In [23]:
# load libraries for data manipulation
import pandas as pd
import re
import regex
import numpy as np

# load libraries for tokenization
import nltk
from nltk.tokenize import TreebankWordTokenizer, WhitespaceTokenizer
from nltk.corpus import stopwords
#nltk.download("stopwords")
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# load libraries for text cleaning
import spacy
import ufal.udpipe
from gensim.models import KeyedVectors, Phrases
from gensim.models.phrases import Phraser
from ufal.udpipe import Model, Pipeline
import conllu

# Supervised text classification
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
import joblib
#import eli5



## Preprocessing

In [2]:
# load data
sample_df = pd.read_csv("./data/stratified_sample.csv")
# remove NaN for tokenizer to work
sample_df = sample_df.dropna(subset=["answer"])

The next step comprises the preprocessing of the data. All answers will be converted to lowercase, stopwords will be removed, as well as punctuation and other noise. This step necessary since the most commons words are "die", "mit" and so on. These words have no inherent meaning and are not useful for further analysis. Lowercasing each word has the advantage that there no two different writing styles of a word. I.e. "die" and "Die" are now recognized as the same word.

In [None]:
# lower the answers to make the analysis easier
sample_df["answer"] = sample_df["answer"].str.lower()

# remove links and punctuation
sample_df["answer"] = sample_df["answer"].str.replace(r"\bhttps?://\S*|&\w+;|[\.,]", " ", regex=True)
sample_df["answer"] = sample_df["answer"].str.replace(r"\s+", " ", regex=True)
sample_df.at[10, "answer"]