# Annotating the data
For the annotations of the sample I use the quantative content analysis (Lamnek 2005). Here three categories will be formed:
1. non-answer: The category encompasses every response where no reaction to the question occurs. Example: ""
2. evasive answer: This category is defined as reacting to the question in not or just partly answering the question. Example: "Sehr geehrter Herr W., haben Sie vielen Dank für Ihre Anfrage. Ich beteilige mich nicht länger am Portal abgeordnetenwatch.de. Um Ihre Frage dennoch zu beantworten, bitte ich um Mitteilung Ihrer E-Mail-Adresse an antje.tillmann@bundestag.de. Mit freundlichen Grüßen Antje Tillmann MdB"
3. answer: Every response which contains the answer to the questions in annotated in this category. Expample: "Sehr geehrter Herr Schellerich,die gesamte Fraktion DIE LINKE im Deutschen Bundestag wird dem ESM-Vertrag nicht zustimmen. Ich habe dies in meiner Rede vom 29.März im Bundestag auch versucht zu begründen. Mit freundlichen Grüßen Dr. Gysi"

The drawn sample will be mannualy annotated. Next the sample will be used to categorise the rest of the answers automatically.

In [49]:
# load libraries for data manipulation
import pandas as pd
import re
import regex
import numpy as np

# ML: Train/test splits, cross validation,
# gridsearch
from sklearn.model_selection import (
    train_test_split,
    cross_val_score,
    GridSearchCV,
)

# load libraries for tokenization
import nltk
from nltk.tokenize import TreebankWordTokenizer, WhitespaceTokenizer
from nltk.corpus import stopwords
#nltk.download("stopwords")
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import preprocessing

# load libraries for text cleaning
import spacy
from spacy.lang.de.examples import sentences
# python -m spacy download de_core_news_sm
import ufal.udpipe
from gensim.models import KeyedVectors, Phrases
from gensim.models.phrases import Phraser
from ufal.udpipe import Model, Pipeline
import conllu

# Supervised text classification
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
import joblib
#import eli5



## Preprocessing

In [50]:
# load data
sample_df = pd.read_csv("./data/stratified_sample.csv", sep=";")

In [51]:
# remove NaN for tokenizer to work
sample_df = sample_df.dropna(subset=["answer"])

In [52]:
sample_df = sample_df.drop_duplicates(["answer", "question_text", "party", "first_name", "last_name", "question_teaser"])

Normalisierung von Umlauten

In [53]:
def remove_umlauts(text):    
    umlauts = {
        "ae" : "ä",
        "oe" : "ö",
        "ue" : "ü",
        "ss" : "ß"
    }

    for repl, original in umlauts.items():
        text = text.replace(original, repl)
    
    return text

In [54]:
sample_df["clean_answers"] = sample_df["answer"].apply(remove_umlauts)

The next step comprises the preprocessing of the data. All answers will be converted to lowercase, punctuation and other noise will be removed. Lowercasing each word has the advantage that there no two different writing styles of a word. I.e. "die" and "Die" are now recognized as the same word.

In [55]:
def text_preprocessing(text):
    # remove links, punctuation, special letters
    text = re.sub(r"[^a-zA-Z0-9]|\bhttps?://\S*|&\w+;|[\.,]", " ", text)
    
    # remove additional whitespaces
    text = re.sub(r"\s+", " ", text)

    # lower text
    text = text.lower()
    
    # tokenization of words
    words = text.split()
    
    # remove stopwords
    german_stopwords = set(stopwords.words("german"))
    words = [w for w in words if w not in german_stopwords]
    
    # return joined text
    return " ".join(words)

In [56]:
sample_df["clean_answers"] = sample_df["clean_answers"].apply(text_preprocessing)

Lemmatisierung

In [57]:
nlp = spacy.load("de_core_news_sm")

def text_lemmatization(text):
    doc = nlp(text)
    lemmas = [token.lemma_ for token in doc if not token.is_punct]
    return " ".join(lemmas)

In [58]:
sample_df["clean_answers"] = sample_df["clean_answers"].apply(text_lemmatization)

## Analyzing

In the next code chunk the sample data will be split into a training and test set. On the data of the training set the model will train and with the testing set the trained model will be tested. This step is necessary to avoid overfitting and ensure the quality of the results. This classifier functions as a baseline.

### Creating a pipeline
In the next step different pipelines are created to efficiently test and tune different vectorizers and classifiers. First the CountVectorizer(), the TfidfVectorizer(), the MultinomialNB() and the LogisticRegression() are used.

In [61]:
# split data into training and testing set with a testing set size of 20% of the data.
X_train, X_test, y_train, y_test = train_test_split(
    sample_df["clean_answers"],
    sample_df["answer_encoded"],
    test_size=0.2,
    random_state=42
)

pipes_and_grids = [
    {
        "pipeline" : Pipeline(
            steps=[
                ("vectorizer", CountVectorizer()),
                ("classifier", MultinomialNB())
            ]
        ),
        "grid" : {
                    "vectorizer__ngram_range" : [(1,1), (1,2), (1,3)],
                    "vectorizer__max_df" : [0.5, 1.0],
                    "vectorizer__min_df" : [1, 5],
        }     
    },
    {
        "pipeline" : Pipeline(
            steps=[
                ("vectorizer", TfidfVectorizer()),
                ("classifier", MultinomialNB())
            ]
        ),
        "grid" : {
                    "vectorizer__ngram_range" : [(1,1), (1,2), (1,3)],
                    "vectorizer__max_df" : [0.5, 1.0],
                    "vectorizer__min_df" : [1, 5],
        }  
    },
    {
        "pipeline" : Pipeline(
            steps=[
                ("vectorizer", CountVectorizer()),
                ("classifier", LogisticRegression(solver="lbfgs"))
            ]
        ),
        "grid" : {
                    "vectorizer__ngram_range" : [(1,1), (1,2), (1,3)],
                    "vectorizer__max_df" : [0.5, 1.0],
                    "vectorizer__min_df" : [1, 5],
                    "classifier__C" : [0.01, 1, 100]
        }  
    },
    {
        "pipeline" : Pipeline(
            steps=[
                ("vectorizer", TfidfVectorizer()),
                ("classifier", LogisticRegression(solver="lbfgs"))
            ]
        ),
        "grid" : {
                    "vectorizer__ngram_range" : [(1,1), (1,2), (1,3)],
                    "vectorizer__max_df" : [0.5, 1.0],
                    "vectorizer__min_df" : [1, 5],
                    "classifier__C" : [0.01, 1, 100]
        }  
    },
    {
        "pipeline" : Pipeline(
            steps=[
                ("vectorizer", CountVectorizer()),
                ("classifier", RandomForestClassifier(n_estimators=100))
            ]
        ),
        "grid" : {
                    "vectorizer__ngram_range" : [(1,1), (1,2), (1,3)],
                    "vectorizer__max_df" : [0.5, 1.0],
                    "vectorizer__min_df" : [1, 5],
        }
    },
    {
        "pipeline" : Pipeline(
            steps=[
                ("vectorizer", TfidfVectorizer()),
                ("classifier", RandomForestClassifier(n_estimators=100))
            ]
        ),
        "grid" : {
                    "vectorizer__ngram_range" : [(1,1), (1,2), (1,3)],
                    "vectorizer__max_df" : [0.5, 1.0],
                    "vectorizer__min_df" : [1, 5],
                    "vectorizer__max_features" : [1000, 5000, 10000]
        }
    },
]

for pipe_and_grid in pipes_and_grids:
    search = GridSearchCV(
        estimator=pipe_and_grid["pipeline"], n_jobs=-1, param_grid=pipe_and_grid["grid"], scoring="accuracy", cv=10
    )

    search.fit(X_train, y_train)
    
    pred = search.predict(X_test)

    print(f"Vectorizer and classifier: {pipe_and_grid['pipeline']}")
    print(f"Best parameters: {search.best_params_}")

    rep = metrics.classification_report(y_test, pred)
    print(rep)

Vectorizer and classifier: Pipeline(steps=[('vectorizer', CountVectorizer()),
                ('classifier', MultinomialNB())])
Best parameters: {'vectorizer__max_df': 1.0, 'vectorizer__min_df': 1, 'vectorizer__ngram_range': (1, 1)}
                precision    recall  f1-score   support

        answer       0.75      0.94      0.84       253
evasive answer       0.74      0.36      0.48       121

      accuracy                           0.75       374
     macro avg       0.75      0.65      0.66       374
  weighted avg       0.75      0.75      0.72       374

Vectorizer and classifier: Pipeline(steps=[('vectorizer', TfidfVectorizer()),
                ('classifier', MultinomialNB())])
Best parameters: {'vectorizer__max_df': 1.0, 'vectorizer__min_df': 5, 'vectorizer__ngram_range': (1, 2)}
                precision    recall  f1-score   support

        answer       0.73      0.98      0.83       253
evasive answer       0.83      0.24      0.37       121

      accuracy           

- svm, knearest neighbors, gradient boost
- word embeddings
- question_text/_teaser

In [60]:
X_train, X_test, y_train, y_test = train_test_split(
    sample_df["answer"],
    sample_df["answer_encoded"],
    test_size=0.2,
    random_state=42
)

vectorizer = TfidfVectorizer(
    max_df=0.5,
    min_df=5,
    ngram_range=(1,3)
)

text_train = vectorizer.fit(X_train)
text_test = vectorizer.fit(X_test)

scaler = preprocessing.StandardScaler().fit(text_train)




TypeError: float() argument must be a string or a real number, not 'TfidfVectorizer'