# Machine Learning auf Textdaten: GermEval 2018

*GermEval* – für German Evaluation – ist ein jährlicher Wettbewerb im Bereich Natural Language Processing für deutschsprachige Texte (s. [https://germeval.github.io/](https://germeval.github.io/)).

Im Jahr 2018 ging es um die Erkennung von Beleidigungen in Tweets.

In [None]:
from collections import namedtuple

Record = namedtuple('Record', [ 'text', 'primary_label', 'secondary_label' ])

with open('germeval2018.training.txt', 'r') as file:
    training_data = [ Record(*line[:-1].split('\t')) for line in file ]

with open('germeval2018.test.txt', 'r') as file:
    test_data = [ Record(*line[:-1].split('\t')) for line in file ]

training_data[0:5]

In [None]:
from collections import Counter

Counter([ (record.primary_label, record.secondary_label) for record in training_data ])

In [None]:
Counter([ (record.primary_label, record.secondary_label) for record in test_data ])

## Preprocessing der Tweets

Für die weitere Verarbeitung wollen wir Twitter Handles (`@username`) löschen und das Hashtag-Zeichen entfernen.

In [None]:
import re

def clean_tweet(text):
    """ Preprocess and tokenize a tweet. """
    
    # remove handles, i.e. @username
    text = re.sub('\@\w+', '', text)
    # remove hashtags, quotes, etc.
    text = re.sub('[\#"\']+', '', text)
    text = text.replace('-', ' ')
    return text

clean_tweet(training_data[4].text)

## Umwandlung in Tensoren

Für die weitere Verarbeitung mit `scikit-learn` wandeln wir die Daten in eine passende Tensorstruktur um.

In [None]:
import numpy as np
import sklearn

def convert_data(input):
    """ Convert data array into tensor structure. """
    data   = np.array([ clean_tweet(record.text) for record in input ])
    coarse = np.array([ record.primary_label for record in input ])
    fine   = np.array([ record.secondary_label for record in input ])
    
    return { 'data': data, 'coarse': coarse, 'fine': fine }

train = convert_data(training_data)
test  = convert_data(test_data)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn import metrics


def evaluate(classifier):
    predicted = classifier.predict(test['data'])
    print(f"Confusion matrix:\n{metrics.confusion_matrix(test['coarse'], predicted)}")
    print(f"{metrics.classification_report(test['coarse'], predicted)}")
    return np.mean(predicted == test['coarse'])
    

Entsprechend dem Cheat-Sheet probieren wir es zunächst mit einem `LinearSVC`

In [None]:
   
text_classifier = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', LinearSVC())
])

text_classifier.fit(train['data'], train['coarse'])
evaluate(text_classifier)

Der nächste Kandidat für Text ist `NaiveBayes`

In [None]:
bayes_classifier = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', MultinomialNB()),
])

bayes_classifier.fit(train['data'], train['coarse']) 
evaluate(bayes_classifier)

Zum Vergleich `KNeighbors` und `DecisionTree`

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

text_classifier = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', KNeighborsClassifier()),
])

text_classifier.fit(train['data'], train['coarse']) 
evaluate(text_classifier)

In [None]:
text_classifier = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', DecisionTreeClassifier()),
])

text_classifier.fit(train['data'], train['coarse']) 
evaluate(text_classifier)

Tatsächlich funktioniert bisher der `NaiveBayes` am besten. Allerdings ist der Recall noch nicht so besonders gut. Die Frage ist, ob hier z.B. Word-Embeddings helfen. Wir importieren dazu von der Uni Heidelberg auf deutschen Twitter-Nachrichten trainierte Word Embeddings (s. [Download](https://www.cl.uni-heidelberg.de/english/research/downloads/resource_pages/GermanTwitterEmbeddings/GermanTwitterEmbeddings_data.shtml))

In [None]:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('twitter-de_d100_w5_min10.bin', binary=True)
print(model.most_similar(positive='Merkel', topn=10))

In [None]:
model.get_vector('Merkel')

Word Embeddings wurden populär, weil Bedeutungszusammenhänge abbilden können, also etwa
```
'Mutter' - 'Frau' + 'Mann' = ' Vater'
```

In [None]:
vater = model.get_vector('Mann') - model.get_vector('Frau') + model.get_vector('Mutter')
model.distances(vater, ('Vater', 'Mutter'))

In [None]:
model.distance('Merkel', 'Bundeskanzlerin')

In [None]:
model.distance('Fachhochschule', 'Bundeskanzler')

In [None]:
model.distance('Fachhochschule', 'FH')

In [None]:
model.distance('Flüchtlinge', 'Nichtdeutsche')

In [None]:
model.distance('Fachhochschule', 'Nichtdeutsche')

In [None]:
from sklearn.base import BaseEstimator
import nltk

class EmbeddingVectorizer(BaseEstimator):
    """Convert a collection of text documents to a matrix of vectors created from word embeddings """
    
    def __init__(self):
        self.model = gensim.models.KeyedVectors.load_word2vec_format('twitter-de_d100_w5_min10.bin', binary=True)
        self.tokenizer = nltk.tokenize.casual.TweetTokenizer()
        
    def fit(self, X, y, **fit_params):
        """Nothing to do here, we use a pre-trained model. """
        return self
    
    def transform(self, raw_documents):
        """Transform documents to embedding matrix by calculating the L2-normalized sum of the embeddings
        of individual words.
        """
        if isinstance(raw_documents, str):
            raise ValueError(
                "Iterable over raw text documents expected, "
                "string object received.")

        _X = []
        for doc in raw_documents:
            x = np.zeros(100)
            for word in self.tokenizer.tokenize(doc):
                try:
                    x += self.model.get_vector(word)
                except KeyError:
                    #print(f"ignoring {word} not in vocabulary")
                    pass
                
            x /= np.linalg.norm(x)
            _X.append(x)
        
        X = np.array(_X)
        return X
       
    


In [None]:
vectorizer = EmbeddingVectorizer()

In [None]:
w2v_svc_classifier = Pipeline([
    ('vect', vectorizer),
    ('clf', LinearSVC()),
])

w2v_svc_classifier.fit(train['data'], train['coarse']) 
evaluate(w2v_svc_classifier)

In [None]:
from sklearn.ensemble import RandomForestClassifier

w2v_rf_classifier = Pipeline([
    ('vect', vectorizer),
    ('clf', RandomForestClassifier(n_estimators=100))
])

w2v_rf_classifier.fit(train['data'], train['coarse']) 
evaluate(w2v_rf_classifier)


In [None]:
from sklearn.ensemble import VotingClassifier


classifiers = [ 
    ('bayes', bayes_classifier),
    ('svc', w2v_svc_classifier),
    ('rf', w2v_rf_classifier)
]

voting_classifier = VotingClassifier(classifiers)

voting_classifier.fit(train['data'], train['coarse']) 
evaluate(voting_classifier)