In [1]:
#!pip install gensim nltk matplotlib numpy scipy scikit-learn spacy

In [2]:
#!python -m spacy download de_core_news_lg

In [3]:
# !wget https://int-emb-glove-de-wiki.s3.eu-central-1.amazonaws.com/vectors.txt

In [4]:
# !wget https://int-emb-glove-de-wiki.s3.eu-central-1.amazonaws.com/vocab.txt

In [5]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import spacy
import re
import os

from nltk.tokenize import sent_tokenize
from collections import defaultdict

In [6]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [7]:
flatten = lambda l: [item for sublist in l for item in sublist]

- [spacy](https://spacy.io/models/de)

- [fasttext models](https://fasttext.cc/docs/en/pretrained-vectors.html)
- [git for a gensim model](https://devmount.github.io/GermanWordEmbeddings/)
- [raw digit embeddings (used last time)](https://www.deepset.ai/german-word-embeddings)
    - [vectors.txt](https://int-emb-glove-de-wiki.s3.eu-central-1.amazonaws.com/vectors.txt)
    - [vocab.txt](https://int-emb-glove-de-wiki.s3.eu-central-1.amazonaws.com/vocab.txt)
    - [their git](https://gitlab.com/deepset-ai/open-source/glove-embeddings-de)
    
- [stopwords](https://github.com/stopwords-iso/stopwords-de)

- [german preprocessing pipline](https://data-dive.com/german-nlp-binary-text-classification-of-reviews-part1)

- [original preprocessing from Just at al., 2020](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7466436/)


## Preprocess

**Ivan Nenchev**

_Regarding preprocessing, I have not done much of it yet, as I was thinking about what I should include. I will proceed with lemmatization and removal of the stop words, but the fillers have to be removed manually. To me, it seems like your transcripts do not include false starts, preservations, or filled pauses, so there is no need to exclude them. Your transcripts are already separated into sentences, but I do not know what the procedure was (separating them). I did use TF-IDF weighting on my data, and I used a large web corpus dump to calculate it, but I do not know what to use for German, as I have no reference corpus._

_As for the analysis, I have done a variety of things in my thesis and I am not sure that we need to replicate all of them on this material. Local coherence (like in Bedi, you are correct!) and possibly global coherence make the most sense here. I would still suggest using tangentiality, which is inapplicable to my material but makes a lot of sense for yours._

_____

_Preprocessing: the NET interview was prepared by me by "normalizing" interviewer speech, i.e. by disregarding any speech by the interviewer that goes beyond the 12 questions asked in the NET interview. Next, Erik formed a "raw version" of the transcript by removing interviewer speech. He was still differentiating the responses though, i.e. he knew which response was fear1, fear2 etc. Next, he cleaned transcripts from stop words and I thought that applying GloVe deletes verbal fillers (?).
The transcripts had been separated into sentences by the trancribing person, i.e. mostly me - and not in a systematic fashion (just when I "felt" like the sentence was finished). I do not know whether Erik applied another method for separating the transcripts into sentences. This maybe is a weak point in the analysis._



_____
__Just at al., 2020__: __Preparation of Data__

The data set consisted of 513 min of 60 recorded NET interviews (see Table 2). Interview length ranged between 3 and 22 min, with an average length of 8.5 min. The interviewer’s speech was left out for complex analysis since it can be reduced to the questions mentioned above. After cleaning transcripts of the interviewer’s speech, the raw data set consisted of 46,375 words, ranging from 134 to 2,644 words, with an average of 772.92 words per participant. Examples for raw data are shown in Figure 1A. For the coherence models, verbal fillers and sentences only containing stop words were excluded from analysis because they can bias coherence measures (13). An example of this is shown in Figure 1B. Words not appearing in the reference corpus for the GloVe model were discarded for this model. Unknown words were saved for later examination, especially for the detection of neologisms. The GloVe model was provided by deepset (48) as open source who trained the model on a German Wikipedia dump.

In [8]:
# vocab_encoding_errors_dict = {'Ã¼':'ü', 'Ã¤': 'ä', 'Ã¶': 'ö', 'ÃŸ': 'ß'}
NET_encoding_errors_dict = {'д': 'ä', 'ь':'ü', 'ц': 'ö', 'Я': 'ß', 'Д': 'Ä', 'Ь': 'Ü', 'Ц': 'Ö',
                            'й': 'é', 'Й': 'É', 'б': 'à', 'Б': 'À'}

fillers = ['ähm', 'Ähm', 'äm', 'Äm', 'Ähn', 'än', 'Mhm', 'mhm', '[unv]', '(...)', '(..)', '[...]', 'Hm', 'Oh', 'Na', 'na', 'ne', 'Ne'] # 'Aslo', 'also'

In [9]:
stopwords = ["a","ab","aber","ach","acht","achte","achten","achter","achtes","ag","alle","allein","allem","allen","aller","allerdings","alles","allgemeinen","als","also","am","an","ander","andere","anderem","anderen","anderer","anderes","anderm","andern","anderr","anders","au","auch","auf","aus","ausser","ausserdem","außer","außerdem","b","bald","bei","beide","beiden","beim","beispiel","bekannt","bereits","besonders","besser","besten","bin","bis","bisher","bist","c","d","d.h","da","dabei","dadurch","dafür","dagegen","daher","dahin","dahinter","damals","damit","danach","daneben","dank","dann","daran","darauf","daraus","darf","darfst","darin","darum","darunter","darüber","das","dasein","daselbst","dass","dasselbe","davon","davor","dazu","dazwischen","daß","dein","deine","deinem","deinen","deiner","deines","dem","dementsprechend","demgegenüber","demgemäss","demgemäß","demselben","demzufolge","den","denen","denn","denselben","der","deren","derer","derjenige","derjenigen","dermassen","dermaßen","derselbe","derselben","des","deshalb","desselben","dessen","deswegen","dich","die","diejenige","diejenigen","dies","diese","dieselbe","dieselben","diesem","diesen","dieser","dieses","dir","doch","dort","drei","drin","dritte","dritten","dritter","drittes","du","durch","durchaus","durfte","durften","dürfen","dürft","e","eben","ebenso","ehrlich","ei","ei,","eigen","eigene","eigenen","eigener","eigenes","ein","einander","eine","einem","einen","einer","eines","einig","einige","einigem","einigen","einiger","einiges","einmal","eins","elf","en","ende","endlich","entweder","er","ernst","erst","erste","ersten","erster","erstes","es","etwa","etwas","euch","euer","eure","eurem","euren","eurer","eures","f","folgende","früher","fünf","fünfte","fünften","fünfter","fünftes","für","g","gab","ganz","ganze","ganzen","ganzer","ganzes","gar","gedurft","gegen","gegenüber","gehabt","gehen","geht","gekannt","gekonnt","gemacht","gemocht","gemusst","genug","gerade","gern","gesagt","geschweige","gewesen","gewollt","geworden","gibt","ging","gleich","gott","gross","grosse","grossen","grosser","grosses","groß","große","großen","großer","großes","gut","gute","guter","gutes","h","hab","habe","haben","habt","hast","hat","hatte","hatten","hattest","hattet","heisst","her","heute","hier","hin","hinter","hoch","hätte","hätten","i","ich","ihm","ihn","ihnen","ihr","ihre","ihrem","ihren","ihrer","ihres","im","immer","in","indem","infolgedessen","ins","irgend","ist","j","ja","jahr","jahre","jahren","je","jede","jedem","jeden","jeder","jedermann","jedermanns","jedes","jedoch","jemand","jemandem","jemanden","jene","jenem","jenen","jener","jenes","jetzt","k","kam","kann","kannst","kaum","kein","keine","keinem","keinen","keiner","keines","kleine","kleinen","kleiner","kleines","kommen","kommt","konnte","konnten","kurz","können","könnt","könnte","l","lang","lange","leicht","leide","lieber","los","m","machen","macht","machte","mag","magst","mahn","mal","man","manche","manchem","manchen","mancher","manches","mann","mehr","mein","meine","meinem","meinen","meiner","meines","mensch","menschen","mich","mir","mit","mittel","mochte","mochten","morgen","muss","musst","musste","mussten","muß","mußt","möchte","mögen","möglich","mögt","müssen","müsst","müßt","n","na","nach","nachdem","nahm","natürlich","neben","nein","neue","neuen","neun","neunte","neunten","neunter","neuntes","nicht","nichts","nie","niemand","niemandem","niemanden","noch","nun","nur","o","ob","oben","oder","offen","oft","ohne","ordnung","p","q","r","recht","rechte","rechten","rechter","rechtes","richtig","rund","s","sa","sache","sagt","sagte","sah","satt","schlecht","schluss","schon","sechs","sechste","sechsten","sechster","sechstes","sehr","sei","seid","seien","sein","seine","seinem","seinen","seiner","seines","seit","seitdem","selbst","sich","sie","sieben","siebente","siebenten","siebenter","siebentes","sind","so","solang","solche","solchem","solchen","solcher","solches","soll","sollen","sollst","sollt","sollte","sollten","sondern","sonst","soweit","sowie","später","startseite","statt","steht","suche","t","tag","tage","tagen","tat","teil","tel","tritt","trotzdem","tun","u","uhr","um","und","uns","unse","unsem","unsen","unser","unsere","unserer","unses","unter","v","vergangenen","viel","viele","vielem","vielen","vielleicht","vier","vierte","vierten","vierter","viertes","vom","von","vor","w","wahr","wann","war","waren","warst","wart","warum","was","weg","wegen","weil","weit","weiter","weitere","weiteren","weiteres","welche","welchem","welchen","welcher","welches","wem","wen","wenig","wenige","weniger","weniges","wenigstens","wenn","wer","werde","werden","werdet","weshalb","wessen","wie","wieder","wieso","will","willst","wir","wird","wirklich","wirst","wissen","wo","woher","wohin","wohl","wollen","wollt","wollte","wollten","worden","wurde","wurden","während","währenddem","währenddessen","wäre","würde","würden","x","y","z","z.b","zehn","zehnte","zehnten","zehnter","zehntes","zeit","zu","zuerst","zugleich","zum","zunächst","zur","zurück","zusammen","zwanzig","zwar","zwei","zweite","zweiten","zweiter","zweites","zwischen","zwölf","über","überhaupt","übrigens"]

In [10]:
def replace_chars(text, char_map):
    for char in char_map:
        text = text.replace(char, char_map[char])
    return text

In [11]:
def remove_timestamps(text):
    return re.sub('#\d\d:\d\d:\d\d-\d#', '', text)

def remove_stopwords(text, stopwords):
    for stopword in stopwords:
        text = text.replace(stopword, '')
    return text

def remove_hanging_punct(text):
    text = text.replace('. .', '.').replace('. ,', '.').replace(': .', ':')
    text = text.replace(' ,', ',').replace(' .', '.').replace(' . ', '. ')
    text = re.sub(':\s+, ', ': ', text)
    return text

punct = '\!"#$%&\'\(\)\*\+,\-\./:;<=>\?@[\\]^_`{|}~\^„“/…'

def test_unicode(text):
    non_unicode = re.findall(f'[^A-Za-zÀ-ž0-9\\n {punct}]', text)
    return not bool(non_unicode), non_unicode

def remove_extra_spaces(text):
    text = re.sub('[ \t]+', ' ', text)
    text = re.sub('\n+', '\n', text)
    return text

def fix_questions(text):
    text = text.replace('T: Was bedeutet denn Freude für Sie?', 'T: Was bedeutet Freude für Sie?')
    text = text.replace('T. Was bedeutet Wut für Sie?', 'T: Was bedeutet Wut für Sie?')
    text = text.replace('T: Und was bedeutet Wut für Sie?', 'T: Was bedeutet Wut für Sie?')
    text = text.replace('T: Was bedeutet für Sie Angst?', 'T: Was bedeutet Angst für Sie?')
    text = text.replace('T: Was bedeutet Trauigkeit für Sie?', 'T: Was bedeutet Traurigkeit für Sie?')
    return text

def fix_turn_beginnings(text):
    text = text.replace('B:', 'P:')
    return text

def preprocess(text, char_map, fillers=[], stopwords=[]):
    text = replace_chars(text, char_map)
    text = remove_timestamps(text)
#     text = remove_stopwords(text, fillers)
#     text = remove_stopwords(text, stopwords)
    text = fix_questions(text)
    text = fix_turn_beginnings(text)
    text = remove_hanging_punct(text)
    text = remove_extra_spaces(text)
    text = remove_hanging_punct(text)
    ok, non_unicode = test_unicode(text)
    assert ok, non_unicode
    text = text.strip().strip('\n').strip('1')
    return text

In [12]:
with open('NET_NAP/NETAKUT049_MPPS_059_6M.txt', encoding='cp1251') as f:
    raw = f.read()

In [13]:
cleaned = preprocess(raw, char_map=NET_encoding_errors_dict, fillers=fillers)
print(cleaned)

T: Was bedeutet Wut für Sie? 
P: Wut. Ja. Wut ist ein Gefühl, dass ich seit meiner Jugend peu à peu verloren habe eigentlich. Ähm, ich war sehr wütend als Jugendlicher, wenn Kleinigkeiten aufraten. Ich mag es aber nach wie vor nicht untergebuttert zu werden. Also wenn, ja, genau. Oder wenn jemand sich nicht konform verhält, sag ich mal. Also ich habe ein sehr großes Raster, wie sich die Leute verhalten sollten. Das ist schon sehr großzügig geschneidert. Aber wenn da jemand partout nicht reinpasst, dann werde ich wütend. 
T: Fällt Ihnen eine Situation ein, in der Sie Wut erlebt haben? 
P: Streit mit einer Partnerin zum Beispiel, die sich im Ton vergreift. Oder handgreiflich wird. 
T: Warum haben Sie in dieser Situation Wut erlebt? 
P: Also in der damaligen Situation war es tatsächlich Schmerz dann, körperlicher Schmerz. Und dann war ich von einer Sekunde auf die andere, war ich hochtourig. Dann war ich auch richtig wütend nachher. Und ich glaube, ich habe dann auch [unv]. Ja. 
T: Was be

In [14]:
ids = []
transcripts = []

for (root, dirs, files) in os.walk('NET_NAP', topdown=True):
    for file in files:
        ids.append(file.split('_')[0].strip('NETAKUT').strip('.txt'))
        with open(f'NET_NAP/{file}', encoding='cp1251') as f:
            raw = f.read()
            cleaned = preprocess(raw, char_map=NET_encoding_errors_dict, fillers=fillers)
            transcripts.append(cleaned)

In [15]:
df = pd.DataFrame(index=ids)
df['raw'] = transcripts
df.head()

Unnamed: 0,raw
34,T: Was bedeutet Traurigkeit für Sie? \nP: Trau...
60,T: Was bedeutet Traurigkeit für Sie? \nP: Mh e...
76,T: Was bedeutet Traurigkeit für Sie? \nP: Verl...
84,T: Was bedeutet Traurigkeit für Sie? \nP: Geld...
88,"T: Was bedeutet Angst für Sie? \nP: Ja, Angst ..."


### Split into questions

In [16]:
def split_into_questions(interview):
    emotion_questions = {'wut': "T: Was bedeutet Wut für Sie?",
                        'angst': "T: Was bedeutet Angst für Sie?",
                        'traurigkeit': "T: Was bedeutet Traurigkeit für Sie?",
                        'freude': "T: Was bedeutet Freude für Sie?"}
    emotion_qids = {'wut': -1,
                    'angst': -1,
                    'traurigkeit': -1,
                    'freude': -1}

    for e, q in emotion_questions.items():
        assert q in interview, f'{q}, not in {interview}'
        emotion_qids[e] = interview.index(q)

    emotion_texts = {'wut': '',
                    'angst': '',
                    'traurigkeit': '',
                    'freude': ''}

    sorted_emotion_qids = sorted(emotion_qids.items(), key=lambda item: item[1])
    for i, item in enumerate(sorted_emotion_qids):
        e, qid = item
        if i + 1 < len(sorted_emotion_qids):
            next_qid = sorted_emotion_qids[i+1][1]
            emotion_texts[e] = interview[qid:next_qid]
        else:
            emotion_texts[e] = interview[qid:]

    return emotion_texts

In [17]:
questions_splits = []
for i, transcript in zip(ids, transcripts):
    questions_splits.append(split_into_questions(transcript))

df = pd.concat([df, pd.DataFrame(questions_splits, index=ids)], axis=1)
df.head()

Unnamed: 0,raw,wut,angst,traurigkeit,freude
34,T: Was bedeutet Traurigkeit für Sie? \nP: Trau...,T: Was bedeutet Wut für Sie? \nP: Richtig wüte...,T: Was bedeutet Angst für Sie? \nP: Angst davo...,T: Was bedeutet Traurigkeit für Sie? \nP: Trau...,T: Was bedeutet Freude für Sie? \nP: Freude is...
60,T: Was bedeutet Traurigkeit für Sie? \nP: Mh e...,T: Was bedeutet Wut für Sie? \nP: Wut? Ähm (.....,T: Was bedeutet Angst für Sie? \nP: Ähm (...) ...,T: Was bedeutet Traurigkeit für Sie? \nP: Mh e...,"T: Was bedeutet Freude für Sie? \nP: Wenn, wen..."
76,T: Was bedeutet Traurigkeit für Sie? \nP: Verl...,T: Was bedeutet Wut für Sie? \nP: Ich bin eige...,T: Was bedeutet Angst für Sie? \nP: Schwierig ...,T: Was bedeutet Traurigkeit für Sie? \nP: Verl...,T: Was bedeutet Freude für Sie? \nP: Freude be...
84,T: Was bedeutet Traurigkeit für Sie? \nP: Geld...,"T: Was bedeutet Wut für Sie? \nP: Wut. Na, da ...",T: Was bedeutet Angst für Sie? \nP: Angst? Ich...,T: Was bedeutet Traurigkeit für Sie? \nP: Geld...,T: Was bedeutet Freude für Sie? \nP: Freude be...
88,"T: Was bedeutet Angst für Sie? \nP: Ja, Angst ...",T: Was bedeutet Wut für Sie? \nP: Da ist es er...,"T: Was bedeutet Angst für Sie? \nP: Ja, Angst ...",T: Was bedeutet Traurigkeit für Sie? \nP: Nur ...,T: Was bedeutet Freude für Sie? \nP: Ich freue...


### Split into turns

In [18]:
def get_patient_turns(text):
    turns = text.split('\n')
    patient_turns = []
    for turn in turns:
        if turn.startswith('P:'):
            turn = turn.strip('P: ').strip(',').strip('.').strip()
            if turn:
                patient_turns.append(turn)
        elif not turn.startswith('T:') and turn not in ['', ' ']:
            print(turn)
    return patient_turns

In [19]:
for colname in ['wut','angst','traurigkeit','freude']:
    df[colname+'_patient_turns'] = df[colname].apply(lambda x: get_patient_turns(x))
    df[colname+'_patient_turns_sents'] = df[colname+'_patient_turns'].apply(lambda x: [sent_tokenize(s) for s in x])

In [20]:
df.head(2)

Unnamed: 0,raw,wut,angst,traurigkeit,freude,wut_patient_turns,wut_patient_turns_sents,angst_patient_turns,angst_patient_turns_sents,traurigkeit_patient_turns,traurigkeit_patient_turns_sents,freude_patient_turns,freude_patient_turns_sents
34,T: Was bedeutet Traurigkeit für Sie? \nP: Trau...,T: Was bedeutet Wut für Sie? \nP: Richtig wüte...,T: Was bedeutet Angst für Sie? \nP: Angst davo...,T: Was bedeutet Traurigkeit für Sie? \nP: Trau...,T: Was bedeutet Freude für Sie? \nP: Freude is...,"[Richtig wütend bin ich eigentlich nie, trauri...","[[Richtig wütend bin ich eigentlich nie, traur...","[Angst davor, das Leben nicht mehr so fristen ...","[[Angst davor, das Leben nicht mehr so fristen...",[Traurigkeit ist ein ziemlicher Teil von meine...,[[Traurigkeit ist ein ziemlicher Teil von mein...,"[Freude ist etwas, wonach ich immer strebe. Un...","[[Freude ist etwas, wonach ich immer strebe., ..."
60,T: Was bedeutet Traurigkeit für Sie? \nP: Mh e...,T: Was bedeutet Wut für Sie? \nP: Wut? Ähm (.....,T: Was bedeutet Angst für Sie? \nP: Ähm (...) ...,T: Was bedeutet Traurigkeit für Sie? \nP: Mh e...,"T: Was bedeutet Freude für Sie? \nP: Wenn, wen...","[Wut? Ähm (...) Ja, wenn man sich absolut nich...","[[Wut?, Ähm (...) Ja, wenn man sich absolut ni...","[Ähm (...) Naja Angst (...) Ich finde, man man...","[[Ähm (...) Naja Angst (...) Ich finde, man ma...",[Mh ein starker Einschnitt in mein Gefühlslebe...,[[Mh ein starker Einschnitt in mein Gefühlsleb...,"[Wenn, wenn es meiner Tochter gut geht. Wenn e...","[[Wenn, wenn es meiner Tochter gut geht., Wenn..."


In [29]:
csv = df[['wut_patient_turns',
 'angst_patient_turns',
 'traurigkeit_patient_turns',
 'freude_patient_turns']].applymap(lambda x: ' '.join(x)).to_csv(sep='\t', index=False)

with open('turns.tsv', 'w') as f:
  f.write(csv)

## Spacy w2v Vectors

In [None]:
nlp = spacy.load("de_core_news_lg")

In [None]:
doc = nlp('Fällt Ihnen eine Situation ein, in der Sie Angst erlebt haben? Ähm.. NOWORD 1')
print(doc.text)
for token in doc:
    print(token.text, token.pos_, token.dep_, token.lemma_, token.is_oov)

Fällt Ihnen eine Situation ein, in der Sie Angst erlebt haben? Ähm.. NOWORD 1
Fällt VERB ROOT fallen False
Ihnen PRON da ihnen False
eine DET nk ein False
Situation NOUN sb Situation False
ein ADV svp ein False
, PUNCT punct -- False
in ADP mo in False
der PRON nk der False
Sie PRON sb sie False
Angst NOUN oa Angst False
erlebt VERB oc erleben False
haben AUX rc haben False
? PUNCT punct -- False
Ähm NOUN ROOT Ähm False
.. X nk .. False
NOWORD PROPN nk NOWORD True
1 NUM pnc 1 False


In [None]:
oov = set()

def vectorize_sent(text, oov=oov, stopwords=fillers):
    doc = nlp(text)
    oovs = set(token.text for token in doc if (token.text not in stopwords
                                               and token.is_oov
                                               and token.pos_ not in ['PUNCT', 'NUM']))
    oov |= oovs
    return [token.vector for token in doc if ((token.text not in stopwords
                                               and not token.is_oov
                                               and token.pos_ not in ['PUNCT', 'NUM']))]

def vectorize_turn_history(turn_history, stopwords=fillers):
    turns_sentences_word_vectors = [[vectorize_sent(s, stopwords=stopwords) for s in t] for t in turn_history]
    sentences_word_vectors = flatten(turns_sentences_word_vectors)
    sentence_vectors = [np.mean(t, axis=0) for t in sentences_word_vectors if t]
    return sentence_vectors

In [None]:
for colname in ['wut','angst','traurigkeit','freude']:
    df[colname+'_sent_vectors'] = df[colname+'_patient_turns_sents'].apply(vectorize_turn_history)

In [None]:
def get_text_tokens(text, stopwords=fillers):
    doc = nlp(text)
    return [token.lemma_.lower() for token in doc if ((token.text not in stopwords
                                             and token.pos_ not in ['PUNCT', 'NUM']))]

In [None]:
for colname in ['wut','angst','traurigkeit','freude']:
    df[colname+'_tokens'] = df[colname+'_patient_turns'].apply(lambda x: get_text_tokens(' '.join(x)))

In [None]:
for colname in ['wut','angst','traurigkeit','freude']:
    df[colname+'_tokens_len'] = df[colname+'_tokens'].apply(len)
    df[colname+'_tokens_unique'] = df[colname+'_tokens'].apply(lambda x: len(set(x)))

In [None]:
df['total_tokens_wo_fillers'] = sum([df[colname+'_tokens_len'] for colname in ['wut','angst','traurigkeit','freude']])
df['unique_tokens_wo_fillers'] = sum([df[colname+'_tokens_unique'] for colname in ['wut','angst','traurigkeit','freude']])

In [None]:
df[['wut_tokens_unique',
 'angst_tokens_len',
 'angst_tokens_unique',
 'traurigkeit_tokens_len',
 'traurigkeit_tokens_unique',
 'freude_tokens_len',
 'freude_tokens_unique',
 'unique_tokens_wo_fillers',
 'total_tokens_wo_fillers']].to_csv('total_tokens_with_stopwords.csv')

In [None]:
df.head(2)

In [None]:
oov2id = defaultdict(list)
for w in oov:
    oov2id[df[df['raw'].str.contains(w)].index[0]].append(w)

In [None]:
ids, oovs_for_ids = list(zip(*((k, ', '.join(v)) for k, v in oov2id.items())))

In [None]:
oov_df =pd.DataFrame(index=ids)
oov_df['ovv'] = oovs_for_ids

In [None]:
oov_df.to_csv('oov_table.csv')

In [None]:
print('\n'.join(sorted(oov)))

200000.
24.9.73
Abwärtstrip
Aggressionstrieb
Aktienwirtschaft
Allerweltsfall
Altersängste
Angst-Neurose
Angstmoment
Angstneurose
Angstneurose-Psychose
Angstpsychose
Angstsachen
Animalisch
Anstrebenswerte
Antriebsfaktor
Antriebslos
Asylbabys
Auserwähltsein
Baumsegel
Beklemmungsgefühle
Beobachtungsgefühl
Beobachtungsgefühle
Beruhigungstropfen
Bezugsbetreuer
Billigschüssel
Biokisten
Braunüle
Bredouillen
Brustgegend
Chemieraum
Chlorbecken
Dachpost
Damenrolle
Dauersituation
Depressionsphasen
Dienstberatung
Einengendes
Einliterflasche
Empathielosigkeit
Endloskette
Erfolgsereignisse
Erregungsniveau
Eskalieren
Euphorisches
Ex-Beziehung
Ex-Freunden
Farblichkeit
Feiergäste
Fettstoffänderungstabletten
Fluchtinstinkt
Fluchtreaktion
Foltermittel
Fortleitung
Fremdkontrolle
Freudauslöser
Freundschaftsmäßig
Geburtsschmerzen
Gedankenspiralen
Gefühlausbrüchen
Gefühlsliebe
Gestoßen
Gesundheitliches
Glaubenstod
Glückgefühl
Glückmomente
Goldbild
Grundgesetzkonform
Handedrücken
Handlungsalternative
Handlung

## Metrics

In [None]:
def cos_sim(v1, v2):
    return np.inner(v1, v2)/(np.linalg.norm(v1)*np.linalg.norm(v2))

### Local Coherence
(let i = cosine distance between an averaged vector of the neighbouring clauses)
a plot:
x axis: number of the current clause
y axis: i

- the differences between the averages for the groups
- the differences between the averages for the averages of the groups

In [None]:
def get_local_coherence_array(clause_vectors):
    """
    calculates local coherence

    :param clause_vectors: list of np.arrays (vectors) or a np.array of shape (sent_len, emb_size)
    :return: list of float of len sent_len, local coherence of each clasue
    """
    local_coherence_array = []
    length = len(clause_vectors)
    for i in range(length-1):
        local_coherence_array.append(cos_sim(clause_vectors[i], clause_vectors[i+1]))
    return local_coherence_array

In [None]:
for colname in ['wut','angst','traurigkeit','freude']:
    df[f'{colname}_lcoh'] = df[f'{colname}_sent_vectors' ][~pd.isnull(df[f'{colname}_sent_vectors'])].apply(get_local_coherence_array)
    df[f'{colname}_mean_lcoh'] = df[f'{colname}_lcoh'][~pd.isnull(df[f'{colname}_lcoh'])].apply(np.mean)
    df[f'{colname}_std_lcoh'] = df[f'{colname}_lcoh'][~pd.isnull(df[f'{colname}_lcoh'])].apply(np.std)

In [None]:
df.head(2)

Unnamed: 0,raw,wut,angst,traurigkeit,freude,wut_patient_turns,wut_patient_turns_sents,angst_patient_turns,angst_patient_turns_sents,traurigkeit_patient_turns,...,wut_std_lcoh,angst_lcoh,angst_mean_lcoh,angst_std_lcoh,traurigkeit_lcoh,traurigkeit_mean_lcoh,traurigkeit_std_lcoh,freude_lcoh,freude_mean_lcoh,freude_std_lcoh
52,T: Was bedeutet Traurigkeit für Sie? \nP: Sehr...,"T: Was bedeutet Wut für Sie? \nP: Naja, ist ei...",T: Was bedeutet Angst für Sie? \nP: Unbewussth...,T: Was bedeutet Traurigkeit für Sie? \nP: Sehr...,T: Was bedeutet Freude für Sie? \nP: Freude? (...,"[Naja, ist eine sehr heftige, also geballte Em...","[[Naja, ist eine sehr heftige, also geballte E...","[Unbewusstheit bedeutet das, glaube ich, inzwi...","[[Unbewusstheit bedeutet das, glaube ich, inzw...","[Sehr wichtige Emotion für mich. Ähm (...), ic...",...,0.171316,"[0.7017261, 0.47846112, 0.6442868]",0.608158,0.09466,"[0.78157634, 0.84219146, 0.6701517, 0.26192057...",0.644793,0.219183,"[0.29094148, 0.47891936, 0.34767905, 0.2080794...",0.37894,0.201185
12,T: Was bedeutet Angst für Sie? \nP: Angst ist ...,T: Was bedeutet Wut für Sie? \nP: Ärger und Wu...,T: Was bedeutet Angst für Sie? \nP: Angst ist ...,T: Was bedeutet Traurigkeit für Sie? \nP: Trau...,"T: Was bedeutet Freude für Sie? \nP: Freude, j...",[Ärger und Wut habe ich nicht. Aber ich würde ...,"[[Ärger und Wut habe ich nicht., Aber ich würd...",[Angst ist äh. Aus meiner Angst bin ich so ein...,"[[Angst ist äh., Aus meiner Angst bin ich so e...","[Traurigkeit, ähm, Emotion, Emotion Traurigkei...",...,0.234851,"[0.48715672, 0.8672949, 0.40815282, 0.47049156...",0.484985,0.208015,"[0.14959149, 0.788764, 0.01196405, 0.52236325,...",0.546182,0.22812,"[0.63875705, 0.6375287, 0.9454215, 0.7512653, ...",0.564329,0.228487


### Global coherence

(let us take a weighted average of each clause, then an average of these centroids in one text, then a average of the averages of all the texts at hand)

- cosine distance between the average of all the texts and the average of the current text for each participant
- cosine distance between the average of all the control texts and the average of the current text for each participant

In [None]:
def compare_text_to_a_standard_vector(clause_vectors, standard_vector):
    """
    cosine similarity of a given vector and a list of vectors

    :param clause_vectors: list of np.arrays (vectors) or a np.array of shape (sent_len, emb_size)
    :param standard_vector: np.array of int or float of shape (emb_size)
    """
    average_file_vector = np.average(clause_vectors, axis=0)
    return cos_sim(average_file_vector, standard_vector)

In [None]:
def get_standard_vectors(data, column_name):
    """
    gets average vector across all vectors in a givem column, unlisting them

    :param data: pd.Dataframe
    :param column_name: str, column with list of lists or np.arrays of int ot float of shape (sent_len, emb_size)
    :return: np.array of int or float of shape (emb_size)
    """
    column = data[column_name][~pd.isnull(data[column_name])]
    return np.nanmean(np.vstack(flatten(column.values)), axis=0)

In [None]:
def get_global_coherence(clause_vectors, standard_vector):
    """
    calculates global coherence

    :param clause_vectors: list of np.arrays (vectors) or a np.array of shape (sent_len, emb_size)
    :param standard_vector:  np.array of int or float of shape (emb_size)
    :return: list of float of len sent_len, global coherence of each clasue
    """
    return [cos_sim(vec, standard_vector) for vec in clause_vectors]

In [None]:
for colname in ['wut','angst','traurigkeit','freude']:
    standard = get_standard_vectors(df, f'{colname}_sent_vectors')
    df[f'{colname}_gcoh'] = df[f'{colname}_sent_vectors'][~pd.isnull(df[f'{colname}_sent_vectors'])].apply(lambda vecs: get_global_coherence(vecs, standard))
    df[f'{colname}_mean_gcoh'] = df[f'{colname}_gcoh'][~pd.isnull(df[f'{colname}_gcoh'])].apply(np.mean)
    df[f'{colname}_std_gcoh'] = df[f'{colname}_gcoh'][~pd.isnull(df[f'{colname}_gcoh'])].apply(np.std)

In [None]:
df.head(2)

Unnamed: 0,raw,wut,angst,traurigkeit,freude,wut_patient_turns,wut_patient_turns_sents,angst_patient_turns,angst_patient_turns_sents,traurigkeit_patient_turns,...,wut_std_gcoh,angst_gcoh,angst_mean_gcoh,angst_std_gcoh,traurigkeit_gcoh,traurigkeit_mean_gcoh,traurigkeit_std_gcoh,freude_gcoh,freude_mean_gcoh,freude_std_gcoh
52,T: Was bedeutet Traurigkeit für Sie? \nP: Sehr...,"T: Was bedeutet Wut für Sie? \nP: Naja, ist ei...",T: Was bedeutet Angst für Sie? \nP: Unbewussth...,T: Was bedeutet Traurigkeit für Sie? \nP: Sehr...,T: Was bedeutet Freude für Sie? \nP: Freude? (...,"[Naja, ist eine sehr heftige, also geballte Em...","[[Naja, ist eine sehr heftige, also geballte E...","[Unbewusstheit bedeutet das, glaube ich, inzwi...","[[Unbewusstheit bedeutet das, glaube ich, inzw...","[Sehr wichtige Emotion für mich. Ähm (...), ic...",...,0.136596,"[0.92125434, 0.7638115, 0.59814364, 0.8714083]",0.788654,0.123837,"[0.8267782, 0.84174126, 0.89066446, 0.6630149,...",0.770828,0.14971,"[0.39391193, 0.83743906, 0.5015149, 0.57806367...",0.56058,0.249865
12,T: Was bedeutet Angst für Sie? \nP: Angst ist ...,T: Was bedeutet Wut für Sie? \nP: Ärger und Wu...,T: Was bedeutet Angst für Sie? \nP: Angst ist ...,T: Was bedeutet Traurigkeit für Sie? \nP: Trau...,"T: Was bedeutet Freude für Sie? \nP: Freude, j...",[Ärger und Wut habe ich nicht. Aber ich würde ...,"[[Ärger und Wut habe ich nicht., Aber ich würd...",[Angst ist äh. Aus meiner Angst bin ich so ein...,"[[Angst ist äh., Aus meiner Angst bin ich so e...","[Traurigkeit, ähm, Emotion, Emotion Traurigkei...",...,0.198651,"[0.51174384, 0.80317664, 0.6626376, 0.7543498,...",0.622313,0.180479,"[0.53814876, 0.45457947, 0.3572675, 0.45147625...",0.656356,0.17199,"[0.8547029, 0.79114085, 0.86506295, 0.80062115...",0.674217,0.189464


In [None]:
df[['wut_mean_lcoh', 'wut_std_lcoh','angst_mean_lcoh', 'angst_std_lcoh',
       'traurigkeit_mean_lcoh', 'traurigkeit_std_lcoh',
       'freude_mean_lcoh', 'freude_std_lcoh', 'wut_mean_gcoh',
       'wut_std_gcoh', 'angst_mean_gcoh', 'angst_std_gcoh',
       'traurigkeit_mean_gcoh', 'traurigkeit_std_gcoh',
       'freude_mean_gcoh', 'freude_std_gcoh']].to_csv('results.csv')

## GloVe

In [None]:
with open('vectors.txt') as f:
    lines = f.readlines()

In [None]:
def process_line(line):
    key, str_value = line.strip().split(' ', 1)
    return key, np.array([float(x) for x in str_value.split()])

In [None]:
glove = {}
for line in lines:
    key, value = process_line(line)
    if test_unicode(key):
        glove[key] = value

In [None]:
glove_oov = set()
def get_glove_vector(word):
    if word in glove:
        return glove[word]
    else:
        glove_oov.add(word)
        return None

In [None]:
get_glove_vector('angst')

array([-1.181350e-01, -7.878900e-02,  2.453920e-01, -3.412700e-02,
        5.615000e-02,  2.576550e-01, -3.381640e-01,  3.186940e-01,
       -6.616310e-01,  6.015050e-01, -1.578570e-01,  9.650700e-02,
        2.641200e-02, -6.702980e-01, -2.797610e-01,  3.632100e-01,
        1.418910e-01, -7.198290e-01,  3.404910e-01, -3.917510e-01,
        2.406550e-01,  8.361600e-02, -3.876580e-01,  2.529030e-01,
       -7.137030e-01, -1.325400e-01,  2.441550e-01,  2.154063e+00,
       -2.910310e-01,  5.262640e-01,  1.985460e-01, -2.016080e-01,
        5.033310e-01,  7.624320e-01, -5.731700e-02, -1.270140e-01,
       -8.508500e-02, -2.884040e-01, -4.987900e-02,  9.060770e-01,
        4.320700e-02, -4.392220e-01, -2.928250e-01, -1.866930e-01,
        6.359000e-02, -2.040010e-01,  4.675400e-02,  2.426800e-02,
       -1.074350e-01, -3.431470e-01,  2.668440e-01,  3.125020e-01,
       -1.773730e-01, -5.293330e-01,  1.405660e-01,  9.177200e-02,
        4.482180e-01, -9.321000e-03,  3.683500e-02, -6.769000e

In [None]:
get_glove_vector('Angst')

In [None]:
def vectorize_sent_glove(text, oov=glove_oov):
    doc = nlp(text)
    oovs = set(token.text for token in doc if token.lemma_.lower() not in glove)
    oov |= oovs
    return [get_glove_vector(token.lemma_.lower()) for token in doc if (token.pos_ not in ['PUNCT', 'NUM'] and get_glove_vector(token.lemma_.lower()) is not None)]

def vectorize_turn_history_glove(turn_history):
    turns_sentences_word_vectors = [[vectorize_sent_glove(s) for s in t] for t in turn_history]
    sentences_word_vectors = flatten(turns_sentences_word_vectors)
    sentence_vectors = [np.mean(t, axis=0) for t in sentences_word_vectors if t]
    return sentence_vectors

In [None]:
vectorize_turn_history_glove(df['freude_patient_turns_sents'][0])[0].size

300

In [None]:
for colname in ['wut','angst','traurigkeit','freude']:
    df[colname+'_glove_sent_vectors'] = df[colname+'_patient_turns_sents'].apply(vectorize_turn_history_glove)

In [None]:
for colname in ['wut','angst','traurigkeit','freude']:
    df[f'{colname}_glove_lcoh'] = df[f'{colname}_glove_sent_vectors' ][~pd.isnull(df[f'{colname}_glove_sent_vectors' ])].apply(get_local_coherence_array)
    df[f'{colname}_glove_mean_lcoh'] = df[f'{colname}_glove_lcoh'][~pd.isnull(df[f'{colname}_glove_lcoh'])].apply(np.mean)
    df[f'{colname}_glove_std_lcoh'] = df[f'{colname}_glove_lcoh'][~pd.isnull(df[f'{colname}_glove_lcoh'])].apply(np.std)

In [None]:
for colname in ['wut','angst','traurigkeit','freude']:
    standard = get_standard_vectors(df, f'{colname}_glove_sent_vectors')
    df[f'{colname}_glove_gcoh'] = df[f'{colname}_glove_sent_vectors'][~pd.isnull(df[f'{colname}_glove_sent_vectors'])].apply(lambda vecs: get_global_coherence(vecs, standard))
    df[f'{colname}_glove_mean_gcoh'] = df[f'{colname}_glove_gcoh'][~pd.isnull(df[f'{colname}_glove_gcoh'])].apply(np.mean)
    df[f'{colname}_glove_std_gcoh'] = df[f'{colname}_glove_gcoh'][~pd.isnull(df[f'{colname}_glove_gcoh'])].apply(np.std)

In [None]:
df[['wut_glove_mean_lcoh', 'wut_glove_std_lcoh','angst_glove_mean_lcoh', 'angst_glove_std_lcoh',
       'traurigkeit_glove_mean_lcoh', 'traurigkeit_glove_std_lcoh',
       'freude_glove_mean_lcoh', 'freude_glove_std_lcoh', 'wut_glove_mean_gcoh',
       'wut_glove_std_gcoh', 'angst_glove_mean_gcoh', 'angst_glove_std_gcoh',
       'traurigkeit_glove_mean_gcoh', 'traurigkeit_glove_std_gcoh',
       'freude_glove_mean_gcoh', 'freude_glove_std_gcoh']].to_csv('glove_results.csv')

In [None]:
df.to_pickle('net.pkl')

In [None]:
df.describe()

Unnamed: 0,wut_mean_lcoh,wut_std_lcoh,angst_mean_lcoh,angst_std_lcoh,traurigkeit_mean_lcoh,traurigkeit_std_lcoh,freude_mean_lcoh,freude_std_lcoh,wut_mean_gcoh,wut_std_gcoh,...,freude_glove_mean_lcoh,freude_glove_std_lcoh,wut_glove_mean_gcoh,wut_glove_std_gcoh,angst_glove_mean_gcoh,angst_glove_std_gcoh,traurigkeit_glove_mean_gcoh,traurigkeit_glove_std_gcoh,freude_glove_mean_gcoh,freude_glove_std_gcoh
count,97.0,97.0,97.0,97.0,97.0,97.0,97.0,97.0,97.0,97.0,...,97.0,97.0,97.0,97.0,97.0,97.0,97.0,97.0,97.0,97.0
mean,0.580473,0.192661,0.573417,0.192877,0.546968,0.200989,0.55007,0.196787,0.719882,0.168106,...,0.774933,0.154846,0.86524,0.141837,0.863309,0.135806,0.846877,0.16486,0.864543,0.12693
std,0.105261,0.05853,0.102678,0.0629,0.105857,0.061933,0.123672,0.076333,0.078689,0.058515,...,0.108912,0.102802,0.062822,0.097478,0.071308,0.099949,0.074051,0.104779,0.065757,0.092712
min,0.209633,0.061162,0.225543,0.0,0.261755,0.014662,0.191178,0.000505,0.472199,0.060339,...,0.308797,0.01627,0.646079,0.016888,0.51557,0.019618,0.551581,0.013683,0.612007,0.015719
25%,0.529594,0.14918,0.52744,0.145447,0.500348,0.161008,0.457174,0.151975,0.679563,0.119899,...,0.728565,0.071266,0.835118,0.051109,0.836276,0.056111,0.808246,0.082371,0.843655,0.052279
50%,0.580113,0.188688,0.591546,0.191503,0.550391,0.205462,0.560988,0.197187,0.72459,0.168448,...,0.787381,0.144048,0.871962,0.131446,0.882323,0.107207,0.863933,0.139888,0.87571,0.105458
75%,0.635499,0.234851,0.648067,0.235477,0.61849,0.237954,0.640006,0.237145,0.772117,0.19552,...,0.851212,0.216673,0.912399,0.195774,0.907371,0.193759,0.899286,0.251892,0.905779,0.168264
max,0.778247,0.351956,0.751157,0.332566,0.771459,0.413998,0.780602,0.453238,0.850421,0.357959,...,0.926419,0.469865,0.950118,0.402712,0.947764,0.502517,0.958429,0.404647,0.945116,0.43837
