# IMPLEMENTATION OF TEXT MINING METHODS 
The present notebook contains all code that was used to apply text mining methods on the glossary terms. This is the heart piece of the implementation. Please read the `README` first to make sure that the code will work on your device. 

**Important**: Due to the notation that was used in prior project regarding the glossary, we will in many cases use the term `pro` to refer to the corpus of climate supporters and the term `con(tra)` to refer to the corpus of climate skeptics. This notation is also used throughout the R notebooks. Accordingly, variable names containing `pro` or `con` are used to indicate that this variable exists for each of the corpora

## Requirements to run this Notebook

Please make sure that the following libraries are installed to be able to run all code of the notebook

In [1835]:
#!pip3 install wn
#!pip3 install spacy
#!python -m spacy download de_core_news_md
#!pip3 install networkx
#!pip3 install tabulate
#!pip3 install germansentiment
#!pip3 install -U numpy
#!pip3 install -U textblob-de

In [2]:
### IMPORT REQUIREMENTS
import pandas as pd
import numpy as np
import re

from matplotlib.pyplot import figure
import matplotlib.pyplot as plt
from ast import literal_eval
from tabulate import tabulate
from pathlib import Path
from collections import Counter
from nltk.metrics import distance as dist
from textblob_de import TextBlobDE as TextBlob

from germansentiment import SentimentModel
model = SentimentModel()

### IMPORT SPACY 
import spacy
from spacy import displacy
nlp = spacy.load('de_core_news_md')


### IMPORT WORDNET 
import wn
import wn.taxonomy
#wn.download('odenet')
#wn.download('oewn:2021')

# assign German and English WordNet source 
en = wn.Wordnet('oewn:2021')
de = wn.Wordnet('odenet:1.4')

from wn.similarity import path



# 0. Load Files

The file `compounds.csv` is the output of the `preprocessing.ipynb` file. The file is loaded here into a pandas data frame and columns are preprocessed such that they have the correct format for the upcoming pieces of code. These are the unchanged files that were originally used to run the upcoming text mining methods. They only contain simplistic information and the output of the corpus-based methods from R.

In [1762]:
# load compounds file (output of preprocessing.ipynb)
compounds = pd.read_csv("../files/compounds_info.csv")

# load context files (i.e. output of corpus-based methods notebook)
pro_context = pd.read_csv("../../R/output/pro_context.csv")
con_context = pd.read_csv("../../R/output/con_context.csv")

Or to load the final files that are composed in the course of this notebook (to avoid re-running all code) please run the following code:
- `knowledge_base` = the complete knowledge base containing all information about the compounds
- `pro_info`= knowledge base of methods that were applied to context of the climate supporters corpus
- `con_info`= knowledge base of methods that were applied to context of the climate skeptics corpus

In [3]:
# load final knowledge base 
knowledge_base = pd.read_csv("../output/knowledge_base.csv")

# load context files (i.e. output of corpus-based methods notebook in R)
pro_info = pd.read_csv("../output/pro_info.csv")
con_info = pd.read_csv("../output/con_info.csv")

# 1. Preprocessing

## 1.1 Convert Literals
For most of the `csv` files the literals are not evaluated correctly, i.e. columns containing a list of strings is evaluated as a string when loading them into Python using `pandas` (`read_csv`). This issue is addressed by applying the `literal_eval` function of the `ast` library to the columns for which we have the problem.

In [6]:
compounds['noun_forms'] = compounds.noun_forms.apply(lambda x: literal_eval(str(x)))
compounds['compound_forms'] = compounds.compound_forms.apply(lambda x: literal_eval(str(x)))

This may also be the case for the following columns if the final knowledge base was loaded into the notebook

In [7]:
compounds['compound_forms'] = compounds.compound_forms.apply(lambda x: literal_eval(str(x)))
compounds['related_words'] = compounds.related_words.apply(lambda x: literal_eval(str(x)))
compounds['hypernyms'] = compounds.hypernyms.apply(lambda x: literal_eval(str(x)))
compounds['en_hypernyms'] = compounds.en_hypernyms.apply(lambda x: literal_eval(str(x)) if(str(x) != 'nan') else x)
compounds['definition'] = compounds.definition.apply(lambda x: literal_eval(str(x)))
compounds['PERS_pro'] = compounds.PERS_pro.apply(lambda x: literal_eval(str(x)) if(str(x) != 'nan') else x)
compounds['PERS_con'] = compounds.PERS_con.apply(lambda x: literal_eval(str(x)) if(str(x) != 'nan') else x)
compounds['ORG_pro'] = compounds.ORG_pro.apply(lambda x: literal_eval(str(x)) if(str(x) != 'nan') else x)
compounds['ORG_con'] = compounds.ORG_con.apply(lambda x: literal_eval(str(x)) if(str(x) != 'nan') else x)
compounds['similar_words'] = compounds.similar_words.apply(lambda x: literal_eval(str(x)) if(str(x) != 'nan') else x)
compounds['pro_mods'] = compounds.pro_mods.apply(lambda x: literal_eval(str(x)) if(str(x) != 'nan') else x)
compounds['con_mods'] = compounds.con_mods.apply(lambda x: literal_eval(str(x)) if(str(x) != 'nan') else x)
compounds["PERS_pro"] = compounds.PERS_pro.apply(lambda x: Counter(x) if(str(x) != 'nan') else x)
compounds["PERS_con"] = compounds.PERS_con.apply(lambda x: Counter(x) if(str(x) != 'nan') else x)
compounds["ORG_pro"] = compounds.ORG_pro.apply(lambda x: Counter(x) if(str(x) != 'nan') else x)
compounds["ORG_con"] = compounds.ORG_con.apply(lambda x: Counter(x) if(str(x) != 'nan') else x)
compounds["pro_attr"] = compounds.pro_attr.apply(lambda x: "".join(literal_eval(str(x))) if(str(x) != 'nan') else x)
compounds["con_attr"] = compounds.con_attr.apply(lambda x: "".join(literal_eval(str(x))) if(str(x) != 'nan') else x)
compounds["pro_colls"] = compounds.pro_colls.apply(lambda x: literal_eval(str(x)) if(str(x) != 'nan') else list())
compounds["con_colls"] = compounds.con_colls.apply(lambda x: literal_eval(str(x)) if(str(x) != 'nan') else list())

In [15]:
pro_context['entities'] = pro_context.entities.apply(lambda x: literal_eval(str(x)) if(str(x) != 'nan') else x)
pro_context['persons'] = pro_context.persons.apply(lambda x: literal_eval(str(x)) if(str(x) != 'nan') else x)
pro_context['organisations'] = pro_context.organisations.apply(lambda x: literal_eval(str(x)) if(str(x) != 'nan') else x)
pro_context['PERS'] = pro_context.PERS.apply(lambda x: literal_eval(str(x)) if(str(x) != 'nan') else x)
pro_context['ORG'] = pro_context.ORG.apply(lambda x: literal_eval(str(x)) if(str(x) != 'nan') else x)
pro_context['dependencies'] = pro_context.dependencies.apply(lambda x: literal_eval(str(x)) if(str(x) != 'nan') else x)
pro_context['modifiers'] = pro_context.modifiers.apply(lambda x: literal_eval(str(x)) if(str(x) != 'nan') else x)

con_context['entities'] = con_context.entities.apply(lambda x: literal_eval(str(x)) if(str(x) != 'nan') else x)
con_context['persons'] = con_context.persons.apply(lambda x: literal_eval(str(x)) if(str(x) != 'nan') else x)
con_context['organisations'] = con_context.organisations.apply(lambda x: literal_eval(str(x)) if(str(x) != 'nan') else x)
con_context['PERS'] = con_context.PERS.apply(lambda x: literal_eval(str(x)) if(str(x) != 'nan') else x)
con_context['ORG'] = con_context.ORG.apply(lambda x: literal_eval(str(x)) if(str(x) != 'nan') else x)
con_context['dependencies'] = con_context.dependencies.apply(lambda x: literal_eval(str(x)) if(str(x) != 'nan') else x)
con_context['modifiers'] = con_context.modifiers.apply(lambda x: literal_eval(str(x)) if(str(x) != 'nan') else x)

## 1.2 Reduce Concordances
The context that we retrieved via the concordances (`kwic`) in R unfortunately could only be performed on regex level. Accordingly also more complex forms of the compounds are contained in out data frame. E.g. for the compound *Klimagerechtigkeit* also concordances containing the key word *Klimagerechtigkeitspolitik* were included and for the compound *Klimaalarm* also key word phrases containin the adjective form *klimaalarmistisch* were retrieved. We want to get rid of the rows not containing **exact matches** of our compound word forms. 

Accordingly, we search for the rows that contain one of the compound forms and retrieve those columns for the data frames which we will then use for the upcoming analyses. 

To do this, we first get a list of the compound word forms from our `compounds` data frame.  

In [3118]:
compound_forms = compounds.compound_forms.tolist()
compound_forms = [item for sublist in compound_forms for item in sublist]

Then we apply the following function from `pandas` to check whether a string is contained in the row and to retrieve the according rows. 

In [3145]:
# retrieve all columns which contain a compound form
pro_context = pro_context[pro_context.keyword.str.contains(" |".join(compound_forms), case=False).groupby(level=0).any()]
con_context = con_context[con_context.keyword.str.contains(" |".join(compound_forms), case=False).groupby(level=0).any()]

# 2. Working with WordNet
In this section we will apply various function from the `WordNet` library to the `compounds` data frame, i.e. the list of compound words and word forms.

## 2.1 Exploring Hierarchical Structures (Hypernyms)

We create multiple functions which we need to retrieve the synset and additional more information of the noun from the `WordNet` library. 

In [206]:
def get_synset(string):
    
    """
    Returns the WordNet synsets of the input string.
    Arg: 
        string: a noun.
    Returns: 
        The synsets of the noun if available, else None.
    """
    
    try: 
        word = de.synsets(string, pos="n") # retrieve synsets for the string
        return word # return synsets
    except:
        return
    
def get_lemmas(string):
    """
    Returns the WordNet lemmas of a string.
    Arg: 
        string: a noun.
    Returns: 
        A list of lemmas (i.e. related words) of the noun if available, else None.
    """
    
    lemmas = [] # initiate empty list
    
    try:
        # for each synset
        for s in get_synset(string):
            lemmas.append(s.lemmas()) # retrieve lemmas and append to list
 
        return list(set([x for l in lemmas for x in l])) # return flattened list of lemmas 
    
    except:
        return
    
    
def get_hypernyms(string):
    
    """
    Returns the WordNet hypernyms of a string.
    Arg: 
        string: a noun.
    Returns: 
        A list of hypernyms of the noun if available, else empty list.
    """
       
    hypers = [] # initiate empty list
    
    try:
        # for each synset
        for s in get_synset(string):
            hypernyms = s.hypernyms() # retrieve hypernyms
            
            # for each hypernym
            for el in hypernyms:
                hypers.append(el.lemmas()) # retrieve lemmas
                
    except:
        pass
    
    return list(set([y for x in hypers for y in x])) # return flattened list of hypernyms 
      

# for each compound save lemmas (as related words), hypernyms, definition to compound dataframe 
compounds['related_words'] = compounds.second_part.apply(get_lemmas)
compounds['hypernyms'] = compounds.second_part.apply(get_hypernyms)

To visualize the hierarchical tree structure of the WordNet knowledge base, we will have a quick look at the following example. Here we see the hypernyms of the word "Betrug" (en: "fraud"):

In [204]:
# retrieve synsets
synsets = de.synsets('Betrug', pos='n')

# for each synset
p = 1
for s in synsets: 
    
    # for each hypernym path
    for path in wn.taxonomy.hypernym_paths(s):
        print("\nPath", p)
        for i, ss in enumerate(path):
            
            # print synset ID and lemma
            print(' ' * i, ss, ss.lemmas()[0]) 
        p += 1


Path 1
 Synset('odenet-10880-n') krimineller Akt
  Synset('odenet-15937-n') Frevel
   Synset('odenet-6999-n') Topf
    Synset('odenet-9850-n') Vermögen
     Synset('odenet-4667-n') Vermögen
      Synset('odenet-10390-n') Liegenschaft

Path 2
 Synset('odenet-10880-n') krimineller Akt
  Synset('odenet-5502-n') Handlung

Path 3
 Synset('odenet-8872-n') Rauheit
  Synset('odenet-25840-n') Unglück


Additionally, to retrieve the hypernym which is closest to the root of the knowledge base, we will use the following code:

In [2640]:
def get_roots(string):
    
    """
    Returns the German WordNet roots of a string.
    Arg: 
        string: a noun.
    Returns: 
        A list of root concepts of the noun if available, else empty list.
    """
    
    roots = [] # initiate empty list
    synsets = get_synset(string) # retrieve synsets

    # for each synset
    for s in synsets: 
        
        # for each hypernym path
        for path in wn.taxonomy.hypernym_paths(s):
            
            # retrieve root (i.e. last element of path)
            roots.append(path[-1].lemmas())
            
    # return flattened list of root hypernyms
    return [y for x in roots for y in x]

# apply function
compounds['roots'] = compounds.second_part.apply(get_roots)

To be able to specify whether the concept of the compound describes an action or a person, we will retrieve the english hypernym paths for each compound word. For this, we translate the synset and retrieve the hypernym paths as we did before for the German WordNet lexicon. Instead of only retrieving the root concept (which is always `entity` for the English lexicon), we retrieve the complete path and check for key words that give us the desired information.

In [2641]:
def get_en_hypernyms(string):
    
    """
    Returns the English WordNet hypernym paths of a string.
    Arg: 
        string: a noun.
    Returns: 
        A list of hypernyms of the noun if available, else NaN.
    """

    try:
        roots = [] # initiate empty list
        synsets = get_synset(string)[0].translate(lexicon='oewn:2021') # get English version of synset 
    
        # for each synset
        for s in synsets: 
            
            # for each hypernym path
            for path in wn.taxonomy.hypernym_paths(s):
                
                # retrieve lemmas of hypernyms and append to list 
                roots.append(x.lemmas() for x in path)
                
        # flatten list
        roots = [z for x in roots for y in x for z in y]
        
        return roots # return list
    
    # except no translation is available
    except:
        
        return np.nan # then return NaN

# apply function
compounds['en_hypernyms'] = compounds.second_part.apply(get_en_hypernyms)

For instance, let's retrieve the English hypernym paths for the following two words: "Betrüger" (person) and "Betrug" (action).

In [2642]:
word1 = "Betrüger"
word2 = "Betrug"

print(word1, "- Hypernym Paths:")
# for each hypernym path of the synset
for path in wn.taxonomy.hypernym_paths(get_synset(word1)[0].translate(lexicon='oewn:2021')[0]):
    for i, ss in enumerate(path):
        print(' ' * i, ss, ss.lemmas()[0]) # print synset and lemma 
    
print("\n")
print("_"*50)
print("\n")

print(word2, "- Hypernym Paths:")
# for each hypernym path of the synset
for path in wn.taxonomy.hypernym_paths(get_synset(word2)[0].translate(lexicon='oewn:2021')[0]):
    for i, ss in enumerate(path):
        print(' ' * i, ss, ss.lemmas()[0]) # print synset and lemma 

Betrüger - Hypernym Paths:
 Synset('oewn-09974494-n') chiseler
  Synset('oewn-10017621-n') slicker
   Synset('oewn-09657157-n') offender
    Synset('oewn-09851208-n') bad person
     Synset('oewn-00007846-n') soul
      Synset('oewn-00004475-n') being
       Synset('oewn-00004258-n') animate thing
        Synset('oewn-00003553-n') unit
         Synset('oewn-00002684-n') physical object
          Synset('oewn-00001930-n') physical entity
           Synset('oewn-00001740-n') entity
 Synset('oewn-09974494-n') chiseler
  Synset('oewn-10017621-n') slicker
   Synset('oewn-09657157-n') offender
    Synset('oewn-09851208-n') bad person
     Synset('oewn-00007846-n') soul
      Synset('oewn-00004475-n') being
       Synset('oewn-00007347-n') cause
        Synset('oewn-00001930-n') physical entity
         Synset('oewn-00001740-n') entity


__________________________________________________


Betrug - Hypernym Paths:
 Synset('oewn-00770581-n') fraud
  Synset('oewn-00767761-n') criminal offence
 

In the paths we can see the following key words for the two concepts:
- **Betrüger**: person, soul
- **Betrug**: activity, human action

Also for other concepts we identify the following hypernyms:
- **Vernunft**: psychological feature
- **Staat**: people, grouping

We finally decide for the *four* main categories which we want to use to specify the concepts by looking for the following key words:
- Abstraction: rational motive, motive, state, psychological feature, attribute,
                phenomenon, process, cause, physical object, abstract entity, artifact
- Person: person, soul, image, spiritual being, ideal
- Action: activity, human action, wrongdoing
- Group: grouping, people
- Location: location, area

Accordingly, we will create lists of those key words and specify the concept in the upcoming lines and save the information to a new column `concept`.

In [3166]:
# initiate concepts
abstraction = ["rational motive", "motive", "state", "psychological feature", "attribute",
                "phenomenon", "process", "cause", "physical object", "abstract entity", "artifact"]
person = ["person", "soul",  "image", "spiritual being", "ideal"] 
action = ["activity", "human action", "wrongdoing"]
group = ["grouping", "people"]
location = ["location", "area"]

In [3167]:
def specify_concept(hypernyms):
    
    """
    Returns the concept label for each compound word.
    Arg: 
        hypernyms: a list of hypernyms.
    Returns: 
        A concept label (either "person" or "action") for the compound word, else NaN.
    """
    
    try:
        
        # if there is at least one common element in list of hypernyms and list of action key words
        if len(set(location).intersection(set(hypernyms))) > 0:
            return "location" # label as location
        
        # if there is at least one common element in list of hypernyms and list of group key words    
        elif len(set(group).intersection(set(hypernyms))) > 0:
            return "group" # label as group
        
        # if there is at least one common element in list of hypernyms and list of action key words
        elif len(set(action).intersection(set(hypernyms))) > 0:
            return "action" # label as action
       
        # if there is at least one common element in list of hypernyms and list of person key words    
        elif len(set(person).intersection(set(hypernyms))) > 0:
            return "person" # label as person    
    
        # if there is at least one common element in list of hypernyms and list of abstraction key words    
        elif len(set(abstraction).intersection(set(hypernyms))) > 0:
            return "abstraction" # label as abstraction
    
        
        else:
            return(np.nan)
        
    # if there is no comparison available (e.g. bc. no hypernym list is available for the row)
    except:
        return np.nan # return NaN

In [3168]:
# apply function to en_hypernyms column
compounds['concept'] = compounds.en_hypernyms.apply(specify_concept)

# manually change the value of "Milliardär" since 
compounds.loc[(compounds.second_part == 'milliardär'),'concept']='person'

Next, we will save those columns for which we could not identify a concept to a csv file to manually add the concept of those compound words.

In [3185]:
# gather all columns that did not receive a concept
concept_manual = compounds[compounds['concept'].isna()][["original", "concept"]]

#concept_manual.to_csv("../evaluation/concept_manual.csv", index=False)

With the automatic method we could specify concepts for 80.65% of the compounds.

In [3186]:
100 - (len(concept_manual)/248)*100

80.24193548387098

After the manual annotation of the remaining 48 concepts, we load the table back into Python and merge the concepts to the final knowledeg base.

In [3187]:
# load annotated table
concept_annotated = pd.read_csv("../evaluation/concept_manual.csv", sep =";")

# reset index and update values
compounds = compounds.set_index('original')
concept_annotated = concept_annotated.set_index('original')
compounds.update(concept_annotated)
compounds.reset_index(inplace=True)

## 2.2 Definitions
In the following piece of code we retrieve the definitions of the synsets of a word, retrieved from the German `OdeNet` distribution of `WordNet`. (Note: This information will not be used in the final definition phrasing part, please see the paper for an explanation)

In [2655]:
def get_definition(string):

    """
    Returns the WordNet definitions of a string.
    Arg: 
        string: a noun.
    Returns: 
        A list of definitions of the noun if available, else None.
    """
    definition = [] # initiate empty list
    
    try:
        # for each synset
        for s in get_synset(string):
            definition.append(s.definition()) # retrieve definition and append to list
 
        return definition # return list of definitions
    
    except:
        return 
    
# create and save definition to new column in data frame
compounds['definition'] = compounds.second_part.apply(get_definition)

## 2.3 Similarity Measures
Via the `WordNet` library there are multiple options to compute the similarity of two input concepts. For our case we used the following two ways of retrieving a similarity score. 
Here, a score of 1 means that words are very similar and 0 indicating that words are not similar at all.

In the following example we can see how the two options differ in scoring for the same combination of words. While the `PATH` similarity score for **Betrug** and **Verbrechen** is 0.5, the `WUP` score is 0.9 - indicating a higher relatedness of the two concepts than the first method.

In [2656]:
# PATH similarity measure
# 1 is being "very similar", 0 is "not similar" (i.e. no connection in wn)
wn.similarity.path(get_synset("Betrug")[0], get_synset("Verbrechen")[0])

0.5

In [2657]:
# WUP similarity measure 
wn.similarity.wup(get_synset("Betrug")[0], get_synset("Verbrechen")[0], True)

0.9230769230769231

The retrieval of the common synset path results in the following: 

In [2658]:
# get list of common paths
paths = [list(reversed([get_synset("Betrug")[0]] + p)) for p in get_synset("Verbrechen")[0].hypernym_paths()]
print("Path of 'Betrug' and 'Verbrechen':\n")

# for each hypernym in the path
for el in paths:
    for item in el:
        print(item.lemmas()[0]) # print the lemma of the hypernym

Path of 'Betrug' and 'Verbrechen':

Handlung
Treulosigkeit
Liegenschaft
Vermögen
Vermögen
Topf
Frevel
Treulosigkeit


To illustrate the hypernym paths and the according path similarity scores for each edge between the nodes please see the following output:

In [2659]:
# initiate words that we want to compare 
word1 = "Betrug"
word2 = "Verbrechen"

# retrieve synset
synset1 = get_synset(word1)[0]
p = 1 # initiate path count

# for each path
for path in wn.taxonomy.hypernym_paths(synset1):
    print("\nPath",p, word1) 
    for i, ss in enumerate(path):
        print(' ' * i, ss, ss.lemmas()[0]) # print synsets and lemmas
        print(" " * i, "path:", wn.similarity.wup(word, ss)) # retrieve path similarity
    p = p+1
    
print("_"*45)    
synset2 = get_synset(word2)[0]
p = 1 # initiate path count

# for each path
for path in wn.taxonomy.hypernym_paths(synset2):
    print("\nPath",p, word2)
    for i, ss in enumerate(path):
        print(' ' * i, ss, ss.lemmas()[0]) # print synsets and lemmas
        print(" " * i, "path:", wn.similarity.wup(word, ss)) # retrieve path similarity
    p = p+1


Path 1 Betrug
 Synset('odenet-10880-n') krimineller Akt
 path: 0.9230769230769231
  Synset('odenet-15937-n') Frevel
  path: 0.8333333333333334
   Synset('odenet-6999-n') Topf
   path: 0.7272727272727273
    Synset('odenet-9850-n') Vermögen
    path: 0.6
     Synset('odenet-4667-n') Vermögen
     path: 0.4444444444444444
      Synset('odenet-10390-n') Liegenschaft
      path: 0.25

Path 2 Betrug
 Synset('odenet-10880-n') krimineller Akt
 path: 0.9230769230769231
  Synset('odenet-5502-n') Handlung
  path: 0.5
_____________________________________________

Path 1 Verbrechen
 Synset('odenet-5502-n') Handlung
 path: 0.5

Path 2 Verbrechen
 Synset('odenet-15937-n') Frevel
 path: 0.8333333333333334
  Synset('odenet-6999-n') Topf
  path: 0.7272727272727273
   Synset('odenet-9850-n') Vermögen
   path: 0.6
    Synset('odenet-4667-n') Vermögen
    path: 0.4444444444444444
     Synset('odenet-10390-n') Liegenschaft
     path: 0.25


In [2660]:
print("Common hypernyms of 'Betrug' and 'Verbrechen': \n")

for x in sorted(get_synset("betrug")[0].common_hypernyms(get_synset("verbrechen")[0])):
    print(x.lemmas()[0])

Common hypernyms of 'Betrug' and 'Verbrechen': 

Liegenschaft
krimineller Akt
Frevel
Vermögen
Handlung
Topf
Vermögen


The following code retrieves the **Lowest Common Hypernym** of both concepts.

In [207]:
# retrieve lowest common hypernym 
lowest_common = get_synset("Betrug")[0].lowest_common_hypernyms(get_synset("Verbrechen")[0])

for s in lowest_common:
    print(s.lemmas()[0]) # print lemma of hypernym

krimineller Akt


To work with similarity scores on our `compounds` data frame we first initiate a list of the nouns of our data frame to have a closer look at. Furthermore we initiate two empty data frames to compute and save the similarity scores. For this, we create a matrix containing all nouns as columns and as rows. The values then constitute the similarity scores for each combination of nouns. 

Then we run the `WUP` function and the `PATH` function on our nouns and save the computed similarity scores to the new data frames.

In [2662]:
# create list of nouns from compound words to work with for similarity measures
nouns = compounds.second_part.tolist()

# create data frame (matrix like) with all nouns as columns and rows to compute similarity 
nouns_wup = pd.DataFrame(index = nouns, columns = nouns)
nouns_sim = pd.DataFrame(index = nouns, columns = nouns)

In [2663]:
# to fill dataframe with similarity scores ("wup" function)

# iterate over columns and rows
for w in nouns:
    for ww in nouns:
        try: 
            wn_w = get_synset(w)[0] # retrieve synset information for column word and for row word
            wn_ww = get_synset(ww)[0]
            sim = wn.similarity.wup(wn_w, wn_ww, True) # compute similarity score
            nouns_wup[ww].loc[w] = sim # change value in cell
        except:
            nouns_wup[ww].loc[w] = np.NaN # if there is no score, return "None"
            
            
# round all values to 3 decimals 
nouns_wup = nouns_wup.round(3)

# to fill dataframe with similarity scores ("path" function)

# iterate over columns and rows
for w in nouns:
    for ww in nouns:
        try: 
            wn_w = get_synset(w)[0] # retrieve synset information for column word and for row word
            wn_ww = get_synset(ww)[0]
            sim = wn.similarity.path(wn_w, wn_ww, True) # compute similarity score
            nouns_sim[ww].loc[w] = sim # change value in cell
        except:
            nouns_sim[ww].loc[w] = np.NaN # if there is no score, return "None"

# round all values to 3 decimals 
nouns_sim = nouns_sim.round(3)

Our matrix-like data frames now look as follows: 

In [2664]:
nouns_sim.head()

Unnamed: 0,abzockerei,aktivismus,aktivist,aktivistin,alarm,alarmist,anbeter,apokalypse,apokalyptiker,apostel,...,zar,zerrüttung,zerstörer,zerstörung,zeugs,zipfel,zirkus,zunft,zwang,überhitzung
abzockerei,1.0,,0.142857,,0.2,,0.166667,0.2,,0.25,...,0.25,0.2,0.25,0.25,0.142857,0.25,0.125,0.166667,0.25,
aktivismus,,,,,,,,,,,...,,,,,,,,,,
aktivist,0.142857,,1.0,,0.142857,,0.166667,0.142857,,0.166667,...,0.166667,0.142857,0.166667,0.166667,0.111111,0.166667,0.1,0.125,0.166667,
aktivistin,,,,,,,,,,,...,,,,,,,,,,
alarm,0.2,,0.142857,,1.0,,0.166667,0.2,,0.25,...,0.25,0.2,0.25,0.25,0.142857,0.25,0.125,0.166667,0.25,


In [2665]:
nouns_wup.head()

Unnamed: 0,abzockerei,aktivismus,aktivist,aktivistin,alarm,alarmist,anbeter,apokalypse,apokalyptiker,apostel,...,zar,zerrüttung,zerstörer,zerstörung,zeugs,zipfel,zirkus,zunft,zwang,überhitzung
abzockerei,1.0,,0.25,,0.333333,,0.285714,0.333333,,0.4,...,0.4,0.333333,0.4,0.4,0.25,0.4,0.222222,0.285714,0.4,
aktivismus,,,,,,,,,,,...,,,,,,,,,,
aktivist,0.25,,1.0,,0.25,,0.285714,0.25,,0.285714,...,0.285714,0.25,0.285714,0.285714,0.2,0.285714,0.181818,0.222222,0.285714,
aktivistin,,,,,,,,,,,...,,,,,,,,,,
alarm,0.333333,,0.25,,1.0,,0.285714,0.333333,,0.4,...,0.4,0.333333,0.4,0.4,0.25,0.4,0.222222,0.285714,0.4,


In [2666]:
# save to csv file 
nouns_sim.to_csv("../output/nouns_sim.csv")
nouns_wup.to_csv("../output/nouns_wup.csv")

#### Retrieve Words

In a next step, we gather all cells with similarity scores higher than 0.5. The word combinations having a score higher than 0.5 are then saved to a dictionary from which we then create a new column in our original `compounds` data frame.

In [2667]:
# create a dictionary with keys from 
wup_sim_dict = dict.fromkeys(nouns)

# for each key look up similar words and save to dict 
for key, values in wup_sim_dict.items():
    
    # update values such that the list of "very similar" words is saved for each key 
    words = nouns_wup.index[nouns_wup[key] > 0.5].tolist()
    
    if key in words:
        words.remove(key) # remove the key word from the values list (since they always get score 1.0)
        
    wup_sim_dict[key] = words # save words to according dictionary key

# do for both data frames (similarity measures)
sim_dict = dict.fromkeys(nouns)

for key, values in sim_dict.items():
    
    # update values such that the list of "very similar" words is saved for each key 
    words = nouns_sim.index[nouns_sim[key] > 0.5].tolist()
    
    if key in words:
        words.remove(key) # remove the key word from the values list (since they always get score 1.0)
    
    sim_dict[key] = words # save words to according dictionary key
    
# add information to compounds data frame 
compounds['path'] = compounds.second_part.map(sim_dict)
compounds['wup'] = compounds.second_part.map(wup_sim_dict)

## 2.4 Stemming of second part nouns 
Stemming is performed using the following stemmers provided via the `nltk` library: `cistem`, `porter`, `lancaster`, `snowball`

We save all stemming results to our data frame to be able to combine the information of the output of all stemmers. 

In [2669]:
# load libraries
from nltk.stem.cistem import Cistem
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer
from nltk.stem.snowball import SnowballStemmer

# initiate stemmers 
cistem = Cistem(case_insensitive=True)
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer("german")

# the following functions all take a string as an input and return the according stem for each stemmer 
def stem_cistem(string):
    return cistem.stem(string)

def stem_porter(string):
    return porter.stem(string)

def stem_lancaster(string):
    return lancaster.stem(string)

def stem_snowball(string):
    return snowball.stem(string)

# apply stemmers to dataframe 
compounds['stem_cistem'] = compounds.second_part.apply(stem_cistem)
compounds['stem_porter'] = compounds.second_part.apply(stem_porter)
compounds['stem_lancaster'] = compounds.second_part.apply(stem_lancaster)
compounds['stem_snowball'] = compounds.second_part.apply(stem_snowball)

After having applied each stemmer to our data frame, we check the output of each stemmer for duplicates. I.e. are there any compounds giving us the same stem? We conclude these compounds to be semantically related to each other in some way and therefore create a list of words with the same stem and save this list for each word to a new column `share_stemmer`. Since this is being done for each stemmer, we obtain four new columns.

In [2670]:
# for each stemming column check whether there are same entries of words that could be connected
# get duplicates of stem column
cistem_stem_words = compounds[compounds.duplicated(['stem_cistem'])].stem_cistem.tolist()
# create list of duplicate words 
cistem_duplicates = compounds[pd.DataFrame(compounds.stem_cistem.tolist()).isin(cistem_stem_words).any(1).values].second_part.tolist()

# get duplicates of stem column
porter_stem_words = compounds[compounds.duplicated(['stem_porter'])].stem_porter.tolist()
# create list of duplicate words 
porter_duplicates = compounds[pd.DataFrame(compounds.stem_porter.tolist()).isin(porter_stem_words).any(1).values].second_part.tolist()

# get duplicates of stem column
lancaster_stem_words = compounds[compounds.duplicated(['stem_lancaster'])].stem_lancaster.tolist()
# create list of duplicate words 
lancaster_duplicates = compounds[pd.DataFrame(compounds.stem_lancaster.tolist()).isin(lancaster_stem_words).any(1).values].second_part.tolist()

# get duplicates of stem column
snowball_stem_words = compounds[compounds.duplicated(['stem_snowball'])].stem_snowball.tolist()
# create list of duplicate words 
snowball_duplicates = compounds[pd.DataFrame(compounds.stem_snowball.tolist()).isin(snowball_stem_words).any(1).values].second_part.tolist()

# create new column and add common stem words there (do for every stemmer)

# cistem stemmer
compounds["share_cistem"] = np.NaN

# for each stem word that has duplicate in the data frame
for stem in cistem_stem_words:
    # for compound that has a shared stem with another compound
    idx = compounds.index[compounds['stem_cistem'] == stem]
    for i in idx:
        # get list of words that share stem
        share = compounds.second_part[compounds['stem_cistem'] == stem].tolist()
        # save that list to the new column
        compounds.share_cistem.loc[i] = share

# porter stemmer
compounds["share_porter"] = np.NaN
for stem in porter_stem_words:
    idx = compounds.index[compounds['stem_porter'] == stem]
    for i in idx:
        share = compounds.second_part[compounds['stem_porter'] == stem].tolist()
        compounds.share_porter.loc[i] = share

# lancaster stemmer
compounds["share_lancaster"] = np.NaN
for stem in lancaster_stem_words:
    idx = compounds.index[compounds['stem_lancaster'] == stem]
    for i in idx:
        share = compounds.second_part[compounds['stem_lancaster'] == stem].tolist()
        compounds.share_lancaster.loc[i] = share

# snowball stemmer
compounds["share_snowball"] = np.NaN
for stem in snowball_stem_words:
    idx = compounds.index[compounds['stem_snowball'] == stem]
    for i in idx:
        share = compounds.second_part[compounds['stem_snowball'] == stem].tolist()
        compounds.share_snowball.loc[i] = share

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


This is how the new columns look like .

In [2671]:
compounds

Unnamed: 0,original,second_part,noun_forms,lemma,genus,compound_forms,related_words,hypernyms,roots,en_hypernyms,...,path,wup,stem_cistem,stem_porter,stem_lancaster,stem_snowball,share_cistem,share_porter,share_lancaster,share_snowball
0,klimaabzockerei,abzockerei,"[abzockerei, abzockereien]",abzockerei,f,"[klimaabzockerei, klimaabzockereien]","[Abzocke, Geldschneiderei, Profitmacherei, Beu...","[Zinssatz, Zinsfuß]","[Zinssatz, Zinsfuß]","[robbery, stealing, thieving, theft, larceny, ...",...,[],[],abzockerei,abzockerei,abzockere,abzockerei,,,,
1,klimaaktivismus,aktivismus,[aktivismus],aktivismus,m,[klimaaktivismus],[],[],[],,...,[],[],aktivismu,aktivismu,aktivism,aktivismus,,,,
2,klimaaktivist,aktivist,"[aktivisten, aktivist]",aktivist,m,"[klimaaktivisten, klimaaktivist]","[aktiver Mitarbeiter, Aktivist, politisch akti...","[Volksvertreter, Politiker]","[jemand, irgendjemand, jeder beliebige]","[reformer, meliorist, crusader, social reforme...",...,[],"[demagoge, macher]",aktivi,aktivist,akt,aktivist,,,,
3,klimaaktivistin,aktivistin,"[aktivistinnen, aktivistin]",aktivistin,f,"[klimaaktivistinnen, klimaaktivistin]",[],[],[],,...,[],[],aktivisti,aktivistin,aktivistin,aktivistin,,,,
4,klimaalarm,alarm,"[alarms, alarm, alarmen, alarme, alarmes]",alarm,m,"[klimaalarms, klimaalarm, klimaalarmen, klimaa...","[Alarm, Notruf, Alarmruf, Warnton, Warnsignal,...","[Gunst, Geneigtheit, Wohlwollen, Gewogenheit, ...","[Gunst, Wohlwollen, Geneigtheit, Zugewandtheit...","[fear, fearfulness, fright, emotion, feeling, ...",...,[],[],alarm,alarm,alarm,alarm,,,"[alarm, alarmist]",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
243,klimazipfel,zipfel,"[zipfeln, zipfel, zipfels]",zipfel,m,"[klimazipfeln, klimazipfel, klimazipfels]","[bestes Stück, Zipfel, Schwengel, Stößel, Schn...",[],[],"[cylinder, round shape, form, shape, attribute...",...,[],[],zipfel,zipfel,zipfel,zipfel,,,,
244,klimazirkus,zirkus,"[zirkusse, zirkus, zirkussen, zirkusses]",zirkus,m,"[klimazirkusse, klimazirkus, klimazirkussen, k...","[Herumlärmen, Gelärme, Lärmerei, Rumlärmen, Zi...","[vorweisen, vorzeigen]","[Versuch, Vorsatz, Unternehmung, Rastlosigkeit...","[bowl, sports stadium, stadium, arena, constru...",...,[],"[kabarett, show]",zirku,zirku,zirk,zirkus,,,,
245,klimazunft,zunft,"[zunft, zünfte, zünften]",zunft,f,"[klimazunft, klimazünfte, klimazünften]","[Gewerbe, Gilde, Innung, Gewerk, Zunft, Amt, B...","[Arbeitsgemeinschaft, Arbeitskreis, Arbeitsgru...","[Gestaltung, Erreichung, Realisierung, Verwirk...",[],...,[],[],zunf,zunft,zunft,zunft,,,,
246,klimazwang,zwang,"[zwange, zwängen, zwangs, zwanges, zwang, zwänge]",zwang,m,"[klimazwange, klimazwängen, klimazwangs, klima...","[Erpressung, Nötigung, Bedingung, Auflage, Res...","[Befehlssatz, Befehlsvorrat, Befehlsrepertoire]","[Gruppierung, Clusterung, Bündelung]","[regulating, regulation, control, activity, hu...",...,[],[],zwang,zwang,zwang,zwang,,,,


## 2.5 Compute Distance
Additional to the stems, we now compute the distance of two strings by meanings of their letters. The `Jaro Distance` computes the similarity between two strings and is offered by `nltk`. An example output for the Lancaster Stemmer is shown below. Since the Lancaster Stemmer also puts the words **Professor** and **Presse** into relation, we retrospectively discard it for this procedure.

In [2673]:
# Compute Jaro Distance of stems 

dist_cistem = [] # initiate empty list 

# iterate over all rows in columns
for w in compounds.stem_cistem:
    for ww in compounds.stem_cistem:
        
        # compute distance score for all possible combinations
        dist = distance.jaro_similarity(w, ww) 
        
        # if distance score is between 0.87 and 1.0 (indicating exact same string)
        if dist >= 0.87 and dist < 1.0:
            # retrieve complete words 
            word1 = compounds[compounds.stem_cistem == w].second_part.values[0]
            word2 = compounds[compounds.stem_cistem == ww].second_part.values[0]
            dist_cistem.append([word1, word2]) # and save to list
            
# Do this for all stem columns 
dist_porter = []
for w in compounds.stem_porter:
    for ww in compounds.stem_porter:
        dist = distance.jaro_similarity(w, ww)
        if dist >= 0.87 and dist < 1.0:
            word1 = compounds[compounds.stem_porter == w].second_part.values[0]
            word2 = compounds[compounds.stem_porter == ww].second_part.values[0]
            dist_porter.append([word1, word2])
                   
dist_snowball = []
for w in compounds.stem_snowball:
    for ww in compounds.stem_snowball:
        dist = distance.jaro_similarity(w, ww)
        if dist >= 0.87 and dist < 1.0:
            word1 = compounds[compounds.stem_snowball == w].second_part.values[0]
            word2 = compounds[compounds.stem_snowball == ww].second_part.values[0]
            dist_snowball.append([word1, word2])
            
            
# to exclude: some combinations are not as accurate   
# Presse - Professor should not be connected 
dist_lancaster = []
for w in compounds.stem_lancaster:
    for ww in compounds.stem_lancaster:
        dist = distance.jaro_similarity(w, ww)
        if dist >= 0.87 and dist < 1.0:
            word1 = compounds[compounds.stem_lancaster == w].second_part.values[0]
            word2 = compounds[compounds.stem_lancaster == ww].second_part.values[0]
            print(word1,"=>", word2, "| Score:", dist)
            dist_lancaster.append([word1, word2])
            
            
# combine findings
dist_stemmer = dist_cistem + dist_porter + dist_snowball

# get unique values 
dist_stemmer = [list(x) for x in {tuple(x) for x in dist_stemmer}]

apokalypse => apokalyptiker | Score: 0.872053872053872
apokalyptiker => apokalypse | Score: 0.872053872053872
betrug => betrüger | Score: 0.888888888888889
betrüger => betrug | Score: 0.888888888888889
freund => freundin | Score: 0.9166666666666666
freundin => freund | Score: 0.9166666666666666
gerechtigkeit => ungerechtigkeit | Score: 0.9555555555555555
konfusion => konsens | Score: 0.8888888888888888
konsens => konfusion | Score: 0.8888888888888888
leugner => leugnung | Score: 0.875
leugnung => leugner | Score: 0.875
lüge => lügner | Score: 0.8888888888888888
lügner => lüge | Score: 0.8888888888888888
notfall => notlage | Score: 0.8888888888888888
notlage => notfall | Score: 0.8888888888888888
presse => professor | Score: 0.9047619047619048
professor => presse | Score: 0.9047619047619048
propaganda => propagandafilm | Score: 0.8809523809523809
propagandafilm => propaganda | Score: 0.8809523809523809
propagandafilm => propaganda | Score: 0.8809523809523809
propaganda => propagandafilm

After having retrieved the string similarity scores, we save the scores to the `compounds` data frame.

In [2674]:
# apply list of distances to data frame 
compounds["dist_stemmer"] = np.NaN

# for each word combination
for w in dist_stemmer:
    # retrieve index of first element 
    idx = compounds.index[compounds['second_part'] == w[0]][0]
    # save both elements to new column "dist_stemmer"
    compounds.dist_stemmer.loc[idx] = w

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


## 2.6 Combine Related Words
In a next step, we combine the information we just retrieved by looking for *stem duplicates* with the information we got from the *path similarity* and the computation of *distance*. This leads us to a list of **similar words** which are either similar based on the similarity score or on the stem of the word. This information is saved to a new column `similar_words`


In [2675]:
# merge the following columns into new column "similar_words"

# map the columns
mapping = {"wup": "similar_words",
           "share_cistem": "similar_words",
           "share_porter": "similar_words",
           "share_lancaster": "similar_words",
           "dist_stemmer": "similar_words"}

# and save to new column
compounds["similar_words"] = compounds.groupby(mapping, axis=1).sum()

In [2676]:
compounds.head()

Unnamed: 0,original,second_part,noun_forms,lemma,genus,compound_forms,related_words,hypernyms,roots,en_hypernyms,...,stem_cistem,stem_porter,stem_lancaster,stem_snowball,share_cistem,share_porter,share_lancaster,share_snowball,dist_stemmer,similar_words
0,klimaabzockerei,abzockerei,"[abzockerei, abzockereien]",abzockerei,f,"[klimaabzockerei, klimaabzockereien]","[Abzocke, Geldschneiderei, Profitmacherei, Beu...","[Zinssatz, Zinsfuß]","[Zinssatz, Zinsfuß]","[robbery, stealing, thieving, theft, larceny, ...",...,abzockerei,abzockerei,abzockere,abzockerei,,,,,,[]
1,klimaaktivismus,aktivismus,[aktivismus],aktivismus,m,[klimaaktivismus],[],[],[],,...,aktivismu,aktivismu,aktivism,aktivismus,,,,,"[aktivismus, aktivist]","[aktivismus, aktivist]"
2,klimaaktivist,aktivist,"[aktivisten, aktivist]",aktivist,m,"[klimaaktivisten, klimaaktivist]","[aktiver Mitarbeiter, Aktivist, politisch akti...","[Volksvertreter, Politiker]","[jemand, irgendjemand, jeder beliebige]","[reformer, meliorist, crusader, social reforme...",...,aktivi,aktivist,akt,aktivist,,,,,"[aktivist, aktivismus]","[demagoge, macher, aktivist, aktivismus]"
3,klimaaktivistin,aktivistin,"[aktivistinnen, aktivistin]",aktivistin,f,"[klimaaktivistinnen, klimaaktivistin]",[],[],[],,...,aktivisti,aktivistin,aktivistin,aktivistin,,,,,"[aktivistin, aktivist]","[aktivistin, aktivist]"
4,klimaalarm,alarm,"[alarms, alarm, alarmen, alarme, alarmes]",alarm,m,"[klimaalarms, klimaalarm, klimaalarmen, klimaa...","[Alarm, Notruf, Alarmruf, Warnton, Warnsignal,...","[Gunst, Geneigtheit, Wohlwollen, Gewogenheit, ...","[Gunst, Wohlwollen, Geneigtheit, Zugewandtheit...","[fear, fearfulness, fright, emotion, feeling, ...",...,alarm,alarm,alarm,alarm,,,"[alarm, alarmist]",,"[alarm, alarmist]","[alarm, alarmist, alarm, alarmist]"


# 3. Named Entity Recognition
In this section we will work with the context data frames that we created by retrieving the concordances in R, i.e. `pro_context` and `con_context` to extract entities that are used in the context of the compound words.

They contain, inter alia, the following columns:
- `pre`: contains up to 5 sentences that appear to the left of the key word phrase
- `keyword`: the sentence containing the key word
- `post`: contains up to 5 sentences that appear to the right of the key word phrase
- `pattern`: the key word 

The extraction of named entities seeks to identify entities such as **persons** and **organisations** from the input text. 

## 3.1 Preprocessing

In the following, we preprocess the csv files we retrieved from R. The function `get_full_text` takes a data frame as its input and combines the strings contained in the three columns `pre`, `keyword` and `post` to generate the full text in which a key word (in our case the compound word) is found. 
(Additionally we reset and drop the index.)

In [1804]:
# function to get full text (combination of pre, keyword and post column) for each keyword, i.e. compound word 

def get_full_text(df):
    
    """
    Retrieves the full text preceding and following the keyword phrase and saves full text to new column.
    Arg: 
        df: the data frame containing the keyword-in-context information (columns "pre", "keyword" and "post").
    Returns: 
        The original data frame with a new column "full" containing the combined text columns. 
    """
    
    df = df.replace(np.nan,'',regex=True)
    
    # create full text for each column
    df["full"] = df["pre"] + df["keyword"] + df["post"]
    
    # convert full text column to string
    df.full = df.full.astype(str)
    
    return df  

In [1805]:
# apply function to both data frames 
pro_context = get_full_text(pro_context)
con_context = get_full_text(con_context)

## 3.2 Retrieve Entities
The following functions use spaCys Named Entity Recognition pipelines to retrieve the entities for our data frame. Additionally we create columns for `Persons` and `Organisations`.

In [1806]:
# functions to retrieve information regarding named entities

def get_ner(text):
    
    """
    This function retrieves entities from an input text.
    Arg: 
        text: a string.
    Returns: 
        The entities for the given input string. 
    """
    
    doc = nlp(text) # create spaCy nlp element
    entities = [] # create empty list
    try: 
        for ent in doc.ents: # iterate over entities in nlp element
            entities.append([ent.text, ent.label_]) # and save word and according label as list
    except:
        entities.append("None")
        
    return entities # return list of entities

def get_persons(text):
    
    """
    This function retrieves entities with the label "PER" from an input text.
    Arg: 
        text: a string.
    Returns: 
        The "PER" entities for the given input string. 
    """
    
    persons = [] # create empty list
    for label in text: # for each text-label pair 
        if label[1] == "PER": # if label is "PERSON"
            persons.append(label[0]) # save text to list
            
    return set(persons) # return a set of the list to remove duplicates 

def get_organisations(text):
    
    """
    This function retrieves entities with the label "ORG" from an input text.
    Arg: 
        text: a string.
    Returns: 
        The "ORG" entities for the given input string. 
    """
    
    organisations = [] # create empty list
    for label in text: # for each text-label pair 
        if label[1] == "ORG": # if label is "ORGANISATION"
            organisations.append(label[0]) # save text to list
            
    return set(organisations) # return a set of the list to remove duplicates 

In [1807]:
# apply functions to data frames - note: the following lines take a while to run
pro_context["entities"] = pro_context.full.apply(get_ner)
con_context["entities"] = con_context.full.apply(get_ner)

pro_context["persons"] = pro_context.entities.apply(get_persons)
con_context["persons"] = con_context.entities.apply(get_persons)

pro_context["organisations"] = pro_context.entities.apply(get_organisations)
con_context["organisations"] = con_context.entities.apply(get_organisations)

## 3.3 Cleaning 

To evaluate and clean the entities that were retrieved via `spacy` we create a list of unique **persons** and **organisations** from both data frames and manually evaluate those lists. 

In [1809]:
### PERSONS
# retrieve persons from both data frames
pro_persons = pro_context.persons.apply(list).tolist()
con_persons = con_context.persons.apply(list).tolist()
persons = pro_persons + con_persons # concatenate lists 
persons = [pers for sublist in persons for pers in sublist] # flatten nested list
persons = [s.strip('.') for s in persons] # clean list (remove "." symbol from beginning of string)
persons = set(persons) # get list of unique persons

### ORGANISATIONS
# retrieve organizations from both data frames
pro_org = pro_context.organisations.apply(list).tolist()
con_org = con_context.organisations.apply(list).tolist()
organisations = pro_org + con_org # concatenate lists 
organisations = [org for sublist in organisations for org in sublist] # flatten nested list
organisations = [s.strip('.') for s in organisations] # clean list (remove "." symbol from beginning of string)
organisations = set(organisations) # get list of unique persons

In [1810]:
# save lists to text files 

# open file in write mode
with open('../evaluation/persons.txt', 'w') as file:
    for item in persons:
        # write each item on a new line
        file.write("%s\n" % item)

# open file in write mode
with open('../evaluation/organisations.txt', 'w') as file:
    for item in organisations:
        # write each item on a new line
        file.write("%s\n" % item)

Afterwards, we load the cleaned lists back into Python and apply those to our data frames to make sure we only keep the cleaned entities. For this, we compare the list of organisations and persons with the list of cleaned entities and their spellings. This gives us a final list of entities (in a unified spelling) for each occurrence of the compound.

In [2140]:
# load files back into python and preprocess tables 
persons_cleaned = pd.read_csv("../evaluation/persons_cleaned.csv", sep=";")
# create new column for each spelling
persons_cleaned.spellings = persons_cleaned.spellings.replace(np.nan, " ").str.split(", ") 
persons_cleaned = persons_cleaned.explode('spellings')

organisations_cleaned = pd.read_csv("../evaluation/organisations_cleaned.csv", sep=";")
# create new column for each spelling
organisations_cleaned.spellings = organisations_cleaned.spellings.replace(np.nan, " ").str.split(", ")
organisations_cleaned = organisations_cleaned.explode('spellings')

In [2145]:
def clean_organisations(df):
        
    """
    This function compares the organisation entities with a cleaned list of organisations and retrieves the full names.
    Arg: 
        df: a data frame containing ORG entities 
    Returns: 
        A list of cleaned ORG entities for the data frame. 
    """
    
    # initiate empty list
    ORG = []
    
    # for each list of "ORG" entities in the data frame
    for row in df.organisations:
        
        # initiate empty list
        org = []
        
        # for each entitiy in the row
        for entity in row:

            # replace double whitespaces by single one and strip whitespaces from left and right end
            entity_cleaned = entity.replace("  ", " ").strip()
            
            # if entity matches a full name in our cleaned list
            if entity_cleaned in organisations_cleaned.organisation.tolist():
                full_name = entity_cleaned # entity equal the full name of the person
                
            # if entity matches one of the spellings in cleaned list
            elif entity_cleaned in organisations_cleaned.spellings.tolist():
                # retrieve the full name from cleaned list
                full_name = organisations_cleaned[organisations_cleaned.spellings == entity_cleaned].organisation.values[0]
                
            
            else:
                full_name = " " # return empty string

      
            # append the full name to the list for each row 
            org.append(full_name)
            
            # remove the empty strings 
            org = [x for x in org if x != " "]

        # append the list of full names of each row to the final list
        ORG.append(set(org))

    # return full list of names 
    return ORG

In [2161]:
# apply to data frame 
pro_context["ORG"] = clean_organisations(pro_context)
con_context["ORG"] = clean_organisations(con_context)

In [2557]:
# retrieve PERSON entities

def clean_persons(df):
        
    """
    This function compares the person entities with a cleaned list of persons and retrieves the full names.
    Arg: 
        df: a data frame containing PER entities 
    Returns: 
        A list of cleaned PER entities for the data frame. 
    """

    # initiate empty list
    PERS = []

    # for each list of "PERS" entities in the data frame
    for row in df.persons:
        
        # initiate empty list
        pers = []
        
        # for each entitiy in the row
        for entity in row:

            # normalize string: replace full stop symbols 
            entity_cleaned = entity.replace("."," ")
            
            # replace double whitespaces by single one and strip whitespaces from left and right end
            entity_cleaned = entity_cleaned.replace("  ", " ").strip()
            
            # remove digits
            entity_cleaned = ''.join((x for x in entity_cleaned if not x.isdigit()))

            # if entity matches a full name in our cleaned list
            if entity_cleaned in persons_cleaned.full.tolist():
                full_name = entity_cleaned # entity equal the full name of the person

            # if entity matches a last name in our cleaned list
            elif entity_cleaned in persons_cleaned.last_name.tolist():
                # retrieve the full name from cleaned list
                full_name = persons_cleaned[persons_cleaned.last_name == entity_cleaned].full.values[0]

            # if entity matches one of the spellings in cleaned list
            elif entity_cleaned in persons_cleaned.spellings.tolist():
                # retrieve the full name from cleaned list
                full_name = persons_cleaned[persons_cleaned.spellings == entity_cleaned].full.values[0]
                
            # if entity is in potential genitive form (has an "s" at the end), e.g. "Greta Thunbergs"
            # strip the s from the end of the string and check again for matches to full names 
            elif entity_cleaned.rstrip("s") in persons_cleaned.full.tolist():
                full_name = entity_cleaned.strip("s")

            # if entity is in potential genitive form (has an "s" at the end), e.g. "Greta Thunbergs"
            # strip the s from the end of the string and check again for matches to last names
            elif entity_cleaned.rstrip("s") in persons_cleaned.last_name.tolist():
                # retrieve the full name from cleaned list
                full_name = persons_cleaned[persons_cleaned.last_name == (entity_cleaned.strip("s"))].full.values[0]

            # if no match is found
            else:
                full_name = " " # return empty string

            # append the full name to the list for each row 
            pers.append(full_name)
            
            # remove the empty strings 
            pers = [x for x in pers if x != " "]

        # append the list of full names of each row to the final list
        PERS.append(set(pers))

    # return full list of names 
    return PERS

In [2558]:
# apply to data frame 
pro_context["PERS"] = clean_persons(pro_context)
con_context["PERS"] = clean_persons(con_context)

Save the final information we need for the definitions to the `compounds` data frame.

In [2566]:
# replace empty values with nan
pro_context.PERS = pro_context.PERS.apply(lambda y: np.nan if len(y)==0 else y)
pro_context.ORG = pro_context.ORG.apply(lambda y: np.nan if len(y)==0 else y)
# convert values into list
pro_context.PERS = pro_context.PERS.map(list, na_action='ignore')
pro_context.ORG = pro_context.ORG.map(list, na_action='ignore')

# retrieve dictionary with PERSONS and ORGANISATIONS with each compound as a key and entities as values
pro_PERS = pro_context[pro_context.PERS.isna() == False].groupby(by=["pattern"], dropna=True)["PERS"].apply(list).to_dict()
pro_ORG = pro_context[pro_context.ORG.isna() == False].groupby(by=["pattern"], dropna=True)["ORG"].apply(list).to_dict()

# retrieve dictionary with PERSONS and ORGANISATIONS with each compound as a key and entities as values
con_PERS = con_context[con_context.PERS.isna() == False].groupby(by=["pattern"], dropna=True)["PERS"].apply(list).to_dict()
con_ORG = con_context[con_context.ORG.isna() == False].groupby(by=["pattern"], dropna=True)["ORG"].apply(list).to_dict()

# add information to compound data frame 
compounds['PERS_pro']= compounds['original'].map(pro_PERS)
compounds['ORG_pro']= compounds['original'].map(pro_ORG)

compounds['PERS_con']= compounds['original'].map(con_PERS)
compounds['ORG_con']= compounds['original'].map(con_ORG)

With the following function we flatten the nested lists of entities for the final knowledge base.

In [3031]:
def flatten_list(nested_list):
    
    """
    This function flattens a nested list. 
    Arg: 
        nested_list: a nested list of the format [[item1],[item2]].
    Returns: 
        A flattened version of the nested list, e.g. [item1,item2], else NaN.
     """       
    
    try:
        return [item for sublist in nested_list for item in sublist]
    except:
        return np.nan

In [3032]:
# flatten entity lists 
compounds["PERS_pro"] = compounds.PERS_pro.apply(flatten_list)
compounds["PERS_con"] = compounds.PERS_con.apply(flatten_list)
compounds["ORG_pro"] = compounds.ORG_pro.apply(flatten_list)
compounds["ORG_con"] = compounds.ORG_con.apply(flatten_list)

## 3.4 Visualization 
In the following we will have a look at the specific procedure that is done by `spacys` NER pipeline. To visualize an example, let's have a look at the following sentence that we retrieved from the pro corpus: 

**"Dank Greta und FFF ist endlich Bewegung in den Stillstand bei der Klimarettung gekommen."**

In [1823]:
#ex1 = "Gerrit Hansen von der Klimaaktivistengruppe Germanwatch."
ex2 = "Dank Greta und FFF ist endlich Bewegung in den Stillstand bei der Klimarettung gekommen."

In [1824]:
doc = nlp(ex2) # create spacy nlp object

# for each entity
for e in doc.ents:
    # print text, start and end character, label
    print(e.text, e.start_char, e.end_char, e.label_)

Greta 5 10 PER
FFF 15 18 ORG


In [1825]:
print(f"{'Token':{23}} {'BIO Tag':{15}} {'Entity':{20}} ")
# for each token
for token in doc:
    # print BIO tag and entity information
    print(f"{token.text:{25}} {token.ent_iob_:{13}}   {token.ent_type_:{20}} ")

Token                   BIO Tag         Entity               
Dank                      O                                    
Greta                     B               PER                  
und                       O                                    
FFF                       B               ORG                  
ist                       O                                    
endlich                   O                                    
Bewegung                  O                                    
in                        O                                    
den                       O                                    
Stillstand                O                                    
bei                       O                                    
der                       O                                    
Klimarettung              O                                    
gekommen                  O                                    
.                         O               

In [1826]:
# visualize the entities
displacy.render(doc, style="ent", jupyter=True)

# 4. Dependency Parsing
In this section we apply dependency parsing ot the context of the compounds to potentially obtain modifiers that further specify the compound words (in other words, they are dependent on the compound words).

For a full list of German/English dependency labels please see:

https://github.com/explosion/spaCy/blob/master/spacy/glossary.py  
https://www.ims.uni-stuttgart.de/documents/ressourcen/korpora/tiger-corpus/annotation/tiger_scheme-syntax.pdf

## 4.1 Retrieve Dependencies

Lets retrieve dependency information with the `spacy` library for our context data frames. 

The function `get_dependencies` retrieves all modifiers recursively to also check for cases such as found in: 

**"Über den weltweit bekanntesten (und wohl aggressivsten) Klimaaktivisten Bill McKibben"**

We want *weltweit*, *bekanntesten*, *wohl* and *aggressivsten* to be found as well. Therefore the following function takes heads which are modifiers of the compound and looks recursively for their dependents too. 

In [3544]:
def get_dependencies(df):
    
    """
    This function recursively retrieves the dependencies of words being dependent of a compound word. 
    Arg: 
        df: a data frame containing the compound words for which we want to count the POS tags of the dependent words.
    Returns: 
        A nested list consisting of the compound words as keys and the dependency information, 
        i.e. the dependent word, the POS tag of the dependent word, the dependency tag.
    """   
    
    #deps = dict.fromkeys(set(df.pattern)) # initiate a dictionary with the compounds as keys
    deps = []
    mods = ["mo", "mnr", "nk"] # the list of dependency tags we are interested in
    pos = ["ADJA", "ADJD", "PAV", "PROAV", "PDAT", "PIAT", "PIDAT",
          "PPOSAT", "PRELAT", "PTKA", "PWAT", "PWAV", "ADJ", "ADP", "ADV"] # the list of POS tags we are interested in
    
    # iterate over rows of data frame 
    for index, row in df.iterrows():
        
        doc = nlp(row["keyword"]) # retrieve sentence with keyword for according row
        
        tok_deps = [] # initiate empty list of dependencies for that row
        
        # for each token in the sentence 
        for token in doc:
        
            # check for words being dependent on our compound word, i.e. the compound is the head 
            if str(token.head.text).lower().startswith(row["pattern"].lower()):
                
                # if the token is a modifier (i.e. token has one of the modifier tags and one of the POS tags that we defined above)
                if token.dep_ in mods and token.tag_ in pos:
                        
                    # append information (lemma, token, pos tag, dependency tag, head word) to dependency list
                    tok_deps.append([token.lemma_, token.text, token.tag_, token.dep_, token.head.text])
                    
                    # use the new found word as new head 
                    new_head = token.text
                    
                    # check if we have words that are dependent on our new head
                    for token in doc:
                        
                        # if yes
                        if token.head.text == new_head:
                            
                            # and if the token is a modifier (i.e. token has one of the modifier tags and one of the POS tags that we defined above)
                            if token.dep_ in mods and token.tag_ in pos:
                                
                                # append it to our list
                                tok_deps.append([token.lemma_, token.text, token.tag_, token.dep_, token.head.text])


            # if we have a conjunct of one of the modifiers and another modifier
            if token.head.text == "und" and token.tag_ in pos and token.dep_ == "cj":
                
                # append to list
                tok_deps.append([token.lemma_, token.text, token.tag_, token.dep_, token.head.text])
                
                # use token as new head 
                next_head = token.text
                
                # check for dependent words 
                for token in doc:
                    
                    # if yes
                    if token.head.text == next_head: 
                        
                        # and if the token is a modifier (i.e. token has one of the modifier tags and one of the POS tags that we defined above)
                        if token.dep_ in mods and token.tag_ in pos:
                                
                            # append information to list
                            tok_deps.append([token.lemma_, token.text, token.tag_, token.dep_, token.head.text])

            # if none of the options apply, move on
            else:
                pass
                
        # append to final list 
        deps.append(tok_deps)
        
    # return final list with dependency information for each row in the data frame
    return deps 

## 4.2 Retrieve Modifiers from Dependencies

Next, since the `get_dependencies` function outputs a dictionary full of dependency information for each compound, we use the function `get_mods` to retrieve the modifiers from the `get_dependencies` output. 

In [3546]:
def get_mods(column):
    
    """
    This function retrieves the modifiers of the column created via the "get_dependencies" above. 
    Arg: 
        column: a data frame column containing list of dependencies from which we want to retrieve the very first element.
    Returns: 
        A list of modifier words for each compound word if possible, else 0.
    """ 
    
    mods = [] # initiate empty list
    
    # for each list of dependencies
    for deps in column:
        
        # if list ist empty
        if column == "[]":
            return 0 # return 0
           
        else:
            # if we have the string "innen" do not append to modifiers
            if deps[0] == "innen":
                pass
            
            # else 
            else:
            # get first element and save to new list
                mods.append(deps[0])
            
    return mods # return list of modifiers

Let's apply both functions and save the dependency information to a new column `dependencies` and the list of modifiers to a new column `modifiers`. 

In [3553]:
# apply function to data frames 
con_deps = get_dependencies(con_context)
pro_deps = get_dependencies(pro_context)

# retrieve dependencies and save to new column in our data frame 
pro_context["dependencies"] = pro_deps
con_context["dependencies"] = con_deps

In [3578]:
# apply modifier function to find all modifiers and save to new column "modifiers"
pro_context["modifiers"] = pro_context.dependencies.apply(get_mods)
con_context["modifiers"] = con_context.dependencies.apply(get_mods)

Add information the the `compounds` data frame:

In [3607]:
# convert modifiers column into dictionary
pro_mods = pro_context.groupby("pattern")["modifiers"].apply(list).to_dict()
con_mods = con_context.groupby("pattern")["modifiers"].apply(list).to_dict()

# map to compounds data frame
compounds["pro_mods"] = compounds.original.map(pro_mods)
compounds["pro_mods"] = compounds.pro_mods.apply(flatten_list)
compounds["con_mods"] = compounds.original.map(con_mods)
compounds["con_mods"] = compounds.con_mods.apply(flatten_list)

## 4.3 Cleaning
After the manual evaluation of the modifiers, we noticed that the list of modifiers still requires a lot of manual cleaning since some words contained in the modifiers column are neither adverbs nor adjectives (e.g. "Flashcrash") or are not useful for the definition phrasing part. 
Hence, we will manually clean the list of modifiers and load it back into Python to update the data frame with the cleaned version:

In [4]:
# get cleaned list of modifiers
mods_cleaned = pd.read_csv("../evaluation/modifiers_cleaned.csv", index_col = 0, sep=";")

# update knowledge base with cleaned list 
compounds = compounds.set_index('original')
compounds.update(mods_cleaned)
compounds.reset_index(inplace=True)

## 4.4 Visualization

To visualize what just happened in the functions we applied before, let's have a look at the following code. We will use the previously mentioned example phrase from the C2022 corpus: 

**Über den weltweit bekanntesten und wohl aggressivsten Klimaaktivisten Bill McKibben**

For this phrase, we will retrieve the *tokens*, *dependency labels*, the according *heads* and a brief *explanation* of the dependency tag. The ouput is shown below.

In [1831]:
text = "Über den weltweit bekanntesten und wohl aggressivsten Klimaaktivisten Bill McKibben"

print(f"{'Token':{15}} {'Dependence':{15}} {'Head Text':{20}}  {'Dependency Explained'} ")
# for each token in the text
for token in nlp(text):
    # print the dependency information
    print(f"{token.text:{15}} {token.dep_+' =>':{13}}   {token.head.text:{20}}  {spacy.explain(token.dep_)} ")

Token           Dependence      Head Text             Dependency Explained 
Über            ROOT =>         Über                  root 
den             nk =>           Klimaaktivisten       noun kernel element 
weltweit        mo =>           bekanntesten          modifier 
bekanntesten    nk =>           Klimaaktivisten       noun kernel element 
und             cd =>           bekanntesten          coordinating conjunction 
wohl            mo =>           aggressivsten         modifier 
aggressivsten   cj =>           und                   conjunct 
Klimaaktivisten nk =>           Über                  noun kernel element 
Bill            pnc =>          McKibben              proper noun component 
McKibben        nk =>           Klimaaktivisten       noun kernel element 


Heads are tagges as `ROOT`, a full list of the remaining dependency tags can be found here: https://github.com/explosion/spaCy/blob/master/spacy/glossary.py

Our function `get_dependencies_recursive` now seeks to retrieve all words that are being dependent on our compound word (and that are an adjective or adverb).

Accordingly, for this example, our code retrieves the words (in this order):

**bekanntesten** => **weltweit**  
**aggressivsten** => **wohl** 

The syntactic structure of the phrase can also be visualized as a graph of dependencies as given here:

In [212]:
ex1 = "Über den weltweit bekanntesten und wohl aggressivsten Klimaaktivisten Bill McKibben"

doc = nlp(ex1)
displacy.render(doc, style="dep")

# to save the plot please un-comment the following lines

#dep_plot = displacy.render(doc, style='dep', jupyter=False)
#output_path = Path("../../plots/dependency_plot.svg")
#output_path.open("w", encoding="utf-8").write(dep_plot);

Write necessary information about the modifiers to the `compounds` data frame.

In [2624]:
# retrieve dictionary with MODIFIERS with each compound as a key and entities as values
pro_mods = pro_context.set_index('pattern').to_dict()['modifiers']
con_mods = con_context.set_index('pattern').to_dict()['modifiers']

# 5. Sentiment Analysis
This section applies two sentiment models, namely `GermanBert` and `TextBlob` to the context of the compounds.

## 5.1 German Bert 
Firstly, we apply the `GermanBert` model to the data and save the output to a new column `bert` for each of the context data frames.

In [215]:
def get_bert(text):
    
    """
    This function retrieves the polarity of the German Bert model. 
    Arg: 
        text: a string for which we want to retrieve a polarity label.
    Returns: 
        A polarity label, i.e. "positive", "negative", or "neutral".
    """   
    
    # retrieve polarity from German Bert model
    pol = model.predict_sentiment([str(text)])
        
    return pol[0] # return polarity label

In [1838]:
# apply to data frames
pro_context["bert"] = pro_context.full.apply(get_bert)
con_context["bert"] = con_context.full.apply(get_bert)

## 5.2 TextBlob

Next, we apply the `TextBlob` model to the data and save the output to the column `blob`. The function `convert_sentiment` seeks to convert the continuous sentiment scores that are given by the `TextBlob` model into discrete polarity labels (neutral, negative or positive).

In [216]:
def get_blob(text):
    
    """
    This function retrieves the polarity of the TextBlob model. 
    Arg: 
        text: a string for which we want to retrieve a polarity label.
    Returns: 
        A sentiment score, ranging from -1.0 to 1.0
    """   
        
    # retrieve sentiment from TextBlob
    blob = TextBlob(str(text))
    
    return blob.sentiment[0] # return sentiment score 
        

def convert_sentiment(value, threshold):
    
    """
    This function converts the output of get_blob into discrete polarity labels, depending on the threshold that is set. 
    Arg: 
        value: a sentiment score, ranging from -1.0 to 1.0
        threshold: a threshold score, ranging from -1.0 to 1.0
    Returns: 
        A polarity label, i.e. "positive", "negative", or "neutral".
    """   
    
    # if sentiment score is higher or equal to threshold
    if value >= threshold:
        label = "positive" # polarity = positive
        
    # if sentiment score is lower or equal to the negative version of the threshold
    if value <= -(threshold):
        label = "negative" # polarity = negative
        
    # if sentiment score is between the negative version and the positive version of the threshold
    else:
        label = "neutral" # polarity = neutral
        
    return label # return label 

In [1844]:
# apply to data frame 
pro_context["blob"] = pro_context.full.apply(get_blob)
con_context["blob"] = con_context.full.apply(get_blob)

## 5.3 Evaluate Sentiment Tools

### 5.3.1 Check Sentiment
In the following we will retrieve those texts that obtain different polarities from the two sentiment models. The idea is to assign the sentiment for these entries manually and to see if the manual annotation is in accordance with one of the models. This could help us decide for one of the models without the need of annotating a larger sample of texts. 

Firstly, we convert the polarity scores we retrieved via the `TextBlob` tool into discrete labels: `negative`, `neutral`, `positive` to be able to compare them to the polarity labels we got via `GermanBert`. 

To see which threshold for `TextBlob` gives us the most similar polarity labels (compared to the `GermanBert` output), different threshold values (0.0, 0.1, 0.2, 0.3, 0.4) are tested for the conversion into the discrete label format. This is done in the `test_parameters` function below.

In [2167]:
def test_parameters(df):
    
    """
    This function tests different threshold parameters and applies the convert_sentiment function to retrieve an output for each threshold value. 
    It counts for each threshold value how many polarity labels are different for the TextBlob and the GermanBert output. 
    Arg: 
        df: a data frame.
    Returns: 
        An array consisting of threshold values and the according difference count.
    """   
    
    diff_scores = [] # initiate empty list
    
    # for different threshold values
    for x in np.arange(0.0, 0.50, 0.10):
        
        # get discrete labels
        df["blob_bin"] = df.blob.apply(convert_sentiment, threshold = round(x,2))
        
        # compute difference of columns
        diff = len(df[(df['bert'] == df['blob_bin']) == False])
        
        # save scores to list in format: threshold, difference count
        diff_scores.append([round(x,2), diff])
        
    # return minimum of difference count and list of difference counts
    return min(diff_scores, key=lambda x: x[1]), diff_scores

In [2168]:
# apply function to data frames 
pro_diff = test_parameters(pro_context)
con_diff = test_parameters(con_context)

Now, we can have a look at the difference counts that the range of parameters gives us (the first value is the threshold parameter, the second value is the difference count):

In [2171]:
print("Contra data frame differences:\n\n", tabulate(con_diff[1], headers=["threshold", "difference count"]))
print("\nPro data frame differences:\n\n", tabulate(pro_diff[1], headers=["threshold", "difference count"]))

Contra data frame differences:

   threshold    difference count
-----------  ------------------
        0                   753
        0.1                 615
        0.2                 550
        0.3                 519
        0.4                 515

Pro data frame differences:

   threshold    difference count
-----------  ------------------
        0                   279
        0.1                 212
        0.2                 188
        0.3                 176
        0.4                 173


The tables clearly show that a threshold of 0.4 would give the least differences between both sentiment models. Accordingly, we apply the conversion of the `TextBlob` labels with a threshold of 0.4. This means that all values smaller than -0.4 will be considered *negative*, all value between -0.4 and + 0.4 will be considered *neutral* and all values higher than + 0.4 are indicated as *positive*.

In [2174]:
# convert continuous sentiment score to discrete score 
con_context["blob_labels"] = con_context.blob.apply(convert_sentiment, threshold = 0.4)
pro_context["blob_labels"] = pro_context.blob.apply(convert_sentiment, threshold = 0.4)

Since the sample of texts with different polarity labels is still very large, we will try to further reduce it in the next step before we start with a manual annotation.

### 5.3.2 Retrieve main Polarity for each Klima Compound

In the end, we want to retrieve the prevailing polarity for each compound. For the manual evaluation of the polarity labels we now transform the data frame into a different format and get the full text of concordances for each compound word. This is done is `convert_dataframe`.

In [2178]:
def convert_dataframe(df):
    """
    This function groups a data frame by their key word (pattern) and concordances (i.e. context texts). 
    Arg: 
        df: a data frame.
    Returns: 
        A grouped data frame with a column for the key word (pattern) and a column containing all concordances for the according key word (text_by_compound)
    """   
    
    # group data frame by pattern and the full text column
    df["text_by_compound"] =  df[["pattern", "full"]].groupby(['pattern'])['full'].transform(lambda x: '//'.join(x))
    # drop duplicates
    df = df[['pattern','text_by_compound']].drop_duplicates()
    # reset index
    df = df.reset_index()
    # drop index column
    df = df.drop("index", axis = 1)
    # split text 
    df["text_by_compound"] = df["text_by_compound"].apply(lambda x: x.split("//"))

    return df # return grouped data frame 

In [2179]:
# apply function to data frame
pro_sentiment = convert_dataframe(pro_context)
con_sentiment = convert_dataframe(con_context)

This is how the grouped data frame looks like:

In [2181]:
con_sentiment.head()

Unnamed: 0,pattern,text_by_compound
0,klimaabzockerei,[Werner Kirstein Universität Leipzig Kirstein ...
1,klimaaktivismus,[Die SojalatteAdabeis der grünen Stadtbiotope ...
2,klimaaktivist,[Von diesem Beitrag werden Sie bei LeitMedien ...
3,klimaaktivistin,[Klimaaktivisten sind aus anderem Holz geschni...
4,klimaalarm,[Es ist die marxistische WassermelonenAgenda g...


To get the most common polarity label for each compound, we now group the data frames by pattern and polarity label and save the information to the transformed data frame. 

In [2182]:
# get most common polarity label for each compound
pro_sentiment_bert = pro_context.groupby(['pattern'])['bert'].max().tolist()
con_sentiment_bert = con_context.groupby(['pattern'])['bert'].max().tolist()
pro_sentiment_blob = pro_context.groupby(['pattern'])['blob_bin'].max().tolist()
con_sentiment_blob = con_context.groupby(['pattern'])['blob_bin'].max().tolist()

# and save to new column in info data frame 
pro_sentiment["bert"] = pro_sentiment_bert
con_sentiment["bert"] = con_sentiment_bert

pro_sentiment["blob"] = pro_sentiment_blob
con_sentiment["blob"] = con_sentiment_blob

The polarity labels that we retrieved before are now given in our tranformed data frame:

In [2183]:
pro_sentiment.head()

Unnamed: 0,pattern,text_by_compound,bert,blob
0,klimaaktivismus,[Aber das macht nichts . Denn Du kannst trotzd...,neutral,neutral
1,klimaaktivist,[Why should I be studying for a future that so...,positive,neutral
2,klimaaktivistin,[Why should I be studying for a future that so...,positive,neutral
3,klimaasyl,[Sie haben im Januar eine Podiumsdiskussion or...,neutral,neutral
4,klimaaufschrei,[Beispielsweise mittels zivilem Ungehorsam und...,neutral,neutral


Next, we want to retrieve for which compound words the polarity labels we obtained from `TextBlob` and `GermanBert` differ and save those rows to a csv file (`con_sentiment_diff_by_compound.csv` and `pro_sentiment_diff_by_compound.csv`).

In [2184]:
# retrieve columns where we have different sentiment scores 
con_sentiment_diff = con_sentiment[(con_sentiment['bert'] == con_sentiment['blob']) == False]
pro_sentiment_diff = pro_sentiment[(pro_sentiment['bert'] == pro_sentiment['blob']) == False]

This is the case for 13 compounds for the P2022 and 56 compound for the C2022 corpus.

In [2188]:
print("Number of compounds with different sentiment labels (P2022):", len(pro_sentiment_diff))
print("Number of compounds with different sentiment labels (C2022):", len(con_sentiment_diff))

Number of compounds with different sentiment labels (P2022): 13
Number of compounds with different sentiment labels (C2022): 56


In [2190]:
# save new information to csv files
#con_sentiment_diff.to_csv("../evaluation/con_sentiment_diff_by_compound.csv", index = False)
#pro_sentiment_diff.to_csv("../evaluation/pro_sentiment_diff_by_compound.csv", index = False)

For those files we now perform a manual annotation of the sentiment. This can be found in the files `pro_sentiment_diff_manual.csv` and `con_sentiment_diff_manual.csv` in the column `manual`. Since also the manual annotation did not help in deciding for one of the models, we discard this approach of obtaining polarities. (See the thesis paper for more information).

## 5.4 Derive Connotations
To derive the connotation of each climate compound in the discourse of climate change we exploit the simple assumption that connotation of a compound may be directly derived by the sentiment that the second constituent can be associated with. Accordingly, in the following, we will obtain a polarity label for each of the second constituents by re-applying both models and look for differences again.

In [217]:
# apply sentiment models to the second constituent of the compounds
compounds["bert"] = compounds.second_part.apply(get_bert)
compounds["blob"] = compounds.second_part.apply(get_blob)

After the application of both sentiment models to the second constituent of the compounds, we figured that a lot of compounds did not receive the expected sentiment label/score.
Accordingly, we repeat the procedure on the column containing the related words, since here we have more words that can contribute to find the prevailing sentiment of the second constituent. For some cases there are no related words available. Accordingly, for those cases we retrieve the sentiment label/score of the second constituent. 

In [218]:
def get_common_bert(row):
    
    """
    This function retrieves the sentiment labels from the GermanBert model for the related words in the compounds data frame.
    Arg: 
        row: a row with a list of related words.
    Returns: 
        The most common polarity label of all words in the list (GermanBert), else None.
    """   
    
    # initiate empty lists 
    bert = []
    if row:
        # for each string in the list of related words
        for string in row:
            bert.append(get_bert(string)) # retrieve sentiment label and append to list
        
        bert = Counter(bert).most_common(1)[0][0] # retrieve most common sentiment label for list of related words 
        
        return bert
    
    # if there is no list of related words, return None       
    else:        
        return 
    
    
def get_common_blob(row):
    
    """
    This function retrieves the sentiment scores from the TextBlob model for the related words in the compounds data frame.
    Arg: 
        row: a row with a list of related words.
    Returns: 
        The average sentiment score of all words in the list (TextBlob), else None.
    """   
    
    # initiate empty lists 
    blob = []
    if row:
        # for each string in the list of related words
        for string in row:
            blob.append(get_blob(string)) # retrieve sentiment score and append to list
            
        blob = sum(blob)/len(blob) # get average score for the list of words
        
        return blob 
    
    # if there is no list of related words, return None
    else:        
        return 

In [219]:
# apply bert function to the column of related words 
compounds["bert_related"] = compounds.related_words.apply(get_common_bert)
compounds.bert_related.fillna(compounds.bert, inplace=True) # if no related words are available, use sentiment label of second constituent


# apply blob function to the column of related words 
compounds["blob_related"] = compounds.related_words.apply(get_common_blob)
compounds.blob_related.fillna(compounds.blob, inplace=True) # if no related words are available, use sentiment scores of second constituent

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)


Next, we manually evaluate the sentiment of the compounds. Since in most cases `TextBlob` and `GermanBert` gave different labels and due to the fact that none of the models correctly identifies all sentiments, we decide to manually assign a polarity label to the compounds. This is done in the file `../evaluation/compounds_sentiment.csv`.

In [220]:
compounds_sentiment = pd.read_csv("../evaluation/compounds_sentiment.csv", sep = ";")

In [221]:
compounds_sentiment

Unnamed: 0,original,second_part,bert,bert_related,blob,blob_related,manual_sentiment
0,klimaabzockerei,abzockerei,negative,negative,0.0,0.000000,negative
1,klimaaktivismus,aktivismus,positive,positive,0.0,0.000000,neutral
2,klimaaktivist,aktivist,positive,positive,0.0,0.666667,neutral
3,klimaaktivistin,aktivistin,positive,positive,0.0,0.000000,neutral
4,klimaalarm,alarm,negative,negative,0.0,0.000000,negative
...,...,...,...,...,...,...,...
243,klimazeugs,zeugs,negative,negative,0.0,0.000000,neutral
244,klimazipfel,zipfel,positive,negative,0.0,0.029412,neutral
245,klimazirkus,zirkus,positive,negative,0.0,0.000000,neutral
246,klimazunft,zunft,negative,negative,0.0,0.000000,neutral


Let's have a look at the sentiment scores and polarity labels that we obtained by the two models: The `GermanBert` model has an accuracy of 0.56 for both columns (`bert` and `bert_related`) and `TextBlob` has an accuracy of 0.46 for the `blob` column and 0.57 for `blob_related`. For `TextBlob` we use a threshold value of 0.001 here since the manual evaluation suggests that all values higher or lower than 0.0 should be *positive* or *negative*.

In [222]:
print("Accuracy bert:")
len(compounds_sentiment[compounds_sentiment.bert == compounds_sentiment.manual_sentiment])/ len(compounds_sentiment)

Accuracy bert:


0.5443548387096774

In [223]:
print("Accuracy bert_related:")
len(compounds_sentiment[compounds_sentiment.bert_related == compounds_sentiment.manual_sentiment])/ len(compounds_sentiment)

Accuracy bert_related:


0.5645161290322581

In [224]:
compounds_sentiment["blob_label"] = compounds_sentiment.blob.apply(lambda x: convert_sentiment(x, 0.001))
compounds_sentiment["blob_related_label"] = compounds_sentiment.blob_related.apply(lambda x: convert_sentiment(x, 0.001))

print("Accuracy blob:")
len(compounds_sentiment[compounds_sentiment.blob_label == compounds_sentiment.manual_sentiment])/ len(compounds_sentiment)

Accuracy blob:


0.46774193548387094

In [225]:
print("Accuracy blob_related:")
len(compounds_sentiment[compounds_sentiment.blob_related_label == compounds_sentiment.manual_sentiment])/ len(compounds_sentiment)

Accuracy blob_related:


0.5846774193548387

Finally, we save the manual sentiment labels to the final `compounds` data frame

In [198]:
# append column with manual sentiment to compounds data frame
compounds = pd.concat([compounds, compounds_sentiment.manual_sentiment], 1)

  compounds = pd.concat([compounds, compounds_sentiment.manual_sentiment], 1)


In [199]:
compounds

Unnamed: 0,original,second_part,noun_forms,lemma,genus,compound_forms,related_words,hypernyms,roots,en_hypernyms,...,con_sarcasm,pro_attr,con_attr,tf_pro,tf_con,tfidf_pro,tfidf_con,pro_colls,con_colls,manual_sentiment
0,klimaabzockerei,abzockerei,"[abzockerei, abzockereien]",abzockerei,f,"[klimaabzockerei, klimaabzockereien]","[Abzocke, Geldschneiderei, Profitmacherei, Beu...","[Zinssatz, Zinsfuß]","['Zinssatz', 'Zinsfuß']","[robbery, stealing, thieving, theft, larceny, ...",...,0.000000,,,0,1,,0.000000,,,negative
1,klimaaktivismus,aktivismus,[aktivismus],aktivismus,m,[klimaaktivismus],[],[],[],,...,0.000000,,,3,1,0.015903,0.000000,,,neutral
2,klimaaktivist,aktivist,"[aktivisten, aktivist]",aktivist,m,"[klimaaktivisten, klimaaktivist]","[aktiver Mitarbeiter, Aktivist, politisch akti...","[Volksvertreter, Politiker]","['jemand', 'irgendjemand', 'jeder beliebige']","[reformer, meliorist, crusader, social reforme...",...,0.010753,{Self},{External},61,66,0.158891,0.608632,"[lieb, indigen, Klimaziel, liebe, schon]","[bekannt, Jahr, Bill]",neutral
3,klimaaktivistin,aktivistin,"[aktivistinnen, aktivistin]",aktivistin,f,"[klimaaktivistinnen, klimaaktivistin]",[],[],[],,...,0.000000,{Self},{External},24,9,0.141808,0.103716,"[Tan, Philippinen, Aktivist, Howey, Uganda, Va...","[schwedisch, Greta]",neutral
4,klimaalarm,alarm,"[alarms, alarm, alarmen, alarme, alarmes]",alarm,m,"[klimaalarms, klimaalarm, klimaalarmen, klimaa...","[Alarm, Notruf, Alarmruf, Warnton, Warnsignal,...","[Gunst, Geneigtheit, Wohlwollen, Gewogenheit, ...","['Gunst', 'Wohlwollen', 'Geneigtheit', 'Zugewa...","[fear, fearfulness, fright, emotion, feeling, ...",...,0.000000,,,0,38,,0.398275,,"[Erzeugung, Flashcrash, erklären, Schulz, posa...",negative
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
243,klimazipfel,zipfel,"[zipfeln, zipfel, zipfels]",zipfel,m,"[klimazipfeln, klimazipfel, klimazipfels]","[bestes Stück, Zipfel, Schwengel, Stößel, Schn...",[],[],"[cylinder, round shape, form, shape, attribute...",...,0.000000,,,0,1,,0.000000,,,neutral
244,klimazirkus,zirkus,"[zirkusse, zirkus, zirkussen, zirkusses]",zirkus,m,"[klimazirkusse, klimazirkus, klimazirkussen, k...","[Herumlärmen, Gelärme, Lärmerei, Rumlärmen, Zi...","[vorweisen, vorzeigen]","['Versuch', 'Vorsatz', 'Unternehmung', 'Rastlo...","[bowl, sports stadium, stadium, arena, constru...",...,0.000000,,,0,6,,0.068872,,,neutral
245,klimazunft,zunft,"[zunft, zünfte, zünften]",zunft,f,"[klimazunft, klimazünfte, klimazünften]","[Gewerbe, Gilde, Innung, Gewerk, Zunft, Amt, B...","[Arbeitsgemeinschaft, Arbeitskreis, Arbeitsgru...","['Gestaltung', 'Erreichung', 'Realisierung', '...",[],...,0.000000,,,0,1,,0.000000,,,neutral
246,klimazwang,zwang,"[zwange, zwängen, zwangs, zwanges, zwang, zwänge]",zwang,m,"[klimazwange, klimazwängen, klimazwangs, klima...","[Erpressung, Nötigung, Bedingung, Auflage, Res...","[Befehlssatz, Befehlsvorrat, Befehlsrepertoire]","['Gruppierung', 'Clusterung', 'Bündelung']","[regulating, regulation, control, activity, hu...",...,0.000000,,,0,1,,0.000000,,,neutral


# 6. Self- vs. External Attribution
To evaluate whether compounds words are used by the specific subdiscourse in terms of a self- or external attribution, the `pro_context` and `con_context` data frames will be manually examined. For this, we retrieve the concepts that we determined in `compounds` and add this information to the context data frames. 

In [None]:
# retrieve concept information from compounds data frame to the both context data frames 
concept_dict = dict(zip(compounds.original, compounds.concept))

# create copy of context data frames 
pro_sample = pro_context
con_sample = con_context

# map the concept categories to the new data frames
pro_sample["concept"] = pro_sample['pattern'].map(concept_dict) 
con_sample["concept"] = con_sample['pattern'].map(concept_dict) 

Then we filter the data frames for `persons` and `groups` since we attempt to identify the attribution of each subdiscourse for those concept categories. 

In [3198]:
# filter for persons and groups
pro_sample = pro_sample[(pro_sample.concept == "person") | (pro_sample.concept == "group")]
con_sample = con_sample[(con_sample.concept == "person") | (con_sample.concept == "group")]

# retrieve a random subset for 80% of each compound word 
pro_sample = pro_sample[["pattern","keyword", "concept"]].groupby("pattern").sample(frac=0.8, replace=False, random_state=1)
con_sample = con_sample[["pattern","keyword", "concept"]].groupby("pattern").sample(frac=0.8, replace=False, random_state=1)

# add origin tags 
pro_sample["origin"] = "P2022"
con_sample["origin"] = "C2022"

# concatenate both data frames 
full_sample =  pd.concat([pro_sample,con_sample])

In [3203]:
# save full sample to evaluation folder 
#full_sample.to_csv("../evaluation/attr_sample.csv")

After the manual annotation of whether a term is used to describe the own group (indicated by the tag `self`) or to refer to the opposing group (indicated by the tag `external`), we load the annotated data frame back into Python to save the information to our context data frames. The tag `None` is used if from the context phrases no attribution information could be obtained at all.

In [72]:
annotated_sample = pd.read_csv("../evaluation/attr_manual.csv", sep =";", index_col = 0)

In [73]:
annotated_sample

Unnamed: 0,concept,origin,pattern,keyword,annotation
514,person,C2022,klimabrandstifter,„ Wenn Sie einem Klimabrandstifter vier weiter...,
571,person,C2022,klimafachkraft,Doch eine Klimafachkraft mit der Ausbildung Ka...,
672,person,C2022,klimagott,Mit EMobil werden wir nichts ( in Worten : NIC...,Exernal
674,person,C2022,klimagott,"verbiegen , um dem Klimagott zu huldigen .",Exernal
676,person,C2022,klimagöttin,Politiker die sie regieren und der neuen „ Kli...,Exernal
...,...,...,...,...,...
1215,group,C2022,klimamafia,"Vor Jahren war der Mann noch nicht infiziert ,...",External
1217,group,C2022,klimamafia,"Jetzt wird es von der Klimamafia verwendet , u...",External
1282,group,C2022,klimaplanwirtschaft,So hat uns bald dahingerafft die KlimaPlanwirt...,External
1797,group,C2022,klimastaat,Aber der soziale Tod ist ihnen in dem von viel...,External


In [102]:
# retrieve attribution for both subdiscourses
pro_attribution = annotated_sample[annotated_sample.origin == "P2022"].groupby("pattern")["annotation"].apply(set).to_dict()
con_attribution = annotated_sample[annotated_sample.origin == "C2022"].groupby("pattern")["annotation"].apply(set).to_dict()

In [103]:
# and map to compounds data frame 
compounds["pro_attr"] = compounds.original.map(pro_attribution)
compounds["con_attr"] = compounds.original.map(con_attribution)

## 6.1 Simplistic Sarcasm Detection
To be able to give further information about the use of each compound in the according discourse, we use the following very simplistic approach to detect sarcastic mentionings of the compounds in the key word phrases:
The manual evaluation of section 6 suggests that if a compound appears in quotation marks it is very likely to be used sarcastically. 

In [66]:
def find_quotations(df):
    
    """
    This function checks for each key word sentence whether the compound words is mentioned in quotation marks. 
    Arg: 
        df: a data frame containing the compound and a key word phrase.
    Returns: 
        A new column with the binary label 1 if the compound is used in quotation marks, else 0.
    """ 
    
    # for each row in the data frame
    for idx, row in df.iterrows():
        
        string = row['keyword'].lower() # retrieve key word phrase
        compound = row['pattern'] # retrieve compound word
    
        # retrieve exact positions of compound word in the string
        for match in re.finditer(compound, string):
            start = match.start() # get start position
            end = match.end() # and end position

            try: 
                # if 1 position before the starting point, we have one of the following quotation marks
                if string[start-1] == '"' or string[start-1] == """'""" or string[start-1] == """„""":
                    x = True # set x to True

                # or if 2 position before the starting point, we have one of the following quotation marks
                elif string[start-2] == """ " """ or str1[start-2] == """'""" or string[start-2] == """„""":
                    x = True # set x to True
                    
                # else, x is False
                else:
                    x = False

            except:
                x = False
                y = False

            try:
                # if 1 position after the end point, we have one of the following quotation marks
                if string[end+1] == '"' or string[end+1] == """'""" or string[end+1] == """„""":
                    y = True # set y to True
                    
                # if 2 positions after the end point, we have one of the following quotation marks
                elif string[end+2] == '"' or string[end+2] == """'""" or string[end+2] == """“""":
                    y = True # set y to True
                    
                # else, y is False
                else:
                    y = False 
                    
            except:
                x = False
                y = False

            # if we have 2 quotations marks surrounding the compound
            if x and y == True:
                df.at[idx,'quotes'] = 1 # set value in new column to 1 
 
            # if not:
            else:
                df.at[idx,'quotes'] = 0 # set value in new column to 0

In [69]:
# apply function to both data frames 
find_quotations(pro_context)
find_quotations(con_context)

# convert type of number from float to integer
pro_context.quotes = pro_context.quotes.astype('int32')
con_context.quotes = con_context.quotes.astype('int32')

Then, we want to compute the proportion of how often the compound word is used in quotation marks and map the information to the compounds data frame  

In [75]:
# count proportion of each compound of how many appearances are sarcastic and how many are not
pro_sarcasm = pro_context.groupby('pattern')['quotes'].mean().to_dict()
con_sarcasm = con_context.groupby('pattern')['quotes'].mean().to_dict()

# map information for each compound to compounds data frame 
compounds["pro_sarcasm"] = compounds.original.map(pro_sarcasm)
compounds["con_sarcasm"] = compounds.original.map(con_sarcasm)

# 7. Term Frequencies
In the following, we will add the term frequencies and the TF-IDF scores that we computed in R to our compounds data frame.

In [152]:
# load tf data frame 
tf = pd.read_csv("../../R/output/tf_complete.csv", header=1, names=["original", "tf_pro", "tf_con"])

# load tfidf data frame
tf_idf = pd.read_csv("../../R/output/tfidf_complete.csv", header=0, index_col =0, names=["original", "tfidf_con", "tfidf_pro"])

In [149]:
tf.head()

Unnamed: 0,original,tf_pro,tf_con
0,klimaaktivistin,24,9
1,klimagerechtigkeit,280,16
2,klimaaktivist,61,66
3,klimapäckchen,9,0
4,klimarettung,13,48


In [150]:
tf_idf.head()

Unnamed: 0,original,tfidf_con,tfidf_pro
1,klimaabzockerei,0.0,
2,klimaaktivismus,0.0,0.015903
3,klimaaktivist,0.608632,0.158891
4,klimaaktivistin,0.103716,0.141808
5,klimaalarm,0.398275,


In [None]:
# set index
compounds = compounds.set_index('original')
tf = tf.set_index('original')
tf_idf = tf_idf.set_index('original')

# append new columns
compounds = pd.concat([compounds, tf.tf_pro, tf.tf_con, tf_idf.tfidf_pro, tf_idf.tfidf_con], 1)
compounds.tf_pro = compounds.tf_pro.astype('Int64') # convert into integers
compounds.tf_con = compounds.tf_con.astype('Int64') # convert into integers

# reset index
compounds.reset_index(inplace=True)

# 8. Collocations

Add collocation that we retrieved in R to the data frame.

In [178]:
# load collocations
colls_pro = pd.read_csv("../../R/output/top_colls_pro_cleaned.csv", sep=";", index_col = 0)
colls_con = pd.read_csv("../../R/output/top_colls_con_cleaned.csv", sep=";", index_col = 0)

In [179]:
colls_con

Unnamed: 0,word,n,keyword,tag
1,bekannt,4,klimaaktivist,pre
2,Jahr,2,klimaaktivist,pre
3,Bill,4,klimaaktivist,post
4,schwedisch,3,klimaaktivistin,pre
5,Greta,5,klimaaktivistin,post
...,...,...,...,...
134,breite,2,klimawahn,post
135,liegen,2,klimawahnsinn,pre
136,beenden,2,klimawahnsinn,post
137,Deutschland,2,klimawahnsinn,post


In [180]:
# map to compounds data frame
colls_con_dict = colls_con.groupby("keyword")["word"].apply(list).to_dict()
colls_pro_dict = colls_pro.groupby("keyword")["word"].apply(list).to_dict()

compounds["pro_colls"] = compounds.original.map(colls_pro_dict)
compounds["con_colls"] = compounds.original.map(colls_con_dict)

# TO DELETE

## UPDATE DATA FRAMES 

In [77]:
#pro_context.to_csv("../output/pro_info.csv", index = False)
#con_context.to_csv("../output/con_info.csv", index = False)

In [6]:
#knowledge_base.to_csv("../output/knowledge_base.csv", index = False)

In [145]:
#compounds = compounds.drop(['tf_pro', 'tf_con', 'tfidf_pro','tfidf_con'], axis=1)