In [None]:
import pandas as pd
import numpy as np
from statistics import mean

from textblob_de import TextBlobDE as TextBlob

import sys
sys.path.append("GerVADER/")
import GerVADER.vaderSentimentGER as gv

from typing import Callable, List, Dict, TextIO, Tuple

# Hyperbole Feature Engineering

## Introduction

The approach which is followed in this notebook is taken from the paper "A Computational Exploration of Exaggeration" [1] 

[1] Troiano, Enrica, et al. "A computational exploration of 
exaggeration." Proceedings of the 2018 Conference on Empirical 
Methods in Natural Language Processing. 2018.

## Data

The following section gives a brief description of all the datasets that are used in this notebook

In [None]:
PATH_TO_DATA = "../../data/hyperbole_detection/"

### Hyperbole dataset

##### The dataset has been taken from the following source:

https://github.com/yunx-z/MOVER

##### The dataset has been published in this paper: 

Zhang, Yunxiang, and Xiaojun Wan. "MOVER: Mask, Over-generate and Rank for Hyperbole Generation." arXiv preprint arXiv:2109.07726 (2021).

##### Additional information

The HYPO-L dataset was translated using Google Sheets. Google Sheets has the ability to use the Google Translate API by invoking a simple function. 

This is just a temporary solution, the idea is to use a smaller but human translated dataset to train the final model. It will be interesting to compare the results which can be achieved by a model which was trained on machine translated data vs. the performance which is achieved by a model which was trained on human translated data. 

In [None]:
HYPERBOLE_DATASET = PATH_TO_DATA + "hyperbole_machine_translations.csv"

### Data which is used to compute imageability

##### The dataset has been taken from the following source: 

https://www.ims.uni-stuttgart.de/forschung/ressourcen/experiment-daten/affective-norms/

##### The dataset has been published in this paper:

Maximilian Köper, Sabine Schulte im Walde
Automatically Generated Norms of Abstractness, Arousal, Imageability and Valence for 350,000 German Lemmas
In: Proceedings of the 10th Conference on Language Resources and Evaluation (LREC). Portorož, Slovenia, May 2016.

##### Additional citings (training resource publications)

* Võ et al. (2009)
* Lahl et al. (2009)
* Kanske and Kotz (2010)
* Schmidtke et al. (2014)
* The MRC Psycholinguistic Database
* Brysbaert et al. (2014)

In [None]:
IMAGEABILITY_DATASET = PATH_TO_DATA + "imageability/affective_norms.txt"

### Data which is used to compute unexpectedness

##### The GloVe embeddings have been taken from the following source:

https://www.deepset.ai/german-word-embeddings

https://int-emb-glove-de-wiki.s3.eu-central-1.amazonaws.com/vectors.txt

##### The GloVe embeddings have been published in the following repository:

https://gitlab.com/deepset-ai/open-source/glove-embeddings-de

##### The Word2Vec embeddings have been taken from the following source:

https://www.deepset.ai/german-word-embeddings

https://int-emb-word2vec-de-wiki.s3.eu-central-1.amazonaws.com/vectors.txt

##### The Word2Vec embeddings have been published in the following repository:

https://gitlab.com/deepset-ai/open-source/word2vec-embeddings-de

In [None]:
GLOVE_VECTORS_DATASET = PATH_TO_DATA + "glove/glove_vectors.txt"
WORD2VEC_VECTORS_DATASET = PATH_TO_DATA + "word2vec/word2vec_vectors.txt"

### Data which is used to compute polarity

##### The German version of the SentiWS dataset has been taken from the following source:

https://wortschatz.uni-leipzig.de/de/download

##### The dataset has been published in the following paper:

R. Remus, U. Quasthoff & G. Heyer: SentiWS - a Publicly Available German-language Resource for Sentiment Analysis.
In: Proceedings of the 7th International Language Resources and Evaluation (LREC'10), pp. 1168-1171, 2010

In [None]:
SENTI_WS_NEGATIVE_DATASET = PATH_TO_DATA + "polarity/SentiWS_v2.0/SentiWS_v2.0_Negative.txt"
SENTI_WS_POSITIVE_DATASET = PATH_TO_DATA + "polarity/SentiWS_v2.0/SentiWS_v2.0_Positive.txt"

### Main Dataframe

The following dataframe will be used as a starting point. All the feature engineering steps will be executed on this dataframe. The current values in the dataframe will not be changed, additional columns will be added. 

In [None]:
df_hyperbole = pd.read_csv(HYPERBOLE_DATASET)

In [None]:
df_hyperbole.columns

## Feature Engineering

### Imageability

Imageability is a feature which is commonly used in metaphor detection. Troiano et al. argumented, that a speaker might try to convey strength while making use of a hyperbole. In such a situation the speaker might use a highly picturable vocabulary. 

They computed the feature by averaging the imageability values for all words in a sentence.They used a different dataset, as they were working with the English language, the approach is followed as described in the paper.

In [None]:
def generate_imageability_dict() -> Dict[str, float]:
    df = pd.read_csv(IMAGEABILITY_DATASET, sep="\t")
    df["WordLower"] = df["Word"].str.lower()
    
    return pd.Series(df.IMG.values,index=df.WordLower).to_dict()

def get_imageability_of_sentence(sentence: str, dict_imageability: Dict[str, float]) -> float:
    words = sentence.lower().split(" ")
    score = 0
    
    
    for word in words:
        if word in dict_imageability.keys():
            score += dict_imageability[word]
    
    return score / len(words)

def get_imageability(sentences: List[str], dict_imageability: dict) -> List[float]:
    
    res = []
    
    for sentence in sentences:
        res.append(get_imageability_of_sentence(sentence,dict_imageability))
    
    return res

def append_imageabilty(df: pd.DataFrame, sentence_column: str, dict_imageability: Dict[str, float]) -> pd.DataFrame:
    
    sentences = df[sentence_column].tolist()
    
    values = get_imageability(sentences,dict_imageability)
    
    df["imageability"] = values
    
    return df

Generate the dictionary and append the imageability feature to the initial dataframe

In [None]:
dict_imageability = generate_imageability_dict()

In [None]:
df_hyperbole = append_imageabilty(df_hyperbole, "german",dict_imageability)

In [None]:
df_hyperbole.head(5)

### Unexpectedness

Unexpectedness is a feature that tries to give a number to the predictability of the words in a sentence. The idea is to have a higher value for a word if the word is used in an unexpected manner. 

Troiano et al. argued, that word vectors might capture whether a word is used unexpectedly because they encode the contexts in whic terms frequently occurs, as well as contrasts and similarity among their meaning. [1] 

To compute unexpectedness, Troiano et al. used GloVe embeddings and Word2Vec embeddings. To replicate their approach, german GloVe and Word2Vec embeddings are used. 

The embeddings are transformed into dictionaries. Afterwards the embedding of each word in a sentence is compared to the embeddings of all the other words in the sentence. (following the approach by Troiano et al.)

The average and the minimum value is then taken from the computed values and appended as a new feature column to the initial dataframe. This results in four new columns.

In [None]:
def load_model(file_path: str, mode: str) -> Dict[str, np.array]:
    
    if mode not in ["glove","word2vec"]:
        print("Please specify either glove or word2vec as mode")
        return
    else:
        print("Loading " + mode + " model")

        word_function = get_word_function(mode)
        
        model = {}
        
        with open(file_path, 'r') as f:
            model = get_model_dict(f, word_function)
            
    print(f"{len(model)} words loaded!")
    
    return model

def get_model_dict(file: TextIO, word_function: Callable[[str], str]) -> Dict[str, np.array]:
    
    model = {}
    
    for line in file:
        split_line = line.split(" ")
        word = word_function(split_line)
        embedding = np.array(split_line[1:],dtype=np.float64)
        model[word] = embedding
    
    return model

def get_word_function(mode: str) -> Callable[[str], str]:
    if mode == "word2vec":
        return get_word_word2vec_formating
    else:
        return get_word_glove_formatting

def get_word_word2vec_formating(split_line: str) -> str:
    return split_line[0].replace("'","")[1:]

def get_word_glove_formatting(split_line: str) -> str:
    return split_line[0]

Methods for calculating unexpectedness

In [None]:
def get_unexpectedness_for_sentence(sentence: str, glove_dict: Dict[str, np.array]) -> Tuple[float, float]:
    
    words = get_filtered_words(sentence,glove_dict)
    
    results = go_over_all_word_pairs(words,glove_dict)
    
    if len(results) == 0:
        return (0,0)
    
    return (mean(results),min(results))
    
def get_filtered_words(sentence: str, glove_dict: Dict[str, np.array]) -> List[str]:
    words = sentence.lower().split(" ")
    res = []
    for word in words:
        if word in glove_dict.keys():
            res.append(word)
            
    return res

def go_over_all_word_pairs(words: List[str] ,glove_dict: Dict[str, np.array]) -> List[float]:

    words_compare = words[1:]

    res = []
    word_comp_index = 1
    
    for word in words:
        for word_comp in words_compare:
            res.append(get_cosine_similarity_for_word_pair(word,word_comp,glove_dict))
        word_comp_index += 1
        words_compare = words[word_comp_index:]
    
    return res

def get_cosine_similarity_for_word_pair(word1: str, word2: str, glove_dict: Dict[str, np.array]) -> float:
    
    if word1 in glove_dict.keys() and word2 in glove_dict.keys():
    
        vector_1 = glove_dict[word1]
        vector_2 = glove_dict[word2]
        
        return get_cosine_simlarity(vector_1,vector_2)
    else:
        return 0
    
def get_cosine_simlarity(a: np.array, b: np.array) -> float:
    return np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))

In [None]:
glove_model = load_model(GLOVE_VECTORS_DATASET,"glove")

In [None]:
word2vec_model = load_model(WORD2VEC_VECTORS_DATASET,"word2vec")

In [None]:
sentence = "Das hier ist ein Beispielsatz"

In [None]:
get_unexpectedness_for_sentence(sentence,glove_model)

In [None]:
get_unexpectedness_for_sentence(sentence,word2vec_model)

In [None]:
def get_unexpectedness_list(sentences: List[str], dict_model: Dict[str, np.array]) -> List[Tuple[float, float]]:
    
    res = []
    
    for sentence in sentences:
        res.append(get_unexpectedness_for_sentence(sentence,dict_model))
    
    return res

def get_avg_unexpectedness(unexpectedness_list: List[Tuple[float, float]]) -> List[float]:
    
    return get_values_at_index(unexpectedness_list, 0)

def get_min_unexpectedness(unexpectedness_list: List[Tuple[float, float]]) -> List[float]:
    
    return get_values_at_index(unexpectedness_list, 1)

def get_values_at_index(tuple_list: List[Tuple[float,float]], index: int) -> List[float]:  
    res = []
    
    for tup in tuple_list:
        res.append(tup[index])
    
    return res                     

def append_unexpectedness(df: pd.DataFrame, sentence_column: str, 
                          dict_model: Dict[str, np.array], model_name: str) -> pd.DataFrame:
    
    sentences = df[sentence_column].tolist()
    
    unexpectedness_list = get_unexpectedness_list(sentences, dict_model)
    
    values_avg = get_avg_unexpectedness(unexpectedness_list)
    values_min = get_min_unexpectedness(unexpectedness_list)
    
    column_postfix = "_" + model_name
    column_avg = "unexpectedness_avg_{}".format(model_name)
    column_min = "unexpectedness_min_{}".format(model_name)
    
    df[column_avg] = values_avg
    df[column_min] = values_min
    
    return df

In [None]:
df_hyperbole = append_unexpectedness(df_hyperbole, "german",glove_model,"glove")

In [None]:
df_hyperbole = append_unexpectedness(df_hyperbole, "german",word2vec_model,"word2vec")

In [None]:
df_hyperbole.head(5)

### Polarity

Polarity stands for the sentiment of a sentence [1]

Troiano et al. used SentiWords and TextBlob for calculating the polarity values. A German version of TextBlob is available and is used below. A German version of the SentiWords dataset is used as well, so that the approach can be followed.

Here the polarity is computed for each sentence.

#### Senti WS Polarity

In [None]:
df_neg = pd.read_csv(SENTI_WS_NEGATIVE_DATASET,sep="\t")
df_neg.columns = ["Word","Value","WordForms"]

df_pos = pd.read_csv(SENTI_WS_POSITIVE_DATASET,sep="\t")
df_pos.columns = ["Word","Value","WordForms"]

In [None]:
def senti_words_preprocessing(df: pd.DataFrame) -> Dict[str, float]:
    df_snip = df[['Value','WordForms']]
    df_snip['WordForms'] = df_snip['WordForms'].str.lower().str.split(",")
    df_all_words = df_snip.explode('WordForms')
    df_all_words = df_all_words.reset_index(drop=True)
    final_dict = pd.Series(df_all_words.Value.values,index=df_all_words.WordForms).to_dict()
    
    return final_dict

def get_senti_words_polarity_of_sentence(sentence: str, 
                                         dict_pos: Dict[str, float], dict_neg: Dict[str, float]) -> float:
    score = 0
    words = sentence.lower().split(" ")
    
    for word in words:
        if word in dict_pos:
            score += dict_pos[word]
        elif word in dict_neg:
            score += dict_neg[word]
    
    return score / len(words)

def get_senti_words_polarity(sentences: List[str], 
                             dict_pos: Dict[str, float], dict_neg: Dict[str, float]) -> List[float]:
    
    res = []
    
    for sentence in sentences:
        res.append(get_senti_words_polarity_of_sentence(sentence,dict_pos,dict_neg))
    
    return res

def append_senti_ws_polarity(df: pd.DataFrame, sentence_column: str, 
                             dict_pos: Dict[str, float], dict_neg: Dict[str, float]) -> pd.DataFrame:
    
    sentences = df[sentence_column].tolist()
    
    values = get_senti_words_polarity(sentences,dict_pos,dict_neg)
    
    df["polarity_senti_ws"] = values
    
    return df

In [None]:
%%capture
dict_pos = senti_words_preprocessing(df_pos)
dict_neg = senti_words_preprocessing(df_neg)

In [None]:
df_hyperbole = append_senti_ws_polarity(df_hyperbole,"german",dict_pos,dict_neg)

In [None]:
df_hyperbole.head(7)

In [None]:
type(TextBlob("Hallo Welt").sentiment)

#### TextBlob Code (Used for Polarity and Subjectivity)

In [None]:
def get_text_blob_sentiment_of_sentence(sentence: str) -> Tuple[float, float]:
    blob = TextBlob(sentence)
    return blob.sentiment

def get_text_blob_sentiment(sentences: List[str]) -> List[Tuple[float, float]]:
    
    res = []
    
    for sentence in sentences:
        res.append(get_text_blob_sentiment_of_sentence(sentence))
    
    return res

def get_text_blob_polarity(text_blob_sentiment: Tuple[float, float]) -> List[float]:
    
    res = []
    
    for sentiment in text_blob_sentiment:
        res.append(sentiment[0])
    
    return res

def get_text_blob_subjectivity(text_blob_sentiment: Tuple[float, float]) -> List[float]:
    
    res = []
    
    for sentiment in text_blob_sentiment:
        res.append(sentiment[1])
    
    return res

def append_text_blob_polarity(df: pd.DataFrame, text_blob_sentiment: Tuple[float, float]) -> pd.DataFrame:
    
    values = get_text_blob_polarity(text_blob_sentiment)
    
    df["polarity_text_blob"] = values
    
    return df

def append_text_blob_subjectivity(df: pd.DataFrame, text_blob_sentiment: Tuple[float, float]) -> pd.DataFrame:
    
    values = get_text_blob_subjectivity(text_blob_sentiment)
    
    df["subjectivity_text_blob"] = values
    
    return df

In [None]:
sentences = df_hyperbole["german"].tolist()

In [None]:
text_blob_sentiment = get_text_blob_sentiment(sentences)

In [None]:
df_hyperbole = append_text_blob_polarity(df_hyperbole,text_blob_sentiment)

In [None]:
df_hyperbole.head(5)

### Subjectivity

Subjectivity is a feature that describes whether objective information or personal opinion is communicated by a statement [1].

The approach is simple to follow, as TextBlob is used again. 

In [None]:
df_hyperbole = append_text_blob_subjectivity(df_hyperbole,text_blob_sentiment)

In [None]:
df_hyperbole.head(5)

In [None]:
table_subjectivity = pd.DataFrame(df_hyperbole.groupby(['subjectivity_text_blob']).size())
table_subjectivity.index = list(table_subjectivity.index)
table_subjectivity = table_subjectivity.reset_index()
table_subjectivity.columns = ["subjectivity_value","count"]

In [None]:
table_subjectivity

### Emotional Intensity

Emotional intesity describes the strength of sentiment. [1]

Troiano et al. used the VADER library to calculate this value. I found a German version of the VADER library (GerVADER).

Source: https://github.com/KarstenAMF/GerVADER

I take the scores which I can get from GerVADER and append them as new features. The paper unfortunately does not describe which values they were using.

##### GerVADER has been published in the following paper: 

Karsten Michael Tymann, Matthias Lutz, Patrick Palsbröker and Carsten Gips: GerVADER - A German adaptation of the VADER sentiment analysis tool for social media texts. In Proceedings of the Conference "Lernen, Wissen, Daten, Analysen" (LWDA 2019), Berlin, Germany, September 30 - October 2, 2019.

In [None]:
analyzer = gv.SentimentIntensityAnalyzer()

In [None]:
analyzer.polarity_scores("Hallo Welt")

In [None]:
def get_vader_scores_of_sentence(sentence: str, analyzer: gv.SentimentIntensityAnalyzer) -> Dict[str, float]:
    
    scores = analyzer.polarity_scores(sentence)
    
    return scores
    
def get_vader_scores(sentences: List[str], 
                     analyzer: gv.SentimentIntensityAnalyzer) -> Tuple[float, float, float, float]:
    
    res_positive = []
    res_neutral = []
    res_negative = []
    res_compound = []
    
    for sentence in sentences:
        scores = get_vader_scores_of_sentence(sentence, analyzer)
        res_positive.append(scores["pos"])
        res_neutral.append(scores["neu"])
        res_negative.append(scores["neg"])
        res_compound.append(scores["compound"])
    
    return res_positive, res_neutral, res_negative, res_compound

def append_vader_scores(df: pd.DataFrame, sentence_column: str, 
                        analyzer: gv.SentimentIntensityAnalyzer) -> pd.DataFrame:
    
    sentences = df[sentence_column].tolist()
    
    result_tuple = get_vader_scores(sentences, analyzer)
    
    df["vader_positive"] = result_tuple[0]
    df["vader_neutral"] = result_tuple[1]
    df["vader_negative"] = result_tuple[2]
    df["vader_compound"] = result_tuple[3]
    
    return df

In [None]:
df_hyperbole = append_vader_scores(df_hyperbole, "german", analyzer)

In [None]:
column_selection = ["german","vader_positive","vader_neutral","vader_negative","vader_compound"]
df_hyperbole[column_selection].head(5)

## Export

The initial dataset with all the additional features is exported

In [None]:
df_hyperbole.to_csv(PATH_TO_DATA + "hyperboles_feature_engineered.csv",index=False)