# Text Data Augmentation

In this notebook, we provide an example of how to perform data augmentation on text. Data augmentation is to make copies and add meaningful changes to the original text data. This method can hence increase the amount of the data and provides another way to have more training data. The hope and idea behind data augmentation is that the increased amount of data can increase performance when training a model. 

In this example we will test if augmentation on ~1000 tweets can improve the performance of a sentiment model. I had expected an increase in performance, but such increase was not revealed by the experiment. However, the method  might be succesful on other types of text data. Also, I could imagine that data augmentation will prove useful when the Danish ressources for synonyms and word embeddings are more developed and mature, as data augmentation seems promising on English text data (see for instance the paper by Wei & Zou (2019): "Eda: Easy data augmentation techniques for boosting performance on text classification tasks" and the paper by Sun & He (2020): "A novel approach to generate a large scale of supervised data for short text sentiment analysis").

Another small experiment that is carried out at the end of this tutorial is to test if we by creating tricky sentences including negations can bias the model to better handled ambiguous sentences. 

Steps in this tutorial:
1. Load libraries, data etc.
2. Preprocess the data (cleaning)
3. Augment the data: Add spelling mistakes, swap words with word embeddings and/or synonyms
4. Prepare the data for training and evaluation with SpaCy
5. Train two models: with and without the augmented data
6. Test the performance of the better performing model
7. Test negations

## 1. Load libraries, data etc. 

#### Before you start the tutorial, you need to install some packages. You install them by downloading the file "requirements_dataaugmentation.txt" and in your terminal type the command 'pip install -r requirements_dataaugmentation.txt'. 

#### Also, you need to install the latest version of the danlp package, which you do by typing the following command in the terminal: 'pip install git+https://github.com/alexandrainst/danlp.git'

After that, you are ready to import the libraries. 

In [1]:
import pandas as pd
import operator
import numpy as np
import re
import random
from random import shuffle
random.seed(1)
import os
import srsly
from sklearn.metrics import classification_report

import danlp
from danlp.download import DEFAULT_CACHE_DIR
from danlp import utils
from danlp.models import load_spacy_model
from danlp.models.embeddings  import load_wv_with_gensim
from danlp.datasets import DanNet
from spacy.gold import docs_to_json
import spacy

In [2]:
# load the spaCy model 
nlp = load_spacy_model()

# Load the word embeddings to use
word_embeddings = load_wv_with_gensim('conll17.da.wv')

# Load the synonyms to use 
dannet = DanNet()

The dataset used in this tutorial is The Twitter sentiment, which is a small manually annotated dataset by the Alexandra Institute. It contains tags in two sentiment dimension: analytic: ['subjective' , 'objective'] and polarity: ['positive', 'neutral', 'negative' ]. In this tutorial we will only use the dimension of polarity. The data is split in train and test part. 


#### Due to Twitters privacy policy, people are allowed to delete their tweets. Therefore, to download the actual tweet text one need a Twitter development account and to generate the sets of login keys, read how to get started here: https://python-twitter.readthedocs.io/en/latest/getting_started.html. 
#### Then the dataset can be loaded with the DaNLP package (see code block below) by setting the following environment variable for the keys you get when you create an account:

TWITTER_CONSUMER_KEY, TWITTER_CONSUMER_SECRET, TWITTER_ACCESS_TOKEN, TWITTER_ACCESS_SECRET

In [None]:
from danlp.datasets import TwitterSent
twitSent = TwitterSent()

df_test, df_train_all = twitSent.load_with_pandas()

#### If you want to use your own data, you can also do that. Just make sure you have two columns called 'text' and 'polarity'. 

We select only the two relevant columns of the data

In [4]:
df_train_all = df_train_all[['text', 'polarity']]
df_test = df_test[['text', 'polarity']]

We split the training data in training and evaluation data (80/20) in order to avoid that we evaluate and test on the same data. 

In [5]:
df_train = df_train_all.sample(frac=0.8,random_state=200)
df_dev = df_train_all.drop(df_train.index)

## 2. Preprocess data

We make all letters lowercase, remove all punctuation in order to identify as many word embeddings and synonyms later in the process. Also, we remove hashtags and links to webpages. 

In [6]:
def cleantweets(tweet):
    clean_tweets = []
    for i in tweet:
        i = i.lower()
        i = re.sub(r'[\.\,\"\!\?\:\;\_\=\(\)\|\*\@\&\$\"\’\/\%\+]+',"",i)
        clean_tweets.append(' '.join(re.sub("(@[A-Za-z0-9]+)|(#[A-Za-z0-9]+)|(@ [A-Za-z0-9]+)|(# [A-Za-z0-9]+)|(\w+:\/\/\S+)"," ",i).split()))
    return clean_tweets


In [7]:
df_train['text'] = cleantweets(df_train['text'])
df_dev['text'] = cleantweets(df_dev['text'])
df_test['text'] = cleantweets(df_test['text'])

## 3.1 Augment the data: Add spelling mistakes

#### This section provides three different ways to add spelling mistakes to the data:
1. To randomly delete words 
2. To randomly delete characters
3. To randomly swap characters

1. takes as input the sentence and the percentage of words to be deleted

In [8]:
def random_deletion_word(sentence, p_w):
    words = [word for word in sentence.split(' ') if word is not '']
    if len(words) == 1: # if there is only one word in the sentence, we will not delete any words
        return words
    
    new_sentence = [word for word in words if random.uniform(0, 1) > p_w] # remove words
    
    if len(new_sentence) == 0: # if there is no words left, we want to return a random word from the sentence
        return random.choice(words)
    new_sentence = ' '.join(new_sentence)
    return new_sentence

In [9]:
random_deletion_word('Denne funktion fjerner ord tilfældigt med den procentsats der er defineret i p_w', p_w=0.05)

'Denne funktion fjerner ord tilfældigt med den procentsats der defineret i p_w'

2. takes as input the sentence and the percentage of characters to be deleted

In [10]:
def random_deletion_character(sentence, p_c_d):
    chars = [char for char in sentence]
    if len(chars) == 1: # if there is only one char in the sentence, we will not delete any chars
        return chars
    
    new_sentence = [char for char in chars if random.uniform(0, 1) > p_c_d] # remove characters
    
    if len(new_sentence) == 0: # if there is no characters left, we want to return a random char from the sentence
        return random.choice(chars)
    new_sentence = ''.join(new_sentence)     
    return new_sentence

In [14]:
random_deletion_character('Denne funktion fjerner bogstaver tilfældigt med den procentsats der er defineret i p_c_d', p_c_d=0.05)

'Dnne funktion fjerner ogstaver tilfældigt med den procetsats der er defineet ip_c_d'

3. takes as input the sentence and the percentage of characters to be swapped

In [12]:
def character_replacement(sentence, p_c_r):
    chars = [char for char in sentence]
    new_sentence = []
    
    for ch in chars:
        new_char = list('qwertyuiopåasdfghjklæøzxcvbnm')
        r = random.uniform(0, 1)
        if r > p_c_r:
            new_sentence += ch
        else:
            new_sentence += random.choice(new_char)
            
    new_sentence = ''.join(new_sentence)       
    return new_sentence

In [15]:
character_replacement('Denne funktion udskifter bogstaver med den procentsats der er defineret i p_c_r', p_c_r=0.05)

'Denne funktiqn udskifter bogstaver med denyprocentsais deq er defineret i p_c_r'

## 3.2 Augment the data: Swap words with word embeddings and/or synonyms

#### This function pos_extraction() takes a sentence as input and returns a list of the words and the respective pos-tags. 

We want to be able to swap words, but we also want the new text to be meaningful and grammatically correct. The tracking of pos-tags will be used to secure, in the final function for augmenting sentences, that only words with the defined pos-tags will be swapped and also that the words will only be swapped with a synonym/word embedding having the same pos-tag.  

In [16]:
def pos_extraction(sentence):
    doc = nlp(sentence)
    words = [token.text for token in doc]
    pos = [token.pos_ for token in doc]
    words_pos = [[i,j] for i, j in zip(pos, words)] 
    return words_pos

In [17]:
pos_extraction('Denne funktion returnerer sætningens pos-tags eller ordklasser som de også kaldes')

[['DET', 'Denne'],
 ['NOUN', 'funktion'],
 ['VERB', 'returnerer'],
 ['NOUN', 'sætningens'],
 ['NOUN', 'pos-tags'],
 ['CCONJ', 'eller'],
 ['NOUN', 'ordklasser'],
 ['ADP', 'som'],
 ['PRON', 'de'],
 ['ADV', 'også'],
 ['VERB', 'kaldes']]

#### This function get_embedding() takes a word as an input and returns a list of word embeddings to the word. 

It has a boundary in that it only returns the word embeddings which have a similarity score above 0.80 to the original word. This is to ensure meaningful swaps. 

In [18]:
def get_embedding(word):
    new_embeddings = []
    try:
        new_embeddings = [em[0] for em in word_embeddings.most_similar(word) if em[1] > 0.80] 
        if word in new_embeddings:
            new_embeddings.remove(word)
    except Exception as e:
        pass
    return new_embeddings


In [19]:
get_embedding('statsminister')

['partiformand',
 'partileder',
 'økonomiminister',
 'fogh',
 'thorning-',
 'finansminister',
 'thorning-schmidt',
 'udenrigsminister',
 'regeringsleder',
 'statminister']

#### This function embedding_replacement() takes a sentence, the number of words to swap (n_embed) and up to three pos tags (pos1, pos2, pos3) as input, and returns the augmented sentence. 

Please note, that even though you set n_embed to 5 or even 10, it might not swap that many words. It swaps up until 5 or 10, but can only swap as many words as it finds embeddings for. Most of the time the swaps are meaningful, but cases will appear where the swap makes no sense or where the embedding have the opposite meaning of the word it is swapped with.

You can find the use of pos_extraction() and get_embedding() in this function. 

In [20]:
def embedding_replacement(sentence, n_embed, pos1,pos2=None,pos3=None):
    pos_tag = [i[0] for i in pos_extraction(sentence) if i[0] in {pos1,pos2,pos3}] 
    word = [i[1] for i in pos_extraction(sentence) if i[0] in {pos1,pos2,pos3}]
    words_to_swap = [[i,j] for i, j in zip(pos_tag, word)] # extract relevant pos-tag and word and combine   
    
    new_sentence = [word for word in sentence.split(' ') if word is not ''] # prepare for swapping
    num_replaced = 0
    for pos in random.sample(words_to_swap, len(words_to_swap)):
        embeddings = get_embedding(pos[1])
        embed_pos_list = pos_extraction(' '.join(embeddings)) # get embeddings and the pos-tag for embeddings
        embed_to_be_used = [em[1] for em in embed_pos_list if em[0] == pos[0]]          
        if len(embed_to_be_used) >= 1:  # to swap n_embed times          
            new_sentence = [random.choice(embed_to_be_used) if word == pos[1] else word for word in new_sentence]
            num_replaced += 1   
        if num_replaced >= n_embed:
            break
    new_sentence = ' '.join(new_sentence)
    return new_sentence

In [24]:
embedding_replacement('det er sjovt at eksperimentere med sproget og skifte ordene ud', n_embed=5, pos1='NOUN',pos2='ADJ',pos3='ADV')

'det er sjovest at eksperimentere med sproget og skifte sætningen ud'

#### This function synonym_replacement() takes a sentence, the number of words to swap (n_syno) and up to three pos tags (pos1, pos2, pos3) as input, and returns the augmented sentence. 

Please note, that even though you set n_syno to 5 or even 10, it might not swap that many words. It swaps up until 5 or 10, but can only swap as many words as it finds synonyms for. Most of the time the swaps are meaningful, but cases will appear where the swap makes no sense due to words that can have two pos-tags dependent on the context. 

In [25]:
def synonym_replacement(sentence, n_syno, pos1,pos2=None,pos3=None):
    pos_tag = [i[0] for i in pos_extraction(sentence) if i[0] in {pos1,pos2,pos3}] 
    word = [i[1] for i in pos_extraction(sentence) if i[0] in {pos1,pos2,pos3}]
    words_to_swap = [[i,j] for i, j in zip(pos_tag, word)] # extract relevant pos-tag and word and combine 
    
    new_sentence = [word for word in sentence.split(' ') if word is not ''] # prepare for swapping
    num_replaced = 0
    for pos in random.sample(words_to_swap, len(words_to_swap)):
        syno_pos_list = pos_extraction(' '.join(dannet.synonyms(pos[1]))) # get synos and the pos-tag for synos
        syno_to_be_used = [syn[1] for syn in syno_pos_list if syn[0] == pos[0]] 
        if len(syno_to_be_used) >= 1:
            new_sentence = [random.choice(syno_to_be_used) if word == pos[1] else word for word in new_sentence]
            num_replaced += 1   
        if num_replaced >= n_syno:
            break
    new_sentence = ' '.join(new_sentence)
    return new_sentence

In [32]:
synonym_replacement('Det er en svær opgave at arbejde med sprog og teknologi', n_syno=5, pos1='NOUN',pos2='ADJ',pos3='ADV')

'Det er en vanskelig funktion at arbejde med sprogbrug og teknologi'

## 3.3 Augment the data: Putting it all together

#### This function takes a dataframe with columns 'text' and 'polarity' as input, as well as three pos-tags to swap words from. Next it takes the number of embeddings and number of synonyms to swap. Lastly, the number of sentences to augment, as well as the percentage of words to be deleted and the percentage of characters to be respectively deleted and swapped are taken as inputs. It returns a new dataframe with both the original sentences and the augmented sentences. 

You have to specify at least one pos-tag, but the other two are optional. You can select all or only some of the augmentation methods by specifying the arguments. If nothing is specified the default is 0 to all optional arguments, and the respective augmentation will not be done. Be aware that swapping embedding is very time-consuming.  

In [33]:
def augment_sentences(df,pos1,pos2=None,pos3=None, n_embed=0, n_syno=0, p_w=0, p_c_d=0, p_c_r=0, n_sen=1):
    augmented_sentences=[]
    polarity=[]
    for i in df['text']:
        augmented_sentences.append(i)
        for _ in range (n_sen):
            if n_embed >= 1:
                i = embedding_replacement(i,n_embed,pos1,pos2,pos3)
            if n_syno >= 1:
                i = synonym_replacement(i,n_syno,pos1,pos2,pos3)
            if p_w > 0:
                i = random_deletion_word(i, p_w)
            if p_c_d > 0:
                i = random_deletion_character(i, p_c_d)
            if p_c_r > 0:
                i = character_replacement(i, p_c_r)
            augmented_sentences.append(i)
        
    for j in df['polarity']:
        for _ in range (n_sen+1):
            polarity.append(j)    
    total_df = pd.DataFrame({'text':augmented_sentences,'polarity':polarity})
    return total_df

In this tutorial, we swap both synonyms and embeddings as well as adding small spelling mistakes to the same sentence. This is done to achieve the most variation so that the augmented sentences will not be too alike to the original sentence and hopefully have a better chance of improving the model. 

We try with two different combinations of pos-tags because the Danish ressources are limited and we aim to achieve as much variation as possible in the augmented sentences. 

In total we end up with 8 new sentences out of each sentence. This number is chosen based on the recommendations from Wei & Zou (2019) in their paper: "Eda: Easy data augmentation techniques for boosting performance on text classification tasks" as well as taking into account that the Danish ressources are less developed than English ressources. Therefore we have 8 augmented sentence instead of the 12 that their recommendations would be with ~1000 sentences. 

In [None]:
df_noun_adj_adv = augment_sentences(df_train, pos1='NOUN',pos2='ADJ',pos3='ADV', n_embed=5, n_syno=5, p_w=0.01, p_c_d=0.005, p_c_r=0.005,n_sen=4)

In [None]:
df_verb_propn_adv = augment_sentences(df_train, pos1='VERB',pos2='PROPN',pos3='ADV', n_embed=5, n_syno=5, p_w=0.01, p_c_d=0.005, p_c_r=0.005,n_sen=4)

In [None]:
#combine the two created dataframes
df_ori_aug = df_noun_adj_adv.append(df_verb_propn_adv, ignore_index=True)

As the function keeps the original sentence, we now have it twice, therefore we drop the dublicates.

In [None]:
df_ori_aug.sort_values("text", inplace = True) 
df_ori_aug.drop_duplicates(subset ="text", 
                     keep = 'first', inplace = True) 

## 4. Prepare the data for training and evaluation with SpaCy

#### In this section, we prepare the data for training two SpaCy models on sentiment. One model with only the original data (df_train) and one with the original as well as the augmented data (df_ori_aug). 

To do so, we change the labels of polarity and the prepare_data function converts it to json format. 

In [35]:
df_train['polarity'] = df_train['polarity'].replace(['positiv','negativ','neutral'],[0,2,1])
df_ori_aug['polarity'] = df_ori_aug['polarity'].replace(['positiv','negativ','neutral'],[0,2,1])
df_dev['polarity'] = df_dev['polarity'].replace(['positiv','negativ','neutral'],[0,2,1])

In [None]:
nlp_prep = load_spacy_model()
nlp_prep.disable_pipes(*nlp_prep.pipe_names)
sentencizer = nlp_prep.create_pipe("sentencizer")
nlp_prep.add_pipe(sentencizer, first=True)
# function to read pandas dataFrame and save as json format expected by spaCy
def prepare_data(df, outputfile):
    # choose the name of the columns containg the text and labels
    label='polarity'
    text = 'text'
    def put_cat(x):
        # adapt the name and amount of labels
        return {'positiv': bool(x==0), 'neutral': bool(x==1), 'negativ': bool(x==2)} 
    
    cat = list(df[label].map(put_cat))
    texts, cats= (list(df[text]), cat)
    
    #Create the container doc object
    docs = []
    for i, doc in enumerate(nlp_prep.pipe(texts)):
        doc.cats = cats[i]
        docs.append(doc)
    # write the data to json file
    srsly.write_json(outputfile,[docs_to_json(docs)])

In [None]:
prepare_data(df_train, 'ori.json')
prepare_data(df_ori_aug, 'syn_ori_aug.json')
prepare_data(df_dev, 'dev.json')

## 5. Train two models: with and without the augmented data

#### We are now ready to train the models. You need to open your terminal (and activate your virtual environment if you are working in one) and go to the folder where this tutorial is placed. 

In [None]:
# You need to find the directory (DEFAULT_CACHE_DIR) of the spacy model and word embeddings that you will train on. 
from danlp.download import DEFAULT_CACHE_DIR
DEFAULT_CACHE_DIR

Change 'DEFAULT_CACHE_DIR' in the following two commands to what is printed by the last command. In the command it is defined that a Danish model should be trained (da), what the training data and evaluation data is, that it is a categorical model (textcat) and which vectors to use, as well as how many iterations to run. 


#### When you have done that, copy the commands by turn into the terminal and run the training. 

#### From the output in the terminal, you can see which of the 10 models performs the best. 

We want to choose the one with the highest F1-score and lowest loss. The better model for each model is chosen:

model_ori: model5         loss: 0.114   f1_score: 53.684

model_ori_aug: model5     loss: 0.002   f1_score: 50.829

#### In this example, the slightly better model is model5 with only the original data, therefore we will further test the performance of this model. 
You might get different results depending on your choices in the augmentation proces.

## 6. Test the performance of the better performing model

#### We load the model, and we use the model in the functions in order to predict the sentiment and held the prediction against the true sentiment.  

Thereafter we evaluate by a classification report with individual precision, recall and f1-score for each label of sentiment as well as a macro average of each measure.  

In [36]:
#load the trained model
output_dir ='model_ori/model5'
new_nlp = spacy.load(output_dir)

def predict(x):
    doc = new_nlp(x)
    return max(doc.cats.items(), key=operator.itemgetter(1))[0]

result = pd.DataFrame({'pred':[predict(i) for i in df_test['text']],'true':df_test['polarity'],'text':df_test['text']})

  "__main__", mod_spec)


In [37]:
result['true'] = result['true'].replace([0,2,1],['positiv','negativ','neutral'])
print(classification_report(result['true'],result['pred']))

              precision    recall  f1-score   support

     negativ       0.64      0.87      0.74       271
     neutral       0.48      0.19      0.27       120
     positiv       0.53      0.40      0.46       121

    accuracy                           0.60       512
   macro avg       0.55      0.49      0.49       512
weighted avg       0.58      0.60      0.56       512



It seems that the model is better with negative sentiment compared to especially neutral sentiment, where the positive sentiment places itself in the middle.

## 7. Test negations

#### Usually the model have a hard time and fails to predict the sentiment on sentences with negations. Therefore, in this section we will test if we can unbias the model's predictions on sentences with negations by exposing the model to many different examples. 

We have chosen the better performing model from the previous research and will now compare it to a model trained on the same data + created negation sentences that are both positive and negative. 

When both have been trained, we will test their performance on ambigious sentences with negations but also on the test dataset we used to test the better model on previously. This way we will explore if the extra sentences with negations can improve the performance of the model. 

In [40]:
# Making lists to create different sentences
adj_pos=['glad','fornøjet','dejlig','tilpas','demokratisk','dygtig','flot','grundig','klog','menneskelig','opmærksom','sød','venlig','imødekommende']
adj_neg = ['sur','irriteret','utilfreds','negativ','frustreret','vred','ucharmerende','voldsom','utilgængelig','utilnærmelig']
adj_beskrivende = ['skøn','fremragende','dejlig','behagelig','underholdende']

adv = ['rigtig','meget','lidt','noget','virkelig','umådelig','rimelig','forholdsvis','temmelig']

noun_person = ['manden', 'drengen','damen','Hans','Ulla','Lotte','Carsten','han','hun','pigen',]
noun_profession = ['student','læge','pædogog','håndværker','bygmester','arkitekt','politiker','datalog','mor','far','musiker','underviser']
noun_ting = ['biograftur', 'middag','sejltur','oplevelse','debat']
noun_ting2 = ['maden','selskabet','servicen','stemningen','personalet']

verb_neg = ['afskyr','hader','foragter','ringeagter']
verb_pos = ['danse','synge','debattere','arbejde','spise','snakke']

#### Negations for training

In [41]:
# looping through different combinations of sentences
negations = []
label = []
words= []
l = [adj_pos, adj_neg, adj_beskrivende, adv, noun_person, noun_profession, noun_ting, noun_ting2,verb_neg, verb_pos]
sen_pos = ['{} er ikke {} i dag','{} er {} i dag','{} er en {} {}','Det har været en {} {}','Jeg vil {} mere','Hvem vil ikke gerne være en {} {}?','Der er ingen grund til at være {} når {} er så godt']
sen_neg = ['{} er ikke {} i dag','{} er {} i dag','{}! Men ellers tak','Hvis du er {}, kan du ikke være med','Jeg vil aldrig {} mere','hvem vil {}? Absolut ikke mig!', 'Jeg {} folk der ikke viser respekt']
for i in range(10):
    w = [random.choice(x) for x in l]
    negations.append(sen_pos[0].format(w[4],w[1]))
    negations.append(sen_pos[1].format(w[4],w[0]))
    negations.append(sen_pos[2].format(w[4],w[0],w[5]))
    negations.append(sen_pos[3].format(w[2],w[6]))
    negations.append(sen_pos[4].format(w[9]))
    negations.append(sen_pos[5].format(w[0],w[5]))
    negations.append(sen_pos[6].format(w[1],w[7]))   
    label += len(sen_pos) * [0] #positiv
    negations.append(sen_neg[0].format(w[4],w[0]))
    negations.append(sen_neg[1].format(w[4],w[1]))
    negations.append(sen_neg[2].format(w[2]))
    negations.append(sen_neg[3].format(w[1]))
    negations.append(sen_neg[4].format(w[9]))
    negations.append(sen_neg[5].format(w[9]))
    negations.append(sen_neg[6].format(w[8]))
    label += len(sen_pos) * [2] #negativ


In [42]:
# saving as a dataframe
negations_train = pd.DataFrame({'text':negations,'polarity':label})

#### Negations for testing

In [43]:
negations = []
label = []
l = [adj_pos, adj_neg, adj_beskrivende, adv, noun_person, noun_profession, noun_ting,noun_ting2,verb_neg, verb_pos]
sen_pos = ['{} er {} {}','Den {} gjorde at en {} begyndte at {} meget','Efter en gåtur var {} i en {} stemning','En {} {} kan løfte humøret på selv en ikke så lidt {} gammel mand']
sen_neg = ['{} er ikke {} {}','Den {} gjorde at en {} ikke gad at {}','Det har været en {} {}. Men med det sagt, {} jeg virkelig {}','Jeg vil ikke have hjælp fra {}']
for i in range(10):
    w = [random.choice(x) for x in l]
    negations.append(sen_pos[0].format(w[7],w[3],w[2]))
    negations.append(sen_pos[1].format(w[6],w[5],w[9]))
    negations.append(sen_pos[2].format(w[4],w[2]))
    negations.append(sen_pos[3].format(w[0],w[5],w[1]))
    label += len(sen_pos) * [0] #positiv
    negations.append(sen_neg[0].format(w[7],w[3],w[2]))
    negations.append(sen_neg[1].format(w[6],w[5],w[9]))
    negations.append(sen_neg[2].format(w[2],w[6],w[8],w[7]))
    negations.append(sen_neg[3].format(w[4]))
    label += len(sen_neg) * [2] #negativ

In [44]:
# saving as a dataframe
negations_test = pd.DataFrame({'text':negations,'polarity':label})

We append the negations to the dataset that returned the better performance. 

In [45]:
df_ori_negations = df_train.append(negations_train, ignore_index = True)

We load the model and the function transforms the data to json format. 

In [None]:
# load the spaCy model with the adjusted pipeline for the preparation of data to work
nlp_prep = load_spacy_model()
nlp_prep.disable_pipes(*nlp_prep.pipe_names)
sentencizer = nlp_prep.create_pipe("sentencizer")
nlp_prep.add_pipe(sentencizer, first=True)

In [None]:
prepare_data(df_ori_negations, 'ori_negations.json')

#### As before, open the terminal and type in the following (remember to change 'DEFAULT_CACHE_DIR):

The better model here is model8 with loss: 0.032   f1_score: 48.438

#### Now we can load both models and test the performance of the models on the created negation sentences for test

In [46]:
#load the models
output_dir1 ='model_ori/model5'
nlp = spacy.load(output_dir1)

output_dir2 ='model_ori_neg/model8'
nlp_negation = spacy.load(output_dir2)


  "__main__", mod_spec)
  "__main__", mod_spec)


In [47]:
#use the models for predictions
def predict(x):
    doc = nlp(x)
    return max(doc.cats.items(), key=operator.itemgetter(1))[0]

def predict_negation(x):
    doc = nlp_negation(x)
    return max(doc.cats.items(), key=operator.itemgetter(1))[0]

result_negations = pd.DataFrame({'pred':[predict(i) for i in negations_test['text']],'pred_negations':[predict_negation(i) for i in negations_test['text']], 'true':negations_test['polarity'],'text':negations_test['text']})
result_original = pd.DataFrame({'pred':[predict(i) for i in df_test['text']],'pred_negations':[predict_negation(i) for i in df_test['text']], 'true':df_test['polarity'],'text':df_test['text']})

In [48]:
result_negations['true'] = result_negations['true'].replace([0,2,1],['positiv','negativ','neutral'])

print(classification_report(result_negations['true'],result_negations['pred'],labels=['positiv','negativ']))
print(classification_report(result_negations['true'],result_negations['pred_negations'],labels=['positiv','negativ']))

              precision    recall  f1-score   support

     positiv       0.62      0.53      0.57        40
     negativ       0.47      0.42      0.45        40

   micro avg       0.54      0.47      0.51        80
   macro avg       0.54      0.47      0.51        80
weighted avg       0.54      0.47      0.51        80

              precision    recall  f1-score   support

     positiv       0.57      0.85      0.68        40
     negativ       0.70      0.35      0.47        40

    accuracy                           0.60        80
   macro avg       0.63      0.60      0.57        80
weighted avg       0.63      0.60      0.57        80



Here we see that the effect of training with more negation sentences is ambigious. Even though the averaged performance is better, it seems to have skewed the prediction to positive as the recall is higher for the positive sentiment in the model trained on negation sentences. Therefore, the regulation might have worked too well in accepting negations in sentences with positive sentiment. 

#### We do also test the performance on the original test set to evaluate if the negations has improved the general performance of the model.

In [49]:
result_original['true'] = result_original['true'].replace([0,2,1],['positiv','negativ','neutral'])

print(classification_report(result_original['true'],result_original['pred']))
print(classification_report(result_original['true'],result_original['pred_negations']))

              precision    recall  f1-score   support

     negativ       0.64      0.87      0.74       271
     neutral       0.48      0.19      0.27       120
     positiv       0.53      0.40      0.46       121

    accuracy                           0.60       512
   macro avg       0.55      0.49      0.49       512
weighted avg       0.58      0.60      0.56       512

              precision    recall  f1-score   support

     negativ       0.64      0.85      0.73       271
     neutral       0.50      0.14      0.22       120
     positiv       0.48      0.47      0.47       121

    accuracy                           0.59       512
   macro avg       0.54      0.49      0.48       512
weighted avg       0.57      0.59      0.55       512



Again, we see the tendency of a higher recall on positive sentiment. Apart from that, the negations haven't changed much in performance of the model. 