This notebook will be used as a base and support in the implementation of the nlpaug library for text augmentation.


Inspiration taken from : https://www.kaggle.com/code/andypenrose/text-augmentation-with-nlpaug

In [1]:
%load_ext autoreload
%autoreload 2

## Dataset

In [2]:
import pandas as pd

df = pd.read_csv("../../data/silver/df_cleantext_v0.csv")
df

Unnamed: 0,Category,Message
0,0,go until jurong point crazy available only in ...
1,0,ok lar joking wif u oni
2,1,free entry in number a wkly comp to win fa cup...
3,0,u dun say so early hor u c already then say
4,0,nah i dont think he goes to usf he lives aroun...
...,...,...
5152,1,this is the numbernd time we have tried number...
5153,0,will you b going to esplanade fr home
5154,0,pity was in mood for that soany other suggestions
5155,0,the guy did some bitching but i acted like id ...


In [3]:
import os
import sys
sys.path.append(os.path.abspath("../utils"))
from experiments_utils import print_and_highlight_diff

## Chunk for analysis

In [4]:
pd.set_option('display.max_colwidth', None)

sample = df['Message'].iloc[:5].astype(str).to_list()


In [5]:
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.word.context_word_embs as nawcwe
import nlpaug.augmenter.word.word_embs as nawwe
import nlpaug.augmenter.word.spelling as naws

## KeyboardAug method

In [6]:
aug = nac.KeyboardAug()
augmented_texts = aug.augment(sample)

print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
go [31muGtiK[39m jurong point [31mcraAt[39m [31mavai<QblF[39m only in [31mbutiD[39m n [31mfreaf[39m world la e [31mbIff3t[39m cine there got amore wat 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
ok lar [31muoOing[39m wif u oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
[31mCGee[39m [31memtrt[39m in number a [31mwllH[39m [31mcoNl[39m to win fa cup [31mBinQl[39m tkts numberst may [31mn&mbFr[39m [31mtDxF[39m fa to number to receive entry [31mquSstioGcGd[39m txt ratetcs apply [31mM&mbeDoverjumbefw[39m 
-----------------------------

Notes: 
1. The original criteria decided to not focus on special chars nor numbers so the cleaning acted accordingly. These two params should be changed. Same for upper/lower chars. 
2. Many characters of each modified word are changed and the result doesn't look realistic imo. This augmenter technique creates unrealistic results.


In [7]:
aug = nac.KeyboardAug(aug_word_p=0.1, include_numeric=False, include_special_char=False, include_upper_case=False)
augmented_texts = aug.augment(sample)

print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
go until [31mnugong[39m point crazy available only in bugis n great world la e [31mgiffet[39m cine there got amore wat 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
ok lar [31mjoukng[39m wif u oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
[31mffde[39m entry in number a wkly comp to win fa cup final tkts numberst may number text fa to [31mnumvsr[39m to receive entry questionstd txt ratetcs apply [31mnumnwrkvernumndfs[39m 
--------------------------------------------------
Original: 11
u dun say so early hor u c already then say
Augmented: 11
u dun s

Notes: 

The misspellings are very hardcore and artificial. They don't make sense to me. 


In [8]:
# help(nac.KeyboardAug)
# help(aug.augment)

## SpellingAug method

This method substitutes word by spelling mistake words dictionary

In [9]:
aug = naws.SpellingAug()
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
go until jurong point [31mcraezy[39m available only in bugis n [31mgrait[39m world la [31mand[39m buffet [31mcinema[39m [31mtrere[39m got amore [31mwant[39m 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
[31mOK)][[...[39m lar joking [31mwife[39m u oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
[31mfreer[39m entry [31mil[39m [31mlnumber[39m [31mal[39m wkly comp [31mtoo.[39m [31mwind[39m fa [31mcoop[39m final tkts numberst may [31mnumenber[39m text fa [31mtj[39m number to receive entry questionstd txt ratetcs apply numberovernumbe

Notes: 

This method creates more realistic results than the previous technique.

In [10]:
aug = naws.SpellingAug(aug_p = 0.5)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
[31mgou[39m until jurong point [31mcarzy[39m [31mavalible[39m [31mony[39m in bugis [31min[39m [31mgrate[39m world la [31mHe[39m buffet [31mcinema[39m there [31mgotten[39m amore [31meat[39m 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
[31mOk[39m [31mlaw[39m joking wif [31mYou[39m oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
free entry [31mim[39m [31mnubmer[39m a wkly comp [31mot[39m [31mwinne[39m fa cup final tkts numberst [31mmaybe[39m [31mnamber's[39m text fa [31mte[39m [31mnummer[39m [31mro[39m receive entry ques

Notes:

1. The results are way more realistic.
2. The dictionary includes upper case that should be treated later on.    


In [11]:
#help(naws.SpellingAug())
# help(aug.augment)

## SynonymAug

Substitute similar word according to WordNet/ PPDB synonym

Default is WordNet 

In [12]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/maldu/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [13]:
aug = naw.SynonymAug()
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 21
go [31mbad[39m [31muntil[39m [31mjurong[39m [31mperiod[39m [31mweirdo[39m [31mavailable[39m [31monly[39m [31min[39m [31mbugis[39m [31mn[39m [31mgreat[39m [31mworld[39m [31mla[39m [31me[39m [31mbuffet[39m [31mcine[39m [31mthere[39m [31mget[39m [31mamore[39m [31mwat[39m 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
ok lar joking wif u oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 29
free [31mentrance[39m in number a wkly [31mcomprehensive[39m to win fa cup [31mlast[39m tkts numberst may [31mtelephone[39m [31mnumber[

Notes:

1. Some add extra words and I don't understand why

- Original: 11
- u dun say so early hor u c already then say
- Augmented: 12
- u dun say so other hor u ampere second already then say 

2. Translations look more realistic to me 
3. Checking the params of the function it provides two databases of misspellings 'wordnet' and 'ppdb'.

In [14]:
aug = naw.SynonymAug(aug_p= 0.3)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 21
go until jurong [31mfull[39m [31mstop[39m [31mcrazy[39m [31mavailable[39m [31mentirely[39m [31min[39m [31mbugis[39m [31mn[39m [31mgreat[39m [31mreality[39m [31mlanthanum[39m [31me[39m [31mbuffet[39m [31mcine[39m [31mthere[39m [31mgot[39m [31mamore[39m [31mwat[39m 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 7
[31mall[39m [31mright[39m [31mlar[39m [31mjoking[39m [31mwif[39m [31mu[39m [31moni[39m 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 29
[31mgratuitous[39m entry in number a wkly comp to win fa cup final tkts num

Notes: 

1. Indeed we should be careful with the perc of augmented texts because it adds new words changing the length of the sentence. 
2. Some synonims are not fitting the meaning of the sentence imo.

This cannot be used unless we find a way to force 1-to-1 swapping...not even then. We should also restrict the type of synonym to swap to absolute synonyms :\

In [15]:
# help(naw.SynonymAug())

## WordEmbeddingAug method

This method inserts word randomly by word embeddings similarity. 

:param str model_type: Model type of word embeddings. Expected values include 'word2vec', 'glove' and 'fasttext'.

### Download models

The models used are coming from gensim library due to lack of local resources.


In [None]:
from nlpaug.util.file.download import DownloadUtil

DownloadUtil.download_word2vec(dest_dir='./models')
DownloadUtil.download_glove('glove.6B', './models')
DownloadUtil.download_fasttext('wiki-news-300d-1M', './models')


### word2vec

Too heavy for my machine, leaving it for Github Actions running

In [17]:

aug = naw.WordEmbsAug(
    model_type='word2vec', model_path='./models/GoogleNews-vectors-negative300.bin',
    action="substitute")
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)


2025-01-03 17:02:05,707 - INFO - loading projection weights from ./models/GoogleNews-vectors-negative300.bin


FileNotFoundError: [Errno 2] No such file or directory: './models/GoogleNews-vectors-negative300.bin'

In [None]:
aug = naw.WordEmbsAug(
    model_type='word2vec', model_path='./models/GoogleNews-vectors-negative300.bin',
    action="substitute", aug_p= 0.3)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

In [None]:
aug = naw.WordEmbsAug(
    model_type='word2vec', model_path='./models/GoogleNews-vectors-negative300.bin',
    action="substitute", aug_p= 0.8)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

### glove 6B 100d

In [None]:

aug = naw.WordEmbsAug(
    model_type='glove', model_path='./models/glove.6B.100d.txt',
    action="substitute")
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

In [None]:

aug = naw.WordEmbsAug(
    model_type='glove', model_path='./models/glove.6B.100d.txt',
    action="substitute",  aug_p= 0.3)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

In [None]:

aug = naw.WordEmbsAug(
    model_type='glove', model_path='./models/glove.6B.100d.txt',
    action="substitute",  aug_p= 0.8)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

### fasttext

In [None]:
aug = naw.WordEmbsAug(
    model_type='fasttext', model_path='./models/wiki-new-300d-1M.vec',
    action="substitute")
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

In [None]:

aug = naw.WordEmbsAug(
    model_type='fasttext', model_path='./models/wiki-new-300d-1M.vec',
    action="substitute", aug_p= 0.3)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

In [None]:
aug = naw.WordEmbsAug(
    model_type='word2vec', model='nlpaug.model.word_embs.nmw.Word2vec()',
    action="substitute", aug_p= 0.8)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

In [41]:
# import nlpaug.augmenter.word as naw
# aug = naw.WordEmbsAug(model_type='word2vec', model='nlpaug.model.word_embs.nmw.Word2vec()',action="substitute", aug_p= 0.8)
# augmented_texts = aug.augment(sample)
# print_and_highlight_diff(sample, augmented_texts)

AttributeError: 'str' object has no attribute 'get_vocab'

In [None]:
# help(naw.WordEmbsAug)

In [None]:
aug = naw.WordEmbsAug(
    model_type='fasttext', model_path='./models/wiki-new-300d-1M.vec',
    action="substitute", aug_p= 0.8)
aaugmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

## ContextualWordEmbs method

I'll try SqueezeBERT since the resources are limited 

In [None]:
aug = naw.ContextualWordEmbsAug(
    model_path='distilbert-base-uncased', action="substitute", aug_p= 0.3)
augmented_text = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

In [31]:
aug = nawcwe.ContextualWordEmbsAug(
    model_path=model_dir, 
    action ='substitute', 
    aug_p= 0.8)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 1
Y
Augmented: 5
[31mmy[39m [31msample[39m [31mjust[39m [31mgoes[39m [31mhere[39m 


In [None]:
# help(nawcwe.ContextualWordEmbsAug)