This notebook will be used as a base and support in the implementation of the nlpaug library for text augmentation.


Inspiration taken from : https://www.kaggle.com/code/andypenrose/text-augmentation-with-nlpaug

In [2]:
%load_ext autoreload
%autoreload 2

## Dataset

In [3]:
import pandas as pd

df = pd.read_csv("../../data/silver/df_cleantext_v0.csv")
df

Unnamed: 0,Category,Message
0,0,go until jurong point crazy available only in ...
1,0,ok lar joking wif u oni
2,1,free entry in number a wkly comp to win fa cup...
3,0,u dun say so early hor u c already then say
4,0,nah i dont think he goes to usf he lives aroun...
...,...,...
5152,1,this is the numbernd time we have tried number...
5153,0,will you b going to esplanade fr home
5154,0,pity was in mood for that soany other suggestions
5155,0,the guy did some bitching but i acted like id ...


In [4]:
import os
import sys
sys.path.append(os.path.abspath("../utils"))
from experiments_utils import print_and_highlight_diff

## Chunk for analysis

In [5]:
pd.set_option('display.max_colwidth', None)

sample = df['Message'].iloc[:5].astype(str).to_list()


In [6]:
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.word.context_word_embs as nawcwe
import nlpaug.augmenter.word.word_embs as nawwe
import nlpaug.augmenter.word.spelling as naws

## KeyboardAug method

In [7]:
aug = nac.KeyboardAug()
augmented_texts = aug.augment(sample)

print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
go until [31mmufong[39m point crazy available [31mLHly[39m in [31mbuglC[39m n [31mg4eAt[39m [31mwLrlF[39m la e buffet [31mX8ne[39m there got amore wat 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
ok lar [31mUouing[39m wif u oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
[31mGre2[39m entry in [31mnumh$r[39m a [31mwmlg[39m comp to win fa cup [31mfJnaP[39m tkts [31mn^mfsrst[39m may [31mnumged[39m [31mteZY[39m fa to number to receive entry questionstd txt [31mEatefds[39m [31mal9ly[39m numberovernumbers 
-----------------------------

Notes: 
1. The original criteria decided to not focus on special chars nor numbers so the cleaning acted accordingly. These two params should be changed. Same for upper/lower chars. 
2. Many characters of each modified word are changed and the result doesn't look realistic imo. This augmenter technique creates unrealistic results.


In [8]:
aug = nac.KeyboardAug(aug_word_p=0.1, include_numeric=False, include_special_char=False, include_upper_case=False)
augmented_texts = aug.augment(sample)

print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
go until jurong point crazy available only in bugis n great [31mworir[39m la e buffet cine there got [31majorr[39m wat 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
ok lar [31mioklng[39m wif u oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
free [31msntdy[39m in number a wkly comp to win fa cup final tkts [31mnjmbwrat[39m may [31mnunbet[39m text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers 
--------------------------------------------------
Original: 11
u dun say so early hor u c already then say
Augmented: 11
u dun s

Notes: 

The misspellings are very hardcore and artificial. They don't make sense to me. 


In [9]:
# help(nac.KeyboardAug)
# help(aug.augment)

## SpellingAug method

This method substitutes word by spelling mistake words dictionary

In [10]:
aug = naws.SpellingAug()
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
go [31mutill[39m jurong [31mponint[39m crazy [31mavilable[39m [31molny[39m in bugis n great [31mworlth[39m la e buffet [31mcinema[39m there got amore wat 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
ok [31mlaw[39m joking wif [31myou[39m oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
free entry in number [31me[39m wkly comp [31mgo[39m [31mwinn[39m fa [31mcouple[39m [31mfinel[39m tkts numberst [31mmaybe[39m [31mnumbtr[39m text fa to [31mnouber[39m to [31mreceivement[39m entry questionstd txt ratetcs apply numberovernumbers 
----

Notes: 

This method creates more realistic results than the previous technique.

In [11]:
aug = naws.SpellingAug(aug_p = 0.5)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
[31mago[39m [31mtill[39m jurong point crazy available [31mnoly[39m in bugis [31mNo[39m [31mgeart[39m [31mwordl[39m la [31mold[39m [31mbuffe[39m cine [31mthear[39m got amore [31mwant[39m 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
ok [31mlaw[39m joking [31mwife[39m [31myou[39m oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
free entry in [31munmber[39m a wkly comp [31mth[39m [31mwinne[39m fa [31mcouple[39m [31mfinel[39m tkts numberst [31mMy[39m [31mnummber[39m text fa [31mtoa[39m number [31mte[39m receive entry questi

Notes:

1. The results are way more realistic.
2. The dictionary includes upper case that should be treated later on.    


In [12]:
#help(naws.SpellingAug())
# help(aug.augment)

## SynonymAug

Substitute similar word according to WordNet/ PPDB synonym

Default is WordNet 

In [13]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/maldu/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [14]:
aug = naw.SynonymAug()
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 21
go until jurong point crazy available only in bugis n [31mnifty[39m world [31mpelican[39m [31mstate[39m [31me[39m [31mbuffet[39m [31mcine[39m [31mthere[39m [31mgot[39m [31mamore[39m [31mwat[39m 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 10
[31mhunky[39m [31mdory[39m [31mlar[39m [31mjoking[39m [31mwif[39m [31mu[39m [31moffice[39m [31mof[39m [31mnaval[39m [31mintelligence[39m 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 30
free entry in number a wkly comp to [31mdeliver[39m [31mthe[39m [31mgoods[39m [31mfa[39m [31mcup

Notes:

1. Some add extra words and I don't understand why

- Original: 11
- u dun say so early hor u c already then say
- Augmented: 12
- u dun say so other hor u ampere second already then say 

2. Translations look more realistic to me 
3. Checking the params of the function it provides two databases of misspellings 'wordnet' and 'ppdb'.

In [15]:
aug = naw.SynonymAug(aug_p= 0.3)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
go until jurong point [31mlooney[39m available only in bugis n [31mbig[39m world [31mlouisiana[39m e buffet cine there [31mpose[39m amore wat 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
ok lar joking wif u oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
free entry in number a wkly [31mcomprehensive[39m to win fa cup final tkts numberst may number text fa to [31mroutine[39m to [31mencounter[39m entry questionstd txt ratetcs apply numberovernumbers 
--------------------------------------------------
Original: 11
u dun say so early hor u c already t

Notes: 

1. Indeed we should be careful with the perc of augmented texts because it adds new words changing the length of the sentence. 
2. Some synonims are not fitting the meaning of the sentence imo.

This cannot be used unless we find a way to force 1-to-1 swapping...not even then. We should also restrict the type of synonym to swap to absolute synonyms :\

In [16]:
# help(naw.SynonymAug())

## WordEmbeddingAug method

This method inserts word randomly by word embeddings similarity. 

:param str model_type: Model type of word embeddings. Expected values include 'word2vec', 'glove' and 'fasttext'.

### Download models

The models used are coming from gensim library due to lack of local resources.


In [22]:
# from nlpaug.util.file.download import DownloadUtil

# DownloadUtil.download_word2vec(dest_dir='./models')
# DownloadUtil.download_glove('glove.6B', './models')
# DownloadUtil.download_fasttext('wiki-news-300d-1M', './models')


Downloading...
From (original): https://drive.google.com/uc?export=download&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM
From (redirected): https://drive.google.com/uc?export=download&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&confirm=t&uuid=d99cbbcb-c3b8-4674-b260-d682736f47a2
To: /home/maldu/dscience/projects/spam_detector/research/03_text_augmentation/models/GoogleNews-vectors-negative300.bin.gz
100%|██████████| 1.65G/1.65G [02:20<00:00, 11.7MB/s]


### word2vec

Too heavy for my machine, leaving it for Github Actions running

In [None]:

aug = naw.WordEmbsAug(
    model_type='word2vec', model_path='./models/GoogleNews-vectors-negative300.bin',
    action="substitute")
aaugmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)


In [None]:
aug = naw.WordEmbsAug(
    model_type='word2vec', model_path='./models/GoogleNews-vectors-negative300.bin',
    action="substitute", aug_p= 0.3)
aaugmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

In [None]:
aug = naw.WordEmbsAug(
    model_type='word2vec', model_path='./models/GoogleNews-vectors-negative300.bin',
    action="substitute", aug_p= 0.8)
aaugmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

### glove 6B 100d

In [None]:

aug = naw.WordEmbsAug(
    model_type='glove', model_path='./models/glove.6B.100d.txt',
    action="substitute")
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

In [None]:

aug = naw.WordEmbsAug(
    model_type='glove', model_path='./models/glove.6B.100d.txt',
    action="substitute",  aug_p= 0.3)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

In [None]:

aug = naw.WordEmbsAug(
    model_type='glove', model_path='./models/glove.6B.100d.txt',
    action="substitute",  aug_p= 0.8)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

### fasttext

In [None]:
aug = naw.WordEmbsAug(
    model_type='fasttext', model_path='./models/wiki-new-300d-1M.vec',
    action="substitute")
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

In [None]:

aug = naw.WordEmbsAug(
    model_type='fasttext', model_path='./models/wiki-new-300d-1M.vec',
    action="substitute", aug_p= 0.3)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

In [None]:
aug = naw.WordEmbsAug(
    model_type='word2vec', model='nlpaug.model.word_embs.nmw.Word2vec()',
    action="substitute", aug_p= 0.8)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

In [41]:
# import nlpaug.augmenter.word as naw
# aug = naw.WordEmbsAug(model_type='word2vec', model='nlpaug.model.word_embs.nmw.Word2vec()',action="substitute", aug_p= 0.8)
# augmented_texts = aug.augment(sample)
# print_and_highlight_diff(sample, augmented_texts)

AttributeError: 'str' object has no attribute 'get_vocab'

In [None]:
# help(naw.WordEmbsAug)

In [None]:
aug = naw.WordEmbsAug(
    model_type='fasttext', model_path='./models/wiki-new-300d-1M.vec',
    action="substitute", aug_p= 0.8)
aaugmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

## ContextualWordEmbs method

I'll try SqueezeBERT since the resources are limited 

In [None]:
aug = naw.ContextualWordEmbsAug(
    model_path='distilbert-base-uncased', action="substitute", aug_p= 0.3)
augmented_text = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

In [31]:
aug = nawcwe.ContextualWordEmbsAug(
    model_path=model_dir, 
    action ='substitute', 
    aug_p= 0.8)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 1
Y
Augmented: 5
[31mmy[39m [31msample[39m [31mjust[39m [31mgoes[39m [31mhere[39m 


In [None]:
# help(nawcwe.ContextualWordEmbsAug)