This notebook will be used as a base and support in the implementation of the nlpaug library for text augmentation.


Inspiration taken from : https://www.kaggle.com/code/andypenrose/text-augmentation-with-nlpaug

In [1]:
%load_ext autoreload
%autoreload 2

## Dataset

In [2]:
import pandas as pd

df = pd.read_csv("../../data/silver/df_cleantext_v0.csv")
df

Unnamed: 0,Category,Message
0,0,go until jurong point crazy available only in ...
1,0,ok lar joking wif u oni
2,1,free entry in number a wkly comp to win fa cup...
3,0,u dun say so early hor u c already then say
4,0,nah i dont think he goes to usf he lives aroun...
...,...,...
5152,1,this is the numbernd time we have tried number...
5153,0,will you b going to esplanade fr home
5154,0,pity was in mood for that soany other suggestions
5155,0,the guy did some bitching but i acted like id ...


In [3]:
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.word.context_word_embs as nawcwe
import nlpaug.augmenter.word.word_embs as nawwe
import nlpaug.augmenter.word.spelling as naws

In [4]:
import os
import sys
sys.path.append(os.path.abspath("../utils"))
from experiments_utils import print_and_highlight_diff

## Chunk for analysis

In [5]:
pd.set_option('display.max_colwidth', None)

sample = df['Message'].iloc[:5].astype(str).to_list()


## KeyboardAug method

In [6]:
aug = nac.KeyboardAug()
augmented_texts = aug.augment(sample)

print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
go until jurong [31mpling[39m [31mcTaz^[39m [31mavxilaf>e[39m [31mon:T[39m in bugis n great world la e buffet cine [31mtTege[39m got [31mXmire[39m wat 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
ok lar [31mjok&hg[39m wif u oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
[31mb5ee[39m entry in [31mnjmbed[39m a [31mw.<y[39m [31mcojO[39m to win fa cup [31mvinaI[39m tkts [31mn*mb3rs5[39m may number [31mtsst[39m fa to number to [31mrecF(v$[39m entry questionstd txt ratetcs [31mzp9ly[39m numberovernumbers 
-----------------------------

Notes: 
1. The original criteria decided to not focus on special chars nor numbers so the cleaning acted accordingly. These two params should be changed. Same for upper/lower chars. 
2. Many characters of each modified word are changed and the result doesn't look realistic imo. This augmenter technique creates unrealistic results.


In [7]:
aug = nac.KeyboardAug(aug_word_p=0.1, include_numeric=False, include_special_char=False, include_upper_case=False)
augmented_texts = aug.augment(sample)

print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
go [31muhtol[39m [31mjurogn[39m point crazy available only in bugis n great world la e buffet cine there got amore wat 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
ok lar [31miokong[39m wif u oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
free entry in number a [31msjly[39m comp to win fa cup final tkts numberst may number text fa to [31mnjjber[39m to receive entry questionstd txt ratetcs apply [31mmkmgeekverjumbers[39m 
--------------------------------------------------
Original: 11
u dun say so early hor u c already then say
Augmented: 11
u dun s

Notes: 

The misspellings are very hardcore and artificial. They don't make sense to me. 


In [8]:
# help(nac.KeyboardAug)
# help(aug.augment)

## SpellingAug method

This method substitutes word by spelling mistake words dictionary

In [9]:
aug = naws.SpellingAug()
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
go until jurong point crazy available only in bugis [31mOn[39m [31mgteat[39m world [31ma[39m e [31mbuffe[39m cine there [31mgate.[39m amore [31mway[39m 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
[31mOK)][[...[39m lar joking wif [31mYou[39m oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
[31mfee[39m entry [31mlin[39m [31mnummber[39m [31mg[39m wkly comp [31mro[39m win fa cup [31mfinel[39m tkts numberst [31mMay[39m number [31mtex[39m fa [31mou[39m number to receive entry questionstd txt ratetcs apply numberovernumbers 
---------

Notes: 

This method creates more realistic results than the previous technique.

In [16]:
aug = naws.SpellingAug(aug_p = 0.5)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
[31mgoes[39m until jurong [31mopint[39m [31mcreasy[39m available only [31ming[39m bugis n [31mgreats[39m [31mworls[39m [31ma[39m [31ma[39m buffet [31mcinema[39m [31mwhere[39m got amore wat 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
[31mof[39m lar [31mchocking[39m wif [31mYou[39m oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
free entry [31mia[39m [31mnamber's[39m [31mde[39m wkly comp [31mty[39m win fa [31mcop[39m [31mfinel[39m tkts numberst [31mmays[39m [31mnamber[39m text fa to number [31mtu[39m receive entry ques

Notes:

1. The results are way more realistic.
2. The dictionary includes upper case that should be treated later on.    


In [None]:
#help(naws.SpellingAug())
# help(aug.augment)

## SynonymAug

Substitute similar word according to WordNet/ PPDB synonym

In [23]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/maldu/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [24]:
aug = naw.SynonymAug()
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 22
[31mmove[39m until jurong [31mitem[39m crazy available only in bugis n [31mgravid[39m [31mhumans[39m [31matomic[39m [31mnumber[39m [31m57[39m [31me[39m [31mbuffet[39m [31mcine[39m [31mthere[39m [31mgot[39m [31mamore[39m [31mwat[39m 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
[31mo.k.[39m lar joking wif u oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
free entry in number a wkly comp to win fa cup [31mconcluding[39m tkts numberst may number text fa to number to [31mget[39m [31mentranceway[39m questionstd txt ratetcs apply nu

Notes:

- Original: 11
- u dun say so early hor u c already then say
- Augmented: 12
- u dun say so other hor u ampere second already then say 

some add extra words and I don't understand why

In [27]:
help(naw.SynonymAug())

Help on SynonymAug in module nlpaug.augmenter.word.synonym object:

class SynonymAug(nlpaug.augmenter.word.word_augmenter.WordAugmenter)
 |  SynonymAug(aug_src='wordnet', model_path=None, name='Synonym_Aug', aug_min=1, aug_max=10, aug_p=0.3, lang='eng', stopwords=None, tokenizer=None, reverse_tokenizer=None, stopwords_regex=None, force_reload=False, verbose=0)
 |  
 |  Augmenter that leverage semantic meaning to substitute word.
 |  
 |  :param str aug_src: Support 'wordnet' and 'ppdb' .
 |  :param str model_path: Path of dictionary. Mandatory field if using PPDB as data source
 |  :param str lang: Language of your text. Default value is 'eng'. For `wordnet`, you can choose lang from this list
 |      http://compling.hss.ntu.edu.sg/omw/. For `ppdb`, you simply download corresponding langauge pack from
 |      http://paraphrase.org/#/download.
 |  :param float aug_p: Percentage of word will be augmented.
 |  :param int aug_min: Minimum number of word will be augmented.
 |  :param int au