This notebook will be used as a base and support in the implementation of the nlpaug library for text augmentation.



In [1]:
%load_ext autoreload
%autoreload 2

## Dataset

In [2]:
import pandas as pd

df = pd.read_csv("../../data/silver/df_cleantext_v0.csv")
df

Unnamed: 0,Category,Message
0,0,go until jurong point crazy available only in ...
1,0,ok lar joking wif u oni
2,1,free entry in number a wkly comp to win fa cup...
3,0,u dun say so early hor u c already then say
4,0,nah i dont think he goes to usf he lives aroun...
...,...,...
5152,1,this is the numbernd time we have tried number...
5153,0,will you b going to esplanade fr home
5154,0,pity was in mood for that soany other suggestions
5155,0,the guy did some bitching but i acted like id ...


In [6]:
import os
import sys
sys.path.append(os.path.abspath("../utils"))
from experiments_utils import print_and_highlight_diff

## Chunk for analysis

In [3]:
pd.set_option('display.max_colwidth', None)

sample = df['Message'].iloc[:5].astype(str).to_list()


In [4]:
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.word.context_word_embs as nawcwe
import nlpaug.augmenter.word.word_embs as nawwe
import nlpaug.augmenter.word.spelling as naws

## KeyboardAug method

In [6]:
aug = nac.KeyboardAug()
augmented_texts = aug.augment(sample)

print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
go [31m^ntip[39m jurong [31mpoib$[39m crazy available only in bugis n great world la e [31mHuffeR[39m [31mc(je[39m [31mhh4re[39m got [31mampTe[39m wat 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
ok lar [31mMok7ng[39m wif u oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
free entry in number a wkly comp to win fa cup [31mfona<[39m tkts [31mnKNberZt[39m may number [31mteDG[39m fa to [31mnImbef[39m to receive [31men6r&[39m [31mq*wstipnstx[39m txt [31mrxt3tcA[39m [31mapp.6[39m [31mmumbRrovernuJNsgs[39m 
-----------------------------

Notes: 
1. The original criteria decided to not focus on special chars nor numbers so the cleaning acted accordingly. These two params should be changed. Same for upper/lower chars. 
2. Many characters of each modified word are changed and the result doesn't look realistic imo. This augmenter technique creates unrealistic results.


In [7]:
aug = nac.KeyboardAug(aug_word_p=0.1, include_numeric=False, include_special_char=False, include_upper_case=False)
augmented_texts = aug.augment(sample)

print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
go until [31mjirojg[39m point crazy available only in bugis n great world la e buffet [31mdone[39m there got amore wat 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
ok lar [31mjokibt[39m wif u oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
free [31mwgtry[39m in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs [31maoppy[39m [31mnhmberlfefhumberc[39m 
--------------------------------------------------
Original: 11
u dun say so early hor u c already then say
Augmented: 11
u dun s

Notes: 

The misspellings are very hardcore and artificial. They don't make sense to me. 


In [8]:
# help(nac.KeyboardAug)
# help(aug.augment)

## SpellingAug method

This method substitutes word by spelling mistake words dictionary

In [9]:
aug = naws.SpellingAug()
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
[31mgona[39m [31mutill[39m jurong [31mpiont[39m crazy available only [31mil[39m bugis n great [31mwould[39m la e buffet cine there got amore [31mwant[39m 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
[31mOK.[39m lar joking [31mwife[39m u oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
[31mfrre[39m entry [31men[39m [31munmber[39m [31me[39m wkly comp to [31mwinn[39m fa cup [31mfinel[39m tkts numberst may [31mnambr[39m text fa [31mtoo.[39m number to [31mrecived[39m entry questionstd txt ratetcs apply numberovernumbers 
------------

Notes: 

This method creates more realistic results than the previous technique.

In [10]:
aug = naws.SpellingAug(aug_p = 0.5)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
go [31muntill[39m jurong point [31mcraezy[39m [31mavilable[39m [31monley[39m [31mjin[39m bugis n [31mgrea[39m world [31ma[39m e buffet [31mcinema[39m [31mtheve[39m [31mgat[39m amore wat 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
[31mOK)][[...[39m lar [31mchocking[39m [31mwife[39m u oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
[31mfreer[39m entry [31mith[39m number a wkly comp to win fa [31mcouple[39m [31mfinel[39m tkts numberst may [31mnumper[39m [31mtest[39m fa to [31mnuamber[39m [31mro[39m [31mrecidive[39m entr

Notes:

1. The results are way more realistic.
2. The dictionary includes upper case that should be treated later on.    


In [11]:
#help(naws.SpellingAug())
# help(aug.augment)

## SynonymAug

Substitute similar word according to WordNet/ PPDB synonym

Default is WordNet 

In [12]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/maldu/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [13]:
aug = naw.SynonymAug()
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
go until jurong point [31mnutcase[39m available only in bugis n great world la e buffet cine there got amore wat 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
ok lar joking wif u oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 31
free entry in number a wkly comp to [31mbring[39m [31mhome[39m [31mthe[39m [31mbacon[39m [31mfa[39m [31mcup[39m [31mfinal[39m [31mtkts[39m [31mnumberst[39m [31mmay[39m [31mnumber[39m [31mtext[39m [31mfa[39m [31mto[39m [31mnumeral[39m [31mto[39m [31mget[39m [31mentry[39m [31mquestionstd[39m [31mtxt[

Notes:

1. Some add extra words and I don't understand why

- Original: 11
- u dun say so early hor u c already then say
- Augmented: 12
- u dun say so other hor u ampere second already then say 

2. Translations look more realistic to me 
3. Checking the params of the function it provides two databases of misspellings 'wordnet' and 'ppdb'.

In [14]:
aug = naw.SynonymAug(aug_p= 0.3)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 23
[31mhold[39m [31mup[39m [31muntil[39m [31mjurong[39m [31mhead[39m [31mcrazy[39m [31mavailable[39m [31monly[39m [31min[39m [31mbugis[39m [31mn[39m [31mgreat[39m [31mworld[39m [31mla[39m [31me[39m [31mbuffet[39m [31mcine[39m [31mat[39m [31mthat[39m [31mplace[39m [31mgot[39m [31mamore[39m [31mwat[39m 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 9
ok lar joking wif u [31moffice[39m [31mof[39m [31mnaval[39m [31mintelligence[39m 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 30
[31mloose[39m entry in [31mturn[39m a wk

Notes: 

1. Indeed we should be careful with the perc of augmented texts because it adds new words changing the length of the sentence. 
2. Some synonims are not fitting the meaning of the sentence imo.

This cannot be used unless we find a way to force 1-to-1 swapping...not even then. We should also restrict the type of synonym to swap to absolute synonyms :\

In [15]:
# help(naw.SynonymAug())

## WordEmbeddingAug method

This method inserts word randomly by word embeddings similarity. 

:param str model_type: Model type of word embeddings. Expected values include 'word2vec', 'glove' and 'fasttext'.

### Download models




In [17]:
os.makedirs("./models", exist_ok=True)

In [18]:
from nlpaug.util.file.download import DownloadUtil

DownloadUtil.download_word2vec(dest_dir='./models')
DownloadUtil.download_glove('glove.6B', './models')
DownloadUtil.download_fasttext('wiki-news-300d-1M', './models')


Downloading...
From (original): https://drive.google.com/uc?export=download&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM
From (redirected): https://drive.google.com/uc?export=download&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&confirm=t&uuid=27125d57-52fa-401b-8a73-de9429426f8e
To: /home/maldu/dscience/projects/spam_detector/research/03_text_augmentation/models/GoogleNews-vectors-negative300.bin.gz
100%|██████████| 1.65G/1.65G [02:27<00:00, 11.2MB/s]


### word2vec


In [19]:

aug = naw.WordEmbsAug(
    model_type='word2vec', model_path='./models/GoogleNews-vectors-negative300.bin',
    action="substitute")
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)


2025-01-03 18:14:20,279 - INFO - loading projection weights from ./models/GoogleNews-vectors-negative300.bin


2025-01-03 18:14:36,889 - INFO - KeyedVectors lifecycle event {'msg': 'loaded (3000000, 300) matrix of type float32 from ./models/GoogleNews-vectors-negative300.bin', 'binary': True, 'encoding': 'utf8', 'datetime': '2025-01-03T18:14:36.889408', 'gensim': '4.3.3', 'python': '3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0]', 'platform': 'Linux-6.8.0-50-generic-x86_64-with-glibc2.35', 'event': 'load_word2vec_format'}


--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
[31mwant[39m until jurong [31mJosh_Dotzler[39m crazy available [31mscarcely[39m in bugis n great world [31men_una[39m e buffet cine there got [31mla_musica[39m [31mmicheal[39m 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
[31mUmmmm[39m lar joking [31mgina[39m u oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
[31mHEARING_LOSS_OUTREACH[39m entry [31mbefore[39m number a [31mavail_1Dec[39m [31mmcdonalds[39m to win [31missa[39m [31mCzech_Republic_Kveta_Peschke[39m final tkts numberst [31mmaynot[39m [31mboth[39m text fa to [31mgeograph

In [20]:
aug = naw.WordEmbsAug(
    model_type='word2vec', model_path='./models/GoogleNews-vectors-negative300.bin',
    action="substitute", aug_p= 0.3)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
go [31mbefore[39m jurong point crazy available only in bugis n great [31mBritain[39m la e buffet cine [31mwhatsoever[39m [31mStates_Morovich[39m [31mtutti[39m [31mkom[39m 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
ok [31mked[39m joking [31mjoanna[39m u oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
[31mscoliosis_screenings[39m entry [31mover[39m [31mvarious[39m a [31mprkg[39m [31mnrl[39m to win [31mpo[39m cup final tkts numberst [31mdoesn_t[39m number text [31mba[39m to number to [31mgiving[39m entry questionstd txt ratetcs 

### glove 6B 100d

In [21]:

aug = naw.WordEmbsAug(
    model_type='glove', model_path='./models/glove.6B.100d.txt',
    action="substitute")
augmented_texts = aug.augment(sample)

print_and_highlight_diff(sample, augmented_texts)

2025-01-03 18:15:09,195 - INFO - loading projection weights from ./models/glove.6B.100d.txt
2025-01-03 18:15:24,849 - INFO - KeyedVectors lifecycle event {'msg': 'loaded (400000, 100) matrix of type float32 from ./models/glove.6B.100d.txt', 'binary': False, 'encoding': 'utf8', 'datetime': '2025-01-03T18:15:24.849573', 'gensim': '4.3.3', 'python': '3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0]', 'platform': 'Linux-6.8.0-50-generic-x86_64-with-glibc2.35', 'event': 'load_word2vec_format'}


--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
[31myou[39m until jurong [31mmuch[39m crazy available only in bugis n great world la e buffet [31md'or[39m [31mthings[39m got [31mnaqoyqatsi[39m [31mprambanan[39m 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
[31m'd[39m lar joking wif [31mω[39m oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
free entry in number [31mtake[39m wkly comp [31mseek[39m [31msecured[39m fa cup final [31mfukuchiyama[39m numberst [31mend[39m [31m12[39m text fa to [31mfour[39m to [31msend[39m entry questionstd txt ratetcs [31mable[39m numberovernumbers 
-

In [22]:

aug = naw.WordEmbsAug(
    model_type='glove', model_path='./models/glove.6B.100d.txt',
    action="substitute",  aug_p= 0.3)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
go until jurong point [31msomething[39m [31mdvd[39m only in [31mkaranga[39m n great world la [31mblog[39m buffet cine there got [31mloca[39m [31mbhagavathi[39m 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
ok [31mpra[39m joking wif [31myang[39m oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
free entry in [31mthough[39m a wkly comp to win [31mwembley[39m [31mwinner[39m [31mopening[39m [31mwhifflet[39m numberst [31mfall[39m [31mleast[39m [31mbook[39m fa [31mattempt[39m number to receive entry questionstd txt ratetcs apply numbero

### fasttext

In [24]:
aug = naw.WordEmbsAug(
    model_type='fasttext', model_path='./models/wiki-news-300d-1M.vec',
    action="substitute")
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

2025-01-03 18:15:42,896 - INFO - loading projection weights from ./models/wiki-news-300d-1M.vec


2025-01-03 18:17:30,377 - INFO - KeyedVectors lifecycle event {'msg': 'loaded (999994, 300) matrix of type float32 from ./models/wiki-news-300d-1M.vec', 'binary': False, 'encoding': 'utf8', 'datetime': '2025-01-03T18:17:30.377731', 'gensim': '4.3.3', 'python': '3.10.12 (main, Nov  6 2024, 20:22:13) [GCC 11.4.0]', 'platform': 'Linux-6.8.0-50-generic-x86_64-with-glibc2.35', 'event': 'load_word2vec_format'}


--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
go until jurong point [31mhysterical[39m available [31mstill[39m in bugis n [31madmirable[39m world [31mdella[39m e [31mdinners[39m cine [31mthey[39m got amore wat 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
ok [31mdoc-[39m joking [31myeer[39m u oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
[31mopen[39m [31mposting[39m [31mplaying[39m number a wkly comp to win fa [31mvase[39m [31m5th[39m tkts numberst may [31mNumbers[39m text fa [31mmoving[39m number to [31mgarner[39m [31mform[39m questionstd txt ratetcs apply numberovernumb

In [25]:

aug = naw.WordEmbsAug(
    model_type='fasttext', model_path='./models/wiki-news-300d-1M.vec',
    action="substitute", aug_p= 0.3)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
[31mreach[39m until jurong point [31mderanged[39m available only in bugis n [31mHUGE[39m [31msport[39m la e [31meating[39m cine [31manyhow[39m got amore wat 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
ok [31mlous[39m joking [31mcoc[39m u oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
[31mzero-cost[39m [31mcopy[39m in number [31mover[39m wkly [31mUni[39m to win fa [31mVase[39m final tkts numberst may [31m119[39m text fa to [31m89[39m to [31mdistribute[39m [31mattempt[39m questionstd txt ratetcs apply numberovernumbers 
-------

Notes:

- It's always nice to try but this time we weren't succesful. None of three models made substitutions that would make sense. That is expected since our texts are rather short and there is not much semantic meaning on them. Augmentation based on keywords provided better results.

In [None]:
# help(naw.WordEmbsAug)

## ContextualWordEmbs method

I'll try SqueezeBERT since the resources are limited 

In [26]:
aug = naw.ContextualWordEmbsAug(
    model_path='distilbert-base-uncased', action="substitute", aug_p= 0.3)
augmented_text = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
[31mreach[39m until jurong point [31mderanged[39m available only in bugis n [31mHUGE[39m [31msport[39m la e [31meating[39m cine [31manyhow[39m got amore wat 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
ok [31mlous[39m joking [31mcoc[39m u oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
[31mzero-cost[39m [31mcopy[39m in number [31mover[39m wkly [31mUni[39m to win fa [31mVase[39m final tkts numberst may [31m119[39m text fa to [31m89[39m to [31mdistribute[39m [31mattempt[39m questionstd txt ratetcs apply numberovernumbers 
-------

In [27]:
aug = nawcwe.ContextualWordEmbsAug(
    model_path='distilbert-base-uncased', 
    action ='substitute', 
    aug_p= 0.8)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 20
go [31mgo[39m jurong [31mwent[39m crazy [31mbut[39m only in [31mcommercials[39m [31mlike[39m [31m[UNK][39m [31msoup[39m [31m[UNK][39m [31mmama[39m [31mdie[39m cine there got amore wat 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 5
[31mrv...[39m [31m≈[39m [31mה[39m [31m[UNK][39m [31m≤[39m 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
free entry in [31mserie[39m a wkly comp [31mweb[39m [31mview[39m fa [31mcups[39m final [31mround[39m [31mfixtures[39m may number text fa to [31mfifa[39m [31m11[39m receive entry [31mregistr

Notes:

As with the word embeddings augmentation method, the results don't make much sense. It also adds simbols even from other languages. That could be restricted by params but after seeing the results I prefer to discard this method 

In [None]:
# help(nawcwe.ContextualWordEmbsAug)

## BackTranslationAug method

In [9]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"

aug = naw.BackTranslationAug(
    from_model_name='facebook/wmt19-en-de', 
    to_model_name='facebook/wmt19-de-en', 
)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

--------------------------------------------------
Original: 20
go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat
Augmented: 33
[31mGo[39m [31mto[39m jurong point crazy available only in bugis n great world la e buffet cine there got amore wat [31mwat[39m [31ma[39m [31mbugis[39m [31mn[39m [31mgreat[39m [31mworld[39m [31mes[39m [31mgot[39m [31mamore[39m [31mwat,[39m [31mthe[39m [31mbuffet[39m [31mc[39m 
--------------------------------------------------
Original: 6
ok lar joking wif u oni
Augmented: 6
ok lar [31mjokes[39m wif u oni 
--------------------------------------------------
Original: 28
free entry in number a wkly comp to win fa cup final tkts numberst may number text fa to number to receive entry questionstd txt ratetcs apply numberovernumbers
Augmented: 28
[31mFree[39m entry [31mto[39m number [31mone[39m wkly comp to win fa cup final tkts [31mnumbered[39m [31mnumbered[39m [31mcan[

In [8]:
augmented_texts

['Go to jurong point crazy available only in bugis n great world la e buffet cine there got amore wat a bugis amore wat n great world it got amore wat a bugis amore wat n great world it got amore wat it got amore wat, the buffet, the buffet cine cine cine cine cine cine cine cine cine cine cine cine cine cine cine cine cine cine cine cine cine cine cine it got amore crazy, until crazy, until it crazy, until it crazy, until go to go, until go, until go to go, go, go, go to go, go, go, go, go, go, go, go, go, go up to go, go, go, go, go, go, go up, go up, go up, go up, go up, go up, go up, go up, go up, go up, go up, go up, go up, go up, go up, go up, go up, go up, go up, go up, go up, go up, go up, go up, go up, go only only only only only only, go, go, go, go up, go, go only only only only only only only go, go, go',
 'ok lar jokes wif u oni',
 'Free entry into number one wkly comp to win cup finals numbered numbered numbered numbered numbered numbered numbered numbered numbered number

Notes:

This method looks interesting but the length of the sentences is going out of control. 

EDIT: the is no way to force the same length in every sentence without modifying the augment class so this method is discarded.