This notebook will be used as a base and support in the implementation of the nlpaug library for text augmentation.


Inspiration taken from : https://www.kaggle.com/code/andypenrose/text-augmentation-with-nlpaug

In [1]:
%load_ext autoreload
%autoreload 2

## Dataset

In [3]:
import pandas as pd

df = pd.read_csv("../../data/silver/df_cleantext_v0.csv")
df

Unnamed: 0,Category,Message
0,0,go until jurong point crazy available only in ...
1,0,ok lar joking wif u oni
2,1,free entry in number a wkly comp to win fa cup...
3,0,u dun say so early hor u c already then say
4,0,nah i dont think he goes to usf he lives aroun...
...,...,...
5152,1,this is the numbernd time we have tried number...
5153,0,will you b going to esplanade fr home
5154,0,pity was in mood for that soany other suggestions
5155,0,the guy did some bitching but i acted like id ...


In [45]:
import os
import sys
sys.path.append(os.path.abspath("../utils"))
from experiments_utils import print_and_highlight_diff

## Chunk for analysis

In [14]:
pd.set_option('display.max_colwidth', None)

sample = df['Message'].iloc[:5].astype(str).to_list()


In [15]:
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.word.context_word_embs as nawcwe
import nlpaug.augmenter.word.word_embs as nawwe
import nlpaug.augmenter.word.spelling as naws

## KeyboardAug method

In [None]:
aug = nac.KeyboardAug()
augmented_texts = aug.augment(sample)

print_and_highlight_diff(sample, augmented_texts)

Notes: 
1. The original criteria decided to not focus on special chars nor numbers so the cleaning acted accordingly. These two params should be changed. Same for upper/lower chars. 
2. Many characters of each modified word are changed and the result doesn't look realistic imo. This augmenter technique creates unrealistic results.


In [None]:
aug = nac.KeyboardAug(aug_word_p=0.1, include_numeric=False, include_special_char=False, include_upper_case=False)
augmented_texts = aug.augment(sample)

print_and_highlight_diff(sample, augmented_texts)

Notes: 

The misspellings are very hardcore and artificial. They don't make sense to me. 


In [8]:
# help(nac.KeyboardAug)
# help(aug.augment)

## SpellingAug method

This method substitutes word by spelling mistake words dictionary

In [None]:
aug = naws.SpellingAug()
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

Notes: 

This method creates more realistic results than the previous technique.

In [None]:
aug = naws.SpellingAug(aug_p = 0.5)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

Notes:

1. The results are way more realistic.
2. The dictionary includes upper case that should be treated later on.    


In [11]:
#help(naws.SpellingAug())
# help(aug.augment)

## SynonymAug

Substitute similar word according to WordNet/ PPDB synonym

Default is WordNet 

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')

In [None]:
aug = naw.SynonymAug()
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

Notes:

1. Some add extra words and I don't understand why

- Original: 11
- u dun say so early hor u c already then say
- Augmented: 12
- u dun say so other hor u ampere second already then say 

2. Translations look more realistic to me 
3. Checking the params of the function it provides two databases of misspellings 'wordnet' and 'ppdb'.

In [None]:
aug = naw.SynonymAug(aug_p= 0.3)
augmented_texts = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

Notes: 

1. Indeed we should be careful with the perc of augmented texts because it adds new words changing the length of the sentence. 
2. Some synonims are not fitting the meaning of the sentence imo.

This cannot be used unless we find a way to force 1-to-1 swapping...not even then. We should also restrict the type of synonym to swap to absolute synonyms :\

In [24]:
# help(naw.SynonymAug())

## WordEmbeddingAug method

This method inserts word randomly by word embeddings similarity. 

:param str model_type: Model type of word embeddings. Expected values include 'word2vec', 'glove' and 'fasttext'.

### Download models


In [7]:
import os 
models_dir = "./models"
os.makedirs(models_dir, exist_ok=True)

In [8]:
from huggingface_hub import hf_hub_download

# model_path = hf_hub_download(
#     repo_id="NathaNn1111/word2vec-google-news-negative-300-bin", 
#     filename="GoogleNews-vectors-negative300.bin",
#     local_dir=models_dir
# )

In [9]:
import requests
import zipfile
import os

glove_url = "http://nlp.stanford.edu/data/glove.6B.zip"
zip_path = "./models/glove.6B.zip"
response = requests.get(glove_url)
with open(zip_path, 'wb') as f:
    f.write(response.content)
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall("./models")
os.remove(zip_path)

In [5]:

model_path = hf_hub_download(
    repo_id="facebook/fasttext-language-identification",
    filename="model.bin",
    cache_dir=models_dir)


model.bin:   0%|          | 0.00/1.18G [00:00<?, ?B/s]

Modelo descargado y guardado en: ./models/models--facebook--fasttext-language-identification/snapshots/3af127d4124fc58b75666f3594bb5143b9757e78/model.bin


### word2vec

Too heavy for my machine, leaving it for Github Actions running

In [None]:
aug = naw.WordEmbsAug(
    model_type='word2vec', 
    model_path="./models/GoogleNews-vectors-negative300.bin",
    action="substitute")
augmented_text = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

### glove 6B 100d

In [None]:
aug = naw.WordEmbsAug(
    model_type='glove', 
    model_path="./models/glove.6B.1000d.txt",
    action="substitute")
augmented_text = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)



### fasttext

In [None]:
aug = naw.WordEmbsAug(
    model_type='fasttext', 
    model_path="./models/models--facebook--fasttext-language-identification/.no_exist/3af127d4124fc58b75666f3594bb5143b9757e78/fasttext.bin",  
    action="substitute"  
)

augmented_text = aug.augment(sample)
print_and_highlight_diff(sample, augmented_texts)

NameError: name 'naw' is not defined

In [None]:
# help(naw.WordEmbsAug)

## ContextualWordEmbs method

I'll try SqueezeBERT since I'm trying to run as many things in local as I can 

In [None]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

model_dir = "./models/squeezebert"

model_name = "squeezebert/squeezebert-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

tokenizer.save_pretrained(model_dir)
model.save_pretrained(model_dir)


In [None]:
aug = nawcwe.ContextualWordEmbsAug(model_path=model_dir, action ='substitute', aug_p= 0.3)
augmented_texts = aug.augment(sample)
# augmented_texts = [x.replace(" ' ", "'") for x in augmented_texts]
print_and_highlight_diff(sample, augmented_texts)

In [None]:
help(nawcwe.ContextualWordEmbsAug)