# Sampling and testing the augmentation module

This notebook is designed to verify the data generated for specific augmentation techniques, such as the synonym dictionary used for synonym replacement. 

Additionally, the notebook contains code to test actual usage of functions that will be implemented in the data augmentation module.

## Imports and initializations

In [1]:
# Importing the required libraries and packages
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np
import translators as ts
import translators.server as tss
from IPython.display import clear_output

Using state Sao Paulo server backend.


In [7]:
# Loading the synonyms into a pandas dataframe
synonyms = pq.read_table('../data/synonyms_pt_BR.parquet').to_pandas()

## Visualizing the data

In [10]:
print('Number of words: ' + str(len(synonyms)))
print('Number of words with each letter of the alphabet:')

alphabet = 'abcdefghijklmnopqrstuvwxyzáàâãéèêíìóòôõúùç'

# Count the number of words that starts with each letter of the alphabet, store the result in a dictionary
letter_count = {}

for letter in alphabet:
    letter_count[letter] = 0
    
for word in synonyms['word']:
    # Set the word to lowercase
    word = word.lower()
    letter_count[word[0]] += 1

# Display the number of words that starts with each letter of the alphabet
for letter in 'abcdefghijklmnopqrstuvwxyz':
    print(letter + ': ' + str(letter_count[letter]))

Number of words: 210977
Number of words with each letter of the alphabet:
a: 31490
b: 5462
c: 21891
d: 23639
e: 27680
f: 7252
g: 3422
h: 1757
i: 11299
j: 830
k: 8
l: 4082
m: 7933
n: 1986
o: 3740
p: 16470
q: 621
r: 16751
s: 10846
t: 7240
u: 1268
v: 4369
w: 3
x: 166
y: 0
z: 539


In [11]:
# Displaying the dataframe
pd.set_option('display.max_colwidth', None)
synonyms

Unnamed: 0,word,synonyms
0,Abade,"[clérigo, confessor, cura, padre, prelado, pároco, sacerdote]"
1,Abadia,"[convento, mosteiro, presbitério, sé, basílica, catedral, igreja, santuário, templo, ádito]"
2,Abalo,"[trepar, concussão, mossa, efervescência, agitação, terremoto, emoção, comoção, choque, estremeção, trepidação, tremor, impulso, balanço, alvoroço, secussão, perturbação]"
3,Abarracamento,"[acampamento, aquartelamento, bivaque]"
4,Abrigada,"[resguardo, refúgio, abrigo, asilo, cobertura, reduto, valhacouto]"
...,...,...
210972,únguis,[úngue]
210973,única,"[uma, inédita]"
210974,único,"[sempar, singular, um, uno, incomparável, ímpar, só, inédito, inconfundível, sui generis, individual]"
210975,únicos,"[sós, uns, individuais, incomparáveis, incomparávéis, inconfundíveis, inéditos, ímpares, singulares]"


In [12]:
# Display the words and synonyms that starts with a specific letter
letter = 'k'

df_combined[df_combined['word'].str.lower().str.startswith(letter)].reset_index(drop=True)

Unnamed: 0,word,synonyms
0,kafkiano,"[absurdo, confuso, surreal]"
1,kaiser,"[soberano, rei, majestade]"
2,kamikaze,"[camicase, suicida]"
3,kardecismo,[espiritismo]
4,kit,"[conjunto, coleção, estojo]"
5,kitsch,"[ridículo, brega, cafona]"
6,kiwi,"[quivi, quiuí]"
7,know-how,"[inaptidão, inexperiência]"


In [13]:
# Displaying the synonyms of a specific word
word = 'abreviação'

list(df_combined[df_combined['word'] == word]['synonyms'].values[0])

['abreviatura', 'abreviamento']

## Testing the augmentation techniques

### Synonym replacement

In [14]:
# Function to augment a sentence by replacing words with synonyms
def synonyms_replacement(sentence, df):
    # Set the sentence to lowercase
    sentence = sentence.lower()
    
    # Split the sentence into words
    words = sentence.split()

    # For each word in the sentence, find the synonyms
    for i, word in enumerate(words):
        # check if the word is in the DataFrame
        if word not in df['word'].values:
            continue
                
        synonyms = list(df[df['word'] == word]['synonyms'].values[0])
        
        # If there are synonyms, replace the word with a synonym
        if len(synonyms) > 0:
            # select a random synonym
            synonym = np.random.choice(synonyms)
            # replace the word with the synonym
            words[i] = synonym
            
    # Join the words into a sentence
    return ' '.join(words)

In [15]:
# Testing the synonyms replacement function
sentence = 'Teste de augmentação de texto para o projeto.'

augmented_sentence = synonyms_replacement(sentence, synonyms)

# print the original text and the augmented text
print(sentence)
print(augmented_sentence)

Teste de augmentação de texto para o projeto.
arguição de augmentação de teor contra isto projeto.


### Back translation

In [16]:
# Translate the sentence to another language (English in this example) and then back to Portuguese
english = ts.translate_text(sentence, translator='google', to_language='en')
portuguese = ts.translate_text(english, translator='google', to_language='pt')

print(english)
print(portuguese)

Text augmentation test for the project.
Teste de aumento de texto para o projeto.


## Other tests (still in progress)

In [21]:
# First, you're going to need to import wordnet:
import nltk
from nltk.corpus import wordnet

nltk.download('wordnet')

# Then, we're going to use the term "program" to find synsets like so:
syns = wordnet.synsets("set")

# Print all the synonyms
print(syns)

[Synset('set.n.01'), Synset('set.n.02'), Synset('set.n.03'), Synset('stage_set.n.01'), Synset('set.n.05'), Synset('bent.n.01'), Synset('set.n.07'), Synset('set.n.08'), Synset('hardening.n.02'), Synset('set.n.10'), Synset('set.n.11'), Synset('set.n.12'), Synset('set.n.13'), Synset('put.v.01'), Synset('determine.v.03'), Synset('specify.v.02'), Synset('set.v.04'), Synset('set.v.05'), Synset('set.v.06'), Synset('fix.v.12'), Synset('set.v.08'), Synset('set.v.09'), Synset('set.v.10'), Synset('arrange.v.06'), Synset('plant.v.01'), Synset('set.v.13'), Synset('jell.v.01'), Synset('typeset.v.01'), Synset('set.v.16'), Synset('set.v.17'), Synset('set.v.18'), Synset('sic.v.01'), Synset('place.v.11'), Synset('rig.v.04'), Synset('set_up.v.04'), Synset('adjust.v.01'), Synset('fructify.v.03'), Synset('dress.v.16'), Synset('fit.s.02'), Synset('fixed.s.02'), Synset('located.s.01'), Synset('laid.s.01'), Synset('set.s.05'), Synset('determined.s.04'), Synset('hardened.s.05')]


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\artur\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [141]:
# load the parquet file from ../scrapy-sinonimos/synonyms_scraper/synonyms_scraper/synonyms.parquet
df = pq.read_table('../scrapy-sinonimos/synonyms_scraper/synonyms.parquet').to_pandas()

df

Unnamed: 0,word,synonyms
0,"a ver-o-mar, amorim e terroso",[]
1,a do baço,[]
2,a ver-o-mar,[]
3,a da gorda,[]
4,a da beja,[]
...,...,...
7146,inchacha,[]
7147,indaiabira,[]
7148,incaia,[]
7149,imperador,"[kaiser, rei, soberano, majestade, monarca, senhor, avassalador, césar]"
