## Vocabulary Analysis

In [1]:
%load_ext autoreload
%autoreload 2
import transformers
from transformers import AutoModel, AutoTokenizer

model_name = 'dccuchile/bert-base-spanish-wwm-uncased'

model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)


Some weights of BertModel were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-uncased and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [2]:
tokenizer.tokenize("CHUPAME LA ***")

['chupa', '##me', 'la', '*', '*', '*']

In [3]:
tokenizer.tokenize("pija")

['pi', '##ja']

In [4]:
tokenizer.tokenize("trolo")

['tro', '##lo']

In [5]:
tokenizer.tokenize("maric√≥n")

['maric√≥n']

In [6]:
tokenizer.tokenize("marica")

['marica']

In [7]:
tokenizer.tokenize("puto")

['puto']

In [8]:
tokenizer.tokenize("Hacete ortear viejo trolazo")

['hace', '##te', 'or', '##tear', 'viejo', 'tro', '##laz', '##o']

Es un problema esto. Veamos c√≥mo agregar posiblemente estos tokens...

In [10]:
tokenizer.special_tokens_map

{'unk_token': '[UNK]',
 'sep_token': '[SEP]',
 'pad_token': '[PAD]',
 'cls_token': '[CLS]',
 'mask_token': '[MASK]'}

In [20]:
tokenizer.add_tokens([
    "[HASHTAG]",
])

model.resize_token_embeddings(len(tokenizer))

Embedding(31003, 768)

## Training new tokenizer

In [21]:
from hatedetection import load_datasets

train_dataset, dev_dataset, test_dataset = load_datasets()

In [22]:
from hatedetection.preprocessing import preprocess_tweet

preprocess_tweet("@clarincom jajajaja #NoVuelvenMas ü§£‚ùå‚ùå", hashtag_token="[HASHTAG]")

'[USER] jajajaja [HASHTAG] no vuelven mas [EMOJI]cara revolvi√©ndose de la risa[EMOJI][EMOJI]marca de cruz[EMOJI][EMOJI]marca de cruz[EMOJI]'

Veamos los que est√°n m√°s de 10 veces

In [25]:
from tokenizers import BertWordPieceTokenizer

new_tokenizer = BertWordPieceTokenizer(lowercase=True)
texts = [ex["text"] for ex in train_dataset]
new_tokenizer.train_from_iterator(
    texts, min_frequency=10
)

In [26]:
old_tokens = set(tokenizer.get_vocab())

missing_tokens = [tok for tok in new_tokenizer.get_vocab() if tok not in old_tokens]

len(missing_tokens)

2163

Antes hab√≠a ~2900, sacamos casi 800. Bien!

In [27]:
i = 400

for i, tok in enumerate(sorted(missing_tokens)):
    print(f"{i+1:<4} -- {tok}")

1    -- #
2    -- ##5n
3    -- ##aaa
4    -- ##aan
5    -- ##abon
6    -- ##acion
7    -- ##acto
8    -- ##aj
9    -- ##aja
10   -- ##ajaa
11   -- ##ajaj
12   -- ##ajaja
13   -- ##ajajaj
14   -- ##ajajaja
15   -- ##ajajajaj
16   -- ##ajajajaja
17   -- ##ajajajajajajajaj
18   -- ##ajajajajajajajajajajajajajajajaj
19   -- ##ajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajajaj
20   -- ##ajajajajj
21   -- ##ajajajj
22   -- ##ajajajjajajajj
23   -- ##ajajj
24   -- ##ajas
25   -- ##ajj
26   -- ##ajja
27   -- ##ajjaja
28   -- ##ajo
29   -- ##ajsj
30   -- ##aju
31   -- ##ajuaju
32   -- ##aleza
33   -- ##amer
34   -- ##anal
35   -- ##anan
36   -- ##andose
37   -- ##arde
38   -- ##arent
39   -- ##arentena
40   -- ##arma
41   -- ##aroni
42   -- ##atr
43   -- ##aur
44   -- ##bacion
45   -- ##baj
46   -- ##bajo
47   -- ##baron
48   -- ##carce
49   -- ##cci
50   -- ##ccion
51   -- ##cciones
52   -- ##cepcion
53   -- ##cepto
54   -- ##cero
55   -- ##chando
56   -- ##charon
57   -- ##chazo