## Vocabulary Analysis

In [1]:
%load_ext autoreload
%autoreload 2
import transformers
from transformers import AutoModel, AutoTokenizer

model_name = 'dccuchile/bert-base-spanish-wwm-uncased'

model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)


Some weights of BertModel were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-uncased and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [2]:
tokenizer.tokenize("CHUPAME LA ***")

['chupa', '##me', 'la', '*', '*', '*']

In [3]:
tokenizer.tokenize("pija")

['pi', '##ja']

In [4]:
tokenizer.tokenize("trolo")

['tro', '##lo']

In [5]:
tokenizer.tokenize("maric√≥n")

['maric√≥n']

In [6]:
tokenizer.tokenize("marica")

['marica']

In [7]:
tokenizer.tokenize("puto")

['puto']

In [8]:
tokenizer.tokenize("Hacete ortear viejo trolazo")

['hace', '##te', 'or', '##tear', 'viejo', 'tro', '##laz', '##o']

Es un problema esto. Veamos c√≥mo agregar posiblemente estos tokens...

In [10]:
tokenizer.special_tokens_map

{'unk_token': '[UNK]',
 'sep_token': '[SEP]',
 'pad_token': '[PAD]',
 'cls_token': '[CLS]',
 'mask_token': '[MASK]'}

In [20]:
tokenizer.add_tokens([
    "[HASHTAG]",
])

model.resize_token_embeddings(len(tokenizer))

Embedding(31003, 768)

## Training new tokenizer

In [53]:
from hatedetection import load_datasets

train_dataset, dev_dataset, test_dataset = load_datasets()

In [54]:
from hatedetection.preprocessing import preprocess_tweet

preprocess_tweet("@clarincom jajajaja #NoVuelvenMas ü§£‚ùå‚ùå", hashtag_token="[HASHTAG]")

'[USER] jaja [HASHTAG] no vuelven mas [EMOJI]cara revolvi√©ndose de la risa[EMOJI][EMOJI]marca de cruz[EMOJI][EMOJI]marca de cruz[EMOJI]'

Veamos los que est√°n m√°s de 10 veces

In [55]:
from tokenizers import BertWordPieceTokenizer

new_tokenizer = BertWordPieceTokenizer(lowercase=True)
texts = [ex["text"] for ex in train_dataset]
new_tokenizer.train_from_iterator(
    texts, min_frequency=10
)

In [56]:
old_tokens = set(tokenizer.get_vocab())

missing_tokens = [tok for tok in new_tokenizer.get_vocab() if tok not in old_tokens]

len(missing_tokens)

2154

Antes hab√≠a ~2900, sacamos casi 800. Bien!

In [57]:
i = 400

for i, tok in enumerate(sorted(missing_tokens)):
    print(f"{i+1:<4} -- {tok}")

1    -- #
2    -- ##5n
3    -- ##aaa
4    -- ##aan
5    -- ##abon
6    -- ##acion
7    -- ##aj
8    -- ##aleza
9    -- ##amer
10   -- ##anal
11   -- ##anan
12   -- ##andose
13   -- ##arde
14   -- ##arent
15   -- ##arentena
16   -- ##arma
17   -- ##aroni
18   -- ##bacion
19   -- ##bajo
20   -- ##baron
21   -- ##bica
22   -- ##carce
23   -- ##cci
24   -- ##ccion
25   -- ##cciones
26   -- ##cepcion
27   -- ##cepto
28   -- ##cero
29   -- ##chando
30   -- ##charon
31   -- ##chazo
32   -- ##chera
33   -- ##chita
34   -- ##chner
35   -- ##chor
36   -- ##chorros
37   -- ##choso
38   -- ##chul
39   -- ##ciada
40   -- ##cian
41   -- ##ciando
42   -- ##ciaron
43   -- ##ciela
44   -- ##cien
45   -- ##ciendose
46   -- ##cog
47   -- ##cras
48   -- ##cridad
49   -- ##ct
50   -- ##cter
51   -- ##cto
52   -- ##ctor
53   -- ##ctora
54   -- ##ctores
55   -- ##ctos
56   -- ##ctu
57   -- ##ctura
58   -- ##cua
59   -- ##cues
60   -- ##cun
61   -- ##cuper
62   -- ##dable
63   -- ##dalla
64   -- ##dando
65   

In [47]:
import re

laughter_regex = re.compile("j[ja]+aj[ja]+")


laughter_regex.match("ajjjjjj")