Загружаем скачанный классификатор токсичности:

In [25]:
!unzip data.tgz

Archive:  data.tgz
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of data.tgz or
        data.tgz.zip, and cannot find data.tgz.ZIP, period.


In [26]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [24]:
!wget -O data.tgz https://disk.yandex.ru/d/9fAiLtgX-rMjtQ

--2021-10-17 06:37:25--  https://disk.yandex.ru/d/9fAiLtgX-rMjtQ
Resolving disk.yandex.ru (disk.yandex.ru)... 87.250.250.50, 2a02:6b8::2:50
Connecting to disk.yandex.ru (disk.yandex.ru)|87.250.250.50|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31095 (30K) [text/html]
Saving to: ‘data.tgz’


2021-10-17 06:37:26 (220 KB/s) - ‘data.tgz’ saved [31095/31095]



In [2]:
!pip install transformers torch sentencepiece gensim



In [10]:
import pandas as pd
import numpy as np

In [3]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
  
tokenizer = AutoTokenizer.from_pretrained("unitary/multilingual-toxic-xlm-roberta")

model = AutoModelForSequenceClassification.from_pretrained("unitary/multilingual-toxic-xlm-roberta").cuda()

TOXIC_CLASS=-1
TOKENIZATION_TYPE='sentencepiece'


Downloading:   0%|          | 0.00/211 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/635 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04G [00:00<?, ?B/s]

Ниже функции для применения классификатора

In [4]:
from torch import softmax, sigmoid
import numpy as np


ALLOWED_ALPHABET=list(map(chr, range(ord('а'), ord('я') + 1)))
ALLOWED_ALPHABET.extend(map(chr, range(ord('a'), ord('z') + 1)))
ALLOWED_ALPHABET.extend(list(map(str.upper, ALLOWED_ALPHABET)))
ALLOWED_ALPHABET = set(ALLOWED_ALPHABET)


def logits_to_toxic_probas(logits):
    if logits.shape[-1] > 1:
        activation = lambda x: softmax(x, -1)
    else:
        activation = sigmoid
    return activation(logits)[:, TOXIC_CLASS].cpu().detach().numpy()


def is_word_start(token):
    if TOKENIZATION_TYPE == 'sentencepiece':
        return token.startswith('▁')
    if TOKENIZATION_TYPE == 'bert':
        return not token.startswith('##')
    raise ValueError("Unknown tokenization type")


def normalize(sentence, max_tokens_per_word=20):
    sentence = ''.join(map(lambda c: c if c.isalpha() else ' ', sentence.lower()))
    ids = tokenizer(sentence)['input_ids']
    tokens = tokenizer.convert_ids_to_tokens(ids)[1:-1]
    
    result = []
    num_continuation_tokens = 0
    for token in tokens:
        if not is_word_start(token):
            num_continuation_tokens += 1
            if num_continuation_tokens < max_tokens_per_word:
                result.append(token.lstrip('#▁'))
        else:
            num_continuation_tokens = 0
            result.extend([' ', token.lstrip('▁#')])
    
    return ''.join(result).strip()

def iterate_batches(data, batch_size=40):
    batch = []
    for x in data:
        batch.append(x)
        if len(batch) >= batch_size:
            yield batch
            batch = []
    if len(batch) > 0:
        yield batch

from tqdm.auto import tqdm
def predict_toxicity(sentences, batch_size=5, threshold=0.5, return_scores=False, verbose=True, device='cuda'):
    results = []
    tqdm_fn = tqdm if verbose else lambda x, total: x
    for batch in tqdm_fn(iterate_batches(sentences, batch_size), total=np.ceil(len(sentences) / batch_size)):
        normlized = [normalize(sent, max_tokens_per_word=5) for sent in batch]
        tokenized = tokenizer(normlized, return_tensors='pt', padding=True, max_length=512, truncation=True)
        
        logits = model.to(device)(**{key: val.to(device) for key, val in tokenized.items()}).logits
        preds = logits_to_toxic_probas(logits)
        if not return_scores:
            preds = preds >= threshold
        results.extend(preds)
    return results


Читаем тестовый набор

In [6]:
texts = []
with open('public_testset.txt', 'rt') as f:
    for line in f:
        texts.append(normalize(line)) 

Token indices sequence length is longer than the specified maximum sequence length for this model (533 > 512). Running this sequence through the model will result in indexing errors


Вычисляем токсичность отдельных слов

In [7]:
import torch

words = set()
for text in texts:
    words.update(text.split())
words = sorted(words)

with torch.inference_mode():
    word_toxicities = predict_toxicity(words, batch_size=100, return_scores=True)
    
toxicity = dict(zip(words, word_toxicities))


  0%|          | 0/221.0 [00:00<?, ?it/s]

Ниже читаем эмбеддинги слов и описываем функции их обработки

In [14]:
word_toxicity_df = pd.DataFrame.from_dict({'word': words, 'toxicity': word_toxicities})

In [17]:
word_toxicity_df.sort_values(by='toxicity', ascending=False).head(20)

Unnamed: 0,word,toxicity
1131,блядь,0.943845
1627,вали,0.933498
1628,валика,0.932709
1631,валить,0.932639
1129,бля,0.927966
1629,валит,0.927525
11,faggot,0.920538
4808,ебать,0.919889
4784,дь,0.917809
1133,бляха,0.91642


In [18]:
import gensim
from pymystem3 import Mystem

stemmer = Mystem()

Installing mystem to /root/.local/bin/mystem from http://download.cdn.yandex.net/mystem/mystem-3.1-linux-64bit.tar.gz


In [20]:
!wget http://vectors.nlpl.eu/repository/20/213.zip

--2021-10-17 06:22:26--  http://vectors.nlpl.eu/repository/20/213.zip
Resolving vectors.nlpl.eu (vectors.nlpl.eu)... 129.240.189.181
Connecting to vectors.nlpl.eu (vectors.nlpl.eu)|129.240.189.181|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1485270300 (1.4G) [application/zip]
Saving to: ‘213.zip’


2021-10-17 06:23:30 (22.5 MB/s) - ‘213.zip’ saved [1485270300/1485270300]



In [21]:
!unzip 213.zip

Archive:  213.zip
  inflating: meta.json               
  inflating: model.model             
  inflating: model.model.vectors_ngrams.npy  
  inflating: model.model.vectors.npy  
  inflating: model.model.vectors_vocab.npy  
  inflating: README                  


In [19]:
embs_file = np.load('embeddings_with_lemmas.npz', allow_pickle=True)
embs_vectors = embs_file['vectors']
embs_vectors_normed = embs_vectors / np.linalg.norm(embs_vectors, axis=1, keepdims=True)
embs_voc = embs_file['voc'].item()

embs_voc_by_id = [None for i in range(len(embs_vectors))]
for word, idx in embs_voc.items():
    if embs_voc_by_id[idx] is None:
        embs_voc_by_id[idx] = word

FileNotFoundError: ignored

In [None]:
def get_w2v_indicies(a):
    res = []
    if isinstance(a, str):
        a = a.split()
    for w in a:
        if w in embs_voc:
            res.append(embs_voc[w])
        else:
            lemma = stemmer.lemmatize(w)[0]
            res.append(embs_voc.get(lemma, None))
    return res

def calc_embs(words):
    words = ' '.join(map(normalize, words))
    inds = get_w2v_indicies(words)
    return [None if i is None else embs_vectors[i] for i in inds]

Сложим эмбеддинги нетоксичных слов в kd-дерево, чтобы можно было близко искать ближайших соседей

In [None]:
nontoxic_emb_inds = [ind for word, ind in embs_voc.items() if toxicity.get(word, 1.0) <= 0.5]
embs_vectors_normed_nontoxic = embs_vectors_normed[nontoxic_emb_inds]

In [None]:
from sklearn.neighbors import KDTree
embs_tree = KDTree(embs_vectors_normed_nontoxic, leaf_size=20)

Функция находит самое близкое нетоксичное слово по предпосчитанным эмбеддингам слов

In [None]:
from functools import lru_cache

@lru_cache()
def find_closest_nontoxic(word, threshold=0.5, allow_self=False):
    if toxicity.get(word, 1.0) <= threshold:
        return word
    
    if word not in toxicity and word not in embs_voc:
        return None
    
    threshold = min(toxicity.get(word, threshold), threshold)
    word = normalize(word)
    word_emb = calc_embs([word])
    if word_emb is None or word_emb[0] is None:
        return None
    
    for i in embs_tree.query(word_emb)[1][0]:
        other_word = embs_voc_by_id[nontoxic_emb_inds[i]]
        if (other_word != word or allow_self) and toxicity.get(other_word, 1.0) <= threshold:
            return other_word
    return None

Заменяем токсичные слова на ближайшие по эмбеддингам не-токсичные

In [None]:
def detox(line):
    words = normalize(line).split()
    fixed_words = [find_closest_nontoxic(word, allow_self=True) or '' for word in words]
    return ' '.join(fixed_words)

In [None]:
fixed_texts = list(map(detox, tqdm(texts)))

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=2500.0), HTML(value='')))




запишем результат в файл

In [None]:
with open('baseline_fixed.txt', 'wt') as f:
    for text in fixed_texts:
        print(text, file=f)

Скор, если никак не изменять комментарии:

In [None]:
!python3.7 score.py public_testset.short.txt public_testset.short.txt  --embeddings embeddings_with_lemmas.npz --lm lm.binary --model ./trained_roberta/ --device cuda --score -

Loading tokenizer
Loading model
Loading texts
Loading LM
Loading embeddings
Scoring
 10%|████                                    | 50/500.0 [00:01<00:15, 29.21it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (534 > 512). Running this sequence through the model will result in indexing errors
100%|███████████████████████████████████████| 500/500.0 [00:20<00:00, 24.28it/s]
2500it [00:26, 95.03it/s] 
average toxicity: 0.6330938
mean lmdiff: 1.0
mean distance_score: 1.0
36.69


Скор бейзлайна:

In [None]:
!python3.7 score.py public_testset.short.txt baseline_fixed.txt  --embeddings embeddings_with_lemmas.npz --lm lm.binary --model ./trained_roberta/ --device cuda --score -

Loading tokenizer
Loading model
Loading texts
Loading LM
Loading embeddings
Scoring
 20%|███████▊                               | 100/500.0 [00:03<00:14, 27.69it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (593 > 512). Running this sequence through the model will result in indexing errors
100%|███████████████████████████████████████| 500/500.0 [00:19<00:00, 25.01it/s]
2500it [00:40, 62.24it/s]
average toxicity: 0.46444112
mean lmdiff: 0.9444674231112382
mean distance_score: 0.8119417961430562
42.11


Сохраним данные для бейзлайна online-задачи

In [None]:
!mkdir -p online_baseline

In [None]:
import pickle as pkl

with open('./online_baseline/data.pkl', 'wb') as f:
    pkl.dump(toxicity, f)
    pkl.dump(nontoxic_emb_inds, f)