# Word probabilities

Following https://arxiv.org/pdf/1906.00363.pdf, we try to compute the predictability score of a sentence based on the probabilities of its component tokens. We use a language model trained on a large corpora of literary text and want to use the predictability score as a proxy for literary "inventivity": i.e. if the model can predict the various tokens in the sentence with high probability, then the sentence must be hackeneyed or cliché. 

The formula given in the paper is:

$p(w_{1}|w_{2},w_{3},...,w_{n})p(w_{2}|w_{1},w_{3},...,w_{n})...p(w_{n}|w_{1},w_{2},...,w_{n-1})^{-1/n} =(\prod_{i=1}^{n}p(w_{i}|w_{1},...,w_{i-1},w_{i+1},...,w_{n}))^{-1/n} $

### Initializing 

First we'll define our preprocessing function (applied prior to the model tokenizer for better handling of korean morphosyntactical features), then we'll load our model pre-trained on North Korean data

In [1]:
from konlpy.tag import Komoran
komoran = Komoran()
def preproc(sentence):
    return ' '.join([_ for _ in komoran.morphs(sentence)]).replace('MASK', '[MASK]')

In [44]:
from transformers import AutoModelWithLMHead, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("./jobert")
model = AutoModelWithLMHead.from_pretrained("./jobert")

I0908 19:21:52.470256 37080 configuration_utils.py:262] loading configuration file ./jobert\config.json
I0908 19:21:52.524113 37080 configuration_utils.py:300] Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "type_vocab_size": 2,
  "vocab_size": 20839
}

I0908 19:21:52.539073 37080 tokenization_utils_base.py:1167] Model name './jobert' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-la

In [42]:
import logging
logging.getLogger("transformers.tokenization_utils").setLevel(logging.ERROR)

Function to compute the probability of a single word with BERT's masked language modelling 

In [45]:
def compute_word_proba(sequence, word):
    global model, tokenizer
    input_ids = tokenizer.encode(sequence, return_tensors="pt")
    mask_token_index = torch.where(input_ids == tokenizer.mask_token_id)[1]
    token_logits = model(input_ids)[0]
    mask_token_logits = token_logits[0, mask_token_index, :]
    mask_token_logits = torch.softmax(mask_token_logits, dim=1)
    sought_after_token = word
    sought_after_token_id = tokenizer.encode(sought_after_token, add_special_tokens=False, add_prefix_space=True)[0]
    token_score = mask_token_logits[:, sought_after_token_id]
    return token_score.detach().numpy()[0]

Extended to compute the probability of each word in a sequence

In [49]:
def compute_word_by_word_proba(sequence):
    global tokenizer
    preprocessed_sequence = preproc(sequence)
    word_dict = {}
    for token in preprocessed_sequence.split(' '):
        masked_sequence = preprocessed_sequence.replace(token, tokenizer.mask_token)
        word_dict[token] = compute_word_proba(masked_sequence, token)
    return word_dict

In [50]:
compute_word_by_word_proba('아무것도  가늠할수  없도록  몸과  마음이  굳어져버 린속에  한가지  생각이  끌날처럼  머리속으로  비껴  들었다')

{'아무것': 0.018641446,
 '도': 0.99329877,
 '가늠': 0.002576774,
 '하': 0.97214746,
 'ㄹ': 0.99953544,
 '수': 0.9890463,
 '없': 0.9403979,
 '도록': 0.0008625049,
 '몸': 0.08521192,
 '과': 0.37629688,
 '마음': 0.17802098,
 '이': 0.9546155,
 '굳어지': 0.0019267205,
 '어': 0.9935823,
 '버': 0.016200697,
 '린': 0.0010652286,
 '속': 0.016586062,
 '에': 0.5310471,
 '한가지': 0.002677068,
 '생각': 0.13339023,
 '끌': 2.1601047e-05,
 '날': 0.00035557462,
 '처럼': 0.0044740182,
 '머리': 0.123638384,
 '으로': 0.012388377,
 '비끼': 0.010278981,
 '들': 0.3685645,
 '었': 0.9988689,
 '다': 0.9634075}

And averaging out the individual probabilities for a sentence score:

In [59]:
import numpy as np

def geometric_mean(series):
    return np.array(series).prod()**(1.0/len(series))

def compute_sentence_score(sentence):
    return geometric_mean(list(compute_word_by_word_proba(sentence).values()))
    
compute_sentence_score('아무것도  가늠할수  없도록  몸과  마음이  굳어져버 린속에  한가지  생각이  끌날처럼  머리속으로  비껴  들었다')

0.04743360799931327

Running a test with a sentence full of cliches and one with a creative metaphor:

In [61]:
cliche = '어둠속에  빛나는  그윽한  눈동자 에  웬  불빛이  어리여  불꽃처럼  아름답게  반짝이 고있었다.'
original = '단풍잎같은 손이 물을 연신 퍼서 뿌려댄다.'

In [62]:
print('Cliche sentence score: ', compute_sentence_score(cliche))
print('Original sentence score: ', compute_sentence_score(original))

Cliche sentence score:  0.03292289714058116
Original sentence score:  0.015907948372126945
