support kenlm models and surprisal from them #14

aalok-sathe · 2023-11-01T17:57:28Z

in this PR we add support for KenLM models using the KenLM python bindings. note that due to the complications of installing KenLM we don't enforce it as a requirement for the repo, but it should be installed if someone wants to use this library to do inference with KenLM Ngram models using the kenlm python interface

…nized text, e.g. whitespace for `kenlm`

…ete CustomEncoding implementation.

… want that? maybe add an option to show but default to disabling it? do we also want bos?

aalok-sathe · 2023-11-08T16:11:04Z

pyproject.toml

@@ -17,6 +17,7 @@ plotext = "^5.0.2"
 matplotlib = "^3.5.2"
 pandas = "^1.4.3"
 openai = "^0.23.0"
+kenlm = {version = "^0.2.0", optional = true}


we don't want to force kenlm as a dependency---only install it if people need it

aalok-sathe · 2023-11-08T16:14:08Z

surprisal/model.py

+                accum += [m.BaseScore(st1, w, st2)]
+                st1, st2 = st2, st1
+            if eos:
+                accum += [m.BaseScore(st1, "</s>", st2)]


this part should maybe be made false by default, since this is generating a score for EOS, which is a convention inconsistent with huggingface models surprisal

aalok-sathe · 2023-11-08T16:16:39Z

surprisal/utils.py

 from transformers import tokenization_utils_base


-def pick_matching_token_ixs(
+def hf_pick_matching_token_ixs(


such a method is not necessary for ngrams, I believe, but need to check how punctuation gets tokenized:

In [10]: [ce] = k.tokenize('hello, my name') In [11]: ce.tokens Out[11]: ('hello', ',', 'my', 'name')

aalok-sathe · 2023-11-08T16:17:53Z

merging this as it doesn't introduce any changes to anything current; only adds new implementation to support the kenlm model class. merging even though we have a few TODOs to address.

aalok-sathe added 10 commits November 1, 2023 13:56

prefix with hf to indicate it works with HF-based tokenizer outputs

5052217

start writing kenLM implementation

b95a847

flesh out interface towards supporting CustomEncoding for custom-toke…

443855b

…nized text, e.g. whitespace for `kenlm`

actually no point subclassing from tokenizers.Encoding

2304da9

move repr() to SurprisalArray rather than huggingfacesurprisal. compl…

0c64a91

…ete CustomEncoding implementation.

add an KenLM and NGramSurprisal implementation

24a4b9c

make KenLMModel visible at the module level

e78b681

bugfixes; bump numpy version for typing

37038bb

bugfix in ids handling in CustomEncoding

b25f27e

OK, we have a MWE! still TODO: figure out </s> surprisal value: do we…

83ba14d

… want that? maybe add an option to show but default to disabling it? do we also want bos?

aalok-sathe marked this pull request as ready for review November 8, 2023 16:07

aalok-sathe commented Nov 8, 2023

View reviewed changes

aalok-sathe merged commit 4cbee05 into main Nov 8, 2023

aalok-sathe mentioned this pull request Nov 17, 2023

Indexing into SurprisalArray using singletons fails in NGramSurprisal. #18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support kenlm models and surprisal from them #14

support kenlm models and surprisal from them #14

aalok-sathe commented Nov 1, 2023 •

edited

Loading

aalok-sathe Nov 8, 2023

aalok-sathe Nov 8, 2023

aalok-sathe Nov 8, 2023

aalok-sathe commented Nov 8, 2023

support kenlm models and surprisal from them #14

support kenlm models and surprisal from them #14

Conversation

aalok-sathe commented Nov 1, 2023 • edited Loading

aalok-sathe Nov 8, 2023

Choose a reason for hiding this comment

aalok-sathe Nov 8, 2023

Choose a reason for hiding this comment

aalok-sathe Nov 8, 2023

Choose a reason for hiding this comment

aalok-sathe commented Nov 8, 2023

aalok-sathe commented Nov 1, 2023 •

edited

Loading