This repository has been archived by the owner on Oct 31, 2023. It is now read-only.
BPC #17
Comments
I think we are computing bit-per-character as mentioned in the paper. |
Hmm, in
but I don't see |
Script https://raw.githubusercontent.com/salesforce/awd-lstm-lm/master/data/enwik8/prep_enwik8.py
|
Sorry, I was confused. Yes, you're right, we're doing bit-per-byte, and we have to correct our paper to say that. Thanks for pointing this out! However, I think bit-per-byte is the standard for this particular dataset as it's defined as "10^8 bytes of wikipedia", and other papers use it that way: |
Thanks! |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Scripts in
experiments
directory calculates bits per byte, not bits per character. Am I right?It is important when comparing chars or words perplexities.
For example, for English enwiki8 ratio chars to bits is 1.0033040809995477:
BPB: 1.0 -> byte perplexity: 2.718 -> char perplexity: 2.727
For Polish ratio chars to bits is 1.0505100080652954:
BPB: 1.0 -> byte perplexity: 2.718 -> char perplexity: 2.859
The text was updated successfully, but these errors were encountered: