Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

BPC #17

Closed
djstrong opened this issue Jan 24, 2020 · 6 comments
Closed

BPC #17

djstrong opened this issue Jan 24, 2020 · 6 comments

Comments

@djstrong
Copy link

djstrong commented Jan 24, 2020

Scripts in experiments directory calculates bits per byte, not bits per character. Am I right?

It is important when comparing chars or words perplexities.

For example, for English enwiki8 ratio chars to bits is 1.0033040809995477:
BPB: 1.0 -> byte perplexity: 2.718 -> char perplexity: 2.727

For Polish ratio chars to bits is 1.0505100080652954:
BPB: 1.0 -> byte perplexity: 2.718 -> char perplexity: 2.859

@tesatory
Copy link
Contributor

I think we are computing bit-per-character as mentioned in the paper.

@djstrong
Copy link
Author

Hmm, in data/enwik8/valid.txt.raw are characters like ą.

>>> ord('ą')
261

but I don't see 261 in data/enwik8/valid.txt.

@djstrong
Copy link
Author

djstrong commented Jan 24, 2020

Script https://raw.githubusercontent.com/salesforce/awd-lstm-lm/master/data/enwik8/prep_enwik8.py
is reading enwik8 as bytes:

data = zipfile.ZipFile('enwik8.zip').read('enwik8')

@djstrong
Copy link
Author

@tesatory
Copy link
Contributor

tesatory commented Jan 27, 2020

Sorry, I was confused. Yes, you're right, we're doing bit-per-byte, and we have to correct our paper to say that. Thanks for pointing this out!

However, I think bit-per-byte is the standard for this particular dataset as it's defined as "10^8 bytes of wikipedia", and other papers use it that way:

@djstrong
Copy link
Author

Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants