BPC #17

djstrong · 2020-01-24T11:27:07Z

Scripts in experiments directory calculates bits per byte, not bits per character. Am I right?

It is important when comparing chars or words perplexities.

For example, for English enwiki8 ratio chars to bits is 1.0033040809995477:
BPB: 1.0 -> byte perplexity: 2.718 -> char perplexity: 2.727

For Polish ratio chars to bits is 1.0505100080652954:
BPB: 1.0 -> byte perplexity: 2.718 -> char perplexity: 2.859

The text was updated successfully, but these errors were encountered:

tesatory · 2020-01-24T14:27:13Z

I think we are computing bit-per-character as mentioned in the paper.

djstrong · 2020-01-24T14:39:01Z

Hmm, in data/enwik8/valid.txt.raw are characters like ą.

>>> ord('ą')
261

but I don't see 261 in data/enwik8/valid.txt.

djstrong · 2020-01-24T14:41:53Z

Script https://raw.githubusercontent.com/salesforce/awd-lstm-lm/master/data/enwik8/prep_enwik8.py
is reading enwik8 as bytes:

data = zipfile.ZipFile('enwik8.zip').read('enwik8')

djstrong · 2020-01-24T14:44:47Z

https://raw.githubusercontent.com/kimiyoung/transformer-xl/master/prep_text8.py is correct

tesatory · 2020-01-27T12:46:53Z

Sorry, I was confused. Yes, you're right, we're doing bit-per-byte, and we have to correct our paper to say that. Thanks for pointing this out!

However, I think bit-per-byte is the standard for this particular dataset as it's defined as "10^8 bytes of wikipedia", and other papers use it that way:

djstrong · 2020-01-27T16:34:58Z

Thanks!

djstrong closed this as completed Jan 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BPC #17

BPC #17

djstrong commented Jan 24, 2020 •

edited

tesatory commented Jan 24, 2020

djstrong commented Jan 24, 2020

djstrong commented Jan 24, 2020 •

edited

djstrong commented Jan 24, 2020

tesatory commented Jan 27, 2020 •

edited

djstrong commented Jan 27, 2020

BPC #17

BPC #17

Comments

djstrong commented Jan 24, 2020 • edited

tesatory commented Jan 24, 2020

djstrong commented Jan 24, 2020

djstrong commented Jan 24, 2020 • edited

djstrong commented Jan 24, 2020

tesatory commented Jan 27, 2020 • edited

djstrong commented Jan 27, 2020

djstrong commented Jan 24, 2020 •

edited

djstrong commented Jan 24, 2020 •

edited

tesatory commented Jan 27, 2020 •

edited