-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
discojs-core/models: add gpt #644
Conversation
bb51e16
to
10be5f1
Compare
25ff487
to
bcddd72
Compare
682cc60
to
c891efe
Compare
b959e6b
to
79c9878
Compare
a43e819
to
91fc143
Compare
79c9878
to
ca2d2c1
Compare
ca2d2c1
to
649a37c
Compare
7a65e03
to
aa39c7a
Compare
c9f6af1
to
1fb1b8a
Compare
1fb1b8a
to
83e3d9e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amazing work, thanks!
i just left minor comments
taskTitle: 'Wikitext 103 Raw', | ||
summary: { | ||
preview: 'In this challenge, we ask you to do next word prediction on a dataset of Wikipedia articles.', | ||
overview: 'Wikitext-103-raw is a dataset comprising unprocessed text excerpts from Wikipedia articles, designed for tasks related to natural language processing and language modeling.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment what tokenizer is used (type, and pretrained tokenizer downloaded from ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and also a comment on what is to expect (like if you train alone for 5mins or 5h, you should expect this train/test loss value roughly, and that the resulting model will somehow sound english?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there isn't really tokenization happening: gpt-tokenizer
is unable to be compiled in the webapp (probably due to the lack of #632). currently, it takes wikitext as a stream of characters and pass it trough GPT.convertCharDataset
to have the correct shape (don't know what its purpose is but it was required to make it work).
on the duration of training, I can only say that the loss is reducing. I hope that it'll write some correct english but I've no test to prove it.
hopefully, #646 will help us make it work and to general try it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think running gpt-micro on Shakespeare for 1 epoch should already produce some "shakespearesque" output
|
||
private static readonly batchSize = 4 | ||
private static readonly blockSize = 128 | ||
private static readonly vocabSize = 50258 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment somewhere on where the tokenizer came from? (from the task definition maybe?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the tokenizer should be part of the task definition or be abstracted away. Is there really a use case for Disco where one might change this parameter? (Or even has the knowledge about what a tokenizer is)
private convertCharDataset (dataset: Dataset): Dataset { | ||
const batchSize = 4 | ||
const sampleSize = GPT.blockSize + 1 | ||
const chunkSize = sampleSize * batchSize * 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment that 2 bytes will be one token id? BTW is it much more mem overhead to use an array (or stream) of int (to represent the token stream input) than the 2 bytes here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I a not sure if my code is wrong but it is supposed to pull the exact number of bytes it needs for one batch. So chunkSize is the number of bytes in the read buffer not number of integers (4 bytes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work thank you Valérian! I left some questions and opinions but you can proceed with the merge
708d4e7
to
09331b3
Compare
09331b3
to
23039d1
Compare
add GPT model, tracked in #641
models.GPT
, wrapping @peacefulotter codethis is a prototype that is not being tested (as we need tokenization to get meaning out of it), only that the training of the model is reducing loss