discojs-core/models: add gpt #644

tharvik · 2024-02-29T11:40:36Z

add GPT model, tracked in #641

add text loader from Hugo
add "wikitext" task, the LLM reference
add models.GPT, wrapping @peacefulotter code
do not add real tokenization as Add tokenization support to Disco LLMs #646 takes care of that
add /datasets in place of /example_training_data
- a script takes care of getting the data, it should be replaced by submodules & LFS, see put ressources in repo instead of remotely #648

this is a prototype that is not being tested (as we need tokenization to get meaning out of it), only that the training of the model is reducing loss

discojs/discojs-core/src/models/gpt/index.ts

martinjaggi

amazing work, thanks!

i just left minor comments

discojs/discojs-core/src/default_tasks/wikitext.ts

martinjaggi · 2024-03-13T15:26:01Z

discojs/discojs-core/src/default_tasks/wikitext.ts

+        taskTitle: 'Wikitext 103 Raw',
+        summary: {
+          preview: 'In this challenge, we ask you to do next word prediction on a dataset of Wikipedia articles.',
+          overview: 'Wikitext-103-raw is a dataset comprising unprocessed text excerpts from Wikipedia articles, designed for tasks related to natural language processing and language modeling.'


add a comment what tokenizer is used (type, and pretrained tokenizer downloaded from ...)

and also a comment on what is to expect (like if you train alone for 5mins or 5h, you should expect this train/test loss value roughly, and that the resulting model will somehow sound english?)

there isn't really tokenization happening: gpt-tokenizer is unable to be compiled in the webapp (probably due to the lack of #632). currently, it takes wikitext as a stream of characters and pass it trough GPT.convertCharDataset to have the correct shape (don't know what its purpose is but it was required to make it work).

on the duration of training, I can only say that the loss is reducing. I hope that it'll write some correct english but I've no test to prove it.

hopefully, #646 will help us make it work and to general try it out.

I think running gpt-micro on Shakespeare for 1 epoch should already produce some "shakespearesque" output

discojs/discojs-core/src/default_tasks/wikitext.ts

martinjaggi · 2024-03-13T15:28:46Z

discojs/discojs-core/src/models/gpt/index.ts

+
+  private static readonly batchSize = 4
+  private static readonly blockSize = 128
+  private static readonly vocabSize = 50258


comment somewhere on where the tokenizer came from? (from the task definition maybe?)

I wonder if the tokenizer should be part of the task definition or be abstracted away. Is there really a use case for Disco where one might change this parameter? (Or even has the knowledge about what a tokenizer is)

martinjaggi · 2024-03-13T15:29:44Z

discojs/discojs-core/src/models/gpt/index.ts

+  private convertCharDataset (dataset: Dataset): Dataset {
+    const batchSize = 4
+    const sampleSize = GPT.blockSize + 1
+    const chunkSize = sampleSize * batchSize * 2


comment that 2 bytes will be one token id? BTW is it much more mem overhead to use an array (or stream) of int (to represent the token stream input) than the 2 bytes here?

I a not sure if my code is wrong but it is supposed to pull the exact number of bytes it needs for one batch. So chunkSize is the number of bytes in the read buffer not number of integers (4 bytes)

discojs/discojs-core/src/models/gpt/model.ts

JulienVig

Great work thank you Valérian! I left some questions and opinions but you can proceed with the merge

DEV.md

datasets/README.md

discojs/discojs-core/src/aggregator/base.ts

discojs/discojs-core/src/informant/training_informant/base.ts

discojs/discojs-core/src/models/gpt/index.ts

server/src/router/federated/server.ts

Closes: #641 Closes: #619 Closes: #600

tharvik self-assigned this Feb 29, 2024

tharvik force-pushed the 641-partial-merge-llm-tharvik branch 3 times, most recently from bb51e16 to 10be5f1 Compare March 1, 2024 10:16

tharvik force-pushed the 641-merge-prototype-llm-tharvik branch from 25ff487 to bcddd72 Compare March 1, 2024 10:24

Base automatically changed from 641-partial-merge-llm-tharvik to develop March 1, 2024 14:35

tharvik force-pushed the 641-merge-prototype-llm-tharvik branch 4 times, most recently from 682cc60 to c891efe Compare March 6, 2024 19:09

tharvik changed the base branch from develop to 643-fixes-tharvik March 6, 2024 19:11

tharvik force-pushed the 641-merge-prototype-llm-tharvik branch 4 times, most recently from b959e6b to 79c9878 Compare March 7, 2024 13:25

tharvik force-pushed the 643-fixes-tharvik branch from a43e819 to 91fc143 Compare March 7, 2024 13:44

tharvik force-pushed the 641-merge-prototype-llm-tharvik branch from 79c9878 to ca2d2c1 Compare March 7, 2024 13:45

Base automatically changed from 643-fixes-tharvik to develop March 7, 2024 13:55

discojs-core/data_loader: drop new & createData

549b649

tharvik force-pushed the 641-merge-prototype-llm-tharvik branch from ca2d2c1 to 649a37c Compare March 7, 2024 13:56

discojs-core/data_loader: add text

7e7f272

tharvik force-pushed the 641-merge-prototype-llm-tharvik branch from 7a65e03 to aa39c7a Compare March 11, 2024 14:54

JulienVig reviewed Mar 12, 2024

View reviewed changes

discojs/discojs-core/src/models/gpt/index.ts Show resolved Hide resolved

JulienVig reviewed Mar 12, 2024

View reviewed changes

discojs/discojs-core/src/models/gpt/index.ts Show resolved Hide resolved

discojs-core/aggregator: rm Task dep

dbdb764

tharvik force-pushed the 641-merge-prototype-llm-tharvik branch from c9f6af1 to 1fb1b8a Compare March 13, 2024 14:17

tharvik marked this pull request as ready for review March 13, 2024 14:27

tharvik force-pushed the 641-merge-prototype-llm-tharvik branch from 1fb1b8a to 83e3d9e Compare March 13, 2024 14:32

tharvik requested review from JulienVig and martinjaggi March 13, 2024 14:40

martinjaggi reviewed Mar 13, 2024

View reviewed changes

martinjaggi self-requested a review March 13, 2024 15:33

martinjaggi approved these changes Mar 13, 2024

View reviewed changes

JulienVig approved these changes Mar 13, 2024

View reviewed changes

tharvik force-pushed the 641-merge-prototype-llm-tharvik branch from 708d4e7 to 09331b3 Compare March 15, 2024 12:02

peacefulotter and others added 4 commits March 18, 2024 13:10

discojs-core/models: add gpt

4c95be2

Closes: #641 Closes: #619 Closes: #600

discojs-core/tasks: add wikitext

bd0d693

datasets: initial

e170bd0

server/tests: add wikitext

23039d1

tharvik force-pushed the 641-merge-prototype-llm-tharvik branch from 09331b3 to 23039d1 Compare March 18, 2024 12:10

tharvik merged commit 1c0e914 into develop Mar 18, 2024
23 checks passed

tharvik deleted the 641-merge-prototype-llm-tharvik branch March 18, 2024 14:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

discojs-core/models: add gpt #644

discojs-core/models: add gpt #644

tharvik commented Feb 29, 2024 •

edited

Loading

martinjaggi left a comment

martinjaggi Mar 13, 2024

martinjaggi Mar 13, 2024

tharvik Mar 14, 2024

peacefulotter Mar 15, 2024

martinjaggi Mar 13, 2024

peacefulotter Mar 15, 2024

martinjaggi Mar 13, 2024

peacefulotter Mar 13, 2024

JulienVig left a comment •

edited

Loading

discojs-core/models: add gpt #644

discojs-core/models: add gpt #644

Conversation

tharvik commented Feb 29, 2024 • edited Loading

martinjaggi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JulienVig left a comment • edited Loading

Choose a reason for hiding this comment

tharvik commented Feb 29, 2024 •

edited

Loading

JulienVig left a comment •

edited

Loading