Thoughts on going forward with adding additional model specifications to the table #7

LudwigStumpp · 2023-05-06T08:24:13Z

Summary

Currently there is the idea to add the following to the table:

length of context window
number of tokens trained

This issue is to discuss if we want to do so and what the implications are. I believe this is an important decision to make moving forward, so I would like to bring this to our attention here.

Implications

If we want to add these, we could have one separate row per published model version. Model version here indicates the standalone model variant published by the authors. This could either be due to different model sizes (see LlaMA-7B, 13B, 33B, 65B) or due to different training procedures (MPT-7B-base, vs -instruct, -chat, -storywriter). This will have an effect on the assigned properties in our table (model size, number of tokens trained, context window, ...)

In short, including more information inside the table would lead to:

more columns, for more properties
more rows, as we need to differentiate between each model version (alternatively, one could indicate the span for all models in one single row, e.g 1T - 10T for number of tokens or 1024 - 4096 for context width)

with the following consequences to the audience:

more complete information, greater level of detail
more difficult to get a quick overview, might damage the table clarity

What are your thoughts on this?

LudwigStumpp · 2023-05-06T10:16:26Z

For simplicity and greater clarity, I would advocate NOT having a separate row entry per model version, but rather specifying the range of values, e.g. 1T - 10T for the number of tokens or 1024 - 4096 for the context width (as it is already the case for the model size). With the number of new models published in the future, this otherwise becomes impossible to maintain.

Also, I would prioritize what is important to the user:

Model size -> as it impacts inference requirements (time + memory).
Context length -> as it impacts the application

So I would propose to leave out the number of tokens trained (right now people seem to train 20 tokens per model parameter, according to the Chinchilla scaling laws, https://arxiv.org/abs/2203.15556). For me personally, the number of tokens trained is relative to the model size, and the resulting eval benchmarks are sufficient to determine if the training was reasonably good.

I would propose to add a column for the context length.

eugeneyan · 2023-05-06T14:21:00Z

Your proposals to (i) add context length as a column containing the range of tokens across all model versions, and (ii) leave out tokens trained for now, is well thought out. Thank you for considering the pros and cons and coming to a conclusion. Will update the list of suggestions.

See [issue 7](#7): Thoughts on adding more model specs to the table

eugeneyan added a commit that referenced this issue May 6, 2023

Update list of suggestions based on Issue 7

cb4d818

See [issue 7](#7): Thoughts on adding more model specs to the table

LudwigStumpp mentioned this issue May 6, 2023

Add column and entries for context length. #11

Merged

eugeneyan closed this as completed May 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thoughts on going forward with adding additional model specifications to the table #7

Thoughts on going forward with adding additional model specifications to the table #7

LudwigStumpp commented May 6, 2023 •

edited

Loading

LudwigStumpp commented May 6, 2023 •

edited

Loading

eugeneyan commented May 6, 2023

Thoughts on going forward with adding additional model specifications to the table #7

Thoughts on going forward with adding additional model specifications to the table #7

Comments

LudwigStumpp commented May 6, 2023 • edited Loading

Summary

Implications

LudwigStumpp commented May 6, 2023 • edited Loading

eugeneyan commented May 6, 2023

LudwigStumpp commented May 6, 2023 •

edited

Loading

LudwigStumpp commented May 6, 2023 •

edited

Loading