Lilith

Using the Lilith optimizer on nanogpt, messing with lr, and multiple schdulers

deepseek step based implementation -> link

Running tests

New lilith versions

Test 26, setting dropout to low values(~0.01) is beneficial, but linear descent compared to the smooth adam curves

Test 25, deepseek scheduler 2:4:4 and 8:1:1, at acc 1000, also these runs now finishing in like 1/4 the time is really useful!

Test 24, graphs for acc=10, acc=50, acc=1000, helps boost training a little bit early on, slightly better curves, slightly lower loss, might just blow up this value for a +1% boost. Here Lilith runs with accelration and a bs 48 come really close to an AdamW run with bs 180, Lilith train time here was almost 5x faster, 70ms per step vs 300ms per step, 4x less mem for batches while 5x faster.

Test 23, accelration at 4, matches acc=2, and yet these values are looking they match larger bs, this optimizer is fire

Test 22, match beta1_m to adam beta 1 and beta_v near adams beta 2, also trying accelration set to 2, here in the graph we have an overfit AdamW vs Lilith with accelration=2, following the same path with less overfitting

Test 21, bs=600, My setup cant see batch_sizes of 600+ with ooms, Lilith is like 10% faster, interstingly, faster and equal or better?!

Test 20, Adam can match lilith at bs=180, testing bs=360 (Yellow and Orange)

Test 19, scaling batchsize to 360, appears to be having a similar effect so far, but better, explains euclaise's tests, his bs=1024

Test 18, scaling batchsize to 180 for a try, lr 3e-4, cosine schedule, sota result by a margin, beats adam?! It shows the same behaviour as adamw on large batches, but better? This could be the large scale training optimizer?

Test 17, using the deepseek step bases again, first graph 2:4:4, second graph 8:1:1, 8:1:1 is a really successful scheduler, achieved the same val loss as cosine adamw

brand new version, due to corruption lost the graphs, but the new good lr is 3e-4, from test 16
Test 15, Trying the deepseek based lr steps once again, 2:4:4 (first graph, lr 1e-4, due to numerical instability) and 8:1:1 (second graph, lr 8e-5), the first step change in 2:4:4 worked, but it flatlined afterwards, some progress on that end, while lr on the deepseek values was much much better, almost cosine

Test 14, set beta 1 and beta 2 to 0.95 and 0.98, slightly worse, and trial of 0.98 and 0.999999 was even worse but good tuning might give a +1% boost,

Test 13, lr 8e-5, was initially 5e-5, but it was too low, couldnt affect it very well, 8e-5 appears to be an even better initial sweet spot than 1e-4, tho it starts converging

Test 12, the same as 9, but just testing batch_size and lowering iters for efficiency, slightly above the sota run, but thats expected from larger batches, trains on 1.2x more tokens than before, for 1/3 the time, lilith is scalable, just like AdamW

Test 11, changed ema_k from 0 to 1 for better numerical stability, and using cosine lr schedule, lr = 1e-3
Note: There is numerical stability, no Nans, but loss is very volatile, literally unlearning
Test 10, using Triangular lr schedule, literally doesnt want to work, just like the previous tlr spike, gonna stick with multistep or cosine

Test 9, the orange bar being the new lilith, lr=1e-4, cosine scheduler, literally matches adamW for awhile, before flattening earlier, but val losses match, at ~1.47, so maybe its just not as prone to overfit?

Old lilith versions

Test 1, Lilith default params, using cosine LR, AdamW params from Karpathy, cosine LR

Test 2, Lilith some slight LR changes(lr 1e-2), using TLR, AdamW params from Karpathy, cosine LR

Test 3, Lilith lR (3e-4), using cosine lr, adamw the same

Test 4, current lilith in blue, lr (1e-4), cosine lr

Test 5, current lilith in green, lr (5e-5), cosine lr, too low, and the model cant seem to get as low as adamw
further tests to try and reintroduce TLR, then try a deepseek style stepwise lr

Test 6, TRL reintroduction(pink), vs sota lilith (blue), and adamW (red), lr 1e-4, didn't go well, TRL is too unstable, will try deepseek stepbased lr later

Test 7, using the deepseek based lr, in yellow, lr 1e-4, 20%,40%,40% partitions, didn't do anything, but that just maybe my infamiliarity with the step based version

Test 8, using the same step partitions in the deepseek paper, teal line, lr 1e-4, 80%,10%,10% partitions, I need to fix it, the lr freaks out and goes to zero, but this optimizer does not seem to like the scheduler whatsoever either, literally no change/drop in all cases

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
config		config
data		data
LICENSE		LICENSE
README.md		README.md
configurator.py		configurator.py
model.py		model.py
sample.py		sample.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config

config

data

data

LICENSE

LICENSE

README.md

README.md

configurator.py

configurator.py

model.py

model.py

sample.py

sample.py

train.py

train.py

Repository files navigation

Lilith

Running tests

New lilith versions

Old lilith versions

About

Releases

Packages

Languages

License

VatsaDev/Lilith

Folders and files

Latest commit

History

Repository files navigation

Lilith

Running tests

New lilith versions

Old lilith versions

About

Resources

License

Stars

Watchers

Forks

Languages