Tiny Shakespeare Model Comparison

Training an RWKV-inspired "Student" model vs GPT on character-level Shakespeare using nanoGPT.

Results (1000 iterations)

Model	Parameters	Final Val Loss	Time/Iter
Student (RWKV-style)	0.83M	1.38	~155ms
GPT	10.65M	1.32	~525ms

The Student model is 13x smaller and 3x faster per iteration, with only slightly higher loss.

Student: 5 layers, 128 embed dim, linear attention with exponential decay

GPT: 6 layers, 6 heads, 384 embed dim, standard transformer attention

Run the notebook in Google Colab with GPU runtime.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
tiny_shakespeare_test.ipynb		tiny_shakespeare_test.ipynb