Skip to content

Commit

Permalink
fixed typos and stuff in README
Browse files Browse the repository at this point in the history
  • Loading branch information
ejmichaud committed Oct 28, 2023
1 parent a4d2d8d commit 4f4f1a2
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,9 @@ Scripts defining experiments (e.g. slurm job arrays for grid searches) are in `e

## Reproducing each figure

Here are rough instructions for reproducing most of the paper's figures. Note that these are not ready to be run: you will need to modify each to e.g. load up data from the correct location on your system, save to the correct location on your system, etc. I have copied these over from a much messier repo, so the paths still reference the name of that old repo, which was `the-everything-machine`. It is possible that I have left out major steps in reproducing these from the desciptions below -- feel free to email me at ericjm [at] mit.edu if you have any questions.
Here are rough instructions for reproducing most of the paper's figures. Note that these are not ready to be run: you will need to modify each to e.g. load up data from the correct location on your system, save to the correct location on your system, etc. I have copied these over from a much messier repo, so the paths still reference the name of that old repo, which was `the-everything-machine`. It is possible that I have left out major steps in reproducing these from the desciptions below -- feel free to email me at ericjm [at] mit.edu if you have any questions, or create an issue on this repo.

**Figure 1** and **Figure 13**: Text snippets created in `notebooks/save-clusters.ipynb`. Running this notebook requires several experiments to be run first. First, one needs to download the test set of The Pile, `test.jsonl.zst`, either from https://pile.eleuther.ai/ or from [here](https://www.dropbox.com/scl/fi/njeocnzo8wzfeep8clm0z/test.jsonl.zst?rlkey=gz68ewdcyktfcekd7pz3n1xcx&dl=0). Then we must create our canonical tokenization of the dataset (which will allow us to consistently map integers to tokens in `test.jsonl.zst`, which can be done with `scripts/create_pile_canonical.py`. In addition to this, the notebook requires the `clusters_full_more.pkl` file containing the clusters from spectral clustering as well as the `full_more.pt` file containing the Pile test set token indices that were used by QDG. These can be downloaded [here](https://www.dropbox.com/scl/fi/87eq1e6q59kuprimlzbtu/clusters_full_more.pkl?rlkey=5lfwf8grnhkp4af6v0vsbpkv4&dl=0) and [here](https://www.dropbox.com/scl/fi/mlm6jzjghcbcw7lxmqlww/full_more.pt?rlkey=s8y3sgipwimabxa87qj6g4dqh&dl=0), respectively. If you want to run QDG to create these yourself, there are several steps. The `full_more.pt` file is created by `experiments/clustering-0/compute_similarity_full_more.py`. This script requires the `zero_and_induction_idxs.pkl` file. This file contains indices of tokens in the test set of the Pile where `pythia-19m` achieves less than 0.1 nats of cross-entropy and indices of tokens which are potentially predictable just via induction from their context (they are the third token in a trigram that occurred earlier in the context) -- we attempt to filter out these tokens which can be achieved via induction since for a small model like `pythia-19m`, it seems like a significant fraction of tokens on which the model achieves very low loss can be predicted in this way, would would make it harder to discover other quanta. The `zero_and_induction_idxs.pkl` file can be downloaded [here](https://www.dropbox.com/scl/fi/v2et8npxbhnsym0d3c5n6/zero_and_induction_idxs.pkl?rlkey=fedbwii5dp560vtq81cws3yh8&dl=0) or created yourself with the `scripts/zero_and_induction_idxs.py` script. Note that this script requires the `pythia-2.npy` file, for which the instructions to download or create are below (for Figure 3). The `clusters_full_more.pkl` is created by `experiments/clustering-0/compute_clusters_full_more.py`.
**Figure 1** and **Figure 13**: Text snippets created in `notebooks/save-clusters.ipynb`. Running this notebook requires several experiments to be run first. First, one needs to download the test set of The Pile, `test.jsonl.zst`, either from https://pile.eleuther.ai/ or from [here](https://www.dropbox.com/scl/fi/njeocnzo8wzfeep8clm0z/test.jsonl.zst?rlkey=gz68ewdcyktfcekd7pz3n1xcx&dl=0). Then we must create our canonical tokenization of the dataset (which will allow us to consistently map integers to tokens in `test.jsonl.zst`, which can be done with `scripts/create_pile_canonical.py`. In addition to this, the notebook requires the `clusters_full_more.pkl` file containing the clusters from spectral clustering as well as the `full_more.pt` file containing the Pile test set token indices that were used by QDG. These can be downloaded [here](https://www.dropbox.com/scl/fi/87eq1e6q59kuprimlzbtu/clusters_full_more.pkl?rlkey=5lfwf8grnhkp4af6v0vsbpkv4&dl=0) and [here](https://www.dropbox.com/scl/fi/mlm6jzjghcbcw7lxmqlww/full_more.pt?rlkey=s8y3sgipwimabxa87qj6g4dqh&dl=0), respectively. If you want to run QDG to create these yourself, there are several steps. The `full_more.pt` file is created by `experiments/clustering-0/compute_similarity_full_more.py`. This script requires the `zero_and_induction_idxs.pkl` file. This file contains indices of tokens in the test set of the Pile where `pythia-19m` achieves less than 0.1 nats of cross-entropy and indices of tokens which are potentially predictable just via induction from their context (they are the third token in a trigram that occurred earlier in the context) -- we attempt to filter out these tokens which can be predicted via induction since for a small model like `pythia-19m`, it seems like a significant fraction of tokens on which the model achieves very low loss can be predicted in this way, which would make it harder to discover other quanta. The `zero_and_induction_idxs.pkl` file can be downloaded [here](https://www.dropbox.com/scl/fi/v2et8npxbhnsym0d3c5n6/zero_and_induction_idxs.pkl?rlkey=fedbwii5dp560vtq81cws3yh8&dl=0) or created yourself with the `scripts/zero_and_induction_idxs.py` script. Note that this script requires the `pythia-2.npy` file, for which the instructions to download or create are below (for Figure 3). The `clusters_full_more.pkl` is created by `experiments/clustering-0/compute_clusters_full_more.py`.

**Figure 2** - `figures/parameters-steps-data-emergence-and-scaling-scalingtop.png`: Created in `notebooks/combined-scaling-and-emergence-plots.ipynb`, using data from `experiments/P-scaling-15` and `experiments/D-scaling-6`

Expand Down

0 comments on commit 4f4f1a2

Please sign in to comment.