A note on preprocessing
In my experiments, I removed duplicates as part of the preprocessing step for each dataset below. I didn't think much of it at the time, but if I were to do these experiments again, I would not dedupe. The main argument against deduping is that it's giving the model a censored version of the density function you're trying to get it to learn. Also, as you change the amount of data you collect for your training set, you're changing the actual shape of the distribution. As you collect more data, the modes of your data will form increasingly smaller proportions of the dataset. In the limit, you would approach the uniform distribution. This concern is not entirely theoretical - for example, there are many GitHub repositories that seem to have been named completely randomly (example from the training set:
On the other hand, having regions of extremely high probability can hurt your model's mixing rate and make sampling more difficult, so it's not an obvious decision.
To download the geonames corpus of US geographical names:
wget http://download.geonames.org/export/dump/US.zip unzip US.zip cut -f 2 US.txt > usnames.txt
That should give you around 2.2m names.
I filtered out punctuation and numerals in the beginning to make the problem as easy as possible, but empirically, adding a few more characters doesn't slow down training that much, and doesn't seem to hurt sample quality.
For more information on the geonames data, and to download names from other countries, check out the geonames website.
I used actors.list.gz from IMDB's public datasets. Note that you'll only get male names in this list - if you want female names as well, you'll want to grab
There's nothing special about actor names in particular that I wanted to capture - this was just the easiest way to get a big list of full names.
Check out this directory for deduped lists of first/last names.
At around 60k tokens, this dataset is relatively small - you'll probably want to do many epochs of training.
I used Google BigQuery to grab all distinct repository names (n=3.7m) from GitHub's 2014 archive. This involved puzzling over a lot of help articles and giving Google my credit card information, so to make things easier for future interested parties, I've dumped the dataset into a GitHub repo.
There are around 80k games total, which is more than the personal names dataset, but these names are much longer and high-entropy. I didn't have much luck learning a model of this data.
It's not hard to imagine other domains we could apply this to. For example, the names/titles of...
- prescription drugs
I was able to find large public datasets for some of these domains (e.g. the Project Gutenberg catalog for books), but a common problem was that they would often contain names from many different languages mixed together. Which makes the problem harder by making the data distribution more complex and multi-modal. It also makes it harder to qualitatively assess outputs.