Create a language model to generate fake lyrics using RNNs.
This project requires python3.
pip3 install -r requirements.txt
Download the Kaggle dataset from here or use the provided zip file from data/songdata.
Use the scripts/convert_kaggle.py
script to generate the word vocabulary and to pre-tokenize all the songs.
Next the the vocabulary size needs to be reduced by using the gen_vocab_word.py
script.
This will use Mikolov subsampling to generate a vocabulary file that removes less common words.
To train the word-level model, the training data needs to be generated beforehand.
This can be done by running the ngramify_word.py
script. For a fixed sequence length and a given
vocabulary, the script will generate a numpy array consisting of training samples. Each sample
is encoded using the provided vocabulary.
Now the word-level model can be trained using train_word.py
. Make sure to create a
directory named weights
and to download the GloVe embedding vectors.
The GloVe model needs to be converted into the word2vec format using scripts/glove_conv.py
.
To train the character-level model, training samples need to be generated beforehand aswell.
Run ngramify_char.py
, similar to the word-level model.
Now the word-level model can be trained using train_char.py
.
Setup
wget http://nlp.stanford.edu/data/glove.6B.zip
unzip glove.6B.zip
python3 scripts/glove_conv.py
python3 scripts/convert_kaggle.py
mkdir weights
Training the word-level model
python3 gen_vocab_word.py -o small_vocab.pkl
python3 ngramify_word.py -i small_vocab.pkl -o ngrams_small.npy
python3 train_word.py -i ngrams_small.npy -v small_vocab.pkl
Training the character-level model
python3 ngramify_char.py -o ngrams_chars.npy
python3 train_char.py -i ngrams_chars.npy
- Vocabulary files will be saved in *.pkl files.
- Weights can be retrieved from the weights directory.
- Format:
weights_{word/char}_{k-Fold number}_{epoch number}
Sampling from the models can be done using the sampler.py
module.
from sampler import *
import util
ws = WordSampler.from_paths("weights/weights_word_k3_4.h5", "small_vocab.pkl", 6)
cs = CharacterSampler.from_paths("weights/weights_char_k3_10.h5", "chars.pkl", 8)
print(util.textify(ws.sample_sent("never gonna give you up, never gonna", 42)))
print(''.join(cs.sample("somebody once told me", 1337)))
To convert the tokens back into text, once can use the textify function of the util module.
The models are evaluted with k-Fold cross validation based on the model's perplexity.
This happens automatically while the model is training (train_word.py
or train_char.py
).
Previous development versions of the different models and a statistical analysis can be found in the various IPython notebooks.
- sampler.py - Allows to sample from the character- and word-level models, given the paths to their weights and vocabularies
- util.py - Collection of functions used to encode and pre-process the data into the correct formats
- train_word.py - Word-level language model
- train_char.py - Character-level language model
- ngramify_word.py - Generates n-gram samples encoded using the vocabulary indices
- ngramify_char.py - Same as ngramify_word.py, just with the character vocabulary
- gen_vocab_word.py - Mikolov subsampling for words/tokens based on all songs