** This is a LSTM language model **
a sentence per line and words are seperated by spaces.
- Tokenize the dataset
- Use
--freqCut N
to determine the vocabulary size of your model. The words appearN
times or less in the training set will be replaced withUNK
. During validation and testing, unknown words will also be mapped toUNK
. - You can also use
--ingoreCase
to lowcase the dataset. - use
--seqLen N
to tell the model the longest sentence length will be less thanN
.
CUDA_VISIBLE_DEVICES=$ID th train.lua --useGPU \
--dropout 0.2 --batchSize 20 --validBatchSize 20 --save $model --model LSTMLM \
--freqCut 1 \
--nlayers 1 \
--seqLen 101 \
--lr $lr \
--optimMethod SGD \
--nhid 200 \
--nin 100 \
--minImprovement 1.001 \
--train $train \
--valid $valid \
--test $test \
| tee $log
You can also take a look at experiments/test/run.sgd.wiki.sh
- disable
freqCut
by using--freqCut 0
- use
--defaultUNK xxx
to indicatexxx
represents the unknow words.
You can also take a look at experiments/test/run.sgd.ptb.sh