GitHub - XingxingZhang/lstm-lm

** This is a LSTM language model **

Training data format

a sentence per line and words are seperated by spaces.

Train a language model on raw data (e.g. no UNK replacement)

Tokenize the dataset
Use --freqCut N to determine the vocabulary size of your model. The words appear N times or less in the training set will be replaced with UNK. During validation and testing, unknown words will also be mapped to UNK.
You can also use --ingoreCase to lowcase the dataset.
use --seqLen N to tell the model the longest sentence length will be less than N.

CUDA_VISIBLE_DEVICES=$ID th train.lua --useGPU \
    --dropout 0.2 --batchSize 20 --validBatchSize 20 --save $model --model LSTMLM \
    --freqCut 1 \
    --nlayers 1 \
    --seqLen 101 \
    --lr $lr \
    --optimMethod SGD \
    --nhid 200 \
    --nin 100 \
    --minImprovement 1.001 \
    --train $train \
    --valid $valid \
    --test $test \
    | tee $log

You can also take a look at experiments/test/run.sgd.wiki.sh

Train a language model on data with UNK replacement (e.g. the commonly used ptb dataset)

disable freqCut by using --freqCut 0
use --defaultUNK xxx to indicate xxx represents the unknow words.

You can also take a look at experiments/test/run.sgd.ptb.sh

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
dataiter		dataiter
experiments/test		experiments/test
layers		layers
nnets		nnets
utils		utils
LICENSE		LICENSE
README.md		README.md
gpu_lock.py		gpu_lock.py
init.lua		init.lua
train.lua		train.lua

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Training data format

Train a language model on raw data (e.g. no UNK replacement)

Train a language model on data with UNK replacement (e.g. the commonly used ptb dataset)

About

Releases

Packages

Languages

License

XingxingZhang/lstm-lm

Folders and files

Latest commit

History

Repository files navigation

Training data format

Train a language model on raw data (e.g. no UNK replacement)

Train a language model on data with UNK replacement (e.g. the commonly used ptb dataset)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages