-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hainan's RNNLM setup #37
base: master
Are you sure you want to change the base?
Conversation
…mit before fixing it
Arpa-reading and generating average prob distribution over words on different histories
nnet3 rnnlm lattice rescoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some things I happened to notice.
no need to take action on these.
cat $train_text | awk -v w=$outdir/wordlist.all \ | ||
'BEGIN{while((getline<w)>0) v[$1]=1;} | ||
{for (i=2;i<=NF;i++) if ($i in v) printf $i" ";else printf "<unk> ";print ""}'|sed 's/ $//g' \ | ||
| shuf --random-source=$train_text > $outdir/train.txt.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't rely on shuf, it's not always installed and we don't like adding dependencies. we typically use utils/shuffle_list.pl
num_words_out=10000 | ||
|
||
stage=-100 | ||
sos="<s>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is typically called bos, not sos.
stage=-100 | ||
sos="<s>" | ||
eos="</s>" | ||
oos="<oos>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i assume this means out of set. maybe best to clarify via a comment?
|
||
if [ $stage -le -2 ]; then | ||
|
||
steps/rnnlm/make_lstm_configs.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll want to have a more xconfig-based mechanism for this.
|
||
cat $outdir/train.txt.0 $outdir/wordlist.all | sed "s= =\n=g" | grep . | sort | uniq -c | sort -k1 -n -r | awk '{print $2,$1}' > $outdir/unigramcounts.txt | ||
|
||
echo $sos 0 > $outdir/wordlist.in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RE the fact that you use the same symbol for both the BOS and EOS symbols...
at some point we may want to revisit this. The issue is that it's common in RNNLM stuff to share/tie the input and output parameter matrices. In this case it may be desirable to have separate symbols for BOS and EOS- otherwise the model is forced to share their representation, which might not be very ideal.
|
||
initial_learning_rate=0.01 | ||
final_learning_rate=0.0005 | ||
learning_rate_decline_factor=1.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more conventional to supply this as a factor less than one, and not divide; but I may be wrong.
No description provided.