Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hainan's RNNLM setup #37

Open
wants to merge 186 commits into
base: master
Choose a base branch
from
Open

Conversation

danpovey
Copy link
Owner

@danpovey danpovey commented Jun 7, 2017

No description provided.

Copy link
Owner Author

@danpovey danpovey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some things I happened to notice.
no need to take action on these.

cat $train_text | awk -v w=$outdir/wordlist.all \
'BEGIN{while((getline<w)>0) v[$1]=1;}
{for (i=2;i<=NF;i++) if ($i in v) printf $i" ";else printf "<unk> ";print ""}'|sed 's/ $//g' \
| shuf --random-source=$train_text > $outdir/train.txt.0
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't rely on shuf, it's not always installed and we don't like adding dependencies. we typically use utils/shuffle_list.pl

num_words_out=10000

stage=-100
sos="<s>"
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is typically called bos, not sos.

stage=-100
sos="<s>"
eos="</s>"
oos="<oos>"
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume this means out of set. maybe best to clarify via a comment?


if [ $stage -le -2 ]; then

steps/rnnlm/make_lstm_configs.py \
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll want to have a more xconfig-based mechanism for this.


cat $outdir/train.txt.0 $outdir/wordlist.all | sed "s= =\n=g" | grep . | sort | uniq -c | sort -k1 -n -r | awk '{print $2,$1}' > $outdir/unigramcounts.txt

echo $sos 0 > $outdir/wordlist.in
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE the fact that you use the same symbol for both the BOS and EOS symbols...
at some point we may want to revisit this. The issue is that it's common in RNNLM stuff to share/tie the input and output parameter matrices. In this case it may be desirable to have separate symbols for BOS and EOS- otherwise the model is forced to share their representation, which might not be very ideal.


initial_learning_rate=0.01
final_learning_rate=0.0005
learning_rate_decline_factor=1.1
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more conventional to supply this as a factor less than one, and not divide; but I may be wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants