Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training with Sejong Treebank corpus #4

Closed
xtknight opened this issue Jun 26, 2016 · 38 comments
Closed

Training with Sejong Treebank corpus #4

xtknight opened this issue Jun 26, 2016 · 38 comments

Comments

@xtknight
Copy link

xtknight commented Jun 26, 2016

Hello,

I was trying to train using the sejong_treebank.sample file, so I ran the following commands:
$ ./sejong/split.sh
$ ./sejong/c2d.sh
$ ./train_sejong.sh

But had an error (same as one below -- "Assign requires shapes of both tensors to match").
So then, I tried downloading a larger treebank corpus from sejong.or.kr (it seems to be the full version of the sejong_treebank.sample in your repository, but then again I'm not sure...) But, the same thing happened.

My input file (tried both sample and full) is just an long stream of the following in UTF-8, just like your sample Sejong file. Is there somewhere else I need to put this? Or is there something else I need other than saving this as sejong/sejong_treebank.txt.v1 and running the scripts?

; 1993/06/08 19 
(NP     (NP 1993/SN + //SP + 06/SN + //SP + 08/SN)
        (NP 19/SN))
; 엠마누엘 웅가로 / 
(NP     (NP     (NP 엠마누엘/NNP)
                (NP 웅가로/NNP))
        (X //SP))
; 의상서 실내 장식품으로… 
(NP_AJT         (NP_AJT 의상/NNG + 서/JKB)
        (NP_AJT         (NP 실내/NNG)
                (NP_AJT 장식품/NNG + 으로/JKB + …/SE)))
; 디자인 세계 넓혀 
(VP     (NP_OBJ         (NP 디자인/NNG)
                (NP_OBJ 세계/NNG))
        (VP 넓히/VV + 어/EC))
; 프랑스의 세계적인 의상 디자이너 엠마누엘 웅가로가 실내 장식용 직물 디자이너로 나섰다. 
(S      (NP_SBJ         (NP     (NP_MOD 프랑스/NNP + 의/JKG)
                        (NP     (VNP_MOD 세계/NNG + 적/XSN + 이/VCP + ᆫ/ETM)
                                (NP     (NP 의상/NNG)
                                        (NP 디자이너/NNG))))
                (NP_SBJ         (NP 엠마누엘/NNP)
                        (NP_SBJ 웅가로/NNP + 가/JKS)))
        (VP     (NP_AJT         (NP     (NP     (NP 실내/NNG)
                                        (NP 장식/NNG + 용/XSN))
                                (NP 직물/NNG))
                        (NP_AJT 디자이너/NNG + 로/JKB))
                (VP 나서/VV + 었/EP + 다/EF + ./SF)))

Here's the logs with all the verbose options.

andy@andy ~/Downloads/syntaxnet/models/syntaxnet/work $ ./sejong/split.sh  -v -v
+ '[' 0 '!=' 0 ']'
++++ readlink -f ./sejong/split.sh
+++ dirname /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/split.sh
++ readlink -f /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong
+ CDIR=/home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong
+ [[ -f /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/env.sh ]]
+ . /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/env.sh
++ set -o errexit
++ export LC_ALL=ko_KR.UTF-8
++ LC_ALL=ko_KR.UTF-8
++ export LANG=ko_KR.UTF-8
++ LANG=ko_KR.UTF-8
+++++ readlink -f /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/env.sh
++++ dirname /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/env.sh
+++ readlink -f /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong
++ CDIR=/home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong
+++++ readlink -f /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/env.sh
++++ dirname /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/env.sh
+++ readlink -f /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/..
++ PDIR=/home/andy/Downloads/syntaxnet/models/syntaxnet/work
++ python=/usr/bin/python
+ make_calmness
+ exec
+ exec
+ child_verbose='-v -v'
+ '[' '!' -e /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/wdir ']'
+ WDIR=/home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/wdir
+ '[' '!' -e /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/log ']'
+ LDIR=/home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/log
+ /usr/bin/python /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/split.py --mode=0
+ /usr/bin/python /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/split.py --mode=1
+ /usr/bin/python /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/split.py --mode=2
+ close_fd
+ exec


andy@andy ~/Downloads/syntaxnet/models/syntaxnet/work $ ./sejong/c2d.sh  -v -v
+ '[' 0 '!=' 0 ']'
++++ readlink -f ./sejong/c2d.sh
+++ dirname /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/c2d.sh
++ readlink -f /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong
+ CDIR=/home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong
+ [[ -f /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/env.sh ]]
+ . /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/env.sh
++ set -o errexit
++ export LC_ALL=ko_KR.UTF-8
++ LC_ALL=ko_KR.UTF-8
++ export LANG=ko_KR.UTF-8
++ LANG=ko_KR.UTF-8
+++++ readlink -f /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/env.sh
++++ dirname /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/env.sh
+++ readlink -f /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong
++ CDIR=/home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong
+++++ readlink -f /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/env.sh
++++ dirname /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/env.sh
+++ readlink -f /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/..
++ PDIR=/home/andy/Downloads/syntaxnet/models/syntaxnet/work
++ python=/usr/bin/python
+ make_calmness
+ exec
+ exec
+ child_verbose='-v -v'
+ '[' '!' -e /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/wdir ']'
+ WDIR=/home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/wdir
+ '[' '!' -e /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/log ']'
+ LDIR=/home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/log
+ for SET in training tuning test
+ /usr/bin/python /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/c2d.py --mode=0
+ /usr/bin/python /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/c2d.py --mode=1
+ /usr/bin/python /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/align.py
number_of_sent = 0, number_of_sent_skip = 0
+ for SET in training tuning test
+ /usr/bin/python /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/c2d.py --mode=0
+ /usr/bin/python /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/c2d.py --mode=1
+ /usr/bin/python /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/align.py
number_of_sent = 0, number_of_sent_skip = 0
+ for SET in training tuning test
+ /usr/bin/python /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/c2d.py --mode=0
+ /usr/bin/python /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/c2d.py --mode=1
+ /usr/bin/python /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/align.py
number_of_sent = 0, number_of_sent_skip = 0
+ close_fd
+ exec


andy@andy ~/Downloads/syntaxnet/models/syntaxnet/work $ ./train_sejong.sh  -v -v
+ '[' 0 '!=' 0 ']'
++++ readlink -f ./train_sejong.sh
+++ dirname /home/andy/Downloads/syntaxnet/models/syntaxnet/work/train_sejong.sh
++ readlink -f /home/andy/Downloads/syntaxnet/models/syntaxnet/work
+ CDIR=/home/andy/Downloads/syntaxnet/models/syntaxnet/work
++++ readlink -f ./train_sejong.sh
+++ dirname /home/andy/Downloads/syntaxnet/models/syntaxnet/work/train_sejong.sh
++ readlink -f /home/andy/Downloads/syntaxnet/models/syntaxnet/work/..
+ PDIR=/home/andy/Downloads/syntaxnet/models/syntaxnet
+ make_calmness
+ exec
+ exec
+ cd /home/andy/Downloads/syntaxnet/models/syntaxnet
+ python=/usr/bin/python
+ SYNTAXNET_HOME=/home/andy/Downloads/syntaxnet/models/syntaxnet
+ BINDIR=/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet
+ CONTEXT=/home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/context.pbtxt_p
+ TMP_DIR=/home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output
+ mkdir -p /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output
+ cat /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/context.pbtxt_p
+ sed s=OUTPATH=/home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output=
+ MODEL_DIR=/home/andy/Downloads/syntaxnet/models/syntaxnet/work/models
+ HIDDEN_LAYER_SIZES=512,512
+ HIDDEN_LAYER_PARAMS=512,512
+ BATCH_SIZE=256
+ BEAM_SIZE=16
+ LP_PARAMS=512,512-0.08-4400-0.85
+ GP_PARAMS=512,512-0.02-100-0.9
+ pretrain_parser
+ /home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_trainer --arg_prefix=brain_parser --batch_size=256 --compute_lexicon --decay_steps=4400 --graph_builder=greedy --hidden_layer_sizes=512,512 --learning_rate=0.08 --momentum=0.85 --beam_size=1 --output_path=/home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output --task_context=/home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/context --projectivize_training_set --training_corpus=tagged-training-corpus --tuning_corpus=tagged-tuning-corpus --params=512,512-0.08-4400-0.85 --num_epochs=20 --report_every=100 --checkpoint_every=1000 --logtostderr
INFO:tensorflow:Computing lexicon...
I syntaxnet/lexicon_builder.cc:124] Term maps collected over 0 tokens from 0 documents
I syntaxnet/term_frequency_map.cc:137] Saved 0 terms to /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/word-map.
I syntaxnet/term_frequency_map.cc:137] Saved 0 terms to /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/lcword-map.
I syntaxnet/term_frequency_map.cc:137] Saved 0 terms to /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/tag-map.
I syntaxnet/term_frequency_map.cc:137] Saved 0 terms to /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/category-map.
I syntaxnet/term_frequency_map.cc:137] Saved 0 terms to /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/label-map.
I syntaxnet/term_frequency_map.cc:101] Loaded 0 terms from /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/label-map.
I syntaxnet/embedding_feature_extractor.cc:35] Features: input.word input(1).word input(2).word input(3).word stack.word stack(1).word stack(2).word stack(3).word stack.child(1).word stack.child(1).sibling(-1).word stack.child(-1).word stack.child(-1).sibling(1).word stack(1).child(1).word stack(1).child(1).sibling(-1).word stack(1).child(-1).word stack(1).child(-1).sibling(1).word stack.child(2).word stack.child(-2).word stack(1).child(2).word stack(1).child(-2).word; input.tag input(1).tag input(2).tag input(3).tag stack.tag stack(1).tag stack(2).tag stack(3).tag stack.child(1).tag stack.child(1).sibling(-1).tag stack.child(-1).tag stack.child(-1).sibling(1).tag stack(1).child(1).tag stack(1).child(1).sibling(-1).tag stack(1).child(-1).tag stack(1).child(-1).sibling(1).tag stack.child(2).tag stack.child(-2).tag stack(1).child(2).tag stack(1).child(-2).tag; stack.child(1).label stack.child(1).sibling(-1).label stack.child(-1).label stack.child(-1).sibling(1).label stack(1).child(1).label stack(1).child(1).sibling(-1).label stack(1).child(-1).label stack(1).child(-1).sibling(1).label stack.child(2).label stack.child(-2).label stack(1).child(2).label stack(1).child(-2).label 
I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: words;tags;labels
I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 64;32;32
I syntaxnet/term_frequency_map.cc:101] Loaded 0 terms from /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/word-map.
I syntaxnet/term_frequency_map.cc:101] Loaded 0 terms from /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/tag-map.
INFO:tensorflow:Preprocessing...
INFO:tensorflow:Training...
INFO:tensorflow:Building training network with parameters: feature_sizes: [20 20 12] domain_sizes: [3 3 3]
INFO:tensorflow:Initializing...
INFO:tensorflow:Training...
I syntaxnet/embedding_feature_extractor.cc:35] Features: input.word input(1).word input(2).word input(3).word stack.word stack(1).word stack(2).word stack(3).word stack.child(1).word stack.child(1).sibling(-1).word stack.child(-1).word stack.child(-1).sibling(1).word stack(1).child(1).word stack(1).child(1).sibling(-1).word stack(1).child(-1).word stack(1).child(-1).sibling(1).word stack.child(2).word stack.child(-2).word stack(1).child(2).word stack(1).child(-2).word; input.tag input(1).tag input(2).tag input(3).tag stack.tag stack(1).tag stack(2).tag stack(3).tag stack.child(1).tag stack.child(1).sibling(-1).tag stack.child(-1).tag stack.child(-1).sibling(1).tag stack(1).child(1).tag stack(1).child(1).sibling(-1).tag stack(1).child(-1).tag stack(1).child(-1).sibling(1).tag stack.child(2).tag stack.child(-2).tag stack(1).child(2).tag stack(1).child(-2).tag; stack.child(1).label stack.child(1).sibling(-1).label stack.child(-1).label stack.child(-1).sibling(1).label stack(1).child(1).label stack(1).child(1).sibling(-1).label stack(1).child(-1).label stack(1).child(-1).sibling(1).label stack.child(2).label stack.child(-2).label stack(1).child(2).label stack(1).child(-2).label 
I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: words;tags;labels
I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 64;32;32
I syntaxnet/term_frequency_map.cc:101] Loaded 0 terms from /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/word-map.
I syntaxnet/term_frequency_map.cc:101] Loaded 0 terms from /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/tag-map.
I syntaxnet/term_frequency_map.cc:101] Loaded 0 terms from /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/label-map.
I syntaxnet/reader_ops.cc:141] Starting epoch 1
I syntaxnet/reader_ops.cc:141] Starting epoch 2
I syntaxnet/reader_ops.cc:141] Starting epoch 3
I syntaxnet/reader_ops.cc:141] Starting epoch 4
I syntaxnet/reader_ops.cc:141] Starting epoch 5
I syntaxnet/reader_ops.cc:141] Starting epoch 6
I syntaxnet/reader_ops.cc:141] Starting epoch 7
I syntaxnet/reader_ops.cc:141] Starting epoch 8
I syntaxnet/reader_ops.cc:141] Starting epoch 9
I syntaxnet/reader_ops.cc:141] Starting epoch 10
I syntaxnet/reader_ops.cc:141] Starting epoch 11
I syntaxnet/reader_ops.cc:141] Starting epoch 12
I syntaxnet/reader_ops.cc:141] Starting epoch 13
I syntaxnet/reader_ops.cc:141] Starting epoch 14
I syntaxnet/reader_ops.cc:141] Starting epoch 15
I syntaxnet/reader_ops.cc:141] Starting epoch 16
I syntaxnet/reader_ops.cc:141] Starting epoch 17
I syntaxnet/reader_ops.cc:141] Starting epoch 18
I syntaxnet/reader_ops.cc:141] Starting epoch 19
I syntaxnet/reader_ops.cc:141] Starting epoch 20
+ evaluate_pretrained_parser
+ for SET in training tuning test
+ /home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval --task_context=/home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/brain_parser/greedy/512,512-0.08-4400-0.85/context --batch_size=256 --hidden_layer_sizes=512,512 --beam_size=1 --input=tagged-training-corpus --output=parsed-training-corpus --arg_prefix=brain_parser --graph_builder=greedy --model_path=/home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/brain_parser/greedy/512,512-0.08-4400-0.85/model
I syntaxnet/term_frequency_map.cc:101] Loaded 0 terms from /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/label-map.
I syntaxnet/embedding_feature_extractor.cc:35] Features: input.word input(1).word input(2).word input(3).word stack.word stack(1).word stack(2).word stack(3).word stack.child(1).word stack.child(1).sibling(-1).word stack.child(-1).word stack.child(-1).sibling(1).word stack(1).child(1).word stack(1).child(1).sibling(-1).word stack(1).child(-1).word stack(1).child(-1).sibling(1).word stack.child(2).word stack.child(-2).word stack(1).child(2).word stack(1).child(-2).word; input.tag input(1).tag input(2).tag input(3).tag stack.tag stack(1).tag stack(2).tag stack(3).tag stack.child(1).tag stack.child(1).sibling(-1).tag stack.child(-1).tag stack.child(-1).sibling(1).tag stack(1).child(1).tag stack(1).child(1).sibling(-1).tag stack(1).child(-1).tag stack(1).child(-1).sibling(1).tag stack.child(2).tag stack.child(-2).tag stack(1).child(2).tag stack(1).child(-2).tag; stack.child(1).label stack.child(1).sibling(-1).label stack.child(-1).label stack.child(-1).sibling(1).label stack(1).child(1).label stack(1).child(1).sibling(-1).label stack(1).child(-1).label stack(1).child(-1).sibling(1).label stack.child(2).label stack.child(-2).label stack(1).child(2).label stack(1).child(-2).label 
I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: words;tags;labels
I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 64;32;32
I syntaxnet/term_frequency_map.cc:101] Loaded 0 terms from /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/word-map.
I syntaxnet/term_frequency_map.cc:101] Loaded 0 terms from /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/tag-map.
INFO:tensorflow:Building training network with parameters: feature_sizes: [20 20 12] domain_sizes: [3 3 3]
Traceback (most recent call last):
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/__main__/syntaxnet/parser_eval.py", line 149, in <module>
    tf.app.run()
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv))
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/__main__/syntaxnet/parser_eval.py", line 145, in main
    Eval(sess, num_actions, feature_sizes, domain_sizes, embedding_dims)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/__main__/syntaxnet/parser_eval.py", line 98, in Eval
    parser.saver.restore(sess, FLAGS.model_path)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/training/saver.py", line 1104, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/client/session.py", line 333, in run
    run_metadata_ptr)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/client/session.py", line 573, in _run
    feed_dict_string, options, run_metadata)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/client/session.py", line 653, in _do_run
    target_list, options, run_metadata)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/client/session.py", line 673, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [3,64] rhs shape= [485,64]
     [[Node: save/Assign_5 = Assign[T=DT_FLOAT, _class=["loc:@embedding_matrix_0"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/cpu:0"](params/embedding_matrix_0/ExponentialMovingAverage, save/restore_slice_5)]]
Caused by op u'save/Assign_5', defined at:
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/__main__/syntaxnet/parser_eval.py", line 149, in <module>
    tf.app.run()
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv))
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/__main__/syntaxnet/parser_eval.py", line 145, in main
    Eval(sess, num_actions, feature_sizes, domain_sizes, embedding_dims)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/__main__/syntaxnet/parser_eval.py", line 96, in Eval
    parser.AddSaver(FLAGS.slim_model)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/__main__/syntaxnet/graph_builder.py", line 568, in AddSaver
    self.saver = tf.train.Saver(variables_to_save)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/training/saver.py", line 845, in __init__
    restore_sequentially=restore_sequentially)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/training/saver.py", line 515, in build
    filename_tensor, vars_to_save, restore_sequentially, reshape)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/training/saver.py", line 281, in _AddRestoreOps
    validate_shape=validate_shape))
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/ops/gen_state_ops.py", line 45, in assign
    use_locking=use_locking, name=name)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/ops/op_def_library.py", line 693, in apply_op
    op_def=op_def)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/framework/ops.py", line 2186, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/framework/ops.py", line 1170, in __init__
    self._traceback = _extract_stack()
@xtknight
Copy link
Author

Okay, for the full corpus it seems like c2d was failing to create a deptree file in build_tree()...because of an unaligned sentence. Not sure what's going on there. So when I manually copied some files around and ran just with the sample corpus, I finally got something going again, but running into what seems like the same error.

andy@andy ~/Downloads/syntaxnet/models/syntaxnet/work $ ./train_sejong.sh -v -v 
+ '[' 0 '!=' 0 ']'
++++ readlink -f ./train_sejong.sh
+++ dirname /home/andy/Downloads/syntaxnet/models/syntaxnet/work/train_sejong.sh
++ readlink -f /home/andy/Downloads/syntaxnet/models/syntaxnet/work
+ CDIR=/home/andy/Downloads/syntaxnet/models/syntaxnet/work
++++ readlink -f ./train_sejong.sh
+++ dirname /home/andy/Downloads/syntaxnet/models/syntaxnet/work/train_sejong.sh
++ readlink -f /home/andy/Downloads/syntaxnet/models/syntaxnet/work/..
+ PDIR=/home/andy/Downloads/syntaxnet/models/syntaxnet
+ make_calmness
+ exec
+ exec
+ cd /home/andy/Downloads/syntaxnet/models/syntaxnet
+ python=/usr/bin/python
+ SYNTAXNET_HOME=/home/andy/Downloads/syntaxnet/models/syntaxnet
+ BINDIR=/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet
+ CONTEXT=/home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/context.pbtxt_p
+ TMP_DIR=/home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output
+ mkdir -p /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output
+ cat /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/context.pbtxt_p
+ sed s=OUTPATH=/home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output=
+ MODEL_DIR=/home/andy/Downloads/syntaxnet/models/syntaxnet/work/models
+ HIDDEN_LAYER_SIZES=512,512
+ HIDDEN_LAYER_PARAMS=512,512
+ BATCH_SIZE=256
+ BEAM_SIZE=16
+ LP_PARAMS=512,512-0.08-4400-0.85
+ GP_PARAMS=512,512-0.02-100-0.9
+ pretrain_parser
+ /home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_trainer --arg_prefix=brain_parser --batch_size=256 --compute_lexicon --decay_steps=4400 --graph_builder=greedy --hidden_layer_sizes=512,512 --learning_rate=0.08 --momentum=0.85 --beam_size=1 --output_path=/home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output --task_context=/home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/context --projectivize_training_set --training_corpus=tagged-training-corpus --tuning_corpus=tagged-tuning-corpus --params=512,512-0.08-4400-0.85 --num_epochs=20 --report_every=100 --checkpoint_every=1000 --logtostderr
INFO:tensorflow:Computing lexicon...
I syntaxnet/lexicon_builder.cc:124] Term maps collected over 1745 tokens from 71 documents
I syntaxnet/term_frequency_map.cc:137] Saved 534 terms to /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/word-map.
I syntaxnet/term_frequency_map.cc:137] Saved 534 terms to /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/lcword-map.
I syntaxnet/term_frequency_map.cc:137] Saved 36 terms to /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/tag-map.
I syntaxnet/term_frequency_map.cc:137] Saved 36 terms to /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/category-map.
I syntaxnet/term_frequency_map.cc:137] Saved 27 terms to /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/label-map.
I syntaxnet/term_frequency_map.cc:101] Loaded 27 terms from /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/label-map.
I syntaxnet/embedding_feature_extractor.cc:35] Features: input.word input(1).word input(2).word input(3).word stack.word stack(1).word stack(2).word stack(3).word stack.child(1).word stack.child(1).sibling(-1).word stack.child(-1).word stack.child(-1).sibling(1).word stack(1).child(1).word stack(1).child(1).sibling(-1).word stack(1).child(-1).word stack(1).child(-1).sibling(1).word stack.child(2).word stack.child(-2).word stack(1).child(2).word stack(1).child(-2).word; input.tag input(1).tag input(2).tag input(3).tag stack.tag stack(1).tag stack(2).tag stack(3).tag stack.child(1).tag stack.child(1).sibling(-1).tag stack.child(-1).tag stack.child(-1).sibling(1).tag stack(1).child(1).tag stack(1).child(1).sibling(-1).tag stack(1).child(-1).tag stack(1).child(-1).sibling(1).tag stack.child(2).tag stack.child(-2).tag stack(1).child(2).tag stack(1).child(-2).tag; stack.child(1).label stack.child(1).sibling(-1).label stack.child(-1).label stack.child(-1).sibling(1).label stack(1).child(1).label stack(1).child(1).sibling(-1).label stack(1).child(-1).label stack(1).child(-1).sibling(1).label stack.child(2).label stack.child(-2).label stack(1).child(2).label stack(1).child(-2).label 
I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: words;tags;labels
I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 64;32;32
I syntaxnet/term_frequency_map.cc:101] Loaded 534 terms from /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/word-map.
I syntaxnet/term_frequency_map.cc:101] Loaded 36 terms from /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/tag-map.
INFO:tensorflow:Preprocessing...
INFO:tensorflow:Training...
INFO:tensorflow:Building training network with parameters: feature_sizes: [20 20 12] domain_sizes: [537  39  30]
INFO:tensorflow:Initializing...
INFO:tensorflow:Training...
I syntaxnet/embedding_feature_extractor.cc:35] Features: input.word input(1).word input(2).word input(3).word stack.word stack(1).word stack(2).word stack(3).word stack.child(1).word stack.child(1).sibling(-1).word stack.child(-1).word stack.child(-1).sibling(1).word stack(1).child(1).word stack(1).child(1).sibling(-1).word stack(1).child(-1).word stack(1).child(-1).sibling(1).word stack.child(2).word stack.child(-2).word stack(1).child(2).word stack(1).child(-2).word; input.tag input(1).tag input(2).tag input(3).tag stack.tag stack(1).tag stack(2).tag stack(3).tag stack.child(1).tag stack.child(1).sibling(-1).tag stack.child(-1).tag stack.child(-1).sibling(1).tag stack(1).child(1).tag stack(1).child(1).sibling(-1).tag stack(1).child(-1).tag stack(1).child(-1).sibling(1).tag stack.child(2).tag stack.child(-2).tag stack(1).child(2).tag stack(1).child(-2).tag; stack.child(1).label stack.child(1).sibling(-1).label stack.child(-1).label stack.child(-1).sibling(1).label stack(1).child(1).label stack(1).child(1).sibling(-1).label stack(1).child(-1).label stack(1).child(-1).sibling(1).label stack.child(2).label stack.child(-2).label stack(1).child(2).label stack(1).child(-2).label 
I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: words;tags;labels
I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 64;32;32
I syntaxnet/term_frequency_map.cc:101] Loaded 534 terms from /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/word-map.
I syntaxnet/term_frequency_map.cc:101] Loaded 36 terms from /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/tag-map.
I syntaxnet/term_frequency_map.cc:101] Loaded 27 terms from /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/label-map.
I syntaxnet/reader_ops.cc:141] Starting epoch 1
INFO:tensorflow:Epochs: 1, num steps: 100, seconds elapsed: 2.40, avg cost: 0.26, 
I syntaxnet/reader_ops.cc:141] Starting epoch 2
INFO:tensorflow:Epochs: 2, num steps: 200, seconds elapsed: 3.87, avg cost: 0.14, 
INFO:tensorflow:Epochs: 2, num steps: 300, seconds elapsed: 5.26, avg cost: 0.11, 
I syntaxnet/reader_ops.cc:141] Starting epoch 3
INFO:tensorflow:Epochs: 3, num steps: 400, seconds elapsed: 6.80, avg cost: 0.19, 
I syntaxnet/reader_ops.cc:141] Starting epoch 4
INFO:tensorflow:Epochs: 4, num steps: 500, seconds elapsed: 8.13, avg cost: 0.06, 
INFO:tensorflow:Epochs: 4, num steps: 600, seconds elapsed: 9.68, avg cost: 0.21, 
I syntaxnet/reader_ops.cc:141] Starting epoch 5
INFO:tensorflow:Epochs: 5, num steps: 700, seconds elapsed: 11.13, avg cost: 0.14, 
INFO:tensorflow:Epochs: 5, num steps: 800, seconds elapsed: 12.51, avg cost: 0.11, 
I syntaxnet/reader_ops.cc:141] Starting epoch 6
INFO:tensorflow:Epochs: 6, num steps: 900, seconds elapsed: 14.05, avg cost: 0.19, 
I syntaxnet/reader_ops.cc:141] Starting epoch 7
INFO:tensorflow:Epochs: 7, num steps: 1000, seconds elapsed: 15.39, avg cost: 0.06, 
INFO:tensorflow:Evaluating training network.
I syntaxnet/embedding_feature_extractor.cc:35] Features: input.word input(1).word input(2).word input(3).word stack.word stack(1).word stack(2).word stack(3).word stack.child(1).word stack.child(1).sibling(-1).word stack.child(-1).word stack.child(-1).sibling(1).word stack(1).child(1).word stack(1).child(1).sibling(-1).word stack(1).child(-1).word stack(1).child(-1).sibling(1).word stack.child(2).word stack.child(-2).word stack(1).child(2).word stack(1).child(-2).word; input.tag input(1).tag input(2).tag input(3).tag stack.tag stack(1).tag stack(2).tag stack(3).tag stack.child(1).tag stack.child(1).sibling(-1).tag stack.child(-1).tag stack.child(-1).sibling(1).tag stack(1).child(1).tag stack(1).child(1).sibling(-1).tag stack(1).child(-1).tag stack(1).child(-1).sibling(1).tag stack.child(2).tag stack.child(-2).tag stack(1).child(2).tag stack(1).child(-2).tag; stack.child(1).label stack.child(1).sibling(-1).label stack.child(-1).label stack.child(-1).sibling(1).label stack(1).child(1).label stack(1).child(1).sibling(-1).label stack(1).child(-1).label stack(1).child(-1).sibling(1).label stack.child(2).label stack.child(-2).label stack(1).child(2).label stack(1).child(-2).label 
I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: words;tags;labels
I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 64;32;32
I syntaxnet/reader_ops.cc:141] Starting epoch 1
I syntaxnet/reader_ops.cc:141] Starting epoch 2
INFO:tensorflow:Seconds elapsed in evaluation: 0.50, eval metric: 8.42%
INFO:tensorflow:Writing out trained parameters.
INFO:tensorflow:Epochs: 7, num steps: 1100, seconds elapsed: 18.31, avg cost: 0.19, 
I syntaxnet/reader_ops.cc:141] Starting epoch 8
INFO:tensorflow:Epochs: 8, num steps: 1200, seconds elapsed: 19.78, avg cost: 0.13, 
INFO:tensorflow:Epochs: 8, num steps: 1300, seconds elapsed: 21.21, avg cost: 0.09, 
I syntaxnet/reader_ops.cc:141] Starting epoch 9
INFO:tensorflow:Epochs: 9, num steps: 1400, seconds elapsed: 22.86, avg cost: 0.16, 
I syntaxnet/reader_ops.cc:141] Starting epoch 10
INFO:tensorflow:Epochs: 10, num steps: 1500, seconds elapsed: 24.29, avg cost: 0.05, 
INFO:tensorflow:Epochs: 10, num steps: 1600, seconds elapsed: 25.93, avg cost: 0.15, 
I syntaxnet/reader_ops.cc:141] Starting epoch 11
INFO:tensorflow:Epochs: 11, num steps: 1700, seconds elapsed: 27.49, avg cost: 0.10, 
INFO:tensorflow:Epochs: 11, num steps: 1800, seconds elapsed: 28.93, avg cost: 0.07, 
I syntaxnet/reader_ops.cc:141] Starting epoch 12
INFO:tensorflow:Epochs: 12, num steps: 1900, seconds elapsed: 30.58, avg cost: 0.13, 
I syntaxnet/reader_ops.cc:141] Starting epoch 13
INFO:tensorflow:Epochs: 13, num steps: 2000, seconds elapsed: 31.98, avg cost: 0.05, 
INFO:tensorflow:Evaluating training network.
I syntaxnet/reader_ops.cc:141] Starting epoch 3
INFO:tensorflow:Seconds elapsed in evaluation: 0.46, eval metric: 80.29%
INFO:tensorflow:Writing out trained parameters.
INFO:tensorflow:Epochs: 13, num steps: 2100, seconds elapsed: 34.92, avg cost: 0.11, 
I syntaxnet/reader_ops.cc:141] Starting epoch 14
INFO:tensorflow:Epochs: 14, num steps: 2200, seconds elapsed: 36.47, avg cost: 0.09, 
INFO:tensorflow:Epochs: 14, num steps: 2300, seconds elapsed: 37.92, avg cost: 0.06, 
I syntaxnet/reader_ops.cc:141] Starting epoch 15
INFO:tensorflow:Epochs: 15, num steps: 2400, seconds elapsed: 39.51, avg cost: 0.11, 
I syntaxnet/reader_ops.cc:141] Starting epoch 16
INFO:tensorflow:Epochs: 16, num steps: 2500, seconds elapsed: 40.94, avg cost: 0.04, 
INFO:tensorflow:Epochs: 16, num steps: 2600, seconds elapsed: 42.58, avg cost: 0.09, 
I syntaxnet/reader_ops.cc:141] Starting epoch 17
INFO:tensorflow:Epochs: 17, num steps: 2700, seconds elapsed: 44.21, avg cost: 0.07, 
INFO:tensorflow:Epochs: 17, num steps: 2800, seconds elapsed: 45.59, avg cost: 0.05, 
I syntaxnet/reader_ops.cc:141] Starting epoch 18
INFO:tensorflow:Epochs: 18, num steps: 2900, seconds elapsed: 47.28, avg cost: 0.09, 
I syntaxnet/reader_ops.cc:141] Starting epoch 19
INFO:tensorflow:Epochs: 19, num steps: 3000, seconds elapsed: 48.66, avg cost: 0.04, 
INFO:tensorflow:Evaluating training network.
I syntaxnet/reader_ops.cc:141] Starting epoch 4
INFO:tensorflow:Seconds elapsed in evaluation: 0.50, eval metric: 78.45%
INFO:tensorflow:Writing out trained parameters.
INFO:tensorflow:Epochs: 19, num steps: 3100, seconds elapsed: 51.34, avg cost: 0.08, 
I syntaxnet/reader_ops.cc:141] Starting epoch 20
+ evaluate_pretrained_parser
+ for SET in training tuning test
+ /home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval --task_context=/home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/brain_parser/greedy/512,512-0.08-4400-0.85/context --batch_size=256 --hidden_layer_sizes=512,512 --beam_size=1 --input=tagged-training-corpus --output=parsed-training-corpus --arg_prefix=brain_parser --graph_builder=greedy --model_path=/home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/brain_parser/greedy/512,512-0.08-4400-0.85/model
I syntaxnet/term_frequency_map.cc:101] Loaded 27 terms from /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/label-map.
I syntaxnet/embedding_feature_extractor.cc:35] Features: input.word input(1).word input(2).word input(3).word stack.word stack(1).word stack(2).word stack(3).word stack.child(1).word stack.child(1).sibling(-1).word stack.child(-1).word stack.child(-1).sibling(1).word stack(1).child(1).word stack(1).child(1).sibling(-1).word stack(1).child(-1).word stack(1).child(-1).sibling(1).word stack.child(2).word stack.child(-2).word stack(1).child(2).word stack(1).child(-2).word; input.tag input(1).tag input(2).tag input(3).tag stack.tag stack(1).tag stack(2).tag stack(3).tag stack.child(1).tag stack.child(1).sibling(-1).tag stack.child(-1).tag stack.child(-1).sibling(1).tag stack(1).child(1).tag stack(1).child(1).sibling(-1).tag stack(1).child(-1).tag stack(1).child(-1).sibling(1).tag stack.child(2).tag stack.child(-2).tag stack(1).child(2).tag stack(1).child(-2).tag; stack.child(1).label stack.child(1).sibling(-1).label stack.child(-1).label stack.child(-1).sibling(1).label stack(1).child(1).label stack(1).child(1).sibling(-1).label stack(1).child(-1).label stack(1).child(-1).sibling(1).label stack.child(2).label stack.child(-2).label stack(1).child(2).label stack(1).child(-2).label 
I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: words;tags;labels
I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 64;32;32
I syntaxnet/term_frequency_map.cc:101] Loaded 534 terms from /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/word-map.
I syntaxnet/term_frequency_map.cc:101] Loaded 36 terms from /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/tag-map.
INFO:tensorflow:Building training network with parameters: feature_sizes: [20 20 12] domain_sizes: [537  39  30]
Traceback (most recent call last):
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/__main__/syntaxnet/parser_eval.py", line 149, in <module>
    tf.app.run()
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv))
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/__main__/syntaxnet/parser_eval.py", line 145, in main
    Eval(sess, num_actions, feature_sizes, domain_sizes, embedding_dims)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/__main__/syntaxnet/parser_eval.py", line 98, in Eval
    parser.saver.restore(sess, FLAGS.model_path)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/training/saver.py", line 1104, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/client/session.py", line 333, in run
    run_metadata_ptr)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/client/session.py", line 573, in _run
    feed_dict_string, options, run_metadata)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/client/session.py", line 653, in _do_run
    target_list, options, run_metadata)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/client/session.py", line 673, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [256,55] rhs shape= [71,55]
     [[Node: save/Assign_15 = Assign[T=DT_FLOAT, _class=["loc:@transition_scores"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/cpu:0"](transition_scores, save/restore_slice_15)]]
Caused by op u'save/Assign_15', defined at:
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/__main__/syntaxnet/parser_eval.py", line 149, in <module>
    tf.app.run()
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv))
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/__main__/syntaxnet/parser_eval.py", line 145, in main
    Eval(sess, num_actions, feature_sizes, domain_sizes, embedding_dims)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/__main__/syntaxnet/parser_eval.py", line 96, in Eval
    parser.AddSaver(FLAGS.slim_model)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/__main__/syntaxnet/graph_builder.py", line 568, in AddSaver
    self.saver = tf.train.Saver(variables_to_save)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/training/saver.py", line 845, in __init__
    restore_sequentially=restore_sequentially)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/training/saver.py", line 515, in build
    filename_tensor, vars_to_save, restore_sequentially, reshape)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/training/saver.py", line 281, in _AddRestoreOps
    validate_shape=validate_shape))
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/ops/gen_state_ops.py", line 45, in assign
    use_locking=use_locking, name=name)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/ops/op_def_library.py", line 693, in apply_op
    op_def=op_def)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/framework/ops.py", line 2186, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/home/andy/Downloads/syntaxnet/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/tf/tensorflow/python/framework/ops.py", line 1170, in __init__
    self._traceback = _extract_stack()

@dsindex
Copy link
Owner

dsindex commented Jun 26, 2016

@xtknight

before executing train_sejong.sh

  • place your large corpus to sejong/
12:14 $ ls sejong/sejong*
sejong/sejong_treebank.sample  sejong/sejong_treebank.txt.v1
  • sejong_treebank.txt.v1 has same format to sejong_treebank.sample
  • and then
$ cd sejong
$ ./split.sh -v -v
$ ls wdir/sejong_treebank.txt.v1.*
wdir/sejong_treebank.txt.v1.test  wdir/sejong_treebank.txt.v1.training  wdir/sejong_treebank.txt.v1.tuning
$ ./c2d.sh -v -v
$ ls wdir/deptree.txt.v3*
wdir/deptree.txt.v3.test  wdir/deptree.txt.v3.training  wdir/deptree.txt.v3.tuning
  • check your context.pbtxt_p
$ cat sejong/context.pbtxt_p
...
input {
  name: 'tagged-training-corpus'
  record_format: 'conll-sentence'
  Part {
    file_pattern: 'work/sejong/wdir/deptree.txt.v3.training'
  }
}
input {
  name: 'tagged-tuning-corpus'
  record_format: 'conll-sentence'
  Part {
    file_pattern: 'work/sejong/wdir/deptree.txt.v3.tuning'
  }
}
input {
  name: 'tagged-test-corpus'
  record_format: 'conll-sentence'
  Part {
    file_pattern: 'work/sejong/wdir/deptree.txt.v3.test'
  }
}
...

@xtknight
Copy link
Author

xtknight commented Jun 26, 2016

Thanks for your quick response!
I was able to train it yesterday after copying around some files again (making the v1.training, v1.tuning, v1.test files). And also, I realized my full corpus file was missing LF(line feeds) between training data entries so I fixed that and it worked. That also fixed split.sh so now I don't have to manually create the training, tuning, test files.

When I use the SyntaxNet for English, the result is a root word. In the training data the root of each training sentence is clearly specified. However in the training data for the Korean text for Sejong TreeBank I don't see how the root is specified. How does that work?

English SyntaxNet training data (the root is clearly indicated on node 3)

1   Al  Al  PROPN   NNP Number=Sing 3   name    _   SpaceAfter=No
2   -   -   PUNCT   HYPH    _   3   punct   _   SpaceAfter=No
3   Zaman   Zaman   PROPN   NNP Number=Sing 0   **root**    _   _
4   :   :   PUNCT   :   _   3   punct   _   _
5   American    american    ADJ JJ  Degree=Pos  6   amod    _   _
6   forces  force   NOUN    NNS Number=Plur 7   nsubj   _   _
7   killed  kill    VERB    VBD Mood=Ind|Tense=Past|VerbForm=Fin    3   parataxis   _   _
8   Shaikh  Shaikh  PROPN   NNP Number=Sing 12  name    _   _
9   Abdullah    Abdullah    PROPN   NNP Number=Sing 12  name    _   _
10  al  al  PROPN   NNP Number=Sing 12  name    _   SpaceAfter=No
11  -   -   PUNCT   HYPH    _   12  punct   _   SpaceAfter=No
12  Ani Ani PROPN   NNP Number=Sing 7   dobj    _   SpaceAfter=No
13  ,   ,   PUNCT   ,   _   12  punct   _   _
14  the the DET DT  Definite=Def|PronType=Art   15  det _   _

Where is root indicated in training data here and what is the difference between v2 and v3 deptree? I guess v3 is going into SyntaxNet for training but I'm not sure about v2, and your sample output seems closer to v2. (10 nodes)

deptree.txt.v2.training

1       그      그/MM   DP      2
2       날      날/NNG  NP_AJT  10
3       지회에서는      지회/NNG + 에서/JKB + 는/JX     NP_AJT  10
4       해직교사        해직/NNG + 교사/NNG     NP      5
5       비상대책회의인가        비상/NNG + 대책/NNG + 회의/NNG + 이/VCP + ᆫ가/EC        VNP     6
6       하는    하/VV + 는/ETM  VP_MOD  9
7       조금은  조금/NNG + 은/JX        NP_AJT  8
8       살벌한  살벌/NNG + 하/XSA + ᆫ/ETM       VP_MOD  9
9       회의가  회의/NNG + 가/JKS       NP_SBJ  10
10      열렸다. 열리/VV + 었/EP + 다/EF + ./SF  VP      0

deptree.txt.v3.training

1       그      그      MM      MM      _       2       DP      _       _
2       날      날      NNG     NNG     _       22      NP_AJT  _       _
3       지회    지회    NNG     NNG     _       4       MOD     _       _
4       에서    에서    JKB     JKB     _       5       MOD     _       _
5       는      는      JX      JX      _       22      NP_AJT  _       _
6       해직    해직    NNG     NNG     _       7       MOD     _       _
7       교사    교사    NNG     NNG     _       8       NP      _       _
8       비상    비상    NNG     NNG     _       9       MOD     _       _
9       대책    대책    NNG     NNG     _       10      MOD     _       _
10      회의    회의    NNG     NNG     _       11      MOD     _       _
11      이      이      VCP     VCP     _       12      MOD     _       _
12      ᆫ가     ᆫ가     EC      EC      _       13      VNP     _       _
13      하      하      VV      VV      _       14      MOD     _       _
14      는      는      ETM     ETM     _       20      VP_MOD  _       _
15      조금    조금    NNG     NNG     _       16      MOD     _       _
16      은      은      JX      JX      _       17      NP_AJT  _       _
17      살벌    살벌    NNG     NNG     _       18      MOD     _       _
18      하      하      XSA     XSA     _       19      MOD     _       _
19      ᆫ       ᆫ       ETM     ETM     _       20      VP_MOD  _       _
20      회의    회의    NNG     NNG     _       21      MOD     _       _
21      가      가      JKS     JKS     _       22      NP_SBJ  _       _
22      열리    열리    VV      VV      _       23      MOD     _       _
23      었      었      EP      EP      _       24      MOD     _       _
24      다      다      EF      EF      _       25      MOD     _       _
25      .       .       SF      SF      _       0       VP      _       _

sample output

10  열렸다. 열리/VV + 었/EP + 다/EF + ./SF  VP  0   ROOT    0   SUCCESS

How is 열렸다 chosen as the root when SyntaxNet trains on deptree v3? Is VV just considered automatically the root or is there some other mechanism?

Thanks for your hard work,
Andrew

@dsindex
Copy link
Owner

dsindex commented Jun 27, 2016

@xtknight

original sejong constituent tree is filtered by tree2con() in c2d.py
: sejong_treebank.txt.v1.* -> sejong_treebank.txt.v2.*

filtered sejong constituent tree is converted to dependency tree by tree2dep() in c2d.py
: sejong_treebank.txt.v2.* -> deptree.txt.v2.*

then, make eoj-based dependency tree to morph-based dependency tree via align.py
: deptree.txt.v2.* -> deptree.txt.v3.*

now, we have training data(deptree.txt.v3.*) for syntaxnet.

i thought your question is 'where ROOT comes from?' the answer is from the tree2dep() function.
there are a lot of rules in tree2dep(). but most important rule is 'head final'.
because korean language is head final, we can choose governor(head) of a word as its parent's most-right child. where parent should have right child(not equal to a word itself).

https://github.com/dsindex/blog/wiki/%5Bparsing%5D-visualizer-for-the-Sejong-Tree-Bank

; 가계부의 틀이 달라지고 있다.
(S  (NP_SBJ (NP_MOD 가계부/NNG + 의/JKG)
        (NP_SBJ 틀/NNG + 이/JKS))
    (VP (VP 달라지/VV + 고/EC)
        (VP 있/VX + 다/EF + ./SF)))

in this case, head of '틀이' is '있다'. but check_vx_rule() determines '있다' as auxiliary verb.
so '달라지고' is selected.

@xtknight
Copy link
Author

xtknight commented Jun 28, 2016

Thanks for your response! Your explanation makes sense.
It seems like I trained the model (the copy_model part of train_sejong.sh succeeded).

However, I don't understand how to use the model. test.sh requires tagger-params(?) which I don't have, and I would like to enter a sentence like the English SyntaxNet and see a tree. I am wondering if this is possible with this treebank model. Do I need to use some other model first to get the part of speech tags, and then plug it into this model? (How did you test your TreeBank model?)

@dsindex
Copy link
Owner

dsindex commented Jun 28, 2016

@xtknight

unfortunately, it is not possible to tag sentences without having a Korean morphological analyzer.
when we want to analyze '가계부의 틀이 달라지고 있다', we should have a segmentation like '가계부/NNG+의/JKG 틀/NNG+이/JKO 달라지다/VV+고/EC 있다/VX+다/EP'.

if we have '먹은' with several possible segmentation '먹다+은', '먹+은' in the corpus, it can't be directly trained from Sejong tagged corpus using SyntaxNet.

so, i recommend you to use an available morphological analyzer :

http://eunjeon.blogspot.sg/
http://kkma.snu.ac.kr/documents/
http://konlpy.org/ko/v0.4.3/morph/

it worth to use.

by the way, i thought you have sejong corpus and it may differ from mine.
what is the eoj-based UAS? in my case, it is around 88%

@xtknight
Copy link
Author

xtknight commented Jul 1, 2016

@dsindex
Sorry for my late response, I was gone for awhile. Plus, I had to train the model for a couple days.

I got Sejong TreeBank corpus (구문분석말뭉치?) from the sejong.or.kr miscellaneous files section by combining 15 different files BGAA0001.txt, BGAA0164.txt into a single UTF-8 file and then I wrote a script to try to remove invalid or weird things in the files.

I think the number you are asking for is accuracy 0.886990.

+ /usr/bin/python /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/eval.py -a /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/wdir/deptree.txt.v2.test -b /home/andy/Downloads/syntaxnet/models/syntaxnet/work/sejong/tmp_p/syntaxnet-output/brain_parser/structured/512,512-0.02-100-0.9/beam-parsed-test-corpus-eoj
can't compare different sentences
skip_sentences, total_sentences = 1, 3777
accuracy(UAS) = 0.886990
+ copy_model

I have used KoNLPy before. I am going to try to develop something to connect the morphological analyzer to this structured syntactic analyzer.

If I have the input tagged with POS tags like you mentioned, how do I run the model or which command should I use for creating the tree?
'가계부/NNG+의/JKG 틀/NNG+이/JKO 달라지다/VV+고/EC 있다/VX+다/EP'

I was experimenting with the following command, but I'm confused about what the input needs to be if I have '가계부/NNG+의/JKG 틀/NNG+이/JKO 달라지다/VV+고/EC 있다/VX+다/EP'

${BINDIR}/parser_eval \
--task_context=${TMP_DIR}/brain_parser/structured/${GP_PARAMS}/context \
--batch_size=${BATCH_SIZE} \
--hidden_layer_sizes=${HIDDEN_LAYER_SIZES} \
--beam_size=${BEAM_SIZE} \
--input=stdin \
--output=beam-parsed-${SET}-corpus \
--arg_prefix=brain_parser \
--graph_builder=structured \
--model_path=${TMP_DIR}/brain_parser/structured/${GP_PARAMS}/model \
--alsologtostderr

From analyzing English SyntaxNet, I thought I should input something like this:

1   프랑스   프랑스   NNP NNP _   0       _   _
2   의 의 JKG JKG _   0       _   _
3   세계  세계  NNG NNG _   0       _   _
4   적 적 XSN XSN _   0       _   _

But then, the resulting beam-parsed-test-corpus has some weird output like:

1       1       프랑스  프랑스  NNP     NNP     _       0               _       _       _                       _       0       ROOT    _       _

1       2       의      의      JKG     JKG     _       0               _       _       _                       _       0       ROOT    _       _

1       3       세계    세계    NNG     NNG     _       0               _       _       _                       _       0       ROOT    _       _

1       4       적      적      XSN     XSN     _       0               _       _       _                       _       0       ROOT    _       _

@dsindex
Copy link
Owner

dsindex commented Jul 4, 2016

@xtknight

i have made a modification for testing korean parser.
read https://github.com/dsindex/syntaxnet

if you make input as the sejong/tagged_input.sample, then it will work fine.

and for convenience, using Komoran tagger in konlpy, you can input raw sentence.

$ echo "나는 학교에 간다." | python sejong/tagger.py | ./test_sejong.sh
Input: 나 는 학교 에 가 ㄴ다 .
Parse:
. SF ROOT
 +-- ㄴ다 EF MOD
     +-- 가 VV MOD
         +-- 는 JX NP_SBJ
         |   +-- 나 NP MOD
         +-- 에 JKB NP_AJT
             +-- 학교 NNG MOD

@xtknight
Copy link
Author

xtknight commented Jul 4, 2016

Thanks! I eventually figured this out. And, I ended up using Komoran too.

I have another question for you actually. After running on a custom corpus, I made a program to recombine parts of each eojeol, but as you can see, there is a problem with the word "위해" --> (위하아 VP). Komoran splits it into

10  위하  _   VV  VV  _   11  MOD _   _
11  아 _   EC  EC  _   12  VP  _   _

I can't figure out why it wants to do this. 위해 may be 위하+ㅏ or 위하+ㅓ, but python-jamo doesn't want to make 해 anyway. All I want to do is change this into "위해" in a consistent way.


I made a script to regenerate the tree based on parts. I think you regenerated the parts based on the corpus, but because this is a new sentence I wasn't able to do this. I had to recombine output from the POS tagger tree.

$ echo "수영할 때 눈을 보호하기 위해 쓰는 물안경은 렌즈의 굴절력과 고무 밴드의 내구성이 제품 선택의 포인트다." | ./demo.sh

Input: 수영할 때 눈을 보호하기 위하아 쓰는 물안경은 렌즈의 굴절력과 고무 밴드의 내구성이 제품 선택의 포인트다.
Parse:
포인트다.  ROOT
 +-- 물안경은  NP_SBJ
 |   +-- 쓰는  VP_MOD
 |       +-- 때  NP_AJT
 |       |   +-- 수영할  VP_MOD
 |       +-- 눈을  NP_OBJ
 |       +-- 위하아  VP
 |           +-- 보호하기  VP_OBJ
 +-- 내구성이  NP_SBJ
 |   +-- 굴절력과  NP_CNJ
 |   |   +-- 렌즈의  NP_MOD
 |   +-- 고무 밴드의  NP_MOD
 +-- 선택의  NP_MOD
     +-- 제품  NP

The intermediate step was this:

1   수영  _   NNG NNG _   2   MOD _   _
2   하 _   XSV XSV _   3   MOD _   _
3   ㄹ _   ETM ETM _   4   VP_MOD  _   _
4   때 _   NNG NNG _   12  NP_AJT  _   _
5   눈 _   NNG NNG _   6   MOD _   _
6   을 _   JKO JKO _   12  NP_OBJ  _   _
7   보호  _   NNG NNG _   8   MOD _   _
8   하 _   XSV XSV _   9   MOD _   _
9   기 _   ETN ETN _   10  VP_OBJ  _   _
10  위하  _   VV  VV  _   11  MOD _   _
11  아 _   EC  EC  _   12  VP  _   _
12  쓰 _   VV  VV  _   13  MOD _   _
13  는 _   ETM ETM _   14  VP_MOD  _   _
14  물안경   _   NNP NNP _   15  MOD _   _
15  은 _   JX  JX  _   29  NP_SBJ  _   _
16  렌즈  _   NNG NNG _   17  MOD _   _
17  의 _   JKG JKG _   18  NP_MOD  _   _
18  굴절  _   NNG NNG _   19  MOD _   _
19  력 _   NNG NNG _   20  MOD _   _
20  과 _   JC  JC  _   23  NP_CNJ  _   _
21  고무 밴드   _   NNP NNP _   22  MOD _   _
22  의 _   JKG JKG _   23  NP_MOD  _   _
23  내구  _   NNG NNG _   24  MOD _   _
24  성 _   XSN XSN _   25  MOD _   _
25  이 _   JKS JKS _   29  NP_SBJ  _   _
26  제품  _   NNG NNG _   27  NP  _   _
27  선택  _   NNG NNG _   28  MOD _   _
28  의 _   JKG JKG _   29  NP_MOD  _   _
29  포인트   _   NNG NNG _   30  MOD _   _
30  다 _   JX  JX  _   31  MOD _   _
31  .   _   SF  SF  _   0   ROOT    _   _

which gets converted to:

1   수영할   수영/NNG + 하/XSV + ㄹ/ETM  _   _   _   2   VP_MOD  _
2   때 때/NNG _   _   _   6   NP_AJT  _
3   눈을  눈/NNG + 을/JKO   _   _   _   6   NP_OBJ  _
4   보호하기    보호/NNG + 하/XSV + 기/ETN  _   _   _   5   VP_OBJ  _
5   위하아   위하/VV + 아/EC  _   _   _   6   VP  _
6   쓰는  쓰/VV + 는/ETM    _   _   _   7   VP_MOD  _
7   물안경은    물안경/NNP + 은/JX  _   _   _   14  NP_SBJ  _
8   렌즈의   렌즈/NNG + 의/JKG    _   _   _   9   NP_MOD  _
9   굴절력과    굴절/NNG + 력/NNG + 과/JC   _   _   _   11  NP_CNJ  _
10  고무 밴드의    고무 밴드/NNP + 의/JKG _   _   _   11  NP_MOD  _
11  내구성이    내구/NNG + 성/XSN + 이/JKS  _   _   _   14  NP_SBJ  _
12  제품  제품/NNG  _   _   _   13  NP  _
13  선택의   선택/NNG + 의/JKG    _   _   _   14  NP_MOD  _
14  포인트다.   포인트/NNG + 다/JX + ./SF   _   _   _   0   ROOT    _

@dsindex
Copy link
Owner

dsindex commented Jul 5, 2016

@xtknight
i think that we need to know eoj index of input sentence in addition to (morph, tag) by Komoran tagger for combining '위하','어' to '위해'. but it seems not supported yet.
(at least i couldn't find it)

there are also rules for combining root and functional word(eomi). but those are difficult to implement.
so i am not willing to recommend to you.

@xtknight
Copy link
Author

xtknight commented Jul 5, 2016

Yes, I tried to combine them, but it's not working great. At komoran-2.0-master/KOMORAN_2.0_beta/corpus_build/dic.irregular there is a list of irregular forms. It fixed "위하+아" but it doesn't handle inflection(conjugation) of verbs. I've dealt with conjugation before so I can probably figure it out, but I'm curious why the original information about the complete word is destroyed.

What about training the model on the 어절 instead of all the individual parts? I mean training the model based on "선택의"(NP_MOD) instead of "선택/NNG + 의/JKG", for example. Is there a good reason to separate them? I am curious why they decided to separate them all in the corpus.

I made a sample here that attempts recombination. But it fails sometimes because of the eoj problems or disagreement between Komoran POS tagger and Sejong corpus.
http://ec2-52-78-70-112.ap-northeast-2.compute.amazonaws.com/syntaxnet/syntaxnet.htm
"수영할 때 눈을 보호하기 위해 쓰는 물안경은 렌즈의 굴절력과 고무 밴드의 내구성이 제품 선택의 포인트다."

My next project is to hack the Komoran tagger to return eoj index.
Apparently the latest version of Komoran is 2.4 (2014.11.24 ), but not sure if they give the source code for it, http://shineware.tistory.com/entry/KOMORAN-ver-24
Latest commit on their github appears to be 2014.12.25 so it must be update to date I guess?

@dsindex
Copy link
Owner

dsindex commented Jul 5, 2016

@xtknight
if you train model based on eoj form, model would suffer from data sparseness problem. so, it will not be recommended.
by the way, please let me know if you hack komoran for eoj index.
i use our in-house pos tagger which support every possible information including eoj index, eoj offset/length, morph offset/length(limited), and so on.
so i hope do same in komoran or other taggers.

@xtknight
Copy link
Author

xtknight commented Jul 6, 2016

Okay, that makes sense.

There is another thing I am curious about. When eval.py runs it does not consider the phrase structure tag 'ptst' (like NP or NP_OBJ or VP) as part of the matching criteria. It seems like only seq, analyzed, and gov are important.

(sejong/eval.py)

def compare(entry_a, entry_b) :
    '''
    -1 : 비교 불가능
    0  : 다름
    1  : 같음

    entry : [seq, eoj, analyzed, ptst, gov]
    '''
    if entry_a[0] != entry_b[0] : return -1
    if entry_a[2].replace(' ','') != entry_b[2].replace(' ','') : return -1
    if entry_a[4] == entry_b[4] : return 1
    return 0

What is the reason for this? Is the ptst tag returned unimportant?

Because when I run the model on the following sentence I'm having trouble with the ptst tag:

서울에 4일 밤부터 100㎜ 넘는 폭우가 쏟아져 곳곳에서 도로함몰, 교통사고 등 비 피해가 속출했다.

...
17  도로  _   NNG NNG _   18  NP  _   _
18  함몰  _   NNG NNG _   19  MOD _   _
19  ,   _   SP  SP  _   20  NP_CNJ  _   _
...

I get this output from the model, but 도로 is marked as NP. But it seems like it should be MOD. It enters the model as NNG,NNG,SP.
According to Komoran it is marked like this:
Eojeol_parts[9]: 도로함몰, => [('도로', 'NNG'), ('함몰', 'NNG'), (',', 'SP')]

For my tree generation I expected that everything in a tagged Eojeol would be marked as MOD except for the last part (for example, MOD,MOD,NP_CNJ for 도로함몰). But I don't know why 도로 is being separated as NP. I wasn't aware the model would reclassify those parts.

@dsindex
Copy link
Owner

dsindex commented Jul 7, 2016

@xtknight

labeled attachment score(LAS) is less important than unlabeled attachment score(UAS) in practical reason. so, eval.py only check whether token's governor is correct or not. you can also modify it for measuring LAS(regarding ptst)

as you may know, Korean language is flexible for spacing within compound noun. for example, '도로함몰' or '도로 함몰' are acceptable. that is why trained model often mis-classify 'MOD' as 'NP'. but i think that is not important. if we get the eoj index from POS tagger, we can make it correct-classified.

@xtknight
Copy link
Author

xtknight commented Jul 7, 2016

Yes, compound nouns seem to be the primary issue I run into. So, if that compound noun's first piece (도로) gets classified as NP, we can just change it to MOD according to the Komoran POS tagger Eojeol group and everything should be fine.

I wonder if this misclassification can be prevented while training the model if we use the POS Eojeol data.

But what happens if 도로's HEAD value is also classified wrong? Then the tree would be broken as the compound noun's pieces would be far away in the tree? Should I fix the HEAD and DEPREL values based on the POS tagger (modify HEAD=18, DEPREL=MOD)? I'm not sure how I should solve the problem when it happens even if I have the eoj index. And I am curious what happens if multiple parts of an Eojeol get misclassified or something that is not a compound noun...

(Possible misclassification example)

...
17  도로  _   NNG NNG _   2  NP  _   _
18  함몰  _   NNG NNG _   19  MOD _   _
19  ,   _   SP  SP  _   20  NP_CNJ  _   _
...

I have updated my online example to show logs and also the PSG tree.
http://ec2-52-78-70-112.ap-northeast-2.compute.amazonaws.com/syntaxnet/psg_tree.htm

Although I don't fully understand everything yet, I appreciate your explanations and hope there is some way for me to contribute to your project, especially for the Korean language related tagger and parsers.

@dsindex
Copy link
Owner

dsindex commented Jul 8, 2016

@xtknight
I wonder if this misclassification can be prevented while training the model if we use the POS Eojeol data.

: i'm not sure :) but ... since Syntaxnet uses NN classifiers not rules, we could not drive to 100% accuracy for such label('MOD').

But what happens if 도로's HEAD value is also classified wrong? Then the tree would be broken as the compound noun's pieces would be far away in the tree?

: as you mentioned, you should fix inner-eoj relation('MOD') based on eoj index. the governor of last morph in a eoj is only our concern. and that classification is solely responsible for parser(Synstaxnet).
: more over, Korean is head-final language. so, one's governor can't be linked backward. you may use this rule to fix results of parser.

Although I don't fully understand everything yet, I appreciate your explanations and hope there is some way for me to contribute to your project, especially for the Korean language related tagger and parsers.

: No problem :)
: i recently have a plan to launch web server just like your 'psg_tree.htm'
: but `bazel-bin/syntaxnet/parser_eval' and API are not that user-friendly.
: anyway ... i'd like to analyze 'paser_eval and parser_eval.py to make below style

model = initialize(context_path)
while 1 :
    ...
    analyzed_result = analyze(model, sentence)
    ...
model.finalize()

: it will be very welcome if you have done it :)

by the way, would you mind if i add your `psg_tree.htm' in README.md ?

@xtknight
Copy link
Author

xtknight commented Jul 8, 2016

I am still trying to fix the tree issues via the POS tagger. I have made a hack to Komoran to always split words with spaces so that I can match the eoj index to the original sentence, but I decided it was probably the wrong way to fix it. So now I am trying to properly return the eoj index.

For parser_eval....
Do you mean that you want a SyntaxNet model to be callable via a Python API like your code? Yes I agree, the way to call the model is not very intuitive in terms of programming. It seems like output is going everywhere. It would be very nice if it was that easy but I wonder if it would require a lot of core code modification. Maybe I can make a messy hack to make it possible. Oh, just the thought of looking at that code.......... :) I also should probably investigate this for my project, anyway. The whole SyntaxNet project is really like a maze.

Sure, you can add a link to my psg_tree.htm (I will still be fixing some of the errors using POS tagger eoj). Basically it's just running the Python to run the demo.sh script on the backend and returns the tree with JSON along with the error log.

@xtknight
Copy link
Author

xtknight commented Jul 24, 2016

I added some features and also the link of the PSG tree got changed due to my EC2 server crashing.
If you update to this link it will be fine.

http://sejongpsg.ddns.net/syntaxnet/psg_tree.htm

I tried to investigate implementing what you were talking about with parser_eval, but unfortunately I have no idea where to begin.

@dsindex
Copy link
Owner

dsindex commented Jul 24, 2016

@xtknight
the page looks good! :) i'v updated the link.

i can print out 'python_path' and 'program, args' in models/syntaxnet/bazel-bin/syntaxnet/parser_eval

try:
    sys.stdout.flush()
    sys.stderr.write('[python_path] : ' + python_path + '\n')
    sys.stderr.write('[execv()] : ' + program + ' '.join(args) +'\n')
    os.execv(program, args)
  except EnvironmentError as e:
    # This exception occurs when os.execv() fails for some reason.
    if not getattr(e, 'filename', None):
      e.filename = program  # Add info to error message
    raise

so, got information for executing parser_eval.py

  • export PYTHONPATH
$ export PYTHONPATH=/path/to/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles:/path/to/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/:/path/to/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/external/six_archive:/path/to/models/syntaxnet/bazel-bin/syntaxnet/parser_eval.runfiles/external/tf
  • execute parser_eval.py with batch_size=1
$ echo "나는 학교에 간다." | python work/sejong/tagger.py | python bazel-bin/syntaxnet/parser_eval.runfiles/syntaxnet/parser_eval.py --input=stdin-conll --output=stdout-conll --batch_size=1 --hidden_layer_sizes=512,512 --beam_size=16 --arg_prefix=brain_parser --graph_builder=structured --task_context=/path/to/models/syntaxnet/work/models_sejong/context.pbtxt_p --model_path=/path/to/models/syntaxnet/work/models_sejong/parser-params

...

1   나 _   NP  NP  _   2   MOD _   _
2   는 _   JX  JX  _   5   NP_SBJ  _   _
3   학교  _   NNG NNG _   4   MOD _   _
4   에 _   JKB JKB _   5   NP_AJT  _   _
5   가 _   VV  VV  _   6   MOD _   _
6   ㄴ다  _   EF  EF  _   7   MOD _   _
7   .   _   SF  SF  _   0   ROOT    _   _

and i thought i could modify 'parser_eval.py'.

def Eval(sess, num_actions, feature_sizes, domain_sizes, embedding_dims):
...
  parser.AddSaver(FLAGS.slim_model)
  sess.run(parser.inits.values())
  parser.saver.restore(sess, FLAGS.model_path)
  # -------> initialization model end

  # read conll input from stdin
  # can we change below logic to while loop style?

  sink_documents = tf.placeholder(tf.string)
  sink = gen_parser_ops.document_sink(sink_documents,
                                      task_context=FLAGS.task_context,
                                      corpus_name=FLAGS.output)
  t = time.time()
  num_epochs = None
  num_tokens = 0
  num_correct = 0
  num_documents = 0
  while True:
    tf_eval_epochs, tf_eval_metrics, tf_documents = sess.run([
        parser.evaluation['epochs'],
        parser.evaluation['eval_metrics'],
        parser.evaluation['documents'],
    ])

    if len(tf_documents):
      logging.info('Processed %d documents', len(tf_documents))
      num_documents += len(tf_documents)
      sess.run(sink, feed_dict={sink_documents: tf_documents})

    num_tokens += tf_eval_metrics[0]
    num_correct += tf_eval_metrics[1]
    if num_epochs is None:
      num_epochs = tf_eval_epochs
    elif num_epochs < tf_eval_epochs:
      break
    ...

def main(unused_argv):
  logging.set_verbosity(logging.INFO)
  with tf.Session() as sess:
    feature_sizes, domain_sizes, embedding_dims, num_actions = sess.run(
        gen_parser_ops.feature_size(task_context=FLAGS.task_context,
                                    arg_prefix=FLAGS.arg_prefix))

  with tf.Session() as sess:
    Eval(sess, num_actions, feature_sizes, domain_sizes, embedding_dims)


if __name__ == '__main__':
  tf.app.run()

but not tested yet ;;)

@dsindex
Copy link
Owner

dsindex commented Jul 24, 2016

@xtknight
i found this syntaxnet pr for tensorflow serving.
tensorflow/models#250

@dsindex
Copy link
Owner

dsindex commented Jul 24, 2016

@xtknight
Copy link
Author

xtknight commented Jul 25, 2016

Oh, thank you for the links. I'm going to try the PL #250 one first and see how easy it is to use.

On a side note, do you have any idea how to setup CUDA with SyntaxNet, or do you have it working? Mine seems to not work at all when using SyntaxNet, yet other TensorFlow examples work fine with CUDA. I don't know why that is. I guess using GPU could make not only training faster but evaluation as well?

@dsindex
Copy link
Owner

dsindex commented Jul 25, 2016

@xtknight

i checked GPU allocation for 'parser_eval.py' :

def main(unused_argv):
  logging.set_verbosity(logging.INFO)
  with tf.Session() as sess:
    feature_sizes, domain_sizes, embedding_dims, num_actions = sess.run(
        gen_parser_ops.feature_size(task_context=FLAGS.task_context,
                                    arg_prefix=FLAGS.arg_prefix))
  config=tf.ConfigProto(log_device_placement=True)
  with tf.Session(config=config) as sess:
    Eval(sess, num_actions, feature_sizes, domain_sizes, embedding_dims)
...
I external/tf/tensorflow/core/common_runtime/simple_placer.cc:665] evaluation/logits/MatMul: /job:localhost/replica:0/task:0/cpu:0
I external/tf/tensorflow/core/common_runtime/simple_placer.cc:665] evaluation/logits: /job:localhost/replica:0/task:0/cpu:0
I external/tf/tensorflow/core/common_runtime/simple_placer.cc:665] FeatureSize: /job:localhost/replica:0/task:0/cpu:0
...

it seems that parser_eval.py does not use GPU.
hmmmmm....

@xtknight
Copy link
Author

Well, despite my best efforts, I keep running into weird different errors and environment problems with bazel. And it recompiles and redownloads a million packages even if I just want to compile one file. It's sometimes a different error each time because the build order is different. It's a disaster.

Otherwise, I think I'm going to try to rework the Tensorflow Serving code and make something myself. I got the Serving code for MNIST digits working fine but can't compile the parsey api...

@dsindex
Copy link
Owner

dsindex commented Jul 26, 2016

@xtknight

i agree with you. i didn't investigate deeply the codes but expect that we are able to merge below codes :)

@xtknight
Copy link
Author

@dsindex
dmansfield/parsey-mcparseface-api#1

I finally had some luck getting parsey_api to compile thanks to the help of the author. Now my next project is to 'export???' the Sejong model. Not sure what export is actually doing because I thought the exported model was already being used by SyntaxNet but maybe I don't understand the terminology. I'll investigate it!

@dsindex
Copy link
Owner

dsindex commented Jul 28, 2016

@xtknight good job!

you may refer this : (i wrote ;; )
https://github.com/dsindex/tensorflow#tensorflow-serving

i think we need to modify parser_eval.py for exporting session and model just like mlp_mnist_export.py does.
so, say, parser_export.py should be compiled properly using bazel.

@dsindex
Copy link
Owner

dsindex commented Jul 28, 2016

now i catch that dmansfield already done it for exporting model via parsey_mcparseface.py

and now
my concern is :

  • parsey_api's input is string itself.
  • but we need to feed tokens array(sentence.proto) to parsey_api because we don't have Korean tagger by syntaxnet.
  • since i want to generate python client for parsey_api, method to import it in another python server program will be very tricky.

i think syntaxnet/tensorflow_serving are such a messy;; i am not sure i will hate a product using bazel or not :)

@dsindex
Copy link
Owner

dsindex commented Jul 28, 2016

@xtknight @dmansfield (https://github.com/dmansfield)

thanks for your instructions and dmansfield works,
i made getting work parsey_mcparseface.py to export model,
and created parsey_client.py

detail instructions :
https://github.com/dsindex/syntaxnet/blob/master/README_api.md

- export
$ bazel-bin/tensorflow_serving/example/parsey_mcparseface --model_dir=syntaxnet/models/parsey_mcparseface --export_path=exported

- python client
$ bazel-bin/tensorflow_serving/example/parsey_client --server=localhost:9000

but as i mentioned before, we need to change parsey_api protocol.

but we need to feed tokens array(sentence.proto) to parsey_api because we don't have Korean tagger by syntaxnet.

@xtknight
Copy link
Author

xtknight commented Jul 28, 2016

A great success!!

Currently there is way to get it to work without protocol changes, I think. But I don't understand the significance of the protocol. Isn't it just going to match CONLL?

andy@andy ~/Downloads/syntaxnet/parsey-mcparseface-api/parsey_client $ node ./index.js 
I0728 17:11:19.568875015   20946 ev_epoll_linux.c:84]        epoll engine will be using signal: 36
D0728 17:11:19.568936824   20946 ev_posix.c:106]             Using polling engine: epoll
{
  "result": [
    {
      "docid": "-:0",
      "text": "내 가 집 에 가 ㄴ다 .",
      "token": [
        {
          "word": "내",
          "start": 0,
          "end": 2,
          "head": 1,
          "tag": "NP",
          "category": "NP",
          "label": "MOD",
          "break_level": "SPACE_BREAK"
        },
        {
          "word": "가",
          "start": 4,
          "end": 6,
          "head": 4,
          "tag": "JKS",
          "category": "JKS",
          "label": "NP_SBJ",
          "break_level": "SPACE_BREAK"
        },
        {
          "word": "집",
          "start": 8,
          "end": 10,
          "head": 3,
          "tag": "NNG",
          "category": "NNG",
          "label": "MOD",
          "break_level": "SPACE_BREAK"
        },
        {
          "word": "에",
          "start": 12,
          "end": 14,
          "head": 4,
          "tag": "JKB",
          "category": "JKB",
          "label": "NP_AJT",
          "break_level": "SPACE_BREAK"
        },
        {
          "word": "가",
          "start": 16,
          "end": 18,
          "head": 5,
          "tag": "VV",
          "category": "VV",
          "label": "MOD",
          "break_level": "SPACE_BREAK"
        },
        {
          "word": "ㄴ다",
          "start": 20,
          "end": 25,
          "head": 6,
          "tag": "EF",
          "category": "EF",
          "label": "MOD",
          "break_level": "SPACE_BREAK"
        },
        {
          "word": ".",
          "start": 27,
          "end": 27,
          "head": -1,
          "tag": "SF",
          "category": "SF",
          "label": "ROOT",
          "break_level": "SPACE_BREAK"
        }
      ]
    }
  ]
}

Client side index.js:

var grpc = require('grpc');

var protoDescriptor = grpc.load({root: __dirname+'/api', file:'cali/nlp/parsey_api.proto'});

var service = new protoDescriptor.cali.nlp.ParseyService("127.0.0.1:9000", grpc.credentials.createInsecure());

var conllIn = '1    내 내 NP  NP  _   0   _   _   _\n\
2   가 가 JKS JKS _   0   _   _   _\n\
3   집 집 NNG NNG _   0   _   _   _\n\
4   에 에 JKB JKB _   0   _   _   _\n\
5   가 가 VV  VV  _   0   _   _   _\n\
6   ㄴ다  ㄴ다  EF  EF  _   0   _   _   _\n\
7   .   .   SF  SF  _   0   _   _   _\n';

service.parse([conllIn], function(err, response) {
    console.log(JSON.stringify(response,null,'  '));
});

parsey_mcparseface.py (changing corpus name to stdin-conll is crucial! (and model path to latest-model)). And for task_context I just put 'context', not context.pbtxt. Also I removed brain-tagger which we don't have.

import os
import shutil

import tensorflow as tf

from tensorflow.python.platform import tf_logging as logging
from syntaxnet import parser_eval
from syntaxnet.ops import gen_parser_ops
from syntaxnet import structured_graph_builder
from tensorflow.contrib.session_bundle import exporter

flags = tf.app.flags
FLAGS = flags.FLAGS

flags.DEFINE_string('export_path', None, 'Path to export to intstead of running the model.')

def Build(sess, document_source, FLAGS):
  """Builds a sub-network, which will be either the tagger or the parser

  Args:
    sess: tensorflow session to use
    document_source: the input of serialized document objects to process

  Flags: (taken from FLAGS argument)
    num_actions: number of possible golden actions
    feature_sizes: size of each feature vector
    domain_sizes: number of possible feature ids in each feature vector
    embedding_dims: embedding dimension for each feature group

    hidden_layer_sizes: Comma separated list of hidden layer sizes.
    arg_prefix: Prefix for context parameters.
    beam_size: Number of slots for beam parsing.
    max_steps: Max number of steps to take.
    task_context: Path to a task context with inputs and parameters for feature extractors.
    input: Name of the context input to read data from.
    graph_builder: 'greedy' or 'structured'
    batch_size: Number of sentences to process in parallel.
    slim_model: Whether to expect only averaged variables.
    model_path: Path to model parameters.

  Return:
    returns the tensor which will contain the serialized document objects.

  """
  task_context = FLAGS["task_context"]
  arg_prefix = FLAGS["arg_prefix"]
  num_actions = FLAGS["num_actions"]
  feature_sizes = FLAGS["feature_sizes"]
  domain_sizes = FLAGS["domain_sizes"]
  embedding_dims = FLAGS["embedding_dims"]
  hidden_layer_sizes = map(int, FLAGS["hidden_layer_sizes"].split(','))
  beam_size = FLAGS["beam_size"]
  max_steps = FLAGS["max_steps"]
  batch_size = FLAGS["batch_size"]
  corpus_name = FLAGS["input"]
  slim_model = FLAGS["slim_model"]
  model_path = FLAGS["model_path"]

  parser = structured_graph_builder.StructuredGraphBuilder(
        num_actions,
        feature_sizes,
        domain_sizes,
        embedding_dims,
        hidden_layer_sizes,
        gate_gradients=True,
        arg_prefix=arg_prefix,
        beam_size=beam_size,
        max_steps=max_steps)

  parser.AddEvaluation(task_context,
                       batch_size,
                       corpus_name=corpus_name,
                       evaluation_max_steps=max_steps,
               document_source=document_source)

  parser.AddSaver(slim_model)
  sess.run(parser.inits.values())
  parser.saver.restore(sess, model_path)

  return parser.evaluation['documents']

def GetFeatureSize(task_context, arg_prefix):
  with tf.variable_scope("fs_"+arg_prefix):
    with tf.Session() as sess:
      return sess.run(gen_parser_ops.feature_size(task_context=task_context,
                      arg_prefix=arg_prefix))

# export the model in various ways. this erases any previously saved model
def ExportModel(sess, model_dir, input, output, assets):
  if os.path.isdir(model_dir):
    shutil.rmtree(model_dir);

  # using TF Serving exporter to load into a TF Serving session bundle
  logging.info('Exporting trained model to %s', model_dir)
  saver = tf.train.Saver()
  model_exporter = exporter.Exporter(saver)
  signature = exporter.regression_signature(input_tensor=input,output_tensor=output)
  model_exporter.init(sess.graph.as_graph_def(),
                      default_graph_signature=signature,
                      assets_collection=assets)
  model_exporter.export(model_dir, tf.constant(1), sess)

  # using a SummaryWriter so graph can be loaded in TensorBoard
  writer = tf.train.SummaryWriter(model_dir, sess.graph)
  writer.flush()

  # exporting the graph as a text protobuf, to view graph manualy
  f1 = open(model_dir + '/graph.pbtxt', 'w+');
  print >>f1, str(tf.get_default_graph().as_graph_def())

def main(unused_argv):
  logging.set_verbosity(logging.INFO)

  model_dir = "/home/andy/Downloads/syntaxnet/try3_customSyntaxnetName/serving/sejong/brain_parser/structured/512,512-0.02-100-0.9"
  #model_dir="/home/andy/Downloads/syntaxnet/try3_customSyntaxnetName/serving/sejong"
  task_context="%s/context" % model_dir

  common_params = {
      "task_context":  task_context,
      "beam_size":     8,
      "max_steps":     1000,
      "graph_builder": "structured",
      "batch_size":    1024,
      "slim_model":    True,
      }

  model = {
        "brain_parser": {
            "arg_prefix":         "brain_parser",
            "hidden_layer_sizes": "512,512",
            # input is taken from input tensor, not from corpus
            "input":              None,
            "model_path":         "%s/latest-model" % model_dir,
            },
      }

  for prefix in ["brain_parser"]:
      model[prefix].update(common_params)
      feature_sizes, domain_sizes, embedding_dims, num_actions = GetFeatureSize(task_context, prefix)
      model[prefix].update({'feature_sizes': feature_sizes,
                               'domain_sizes': domain_sizes,
                               'embedding_dims': embedding_dims,
                               'num_actions': num_actions })

  with tf.Session() as sess:
      if FLAGS.export_path is not None:
          text_input = tf.placeholder(tf.string, [None])
      else:
          text_input = tf.constant(["parsey is the greatest"], tf.string)

      # corpus_name must be specified and valid because it indirectly informs
      # the document format ("english-text" vs "conll-sentence") used to parse
      # the input text
      document_source = gen_parser_ops.document_source(text=text_input,
                                                       task_context=task_context,
                                                       corpus_name="stdin-conll",
                                                       batch_size=common_params['batch_size'],
                                                                 documents_from_input=True)

      for prefix in ["brain_parser"]:
          with tf.variable_scope(prefix):
              if True or prefix == "brain_tagger":
                  #source = document_source.documents if prefix == "brain_tagger" else model["brain_tagger"]["documents"]
                  source = document_source.documents
                  model[prefix]["documents"] = Build(sess, source, model[prefix])

      if FLAGS.export_path is None:
          sink = gen_parser_ops.document_sink(model["brain_parser"]["documents"],
                                      task_context=task_context,
                                      corpus_name="stdout-conll")
          sess.run(sink)
      else:
          assets = []
          for model_file in os.listdir(model_dir):
              path = os.path.join(model_dir, model_file)
              if not os.path.isdir(path):
                assets.append(tf.constant(path))
          ExportModel(sess, FLAGS.export_path, text_input, model["brain_parser"]["documents"], assets)

if __name__ == '__main__':
  tf.app.run()

$ bazel-bin/tensorflow_serving/example/parsey_mcparseface --export_path=exported
$ bazel-bin/tensorflow_serving/example/parsey_api --port=9000 ./exported/00000001/

@dsindex
Copy link
Owner

dsindex commented Jul 28, 2016

@xtknight

great!! a long way to here :)

@xtknight
Copy link
Author

xtknight commented Jul 29, 2016

But the problem is, you're right ..I think the protocol does need changing. The fields don't match what I see from my test website. I can see the CONLL output properly in parsey_api.cc , but , I have no idea how it transforms into the json format... :\ any clue what this does?

    const tensorflow::Status status1 = bundle_->session->Run(
      {{signature_.input().tensor_name(), input}}, // const std::vector< std::pair< string, Tensor > > &inputs
      {signature_.output().tensor_name()}, // const std::vector< string > &output_tensor_names
      {}, // const std::vector< string > &target_node_names
      &outputs
      );

@dsindex
Copy link
Owner

dsindex commented Jul 29, 2016

@xtknight

i made a python client program.
it uses Komoran and protobuf_json for converting protobuf to json.
(because Parse() returns protobuf object, we need to convert to json)

see : https://github.com/dsindex/syntaxnet/blob/master/README_api.md

and i couldn't understand exactly what it means by 'mis-match' between output fields and your test website. ;;)

@xtknight
Copy link
Author

xtknight commented Jul 30, 2016

@dsindex
Yeah I am stuck at the last step of figuring out the protocol and what the meaning of all the fields in the current protocol are. somehow that C++ code I attached above magically converts CONLL to the protobuf fields, but I don't understand where that process takes place or how it works. so i'm wary (cautious) of trusting the output fields. the numbers in the output CONLL on the node client.js don't seem to match what I get on the CONLL using my EC2 SyntaxNet demo website.

But actually, it seems like the HEAD in the node.js is always 1 less than the HEAD shown with my normal demo.sh and website. Maybe that means it's working, but just as a 0-based index.

"내 가 집 에 가 ㄴ다 ."
HEAD for "내", "가" are 1,4 for node.js and 2,5 for demo.sh

Yes protobuf_json seems like it will be useful!

In the meantime, I have been trying to train the Korean POS tagger for SyntaxNet. I trained a little bit but I noticed the main problem is figuring out the word components to be input into the tagger. (e..g, splitting "수영할" into "수영","하","ㄹ").. apparently Komoran does this all entirely statistically and takes the best match across the whole sentence with probability. I didn't know even the word components were split probabilistically. But this means it will probably be difficult to train SyntaxNet for the POS tagger. What do you think?

@dsindex
Copy link
Owner

dsindex commented Aug 1, 2016

@xtknight

i got it. i am also confused to figure out where the conversion takes place;;

as i mentioned before, training Korean POS tagger using Syntaxnet is very tricky.
i would not recommend you.

but there is some abnormal way to tagging.

in training step,

  • convert your dependency corpus with combinatory tag scheme regarding combinatory tag as a single tag
1           프랑스의    NNP+JKG     NP_MOD      4
2           세계적인    NNG+XSN+VCP+ETM  VNP_MOD     4
3           의상        NNG    NP          4
4           디자이너    NNG            NP          6
5           엠마누엘    NNP            NP          6
6           웅가로가    NNP+JKS     NP_SBJ      11
7           실내        NNG    NP          8
8           장식용      NNG+XSN       NP          9
9           직물        NNG    NP          10
10          디자이너로  NNG+JKB   NP_AJT      11
11          나섰다.     VV+EP+EF+SF      VP          0
  • train via 'brain_pos'

in evaluation step,

  • you can analyze raw input sentence using 'brain_pos' tagger.
in : '프랑스의 세계적인 의상 디자이너 엠마누엘 웅가로가 실내장식용 직물 디자이너로 변했다.'

out : 
프랑스의  NNP+JKG
...
웅가로가  NNP+JKS
...
실내장식용 NNG+NNG
직물  NNG
...
변했다.  VV+EP+EF+SF
  • post processing
1. if combinatory tag is single tag
   - take that tag
   - ex) '직물/NNG'

2. if combinatory tag is not single tag
   - do segmentation
   - ex) '프랑스의 NNP+JKG' -> '프랑스/NNP  의/JKG'
            '실내장식용 NNG+NNG' -> '실내/NNG 장식용/NNG'
   - this can be done by dictionary mapping built from tagged corpus.
   - but for unseen 'eoj+combinatory tags', 
     - you need to implement segmentation module.
     - ex) '실내장식용 -> 실내 장식용', '변했다 -> 변하다 었 다'
     - actually most morphological analyzer does this kind of segmentation first. 
        but here is the reverse.

but notice that this tagger following above steps is likely to weak to unknown eojeols(ex, '의상디자이너', '엠마누엘웅가로가') and eojeols which have grammatical errors like mis-spaces.

@xtknight
Copy link
Author

xtknight commented Aug 5, 2016

@dsindex
sorry for my late response. I decided POS tagging with SyntaxNet probably wasn't practical. (although I may try POS tagging based on your instructions later.)
I analyzed how Komoran worked and it's quite interesting. I guess it just builds a tree based on most common Jamo and decides based on that.

Seems like SyntaxNet hasn't considered agglutinative languages like Korean.
Especially based on this comment..

https://www.reddit.com/r/MachineLearning/comments/4j2caa/announcing_syntaxnet_the_worlds_most_accurate/d33qqg3

Do you know if SyntaxNet can perform anything beyond dependency parsing, like semantic role labeling? I haven't been able to find much information about it. And not sure where to obtain resources for semantic role labeling for Korean (that might be beyond the scope of this Issue but..)

Maybe this is a dumb question but I am curious if it is possible to convert between the Sejong corpus format and the output of the dependency parser. I was analyzing this sentence.
'엠마누엘 웅가로는 "실내 장식품을 디자인할 때 옷을 만들 때와는 다른 해방감을 느낀다"고 말한다.'

And it seems like it is organized like this in the corpus ("엠마누엘 웅가로는"):

(NP_SBJ  <----- where is this tag going??
    (NP 엠마누엘/NNP)
    (NP_SBJ 웅가로/NNP + 는/JX)
)

And then there is the example of "실내 장식품을 디자인할때"

(NP_AJT  <----- where is this tag going??
    (VP_MOD  <----- where is this tag going??
        (NP_OBJ  <----- where is this tag going??
            (NP 실내/NNG)
            (NP_OBJ 장식품/NNG + 을/JKO)
        )
        (VP_MOD 디자인/NNG + 하/XSV + ᆯ/ETM)
    )
    (NP_AJT 때/NNG)
)

But when I run the dependency tagger it seems like tags without leaves are missing, like I labeled. Did I do something wrong? Maybe I am just getting dizzy from looking at the tree. I notice that the leaf-less nodes NP_AJT,VP_MOD,NP_OBJ are always shown in the last leaf node. Is that always the case??

Also it seems like 웅가로는 is becoming a child of 엠마누엘, instead of being on the same level. Is this intentional or is it just variation from the model?

Recombined

1   엠마누엘    엠마누엘/NNP    _   _   _   2   NP
2   웅가로는    웅가로는/NA _   _   _   3   NP
3   "실내 장식품을    "/SS + 실내 장식/NNP + 품/NNG + 을/JKO    _   _   _   4   NP_OBJ
4   디자인할    디자인/NNG + 하/XSV + ㄹ/ETM   _   _   _   5   VP_MOD
5   때 때/NNG _   _   _   7   NP_AJT
6   옷을  옷/NNG + 을/JKO   _   _   _   7   NP_OBJ
7   만들  만들/VV + ㄹ/ETM _   _   _   8   VP_MOD
8   때와는   때/NNG + 와/JKB + 는/JX  _   _   _   11  NP_AJT
9   다른  다른/MM   _   _   _   10  DP
10  해방감을    해방감/NNG + 을/JKO _   _   _   11  NP_OBJ
11  느낀다"고   느끼/VV + ㄴ다/EC + "/SS + 고/JKQ  _   _   _   12  X_CMP
12  말한다.  말/NNG + 하/XSV + ㄴ다/EF + ./SF    _   _   _   0   ROOT

Individual

1   엠마누엘    _   NNP NNP _   2   NP  _   _
2   웅가로는    _   NA  NA  _   3   NP  _   _
3   "   _   SS  SS  _   4   MOD _   _
4   실내 장식   _   NNP NNP _   5   MOD _   _
5   품 _   NNG NNG _   6   MOD _   _
6   을 _   JKO JKO _   7   NP_OBJ  _   _
7   디자인   _   NNG NNG _   8   MOD _   _
8   하 _   XSV XSV _   9   MOD _   _
9   ㄹ _   ETM ETM _   10  VP_MOD  _   _
10  때 _   NNG NNG _   13  NP_AJT  _   _
11  옷 _   NNG NNG _   12  MOD _   _
12  을 _   JKO JKO _   13  NP_OBJ  _   _
13  만들  _   VV  VV  _   14  MOD _   _
14  ㄹ _   ETM ETM _   15  VP_MOD  _   _
15  때 _   NNG NNG _   16  MOD _   _
16  와 _   JKB JKB _   17  MOD _   _
17  는 _   JX  JX  _   21  NP_AJT  _   _
18  다른  _   MM  MM  _   19  DP  _   _
19  해방감   _   NNG NNG _   20  MOD _   _
20  을 _   JKO JKO _   21  NP_OBJ  _   _
21  느끼  _   VV  VV  _   22  MOD _   _
22  ㄴ다  _   EC  EC  _   23  MOD _   _
23  "   _   SS  SS  _   24  MOD _   _
24  고 _   JKQ JKQ _   25  X_CMP   _   _
25  말 _   NNG NNG _   26  MOD _   _
26  하 _   XSV XSV _   27  MOD _   _
27  ㄴ다  _   EF  EF  _   28  MOD _   _
28  .   _   SF  SF  _   0   ROOT    _   _

Full Sejong tree

; 엠마누엘 웅가로는 "실내 장식품을 디자인할 때 옷을 만들 때와는 다른 해방감을 느낀다"고 말한다. 
(S
    (NP_SBJ
        (NP 엠마누엘/NNP)
        (NP_SBJ 웅가로/NNP + 는/JX)
    )
    (VP
        (VP_CMP
            (VP
                (L "/SS)
                (VP
                    (VP
                        (NP_AJT
                            (VP_MOD
                                (NP_OBJ
                                    (NP 실내/NNG)
                                    (NP_OBJ 장식품/NNG + 을/JKO)
                                )
                                (VP_MOD 디자인/NNG + 하/XSV + ᆯ/ETM)
                            )
                            (NP_AJT 때/NNG)
                        )
                        (VP
                            (NP_OBJ
                                (VP_MOD
                                    (NP_SBJ
                                        (VP_MOD
                                            (NP_OBJ 옷/NNG + 을/JKO)
                                            (VP_MOD 만들/VV + ᆯ/ETM)
                                        )
                                        (NP_SBJ 때/NNG + 와/JKB + 는/JX)
                                    )
                                    (VP_MOD 다르/VA + ᆫ/ETM)
                                )
                                (NP_OBJ 해방감/NNG + 을/JKO)
                            )
                            (VP 느끼/VV + ᆫ다/EC)
                        )
                    )
                    (R "/SS)
                )
            )
            (X_CMP 고/JKQ)
        )
        (VP 말/NNG + 하/XSV + ᆫ다/EF + ./SF)
    )
)

P.S. Also for VP_CMP which contains VP and X_CMP, I don't see VP_CMP in the tree.

@dsindex
Copy link
Owner

dsindex commented Aug 8, 2016

  • in fact, what you are asking is constituent parsing or phrase structure parsing.
    if you look at http://nlp.stanford.edu:8080/parser/,
    you can see that the Stanford parser performs constituent parsing internally and
    convert to dependency parse if the dependency option is given.
    it is very difficult reverse conversion from dependency to constituent parse.
    ( you may refer to http://anthology.aclweb.org/H/H01/H01-1014.pdf )
    the Syntaxnet algorithm is basically transition-based dependency parsing.
    so, i think it is hard to construct phrase structure tree from the result of the Syntaxnet.
    (maybe somehow works by using rules mentioned above paper)
  • for semantic role labeling via the Syntaxnet, we need to get a semantic-role labeled corpus.
    but, it seems there is no publicly available data.
    only one thing at hand is developing semantic role labeling module using dependency parse and the semantic information in the Sejong Dictionary.

sejong_entry

(an example semantic information for '가다')

(reference : http://www.koreascience.or.kr/search/articlepdf_ocean.jsp?url=http://ocean.kisti.re.kr/downfile/volume/kiss/JBGHF3/2007/v34n2/JBGHF3_2007_v34n2_112.pdf&admNo=JBGHF3_2007_v34n2_112 )

@xtknight
Copy link
Author

I see! thank you for the detailed information! Well these days I am off on trying doc2vec and other technologies for Korean. I haven't really found an application for the dependency parsing itself. Seems like I need to get a licensed semantically labeled corpus through my university.. but not sure how that would be trained anyways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants