#  COSI134 Project 4: Train a neural network parser

Due: December 18, 2018

In the fourth and final project, you are asked to train a neural network parser using the Penn TreeBank data. You have the choice of using either the encoder-decoder framework or the span scoring approach that we have discussed in class. You're also free to consult any open source code that you can find on the internet. If you choose the encoder-decoder framework, the bulk of the work would involve preparing the input data.

Like the POS tagging project, it's advisable to start small and make sure your code works on a smaller data sample. You also need to make sure the output of your parser needs to be well-formed, that is, has matching parentheses, among other things. The output of your parser needs to be evaluated with a standard software called "evalb". The original version of the software can be found here: https://nlp.cs.nyu.edu/evalb/. There are also java reimplementations of the software at the Stanford Core NLP. The software outputs many metrics, but the main metrics are labeled precision and labeled recall, which are based on counting the number of matching constituents between the gold parser tree and the system output.

In your writeup:
- give a brief description on your code structure.
- give a description on your model, including input feature design, model architecture, hyperparameters you experimented with, etc.
- report experiments and results with different training data sizes at 5000, 15000, and all sentences. 
- report your best experimental performances.

### Use nltk.corpus to access the data:


In [11]:
import nltk.corpus
reader_train = nltk.corpus.BracketParseCorpusReader(r'./train/', r'.*/wsj_.*\.mrg')
reader_dev = nltk.corpus.BracketParseCorpusReader(r'./dev/', r'.*/wsj_.*\.mrg')
reader_test = nltk.corpus.BracketParseCorpusReader(r'./testhttps://github.com/tensorflow/nmt/', r'.*/wsj_.*\.mrg')

In [10]:
reader_train.sents()    # gives you plain sentences

[['In', 'an', 'Oct.', '19', 'review', 'of', '``', 'The', 'Misanthrope', "''", 'at', 'Chicago', "'s", 'Goodman', 'Theatre', '-LRB-', '``', 'Revitalized', 'Classics', 'Take', 'the', 'Stage', 'in', 'Windy', 'City', ',', "''", 'Leisure', '&', 'Arts', '-RRB-', ',', 'the', 'role', 'of', 'Celimene', ',', 'played', '*', 'by', 'Kim', 'Cattrall', ',', 'was', 'mistakenly', 'attributed', '*-2', 'to', 'Christina', 'Haag', '.'], ['Ms.', 'Haag', 'plays', 'Elianti', '.'], ...]

In [12]:
reader_train.parsed_sents()    # gives you gold parsed trees

[Tree('S', [Tree('PP-LOC', [Tree('IN', ['In']), Tree('NP', [Tree('NP', [Tree('DT', ['an']), Tree('NNP', ['Oct.']), Tree('CD', ['19']), Tree('NN', ['review'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('``', ['``']), Tree('NP-TTL', [Tree('DT', ['The']), Tree('NN', ['Misanthrope'])]), Tree("''", ["''"]), Tree('PP-LOC', [Tree('IN', ['at']), Tree('NP', [Tree('NP', [Tree('NNP', ['Chicago']), Tree('POS', ["'s"])]), Tree('NNP', ['Goodman']), Tree('NNP', ['Theatre'])])])])]), Tree('PRN', [Tree('-LRB-', ['-LRB-']), Tree('``', ['``']), Tree('S-HLN', [Tree('NP-SBJ', [Tree('VBN', ['Revitalized']), Tree('NNS', ['Classics'])]), Tree('VP', [Tree('VBP', ['Take']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['Stage'])]), Tree('PP-LOC', [Tree('IN', ['in']), Tree('NP', [Tree('NNP', ['Windy']), Tree('NNP', ['City'])])])])]), Tree(',', [',']), Tree("''", ["''"]), Tree('NP-TMP', [Tree('NN', ['Leisure']), Tree('CC', ['&']), Tree('NNS', ['Arts'])]), Tree('-RRB-', ['-RRB-'])])])]), Tree(',', [',']), 

In [13]:
reader_train.parsed_sents()[0].__str__()    # gives you the tree represented in a string. Make sure to remove the newline signs before training.

"(S\n  (PP-LOC\n    (IN In)\n    (NP\n      (NP (DT an) (NNP Oct.) (CD 19) (NN review))\n      (PP\n        (IN of)\n        (NP\n          (`` ``)\n          (NP-TTL (DT The) (NN Misanthrope))\n          ('' '')\n          (PP-LOC\n            (IN at)\n            (NP\n              (NP (NNP Chicago) (POS 's))\n              (NNP Goodman)\n              (NNP Theatre)))))\n      (PRN\n        (-LRB- -LRB-)\n        (`` ``)\n        (S-HLN\n          (NP-SBJ (VBN Revitalized) (NNS Classics))\n          (VP\n            (VBP Take)\n            (NP (DT the) (NN Stage))\n            (PP-LOC (IN in) (NP (NNP Windy) (NNP City)))))\n        (, ,)\n        ('' '')\n        (NP-TMP (NN Leisure) (CC &) (NNS Arts))\n        (-RRB- -RRB-))))\n  (, ,)\n  (NP-SBJ-2\n    (NP (NP (DT the) (NN role)) (PP (IN of) (NP (NNP Celimene))))\n    (, ,)\n    (VP\n      (VBN played)\n      (NP (-NONE- *))\n      (PP (IN by) (NP-LGS (NNP Kim) (NNP Cattrall))))\n    (, ,))\n  (VP\n    (VBD was)\n    (VP\n      (AD

In [59]:
reader_test.parsed_sents()[0].__str__().replace('\n', '')

"(S  (INTJ (RB No))  (, ,)  (NP-SBJ (PRP it))  (VP (VBD was) (RB n't) (NP-PRD (NNP Black) (NNP Monday)))  (. .))"

In [78]:
def count_right_parentheses(tag):
    return len(tag.split(')'))-1

In [80]:
count_right_parentheses('))))NP-TMP')

4

In [81]:
# drop terminal words and parentheses
def preprosessing(sent):
    tag_stack = []
    result = []
    for tag in sent:
        if '(' in list(tag):
            tag_stack.append(tag)
            result.append(tag)
        else:
            count = count_right_parentheses(tag)
            result[-1] = tag_stack.pop().split('(')[-1]
            if count > 1:
                for i in range(count-1):
                    result.append(')'+tag_stack.pop().split('(')[-1])
    return ' '.join(result)

In [82]:
preprosessing(reader_test.parsed_sents()[0].__str__().replace('\n', '').split())

'(S (INTJ RB )INTJ , (NP-SBJ PRP )NP-SBJ (VP VBD RB (NP-PRD NNP NNP )NP-PRD )VP . )S'

In [5]:
# match parentheses of the output
def parentheses_match(sent):
    tag_stack = []
    result = []
    for tag in sent:
        if '(' in list(tag):
            tag_stack.append(tag)
            result.append(tag)
        elif ')' in list(tag):
            if len(tag_stack) > 0:
                tag_stack.pop()
                result.append(tag)
        else:
            result.append(tag)
    if len(tag_stack) > 0:
        for i in range(len(tag_stack)):
            result.append(')')
    return ' '.join(result)

In [7]:
s='(S CC (SBAR-ADV IN (S (NP-SBJ DT NNP NNP NNP NNP )NP-SBJ (VP VBD RB (VP VB (PRT RP )PRT (NP-TMP NNP )NP-TMP (SBAR-TMP IN (S (NP-SBJ DT NNP NNP NNP NNP )NP-SBJ (VP VBD (NP-EXT CD NNS )NP-EXT )VP )S )SBAR-TMP )VP )VP )S )SBAR-ADV : (NP-SBJ (NP JJS )NP (PP IN (NP PRP )NP )PP (PP-TMP IN (NP DT JJ NN )NP )PP-TMP )NP-SBJ (VP VBD'
parentheses_match(s.split())

'(S CC (SBAR-ADV IN (S (NP-SBJ DT NNP NNP NNP NNP )NP-SBJ (VP VBD RB (VP VB (PRT RP )PRT (NP-TMP NNP )NP-TMP (SBAR-TMP IN (S (NP-SBJ DT NNP NNP NNP NNP )NP-SBJ (VP VBD (NP-EXT CD NNS )NP-EXT )VP )S )SBAR-TMP )VP )VP )S )SBAR-ADV : (NP-SBJ (NP JJS )NP (PP IN (NP PRP )NP )PP (PP-TMP IN (NP DT JJ NN )NP )PP-TMP )NP-SBJ (VP VBD ) )'

In [16]:
# when the number of words is equal to the that of tags
def put_back_equal_words(sent, terminal_words):
    result = []
    for tag in sent:
        if '(' in list(tag):
            result.append(tag)
        elif ')' in list(tag):
            result[-1] += ')'
        else:
            result.append('(' + tag)
            result.append(terminal_words.pop(0) + ')')
    return ' '.join(result)

In [20]:
# count the number of open and close parentheses, and the number of POS tags
def count_parentheses(sent):
    left = 0
    right = 0
    pos_tag = 0
    for tag in sent:
        if '(' in list(tag):
            left += 1
        elif ')' in list(tag):
            right += 1
        else:
            pos_tag += 1
    return left, right, pos_tag

In [37]:
# when the number of words is less the that of tags
def put_back_less_words(sent, terminal_words):
    tag_stack = []
    result = []
    for tag in sent:
        if '(' in list(tag):
            tag_stack.append(tag)
            result.append(tag)
        elif ')' in list(tag):
            tag_stack.pop()
            result[-1] += ')'
        else:
            result.append('(' + tag)
            result.append(terminal_words.pop(0) + ')')
            if len(terminal_words) == 0:
                result[-1] += (')' * len(tag_stack))
                break
    return ' '.join(result)

In [61]:
# add the terminal words back into the output parsing tree
def post_process(sent, terminal_words):
    new_sent = parentheses_match(sent).split()
    left, right, pos_tag = count_parentheses(new_sent)
#     print(left, right, pos_tag)
#     print("pos_tag, len of sent:", pos_tag, len(terminal_words))
    if pos_tag > len(terminal_words):
        result = put_back_less_words(new_sent, terminal_words)
    elif pos_tag < len(terminal_words):
        new_sent.append('(NNP')
        for i in range(len(terminal_words)-pos_tag-1):
            new_sent.append('NNP')
        new_sent.append('.')
        new_sent.append(')')
        result = put_back_equal_words(new_sent, terminal_words)
    else:
        result = put_back_equal_words(new_sent, terminal_words)
    return result

In [68]:
post_process(ssss.split(), reader_test.sents()[1])

20 20 45
pos_tag, len of sent: 45 41


"(S (CC But) (SBAR-ADV (IN while) (S (NP-SBJ (DT the) (NNP New) (NNP York) (NNP Stock) (NN Exchange) (NN did) (NN n't) (NN fall) (N apart) (N Friday) (N as) (N the) (N Dow) (n Jones) (n Industrial) (n Average) (n plunged) (n 190.58) (n points) (NNP --)) (VP (VBD most) (RB of) (VP (VB it) (PRT (RP in)) (NP-TMP (NNP the)) (SBAR-TMP (IN final) (S (NP-SBJ (DT hour) (NNP --) (NNP it) (NNP barely) (NNP managed)) (VP (VBD *-2) (NP-EXT (CD to) (NNS stay))))))))) (: this) (NP-SBJ (NP (JJS side)) (PP (IN of) (NP (PRP chaos))) (PP-TMP (IN .))))"

In [70]:
# write the post-processed output into a file for EVALB
i = 0
f_read = open('./output/output_attention_bilstm_test', 'r')
with open('evalb_output_attention_bilstm_test.txt' ,'a') as f_write:
    for sent in f_read:
        f_write.write(post_process(sent.split(), reader_test.sents()[i]))
        f_write.write('\n')
        i += 1
f_read.close()
print(i, len(reader_test.sents()))

5 5 8
pos_tag, len of sent: 8 8
20 20 30
pos_tag, len of sent: 30 41
22 22 31
pos_tag, len of sent: 31 36
22 22 28
pos_tag, len of sent: 28 37
24 24 25
pos_tag, len of sent: 25 31
12 12 17
pos_tag, len of sent: 17 17
20 20 34
pos_tag, len of sent: 34 31
5 5 6
pos_tag, len of sent: 6 6
4 4 7
pos_tag, len of sent: 7 7
27 27 31
pos_tag, len of sent: 31 38
30 30 29
pos_tag, len of sent: 29 30
13 13 17
pos_tag, len of sent: 17 17
18 18 18
pos_tag, len of sent: 18 18
11 11 20
pos_tag, len of sent: 20 28
21 21 30
pos_tag, len of sent: 30 43
10 10 15
pos_tag, len of sent: 15 15
11 11 14
pos_tag, len of sent: 14 14
5 5 8
pos_tag, len of sent: 8 8
17 17 24
pos_tag, len of sent: 24 24
21 21 30
pos_tag, len of sent: 30 33
6 6 9
pos_tag, len of sent: 9 9
28 28 34
pos_tag, len of sent: 34 39
16 16 24
pos_tag, len of sent: 24 28
15 15 24
pos_tag, len of sent: 24 24
21 21 24
pos_tag, len of sent: 24 29
8 8 8
pos_tag, len of sent: 8 15
22 22 27
pos_tag, len of sent: 27 28
19 19 23
pos_tag, len of sent:

21 21 36
pos_tag, len of sent: 36 41
22 22 31
pos_tag, len of sent: 31 33
21 21 24
pos_tag, len of sent: 24 26
19 19 24
pos_tag, len of sent: 24 25
11 11 14
pos_tag, len of sent: 14 14
16 16 19
pos_tag, len of sent: 19 19
30 30 30
pos_tag, len of sent: 30 51
4 4 5
pos_tag, len of sent: 5 5
18 18 16
pos_tag, len of sent: 16 33
13 13 22
pos_tag, len of sent: 22 22
7 7 11
pos_tag, len of sent: 11 11
7 7 8
pos_tag, len of sent: 8 8
20 20 23
pos_tag, len of sent: 23 32
5 5 7
pos_tag, len of sent: 7 7
20 20 27
pos_tag, len of sent: 27 39
22 22 32
pos_tag, len of sent: 32 62
4 4 6
pos_tag, len of sent: 6 6
10 10 14
pos_tag, len of sent: 14 15
19 19 26
pos_tag, len of sent: 26 26
14 14 17
pos_tag, len of sent: 17 22
15 15 19
pos_tag, len of sent: 19 19
24 24 26
pos_tag, len of sent: 26 50
25 25 30
pos_tag, len of sent: 30 35
25 25 32
pos_tag, len of sent: 32 33
21 21 23
pos_tag, len of sent: 23 32
24 24 32
pos_tag, len of sent: 32 32
30 30 29
pos_tag, len of sent: 29 37
19 19 22
pos_tag, len o

24 24 24
pos_tag, len of sent: 24 31
16 16 23
pos_tag, len of sent: 23 23
25 25 23
pos_tag, len of sent: 23 24
5 5 7
pos_tag, len of sent: 7 7
5 5 9
pos_tag, len of sent: 9 9
12 12 12
pos_tag, len of sent: 12 15
22 22 28
pos_tag, len of sent: 28 51
12 12 13
pos_tag, len of sent: 13 13
11 11 15
pos_tag, len of sent: 15 18
20 20 24
pos_tag, len of sent: 24 24
13 13 19
pos_tag, len of sent: 19 19
19 19 27
pos_tag, len of sent: 27 28
17 17 23
pos_tag, len of sent: 23 23
18 18 24
pos_tag, len of sent: 24 32
13 13 17
pos_tag, len of sent: 17 17
26 26 29
pos_tag, len of sent: 29 30
20 20 26
pos_tag, len of sent: 26 26
9 9 12
pos_tag, len of sent: 12 12
17 17 17
pos_tag, len of sent: 17 17
23 23 28
pos_tag, len of sent: 28 28
24 24 26
pos_tag, len of sent: 26 27
22 22 22
pos_tag, len of sent: 22 26
17 17 26
pos_tag, len of sent: 26 26
6 6 10
pos_tag, len of sent: 10 10
17 17 28
pos_tag, len of sent: 28 28
16 16 26
pos_tag, len of sent: 26 26
20 20 22
pos_tag, len of sent: 22 22
18 18 21
pos_ta

21 21 25
pos_tag, len of sent: 25 33
20 20 20
pos_tag, len of sent: 20 24
29 29 42
pos_tag, len of sent: 42 19
21 21 29
pos_tag, len of sent: 29 29
18 18 19
pos_tag, len of sent: 19 19
28 28 30
pos_tag, len of sent: 30 35
24 24 27
pos_tag, len of sent: 27 28
13 13 17
pos_tag, len of sent: 17 17
22 22 25
pos_tag, len of sent: 25 25
20 20 24
pos_tag, len of sent: 24 26
20 20 30
pos_tag, len of sent: 30 30
18 18 22
pos_tag, len of sent: 22 33
21 21 24
pos_tag, len of sent: 24 24
26 26 29
pos_tag, len of sent: 29 33
24 24 20
pos_tag, len of sent: 20 28
28 28 36
pos_tag, len of sent: 36 29
26 26 30
pos_tag, len of sent: 30 31
20 20 35
pos_tag, len of sent: 35 48
14 14 16
pos_tag, len of sent: 16 25
21 21 28
pos_tag, len of sent: 28 29
16 16 20
pos_tag, len of sent: 20 20
29 29 29
pos_tag, len of sent: 29 44
6 6 8
pos_tag, len of sent: 8 8
22 22 30
pos_tag, len of sent: 30 38
26 26 29
pos_tag, len of sent: 29 21
25 25 33
pos_tag, len of sent: 33 39
25 25 33
pos_tag, len of sent: 33 34
20 20 

26 26 37
pos_tag, len of sent: 37 37
20 20 24
pos_tag, len of sent: 24 27
14 14 20
pos_tag, len of sent: 20 20
29 29 32
pos_tag, len of sent: 32 39
21 21 22
pos_tag, len of sent: 22 35
20 20 33
pos_tag, len of sent: 33 41
25 25 25
pos_tag, len of sent: 25 69
15 15 19
pos_tag, len of sent: 19 29
12 12 14
pos_tag, len of sent: 14 20
22 22 26
pos_tag, len of sent: 26 37
17 17 25
pos_tag, len of sent: 25 26
16 16 19
pos_tag, len of sent: 19 19
8 8 11
pos_tag, len of sent: 11 11
15 15 15
pos_tag, len of sent: 15 15
11 11 12
pos_tag, len of sent: 12 13
12 12 12
pos_tag, len of sent: 12 14
29 29 23
pos_tag, len of sent: 23 30
30 30 30
pos_tag, len of sent: 30 33
8 8 13
pos_tag, len of sent: 13 13
17 17 26
pos_tag, len of sent: 26 26
17 17 22
pos_tag, len of sent: 22 22
11 11 12
pos_tag, len of sent: 12 12
34 34 30
pos_tag, len of sent: 30 34
14 14 18
pos_tag, len of sent: 18 27
19 19 20
pos_tag, len of sent: 20 20
23 23 26
pos_tag, len of sent: 26 31
17 17 20
pos_tag, len of sent: 20 20
26 26

19 19 29
pos_tag, len of sent: 29 30
14 14 25
pos_tag, len of sent: 25 25
29 29 27
pos_tag, len of sent: 27 34
32 32 28
pos_tag, len of sent: 28 30
8 8 15
pos_tag, len of sent: 15 15
21 21 21
pos_tag, len of sent: 21 25
24 24 26
pos_tag, len of sent: 26 34
5 5 8
pos_tag, len of sent: 8 8
10 10 22
pos_tag, len of sent: 22 13
19 19 31
pos_tag, len of sent: 31 31
14 14 14
pos_tag, len of sent: 14 14
22 22 24
pos_tag, len of sent: 24 37
17 17 24
pos_tag, len of sent: 24 24
14 14 16
pos_tag, len of sent: 16 16
9 9 13
pos_tag, len of sent: 13 13
24 24 20
pos_tag, len of sent: 20 50
11 11 17
pos_tag, len of sent: 17 17
35 35 28
pos_tag, len of sent: 28 34
13 13 15
pos_tag, len of sent: 15 15
14 14 21
pos_tag, len of sent: 21 21
15 15 19
pos_tag, len of sent: 19 19
24 24 27
pos_tag, len of sent: 27 28
48 48 42
pos_tag, len of sent: 42 22
23 23 25
pos_tag, len of sent: 25 26
21 21 25
pos_tag, len of sent: 25 26
15 15 23
pos_tag, len of sent: 23 24
5 5 6
pos_tag, len of sent: 6 6
14 14 20
pos_ta

26 26 29
pos_tag, len of sent: 29 63
21 21 26
pos_tag, len of sent: 26 26
20 20 23
pos_tag, len of sent: 23 28
29 29 27
pos_tag, len of sent: 27 49
23 23 29
pos_tag, len of sent: 29 41
15 15 19
pos_tag, len of sent: 19 19
24 24 28
pos_tag, len of sent: 28 35
5 5 9
pos_tag, len of sent: 9 9
20 20 19
pos_tag, len of sent: 19 20
30 30 36
pos_tag, len of sent: 36 39
24 24 25
pos_tag, len of sent: 25 40
11 11 13
pos_tag, len of sent: 13 13
21 21 27
pos_tag, len of sent: 27 31
20 20 31
pos_tag, len of sent: 31 47
10 10 17
pos_tag, len of sent: 17 32
20 20 33
pos_tag, len of sent: 33 34
9 9 13
pos_tag, len of sent: 13 22
31 31 31
pos_tag, len of sent: 31 34
11 11 14
pos_tag, len of sent: 14 14
31 31 42
pos_tag, len of sent: 42 43
12 12 15
pos_tag, len of sent: 15 15
14 14 20
pos_tag, len of sent: 20 19
21 21 28
pos_tag, len of sent: 28 44
28 28 32
pos_tag, len of sent: 32 47
8 8 9
pos_tag, len of sent: 9 9
19 19 21
pos_tag, len of sent: 21 21
33 33 36
pos_tag, len of sent: 36 22
16 16 19
pos_

18 18 24
pos_tag, len of sent: 24 24
8 8 8
pos_tag, len of sent: 8 8
21 21 32
pos_tag, len of sent: 32 33
22 22 31
pos_tag, len of sent: 31 44
17 17 23
pos_tag, len of sent: 23 37
11 11 13
pos_tag, len of sent: 13 13
15 15 17
pos_tag, len of sent: 17 17
23 23 28
pos_tag, len of sent: 28 29
19 19 21
pos_tag, len of sent: 21 21
9 9 10
pos_tag, len of sent: 10 10
20 20 24
pos_tag, len of sent: 24 24
24 24 35
pos_tag, len of sent: 35 53
12 12 14
pos_tag, len of sent: 14 14
10 10 17
pos_tag, len of sent: 17 17
24 24 26
pos_tag, len of sent: 26 41
30 30 34
pos_tag, len of sent: 34 28
14 14 18
pos_tag, len of sent: 18 18
26 26 29
pos_tag, len of sent: 29 37
20 20 20
pos_tag, len of sent: 20 48
21 21 19
pos_tag, len of sent: 19 15
23 23 27
pos_tag, len of sent: 27 30
20 20 25
pos_tag, len of sent: 25 25
21 21 28
pos_tag, len of sent: 28 37
6 6 8
pos_tag, len of sent: 8 8
12 12 13
pos_tag, len of sent: 13 12
21 21 20
pos_tag, len of sent: 20 21
25 25 26
pos_tag, len of sent: 26 31
15 15 22
pos_

29 29 33
pos_tag, len of sent: 33 51
19 19 26
pos_tag, len of sent: 26 38
12 12 20
pos_tag, len of sent: 20 20
1 1 5
pos_tag, len of sent: 5 5
3 3 5
pos_tag, len of sent: 5 5
11 11 15
pos_tag, len of sent: 15 15
2 2 4
pos_tag, len of sent: 4 4
25 25 34
pos_tag, len of sent: 34 12
16 16 22
pos_tag, len of sent: 22 22
25 25 26
pos_tag, len of sent: 26 28
24 24 25
pos_tag, len of sent: 25 27
11 11 18
pos_tag, len of sent: 18 18
19 19 28
pos_tag, len of sent: 28 29
33 33 31
pos_tag, len of sent: 31 38
8 8 18
pos_tag, len of sent: 18 20
19 19 25
pos_tag, len of sent: 25 22
22 22 32
pos_tag, len of sent: 32 33
24 24 27
pos_tag, len of sent: 27 30
20 20 26
pos_tag, len of sent: 26 29
11 11 14
pos_tag, len of sent: 14 16
16 16 25
pos_tag, len of sent: 25 25
19 19 31
pos_tag, len of sent: 31 31
10 10 14
pos_tag, len of sent: 14 14
12 12 23
pos_tag, len of sent: 23 23
21 21 38
pos_tag, len of sent: 38 34
17 17 21
pos_tag, len of sent: 21 21
16 16 22
pos_tag, len of sent: 22 14
27 27 28
pos_tag, 

28 28 33
pos_tag, len of sent: 33 39
18 18 16
pos_tag, len of sent: 16 16
26 26 27
pos_tag, len of sent: 27 68
27 27 30
pos_tag, len of sent: 30 36
26 26 30
pos_tag, len of sent: 30 33
23 23 37
pos_tag, len of sent: 37 53
19 19 26
pos_tag, len of sent: 26 27
20 20 21
pos_tag, len of sent: 21 43
25 25 32
pos_tag, len of sent: 32 36
22 22 24
pos_tag, len of sent: 24 25
27 27 30
pos_tag, len of sent: 30 36
29 29 27
pos_tag, len of sent: 27 42
32 32 30
pos_tag, len of sent: 30 55
21 21 25
pos_tag, len of sent: 25 54
22 22 26
pos_tag, len of sent: 26 34
21 21 22
pos_tag, len of sent: 22 22
9 9 11
pos_tag, len of sent: 11 11
19 19 25
pos_tag, len of sent: 25 43
27 27 22
pos_tag, len of sent: 22 31
26 26 34
pos_tag, len of sent: 34 43
12 12 27
pos_tag, len of sent: 27 26
6 6 10
pos_tag, len of sent: 10 10
8 8 13
pos_tag, len of sent: 13 13
5 5 8
pos_tag, len of sent: 8 8
23 23 44
pos_tag, len of sent: 44 6
8 8 13
pos_tag, len of sent: 13 13
24 24 47
pos_tag, len of sent: 47 5
5 5 10
pos_tag, 

17 17 27
pos_tag, len of sent: 27 27
26 26 25
pos_tag, len of sent: 25 31
9 9 19
pos_tag, len of sent: 19 20
22 22 25
pos_tag, len of sent: 25 30
28 28 22
pos_tag, len of sent: 22 47
8 8 12
pos_tag, len of sent: 12 12
17 17 16
pos_tag, len of sent: 16 16
11 11 19
pos_tag, len of sent: 19 19
23 23 30
pos_tag, len of sent: 30 55
13 13 24
pos_tag, len of sent: 24 24
17 17 22
pos_tag, len of sent: 22 42
16 16 16
pos_tag, len of sent: 16 26
25 25 27
pos_tag, len of sent: 27 36
20 20 29
pos_tag, len of sent: 29 29
11 11 14
pos_tag, len of sent: 14 14
44 44 54
pos_tag, len of sent: 54 33
29 29 34
pos_tag, len of sent: 34 46
16 16 23
pos_tag, len of sent: 23 23
33 33 38
pos_tag, len of sent: 38 29
29 29 31
pos_tag, len of sent: 31 32
8 8 10
pos_tag, len of sent: 10 10
16 16 23
pos_tag, len of sent: 23 23
20 20 22
pos_tag, len of sent: 22 22
23 23 30
pos_tag, len of sent: 30 35
6 6 7
pos_tag, len of sent: 7 7
22 22 31
pos_tag, len of sent: 31 32
31 31 29
pos_tag, len of sent: 29 35
13 13 17
pos

In [71]:
# write the gold file of test data set for EVALB
with open('parsed_test.txt', 'a') as f_write:
    for sent in reader_test.parsed_sents():
        f_write.write(sent.__str__().replace('\n', ''))
        f_write.write('\n')

In [104]:
i = 0
f_read = open('./output/output_test_uni', 'r')
with open('evalb_output_test_uni.txt' ,'a') as f_write:
    for sent in f_read:
        f_write.write(post_process(sent.split(), reader_test.sents()[i]))
        f_write.write('\n')
        i += 1
f_read.close()
print(i, len(reader_test.sents()))

6 6 8
pos_tag, len of sent: 8 8
21 21 26
pos_tag, len of sent: 26 41
18 18 26
pos_tag, len of sent: 26 36
20 20 28
pos_tag, len of sent: 28 37
24 24 25
pos_tag, len of sent: 25 31
12 12 17
pos_tag, len of sent: 17 17
15 15 29
pos_tag, len of sent: 29 31
5 5 6
pos_tag, len of sent: 6 6
4 4 7
pos_tag, len of sent: 7 7
18 18 23
pos_tag, len of sent: 23 38
23 23 20
pos_tag, len of sent: 20 30
13 13 17
pos_tag, len of sent: 17 17
21 21 21
pos_tag, len of sent: 21 18
17 17 27
pos_tag, len of sent: 27 28
16 16 23
pos_tag, len of sent: 23 43
11 11 15
pos_tag, len of sent: 15 15
12 12 14
pos_tag, len of sent: 14 14
5 5 8
pos_tag, len of sent: 8 8
18 18 23
pos_tag, len of sent: 23 24
17 17 27
pos_tag, len of sent: 27 33
6 6 9
pos_tag, len of sent: 9 9
23 23 24
pos_tag, len of sent: 24 39
16 16 25
pos_tag, len of sent: 25 28
13 13 24
pos_tag, len of sent: 24 24
19 19 25
pos_tag, len of sent: 25 29
13 13 13
pos_tag, len of sent: 13 15
20 20 25
pos_tag, len of sent: 25 28
14 14 22
pos_tag, len of s

21 21 26
pos_tag, len of sent: 26 33
20 20 23
pos_tag, len of sent: 23 26
16 16 24
pos_tag, len of sent: 24 25
11 11 14
pos_tag, len of sent: 14 14
16 16 18
pos_tag, len of sent: 18 19
16 16 22
pos_tag, len of sent: 22 51
4 4 5
pos_tag, len of sent: 5 5
21 21 19
pos_tag, len of sent: 19 33
15 15 26
pos_tag, len of sent: 26 22
7 7 11
pos_tag, len of sent: 11 11
7 7 8
pos_tag, len of sent: 8 8
20 20 28
pos_tag, len of sent: 28 32
5 5 7
pos_tag, len of sent: 7 7
15 15 20
pos_tag, len of sent: 20 39
19 19 29
pos_tag, len of sent: 29 62
4 4 6
pos_tag, len of sent: 6 6
11 11 14
pos_tag, len of sent: 14 15
17 17 23
pos_tag, len of sent: 23 26
17 17 22
pos_tag, len of sent: 22 22
14 14 19
pos_tag, len of sent: 19 19
24 24 22
pos_tag, len of sent: 22 50
21 21 22
pos_tag, len of sent: 22 35
21 21 25
pos_tag, len of sent: 25 33
24 24 24
pos_tag, len of sent: 24 32
23 23 31
pos_tag, len of sent: 31 32
21 21 19
pos_tag, len of sent: 19 37
21 21 23
pos_tag, len of sent: 23 34
16 16 21
pos_tag, len o

25 25 23
pos_tag, len of sent: 23 24
5 5 7
pos_tag, len of sent: 7 7
5 5 9
pos_tag, len of sent: 9 9
12 12 15
pos_tag, len of sent: 15 15
21 21 22
pos_tag, len of sent: 22 51
12 12 13
pos_tag, len of sent: 13 13
11 11 18
pos_tag, len of sent: 18 18
24 24 27
pos_tag, len of sent: 27 24
13 13 18
pos_tag, len of sent: 18 19
17 17 22
pos_tag, len of sent: 22 28
14 14 27
pos_tag, len of sent: 27 23
20 20 21
pos_tag, len of sent: 21 32
13 13 17
pos_tag, len of sent: 17 17
18 18 22
pos_tag, len of sent: 22 30
22 22 24
pos_tag, len of sent: 24 26
9 9 12
pos_tag, len of sent: 12 12
17 17 17
pos_tag, len of sent: 17 17
22 22 24
pos_tag, len of sent: 24 28
24 24 25
pos_tag, len of sent: 25 27
22 22 20
pos_tag, len of sent: 20 26
18 18 26
pos_tag, len of sent: 26 26
6 6 10
pos_tag, len of sent: 10 10
14 14 29
pos_tag, len of sent: 29 28
16 16 24
pos_tag, len of sent: 24 26
20 20 21
pos_tag, len of sent: 21 22
18 18 22
pos_tag, len of sent: 22 43
23 23 23
pos_tag, len of sent: 23 31
24 24 28
pos_ta

13 13 21
pos_tag, len of sent: 21 19
25 25 26
pos_tag, len of sent: 26 29
18 18 19
pos_tag, len of sent: 19 19
33 33 34
pos_tag, len of sent: 34 35
23 23 23
pos_tag, len of sent: 23 28
13 13 17
pos_tag, len of sent: 17 17
23 23 24
pos_tag, len of sent: 24 25
22 22 25
pos_tag, len of sent: 25 26
20 20 23
pos_tag, len of sent: 23 30
20 20 24
pos_tag, len of sent: 24 33
20 20 22
pos_tag, len of sent: 22 24
22 22 26
pos_tag, len of sent: 26 33
27 27 24
pos_tag, len of sent: 24 28
20 20 27
pos_tag, len of sent: 27 29
25 25 24
pos_tag, len of sent: 24 31
22 22 31
pos_tag, len of sent: 31 48
21 21 24
pos_tag, len of sent: 24 25
20 20 28
pos_tag, len of sent: 28 29
16 16 21
pos_tag, len of sent: 21 20
23 23 24
pos_tag, len of sent: 24 44
6 6 8
pos_tag, len of sent: 8 8
18 18 26
pos_tag, len of sent: 26 38
13 13 20
pos_tag, len of sent: 20 21
18 18 26
pos_tag, len of sent: 26 39
26 26 27
pos_tag, len of sent: 27 34
24 24 25
pos_tag, len of sent: 25 44
21 21 31
pos_tag, len of sent: 31 37
10 10 

14 14 20
pos_tag, len of sent: 20 20
26 26 29
pos_tag, len of sent: 29 39
21 21 22
pos_tag, len of sent: 22 35
21 21 28
pos_tag, len of sent: 28 41
19 19 17
pos_tag, len of sent: 17 69
20 20 25
pos_tag, len of sent: 25 29
19 19 21
pos_tag, len of sent: 21 20
29 29 34
pos_tag, len of sent: 34 37
18 18 25
pos_tag, len of sent: 25 26
17 17 19
pos_tag, len of sent: 19 19
9 9 11
pos_tag, len of sent: 11 11
13 13 14
pos_tag, len of sent: 14 15
11 11 15
pos_tag, len of sent: 15 13
13 13 14
pos_tag, len of sent: 14 14
22 22 19
pos_tag, len of sent: 19 30
21 21 27
pos_tag, len of sent: 27 33
8 8 13
pos_tag, len of sent: 13 13
17 17 25
pos_tag, len of sent: 25 26
17 17 22
pos_tag, len of sent: 22 22
10 10 12
pos_tag, len of sent: 12 12
18 18 20
pos_tag, len of sent: 20 34
17 17 22
pos_tag, len of sent: 22 27
22 22 18
pos_tag, len of sent: 18 20
20 20 27
pos_tag, len of sent: 27 31
18 18 20
pos_tag, len of sent: 20 20
23 23 23
pos_tag, len of sent: 23 32
19 19 32
pos_tag, len of sent: 32 31
24 24

26 26 24
pos_tag, len of sent: 24 34
28 28 22
pos_tag, len of sent: 22 30
9 9 15
pos_tag, len of sent: 15 15
18 18 19
pos_tag, len of sent: 19 25
20 20 23
pos_tag, len of sent: 23 34
5 5 8
pos_tag, len of sent: 8 8
5 5 13
pos_tag, len of sent: 13 13
20 20 31
pos_tag, len of sent: 31 31
13 13 15
pos_tag, len of sent: 15 14
20 20 22
pos_tag, len of sent: 22 37
17 17 24
pos_tag, len of sent: 24 24
14 14 16
pos_tag, len of sent: 16 16
9 9 13
pos_tag, len of sent: 13 13
22 22 20
pos_tag, len of sent: 20 50
11 11 17
pos_tag, len of sent: 17 17
33 33 28
pos_tag, len of sent: 28 34
14 14 15
pos_tag, len of sent: 15 15
16 16 21
pos_tag, len of sent: 21 21
15 15 19
pos_tag, len of sent: 19 19
27 27 26
pos_tag, len of sent: 26 28
20 20 21
pos_tag, len of sent: 21 22
21 21 25
pos_tag, len of sent: 25 26
22 22 25
pos_tag, len of sent: 25 26
15 15 24
pos_tag, len of sent: 24 24
5 5 6
pos_tag, len of sent: 6 6
14 14 20
pos_tag, len of sent: 20 20
27 27 26
pos_tag, len of sent: 26 27
18 18 19
pos_tag,

22 22 24
pos_tag, len of sent: 24 49
18 18 26
pos_tag, len of sent: 26 41
12 12 19
pos_tag, len of sent: 19 19
20 20 25
pos_tag, len of sent: 25 35
6 6 9
pos_tag, len of sent: 9 9
20 20 20
pos_tag, len of sent: 20 20
21 21 27
pos_tag, len of sent: 27 39
18 18 18
pos_tag, len of sent: 18 40
11 11 13
pos_tag, len of sent: 13 13
24 24 27
pos_tag, len of sent: 27 31
26 26 36
pos_tag, len of sent: 36 47
26 26 31
pos_tag, len of sent: 31 32
18 18 31
pos_tag, len of sent: 31 34
15 15 24
pos_tag, len of sent: 24 22
23 23 24
pos_tag, len of sent: 24 34
10 10 14
pos_tag, len of sent: 14 14
18 18 22
pos_tag, len of sent: 22 43
12 12 15
pos_tag, len of sent: 15 15
11 11 19
pos_tag, len of sent: 19 19
19 19 23
pos_tag, len of sent: 23 44
22 22 22
pos_tag, len of sent: 22 47
8 8 9
pos_tag, len of sent: 9 9
18 18 19
pos_tag, len of sent: 19 21
15 15 21
pos_tag, len of sent: 21 22
21 21 23
pos_tag, len of sent: 23 24
22 22 20
pos_tag, len of sent: 20 31
4 4 7
pos_tag, len of sent: 7 7
19 19 18
pos_tag

17 17 26
pos_tag, len of sent: 26 44
16 16 24
pos_tag, len of sent: 24 37
11 11 13
pos_tag, len of sent: 13 13
16 16 17
pos_tag, len of sent: 17 17
24 24 26
pos_tag, len of sent: 26 29
16 16 19
pos_tag, len of sent: 19 21
9 9 10
pos_tag, len of sent: 10 10
20 20 24
pos_tag, len of sent: 24 24
17 17 22
pos_tag, len of sent: 22 53
13 13 15
pos_tag, len of sent: 15 14
11 11 17
pos_tag, len of sent: 17 17
24 24 24
pos_tag, len of sent: 24 41
24 24 24
pos_tag, len of sent: 24 28
12 12 18
pos_tag, len of sent: 18 18
23 23 26
pos_tag, len of sent: 26 37
22 22 22
pos_tag, len of sent: 22 48
17 17 15
pos_tag, len of sent: 15 15
23 23 23
pos_tag, len of sent: 23 30
17 17 21
pos_tag, len of sent: 21 25
17 17 17
pos_tag, len of sent: 17 37
6 6 8
pos_tag, len of sent: 8 8
10 10 12
pos_tag, len of sent: 12 12
19 19 20
pos_tag, len of sent: 20 21
26 26 23
pos_tag, len of sent: 23 31
14 14 21
pos_tag, len of sent: 21 22
30 30 24
pos_tag, len of sent: 24 36
22 22 21
pos_tag, len of sent: 21 25
19 19 26

12 12 20
pos_tag, len of sent: 20 20
2 2 5
pos_tag, len of sent: 5 5
3 3 5
pos_tag, len of sent: 5 5
10 10 15
pos_tag, len of sent: 15 15
3 3 4
pos_tag, len of sent: 4 4
6 6 11
pos_tag, len of sent: 11 12
18 18 25
pos_tag, len of sent: 25 22
20 20 22
pos_tag, len of sent: 22 28
22 22 25
pos_tag, len of sent: 25 27
14 14 17
pos_tag, len of sent: 17 18
25 25 32
pos_tag, len of sent: 32 29
19 19 20
pos_tag, len of sent: 20 38
8 8 18
pos_tag, len of sent: 18 20
15 15 20
pos_tag, len of sent: 20 22
19 19 26
pos_tag, len of sent: 26 33
18 18 21
pos_tag, len of sent: 21 30
18 18 23
pos_tag, len of sent: 23 29
11 11 17
pos_tag, len of sent: 17 16
15 15 26
pos_tag, len of sent: 26 25
17 17 29
pos_tag, len of sent: 29 31
11 11 14
pos_tag, len of sent: 14 14
11 11 23
pos_tag, len of sent: 23 23
15 15 25
pos_tag, len of sent: 25 34
17 17 21
pos_tag, len of sent: 21 21
11 11 14
pos_tag, len of sent: 14 14
21 21 21
pos_tag, len of sent: 21 52
16 16 18
pos_tag, len of sent: 18 17
28 28 23
pos_tag, le

25 25 22
pos_tag, len of sent: 22 36
16 16 21
pos_tag, len of sent: 21 33
19 19 27
pos_tag, len of sent: 27 53
19 19 25
pos_tag, len of sent: 25 27
21 21 21
pos_tag, len of sent: 21 43
17 17 16
pos_tag, len of sent: 16 36
23 23 21
pos_tag, len of sent: 21 25
20 20 23
pos_tag, len of sent: 23 36
25 25 24
pos_tag, len of sent: 24 42
22 22 23
pos_tag, len of sent: 23 55
16 16 21
pos_tag, len of sent: 21 54
17 17 19
pos_tag, len of sent: 19 34
20 20 22
pos_tag, len of sent: 22 22
9 9 11
pos_tag, len of sent: 11 11
18 18 24
pos_tag, len of sent: 24 43
27 27 21
pos_tag, len of sent: 21 31
15 15 20
pos_tag, len of sent: 20 43
15 15 30
pos_tag, len of sent: 30 26
6 6 10
pos_tag, len of sent: 10 10
10 10 14
pos_tag, len of sent: 14 13
5 5 8
pos_tag, len of sent: 8 8
1 1 6
pos_tag, len of sent: 6 6
7 7 14
pos_tag, len of sent: 14 13
1 1 5
pos_tag, len of sent: 5 5
6 6 11
pos_tag, len of sent: 11 10
1 1 4
pos_tag, len of sent: 4 4
6 6 15
pos_tag, len of sent: 15 14
1 1 4
pos_tag, len of sent: 4 4

21 21 26
pos_tag, len of sent: 26 30
21 21 22
pos_tag, len of sent: 22 47
8 8 12
pos_tag, len of sent: 12 12
17 17 16
pos_tag, len of sent: 16 16
12 12 19
pos_tag, len of sent: 19 19
21 21 26
pos_tag, len of sent: 26 55
16 16 25
pos_tag, len of sent: 25 24
20 20 24
pos_tag, len of sent: 24 42
22 22 21
pos_tag, len of sent: 21 26
20 20 22
pos_tag, len of sent: 22 36
20 20 24
pos_tag, len of sent: 24 29
12 12 14
pos_tag, len of sent: 14 14
24 24 34
pos_tag, len of sent: 34 33
20 20 23
pos_tag, len of sent: 23 46
16 16 23
pos_tag, len of sent: 23 23
23 23 22
pos_tag, len of sent: 22 29
28 28 26
pos_tag, len of sent: 26 32
8 8 10
pos_tag, len of sent: 10 10
17 17 23
pos_tag, len of sent: 23 23
18 18 21
pos_tag, len of sent: 21 22
21 21 24
pos_tag, len of sent: 24 35
6 6 7
pos_tag, len of sent: 7 7
18 18 23
pos_tag, len of sent: 23 32
24 24 21
pos_tag, len of sent: 21 35
13 13 17
pos_tag, len of sent: 17 17
32 32 28
pos_tag, len of sent: 28 36
11 11 17
pos_tag, len of sent: 17 17
21 21 24
p