Skip to content

Commit

Permalink
Fixed tokenization bug, which allowed whitespaces inside punctiation …
Browse files Browse the repository at this point in the history
…tokens.
  • Loading branch information
MihaiSurdeanu committed Feb 27, 2018
1 parent bdb2b90 commit b3aa05d
Show file tree
Hide file tree
Showing 5 changed files with 231 additions and 226 deletions.
2 changes: 1 addition & 1 deletion antlr_tokenizer
Expand Up @@ -5,4 +5,4 @@
# java -jar <full path to your antlr-4.x-complete.jar file>
#

antlr4 main/src/main/java/org/clulab/processors/clulab/tokenizer/OpenDomainLexer.g
antlr4 main/src/main/java/org/clulab/processors/clu/tokenizer/OpenDomainLexer.g
Expand Up @@ -12,7 +12,7 @@ options {
}

@lexer::header {
package org.clulab.processors.clulab.tokenizer;
package org.clulab.processors.clu.tokenizer;
}

// parentheses in Treebank and OntoNotes
Expand Down Expand Up @@ -65,7 +65,7 @@ SMILEY: ('<'|'>')? (':'|';'|'=') ('-'|'o'|'*'|'\'')? ('('|')'|'D'|'P'|'d'|'p'|'O
// TODO: phone numbers

// punctuation
EOS: PUNCTUATION (WHITESPACE? PUNCTUATION)* ;
EOS: PUNCTUATION (PUNCTUATION)* ;

// skip all white spaces
WHITESPACE: ('\t'|' '|'\r'|'\n'|'\u000C'| '\u2028'|'\u2029'|'\u000B'|'\u0085'|'\u00A0'|('\u2000'..'\u200A')|'\u3000')+ -> skip ;
Expand Down

0 comments on commit b3aa05d

Please sign in to comment.