Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SVM PosTagger fails on document without recovery on error #127

Closed
benlabbe opened this issue Dec 3, 2021 · 4 comments
Closed

SVM PosTagger fails on document without recovery on error #127

benlabbe opened this issue Dec 3, 2021 · 4 comments

Comments

@benlabbe
Copy link
Contributor

benlabbe commented Dec 3, 2021

Describe the bug
The SVM PosTagger fails sometimes on several documents. The origin of the errors are not clear (either inside SVMTool or in the way we use it). This leads the processus to crash (either analyzeText or analyzeXml).

To Reproduce
This issue is linked with #95 which describes one of the errors occasionally encountered.

More examples are needed .
Please @benlabbe, you are summoned to upload XML sample files !

Expected behavior
What ever the reason of the errors in the SVM PosTagger, the text processing should continue without side-effects for the following text segments to analyze (eg : for the following paragraphs in an Xml file).

@benlabbe
Copy link
Contributor Author

benlabbe commented Dec 9, 2021

Here are the first elements of my investigations.

  • The failure of the SVMTagger leads to a LinguisticProcessingException.
  • In Release mode, this exception is supposed to be caugth by the upper elements in the calling stack . But its not.
  • I found that the compiling option WITH_DEBUG_MESSAGES acts not as expected.
    • The macro flag DEBUG_LP which enables the catching of exceptions is erroneously defined in Release mode.
  • I propose a correction in SetCompilerFlags.cmake which defines WITH_DEBUG_MESSAGES as a cmake option.
  • With this correction , the paragraphs (engText) responsible of the crashes in my input XML file are aborted and correctly closed in the .mult output file : no content, but some properties are reported by readMultFile for these nodes.
  • the following paragraphs (engText) in my input XML file are correctly processed up to the last one, as seen in the .mult file.
  • the document is correctly closed in the .mult file

Here is a sample XML file causing SVMTagger to crash : 02552_GS_RC_MEC_682_EN_00.xml
Sample error log after my correction in SetCompilerFlags.cmake :

user:home$ analyzeXml -l eng -p TechnipTenderXML 02552_GS_RC_MEC_682_EN_00.xml
 : LP::PosTagger : 2021-12-09T15:26:39.586 ERROR 0x55e55a2fe5a0 Error in SVMTagger result line: did not get 2 elements in '  ' 
 : LP::CoreClient : 2021-12-09T15:26:39.586 ERROR 0x55e55a2fe5a0 "/home/bl231006/WORK/Aymara/lima/lima_linguisticprocessing/src/linguisticProcessing/core/CoreLinguisticProcessingClient.cpp:255: analysis failed : receive status 1 from pipeline. exit" 
 : XML::DocumentsReader : 2021-12-09T15:26:39.587 ERROR 0x55e55a2fe5a0 StructuredDocumentXMLParser::endElement: error while handeling indexing element "engTEXT" absolute offset: 6149 
 : LP::PosTagger : 2021-12-09T15:26:39.782 ERROR 0x55e55a2fe5a0 Error in SVMTagger result line: did not get 2 elements in '  ' 
 : LP::CoreClient : 2021-12-09T15:26:39.782 ERROR 0x55e55a2fe5a0 "/home/bl231006/WORK/Aymara/lima/lima_linguisticprocessing/src/linguisticProcessing/core/CoreLinguisticProcessingClient.cpp:255: analysis failed : receive status 1 from pipeline. exit" 
 : XML::DocumentsReader : 2021-12-09T15:26:39.782 ERROR 0x55e55a2fe5a0 StructuredDocumentXMLParser::endElement: error while handeling indexing element "engTEXT" absolute offset: 10389 
 : LP::PosTagger : 2021-12-09T15:26:41.809 ERROR 0x55e55a2fe5a0 Error in SVMTagger result alignement with analysis graph: got ' . : SENT ' from SVMTagger and ' "\n\n." ' from graph 
 : LP::CoreClient : 2021-12-09T15:26:41.810 ERROR 0x55e55a2fe5a0 "/home/bl231006/WORK/Aymara/lima/lima_linguisticprocessing/src/linguisticProcessing/core/CoreLinguisticProcessingClient.cpp:255: analysis failed : receive status 1 from pipeline. exit" 
 : XML::DocumentsReader : 2021-12-09T15:26:41.810 ERROR 0x55e55a2fe5a0 StructuredDocumentXMLParser::endElement: error while handeling indexing element "engTEXT" absolute offset: 52927 
 : LP::PosTagger : 2021-12-09T15:26:41.940 ERROR 0x55e55a2fe5a0 Error in SVMTagger result alignement with analysis graph: got ' .5 : NOUN ' from SVMTagger and ' "\n.5" ' from graph 
 : LP::CoreClient : 2021-12-09T15:26:41.940 ERROR 0x55e55a2fe5a0 "/home/bl231006/WORK/Aymara/lima/lima_linguisticprocessing/src/linguisticProcessing/core/CoreLinguisticProcessingClient.cpp:255: analysis failed : receive status 1 from pipeline. exit" 
 : XML::DocumentsReader : 2021-12-09T15:26:41.940 ERROR 0x55e55a2fe5a0 StructuredDocumentXMLParser::endElement: error while handeling indexing element "engTEXT" absolute offset: 55901 
 : LP::PosTagger : 2021-12-09T15:26:41.969 ERROR 0x55e55a2fe5a0 Error in SVMTagger result alignement with analysis graph: got ' . : SENT ' from SVMTagger and ' "\n\n." ' from graph 
 : LP::CoreClient : 2021-12-09T15:26:41.969 ERROR 0x55e55a2fe5a0 "/home/bl231006/WORK/Aymara/lima/lima_linguisticprocessing/src/linguisticProcessing/core/CoreLinguisticProcessingClient.cpp:255: analysis failed : receive status 1 from pipeline. exit" 
 : XML::DocumentsReader : 2021-12-09T15:26:41.970 ERROR 0x55e55a2fe5a0 StructuredDocumentXMLParser::endElement: error while handeling indexing element "engTEXT" absolute offset: 56480 
Total: 5317 ms

02552_GS_RC_MEC_682_EN_00.xml.zip

@benlabbe
Copy link
Contributor Author

benlabbe commented Dec 9, 2021

The recovery on error is handled in Release mode thanks to the fix on WITH_DEBUG_MESSAGES in commit e8e2e11 .
This allows to process large XML files where each page is a node (engText) with a minimized impact on the final result

The SVMTag crash is still not solved.

kleag added a commit that referenced this issue Dec 15, 2021
SVMToolPosTagger is failing when given tokens containing newlines. This
commit avoids the problem by replacing newlines with invisible non space
characters. It avoids crash and solves issues #95 and #127. It does not
solves the initial tokenization error though.
@kleag
Copy link
Contributor

kleag commented Dec 15, 2021

Solved in commit 876c293:

gael@brezhoneg2:~/Téléchargements$ echo -e "\n.0" > test.txt; analyzeText --language=eng test.txt
Analyzing 1/1 (100.00%) 'test.txt' : LP::PosTagger : 2021-12-16T00:05:51.228  WARN 0x56012aba6760 Error in SVMTagger. Invalid token with newline(s): "\n.0" 
 : LP::PosTagger : 2021-12-16T00:05:51.228  WARN 0x56012aba6760 Avoiding the problem but the tokenizer should be checked. 
 : LP::PosTagger : 2021-12-16T00:05:51.228  WARN 0x56012aba6760 Error in SVMTagger. Invalid token with newline(s): "\n.0" 
 : LP::PosTagger : 2021-12-16T00:05:51.228  WARN 0x56012aba6760 Avoiding the problem but the tokenizer should be checked. 
 : LP::PosTagger : 2021-12-16T00:05:51.228  WARN 0x56012aba6760 No matching category found for tagger result  ".0"   "NOUN" 
 : LP::PosTagger : 2021-12-16T00:05:51.228  WARN 0x56012aba6760 Taking any one 
# global.columns = ID   FORM    LEMMA   UPOS    XPOS    FEATS   HEAD    DEPREL  DEPS    MISC
# sent_id = 1
# text =  .0 
1       \x0a.0
.0      NUM     _       _       _       _       _       NE=I-Numex.NUMBER|Pos=1|Len=3

gael@brezhoneg2:~/Téléchargements$ echo -e "some text.\n.\n" > test.txt; analyzeText --language=eng test.txt
Analyzing 1/1 (100.00%) 'test.txt' : LP::PosTagger : 2021-12-16T00:06:10.981  WARN 0x5614edead760 Error in SVMTagger. Invalid token with newline(s): ".\n." 
 : LP::PosTagger : 2021-12-16T00:06:10.982  WARN 0x5614edead760 Avoiding the problem but the tokenizer should be checked. 
 : LP::PosTagger : 2021-12-16T00:06:10.982  WARN 0x5614edead760 Error in SVMTagger. Invalid token with newline(s): ".\n." 
 : LP::PosTagger : 2021-12-16T00:06:10.982  WARN 0x5614edead760 Avoiding the problem but the tokenizer should be checked. 
 : LP::PosTagger : 2021-12-16T00:06:10.982  WARN 0x5614edead760 No matching category found for tagger result  ".\u200B."   "NOUN" 
 : LP::PosTagger : 2021-12-16T00:06:10.982  WARN 0x5614edead760 Taking any one 
# global.columns = ID   FORM    LEMMA   UPOS    XPOS    FEATS   HEAD    DEPREL  DEPS    MISC
# sent_id = 1
# text = some text.
1       some    some    DET     _       _       2       det     _       Pos=1|Len=4
2       text    text    NOUN    _       NUMBER=SING     3       Dummy   _       Pos=6|Len=4|SpaceAfter=No
3       .\x0a.  .
.       SENT    _       _       0       _       _       Pos=10|Len=3

But it does not solve the underlying tokenizer error.

@kleag kleag closed this as completed Dec 15, 2021
@benlabbe
Copy link
Contributor Author

Dear @kleag ,

I got a new example that crashes the SVMPosTagger. The malicious characters are the succession of three dots : "..." .
I managed to overcome the issue by replacing in the analyzed text with the unicode 2026 + two spaces : "… ".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants