The number of parsed senses seems very small #40

rabravo · 2017-03-05T21:13:27Z

Hi dkpro-jwktl team,

I git clone the project dkpro-jwktl and I was able to parse the following Wiktionary dump, enwiki-20170301-pages-articles.xml.bz2, without a problem, after adding two instructions to the XMLDumpParser in the private SAXParserFactory getParserFactory() method that increase the number of entries. Before adding these instructions, the libraries had thrown an Exception after parsing 650,000 entries. Here are the additional instructions that resolve this problem (this solution I found from another thread),

//Original instruction
//return SAXParserFactory.newInstance("com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl", null));

System.setProperty("jdk.xml.totalEntitySizeLimit", "1500000000");
SAXParserFactory spf = SAXParserFactory.newInstance("com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl", null);
spf.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, false);
return spf;

After modifying the code, the parsing of the dump finished correctly with no Exceptions or Errors. However, after executing one of the examples, Example3_IterateEntries.java the output showed the following results,

Pages: 10117574
Entries: 3776520
Senses: 986

The output seems short for the number of available senses since I presume these number should be the largest of the three or at least equal to the number of pages/entries. I also tried the examples from Word Senses suggested in

https://dkpro.github.io/dkpro-jwktl/documentation/architecture/

with the word "Boat" (certainly many more instances) and I got a IndexOutOfBoundsException . Do you have any idea why the libraries are not capturing enough number of senses? Finally, If I were to use "boat" I got a NullPointerException. Thank you in advance for any help.

chmeyer · 2017-03-09T12:57:39Z

I tried an English dump from Feb 1, 2017 just recently and got

database.pages=5085081
database.entries=5721239
database.sense=13319857

I also tried parsing the most recent "boat" article page, which worked fine. Given that your number of pages are a lot higher than expected (5 mio vs. 10 mio), I assume that there is an error with your dump file. If I assume that you didn't change the file name of your dump, the problem seems to be that you are trying to parse a WikiPEDIA dump file with the JWKTL WikTIONARY library. Depending on your goal, please download the enwiktionary-... dump or take a look at the https://dkpro.github.io/dkpro-jwpl/ library. Please reopen this issue if it is likely that the problem is somewhere else.

rabravo · 2017-03-10T14:45:06Z

@chmeyer , your inference was correct. The dump I was using is the enwiki... which is a Wikipedia dump. This solves the mystery. All seems to work as it should. Thank you for taking the time to address my questions.

chmeyer closed this as completed Mar 9, 2017

chmeyer self-assigned this Mar 9, 2017

chmeyer added question wontfix labels Mar 9, 2017

chmeyer mentioned this issue Mar 9, 2017

About 2017 dump English Wiktionary #38

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The number of parsed senses seems very small #40

The number of parsed senses seems very small #40

rabravo commented Mar 5, 2017 •

edited

Loading

chmeyer commented Mar 9, 2017

rabravo commented Mar 10, 2017

The number of parsed senses seems very small #40

The number of parsed senses seems very small #40

Comments

rabravo commented Mar 5, 2017 • edited Loading

chmeyer commented Mar 9, 2017

rabravo commented Mar 10, 2017

rabravo commented Mar 5, 2017 •

edited

Loading