Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The number of parsed senses seems very small #40

Closed
rabravo opened this issue Mar 5, 2017 · 2 comments
Closed

The number of parsed senses seems very small #40

rabravo opened this issue Mar 5, 2017 · 2 comments
Assignees

Comments

@rabravo
Copy link

rabravo commented Mar 5, 2017

Hi dkpro-jwktl team,

I git clone the project dkpro-jwktl and I was able to parse the following Wiktionary dump, enwiki-20170301-pages-articles.xml.bz2, without a problem, after adding two instructions to the XMLDumpParser in the private SAXParserFactory getParserFactory() method that increase the number of entries. Before adding these instructions, the libraries had thrown an Exception after parsing 650,000 entries. Here are the additional instructions that resolve this problem (this solution I found from another thread),

//Original instruction
//return SAXParserFactory.newInstance("com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl", null));

System.setProperty("jdk.xml.totalEntitySizeLimit", "1500000000");
SAXParserFactory spf = SAXParserFactory.newInstance("com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl", null);
spf.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, false);
return spf;

After modifying the code, the parsing of the dump finished correctly with no Exceptions or Errors. However, after executing one of the examples, Example3_IterateEntries.java the output showed the following results,

Pages: 10117574
Entries: 3776520
Senses: 986

The output seems short for the number of available senses since I presume these number should be the largest of the three or at least equal to the number of pages/entries. I also tried the examples from Word Senses suggested in

https://dkpro.github.io/dkpro-jwktl/documentation/architecture/

with the word "Boat" (certainly many more instances) and I got a IndexOutOfBoundsException . Do you have any idea why the libraries are not capturing enough number of senses? Finally, If I were to use "boat" I got a NullPointerException. Thank you in advance for any help.

@chmeyer
Copy link
Member

chmeyer commented Mar 9, 2017

I tried an English dump from Feb 1, 2017 just recently and got

database.pages=5085081
database.entries=5721239
database.sense=13319857

I also tried parsing the most recent "boat" article page, which worked fine. Given that your number of pages are a lot higher than expected (5 mio vs. 10 mio), I assume that there is an error with your dump file. If I assume that you didn't change the file name of your dump, the problem seems to be that you are trying to parse a WikiPEDIA dump file with the JWKTL WikTIONARY library. Depending on your goal, please download the enwiktionary-... dump or take a look at the https://dkpro.github.io/dkpro-jwpl/ library. Please reopen this issue if it is likely that the problem is somewhere else.

@rabravo
Copy link
Author

rabravo commented Mar 10, 2017

@chmeyer , your inference was correct. The dump I was using is the enwiki... which is a Wikipedia dump. This solves the mystery. All seems to work as it should. Thank you for taking the time to address my questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants