Skip to content
This repository has been archived by the owner on Nov 25, 2019. It is now read-only.

aardc ends earlier then 100% #25

Closed
Jules- opened this issue Dec 30, 2012 · 13 comments
Closed

aardc ends earlier then 100% #25

Jules- opened this issue Dec 30, 2012 · 13 comments

Comments

@Jules-
Copy link

Jules- commented Dec 30, 2012

I executed aardc

aardc wiki cswiki-latest-pages-articles.cdb --siteinfo cs.json --timeout=5 --show-legend

for convert czech wiki from 27.12.2012.
There are maybe some deprecated functions:

/home/jules/Programy/programy/Linux/aarddict/env-aard/lib/python2.7/site-packages/aardtools/mwaardhtmlwriter.py:356: FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.
not (element.getchildren() or element.text or element.tail) and parent):

and some exceptions

Exception RuntimeError: RuntimeError('cannot join current thread',) in <Finalize object, dead> ignored

Sometimes aardc ends with 80% sometimes more then 90%.

/home/jules/Programy/programy/Linux/aarddict/env-aard/lib/python2.7/site-packages/aardtools/mwaardhtmlwriter.py:356: FutureWarning: The behavior of this method will change in future versions. Use specific 'len(elem)' or 'elem is not None' test instead.
not (element.getchildren() or element.text or element.tail) and parent):
Exception RuntimeError: RuntimeError('cannot join current thread',) in <Finalize object, dead> ignored
Exception RuntimeError: RuntimeError('cannot join current thread',) in <Finalize object, dead> ignored
96.11% t: 9:26:05 avg: 11.6/s a: 240319 r: 152982 s: 0 e: 0 to: 36 f: 0
Compiling .aar files
Creating volume 1
Wrote volume 1
cswiki-latest-pages-articles.aar.1 sha1: 541ffe4376ab1bc1e67d59a6d902bb1946be3fd3
Created cswiki-latest-pages-articles.aar
Compilation took 9:59:52

Where is problem?

@itkach
Copy link
Member

itkach commented Dec 30, 2012

96.11% t: 9:26:05 avg: 11.6/s a: 240319 r: 152982 s: 0 e: 0 to: 36 f: 0
Compiling .aar files
Creating volume 1
Wrote volume 1
cswiki-latest-pages-articles.aar.1 sha1: 541ffe4
Created cswiki-latest-pages-articles.aar
Compilation took 9:59:52

Where is problem?

Not sure. Is there an actual problem with compiled dictionary though?


Reply to this email directly or view it on GitHub.

@Jules-
Copy link
Author

Jules- commented Dec 31, 2012

I have problem with count of articles. I compiled 3 dictionaries from same cdb directory and every dictionary had other count of articles.

from logs:
22:27:00 INFO [compiler] Done with cswiki-latest-pages-articles.aar.1
22:27:00 INFO [compiler] Wrote volume 1
22:27:00 INFO [compiler] Writing volume count 1 to all volumes as >H
22:27:00 INFO [compiler] Calculating checksum for cswiki-latest-pages-articles.aar.1
22:27:04 INFO [compiler] cswiki-latest-pages-articles.aar.1 sha1: e8ea6ad662d6e0ab813509e0864120cd943d81ad
22:27:04 INFO [compiler] Renaming cswiki-latest-pages-articles.aar.1 ==> cswiki-latest-pages-articles.aar
22:27:04 INFO [compiler] total: 409246, skipped: 0, failed: 0, empty: 0, timed out: 498, articles: 168095, redirects: 110448, average: 8.94/s elapsed: 8:39:57
22:27:04 INFO [compiler] Compression: _zlib - 138964, none - 109998, _bz2 - 29583
22:27:04 INFO [compiler] Compilation took 8:39:57.171867

498+168095+110448=279041

03:32:02 INFO [compiler] Done with cswiki-latest-pages-articles.aar.1
03:32:02 INFO [compiler] Wrote volume 1
03:32:02 INFO [compiler] Writing volume count 1 to all volumes as >H
03:32:03 INFO [compiler] Calculating checksum for cswiki-latest-pages-articles.aar.1
03:32:13 INFO [compiler] cswiki-latest-pages-articles.aar.1 sha1: 0392be63d3ec7e23327ae425f0b11d6e2ad7fb79
03:32:13 INFO [compiler] Renaming cswiki-latest-pages-articles.aar.1 ==> cswiki-latest-pages-articles.aar
03:32:13 INFO [compiler] total: 409246, skipped: 0, failed: 0, empty: 0, timed out: 198, articles: 201200, redirects: 128883, average: 10.80/s elapsed: 8:29:28
03:32:13 INFO [compiler] Compression: _zlib - 163847, none - 128372, _bz2 - 37866
03:32:13 INFO [compiler] Compilation took 8:29:28.933373

198+201200+128883=330281

18:17:01 INFO [compiler] Done with cswiki-latest-pages-articles.aar.1
18:17:01 INFO [compiler] Wrote volume 1
18:17:01 INFO [compiler] Writing volume count 1 to all volumes as >H
18:17:01 INFO [compiler] Calculating checksum for cswiki-latest-pages-articles.aar.1
18:17:16 INFO [compiler] cswiki-latest-pages-articles.aar.1 sha1: 541ffe4376ab1bc1e67d59a6d902bb1946be3fd3
18:17:16 INFO [compiler] Renaming cswiki-latest-pages-articles.aar.1 ==> cswiki-latest-pages-articles.aar
18:17:16 INFO [compiler] total: 409246, skipped: 0, failed: 0, empty: 0, timed out: 36, articles: 240319, redirects: 152982, average: 10.93/s elapsed: 9:59:52
18:17:16 INFO [compiler] Compression: _zlib - 194612, none - 152366, _bz2 - 46325
18:17:16 INFO [compiler] Compilation took 9:59:52.214186

36+240319+152982=393337

I don't know, why program doesn't process all articles. Before end of compilation are always two RunTimeExceptions. Compilation didn't end with 100%.

@itkach
Copy link
Member

itkach commented Dec 31, 2012

Compiled with latest version from @aarddict (using fairly old mwlib 12.3)

100.00% t: 2:21:05 avg: 48.3/s a: 251391 r: 157855 s: 0 e: 0 to: 0 f: 0
11:49:58 INFO [compiler] Done with cswiki-20121225.aar.1
11:49:58 INFO [compiler] Wrote volume 1
11:49:58 INFO [compiler] Writing volume count 1 to all volumes as >H
11:49:58 INFO [compiler] Calculating checksum for cswiki-20121225.aar.1
11:50:00 INFO [compiler] cswiki-20121225.aar.1 sha1: 7422f22db99a5f98f964981526405806245b8fbb
11:50:00 INFO [compiler] Renaming cswiki-20121225.aar.1 ==> cswiki-20121225.aar
11:50:00 INFO [compiler] total: 409246, skipped: 0, failed: 0, empty: 0, timed out: 0, articles: 251391, redirects: 157855, average: 48.14/s elapsed: 2:21:42
11:50:00 INFO [compiler] Compression: _zlib - 202415, none - 777951, _bz2 - 49857
11:50:00 INFO [compiler] Compilation took 2:21:42.003559

Download: cswiki-20121225-mwlib-12.3.aar

Compiled with slightly modified version from @doozan (using mwlib 14.1):

100.00% t: 2:12:15 avg: 51.6/s a: 250830 r: 157855 s: 0 e: 0 to: 0 f: 561

With new version of mwlib a fairly large number of articles (561) failed with this error:

Traceback (most recent call last):
  File "/home/itkach/aardtools-doozan/aardtools/wiki.py", line 116, in convert
    magicwords=wikidb.siteinfo['magicwords'])
  File "/home/itkach/.virtualenvs/aardtools-doozan/local/lib/python2.7/site-packages/mwlib/refine/uparser.py", line 34, in parseString
    input = te.expandTemplates(True)
  File "evaluate.py", line 295, in mwlib.templ.evaluate.Expander.expandTemplates (mwlib/templ/evaluate.c:6507)
  File "evaluate.py", line 285, in mwlib.templ.evaluate.Expander._expand (mwlib/templ/evaluate.c:6107)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)

Maybe these articles have bad markup and new mwlib is more strict about this or maybe new mwlib introduced a new bug, don't know.

23:28:28 INFO [compiler] Done with cswiki-20121225.aar.1
23:28:28 INFO [compiler] Wrote volume 1
23:28:28 INFO [compiler] Writing volume count 1 to all volumes as >H
23:28:28 INFO [compiler] Calculating checksum for cswiki-20121225.aar.1
23:28:30 INFO [compiler] cswiki-20121225.aar.1 sha1: 0ec083745977701acfc8fc170314f33738fe9a0c
23:28:30 INFO [compiler] Renaming cswiki-20121225.aar.1 ==> cswiki-20121225.aar
23:28:31 INFO [compiler] total: 409246, skipped: 0, failed: 561, empty: 0, timed out: 0, articles: 250830, redirects: 157855, average: 51.45/s elapsed: 2:12:41
23:28:31 INFO [compiler] Compression: _zlib - 202281, none - 774957, _bz2 - 49430
23:28:31 INFO [compiler] Compilation took 2:12:41.511623

Download: cswiki-20121225-mwlib-14.1.aar

Note that your system is quite a bit slower and you have a number of timed out articles. Clearly the numbers are off when there are timed out articles, I need to look into this, but normally there should be no timed out articles.

With some versions of mwlib/aardtools on some xml dumps there are sometimes few articles that neither convert in any reasonable amount of time nor error out. Detecting and aborting conversion for such articles is to prevent stalling the whole compilation process. This should be extremely rare. Having many timed out articles usually points to timeout value being too low for this particular machine.

It's also interesting to compare number of articles that ends up in the dictionary with stats from http://meta.wikimedia.org/wiki/List_of_Wikipedias: cswiki is reported to have 251810 articles, my compilations are pretty close to that (250830 and 251391). In your compilations a lot more articles are actually lost/discarded than timeout count suggests.

@Jules-
Copy link
Author

Jules- commented Jan 1, 2013

Thank you for your compiled dictionary. I can check compilation in faster PC. AardDict is perfect. Thanks

@MHBraun
Copy link

MHBraun commented Jan 1, 2013

Compiled cswiki-20121225 with the results
100.00% t: 2:03:54 avg: 55.0/s a: 251391 r: 157855 s: 0 e: 0 to: 0 f: 0
My Ubuntu runs in a VMWare and is using mwlib 12.13
Will send the file to Wuala: bit.ly/QAfUyD for your reference

@ghost
Copy link

ghost commented Jan 20, 2013

Hello!
I am a noob and when I convert polish wiki from 20130101 it end about 72%. I have this same symptoms as Jules. I do not know much about linux, so could someone share complete ubuntu image for vmware with all tools needed to correct convert wiki and/or compiled dictionary?
sorry for my english

@MHBraun
Copy link

MHBraun commented Jan 22, 2013

Would like to do so. Is a raw Ubuntu Desktop 64bit 12.04 LTS good enaugh? You will need to go thru the pain installing the aard package as I did. Size is approx 2 GB.
I have no clue how to send you my 50GB VMWare installation including all the files...

@ghost
Copy link

ghost commented Jan 23, 2013

But I have installed clean 32bit 12.10, earlier on 12.04 LTS x64 I had the same problem. Maybe I do something wrong during install tools? I installed everything according to the tutorial on aard page. The only difference between tut is that I have not found libicu38 so I install libicu48 and blahtexml from ubuntu app center.

@itkach
Copy link
Member

itkach commented Jan 23, 2013

@barqqlsky it sounds like you just need a faster machine/vm

@MHBraun
Copy link

MHBraun commented Jan 23, 2013

Must be a 64bit machine. 32bit will not work. Had the same issue ;-)

Sent from my Android phone using TouchDown (www.nitrodesk.com)

-----Original Message-----
From: barqqlsky [notifications@github.com]
Received: Mittwoch, 23 Jan. 2013, 10:44
To: aarddict/desktop [desktop@noreply.github.com]
CC: MHBraun [mhbraun@freenet.de]
Subject: Re: [desktop] aardc ends earlier then 100% (#25)

@itkach
Copy link
Member

itkach commented Jan 23, 2013

@MHBraun indeed
@barqqlsky somehow I missed that you talk about 32bit vm, plwiki is one of the big ones, definitely needs 64bit OS

@ghost
Copy link

ghost commented Jan 29, 2013

Thank you for your help and tips. This post has helped me aarddict/tools#10 # issuecomment-1302371. Precisely "- timeout 120" helped me. Now I do not have any errors or timeouts. Once again, thank you itkach.
ps. it works on 32bit vm

@itkach
Copy link
Member

itkach commented Jan 30, 2013

This is issue tracker for desktop aarddict, moving to aarddict/tools#23

@itkach itkach closed this as completed Jan 30, 2013
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants