New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sections names absent #57
Comments
Use option
|
~/wikiextractor-master/WikiExtractor.py --sections -b 500M --lists -o extracted itwiki-latest-pages-articles.xml.bz2 |
Restored the option, in version 2.54. |
Thank you for quick response! Duino Gorin Carriera.
|
The problem with such kind of data - 'section + list'
If I put some text before the list it will work:
|
I added after line 2190: and now it works for me |
Thank you for the contribution. I merged it into version 2.55. |
@attardi How would I also keep the |
Hi!
Some sections names, e.g. 'Bibliografia' are removed.
For example, for this person
Duino Gorin
https://it.wikipedia.org/wiki/Duino_Gorin
In XML file I could see level 2 header:
==Bibliografia==
*''La Raccolta Completa degli Album Panini 1975-1976''
*''La Raccolta Completa degli Album Panini 1960-2004'' - Indici
*''Almanacco illustrato del calcio 1982''. edizione Pani
But in the processed file just ( no 'Bibliografia' section):
Trascorse in rossonero tre stagioni, fino al 1977, quando passò al Monza.
How could I keep sections' names, please?
Thanks!
The text was updated successfully, but these errors were encountered: