Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sections names absent #57

Closed
ghost opened this issue Mar 18, 2016 · 8 comments
Closed

Sections names absent #57

ghost opened this issue Mar 18, 2016 · 8 comments

Comments

@ghost
Copy link

ghost commented Mar 18, 2016

Hi!
Some sections names, e.g. 'Bibliografia' are removed.
For example, for this person
Duino Gorin
https://it.wikipedia.org/wiki/Duino_Gorin

In XML file I could see level 2 header:
==Bibliografia==
*''La Raccolta Completa degli Album Panini 1975-1976''
*''La Raccolta Completa degli Album Panini 1960-2004'' - Indici
*''Almanacco illustrato del calcio 1982''. edizione Pani

But in the processed file just ( no 'Bibliografia' section):

Trascorse in rossonero tre stagioni, fino al 1977, quando passò al Monza.

  • "La Raccolta Completa degli Album Panini 1975-1976"
  • "La Raccolta Completa degli Album Panini 1960-2004" - Indici
  • "Almanacco illustrato del calcio 1982". edizione Panini.

How could I keep sections' names, please?

Thanks!

@attardi
Copy link
Owner

attardi commented Mar 19, 2016

Use option

--sections

@attardi attardi closed this as completed Mar 19, 2016
@ghost
Copy link
Author

ghost commented Mar 19, 2016

~/wikiextractor-master/WikiExtractor.py --sections -b 500M --lists -o extracted itwiki-latest-pages-articles.xml.bz2
usage: WikiExtractor.py [-h] [-o OUTPUT] [-b n[KMG]] [-c] [--html] [-l]
[--lists] [-ns ns1,ns2] [--templates TEMPLATES]
[--no-templates] [--escapedoc] [--processes PROCESSES]
[-q] [--debug] [-a] [-v]
input
WikiExtractor.py: error: unrecognized arguments: --sections

@attardi
Copy link
Owner

attardi commented Mar 19, 2016

Restored the option, in version 2.54.
Thank you.

@ghost
Copy link
Author

ghost commented Mar 19, 2016

Thank you for quick response!
I re-run it but nothing changes - still missing 'Bibliografia' section:

Duino Gorin
Anche suo fratello minore Fabrizio è stato un calciatore.

Carriera.
Club.
Cresciuto nel Real San Marco di Venezia, esordì con la maglia del Venezia nel campionato di Serie
Trascorse in rossonero tre stagioni, fino al 1977, quando passò al Monza.

  • "La Raccolta Completa degli Album Panini 1975-1976"
  • "La Raccolta Completa degli Album Panini 1960-2004" - Indici
  • "Almanacco illustrato del calcio 1982". edizione Panini.

@ghost
Copy link
Author

ghost commented Mar 20, 2016

The problem with such kind of data - 'section + list'
==Bibliografia==
*''La Raccolta Completa degli Album Panini 1975-1976''
*''La Raccolta Completa degli Album Panini 1960-2004'' - Indici

  • ''Almanacco illustrato del calcio 1982''. edizione Panini.

If I put some text before the list it will work:
==Bibliografia==
Bla Bla
-''La Raccolta Completa degli Album Panini 1975-1976''
-''La Raccolta Completa degli Album Panini 1960-2004'' - Indici

  • ''Almanacco illustrato del calcio 1982''. edizione Panini.

@ghost
Copy link
Author

ghost commented Mar 21, 2016

I added after line 2190:
if line: # FIXME: n is '"'
if Extractor.keepLists:
if len(headers):
if Extractor.keepSections:
items = headers.items()
items.sort()
for i, v in items:
page.append(v)
headers.clear()

and now it works for me

@attardi
Copy link
Owner

attardi commented Mar 23, 2016

Thank you for the contribution. I merged it into version 2.55.

@sooheon
Copy link

sooheon commented Jul 6, 2018

@attardi How would I also keep the == Section == formatting around the text?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants