Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help! DO dump files contain the wikitable in the wikipedia? #247

Open
HamLaertes opened this issue Feb 27, 2021 · 1 comment
Open

Help! DO dump files contain the wikitable in the wikipedia? #247

HamLaertes opened this issue Feb 27, 2021 · 1 comment

Comments

@HamLaertes
Copy link

HamLaertes commented Feb 27, 2021

Hello everyone.
I downloaded the first file enwiki-20210220-pages-articles1.xml-p1p41242.bz2 at the wiki server. I successfully got the extracted text after running the script. However, I found that the text seemed to ignore the table information in the wiki pages i.e. the wikitable.
Do I miss something or the dump files not contain the table information at all?
Thanks!

@HamLaertes
Copy link
Author

HamLaertes commented Feb 27, 2021

I think I've got the answer myself. The dump files actually contain the wikitable information but in a different way.
Adding the argument --html may help get the wikitable more directly. But the code seems to have bugs when converting xml to html.
It reports KeyError as follows:

File "/storage/miniconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/storage/miniconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/storage/fbzhu/yc/wikiextractor/wikiextractor/WikiExtractor.py", line 467, in extract_process
    Extractor(*job[:-1]).extract(out, html_safe)  # (id, urlbase, title, page)
  File "/storage/wikiextractor/wikiextractor/extract.py", line 857, in extract
    text = self.clean_text(text, html_safe=html_safe)
  File "/storage/wikiextractor/wikiextractor/extract.py", line 847, in clean_text
    text = compact(text, mark_headers=mark_headers)
  File "/storage/wikiextractor/wikiextractor/extract.py", line 256, in compact
    page.append(listItem[n] % line)
KeyError: '&'

I am using the xml files dumped at 20 Feb 2021 and wikiextractor version 3.0.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant