Help! DO dump files contain the wikitable in the wikipedia? #247

HamLaertes · 2021-02-27T06:31:33Z

Hello everyone.
I downloaded the first file enwiki-20210220-pages-articles1.xml-p1p41242.bz2 at the wiki server. I successfully got the extracted text after running the script. However, I found that the text seemed to ignore the table information in the wiki pages i.e. the wikitable.
Do I miss something or the dump files not contain the table information at all?
Thanks!

The text was updated successfully, but these errors were encountered:

HamLaertes · 2021-02-27T07:52:55Z

I think I've got the answer myself. The dump files actually contain the wikitable information but in a different way.
Adding the argument --html may help get the wikitable more directly. But the code seems to have bugs when converting xml to html.
It reports KeyError as follows:

File "/storage/miniconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/storage/miniconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/storage/fbzhu/yc/wikiextractor/wikiextractor/WikiExtractor.py", line 467, in extract_process
    Extractor(*job[:-1]).extract(out, html_safe)  # (id, urlbase, title, page)
  File "/storage/wikiextractor/wikiextractor/extract.py", line 857, in extract
    text = self.clean_text(text, html_safe=html_safe)
  File "/storage/wikiextractor/wikiextractor/extract.py", line 847, in clean_text
    text = compact(text, mark_headers=mark_headers)
  File "/storage/wikiextractor/wikiextractor/extract.py", line 256, in compact
    page.append(listItem[n] % line)
KeyError: '&'

I am using the xml files dumped at 20 Feb 2021 and wikiextractor version 3.0.5.

cyk1337 mentioned this issue Dec 18, 2021

KeyError for producing HTML output with --html #280

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help! DO dump files contain the wikitable in the wikipedia? #247

Help! DO dump files contain the wikitable in the wikipedia? #247

HamLaertes commented Feb 27, 2021 •

edited

HamLaertes commented Feb 27, 2021 •

edited

Help! DO dump files contain the wikitable in the wikipedia? #247

Help! DO dump files contain the wikitable in the wikipedia? #247

Comments

HamLaertes commented Feb 27, 2021 • edited

HamLaertes commented Feb 27, 2021 • edited

HamLaertes commented Feb 27, 2021 •

edited

HamLaertes commented Feb 27, 2021 •

edited