WikiExtractor doesnt extract text for bn, hi #175

arijitx · 2022-04-04T18:00:06Z

Hi, I tried the wikiextractor for wikisource dump in bn,hi and es. For bn and hi it doesnt work only extracts one or two words

{"id": "5", "url": "https://bn.wikisource.org/wiki?curid=5", "title": "সানাই/গানের জাল", "text": "সানাই/গানের জাল\n\n<pages index=\"সানাই-রবীন্দ্রনাথ ঠাকুর.djvu\" from=88 to=88 header=1/>"}

While for es it seems to be working.

The text was updated successfully, but these errors were encountered:

MichaelKohler · 2022-04-04T18:06:57Z

Which version of the WikiExtractor are you using locally? The extraction uses an older version. Can you update your version and try again locally? If that doesn't help and the problem persists on the latest version, I would say the bug report should be done in https://github.com/attardi/wikiextractor/issues. If if works with the latest version, I will need to look into updating what we use in the extraction process.

MichaelKohler added needs debugging waiting on feedback labels Apr 4, 2022

MichaelKohler closed this as completed Aug 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WikiExtractor doesnt extract text for bn, hi #175

WikiExtractor doesnt extract text for bn, hi #175

arijitx commented Apr 4, 2022

MichaelKohler commented Apr 4, 2022

WikiExtractor doesnt extract text for bn, hi #175

WikiExtractor doesnt extract text for bn, hi #175

Comments

arijitx commented Apr 4, 2022

MichaelKohler commented Apr 4, 2022