New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Templates don't get expanded #151

Open
dnishiyama opened this Issue Feb 18, 2018 · 3 comments

Comments

Projects
None yet
3 participants
@dnishiyama

dnishiyama commented Feb 18, 2018

Any idea why none of the templates get expanded? I ran WikiExtractor.py an initial time and saved all templates to a file (named "templates", it's 2358539 lines long) to try to debug. I'm trying to extract all wiktionary articles but the resulting text looks like this (blank text in place of templates):

"
dictionary
, from , from , from , perfect past participle of + .
For more, see

This was the command I ran:
python WikiExtractor.py -o extracted --debug --templates templates enwiktionary-sample-pages-articles.xml

This was the output:
INFO: Loading template definitions from: templates
INFO: Loaded 74373 templates in 24.6s
INFO: Starting page extraction from enwiktionary-sample-pages-articles.xml.
INFO: Using 7 extract processes.
INFO: 16 dictionary
INFO: 19 free
INFO: 20 thesaurus
DEBUG: EXPAND also|Dictionary
INFO: 27 encyclopedia
DEBUG: Quit extractor
INFO: 29 portmanteau
DEBUG: Quit extractor
DEBUG: <EXPAND Template:Also
DEBUG: EXPAND wikipedia|dab=Dictionary (disambiguation)|Dictionary
DEBUG: <EXPAND Template:Wikipedia
DEBUG: EXPAND PIE root|en|deyḱ
DEBUG: TEMPLATE Template:PIE root: {{catlangname|{{{1|}}}|terms derived from the PIE root *{{{2|}}}-{{#if:{{{id|{{{id1|}}}}}}| ({{{id|{{{id1|}}}}}})}}}}{{#if:{{{3|}}}|{{catlangname|{{{1|}}}|terms derived from the PIE root *{{{3}}}-{{#if:{{{id2|}}}| ({{{id2|}}})}}}}}}{{#if:{{{4|}}}|{{catlangname|{{{1|}}}|terms derived from the PIE root *{{{4}}}-{{#if:{{{id3|}}}| ({{{id3|}}})}}}}}}
DEBUG: EXPAND catlangname|en|terms derived from the PIE root *deyḱ-{{#if:| ()}}
DEBUG: <EXPAND Template:Catlangname
DEBUG: EXPAND #if:|{{catlangname|en|terms derived from the PIE root *-{{#if:| ()}}}}
DEBUG: EXPAND also|-free
DEBUG: <EXPAND #if
DEBUG: EXPAND #if:|{{catlangname|en|terms derived from the PIE root *-{{#if:| ()}}}}
DEBUG: EXPAND also|Thesaurus|thésaurus
DEBUG: <EXPAND #if
DEBUG: <EXPAND Template:PIE root
DEBUG: EXPAND bor|en|ML.|dictionarium|withtext=1
DEBUG: <EXPAND Template:Bor
DEBUG: EXPAND der|en|la|dictionarius
DEBUG: EXPAND was wotd|2007|March|8
DEBUG: <EXPAND Template:Der
DEBUG: EXPAND wikipedia

I have been working on extracting templates for months and this looks like an amazing tool if I can get it to work. Thanks for all the work you all are doing on it!

@mhagiwara

This comment has been minimized.

mhagiwara commented Jul 17, 2018

@dnishiyama Do you still have this issue? I also encountered a similar problem, and it seems that there is an issue with the current script when it's applied to Wiktionary dumps. Specifically, when it expands templates, it tries to "normalize" template titles by converting the first letter of the template to upper case, although template titles are stored without normalization.

After removing those applications of ucfirst things seem to be working correctly for me.

@dnishiyama

This comment has been minimized.

dnishiyama commented Jul 17, 2018

Thanks for the reply. I do still have the issue and have since moved on to a different technique to gather this data from wikitionary (scrapy + bs4). If I get a chance I'll check out your recommendation. This would be a much better option if it does work.

@KylePiira

This comment has been minimized.

KylePiira commented Aug 3, 2018

I am also encountering this problem on the July 20th, 2018 English Wikipedia dump. Here was my command:

python WikiExtractor.py --o 'articles/' --templates 'templates.temp' --filter_disambig_pages --json 'enwiki.xml'

Here is an example of an incorrectly extracted sentence from Wikipedia Page ID 12.

WikiExtractor Output: The word "anarchism" is composed from the word "anarchy" and the suffix -ism, themselves derived respectively from the Greek , i.e. "anarchy" (from , "anarchos", meaning "one without rulers"; from the privative prefix ἀν- ("an-", i.e. "without") and , "archos", i.e. "leader", "ruler"; (cf. "archon" or , "arkhē", i.e. "authority", "sovereignty", "realm", "magistracy")) and the suffix or ("-ismos", "-isma", from the verbal infinitive suffix , "-izein").

Real Wikipedia Value: The word "anarchism" is composed from the word "anarchy" and the suffix -ism, themselves derived respectively from the Greek ἀναρχία, i.e. anarchy (from ἄναρχος, anarchos, meaning "one without rulers"; from the privative prefix ἀν- (an-, i.e. "without") and ἀρχός, archos, i.e. "leader", "ruler"; (cf. archon or ἀρχή, arkhē, i.e. "authority", "sovereignty", "realm", "magistracy")) and the suffix -ισμός or -ισμα (-ismos, -isma, from the verbal infinitive suffix -ίζειν, -izein).

I've also found other types of template expansions missing such as distance measurements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment