Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Template expansion does not seem to work for french #32

Closed
aadant opened this issue Aug 30, 2015 · 8 comments
Closed

Template expansion does not seem to work for french #32

aadant opened this issue Aug 30, 2015 · 8 comments

Comments

@aadant
Copy link

aadant commented Aug 30, 2015

First get the template file as TEMPLATES, this requires parsing the whole file.

python extractPage.py --id 275 ../frwiki-20150602-pages-articles.xml.bz2 >aikibudo
python WikiExtractor.py -o extracted --templates ../TEMPLATES -a aikibudo

I get

L' est un art martial traditionnel d'origine japonaise ("budō") essentiellement basé sur des techniques de défense.

Correct sentence

L'aïkibudo (合気武道, aikibudō?) est un art martial traditionnel d'origine japonaise (budō) essentiellement basé sur des techniques de défense.

Wiki text :

L'{{japonais|'''aïkibudo'''|合気武道|aikibudō}} est un [[art martial]] traditionnel d'origine [[japon]]aise (''[[budō]]'') essentiellement basé sur des techniques de défense.

@aadant
Copy link
Author

aadant commented Aug 30, 2015

I tried to troubleshoot and part of the problem is that the french templates are localized like this

Modèle:

instead of

Template:

"Template:" is hardcoded in the python script. When I fix this, the templating works. However, there are still issues with Lua modules.

Example : Modèle:lang points to Modèle:Langue that uses Module:Langue !

And this requires lua support. So I guess the pass to collect templates also needs to collect the Lua modules ...

@aadant aadant closed this as completed Aug 30, 2015
@aadant aadant reopened this Aug 30, 2015
@attardi
Copy link
Owner

attardi commented Aug 30, 2015

The problem is related to that. When loading previously saved templates, it assumes that the template namespace is 'Template'.
I am working on a fix.

@attardi
Copy link
Owner

attardi commented Aug 30, 2015

It is fixed in release 2.35.

@attardi attardi closed this as completed Aug 30, 2015
@aadant
Copy link
Author

aadant commented Aug 30, 2015

Thank you for the quick fix. There are other issues though. At least one of them is related to the original description. For some reason, the #redirect are lower case in french while the regex is upper case.

check for redirects

m = re.match('#REDIRECT._?[[([^]]_)]]', page[0], re.IGNORECASE)
if m:
    redirects[title] = m.group(1) #normalizeTitle(m.group(1))
    return

or even like this

check for redirects

m = re.match('#(REDIRECT|redirect)._?[[([^]]_)]]', page[0], re.IGNORECASE)
if m:
    redirects[title] = m.group(2) #normalizeTitle(m.group(2))
    return

@aadant
Copy link
Author

aadant commented Aug 30, 2015

After fixing an issue at line 478, I get :

L'aïkibudo (合気武道, #redirect ) est un art martial traditionnel d'origine japonaise ("budō") essentiellement basé sur des techniques de défense.

@attardi
Copy link
Owner

attardi commented Aug 30, 2015

Got it, thank you.

@aadant
Copy link
Author

aadant commented Aug 30, 2015

After fixing

m = re.match('#(REDIRECT|redirect).?[[([^]])]]', page[0], re.IGNORECASE)

L'aïkibudo (合気武道, ) est un art martial traditionnel d'origine japonaise ("budō") essentiellement basé sur des techniques de défense.

But now, there is a missing japanese transliteration :

https://fr.wikipedia.org/w/api.php?action=expandtemplates&format=json&prop=wikitext&text={{japonais|%27%27%27a%C3%AFkibudo%27%27%27|%E5%90%88%E6%B0%97%E6%AD%A6%E9%81%93%7Caikibud%C5%8D}}

{"expandtemplates":{"wikitext":"'''a\u00efkibudo'''<span style="font-weight: normal"> (<span class="lang-ja" lang="ja" xml:lang="ja" title="Japonais">\u5408\u6c17\u6b66\u9053, <span class="t_nihongo_romaji" title="Transcription Hepburn"><span class="lang-ja-latn-alalc97" lang="ja-latn-alalc97">aikibud\u014d<span class="t_nihongo_help"><span class="t_nihongo_icon" style="color:#00e;font:bold 80% sans-serif;text-decoration:none;padding:0 .1em;">[[Aide:Japonais|?]])"}}

The missing part comes from a lua module :

DEBUG: INVOCATION 0 japonais|'''aïkibudo'''|合気武道|aikibudō
DEBUG: TITLE japonais
DEBUG: INVOCATION 1 #if:aikibudō
|, {{lang|ja-Latn-alalc97|aikibudō}}
|, '''aïkibudo'''

DEBUG: TITLE #if:aikibudō
DEBUG: INVOCATION 1 lang|ja-Latn-alalc97|aikibudō
DEBUG: TITLE lang
DEBUG: INVOCATION 2 #invoke:Langue|langue
DEBUG: TITLE #invoke:Langue

see :
https://fr.wikipedia.org/wiki/Module:Langue

Hence the need to extract the modules in the way as the templates and evaluate them using lua
(lua binaries are required to decode wikipedia, nice russian doll !)

@attardi
Copy link
Owner

attardi commented Aug 30, 2015

I know, the extensions are written in lua and you will have to access to the code of those extensions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants