Skip to content
This repository has been archived by the owner on Nov 25, 2019. It is now read-only.

lang-links option while compiling wiki #30

Open
microspace opened this issue May 20, 2013 · 4 comments
Open

lang-links option while compiling wiki #30

microspace opened this issue May 20, 2013 · 4 comments

Comments

@microspace
Copy link

Wikidata migrated is migrating interlanguage wiki links from individual articles into a central database to ease maintenance.
Here is detailed infomation: https://en.wikipedia.org/wiki/Wikipedia:Wikidata
Now all links are stored in wikidata database which is about 15Gb in size.
Article structure is something like this
Q12345
enwiki: 'water'
ruwiki: 'вода'
and so on. Each article now has its own number, starting with Q.
I have written small script to parse that dump, and create seperate xdxf dictionary, in "one direction". For example ENG -> RUS,FR. Sort of english to russian dictionary, based on wikipedia.

@itkach
Copy link
Member

itkach commented May 20, 2013

Interesting.

seperate xdxf dictionary, in "one direction"

it doesn't have to be in one direction, no reason why you can't have both 'water' -> 'вода' and 'вода' -> 'water', and it's not necessary to go via xdxf format, unless you want to use the output with other xdxf programs.

In any case, your issue description doesn't seem to describe an actual issue or request. What are you suggesting?

@microspace
Copy link
Author

Yes, of course.
My suggestion is to mention it in http://aarddict.org/aardtools/doc/aardtools.html
and to add this feature to aarddtools in future.
thank you

@itkach
Copy link
Member

itkach commented May 20, 2013

If there's some work to be done then the issue better stay open :) You mentioned you have some code written, care to share?

@itkach itkach reopened this May 20, 2013
@microspace
Copy link
Author

I had some memory leak issues, after what I found this article:
Processing every Wikipedia article
They use SAX to parse wiki dump.
Here is used module code page_parser.py
Steps:

  1. Download actual wikidata dump from here
  2. unpack it with bunzip2
  3. replace all " with "
  4. use the code
$ python psax.py > final.xml
import page_parser
import json
def yourCallback(page):
    try:
        j = json.loads(page.text)
        for lang in j['links']:
            if lang == 'trwiki':
                print '<ar><k>'+j['links']['trwiki'].encode('utf-8')+'</k>'
                print '   <def>'+j['entity']+'</def>'
                for lang in j['links']:
                    if lang == 'ruwiki':
                        print '   <def>ruwiki: <kref>'+j['links']['ruwiki'].encode('utf-8')+'</kref></def>'
                for lang in j['links']:
                    if lang == 'enwiki':
                        print '   <def>enwiki: <kref>'+j['links']['enwiki'].encode('utf-8')+'</kref></def>'
                print '</ar>'
        elt.text.close()
    except:
        pass

page_parser.parseWithCallback("wikidatawiki-20130505-pages-meta-current.xml", yourCallback)

Final step: replace all ampersand symbols & with &

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants