Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
mediawiki dump parser for loading up wikipedia data
Go
Branch: master

README.markdown

go-wikiparse

If you're like me, then you enjoy playing with lots of textual data and scour the internet for sources of it.

mediawiki's dumps are a pretty awesome chunk that's fun to work with.

Installation

go get github.com/dustin/go-wikiparse

Usage

The parser takes any io.Reader as a source assuming it's a complete XML dump and lets you pull wikiparse.Page objects out of it. These typically arrive as bzip2 files, so I make my program open the file and set up a bzip reader over it and all that. But you don't need to do that if you want to read off of stdin. Here's a complete example that emits page titles from a decompressing stream on stdin:

package main

import (
    "fmt"
    "os"

    "github.com/dustin/go-wikiparse"
)

func main() {
    p, err := wikiparse.NewParser(os.Stdin)
    if err != nil {
        fmt.Fprintf(os.Stderr, "Error setting up parser", err)
        os.Exit(1)
    }

    for err == nil {
        var page *wikiparse.Page
        page, err = p.Next()
        if err == nil {
            fmt.Println(page.Title)
        }
    }
}

Example invocation:

bzcat enwiki-20120211-pages-articles.xml.bz2 | ./sample

Geographical Information

Because it's interesting to me, I wrote a parser for the wikiproject geographical coordinates that are found on many pages. Use this on the page's content to find out if it's a place or not. Then go there.

Something went wrong with that request. Please try again.