Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Uses jQuery to return a structured JSON representation of a Wikipedia article.
Branch: master
Pull request Compare This branch is 4 commits behind bcoe:master.

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
bin
lib
test
.gitignore
README.md
package.json

README.md

WikiFetch

Author: @benjamincoe

Problem

For some NLP research I'm currently doing, I was interested in parsing structured information from Wikipedia articles.

I did not want to use a full-featured MediaWiki parser:

  • this would be heavy-handed, all I really wanted was: the text contents from articles, images, and links to other articles.
  • I wanted to be able to extend the approach to other websites, e.g., news sites.
  • I wanted to use a crawler-based approach, rather than downloading a massive dataset.

The Solution

WikiFetch Crawls a Wikipedia article using Node.js and jQuery. It returns a structured JSON-representation of the page:

    {
        "title": "Foobar Article",
        "links": {
            "Link_to_another_article: {
                "text": "Another article.", // the text that was linked.
                "title": "Another_article.", // title attribute <a/> tag.
                "occurrences": 1 // number of times this article was linked.
            }
        },
        "sections": {
            "Section Heading": {
                text: "text contents of section.",
                images: ["http://foobar.jpg"] // images occurring within this section.
            }
        }
    }
  • Links within sections are replaced with [[article name]], which will have a corresponding entry in links.

Usage

npm install wikifetch -g
wikifetch --article=Dogs
Something went wrong with that request. Please try again.