Skip to content
Uses jQuery to return a structured JSON representation of a Wikipedia article.
Find file
Latest commit 41bfcb9 Jan 8, 2015 @bcoe switched jquery to cheerio.


Author: @benjamincoe


For some NLP research I'm currently doing, I was interested in parsing structured information from Wikipedia articles.

I did not want to use a full-featured MediaWiki parser:

  • this would be heavy-handed, all I really wanted was: the text contents from articles, images, and links to other articles.
  • I wanted to be able to extend the approach to other websites, e.g., news sites.
  • I wanted to use a crawler-based approach, rather than downloading a massive dataset.

The Solution

WikiFetch Crawls a Wikipedia article using Node.js and jQuery. It returns a structured JSON-representation of the page:

        "title": "Foobar Article",
        "links": {
            "Link_to_another_article: {
                "text": "Another article.", // the text that was linked.
                "title": "Another_article.", // title attribute <a/> tag.
                "occurrences": 1 // number of times this article was linked.
        "sections": {
            "Section Heading": {
                text: "text contents of section.",
                images: ["http://foobar.jpg"] // images occurring within this section.
  • Links within sections are replaced with [[article name]], which will have a corresponding entry in links.


npm install wikifetch -g
wikifetch --article=Dog
Something went wrong with that request. Please try again.