Uses jQuery to return a structured JSON representation of a Wikipedia article.
JavaScript
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
bin
lib
test
.gitignore
README.md
package.json

README.md

WikiFetch

Author: @benjamincoe

Problem

For some NLP research I'm currently doing, I was interested in parsing structured information from Wikipedia articles.

I did not want to use a full-featured MediaWiki parser:

  • this would be heavy-handed, all I really wanted was: the text contents from articles, images, and links to other articles.
  • I wanted to be able to extend the approach to other websites, e.g., news sites.
  • I wanted to use a crawler-based approach, rather than downloading a massive dataset.

The Solution

WikiFetch Crawls a Wikipedia article using Node.js and jQuery. It returns a structured JSON-representation of the page:

	{
		"title": "Foobar Article",
		"links": {
			"Link_to_another_article: {
				"text": "Another article.", // the text that was linked.
				"title": "Another_article.", // title attribute <a/> tag.
				"occurrences": 1 // number of times this article was linked.
			}
		},
		"sections": {
			"Section Heading": {
				text: "text contents of section.",
				images: ["http://foobar.jpg"] // images occurring within this section.
			}
		}
	}
  • Links within sections are replaced with [[article name]], which will have a corresponding entry in links.

Usage

npm install wikifetch -g
wikifetch --article=Dog