WikiFetch

Problem

For some NLP research I'm currently doing, I was interested in parsing structured information from Wikipedia articles.

I did not want to use a full-featured MediaWiki parser:

this would be heavy-handed, all I really wanted was: the text contents from articles, images, and links to other articles.
I wanted to be able to extend the approach to other websites, e.g., news sites.
I wanted to use a crawler-based approach, rather than downloading a massive dataset.

The Solution

WikiFetch Crawls a Wikipedia article using Node.js and jQuery. It returns a structured JSON-representation of the page:

	{
		"title": "Foobar Article",
		"links": {
			"Link_to_another_article: {
				"text": "Another article.", // the text that was linked.
				"title": "Another_article.", // title attribute <a/> tag.
				"occurrences": 1 // number of times this article was linked.
			}
		},
		"sections": {
			"Section Heading": {
				text: "text contents of section.",
				images: ["http://foobar.jpg"] // images occurring within this section.
			}
		}
	}

Links within sections are replaced with [[article name]], which will have a corresponding entry in links.

Usage

npm install wikifetch -g
wikifetch --article=Dog

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
bin		bin
lib		lib
test		test
.gitignore		.gitignore
README.md		README.md
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiFetch

Problem

The Solution

Usage

About

Releases

Packages

Languages

bcoe/wikifetch

Folders and files

Latest commit

History

Repository files navigation

WikiFetch

Problem

The Solution

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages