GitHub - abdnh/getting-to-philosophy: Finds a path between any Wikipedia article to the Philosophy article

A Wikipedia crawler that finds ~~the meaning of life~~ the list of articles that results from the process of starting from a given article and following the first wiki article link in the main text, then repeating the process for subsequent articles, until eventually reaching the Philosophy article.

It's known that applying this process to most articles in the English Wikipedia eventually leads to Philosophy (See Wikipedia:Getting to Philosophy)

The Rules of the Game

The Wikipedia article on the subject lists the following rules for following links:

Clicking on the first non-parenthesized, non-italicized link
Ignoring external links, links to the current page, or red links (links to non-existent pages)
Stopping when reaching "Philosophy", a page with no links or a page that does not exist, or when a loop occurs

The first rule is not implemented by api_crawler.py.

Additionally, both web_crawler.py and api_crawler.py ignore namespace pages, though this can be changed from common.py.

Usage

There are two separate implementations of this process in this repo:

web_crawler.py: a web scraping implementation
api_crawler.py: an implementation using the MediaWiki API. This implementation actually follows all article links and not just the first one. There is no way to return links in their article order using the API, as far as I know.

web_crawler.py

This script provides the go_to_philosophy function that returns a list of articles starting from a given title to Philosophy (if there is any path leading there):

>>> from web_crawler import go_to_philosophy

>>> path = go_to_philosophy("Foobar")
>>> print(path)
['Foobar', 'Metasyntactic_variable', 'Placeholder_name', 'Free_variables_and_bound_variables', 'Mathematics', 'Epistemology', 'Ancient_Greek_language', 'Greek_language', 'Indo-European_languages', 'Language_family', 'Language', 'Communication', 'Self', 'Consciousness', 'Sentience', 'Emotion', 'Mental_state',
'Mind', 'Phenomenon', 'Philosophy']

The function accepts an optional limit argument to limit the recursion level before giving up. The default is 1000.

>>> import sys
>>> from web_crawler import go_to_philosophy

# go as far as we can in pursuit of philosophy!
>>> path = go_to_philosophy("Kasiski examination", limit=sys.maxsize)
>>> print(path)
['Kasiski examination', 'Cryptanalysis', 'Information_system', 'Sociotechnical', 'Organizational_development', 'Organizational_change', 'Human_behavior', 'Human', 'Species', 'Biology', 'Science', 'Scientific_method', 'Empirical_evidence', 'Proposition', 'Logic', 'Reason', 'Consciousness', 'Sentience', 'Emotion', 'Mental_state', 'Mind', 'Phenomenon', 'Philosophy']

There is also a more general go_to function that takes an additional argument for a destination different from Philosophy.

>>> from web_crawler import go_to

>>> path = go_to("Foobar", "Mathematics")
>>> print(path)
['Foobar', 'Metasyntactic_variable', 'Placeholder_name', 'Free_variables_and_bound_variables', 'Mathematics']

The scraped pages are cached in the pages directory.

api_crawler.py

This script implements a variant of the problem that follows all article links and not just the first one. It returns the first path to Philosophy it finds. It also crawls faster due to the use of the API.

>>> from api_crawler import PhilosophyCrawler
>>> philosophy_crawler = PhilosophyCrawler()
>>> path = philosophy_crawler.path_to_philosophy("C (programming language)")
>>> print(path)
['C (programming language)', '"Hello, World!" program', '.deb', 'Deb (file format)', '.NET Framework', '.NET', '.NET Bio', '.NET Foundation', '.NET Compact Framework', 'A Sharp (.NET)', 'Software design', 'Agency (philosophy)', 'Philosophy']

The PhilosophyCrawler class inherits from the more general PathToCrawler and WikipediaLinkCrawler classes. Read the source for more details.

Visualization

See neo4j_demo for an example of using the Neo4j graph database to store links bewteen articles and visualize them.

References

Wikipedia:Getting_to_Philosophy#External_links
loneliness.one/philosophy: I used this site to test my implementation. My web crawler script reaches Philosophy from Logic faster due to a bug in that site in handling parenthesized links!

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
neo4j_demo		neo4j_demo
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
api_crawler.py		api_crawler.py
common.py		common.py
dev-requirements.txt		dev-requirements.txt
mypy.ini		mypy.ini
requirements.txt		requirements.txt
web_crawler.py		web_crawler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Rules of the Game

Usage

web_crawler.py

api_crawler.py

Visualization

References

About

Releases

Packages

Languages

License

abdnh/getting-to-philosophy

Folders and files

Latest commit

History

Repository files navigation

The Rules of the Game

Usage

web_crawler.py

api_crawler.py

Visualization

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages