Skip to content

ddrichman/wikicategories

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wikicategories contains a whole bunch of data files and a whole bunch of scripts and programs to process them. 

HOW TO USE:
skos_categories_en.nt is a giant text file from dbpedia.org containing a list of various relationships between all categories in the English wikipedia. 
article_categories_en.nt is a text file from dbpedia.org with simple processing by me that contains a list of articles with their categories. 

WARNING: Because they operate on large files, many of these programs take a lot of execution time (3-5 minutes on a fast server). Some load ~1-2GB text 
files into memory. 

RUN THE PROGRAMS IN THIS ORDER!!!

source_files/skos_parser.sh:
	- reads skos_categories_en.nt
	- generates categories_parsed.txt [the skos file with a lot of fluff removed]

source_files/categories_supercats_scores (a C++ program, compile with -std=c++11 -O2):
	- reads categories_parsed.txt
	- writes to stdout (you should redirect to CategoriesSupercats.txt)

	- scores the relationships between Wikipedia categories and a specified list of Wikipedia top level categories. 
	- Gives the number of hops between each category and each of ~23 wikipedia top-level categories (I refer to these as supercats). Multiple breadth-first searches are used (one for each of the 23 categories) and a 24th one to print them all out. 

source_files/article_categories_parser.sh
	- reads article_categories_en.nt
	- generates ArticleCategories.txt [lot of fluff removed]

source_files/article_supercats (C++ program, compile with -std=c++11 -O2):
	- reads ArticleCategories.txt and CategoriesSupercats.txt
	- scores relationships between Wikipedia articles and topcats, using the list of categories to which each article belongs and the set of scores generated by categories_supercats_scores
	- generates ArticlesSupercats.txt

source_files/sort_articles_supercats.sh:
	- reads and writes ArticlesSupercats.txt
	- sorts that file using the order expected by lookup_article.pl
	- NOTE: YOU MUST RUN THIS before using the python lookup system or some queries will FAIL SILENTLY!

wikicategories_pylookup/py_categorizer2.py:
	- pass as arguments a query string: will query the Solr db on seine and search ArticlesSupercats.txt for relevant articles

[OLD SCRIPTS]
lookup_article.pl:
	- reads ArticlesSupercats.txt
	- uses a fast binary search (dependent on the file previously having been sorted by sort_articles_supercats.sh) to lookup the desired article. 

	- Usage: ./lookup_article.pl Anarchism

(perl_categorizer.pl: old version of py_categorizer2.py. Use that script instead.)

About

Scripts and programs to compute a mapping table from Wikipedia articles to their top-level categories and poll a Solr wikipedia db.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors