ddrichman/wikicategories
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Repository files navigation
wikicategories contains a whole bunch of data files and a whole bunch of scripts and programs to process them. HOW TO USE: skos_categories_en.nt is a giant text file from dbpedia.org containing a list of various relationships between all categories in the English wikipedia. article_categories_en.nt is a text file from dbpedia.org with simple processing by me that contains a list of articles with their categories. WARNING: Because they operate on large files, many of these programs take a lot of execution time (3-5 minutes on a fast server). Some load ~1-2GB text files into memory. RUN THE PROGRAMS IN THIS ORDER!!! source_files/skos_parser.sh: - reads skos_categories_en.nt - generates categories_parsed.txt [the skos file with a lot of fluff removed] source_files/categories_supercats_scores (a C++ program, compile with -std=c++11 -O2): - reads categories_parsed.txt - writes to stdout (you should redirect to CategoriesSupercats.txt) - scores the relationships between Wikipedia categories and a specified list of Wikipedia top level categories. - Gives the number of hops between each category and each of ~23 wikipedia top-level categories (I refer to these as supercats). Multiple breadth-first searches are used (one for each of the 23 categories) and a 24th one to print them all out. source_files/article_categories_parser.sh - reads article_categories_en.nt - generates ArticleCategories.txt [lot of fluff removed] source_files/article_supercats (C++ program, compile with -std=c++11 -O2): - reads ArticleCategories.txt and CategoriesSupercats.txt - scores relationships between Wikipedia articles and topcats, using the list of categories to which each article belongs and the set of scores generated by categories_supercats_scores - generates ArticlesSupercats.txt source_files/sort_articles_supercats.sh: - reads and writes ArticlesSupercats.txt - sorts that file using the order expected by lookup_article.pl - NOTE: YOU MUST RUN THIS before using the python lookup system or some queries will FAIL SILENTLY! wikicategories_pylookup/py_categorizer2.py: - pass as arguments a query string: will query the Solr db on seine and search ArticlesSupercats.txt for relevant articles [OLD SCRIPTS] lookup_article.pl: - reads ArticlesSupercats.txt - uses a fast binary search (dependent on the file previously having been sorted by sort_articles_supercats.sh) to lookup the desired article. - Usage: ./lookup_article.pl Anarchism (perl_categorizer.pl: old version of py_categorizer2.py. Use that script instead.)