|Failed to load latest commit information.|
|c||better comments about uncles|
|data||updated gitignore so empty data dir is pushed|
|py||there was a subtle bug in py/graph.py that added a space after lines …|
|.gitignore||proof-of-concept sys/queue.h usage with rbt nodes|
I've been meaning to hack on Wikipedia's link data so this is kind of the result of that. It's an ongoing process since there's so much data and many different things to do with it. Note: I'm running this project on a Thinkpad T410 Intel i5 with 8GB of RAM. If you have less RAM, it's possible that the final graph data won't fit in memory. I'm hoping to spend some time researching & developing a kind of in-memory and on-disk graph structure. A graph database would probably be valuable for this dataset. Also note: If you don't want to parse and generate gigabytes of data, I uploaded the core graph data that this project uses. You can download the files here. You'll need both graph.tar.bz2 and keys.tar.bz2 and they need to be extracted to the wikigraph/data/ directory. If you just use those two files, the only script you need to use is graph.py. Starting From Scratch This project uses the April 2011 pages-articles.xml.bz2 database dump from http://en.wikipedia.org/wiki/Wikipedia:Database_download After extracting the XML file, the first thing to do is replace the <mediawiki></mediawiki> tags with <xml></xml> tags so the SAX parser can do its job properly. This is how I did it: $ perl -pi -e 's/<mediawiki[^>]+>/<xml>/' enwiki.xml $ perl -pi -e 's/<\/mediawiki>/<\/xml>/' enwiki.xml enwiki.xml is just a renamed pages-articles.xml file. The next step is to run the SAX parser on the XML file. Since it's massive, it'll take quite awhile. It takes my T410 Intel i5 laptop about 40 minutes or so. The RAM footprint is negligible since it's a SAX parser. I was running at 100% CPU with 0.1% RAM used. You'll have to modify the filename at the bottom of saxparser.py since it's hardcoded for my path at the moment. Once it's done parsing, there will be a file called "pages_links". That's the raw graph data formatted like so: "page0","link00","link01","link02",...,"link0N"\n "page1","link10","link11","link12",...,"link1N"\n ... Each page has a list of links following it from that page. This is a page's adjacency list. Parse Page Titles The next script to run is keys.py. It will parse pages_links for page titles by using csv.reader(f, delimiter=',', quotechar='"') and reading line by line. The first element in a given row/line is the page title. Each page title is cleaned up a bit and some pages (Files:/Wikipedia:/Category:) are ignored completely. The page title is then used as a key into a defaultdict(int) object so I only have unique keys. I then sort all the keys and write them to disk in the file "data/keys". Parse Adjacent Pages/Links You then need to run values.py to parse each page's links and generate a usable link graph. This script starts by loading the data/keys file so I can do vertex lookups by page title. Each vertex has an adjacency set because there will be some duplicates and a set is an easy way to manage unique indexes. The script then parses the pages_links file for each page's links and adds them to the appropriate vertex's adjacency set. The adjacency sets use integer indexes looked up from the keytable by page title. This helps keep the memory footprint to a more manageable level. Only links that are stored in the keytable are allowed to be added to an adjacency set for a given vertex. Each vertex's adjacency set is then written to disk as "data/graph". A single line in the graph file is formatted like so: v1 v2 v3 v4\n Each adjacent vertex is an integer and separated by a space. A newline char indicates the end of that adjacency list. Each line maps to a vertex in the graph: line 0 in the graph file maps to vertex 0 in the graph. That means that line 0 contains vertex 0's adjacency list. Running Graph Algorithms You can either run graph.py directly: $ python graph.py or you can run it in the Python interpretor: $ python >>> from graph import WikiGraph >>> g = WikiGraph() >>> src = g.keytable["nazism"] >>> dst = g.keytable["philosophy"] >>> g.bfs(src) >>> g.path(src, dst) Some ideas for future work 1. Determine backlinks of each wikipedia page. 2. Web-based interface for running graph algorithms on the dataset. 3. Some kind of graph database. I calculated that pre-computing all possible paths in the link graph would require storage on the order of 300 TBs. That means for now I have to run algorithms on-the-fly and that can be slow if I'm trying to service web requests. Not to mention that each time a BFS is run, it has to allocate storage for predecessor trees. A few requests in a row would kill a server. 4. ??? 5. PROFIT!