Skip to content

Using Map-Reduce to implement indexing on Wikipedia XML Dumps

Notifications You must be signed in to change notification settings

dawn360/wikiDB-MRIndexing

Repository files navigation

Indexing Wiki Dumps This indexing algorithm is a variation of Dean & Ghemawat Index algorithm

Run on Hadoop or CDH Cluster Project contains a pre-complied Jar

To complie javac -classpath hadoop classpath *.java

Create MR Jar jar cvf .jar *.class

RUN MR JOB hadoop jar .jar GDIndex enwiki.xml <min_word_length(number>

View Results hadoop fs -cat /part-r-* | less

About

Using Map-Reduce to implement indexing on Wikipedia XML Dumps

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages