GitHub - apurvi96/Wikipedia-Search-Engine: A search engine using inverted indexing to query over wikipedia data dump

A search engine query over wikipedia data dump of size ~60 GB. Using inverted Indexing, merging and Ranking Techniques.

Support for Field Queries - Fields include Title, Infobox, Body, Category, Links, and References. This helps when a user is interested in searching for the movie ‘Up’ where he would like to see the page containing the word ‘Up’ in the title and the word ‘Pixar’ in the Infobox
Index size should be less than one-fourth of the dump size

bash index.sh <path_to_wiki_dump> <path_to_invertedindex_output> <invertedindex_stat.txt> invertedindex_stat.txt:This file should contain two numbers on separate lines Total number of tokens (after converting to lowercase) encountered in the dump Total number of tokens in the inverted index
bash search.sh <path_to_invertedindex_output> query_string

Sachin Ramesh Tendulkar Hogwarts

t:World Cup i:2019 c:Cricket search for "World Cup" in Title, "2019" in Infobox and "Sports" in Category

t:the two towers i:1954 search for "the two towers" in Title and "1954" in Infobox

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Wiki_Indexer		Wiki_Indexer
Wiki_search		Wiki_search
README.md		README.md
index.sh		index.sh
search.sh		search.sh

Provide feedback