Information Retrieval System

This project is implemented with ISNA news agency dataset in two phases.

Brief description

This project consists of two phases; the first phase includes:

And the second phase consists of the following:

The most important functions are listed below:

doc_preprocessing(): tokenizing, removing stopwords, and stemming is done by this function;
create_inverted_index(): creating the positional index
Query_extraction class:
- multiword_extraction(): extracting the biwords
- not_token_extraction(): extracting the words that mustn't be in the result
showResultWithoutRanking()

The most important functions are listed below:

create_inverted_index(): Improve positional index by adding tf-idf element
vectorize_query(): modeling query in vector space
similarity_DAAT(): Document at a time similarity algorithm which is not efficient
similarity_TAAT(): Text at a time similarity algorithm which extremely decrease the time complexity by using index elimination technique
create_champion_list()
showRankedResult()

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
README.md		README.md
phase1.py		phase1.py
phase2.py		phase2.py