The project is an information retrieval system that provides many services using NLP models and machine learning methods. These services include searching a sentence in a database of documents to find the relative documents, classifying new documents in a few pre-defined classes using the training set of labeled documents, and classifying the documents in an unsupervised manner into some clusters. The project consists of three phases that follow:
- Preprocess and standardize the documents and sentences in four steps: normalization, tokenization, stemming and removing stop words.
- Build two indexing systems on the texts: positional indexer, and bigram indexer.
- Compress the indexing systems with variable byte and gamma code techniques.
- Correct words in a sentence and replace with words that most occur with neighbor words.
- Find the most relevant documents to a document with searching in the tf-idf vector space, or proximity search.
- Classify the news documents into four classes (World / Sports / Business / Sci/Tech) with four different methods: Naive Bayes, k-nearest-neighbor, SVM, and Random Forest.
- Report accuracy, precision, and recall of each classifier
- Crawl documents from Semantic Scholar
- Cluster documents in tf-idf and word2vec vector spaces with three methods: k-means, gaussian mixture model, hierarchical clustering
- Run the PageRank algorithm on the crawled documents
- and using the services that are implemented in the files of the services folder that follow:
- implementing the base methods for processing documents
- implementing the indexer classes
- running classifiers implemented in the classifiers folder (,,, and
- implementing clustering methods on the documents
- implementing the search through documents methods
- providing the basic methods for treating documents as vectors
- implementing compressing methods for indexers
- file_manager: implementing methods for working with compressed objects
- implementing the page rank algorithm on documents
- implementing correction methods on sentences
- implementing the basic method for visualization