MIRProject

The project is an information retrieval system that provides many services using NLP models and machine learning methods. These services include searching a sentence in a database of documents to find the relative documents, classifying new documents in a few pre-defined classes using the training set of labeled documents, and classifying the documents in an unsupervised manner into some clusters. The project consists of three phases that follow:

Phase 1

Preprocess and standardize the documents and sentences in four steps: normalization, tokenization, stemming and removing stop words.
Build two indexing systems on the texts: positional indexer, and bigram indexer.
Compress the indexing systems with variable byte and gamma code techniques.
Correct words in a sentence and replace with words that most occur with neighbor words.
Find the most relevant documents to a document with searching in the tf-idf vector space, or proximity search.

Phase 2

Classify the news documents into four classes (World / Sports / Business / Sci/Tech) with four different methods: Naive Bayes, k-nearest-neighbor, SVM, and Random Forest.
Report accuracy, precision, and recall of each classifier

Phase 3

Crawl documents from Semantic Scholar
Cluster documents in tf-idf and word2vec vector spaces with three methods: k-means, gaussian mixture model, hierarchical clustering
Run the PageRank algorithm on the crawled documents

Code Description

main.py and server.py: using the services that are implemented in the files of the services folder that follow:
document_manager.py: implementing the base methods for processing documents
index.py: implementing the indexer classes
classify.py: running classifiers implemented in the classifiers folder (knn.py, naive_bayes.py, random_forest.py, and svm.py)
cluster.py: implementing clustering methods on the documents
search.py: implementing the search through documents methods
vectorspace.py: providing the basic methods for treating documents as vectors
compress.py: implementing compressing methods for indexers
file_manager: implementing methods for working with compressed objects
page_rank.py: implementing the page rank algorithm on documents
spell_correction.py: implementing correction methods on sentences
visualize.py: implementing the basic method for visualization

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
crawler		crawler
data		data
services		services
.gitignore		.gitignore
README.md		README.md
functions.py		functions.py
main.py		main.py
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MIRProject

Phase 1

Phase 2

Phase 3

Code Description

About

Releases

Packages

Contributors 2

Languages

asadi-ali/MIRProject

Folders and files

Latest commit

History

Repository files navigation

MIRProject

Phase 1

Phase 2

Phase 3

Code Description

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages