Skip to content

Clustering news documents using bag of words model to classify documents

License

Notifications You must be signed in to change notification settings

anjanatiha/Clustering-for-Document-Classification

Repository files navigation

Clustering for Document Classification (News Classification)

Description

  1. Used clustering algorithms to classify documents using “20 newsgroups dataset“ based on the bag of words model.
  2. Converted each document to a TFIDF vector and then ran the K-Means and Gaussian Mixture Models algorithms.
  3. For evaluation, computed the weighted F-1 score on the test set for K-Means and Gaussian Mixture Models.
  4. Trained the clustering algorithm on the training set. For each test instance, predicted the cluster to which it belongs and assign the predicted topic to the test instance based on the majority topic of the cluster.
  5. Reported results for at least 5 different parameter settings (varying k for k-means, varying the number of mixture components for GMM).

Tools Requirement: Anaconda, Python

Current Version : v1.0.0.0

Last Update : 11.02.2016