Document-Clustering-Scala

Used following web pages to get the HTML doc

Web Page to Feature Vector(using Hashing trick and Normalization):

I have used the jsoup library to clean the web pages and got the text from the body tags.
Removed all the punctuations and non ASCII characters(Still mode could have been done).
Tokenized the text by seperating with space, Used only 1-grams however a complex and better solution could have been made using 2grams and 3grams.
Removed stop words by using a predefined list of stopwords.
I fixed the length of feature vector to 5000, for every token I got the hash and took the mod value divided by vector length and determined the index in feature vector.
Complex collision handling code could have been written to handle collisions in a better way.
Normalized the Feature vector by dividing by the magnitude and getting the unit vector.

KMeans clustering steps

Steps:

Choose k(Number of Clusters=3) Random vectors as Centroids, set tolerance level tol=0.0003(in this case) and max iterations maxIter=100
Assign each data point to cluster whose distance is minimum among all centroids of the 3 clusters
re-calculate centroids by sum of all points in a cluster Ci/number of points in a cluster Ci
repeat step 2 and 3 until convergence

Results

Here we see that cluster 0 belongs to greek history documents and wiki document on Greece.
cluster 1 belongs to Wiki pages of lakes.
cluster 2 belongs to wiki documents related to machine learning.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Document.scala		Document.scala
KMeans Clustering Results.png		KMeans Clustering Results.png
KMeans.scala		KMeans.scala
Main.scala		Main.scala
NlpUtils.scala		NlpUtils.scala
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document-Clustering-Scala

Used following web pages to get the HTML doc

Web Page to Feature Vector(using Hashing trick and Normalization):

KMeans clustering steps

Results

About

Releases

Packages

Languages

darekarsam/Document-Clustering-Scala

Folders and files

Latest commit

History

Repository files navigation

Document-Clustering-Scala

Used following web pages to get the HTML doc

Web Page to Feature Vector(using Hashing trick and Normalization):

KMeans clustering steps

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages