Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can you use this to cluster textual data #7

Open
drreddy opened this issue Sep 1, 2015 · 3 comments
Open

How can you use this to cluster textual data #7

drreddy opened this issue Sep 1, 2015 · 3 comments

Comments

@drreddy
Copy link

drreddy commented Sep 1, 2015

Hello,

I'am interested to know is there any provision to cluster text data, I see that you are taking coordinates as input. Can you explain how to extend it for clustering a set of text documents.

@mgaido91
Copy link

mgaido91 commented Sep 1, 2015

The usual way to cluster text documents is made as follows:
1 - the data is transformed using the TF-IDF (https://en.wikipedia.org/wiki/Tf%E2%80%93idf) normalization
2 - the proper clustering technique (as DBSCAN can be) is applied to the transformed data, by using the cosine distance measure

As far as the TF-IDF transformation is regarded, you've to implement it by your own or to look to existing implementations. Instead, for the DBSCAN you can use this fork by me (https://github.com/speedymrk9/spark_dbscan) which extends @alitouka's repo allowing the usage of the cosine distance measure, which is the one you need.

Hope I've been clear, I'm at your disposal for any doubt.

@drreddy
Copy link
Author

drreddy commented Sep 1, 2015

Hey thanks for the quick reply, but if the cosine distance is found the memory complexity would be O(n^2) right. I meant is there any way one can convert the text data into the coordinates your package requires. So that I can use your efficient implementation.

@mgaido91
Copy link

mgaido91 commented Sep 1, 2015

Sorry but I have not understood what you've said. May you explain me? The way to convert text documents into coordinates is the TFIDF transformation.

----- Messaggio originale -----
Da: "D. Rajeev. Reddy" notifications@github.com
Inviato: ‎01/‎09/‎2015 17:13
A: "alitouka/spark_dbscan" spark_dbscan@noreply.github.com
Cc: "Marco Gaido" marcogaido91@gmail.com
Oggetto: Re: [spark_dbscan] How can you use this to cluster textual data (#7)

but if the cosine distance is found the memory complexity would be O(n^2) right. I meant is there any way one can convert the text data into the coordinates your package requires. So that I can use your efficient implementation.

Reply to this email directly or view it on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants