Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for generating the word cloud from array-like of labels #271

Closed
soupault opened this issue Jun 7, 2017 · 7 comments
Closed

Comments

@soupault
Copy link

soupault commented Jun 7, 2017

I.e. something like generate_from_array(array), where array is supposed to be an array-like with labels: ('a', 'b', 'c', 'a') / ['a', 'b', 'c', 'a'] / np.array([1, 2, 3, 2]).
The counting is meant to run under the hood (using collections.Counter, for example).

Please, let me know if you would be interested to have this feature. If so, I'll work on the implementation.

P.S. Thank you for the great tool :)

@soupault soupault changed the title Add support for generating the word cloud from arrya-like of labels Add support for generating the word cloud from array-like of labels Jun 7, 2017
@amueller
Copy link
Owner

amueller commented Jun 7, 2017

I'm not sure what you mean by "under the hood" here. You need to provide the counts to the wordcloud in some way.
Can you maybe give an example for your usecase?

@soupault
Copy link
Author

soupault commented Jun 8, 2017

@amueller

I'm not sure what you mean by "under the hood" here. You need to provide the counts to the wordcloud in some way.

Yes, I'm proposing for wordcloud to take care of this in the case of array-like input (which is an often case, I assume). Notice, that WordCloud().generate_from_text, substantially, is implemented in a similar way, and performs counting internally.

Can you maybe give an example for your usecase?

Basically, I'm doing multi-label classification, and averaging predictions over the test set for visualization purposes. So I run inference on a list of samples, collect the results in a list of lists (each sublist stores the predicted labels for a single sample in an order of decreasing confidence indices), flatten the outer list, count the number of occurencies of each label, build WordCloud.

This could also be applied to a multi-class classification problem for recommender systems, where the one of usecases is to explore the top3/top5/topN predictions over the test set.

@amueller
Copy link
Owner

amueller commented Jun 8, 2017 via email

@amueller
Copy link
Owner

amueller commented Jun 8, 2017 via email

@soupault
Copy link
Author

soupault commented Jun 8, 2017

@amueller

You can either do " ".join(array) and pass it to generate_from_text or call pandas value_count on it and pass it to generate from frequencies.

Of course, but there is an overhead in both cases (and, frankly speaking, in my pipeline as well): creating/spliting a potentially large string in the first, pandas dependency and its containers in the second.

Going back to the original question :) : would you like to see such kind of input supported by wordcloud out of the box? To me, the current generate_from_text looks like a special case of considered generate_from_array, and could be built on top of the latter.

@amueller
Copy link
Owner

amueller commented Jun 8, 2017

I'm weary of adding too many interfaces. You an also implement value_counts in your own code in three lines, which is exactly the code you'd add to wordcloud:

d = defaultdict(int)
for word in array:
    d[word] += 1

What's the problem with adding those to your code?
This is what process_tokens does, but it also does other processing that you don't want.

@soupault
Copy link
Author

soupault commented Jun 8, 2017

No problems at all, it is already implemented in such way :). I was just wondering if that is a common enough case.

Thank you very much for the feedback! Closing as wontfix.

@soupault soupault closed this as completed Jun 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants