Add support for generating the word cloud from array-like of labels #271

soupault · 2017-06-07T13:42:05Z

I.e. something like generate_from_array(array), where array is supposed to be an array-like with labels: ('a', 'b', 'c', 'a') / ['a', 'b', 'c', 'a'] / np.array([1, 2, 3, 2]).
The counting is meant to run under the hood (using collections.Counter, for example).

Please, let me know if you would be interested to have this feature. If so, I'll work on the implementation.

P.S. Thank you for the great tool :)

The text was updated successfully, but these errors were encountered:

amueller · 2017-06-07T20:49:00Z

I'm not sure what you mean by "under the hood" here. You need to provide the counts to the wordcloud in some way.
Can you maybe give an example for your usecase?

soupault · 2017-06-08T07:44:29Z

@amueller

I'm not sure what you mean by "under the hood" here. You need to provide the counts to the wordcloud in some way.

Yes, I'm proposing for wordcloud to take care of this in the case of array-like input (which is an often case, I assume). Notice, that WordCloud().generate_from_text, substantially, is implemented in a similar way, and performs counting internally.

Can you maybe give an example for your usecase?

Basically, I'm doing multi-label classification, and averaging predictions over the test set for visualization purposes. So I run inference on a list of samples, collect the results in a list of lists (each sublist stores the predicted labels for a single sample in an order of decreasing confidence indices), flatten the outer list, count the number of occurencies of each label, build WordCloud.

This could also be applied to a multi-class classification problem for recommender systems, where the one of usecases is to explore the top3/top5/topN predictions over the test set.

amueller · 2017-06-08T07:52:31Z

Ah, so a list with repetitions. You can either do " ".join(array) and pass it to generate_from_text or call pandas value_count on it and pass it to generate from frequencies. Sent from phone. Please excuse spelling and brevity.

…

On Jun 8, 2017 09:44, "Egor Panfilov" ***@***.***> wrote: @amueller <https://github.com/amueller> I'm not sure what you mean by "under the hood" here. You need to provide the counts to the wordcloud in some way. Yes, I'm proposing for wordcloud to take care of this in the case of array-like input (which is an often Notice, that WordCloud().generate_from_text, substantially, is implemented in a similar way. Can you maybe give an example for your usecase? Basically, I'm doing multi-label classification, and averaging predictions over the test set for visualization purposes. So I run inference on a list of samples, collect the results in a list of lists (each sublist stores the predicted labels for a single sample in an order of decreasing confidence indices), flatten the outer list, count the number of occurencies of each label, build WordCloud. This could also be applied to a multi-class classification problem for recommender systems, where the one of usecases is to explore the top3/top5/topN predictions over the test set. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#271 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAbcFiFul_Em_S-Jm-0TkwnWwA7Qb_NVks5sB6ZfgaJpZM4NyuPi> .

amueller · 2017-06-08T07:55:03Z

I recommend the second as that will bypass the tokenization, as you already know what the tokens are supposed to be. Sent from phone. Please excuse spelling and brevity.

…

On Jun 8, 2017 09:52, "Andreas Mueller" ***@***.***> wrote: Ah, so a list with repetitions. You can either do " ".join(array) and pass it to generate_from_text or call pandas value_count on it and pass it to generate from frequencies. Sent from phone. Please excuse spelling and brevity. On Jun 8, 2017 09:44, "Egor Panfilov" ***@***.***> wrote: > @amueller <https://github.com/amueller> > > I'm not sure what you mean by "under the hood" here. You need to provide > the counts to the wordcloud in some way. > > Yes, I'm proposing for wordcloud to take care of this in the case of > array-like input (which is an often Notice, that > WordCloud().generate_from_text, substantially, is implemented in a > similar way. > > Can you maybe give an example for your usecase? > > Basically, I'm doing multi-label classification, and averaging > predictions over the test set for visualization purposes. So I run > inference on a list of samples, collect the results in a list of lists > (each sublist stores the predicted labels for a single sample in an order > of decreasing confidence indices), flatten the outer list, count the number > of occurencies of each label, build WordCloud. > > This could also be applied to a multi-class classification problem for > recommender systems, where the one of usecases is to explore the > top3/top5/topN predictions over the test set. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#271 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAbcFiFul_Em_S-Jm-0TkwnWwA7Qb_NVks5sB6ZfgaJpZM4NyuPi> > . >

soupault · 2017-06-08T08:29:36Z

@amueller

You can either do " ".join(array) and pass it to generate_from_text or call pandas value_count on it and pass it to generate from frequencies.

Of course, but there is an overhead in both cases (and, frankly speaking, in my pipeline as well): creating/spliting a potentially large string in the first, pandas dependency and its containers in the second.

Going back to the original question :) : would you like to see such kind of input supported by wordcloud out of the box? To me, the current generate_from_text looks like a special case of considered generate_from_array, and could be built on top of the latter.

amueller · 2017-06-08T09:01:13Z

I'm weary of adding too many interfaces. You an also implement value_counts in your own code in three lines, which is exactly the code you'd add to wordcloud:

d = defaultdict(int)
for word in array:
    d[word] += 1

What's the problem with adding those to your code?
This is what process_tokens does, but it also does other processing that you don't want.

soupault · 2017-06-08T09:29:03Z

No problems at all, it is already implemented in such way :). I was just wondering if that is a common enough case.

Thank you very much for the feedback! Closing as wontfix.

soupault changed the title ~~Add support for generating the word cloud from arrya-like of labels~~ Add support for generating the word cloud from array-like of labels Jun 7, 2017

soupault closed this as completed Jun 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for generating the word cloud from array-like of labels #271

Add support for generating the word cloud from array-like of labels #271

soupault commented Jun 7, 2017 •

edited

Loading

amueller commented Jun 7, 2017

soupault commented Jun 8, 2017 •

edited

Loading

amueller commented Jun 8, 2017 via email

amueller commented Jun 8, 2017 via email

soupault commented Jun 8, 2017

amueller commented Jun 8, 2017 •

edited

Loading

soupault commented Jun 8, 2017

Add support for generating the word cloud from array-like of labels #271

Add support for generating the word cloud from array-like of labels #271

Comments

soupault commented Jun 7, 2017 • edited Loading

amueller commented Jun 7, 2017

soupault commented Jun 8, 2017 • edited Loading

amueller commented Jun 8, 2017 via email

amueller commented Jun 8, 2017 via email

soupault commented Jun 8, 2017

amueller commented Jun 8, 2017 • edited Loading

soupault commented Jun 8, 2017

soupault commented Jun 7, 2017 •

edited

Loading

soupault commented Jun 8, 2017 •

edited

Loading

amueller commented Jun 8, 2017 •

edited

Loading