Skip to content
This repository has been archived by the owner on Dec 21, 2023. It is now read-only.

Dictionary input to count_words(....) #954

Closed
TobyRoseman opened this issue Aug 9, 2018 · 2 comments · Fixed by #2558
Closed

Dictionary input to count_words(....) #954

TobyRoseman opened this issue Aug 9, 2018 · 2 comments · Fixed by #2558

Comments

@TobyRoseman
Copy link
Collaborator

The doctoring for turicreate.text_analytics.count_words(...) contains the following example:

# Run count_words with dictionary input
>>> sa = turicreate.SArray([{'alice bob': 1, 'Bob alice': 0.5},
                                                {'a dog': 0, 'a dog cat': 5}])
>>> turicreate.text_analytics.count_words(sa)
dtype: dict
Rows: 2
[{'bob': 1.5, 'alice': 1.5}, {'a': 5, 'dog': 5, 'cat': 5}]

However the actual output is:

dtype: dict
Rows: 2
[{'bob': 2, 'alice': 2}, {'cat': 1, 'dog': 2, 'a': 2}]

It's determining count by looking at only occurrence in the keys. However the doctoring (both the example and the description) claim it should sum the values for keys that contain the word.

This is a bug as things are not working as described, but a bigger question is why do we want to support this use case? Why do we want be able to tokenize strings/keys and add up their values for each token?

@nickjong
Copy link
Collaborator

Yeah, I'm not sure I understand the use case, but the documented behavior makes the most sense given the shape of the API. If you treat each key as an atomic unit, then there's no really need for the API (and there's no notion of "word" involved). If you just count the number of keys containing each word, then the dictionary values are irrelevant and these should just be lists.

@dhivyaaxim
Copy link
Contributor

@TobyRoseman Please assign this ticket to me as I would like to try fixing it and contribute. Thanks in advance!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants