-
-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stats claims to be able to calculate mode, but cannot 😄 #72
Comments
Oops! - will fix! I guess the only efficient way to calculate this is frequencies. Hmm - in that case I specifically don't want the input casted to doubles - I want whatever came in originally. |
Right! This is an interesting thing I'm coming up against in a few places. Some |
Fixed in 10.000-beta-13 |
Check it: (defn mode
"Return the most value common occurance in the data."
[data]
(->> (hamf/frequencies (or (dtype-base/as-reader data) data))
(hamf/sort-by (hamf/obj->long e (val e)))
(hamf/last)
(key?)))
Potentially faster pathway in the case of really large N is to use take-min with a custom comparator instead of sort-by. |
Not as fancy but fastest - no sort or fancy datastructure required - (defn mode
"Return the most common occurance in the data."
[data]
(let [fs (frequencies data)]
(when-not (== 0 (count fs))
(->> (apply max-key val fs)
(key?))))) |
And fastest - min-key isn't written all that well except for 1 or 2 key calls. So I wrote mmin-key in ham-fisted - 4-5 times faster and it handles the empty sequence case without complaining - https://github.com/cnuernber/ham-fisted/blob/master/src/ham_fisted/api.clj#L3080 (defn mode
"Return the most common occurance in the data."
[data]
(->> (frequencies data)
(mmax-key val)
(key?))) |
And - after working with a really large dataset found one more piece - https://github.com/cnuernber/ham-fisted/blob/master/src/ham_fisted/api.clj#L3093 (defn mode
"Return the most common occurance in the data."
[data]
(->> (frequencies {:map-fn java-hashmap} data)
(mmax-key val)
(key?))) Turns out the java hashmap wins by far against all available maps I tried. |
Those are all clean - and yes, sorting is an algorithmic mistake when calculating mode since |
Thanks for this - this stuff is getting good! Great work. |
dtype-next/src/tech/v3/datatype/statistics.clj
Line 233 in c035733
yet, ...
Not a huge deal obviously - feeling grateful for this library 🦃 today.
The text was updated successfully, but these errors were encountered: