Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats claims to be able to calculate mode, but cannot 😄 #72

Closed
harold opened this issue Nov 24, 2022 · 9 comments
Closed

stats claims to be able to calculate mode, but cannot 😄 #72

harold opened this issue Nov 24, 2022 · 9 comments

Comments

@harold
Copy link
Collaborator

harold commented Nov 24, 2022

#{:min :quartile-1 :sum :mean :mode :median :quartile-3 :max

yet, ...

Execution error at tech.v3.datatype.statistics/calculate-descriptive-stat (statistics.clj:167).
Unrecognized descriptive statistic: :mode

Not a huge deal obviously - feeling grateful for this library 🦃 today.

@cnuernber
Copy link
Owner

Oops! - will fix! I guess the only efficient way to calculate this is frequencies.

Hmm - in that case I specifically don't want the input casted to doubles - I want whatever came in originally.

@harold
Copy link
Collaborator Author

harold commented Dec 16, 2022

Right! This is an interesting thing I'm coming up against in a few places.

Some int columns are categorical and some are quantitative. In a 5,000 row ds, an int column make only take a few different values (e.g., 67, 68, 71, 73) and then treating it as categorical and finding the mode and such make a lot of sense. Same thing came up when choosing a slider vs. multi-select box for filtering on an int column.

@cnuernber
Copy link
Owner

Fixed in 10.000-beta-13

@cnuernber
Copy link
Owner

cnuernber commented Dec 16, 2022

Check it:

(defn mode
  "Return the most value common occurance in the data."
  [data]
  (->> (hamf/frequencies (or (dtype-base/as-reader data) data))
       (hamf/sort-by (hamf/obj->long e (val e)))
       (hamf/last)
       (key?)))
  • frequencies is highly optimized and potentially parallel for large datasets
  • sort-by is smart enough to look at key-fn and if it returns a long or a double and the user does not provide a custom comparator it uses the fastutil long or double array indirect sort pathways. Then it simply returns the input reindexed in-place.
  • last is smart enough that if the input is random access it just returns the last element in constant time.
  • There is no destructuring - just using the clojure key and val primitives which cast to map.entry and do the right thing.

Potentially faster pathway in the case of really large N is to use take-min with a custom comparator instead of sort-by.

@cnuernber
Copy link
Owner

Not as fancy but fastest - no sort or fancy datastructure required -

(defn mode
  "Return the most common occurance in the data."
  [data]
  (let [fs (frequencies data)]
    (when-not (== 0 (count fs))
      (->> (apply max-key val fs)
           (key?)))))

@cnuernber
Copy link
Owner

cnuernber commented Dec 16, 2022

And fastest - min-key isn't written all that well except for 1 or 2 key calls. So I wrote mmin-key in ham-fisted - 4-5 times faster and it handles the empty sequence case without complaining - https://github.com/cnuernber/ham-fisted/blob/master/src/ham_fisted/api.clj#L3080

(defn mode
  "Return the most common occurance in the data."
  [data]
  (->> (frequencies data)
       (mmax-key val)
       (key?)))

@cnuernber
Copy link
Owner

And - after working with a really large dataset found one more piece - https://github.com/cnuernber/ham-fisted/blob/master/src/ham_fisted/api.clj#L3093

(defn mode
  "Return the most common occurance in the data."
  [data]
  (->> (frequencies {:map-fn java-hashmap} data)
       (mmax-key val)
       (key?)))

Turns out the java hashmap wins by far against all available maps I tried.

@harold
Copy link
Collaborator Author

harold commented Dec 17, 2022

Those are all clean - and yes, sorting is an algorithmic mistake when calculating mode since n*log(n) > n and mode can be found in Ο(n).

@harold
Copy link
Collaborator Author

harold commented Dec 17, 2022

Thanks for this - this stuff is getting good! Great work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants