Extracted statistical methods from other gems that work well on an Enumerable
Ruby
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
bin
features
lib
spec
.gitignore
LICENSE
README.rdoc
Rakefile
VERSION.yml
just_enumerable_stats.gemspec

README.rdoc

Just Enumerable Stats

I had some tricky stuff in Statisticus and Sirb that were useful, but I ended up using someone else's library for plain-old calculations on a single enumerable. What a shame, I thought. So I extracted things out and made a simpler library that I'd use for my simpler needs.

I also included a FixedRange class for traversing floating point ranges a little more easily and a nearly-exact copy of the whole library in its own module for containing these methods in your own container. See the Known Issues section below and the specs for that module.

Usage

Would I release a gem without an IRB application included? Probably not. This gem has one, and it's called jes:

jes 
Loading Just Enumerable Stats version: 0.0.1

Looking at the library through jes, you can see that we have all the usual goodies:

>> [1,2,3].mean
=> 2
>> [1,2,3].std
=> 1.0
>> [1,2,3].cor [2,3,5]
=> 0.981980506061966

The list of methods are:

  • average

  • avg

  • cartesian_product

  • categories

  • category_values

  • compliment

  • cor

  • correlation

  • count_if

  • covariance

  • cum_max

  • cum_min

  • cum_prod

  • cum_sum

  • cumulative_max

  • cumulative_min

  • cumulative_product

  • cumulative_sum

  • default_block

  • default_block=

  • dichotomize

  • euclidian_distance

  • exclusive_not

  • intersect

  • is_numeric?

  • max

  • max_index

  • max_of_lists

  • mean

  • median

  • min

  • min_index

  • min_of_lists

  • new_sort

  • order

  • pearson_correlation

  • permutations

  • product

  • quantile

  • rand_in_range

  • range

  • range_as_range

  • range_class

  • range_instance

  • rank

  • set_range

  • set_range_class

  • sigma_pairs

  • standard_deviation

  • std

  • sum

  • tanimoto_correlation

  • tanimoto_pairs

  • to_f!

  • to_pairs

  • union

  • var

  • variance

  • yield_transpose

One of the more interesting methods is yield_transpose:

[1,2,3].yield_transpose([5,5,5], [2,2,2]) { |e| e.product }
# => [10, 20, 30]

The yield_transpose:

  • makes a list of lists (read matrix) out of the main object and one or more other lists

  • yields a block on the columns of the list

In this case, it multiplies 1 * 5 * 2, 2 * 5 * 2, and 3 * 5 * 2 to get the final result.

There are a lot of other interesting tools that do this (RNum, Matrix), but the ones I know about aren't as flexible as this simple implementation.

Another interesting feature is the default block getter and setter. Sometimes I need to filter, scale, or normalize a result. I can do that in the default block and still hold on to the original value Ultimately, it's more expensive to do things this way (every computation has to also go through a filter), but it's a little simpler sometimes. An example:

a = [1,2,3]
a.default_block = lambda {|e| e * 2}
a.sum
# => [2,4,6]
a.std
# => 2.0 instead of 1.0

Scaling

There are a few new features for scaling data:

>> a = [1,2,3]
=> [1, 2, 3]
>> a.scale(2)
=> [2, 4, 6]
>> a
=> [1, 2, 3]
>> a.scale!(2)
=> [2, 4, 6]
>> a
=> [2, 4, 6]
>> a.scale {|e| e - 1}
=> [1, 3, 5]
>> a
=> [2, 4, 6]
>> a.scale! {|e| e - 1}
=> [1, 3, 5]
>> a                   
=> [1, 3, 5]
>> a.scale_between(3,4)
=> [3.0, 3.5, 4.0]
>> a
=> [1, 3, 5]
>> a.scale_between!(3,4)
=> [3.0, 3.5, 4.0]
>> a
=> [3.0, 3.5, 4.0]
>> a = [1,2,3]
=> [1, 2, 3]
>> a.normalize
=> [0.166666666666667, 0.333333333333333, 0.5]
>> a      
=> [1, 2, 3]
>> a.normalize!
=> [0.166666666666667, 0.333333333333333, 0.5]
>> a
=> [0.166666666666667, 0.333333333333333, 0.5]
>> a = [-5,0,5]      
=> [-5, 0, 5]
>> a.scale_to_sigmoid
=> [0.00669285092428486, 0.5, 0.993307149075715]
>> a
=> [-5, 0, 5]
>> a.scale_to_sigmoid!
=> [0.00669285092428486, 0.5, 0.993307149075715]
>> a                  
=> [0.00669285092428486, 0.5, 0.993307149075715]

Basically:

  • scale can scale by a number or with a block. The block is a transformation for a single element.

  • scale_between sets the minimum and maximum values, and keeps each value proportionate to each other.

  • normalize calculates the percentage of an element to the whole.

  • scale_to_sigmoid uses the sigmoid function to scale a set between 0 and 1 with a Gaussian distribution on the numbers.

Categories

Once I started using this gem with my distribution table classes, I needed to have flexible categories on an enumerable. What that looks like is:

Loading Just Enumerable Stats version: 0.0.4
>> a = [1,2,3]
=> [1, 2, 3]
>> a.categories
=> [1, 2, 3]
>> a.category_values
=> {1=>[1], 2=>[2], 3=>[3]}
>> a.set_range_class FixedRange, 1, 3, 0.5
=> FixedRange
>> a.categories
=> [1.0, 1.5, 2.0, 2.5, 3.0]
>> a.category_values(true)
=> {1.0=>[1], 1.5=>[], 2.0=>[2], 2.5=>[], 3.0=>[3]}
>> a.set_range({
?> "<3" => lambda{|e| e < 3},
?> "3" => lambda{|e| e == 3}
>> })
=> ["<3", "3"]
>> a.categories
=> ["<3", "3"]
>> a.category_values(true)
=> {"<3"=>[1, 2], "3"=>[3]}
>> a.count_if {|e| e < 3}
=> 2
>> a.count_if {|e| e == 3}
=> 1
>> a.dichotomize(2, :small, :big)
=> [:small, :big]
>> a.categories
=> [:small, :big]
>> a.category_values[:small]
=> [1, 2]
>> a.category_values[:big]  
=> [3]

OK, here we go:

  • If you have facets installed (sudo gem install facets), it will use a Dictionary instead of a Hash, keeping the order of the hash values consistent with the order they were loaded. It's nice to have an ordered hash, so go ahead and install that.

  • The categories default as all the unique values in the enumerable

  • category_values is a hash or dictionary of values for each category. The values are returned as an array. This is cached, so call it like a.category_values(true) to reset the hash.

  • set_range_class sets an arbitrary class to use for calculating a range. It should respond to map that will return an array of all values in the range. The arguments after FixedRange are the arguments I want to use when instantiating the class. This is an arbitrary list as well. I could just as easily have typed a.set_range_class FixedRange, a.min, a.max, 0.5 with similar results.

  • FixedRange is a Range that can work with floating numbers. Range.new(1.0, 3.0).map chokes but FixedRange.new(1.0, 3.0).map does not.

  • set_range takes an arbitrary hash of lambdas and sets the categories to the keys of that hash.

  • The category_values calculated with a hash of lambdas makes a very flexible set interface for enumerables. There is no rule that the categories setup this way have to be mutually exclusive or collectively exhaustive (MECE), so interesting data sets can be setup here. However, MECE is generally a good guideline for most analysis.

  • count_if is just like a.select_all{|e| e < 3}.size, but a little more obvious.

  • dichotomize just splits the categories into two, with the first category less than or equal to the split value provided (2 in our case)

Obtrusiveness

This gem won't override methods, but it puts a lot into the Enumerable namespace. It's almost as abusive as ActiveSupport. I'll be testing this in a Rails environment to make sure this plays nicely with ActiveSupport. The issue there was that ActiveSupport also wanted to override sum on an enumerable. If you are in an environment where jes isn't overriding methods, then you're going to have to use the jes prefix for the calls. So, @a._jes_sum will work, as will @a._jes_standard_deviation, @a._jes_covariance, etc. I'm realizing that I always want to use this library in all of my new stuff, because it simplifies things so much. So, this is going to be an ever-more important point.

Installation

sudo gem install davidrichards-just_enumerable_stats

Dependencies

There's an optional dependency on facets/dictionary. You'll have fewer surprises if you use a Dictionary instead of a hash for categories.

Known Issues

  • I don't like the quantile methods. I found a different approach that I think is cleaner, that I should implement when I get the time.

  • This isn't really for any Enumerable. It's only tested on Arrays, though I'm pretty sure a lot of my repositories and other custom Enumerables will work well with this. Most importantly, a Hash will fall on its face here, so don't try it. If you need labeled data, keep an eye out for Marginal, a gem I'm cleaning up that offers log-linear methods on cross tables. That gem will use this gem, so whatever goodies I add here will be available there. Also, there is data_frame out right now that uses this gem. That is a simpler and very useful gem for labeled data.

  • I imagine the scope of this gem may grow by about a third more methods. It's not supposed to be an exhaustive list. TeguGears was developed to build these kinds of methods and have them work nicely with other tools. So, anything more than elementary statistics should become a TeguGears class.

  • I should probably rename the range methods.

COPYRIGHT

Copyright © 2009 David Richards. See LICENSE for details.