Principal component analysis (PCA) in Ruby
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
examples
lib
test
.gitignore
Gemfile
Gemfile.lock
MIT-LICENSE
README.md
Rakefile
pca.gemspec

README.md

Principal Component Analysis (PCA)

Principal component analysis in Ruby. Uses GSL for calculations.

PCA can be used to map data to a lower dimensional space while minimizing information loss. It's useful for data visualization, where you're limited to 2-D and 3-D plots.

For example, here's a plot of the 4-D iris flower dataset mapped to 2-D via PCA:

iris

PCA is also used to compress the features of a dataset before feeding it into a machine learning algorithm, potentially speeding up training time with a minimal loss of data detail.

Install

GSL must be installed first. On OS X it can be installed via homebrew: brew install gsl

gem install pca

Example Usage

require 'pca'

pca = PCA.new components: 1

data_2d = [ 
  [2.5, 2.4], [0.5, 0.7], [2.2, 2.9], [1.9, 2.2], [3.1, 3.0],
  [2.3, 2.7], [2.0, 1.6], [1.0, 1.1], [1.5, 1.6], [1.1, 0.9]
]

data_1d = pca.fit_transform data_2d

# Transforms 2d data into 1d:
# data_1d ~= [
#   [-0.8], [1.8], [-1.0], [-0.3], [-1.7],
#   [-0.9], [0.1], [1.1], [0.4], [1.2]
# ]

more_data_1d = pca.transform [ [3.1, 2.9] ]

# Transforms new data into previously fitted 1d space:
# more_data_1d ~= [ [-1.6] ]

reconstructed_2d = pca.inverse_transform data_1d

# Reconstructs original data (approximate, b/c data compression):
# reconstructed_2d ~= [
#   [2.4, 2.5], [0.6, 0.6], [2.5, 2.6], [2.0, 2.1], [2.9, 3.1]
#   [2.4, 2.6], [1.7, 1.8], [1.0, 1.1], [1.5, 1.6], [1.0, 1.0]
# ]

evr = pca.explained_variance_ratio

# Proportion of data variance explained by each component
# Here, the first component explains 99.85% of the data variance:
# evr ~= [0.99854]

See examples for more. Also, peruse the source code (~ 100 loc.)

Options

The following options can be passed in to PCA.new:

option default description
:components nil number of components to extract. If nil, will just rotate data onto first principal component
:scale_data false scales features before running PCA by dividing each feature by its standard deviation.

Working with Returned GSL::Matrix

PCA#transform, #fit_transform, #inverse_transform and #components return instances of GSL::Matrix.

Some useful methods to work with these are the #each_row and #each_col iterators, and the #row(i) and #col(i) accessors.

Or if you'd prefer to work with a standard Ruby Array, you can just call #to_a and get an array of row arrays.

See GSL::Matrix RDoc for more.

Plotting Results With GNUPlot

Requires GNUPlot and gnuplot gem.

require 'pca'
require 'gnuplot'

pca = PCA.new components: 2
data_2d = pca.fit_transform data

Gnuplot.open do |gp|
  Gnuplot::Plot.new(gp) do |plot|
    plot.title "Transformed Data"
    plot.terminal "png"
    plot.output "out.png"

    # Use #col accessor to get separate x and y arrays
    # #col returns a GSL::Vector, so be sure to call #to_a before passing to DataSet
    xy = [data_2d.col(0).to_a, data_2d.col(1).to_a]

    plot.data << Gnuplot::DataSet.new(xy) do |ds|
      ds.title = "Points"
    end
  end
end

Sources and Inspirations