DataFrames for Clojure (inspired by Python's Pandas)
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
doc
java/dataframe
src/dataframe
test/dataframe
.gitignore
.travis.yml
LICENSE
README.md
profiles.clj
project.clj

README.md

dataframe

Build Status

DataFrames for Clojure (inspired by Python's Pandas)

The dataframe package contains two core data structures:

  • A Series is a map of index keys to values. It is ordered and supports O(1) lookup of values by index as well as O(1) lookup of values by positional offset (based on the order of the index).
  • A Frame is a map of column names to column values, which are represented as Series, each with an identical index. A Frame may also be thought of as a map of index keys to maps, where each map is a row of a Frame that maps column names to the value in that row.

Series

A series can be thought of as a 1-D vector of data with an index (vector of keys) for every value. The keys are typically either integers or clojure Keywords, but can be any value. Any of values may be nil, but the non-nil values must all be of the same type.

When iterated over, a Series is a collection of pairs of [index value].

index val
:a 10
:b 20
:c 30
:d 40

To create a Series, pass a sequence of values and an index sequence to the constructor function:

(require '[dataframe.core :as df])

(def srs (df/series [1 2 3] [:a :b :c]))
srs
=> class dataframe.series.Series
:a 1
:b 2
:c 3

DataFrame core has a number of functions for operating on or manipulating Series objects.

(df/ix srs :b)
; 2

(df/values srs)
; [1 2 3]

One can apply arithmetic operations on a Series which return Series objects. These operations obey broadcast rules: You may combine a primitive with a series which will apply the operation to every element of a series and return a new series with the same index as the first. Or, you may apply a row-by-row operation on two series (if their indices exactly align):

(df/add 1 srs)
=> class dataframe.series.Series
:a 2
:b 3
:c 4
(df/eq 2 srs)
=> class dataframe.series.Series
:a false
:b true
:c false
(df/add (series [1 2 3]) (series [10 20 30]))
=> class dataframe.series.Series
0 11
1 22
2 33

Frames

Frames are aligned collections of column-names to Series.

When iterated over, a Frame is a collection of pairs of indexes to maps of rows: [index {col->val}].

columns: :a :b :c
index
:x 10 2 100
:y 20 4 300
:z 30 6 600

There are a number of equivalent ways to create a DataFrame. These all use the dataframe.core/frame constructor function. These ways are:

  • Pass a map of column names to column values as well as an optional index (if no index is passed, then a standard index of integers starting at 0 will be used). The column values can either be sequences or they can be Series objects, but must all have the same length.
(require '[dataframe.core :as df])

(def frame (df/frame {:a [1 2 3] :b [10 20 30]} [:x :y :z]))
frame
=> class dataframe.frame.Frame
	:a	:b
:x	1	10
:y	2	20
:z	3	30

Here, :a and :b are the names of the columns and the index over rows is [:x :y :z].

  • Pass a list of pairs of index keys and rows-as-maps.
(def frame (df/frame [[:x {:a 1 :b 10}]
                      [:y {:a 2 :b 20}]
                      [:z {:a 3 :b 30}]]))
frame
=> class dataframe.frame.Frame
	:a	:b
:x	1	10
:y	2	20
:z	3	30
  • Pass a list of maps and an optional index sequence:
(def frame (df/frame [{:a 1 :b 10}
                      {:a 2 :b 20}
                      {:a 3 :b 30}]
                      [:x :y :z]))
frame
=> class dataframe.frame.Frame
	:a	:b
:x	1	10
:y	2	20
:z	3	30

Selecting

DataFrame core contains a number of functions for selecting specific subsets and items from Series and Frames.

We've already seen the ix function, which selects either a single value from a Series or a single row-map from a Frame.

(ix (df/series [1 2 3] [:x :y :z]) :x)
;1
(ix (df/frame [{:a 1 :b 10}
               {:a 2 :b 20}
               {:a 3 :b 30}]
                [:x :y :z]))
    :x)
;{:a 1 :b 10}

The loc function allows one to select a subset of the input Series or Frame consisting of a list of index values.

(loc (df/series [1 2 3] [:x :y :z]) [:x :y])
=> class dataframe.series.Series
:x 1
:y 2
(loc (df/frame [{:a 1 :b 10}
               {:a 2 :b 20}
               {:a 3 :b 30}]
                [:x :y :z]))
    [:x :y])
=> class dataframe.frame.Frame
	:a	:b
:x	1	10
:y	2	20

In addition to the index-based location, one can select values/rows using a Series of boolean values (the index of this series must align to the index of the Series or Frame)

(df/select (df/series [1 2 3] [:x :y :z])
           (df/series [true false true] [:x :y :z]))
=> class dataframe.series.Series
:x 1
:z 3
(df/select (df/frame [{:a 1 :b 10}
               {:a 2 :b 20}
               {:a 3 :b 30}]
                [:x :y :z]))
    (df/series [true false true] [:x :y :z]))
=> class dataframe.frame.Frame
	:a	:b
:x	1	10
:z	3	30

Grouping

The group-by function takes a Frame and a series whose index is aligned with the Frame's index and returns a map of values to Frames. Each Frame is grouped by the value in the input index.

(def data (df/frame [{:a 1 :b 10}
                        {:a 2 :b 20}
                        {:a 3 :b 30}]
                       [:x :y :z]))

(df/group-by data (df/series [:foo :foo :bar] [:x :y :z]))

One can also group by a function of each row using the group-by-fn function. This function should take the row as a map of column names to values and return a single value that represents the group value for that row:

(def data (df/frame [{:a 1 :b 10}
                     {:a 2 :b 20}
                     {:a 3 :b 30}]
                    [:x :y :z]))

(df/group-by-fn data (fn [row] (+ (:a row) (:b row))))

Joining

To DataFrames may be joined together. Dataframe supports inner, left, right, and outer joins, which are performed using the index of the two dataframes.

(def left (df/frame [{:a 1 :b 10}
                     {:a 2 :b 20}
                     {:a 3 :b 30}]
                    [:x :y :z]))
                    
(def right (df/frame [{:c 100 :d "Foo"}
                      {:c 200 :d "Bar"}
                      {:c 300 :d "Baz"}]
                     [:w :x :y]))                    

(df/join left right :how :outer)
=> class dataframe.frame.Frame
    :b  :a  :c  :d 
:x  10   1 200 Bar 
:y  20   2 300 Baz 
:z  30   3 nil nil 
:w nil nil 100 Foo 

Transforming

DataFrame core has a number of functions for operating on or manipulating Frames.

(def frame (df/frame [[:x {:a 1 :b 10}]
                      [:y {:a 2 :b 20}]
                      [:z {:a 3 :b 30}]]))

(df/ix frame :x)
;=> class dataframe.series.Series
;:b 10
;:a 1

(df/col frame :a)
;=> class dataframe.series.Series
;:x 1
;:y 2
;:z 3


(df/assoc-col frame :c (df/add (df/col frame :a) (df/col frame :b)))
;=> class dataframe.frame.Frame
;	:b	:a	:c
;:x	10	1	11
;:y	20	2	22
;:z	30	3	33

To make manipulating Frames easier, dataframe introduces the with-> macro, which combines Clojure's threading macro with notation for easily accessing the column of a Frame. This macro takes a Frame and threads it through a series of operations. In doing so, when it encounters a symbol of the form $col, it knows to replace it with a reference to a column in the dataframe whose name is the keyword :col (for this reason, it is preferred to use keywords as column names).

(require '[dataframe.core :refer :all])

(def my-df (frame {:a [1 2 3] :b [10 20 30]}))

(with-> my-df
        (assoc-col :c (add $a 5))
        (assoc-col :d (add $b $c)))
=> class dataframe.frame.Frame
	:a	:b	:c	:d
0	1	10	6	16
1	2	20	7	27
2	3	30	8	38

Notice how the uses of $a, $b, and $c are replaced by the corresponding columns, as Series objects, in the dataframe pipeline above. This allows us to leverage functions that act on Series objects to transform these columns and to use them to update the Frame object.

These pipelines can be arbitrarily complicated:

(def my-df (frame [[:w {:a 0 :b 8}]
                   [:x {:a 1 :b 2}]
                   [:y {:a 2 :b 4}]
                   [:z {:a 3 :b 8}]]))
                   
(with-> my-df
        (select (and (lte $a 2) (gte $b 4)))
        (assoc-col :c (add $a $b))
        (map-rows->df (fn [row] {:foo (+ (:a row) (:c row))
                                 :bar (- (:b row) (:c row))}))
        (sort-rows :foo :bar)
        head)                  
=> class dataframe.frame.Frame
	:bar	:foo
:y	-2		8
:w	0		8	
:z	-3		14

DataFrame is distributed under the MIT license

Copyright © 2016 George Herbert Lewis