Multidimensional table processing tool
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
docs
src
test
.gitignore
.travis.yml
LICENSE.md
README.md
REQUIRE
appveyor.yml
mkdocs.yml

README.md

DataCubes.jl

Build Status Coverage Status

The DataCubes package provides several data types and tools to handle multidimensional tables. Below, we give a brief description of the package. For more detail, please refer to documentation.

Installation

At the Julia REPL, Pkg.add("DataCubes"). The package requires DataFrames to convert to and from DataFrame, , RDatasets to run some tests.

Getting Started

There are two important data types in DataCubes.

DictArray

DictArray is a multidimensional array whose element type is an ordered dictionary with common keys. The usual table

a b 
----
1 x 
2 y 
3 z 

can be thought of as a one dimensional array whose 3 elements are

[dict(:a=>1, :b=>:x), dict(:a=>2, :b=>:y), dict(:a=>3, :b=>:z)]

where dict stands for some type of ordered dictionary. With this correspondence, it is straightforward to generalize the table into a multidimensional array. In DataCubes, the function to create a DictArray is darr:

julia> d = darr(a=[1 2;3 4;5 6], b=[:x :y;:z :w;:u :v])
a b |a b 
----+----
1 x |2 y 
3 z |4 w 
5 u |6 v 

julia> d[1, 2]
DataCubes.All.LDict{Symbol,Nullable{T}} with 2 entries:
  :a => Nullable(2)
  :b => Nullable(:y)

Note that all elements in DictArray are Nullable. darr is a helper function to create DictArray, which lifts each array elements to Nullable.

If you want to create an array with null value, use the macro @darr:

julia> @darr(a=[1,2,NA], b=['x',NA,'z'])
a b 
----
1 x 
2   
  z 

Internally, a DictArray is stored as an ordered dictionary of Nullable arrays, and the ordered dictionary is implemented as a key vector and a value vector. Because a DictArray can be multidimensional, it will be misleading to call a and b above as column names. We will call them field names, and refer to the corresponding arrays ([1,2,NA] and ['x',NA,'z']) as field values.

Many of array related functions are implemented for DictArray. For example,

julia> d = @darr(a=[1 2 3;4 5 6], b=[11 12 13;14 15 16])
a  b |a  b |a  b 
-----+-----+-----
1 11 |2 12 |3 13 
4 14 |5 15 |6 16 

julia> size(d)
(2,3)

julia> transpose(d)

a  b |a  b 
-----+-----
1 11 |4 14 
2 12 |5 15 
3 13 |6 16 

julia> reshape(d, 1, 6)

a  b |a  b |a  b |a  b |a  b |a  b 
-----+-----+-----+-----+-----+-----
1 11 |4 14 |2 12 |5 15 |3 13 |6 16 

julia> mapslices(x->sum(x[:a]), d, [1])
3-element DataCubes.All.AbstractArrayWrapper{Nullable{Int64},1,Array{Nullable{Int64},1}}:
 Nullable(5)
 Nullable(7)
 Nullable(9)

LabeledArray

LabeledArray is a multidimensional array consisting of base and axes. A LabeledArray can be created using the larr function. For example,

julia> larr(a=[1 2;3 4;5 6], b=[:x :y;:y :z;:z :x], axis1=darr(k=['x','y','z']), axis2=[:u, :v])

  |u   |v   
--+----+----
k |a b |a b 
--+----+----
x |1 x |2 y 
y |3 y |4 z 
z |5 z |6 x 

Here, the base part is a DictArray whose fields are a and b. There are 2 axes. The first axis is another DictArray with a field k, and the other axis is a normal array [:u, :v].

Similar to DictArray, there is a macro version, @larr, which enables to enter a null value manually more easily:

julia> @larr(a=[1 2;NA 4;5 6], b=[:x :y;:y :z;NA :x], axis1[k=['x','y',NA]], axis2[:u, NA])

  |u   |    
--+----+----
k |a b |a b 
--+----+----
x |1 x |2 y 
y |  y |4 z 
  |5   |6 x 

Note the slightly different way of specifying the axes: the macro version specifies axes by using the form axisN[...] for the nth axis, whereas the function version axisN=[...].

Many array functions can be applied to LabeledArray with expected behavior:

julia> t = larr(a=[1 2;3 4;5 6], b=[:x :y;:y :z;:z :x], axis1=darr(k=['x','y','z']), axis2=[:u, :v])

  |u   |v   
--+----+----
k |a b |a b 
--+----+----
x |1 x |2 y 
y |3 y |4 z 
z |5 z |6 x 


julia> size(t)
(3,2)

julia> transpose(t)

k |x   |y   |z   
--+----+----+----
  |a b |a b |a b 
--+----+----+----
u |1 x |3 y |5 z 
v |2 y |4 z |6 x 


julia> reshape(t, 6)

k x1 |a b 
-----+----
x  u |1 x 
y  u |3 y 
z  u |5 z 
x  v |2 y 
y  v |4 z 
z  v |6 x 


julia> vcat(t, t)
  |u   |v   
--+----+----
k |a b |a b 
--+----+----
x |1 x |2 y 
y |3 y |4 z 
z |5 z |6 x 
x |1 x |2 y 
y |3 y |4 z 
z |5 z |6 x 

Select from a LabeledArray

@select and selct select and/or aggregate a LabeledArray and transform it into another. The function name is selct because what it does is not the same as what Base.select function does.

Here is an example usage of @select:

julia> t = larr(a=[1 2;3 4;5 6], b=[:x :y;:y :z;:z :x], axis1=darr(k=['x','y','z']), axis2=darr(r=[:u, :v]))

r |u   |v   
--+----+----
k |a b |a b 
--+----+----
x |1 x |2 y 
y |3 y |4 z 
z |5 z |6 x 


julia> @select(t, :b, :a)

r |u   |v   
--+----+----
k |b a |b a 
--+----+----
x |x 1 |y 2 
y |y 3 |z 4 
z |z 5 |x 6 


julia> @select(t, where[_b .== :x])

r |u   |v   
--+----+----
k |a b |a b 
--+----+----
x |1 x |    
z |    |6 x 


julia> @select(t, where[(_b .== :x) | (_b .== :y)], by[:b], count=length(_))

b |count 
--+------
x |    2 
y |    2 

In @select, where[...] chooses a portion of the LabeledArray. An underscore _ is treated as the array itself, and _field denotes the field field in the LabeledArray, when the field name is a symbol. In general, a field name can be of arbitrary type, in which case, _[field name] can be used to choose that field. A keyword key=>value creates a new field with name key and value defined by value just as in by[...]. A field name itself creates the same field in the returned array. Keyword/pair arguments determine how to aggregate the fields. In the last example, count=length(_) creates a field count whose value is the length of the selected values for each by-variable. Multiple where[...] are allowed and they are simply concatenated. Then the by[...] determines what variables to aggregate the table by: a field name in ... uses that field as the by-variable. Keyword/pair arguments in ... creates a new field and use those as by-variables.

selct is a function version of @select. It is similar but of course does not provide the underscore version of convenient way of creating a function. Here are some examples:

julia> t = larr(a=reshape(1:50,10,5), b=repmat(1:10, 1, 5))

   | 1    | 2    | 3    | 4    | 5    
---+------+------+------+------+------
   | a  b | a  b | a  b | a  b | a  b 
---+------+------+------+------+------
 1 | 1  1 |11  1 |21  1 |31  1 |41  1 
 2 | 2  2 |12  2 |22  2 |32  2 |42  2 
 3 | 3  3 |13  3 |23  3 |33  3 |43  3 
 4 | 4  4 |14  4 |24  4 |34  4 |44  4 
 5 | 5  5 |15  5 |25  5 |35  5 |45  5 
 6 | 6  6 |16  6 |26  6 |36  6 |46  6 
 7 | 7  7 |17  7 |27  7 |37  7 |47  7 
 8 | 8  8 |18  8 |28  8 |38  8 |48  8 
 9 | 9  9 |19  9 |29  9 |39  9 |49  9 
10 |10 10 |20 10 |30 10 |40 10 |50 10 


julia> selct(t, :b)

   | 1 | 2 | 3 | 4 | 5 
---+---+---+---+---+---
   | b | b | b | b | b 
---+---+---+---+---+---
 1 | 1 | 1 | 1 | 1 | 1 
 2 | 2 | 2 | 2 | 2 | 2 
 3 | 3 | 3 | 3 | 3 | 3 
 4 | 4 | 4 | 4 | 4 | 4 
 5 | 5 | 5 | 5 | 5 | 5 
 6 | 6 | 6 | 6 | 6 | 6 
 7 | 7 | 7 | 7 | 7 | 7 
 8 | 8 | 8 | 8 | 8 | 8 
 9 | 9 | 9 | 9 | 9 | 9 
10 |10 |10 |10 |10 |10 


julia> selct(t, :b, where=[d -> d[:a] .> 25])

   | 1 | 2 | 3 
---+---+---+---
   | b | b | b 
---+---+---+---
 1 |   | 1 | 1 
 2 |   | 2 | 2 
 3 |   | 3 | 3 
 4 |   | 4 | 4 
 5 |   | 5 | 5 
 6 | 6 | 6 | 6 
 7 | 7 | 7 | 7 
 8 | 8 | 8 | 8 
 9 | 9 | 9 | 9 
10 |10 |10 |10 


julia> selct(t, sum_a = d -> sum(d[:a]), where=d -> d[:a] .> 25, by=:b)

 b |sum_a 
---+------
 1 |   72 
 2 |   74 
 3 |   76 
 4 |   78 
 5 |   80 
 6 |  108 
 7 |  111 
 8 |  114 
 9 |  117 
10 |  120 

julia> selct(t, sum_a = d -> sum(d[:a]), where=[d -> d[:a] .> 25], by=:b, by=:a])

 a |   26 |   27 |   28 |   29 |   30 |   31 |   32 |   33 |   34 |   35 |   36 |   37 ...
---+------+------+------+------+------+------+------+------+------+------+------+------
 b |sum_a |sum_a |sum_a |sum_a |sum_a |sum_a |sum_a |sum_a |sum_a |sum_a |sum_a |sum_a ...
---+------+------+------+------+------+------+------+------+------+------+------+------
 1 |      |      |      |      |      |   31 |      |      |      |      |      |      ...
 2 |      |      |      |      |      |      |   32 |      |      |      |      |      ...
 3 |      |      |      |      |      |      |      |   33 |      |      |      |      ...
 4 |      |      |      |      |      |      |      |      |   34 |      |      |      ...
 5 |      |      |      |      |      |      |      |      |      |   35 |      |      ...
 6 |   26 |      |      |      |      |      |      |      |      |      |   36 |      ...
 7 |      |   27 |      |      |      |      |      |      |      |      |      |   37 ...
 8 |      |      |   28 |      |      |      |      |      |      |      |      |      ...
 9 |      |      |      |   29 |      |      |      |      |      |      |      |      ...
10 |      |      |      |      |   30 |      |      |      |      |      |      |      ...

As the last example shows, it is possible to aggregate a table using multiple variables to create a multidimensional LabeledArray.

update and @update works similarly to selct and @select but starts from the original table, not from scratch, modifies it and returns a new LabeledArray.

Join

leftjoin and innerjoin join two LabeledArrays.

julia> t1 = larr(a=[:k1,:k1,:k2,:k3,:k4],b=[1,2,3,4,5])

  | a b 
--+-----
1 |k1 1 
2 |k1 2 
3 |k2 3 
4 |k3 4 
5 |k4 5 


julia> t2 = @larr(axis1[a=[:k0,:k1,:k2,:k3]], axis2[r=[:m,:n]], c=[10 11;12 13;14 15;16 17])


 r | m | n 
---+---+---
 a | c | c 
---+---+---
k0 |10 |11 
k1 |12 |13 
k2 |14 |15 
k3 |16 |17 


julia> leftjoin(t1, t2, 1)

r | m      | n      
--+--------+--------
  | a b  c | a b  c 
--+--------+--------
1 |k1 1 12 |k1 1 13 
2 |k1 2 12 |k1 2 13 
3 |k2 3 14 |k2 3 15 
4 |k3 4 16 |k3 4 17 
5 |k4 5    |k4 5    


julia> innerjoin(t1, t2, 1)

r | m      | n      
--+--------+--------
  | a b  c | a b  c 
--+--------+--------
1 |k1 1 12 |k1 1 13 
2 |k1 2 12 |k1 2 13 
3 |k2 3 14 |k2 3 15 
4 |k3 4 16 |k3 4 17 

leftjoin(t1, t2, 1) left-joins t1 and t2 along the direction 1. Since ndims(t1) == 1 and ndims(t2) == 2 and there is 1 dimension to join along, the result LabeledArray is 1+2-1=2 dimensional. innerjoin works similarly to leftjoin but only keeps the keys in t1 that can be found in t2.

Documentation

More detailed documentation is available here.