#Data Structures in Julia

## Using the data structures package
https://github.com/JuliaLang/DataStructures.jl

In [54]:
using DataStructures

### Deque - Double sided queues
The Deque type implements a double-ended queue using a list of blocks. This data structure supports constant-time insertion/removal of elements at both ends of a sequence.

In [55]:
a = Deque{Int}()
isempty(a)          # test whether the dequeue is empty
length(a)           # get the number of elements
push!(a, 10)        # add an element to the back
pop!(a)             # remove an element from the back
unshift!(a, 20)     # add an element to the front
shift!(a)           # remove an element from the front
front(a)            # get the element at the front
back(a)             # get the element at the back

LoadError: Attempted to front at an empty deque.
while loading In[55], in expression starting on line 8

In [57]:
a = Deque{Int}()

Deque [[]]

In [59]:
isempty(a)

true

In [63]:
push!(a,10)

Deque [[10,10]]

In [64]:
pop!(a)

10

In [65]:
unshift!(a, 20)

Deque [[20,10]]

In [66]:
shift!(a)

20

In [67]:
front(a)

10

In [68]:
back(a)

10

### Stacks and Queues
The Stack and Queue types are a light-weight wrapper of a deque type, which respectively provide interfaces for FILO and FIFO access.

In [70]:
s = Stack(Int)
push!(s, x)
x = top(s)
x = pop!(s)

LoadError: `convert` has no method matching convert(::Type{Int64}, ::Array{Float64,1})
while loading In[70], in expression starting on line 2

In [71]:
s = Stack(Int)

Stack{Deque{Int64}}(Deque [[]])

In [73]:
push!(s, 1)

Stack{Deque{Int64}}(Deque [[1]])

In [75]:
x = top(s)

1

In [80]:
println(length(s))

0


In [83]:
q = Queue(Int)
enqueue!(q, 1)

Queue{Deque{Int64}}(Deque [[1]])

In [84]:
x = front(q)
y = back(q)
println(x == y)

true


In [85]:
x = dequeue!(q)

1

###Disjoint Sets

In [86]:
a = IntDisjointSets(10)      # creates a forest comprised of 10 singletons
union!(a, 3, 5)             # merges the sets that contain 3 and 5 into one
in_same_set(a, x, y)        # determines whether x and y are in the same set
elem = push!(a)             # adds a single element in a new set; returns the new element
                            # (this operation is often called MakeSet)

11

In [None]:
a = DisjointSets{String}(["a", "b", "c", "d"])
union!(a, "a", "b")
in_same_set(a, "c", "d")
push!(a, "f")

###Heaps
Heaps are data structures that efficiently maintain the minimum (or maximum) for a set of data that may dynamically change.

In [97]:
h = binary_maxheap(Int)

BinaryHeap{Int64,GreaterThan}(GreaterThan(),[])

In [98]:
h = binary_maxheap([12,234,2,512,5,235,25,12,512,5,234,4])

BinaryHeap{Int64,GreaterThan}(GreaterThan(),[512,512,235,234,234,4,25,12,12,5,5,2])

In [100]:
# Let h be a heap, i be a handle, and v be a value.

length(h)         # returns the number of elements

isempty(h)        # returns whether the heap is empty

push!(h, 1651)       # add a value to the heap

top(h)            # return the top value of a heap

pop!(h)           # removes the top value, and returns it

1651

In [102]:
h = mutable_binary_maxheap([2414,14,61,7,2345,25,17,5,123])

MutableBinaryHeap(2414, 2345, 61, 123, 14, 25, 17, 5, 7)

In [105]:
update!(h, 2, 12)

`nlargest(n, a)` is equivalent to `sort(a, lt = >)[1:min(n, end)]`, and `nsmallest(n, a)` is equivalent to `sort(a, lt = <)[1:min(n, end)]`.

In [108]:
function nlargest(n, a::Array{Int})
    return sort(a, lt = >)[1:min(n, end)]
end

nlargest (generic function with 1 method)

In [110]:
function nsmallest(n, a::Array{Int})
    return sort(a, lt= <)[1:min(n, end)]
end

nsmallest (generic function with 1 method)

In [114]:
nlargest(3, [1,2,3,4,5])

3-element Array{Int64,1}:
 5
 4
 3

In [115]:
nsmallest(3, [5,4,3,2,1])

3-element Array{Int64,1}:
 1
 2
 3

###OrderedDicts and OrderedSets
OrderedDicts are simply dictionaries whose entries have a particular order. For OrderedDicts (and OrderedSets), order refers to insertion order, which allows deterministic iteration over the dictionary or set:

In [118]:
d = OrderedDict{Char,Int}()
for c in 'a':'e'
    d[c] = c-'a'+1
end
collect(d)' # => [('a',1),('b',2),('c',3),('d',4),('e',5)]

1x5 Array{(Char,Int64),2}:
 ('a',1)  ('b',2)  ('c',3)  ('d',4)  ('e',5)

In [119]:
s = OrderedSet(π,e,γ,catalan,φ)
collect(s) # => [π = 3.1415926535897...,
           #     e = 2.7182818284590...,
           #     γ = 0.5772156649015...,
                 #     catalan = 0.9159655941772...,
                 #     φ = 1.6180339887498...]

5-element Array{Any,1}:
       π = 3.1415926535897...
       e = 2.7182818284590...
       γ = 0.5772156649015...
 catalan = 0.9159655941772...
       φ = 1.6180339887498...

###DefaultDict and DefaultOrderedDict
A DefaultDict allows specification of a default value to return when a requested key is not in a dictionary.

While the implementation is slightly different, a DefaultDict can be thought to provide a normal Dict with a default value. A DefaultOrderedDict does the same for an OrderedDict.

In [122]:
dd = DefaultDict(1)               # create an (Any=>Any) DefaultDict with a default value of 1
dd = DefaultDict(String, Int, 0)  # create a (String=>Int) DefaultDict with a default value of 0

d = ['a'=>1, 'b'=>2]
dd = DefaultDict(0, d)            # provide a default value to an existing dictionary
dd['c'] == 0                      # true
#d['c'] == 0                      # false

dd = DefaultOrderedDict(time)     # call time() to provide the default value for an OrderedDict
dd = DefaultDict(Dict)            # Create a dictionary of dictionaries
                                  # Dict() is called to provide the default value
dd = DefaultDict(()->myfunc())    # call function myfunc to provide the default value

# create a Dictionary of type String=>DefaultDict{String, Int}, where the default of the
# inner set of DefaultDicts is zero
dd = DefaultDict(String, DefaultDict, ()->DefaultDict(String,Int,0))

DefaultDict{String,DefaultDict{K,V,F},Function} with 0 entries

In [124]:
d['c'] = 3

3

In [125]:
d

Dict{Char,Int64} with 3 entries:
  'b' => 2
  'c' => 3
  'a' => 1

###Trie
An implementation of the Trie data structure. This is an associative structure, with String keys:

In [126]:
t=Trie{Int}()
t["Rob"]=42
t["Roger"]=24
haskey(t,"Rob") #true
get(t,"Rob",nothing) #42
keys(t) # "Rob", "Roger"

2-element Array{String,1}:
 "Rob"  
 "Roger"

In [132]:
#to test whether a trie contains any prefix of a given string, use:
seen_prefix(t::Trie, str) = any(v -> v.is_key, path(t, str))

seen_prefix (generic function with 1 method)

In [131]:
seen_prefix(t, "Rob")

true

###Linked List

In [134]:
l1 = nil()

nil()

In [135]:
l2 = cons(1, l1)

list(1)

In [137]:
l3 = list(2,3)

list(2, 3)

In [138]:
l4 = cat(l1, l2, l3)

list(1, 2, 3)

In [139]:
l5 = map((x) -> x*2, l4)

list(2, 4, 6)

In [144]:
for i in l5; println(i); end

2
4
6


#Buffon's Needle

In [3]:
function buffon(m)
    hit = 0
    for i = 1:m
        mp = rand()
        phi = (rand() * pi) - pi / 2
        xrechts = mp + cos(phi) / 2
        xlinks = mp - cos(phi) / 2
        if xrechts >= 1 || xlinks <= 0
            hit += 1
        end
    end
    miss = m - hit
    piapprox = m / hit * 2
end

buffon (generic function with 1 method)

In [34]:
"Time elapsed: " * string(@elapsed a = buffon(100000000)) * " seconds"

"Time elapsed: 4.659189982 seconds"

In [35]:
"Estimation for pi: " * string(a)

"Estimation for pi: 3.1416930423578924"

##Buffon's Needle in Parallel

In [7]:
function buffon_parallel(m)
    hit = @parallel (+) for i = 1:m
        mp = rand()
        phi = (rand() * pi) - pi / 2
        xrechts = mp + cos(phi) / 2
        xlinks = mp - cos(phi) / 2
        (xrechts >= 1 || xlinks <= 0) ? 1 : 0
    end
    miss = m - hit
    piapprox = m / hit * 2
end

buffon_parallel (generic function with 1 method)

In [31]:
"Time elapsed: " * string(@elapsed b = buffon_parallel(100000000)) * " seconds"

"Time elapsed: 4.610179443 seconds"

In [36]:
"Estimation for pi: " * string(b)

"Estimation for pi: 3.141909906393235"

In [37]:
function randmatstat(t; n=10)
 v = zeros(t)
 w = zeros(t)
 for i = 1:t
 a = randn(n,n)
 b = randn(n,n)
 c = randn(n,n)
 d = randn(n,n)
 P = [a b c d]
 Q = [a b; c d]
 v[i] = trace((P'*P)^4)
 w[i] = trace((Q'*Q)^4)
 end
 std(v)/mean(v), std(w)/mean(w)
end

randmatstat (generic function with 1 method)

In [38]:
randmatstat(100)

(0.35356334122197053,0.4004604306751935)

##Low Level Code

In [39]:
function qsort!(a,lo,hi)
 i, j = lo, hi
 while i < hi
 pivot = a[(lo+hi)>>>1]
 while i <= j
 while a[i] < pivot; i = i+1; end
 while a[j] > pivot; j = j-1; end
 if i <= j
 a[i], a[j] = a[j], a[i]
 i, j = i+1, j-1
 end
 end
 if lo < j; qsort!(a,lo,j); end
 lo, j = i, hi
 end
 return a
end

qsort! (generic function with 1 method)

In [45]:
c = randn(1000*1000, 100);

LoadError: interrupt
while loading In[45], in expression starting on line 1

##Using Python libraries within Julia

In [51]:
using PyCall
@pyimport math
math.sin(math.pi / 4) - sin(pi / 4)

0.0

In [52]:
@pyimport pylab
x = linspace(0,2*pi,1000); y = sin(3*x + 4*cos(2*x));
pylab.plot(x, y; color="red", linewidth=2.0, linestyle="--")
pylab.show()

#DataFrames in Julia

In [1]:
using DataFrames

In [2]:
df = DataFrame()
df[:A] = 1:8
df[:B] = ["M", "F", "F", "M", "F", "M", "M", "F"]
df

Unnamed: 0,A,B
1,1,M
2,2,F
3,3,F
4,4,M
5,5,F
6,6,M
7,7,M
8,8,F


In [3]:
head(df)

Unnamed: 0,A,B
1,1,M
2,2,F
3,3,F
4,4,M
5,5,F
6,6,M


In [5]:
using RDatasets
iris = dataset("datasets", "iris")
head(iris)

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa


In [17]:
iris.columns[2]

150-element PooledDataArray{ASCIIString,Uint8,1}:
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 ⋮          
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"

In [None]:
iris[:Species]

In [21]:
df = DataFrame(A = 1:10, B = 2:2:20)

Unnamed: 0,A,B
1,1,2
2,2,4
3,3,6
4,4,8
5,5,10
6,6,12
7,7,14
8,8,16
9,9,18
10,10,20


In [22]:
df[1:3, [:A, :B]]

Unnamed: 0,A,B
1,1,2
2,2,4
3,3,6


In [23]:
df[df[:A] % 2 .== 0, :]

Unnamed: 0,A,B
1,2,4
2,4,8
3,6,12
4,8,16
5,10,20


In [24]:
 df[df[:B] % 2 .== 0, :]

Unnamed: 0,A,B
1,1,2
2,2,4
3,3,6
4,4,8
5,5,10
6,6,12
7,7,14
8,8,16
9,9,18
10,10,20


In [26]:
names = DataFrame(ID = [1, 2], Name = ["John Doe", "Jane Doe"])
jobs = DataFrame(ID = [1, 2], Job = ["Lawyer", "Doctor"])

Unnamed: 0,ID,Job
1,1,Lawyer
2,2,Doctor


In [27]:
full = join(names, jobs, on = :ID)

Unnamed: 0,ID,Name,Job
1,1,John Doe,Lawyer
2,2,Jane Doe,Doctor


In [28]:
a = DataFrame(ID = [1, 2], Name = ["A", "B"])
b = DataFrame(ID = [1, 3], Job = ["Doctor", "Lawyer"])
join(a, b, on = :ID, kind = :inner)
join(a, b, on = :ID, kind = :left)
join(a, b, on = :ID, kind = :right)
join(a, b, on = :ID, kind = :outer)
join(a, b, on = :ID, kind = :semi)
join(a, b, on = :ID, kind = :anti)

Unnamed: 0,ID,Name
1,2,B


###Splity-Apply-Combine strategy
 the `by` function, which takes in three arguments: (1) a DataFrame, (2) a column to split the DataFrame on, and (3) a function or expression to apply to each subset of the DataFrame.
 
 Kind of like group by

In [29]:
by(iris, :Species, size)

Unnamed: 0,Species,N
1,setosa,50
2,versicolor,50
3,virginica,50


In [30]:
by(iris, :Species, df -> mean(df[:PetalLength]))

Unnamed: 0,Species,x1
1,setosa,1.462
2,versicolor,4.26
3,virginica,5.552


In [31]:
by(iris, :Species, df -> DataFrame(N = size(df, 1)))

Unnamed: 0,Species,N
1,setosa,50
2,versicolor,50
3,virginica,50


A second approach to the Split-Apply-Combine strategy is implemented in the aggregate function, which also takes three arguments: (1) a DataFrame, (2) a column (or columns) to split the DataFrame on, and a (3) function (or several functions) that are used to compute a summary of each subset of the DataFrame. Each function is applied to each column, that was not used to split the DataFrame, creating new columns of the form $name_$function e.g. SepalLength_mean. 

In [33]:
aggregate(iris, :Species, sum)

Unnamed: 0,Species,SepalLength_sum,SepalWidth_sum,PetalLength_sum,PetalWidth_sum
1,setosa,250.3,171.39999999999998,73.1,12.3
2,versicolor,296.8,138.5,213.0,66.30000000000001
3,virginica,329.4,148.7,277.6,101.3


In [34]:
aggregate(iris, :Species, [sum, mean])

Unnamed: 0,Species,SepalLength_sum,SepalLength_mean,SepalWidth_sum,SepalWidth_mean,PetalLength_sum,PetalLength_mean,PetalWidth_sum,PetalWidth_mean
1,setosa,250.3,5.005999999999999,171.39999999999998,3.4279999999999995,73.1,1.462,12.3,0.246
2,versicolor,296.8,5.936,138.5,2.77,213.0,4.26,66.30000000000001,1.3260000000000003
3,virginica,329.4,6.587999999999999,148.7,2.974,277.6,5.552,101.3,2.026


####Stacking

In [35]:
iris[:id] = 1:size(iris, 1)  # this makes it easier to unstack
d = stack(iris, [1:4])

Unnamed: 0,variable,value,Species,id
1,SepalLength,5.1,setosa,1
2,SepalLength,4.9,setosa,2
3,SepalLength,4.7,setosa,3
4,SepalLength,4.6,setosa,4
5,SepalLength,5.0,setosa,5
6,SepalLength,5.4,setosa,6
7,SepalLength,4.6,setosa,7
8,SepalLength,5.0,setosa,8
9,SepalLength,4.4,setosa,9
10,SepalLength,4.9,setosa,10


In [36]:
d = stack(iris, [:SepalLength, :SepalWidth], :Species)

Unnamed: 0,variable,value,Species
1,SepalLength,5.1,setosa
2,SepalLength,4.9,setosa
3,SepalLength,4.7,setosa
4,SepalLength,4.6,setosa
5,SepalLength,5.0,setosa
6,SepalLength,5.4,setosa
7,SepalLength,4.6,setosa
8,SepalLength,5.0,setosa
9,SepalLength,4.4,setosa
10,SepalLength,4.9,setosa


In [37]:
d = melt(iris, :Species)

Unnamed: 0,variable,value,Species
1,SepalLength,5.1,setosa
2,SepalLength,4.9,setosa
3,SepalLength,4.7,setosa
4,SepalLength,4.6,setosa
5,SepalLength,5.0,setosa
6,SepalLength,5.4,setosa
7,SepalLength,4.6,setosa
8,SepalLength,5.0,setosa
9,SepalLength,4.4,setosa
10,SepalLength,4.9,setosa


In [39]:
sort!(iris)

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species,id
1,4.3,3.0,1.1,0.1,setosa,14
2,4.4,2.9,1.4,0.2,setosa,9
3,4.4,3.0,1.3,0.2,setosa,39
4,4.4,3.2,1.3,0.2,setosa,43
5,4.5,2.3,1.3,0.3,setosa,42
6,4.6,3.1,1.5,0.2,setosa,4
7,4.6,3.2,1.4,0.2,setosa,48
8,4.6,3.4,1.4,0.3,setosa,7
9,4.6,3.6,1.0,0.2,setosa,23
10,4.7,3.2,1.3,0.2,setosa,3


In [40]:
sort!(iris, rev = true)

sort!(iris, cols = [:SepalWidth, :SepalLength])

sort!(iris, cols = [order(:Species, by = uppercase),
                    order(:SepalLength, rev = true)])

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species,id
1,5.8,4.0,1.2,0.2,setosa,15
2,5.7,3.8,1.7,0.3,setosa,19
3,5.7,4.4,1.5,0.4,setosa,16
4,5.5,3.5,1.3,0.2,setosa,37
5,5.5,4.2,1.4,0.2,setosa,34
6,5.4,3.4,1.7,0.2,setosa,21
7,5.4,3.4,1.5,0.4,setosa,32
8,5.4,3.7,1.5,0.2,setosa,11
9,5.4,3.9,1.7,0.4,setosa,6
10,5.4,3.9,1.3,0.4,setosa,17


The following two examples show two ways to sort the iris dataset with the same result: Species will be ordered in reverse lexicographic order, and within species, rows will be sorted by increasing sepal length and width:

In [41]:
sort!(iris, cols = (:Species, :SepalLength, :SepalWidth),
            rev = (true, false, false))

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species,id
1,4.9,2.5,4.5,1.7,virginica,107
2,5.6,2.8,4.9,2.0,virginica,122
3,5.7,2.5,5.0,2.0,virginica,114
4,5.8,2.7,5.1,1.9,virginica,143
5,5.8,2.7,5.1,1.9,virginica,102
6,5.8,2.8,5.1,2.4,virginica,115
7,5.9,3.0,5.1,1.8,virginica,150
8,6.0,2.2,5.0,1.5,virginica,120
9,6.0,3.0,4.8,1.8,virginica,139
10,6.1,2.6,5.6,1.4,virginica,135


In [42]:
sort!(iris,
      cols = (order(:Species, rev = true), :SepalLength, :SepalWidth))

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species,id
1,4.9,2.5,4.5,1.7,virginica,107
2,5.6,2.8,4.9,2.0,virginica,122
3,5.7,2.5,5.0,2.0,virginica,114
4,5.8,2.7,5.1,1.9,virginica,143
5,5.8,2.7,5.1,1.9,virginica,102
6,5.8,2.8,5.1,2.4,virginica,115
7,5.9,3.0,5.1,1.8,virginica,150
8,6.0,2.2,5.0,1.5,virginica,120
9,6.0,3.0,4.8,1.8,virginica,139
10,6.1,2.6,5.6,1.4,virginica,135


##The Formula, ModelFrame and ModelMatrix Types

In [43]:
fm = Z ~ X + Y

Formula: Z ~ X + Y

In [44]:
df = DataFrame(X = randn(10), Y = randn(10), Z = randn(10))
mf = ModelFrame(Z ~ X + Y, df)

ModelFrame(10x3 DataFrame
| Row | Z         | X         | Y          |
|-----|-----------|-----------|------------|
| 1   | 0.794779  | -1.59965  | 1.01895    |
| 2   | -1.21096  | -0.321773 | 0.148022   |
| 3   | 0.419271  | -0.980067 | -0.346114  |
| 4   | -0.462574 | -0.206005 | 0.483565   |
| 5   | 0.967427  | -1.12538  | 1.14153    |
| 6   | -0.358823 | 0.355719  | -0.0709589 |
| 7   | -0.486693 | -0.56209  | 0.442729   |
| 8   | 0.823228  | 0.29741   | -0.418333  |
| 9   | 0.364264  | 0.0452285 | 0.0429356  |
| 10  | -0.109961 | -0.518664 | -1.47131   |,Terms({:X,:Y},{:Z,:X,:Y},3x3 Array{Int8,2}:
 1  0  0
 0  1  0
 0  0  1,[1,1,1],true,true),Bool[true,true,true,true,true,true,true,true,true,true])

In [45]:
mm = ModelMatrix(ModelFrame(Z ~ X + Y, df))

ModelMatrix{Float64}(10x3 Array{Float64,2}:
 1.0  -1.59965     1.01895  
 1.0  -0.321773    0.148022 
 1.0  -0.980067   -0.346114 
 1.0  -0.206005    0.483565 
 1.0  -1.12538     1.14153  
 1.0   0.355719   -0.0709589
 1.0  -0.56209     0.442729 
 1.0   0.29741    -0.418333 
 1.0   0.0452285   0.0429356
 1.0  -0.518664   -1.47131  ,[0,1,2])

Expressing interactions

In [46]:
mm = ModelMatrix(ModelFrame(Z ~ X + Y + X&Y, df))

ModelMatrix{Float64}(10x4 Array{Float64,2}:
 1.0  -1.59965     1.01895    -1.62997   
 1.0  -0.321773    0.148022   -0.0476295 
 1.0  -0.980067   -0.346114    0.339215  
 1.0  -0.206005    0.483565   -0.0996167 
 1.0  -1.12538     1.14153    -1.28466   
 1.0   0.355719   -0.0709589  -0.0252414 
 1.0  -0.56209     0.442729   -0.248854  
 1.0   0.29741    -0.418333   -0.124417  
 1.0   0.0452285   0.0429356   0.00194192
 1.0  -0.518664   -1.47131     0.763114  ,[0,1,2,3])

Expressing main effects and interactions

In [47]:
mm = ModelMatrix(ModelFrame(Z ~ X*Y, df))

ModelMatrix{Float64}(10x4 Array{Float64,2}:
 1.0  -1.59965     1.01895    -1.62997   
 1.0  -0.321773    0.148022   -0.0476295 
 1.0  -0.980067   -0.346114    0.339215  
 1.0  -0.206005    0.483565   -0.0996167 
 1.0  -1.12538     1.14153    -1.28466   
 1.0   0.355719   -0.0709589  -0.0252414 
 1.0  -0.56209     0.442729   -0.248854  
 1.0   0.29741    -0.418333   -0.124417  
 1.0   0.0452285   0.0429356   0.00194192
 1.0  -0.518664   -1.47131     0.763114  ,[0,1,2,3])

## Pooling Data: Representing Factors

In [48]:
dv = @data(["Group A", "Group A", "Group A",
            "Group B", "Group B", "Group B"])

6-element DataArray{ASCIIString,1}:
 "Group A"
 "Group A"
 "Group A"
 "Group B"
 "Group B"
 "Group B"

In [49]:
pdv = @pdata(["Group A", "Group A", "Group A",
              "Group B", "Group B", "Group B"])

6-element PooledDataArray{ASCIIString,Uint32,1}:
 "Group A"
 "Group A"
 "Group A"
 "Group B"
 "Group B"
 "Group B"

In [51]:
levels(pdv)

2-element Array{ASCIIString,1}:
 "Group A"
 "Group B"

In [53]:
df = DataFrame(A = [1, 1, 1, 2, 2, 2],
               B = ["X", "X", "X", "Y", "Y", "Y"])
pool!(df, [:A, :B])