# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), Dec 5, 2017**

A brief introduction to basic usage of `DataFrames`. Tested under `DataFrames` master on 2017-12-05.
I will try to keep it up to date as the package evolves. This tutorial covers `DataFrames`, `CSV`, `Missings` and `CategoricalArrays` only. It does not show any additional packages that can be used with `DataFrames`.

In [1]:
using DataFrames # load package

### Split-apply-combine

In [2]:
x = DataFrame(id=[1,2,3,4,1,2,3,4], id2=[1,2,1,2,1,2,1,2], v=rand(8))

Unnamed: 0,id,id2,v
1,1,1,0.892486
2,2,2,0.83603
3,3,1,0.225134
4,4,2,0.348906
5,1,1,0.884499
6,2,2,0.459531
7,3,1,0.942475
8,4,2,0.999226


In [3]:
gx1 = groupby(x, :id) # group by multiple variables

DataFrames.GroupedDataFrame  4 groups with keys: Symbol[:id]
First Group:
2×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ id │ id2 │ v        │
├─────┼────┼─────┼──────────┤
│ 1   │ 1  │ 1   │ 0.892486 │
│ 2   │ 1  │ 1   │ 0.884499 │
⋮
Last Group:
2×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ id │ id2 │ v        │
├─────┼────┼─────┼──────────┤
│ 1   │ 4  │ 2   │ 0.348906 │
│ 2   │ 4  │ 2   │ 0.999226 │

In [4]:
gx2 = groupby(x, [:id, :id2])

DataFrames.GroupedDataFrame  4 groups with keys: Symbol[:id, :id2]
First Group:
2×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ id │ id2 │ v        │
├─────┼────┼─────┼──────────┤
│ 1   │ 1  │ 1   │ 0.892486 │
│ 2   │ 1  │ 1   │ 0.884499 │
⋮
Last Group:
2×3 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ id │ id2 │ v        │
├─────┼────┼─────┼──────────┤
│ 1   │ 4  │ 2   │ 0.348906 │
│ 2   │ 4  │ 2   │ 0.999226 │

In [5]:
vcat(gx2...) # back to the original DataFrame

Unnamed: 0,id,id2,v
1,1,1,0.892486
2,1,1,0.884499
3,2,2,0.83603
4,2,2,0.459531
5,3,1,0.225134
6,3,1,0.942475
7,4,2,0.348906
8,4,2,0.999226


In [6]:
x = DataFrame(id = [missing, 5, 1, 3, missing], x = 1:5)

Unnamed: 0,id,x
1,missing,1
2,5,2
3,1,3
4,3,4
5,missing,5


In [7]:
showall(groupby(x, :id)) # by default groups include mising values and are not sorted

DataFrames.GroupedDataFrame  4 groups with keys: Symbol[:id]
gd[1]:
2×2 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ id      │ x │
├─────┼─────────┼───┤
│ 1   │ [90mmissing[39m │ 1 │
│ 2   │ [90mmissing[39m │ 5 │gd[2]:
1×2 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ id │ x │
├─────┼────┼───┤
│ 1   │ 5  │ 2 │gd[3]:
1×2 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ id │ x │
├─────┼────┼───┤
│ 1   │ 1  │ 3 │gd[4]:
1×2 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ id │ x │
├─────┼────┼───┤
│ 1   │ 3  │ 4 │

In [8]:
showall(groupby(x, :id, sort=true, skipmissing=true)) # but we can change it :)

DataFrames.GroupedDataFrame  3 groups with keys: Symbol[:id]
gd[1]:
1×2 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ id │ x │
├─────┼────┼───┤
│ 1   │ 1  │ 3 │gd[2]:
1×2 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ id │ x │
├─────┼────┼───┤
│ 1   │ 3  │ 4 │gd[3]:
1×2 DataFrames.SubDataFrame{Array{Int64,1}}
│ Row │ id │ x │
├─────┼────┼───┤
│ 1   │ 5  │ 2 │

In [9]:
x = DataFrame(id=rand('a':'d', 100), v=rand(100));
by(x, :id, v->mean(v[:v])) # apply a function to each group of a data frame

Unnamed: 0,id,x1
1,'d',0.580662
2,'a',0.506477
3,'b',0.566787
4,'c',0.463323


In [10]:
by(x, :id, v->mean(v[:v]), sort=true) # we can sort the output

Unnamed: 0,id,x1
1,'a',0.506477
2,'b',0.566787
3,'c',0.463323
4,'d',0.580662


In [11]:
x = DataFrame(id=rand('a':'d', 100), x1=rand(100), x2=rand(100))
aggregate(x, :id, sum) # apply a function over all columns of a data frame in groups given by id

Unnamed: 0,id,x1_sum,x2_sum
1,'c',12.243,14.5939
2,'a',15.5228,11.0514
3,'d',12.2026,8.48168
4,'b',7.78458,10.3551


In [12]:
aggregate(x, :id, sum, sort=true) # also can be sorted

Unnamed: 0,id,x1_sum,x2_sum
1,'a',15.5228,11.0514
2,'b',7.78458,10.3551
3,'c',12.243,14.5939
4,'d',12.2026,8.48168


*We omit the discussion of of map/combine as I do not find them very useful (better to use by)*

In [13]:
x = DataFrame(rand(3, 5))

Unnamed: 0,x1,x2,x3,x4,x5
1,0.481785,0.983545,0.815865,0.482619,0.317799
2,0.497258,0.535293,0.282936,0.75274,0.651277
3,0.726341,0.764481,0.382081,0.717635,0.997081


In [14]:
map(mean, eachcol(x)) # map a function over each column and return a data frame

Unnamed: 0,x1,x2,x3,x4,x5
1,0.568461,0.761106,0.493628,0.650998,0.655386


In [15]:
foreach(c -> println(c[1], ": ", mean(c[2])), eachcol(x)) # a raw iteration returns a tuple with column name and values

x1: 0.5684614038386472
x2: 0.7611063750249469
x3: 0.4936275621904791
x4: 0.6509981731333249
x5: 0.6553855774049487


In [16]:
colwise(mean, x) # colwise is similar, but produces a vector

5-element Array{Float64,1}:
 0.568461
 0.761106
 0.493628
 0.650998
 0.655386

In [17]:
x[:id] = [1,1,2]
colwise(mean,groupby(x, :id)) # and works on GroupedDataFrame

2-element Array{Array{Float64,1},1}:
 [0.489521, 0.759419, 0.549401, 0.61768, 0.484538, 1.0] 
 [0.726341, 0.764481, 0.382081, 0.717635, 0.997081, 2.0]

In [18]:
map(r -> r[:x1]/r[:x2], eachrow(x)) # now the returned value is DataFrameRow which works similarly to a one-row DataFrame

3-element Array{Float64,1}:
 0.489846
 0.928944
 0.950111