# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), August 29, 2018**

In [1]:
using DataFrames # load package

## Split-apply-combine

In [2]:
x = DataFrame(id=[1,2,3,4,1,2,3,4], id2=[1,2,1,2,1,2,1,2], v=rand(8))

Unnamed: 0,id,id2,v
1,1,1,0.376992
2,2,2,0.315282
3,3,1,0.851024
4,4,2,0.678299
5,1,1,0.938288
6,2,2,0.315817
7,3,1,0.79761
8,4,2,0.947347


In [3]:
gx1 = groupby(x, :id)

GroupedDataFrame  4 groups with keys: Symbol[:id]
First Group:
2×3 SubDataFrame{Array{Int64,1}}
│ Row │ id │ id2 │ v        │
├─────┼────┼─────┼──────────┤
│ 1   │ 1  │ 1   │ 0.376992 │
│ 2   │ 1  │ 1   │ 0.938288 │
⋮
Last Group:
2×3 SubDataFrame{Array{Int64,1}}
│ Row │ id │ id2 │ v        │
├─────┼────┼─────┼──────────┤
│ 1   │ 4  │ 2   │ 0.678299 │
│ 2   │ 4  │ 2   │ 0.947347 │

In [4]:
gx2 = groupby(x, [:id, :id2])

GroupedDataFrame  4 groups with keys: Symbol[:id, :id2]
First Group:
2×3 SubDataFrame{Array{Int64,1}}
│ Row │ id │ id2 │ v        │
├─────┼────┼─────┼──────────┤
│ 1   │ 1  │ 1   │ 0.376992 │
│ 2   │ 1  │ 1   │ 0.938288 │
⋮
Last Group:
2×3 SubDataFrame{Array{Int64,1}}
│ Row │ id │ id2 │ v        │
├─────┼────┼─────┼──────────┤
│ 1   │ 4  │ 2   │ 0.678299 │
│ 2   │ 4  │ 2   │ 0.947347 │

In [5]:
vcat(gx2...) # back to the original DataFrame

Unnamed: 0,id,id2,v
1,1,1,0.376992
2,1,1,0.938288
3,2,2,0.315282
4,2,2,0.315817
5,3,1,0.851024
6,3,1,0.79761
7,4,2,0.678299
8,4,2,0.947347


In [6]:
x = DataFrame(id = [missing, 5, 1, 3, missing], x = 1:5)

Unnamed: 0,id,x
1,missing,1
2,5,2
3,1,3
4,3,4
5,missing,5


In [7]:
groupby(x, :id) # by default groups include mising values and are not sorted

GroupedDataFrame  4 groups with keys: Symbol[:id]
First Group:
2×2 SubDataFrame{Array{Int64,1}}
│ Row │ id      │ x │
├─────┼─────────┼───┤
│ 1   │ [90mmissing[39m │ 1 │
│ 2   │ [90mmissing[39m │ 5 │
⋮
Last Group:
1×2 SubDataFrame{Array{Int64,1}}
│ Row │ id │ x │
├─────┼────┼───┤
│ 1   │ 3  │ 4 │

In [8]:
groupby(x, :id, sort=true, skipmissing=true) # but we can change it :)

GroupedDataFrame  3 groups with keys: Symbol[:id]
First Group:
1×2 SubDataFrame{Array{Int64,1}}
│ Row │ id │ x │
├─────┼────┼───┤
│ 1   │ 1  │ 3 │
⋮
Last Group:
1×2 SubDataFrame{Array{Int64,1}}
│ Row │ id │ x │
├─────┼────┼───┤
│ 1   │ 5  │ 2 │

In [10]:
using Statistics
x = DataFrame(id=rand('a':'d', 100), v=rand(100));
by(x, :id, y->mean(y.v)) # apply a function to each group of a data frame

Unnamed: 0,id,x1
1,'c',0.53167
2,'b',0.536764
3,'d',0.533255
4,'a',0.500416


In [11]:
by(x, :id, y->mean(y.v), sort=true) # we can sort the output

Unnamed: 0,id,x1
1,'a',0.500416
2,'b',0.536764
3,'c',0.53167
4,'d',0.533255


In [12]:
by(x, :id, y->DataFrame(res=mean(y.v))) # this way we can set a name for a column - DataFramesMeta @by is better

Unnamed: 0,id,res
1,'c',0.53167
2,'b',0.536764
3,'d',0.533255
4,'a',0.500416


In [13]:
x = DataFrame(id=rand('a':'d', 100), x1=rand(100), x2=rand(100))
aggregate(x, :id, sum) # apply a function over all columns of a data frame in groups given by id

Unnamed: 0,id,x1_sum,x2_sum
1,'a',12.1865,15.3687
2,'c',10.487,9.84328
3,'b',13.6422,12.8255
4,'d',9.32685,9.8113


In [14]:
aggregate(x, :id, sum, sort=true) # also can be sorted

Unnamed: 0,id,x1_sum,x2_sum
1,'a',12.1865,15.3687
2,'b',13.6422,12.8255
3,'c',10.487,9.84328
4,'d',9.32685,9.8113


*We omit the discussion of of map/combine as I do not find them very useful (better to use by)*

In [15]:
x = DataFrame(rand(3, 5))

Unnamed: 0,x1,x2,x3,x4,x5
1,0.639392,0.651516,0.905496,0.096984,0.196036
2,0.58018,0.677913,0.766194,0.507495,0.461365
3,0.55038,0.797195,0.456577,0.765008,0.925907


In [16]:
map(mean, eachcol(x)) # map a function over each column and return a data frame

Unnamed: 0,x1,x2,x3,x4,x5
1,0.589984,0.708874,0.709422,0.456496,0.527769


In [17]:
foreach(c -> println(c[1], ": ", mean(c[2])), eachcol(x)) # a raw iteration returns a tuple with column name and values

x1: 0.5199838767033462
x2: 0.7088744581491442
x3: 0.7094221963890801
x4: 0.4564955170599234
x5: 0.5277693846746828


In [18]:
colwise(mean, x) # colwise is similar, but produces a vector

5-element Array{Float64,1}:
 0.5899838767033462
 0.7088744581491442
 0.7094221963890801
 0.4564955170599234
 0.5277693846746828

In [19]:
x[:id] = [1,1,2]
colwise(mean,groupby(x, :id)) # and works on GroupedDataFrame

2-element Array{Array{Float64,1},1}:
 [0.609786, 0.664714, 0.835845, 0.302239, 0.328701, 1.0]
 [0.55038, 0.797195, 0.456577, 0.765008, 0.925907, 2.0] 

In [20]:
map(r -> r.x1/r.x2, eachrow(x)) # now the returned value is DataFrameRow which works similarly to a one-row DataFrame

3-element Array{Float64,1}:
 0.9813913399427114
 0.8558320345818962
 0.6903961212620849