# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), September 26, 2018**

In [1]:
using DataFrames # load package

## Split-apply-combine

In [2]:
x = DataFrame(id=[1,2,3,4,1,2,3,4], id2=[1,2,1,2,1,2,1,2], v=rand(8))

Unnamed: 0,id,id2,v
1,1,1,0.928275
2,2,2,0.982743
3,3,1,0.236876
4,4,2,0.969184
5,1,1,0.0272118
6,2,2,0.342781
7,3,1,0.0338346
8,4,2,0.701404


In [3]:
gx1 = groupby(x, :id)

GroupedDataFrame with 4 groups based on keys: :A, :B
First Group: 2 rows
│ Row │ id    │ id2   │ v         │
│     │ [90mInt64[39m │ [90mInt64[39m │ [90mFloat64[39m   │
├─────┼───────┼───────┼───────────┤
│ 1   │ 1     │ 1     │ 0.928275  │
│ 2   │ 1     │ 1     │ 0.0272118 │
⋮
Last Group: 2 rows
│ Row │ id    │ id2   │ v        │
│     │ [90mInt64[39m │ [90mInt64[39m │ [90mFloat64[39m  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 4     │ 2     │ 0.969184 │
│ 2   │ 4     │ 2     │ 0.701404 │

In [4]:
gx2 = groupby(x, [:id, :id2])

GroupedDataFrame with 4 groups based on keys: :A, :B
First Group: 2 rows
│ Row │ id    │ id2   │ v         │
│     │ [90mInt64[39m │ [90mInt64[39m │ [90mFloat64[39m   │
├─────┼───────┼───────┼───────────┤
│ 1   │ 1     │ 1     │ 0.928275  │
│ 2   │ 1     │ 1     │ 0.0272118 │
⋮
Last Group: 2 rows
│ Row │ id    │ id2   │ v        │
│     │ [90mInt64[39m │ [90mInt64[39m │ [90mFloat64[39m  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 4     │ 2     │ 0.969184 │
│ 2   │ 4     │ 2     │ 0.701404 │

In [5]:
vcat(gx2...) # back to the original DataFrame

Unnamed: 0,id,id2,v
1,1,1,0.928275
2,1,1,0.0272118
3,2,2,0.982743
4,2,2,0.342781
5,3,1,0.236876
6,3,1,0.0338346
7,4,2,0.969184
8,4,2,0.701404


In [6]:
x = DataFrame(id = [missing, 5, 1, 3, missing], x = 1:5)

Unnamed: 0,id,x
1,missing,1
2,5,2
3,1,3
4,3,4
5,missing,5


In [7]:
groupby(x, :id) # by default groups include mising values and are not sorted

GroupedDataFrame with 4 groups based on keys: :A, :B
First Group: 2 rows
│ Row │ id      │ x     │
│     │ [90mInt64⍰[39m  │ [90mInt64[39m │
├─────┼─────────┼───────┤
│ 1   │ [90mmissing[39m │ 1     │
│ 2   │ [90mmissing[39m │ 5     │
⋮
Last Group: 2 rows
│ Row │ id     │ x     │
│     │ [90mInt64⍰[39m │ [90mInt64[39m │
├─────┼────────┼───────┤
│ 1   │ 3      │ 4     │

In [8]:
groupby(x, :id, sort=true, skipmissing=true) # but we can change it :)

GroupedDataFrame with 3 groups based on keys: :A, :B
First Group: 1 row
│ Row │ id     │ x     │
│     │ [90mInt64⍰[39m │ [90mInt64[39m │
├─────┼────────┼───────┤
│ 1   │ 1      │ 3     │
⋮
Last Group: 1 row
│ Row │ id     │ x     │
│     │ [90mInt64⍰[39m │ [90mInt64[39m │
├─────┼────────┼───────┤
│ 1   │ 5      │ 2     │

In [9]:
using Statistics
x = DataFrame(id=rand('a':'d', 100), v=rand(100));
by(x, :id, y->mean(y.v)) # apply a function to each group of a data frame

Unnamed: 0,id,x1
1,'b',0.504804
2,'c',0.495727
3,'d',0.519711
4,'a',0.556148


In [10]:
by(x, :id, y->mean(y.v), sort=true) # we can sort the output

Unnamed: 0,id,x1
1,'a',0.556148
2,'b',0.504804
3,'c',0.495727
4,'d',0.519711


In [11]:
by(x, :id, y->DataFrame(res=mean(y.v))) # this way we can set a name for a column - DataFramesMeta @by is better

Unnamed: 0,id,res
1,'b',0.504804
2,'c',0.495727
3,'d',0.519711
4,'a',0.556148


In [12]:
x = DataFrame(id=rand('a':'d', 100), x1=rand(100), x2=rand(100))
aggregate(x, :id, sum) # apply a function over all columns of a data frame in groups given by id

Unnamed: 0,id,x1_sum,x2_sum
1,'a',11.7049,14.2657
2,'b',12.1979,12.9316
3,'d',9.63083,13.0482
4,'c',12.1058,17.3369


In [13]:
aggregate(x, :id, sum, sort=true) # also can be sorted

Unnamed: 0,id,x1_sum,x2_sum
1,'a',11.7049,14.2657
2,'b',12.1979,12.9316
3,'c',12.1058,17.3369
4,'d',9.63083,13.0482


*We omit the discussion of of map/combine as I do not find them very useful (better to use by)*

In [14]:
x = DataFrame(rand(3, 5))

Unnamed: 0,x1,x2,x3,x4,x5
1,0.620795,0.0494765,0.328397,0.430159,0.526294
2,0.410581,0.64937,0.41514,0.735231,0.887564
3,0.148496,0.840524,0.394913,0.541614,0.046687


In [15]:
map(mean, eachcol(x)) # map a function over each column and return a data frame

Unnamed: 0,x1,x2,x3,x4,x5
1,0.393291,0.513123,0.379483,0.569001,0.486848


In [16]:
foreach(c -> println(c[1], ": ", mean(c[2])), eachcol(x)) # a raw iteration returns a tuple with column name and values

x1: 0.3932907243889327
x2: 0.5131234204230393
x3: 0.37948323368781134
x4: 0.5690013138736149
x5: 0.4868481864188075


In [17]:
colwise(mean, x) # colwise is similar, but produces a vector

5-element Array{Float64,1}:
 0.3932907243889327 
 0.5131234204230393 
 0.37948323368781134
 0.5690013138736149 
 0.4868481864188075 

In [18]:
x[:id] = [1,1,2]
colwise(mean,groupby(x, :id)) # and works on GroupedDataFrame

2-element Array{Array{Float64,1},1}:
 [0.515688, 0.349423, 0.371768, 0.582695, 0.706929, 1.0]
 [0.148496, 0.840524, 0.394913, 0.541614, 0.046687, 2.0]

In [19]:
map(r -> r.x1/r.x2, eachrow(x)) # now the returned value is DataFrameRow which works similarly to a one-row DataFrame

3-element Array{Float64,1}:
 12.54727988062303  
  0.6322765626507335
  0.176670179811385 