# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), January 3, 2019**

In [1]:
using DataFrames # load package

## Split-apply-combine

In [2]:
x = DataFrame(id=[1,2,3,4,1,2,3,4], id2=[1,2,1,2,1,2,1,2], v=rand(8))

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.309601
2,2,2,0.53993
3,3,1,0.560226
4,4,2,0.949346
5,1,1,0.0227493
6,2,2,0.387667
7,3,1,0.629157
8,4,2,0.0943663


In [3]:
gx1 = groupby(x, :id)

GroupedDataFrame{DataFrame} with 4 groups based on key: :id
First Group (2 rows): :id = 1
│ Row │ id    │ id2   │ v         │
│     │ [90mInt64[39m │ [90mInt64[39m │ [90mFloat64[39m   │
├─────┼───────┼───────┼───────────┤
│ 1   │ 1     │ 1     │ 0.309601  │
│ 2   │ 1     │ 1     │ 0.0227493 │
⋮
Last Group (2 rows): :id = 4
│ Row │ id    │ id2   │ v         │
│     │ [90mInt64[39m │ [90mInt64[39m │ [90mFloat64[39m   │
├─────┼───────┼───────┼───────────┤
│ 1   │ 4     │ 2     │ 0.949346  │
│ 2   │ 4     │ 2     │ 0.0943663 │

In [4]:
gx2 = groupby(x, [:id, :id2])

GroupedDataFrame{DataFrame} with 4 groups based on keys: :id, :id2
First Group (2 rows): :id = 1, :id2 = 1
│ Row │ id    │ id2   │ v         │
│     │ [90mInt64[39m │ [90mInt64[39m │ [90mFloat64[39m   │
├─────┼───────┼───────┼───────────┤
│ 1   │ 1     │ 1     │ 0.309601  │
│ 2   │ 1     │ 1     │ 0.0227493 │
⋮
Last Group (2 rows): :id = 4, :id2 = 2
│ Row │ id    │ id2   │ v         │
│     │ [90mInt64[39m │ [90mInt64[39m │ [90mFloat64[39m   │
├─────┼───────┼───────┼───────────┤
│ 1   │ 4     │ 2     │ 0.949346  │
│ 2   │ 4     │ 2     │ 0.0943663 │

In [5]:
vcat(gx2...) # back to the original DataFrame

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.309601
2,1,1,0.0227493
3,2,2,0.53993
4,2,2,0.387667
5,3,1,0.560226
6,3,1,0.629157
7,4,2,0.949346
8,4,2,0.0943663


In [6]:
combine(gx2) # the same

Unnamed: 0_level_0,id,id2,id_1,id2_1,v
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Float64
1,1,1,1,1,0.309601
2,1,1,1,1,0.0227493
3,2,2,2,2,0.53993
4,2,2,2,2,0.387667
5,3,1,3,1,0.560226
6,3,1,3,1,0.629157
7,4,2,4,2,0.949346
8,4,2,4,2,0.0943663


In [7]:
x = DataFrame(id = [missing, 5, 1, 3, missing], x = 1:5)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64⍰,Int64
1,missing,1
2,5,2
3,1,3
4,3,4
5,missing,5


In [8]:
groupby(x, :id) # by default groups include mising values and are not sorted

GroupedDataFrame{DataFrame} with 4 groups based on key: :id
First Group (2 rows): :id = missing
│ Row │ id      │ x     │
│     │ [90mInt64⍰[39m  │ [90mInt64[39m │
├─────┼─────────┼───────┤
│ 1   │ [90mmissing[39m │ 1     │
│ 2   │ [90mmissing[39m │ 5     │
⋮
Last Group (1 row): :id = 3
│ Row │ id     │ x     │
│     │ [90mInt64⍰[39m │ [90mInt64[39m │
├─────┼────────┼───────┤
│ 1   │ 3      │ 4     │

In [9]:
groupby(x, :id, sort=true, skipmissing=true) # but we can change it

GroupedDataFrame{DataFrame} with 3 groups based on key: :id
First Group (1 row): :id = 1
│ Row │ id     │ x     │
│     │ [90mInt64⍰[39m │ [90mInt64[39m │
├─────┼────────┼───────┤
│ 1   │ 1      │ 3     │
⋮
Last Group (1 row): :id = 5
│ Row │ id     │ x     │
│     │ [90mInt64⍰[39m │ [90mInt64[39m │
├─────┼────────┼───────┤
│ 1   │ 5      │ 2     │

In [10]:
using Statistics
x = DataFrame(id=rand('a':'d', 100), v=rand(100));
by(x, :id, :v=>mean) # apply a function to each group of a data frame

Unnamed: 0_level_0,id,v_mean
Unnamed: 0_level_1,Char,Float64
1,'a',0.445712
2,'d',0.527654
3,'b',0.517119
4,'c',0.498448


In [11]:
by(x, :id, :v=>mean, sort=true) # we can sort the output

Unnamed: 0_level_0,id,v_mean
Unnamed: 0_level_1,Char,Float64
1,'a',0.445712
2,'b',0.517119
3,'c',0.498448
4,'d',0.527654


In [12]:
by(x, :id, res=:v=>mean) # this way we can set a name for a column

Unnamed: 0_level_0,id,res
Unnamed: 0_level_1,Char,Float64
1,'a',0.445712
2,'d',0.527654
3,'b',0.517119
4,'c',0.498448


In [13]:
by(x, :id, res1=:v=>mean, res2=:v=>sum) # you can have multiple operations

Unnamed: 0_level_0,id,res1,res2
Unnamed: 0_level_1,Char,Float64,Float64
1,'a',0.445712,9.35996
2,'d',0.527654,16.8849
3,'b',0.517119,11.8937
4,'c',0.498448,11.9627


In [14]:
x = DataFrame(id=rand('a':'d', 100), x1=rand(100), x2=rand(100))
aggregate(x, :id, sum) # apply a function over all columns of a data frame in groups given by id

Unnamed: 0_level_0,id,x1_sum,x2_sum
Unnamed: 0_level_1,Char,Float64,Float64
1,'c',15.5543,10.8053
2,'b',12.6932,11.9198
3,'d',7.47519,9.23485
4,'a',15.4637,16.1987


In [15]:
aggregate(x, :id, sum, sort=true) # also can be sorted

Unnamed: 0_level_0,id,x1_sum,x2_sum
Unnamed: 0_level_1,Char,Float64,Float64
1,'a',15.4637,16.1987
2,'b',12.6932,11.9198
3,'c',15.5543,10.8053
4,'d',7.47519,9.23485


A new feature is `mapcols` convinience function

In [16]:
x = DataFrame(rand(3, 5))

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.569699,0.297867,0.526604,0.970632,0.474076
2,0.580745,0.00741397,0.196166,0.906289,0.757872
3,0.0545064,0.227071,0.472087,0.411805,0.618873


In [17]:
mapcols(mean, x)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.40165,0.177451,0.398286,0.762909,0.61694


In [18]:
map(mean, eachcol(x, false)) # map a function over each column and return a vector

5-element Array{Float64,1}:
 0.40164989440640103
 0.17745077318753447
 0.39828573939784045
 0.7629086652300364 
 0.6169403523612847 

In [19]:
foreach(c -> println(c[1], ": ", mean(c[2])), eachcol(x, true)) # an iteration returns a Pair with column name and values

x1: 0.40164989440640103
x2: 0.17745077318753447
x3: 0.39828573939784045
x4: 0.7629086652300364
x5: 0.6169403523612847


In [20]:
colwise([mean,minimum,maximum], x) # colwise is similar but accepts a vector of functions

3×5 Array{Float64,2}:
 0.40165    0.177451    0.398286  0.762909  0.61694 
 0.0545064  0.00741397  0.196166  0.411805  0.474076
 0.580745   0.297867    0.526604  0.970632  0.757872

In [21]:
x[:id] = [1,1,2]
colwise(mean,groupby(x, :id)) # and it also works on GroupedDataFrame

2-element Array{Array{Float64,1},1}:
 [0.575222, 0.152641, 0.361385, 0.93846, 0.615974, 1.0]  
 [0.0545064, 0.227071, 0.472087, 0.411805, 0.618873, 2.0]

In [22]:
map(r -> r.x1/r.x2, eachrow(x)) # now the returned value is DataFrameRow which works similarly to a one-row DataFrame

3-element Array{Float64,1}:
  1.912591271273047  
 78.33115962528011   
  0.24004117109896686