# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), December 9, 2018**

In [1]:
using DataFrames # load package

## Split-apply-combine

In [2]:
x = DataFrame(id=[1,2,3,4,1,2,3,4], id2=[1,2,1,2,1,2,1,2], v=rand(8))

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.572026
2,2,2,0.989102
3,3,1,0.556679
4,4,2,0.674911
5,1,1,0.181193
6,2,2,0.903744
7,3,1,0.242435
8,4,2,0.523293


In [3]:
gx1 = groupby(x, :id)

GroupedDataFrame{DataFrame} with 4 groups based on key: :id
First Group: 2 rows
│ Row │ id    │ id2   │ v        │
│     │ [90mInt64[39m │ [90mInt64[39m │ [90mFloat64[39m  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 1     │ 1     │ 0.572026 │
│ 2   │ 1     │ 1     │ 0.181193 │
⋮
Last Group: 2 rows
│ Row │ id    │ id2   │ v        │
│     │ [90mInt64[39m │ [90mInt64[39m │ [90mFloat64[39m  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 4     │ 2     │ 0.674911 │
│ 2   │ 4     │ 2     │ 0.523293 │

In [4]:
gx2 = groupby(x, [:id, :id2])

GroupedDataFrame{DataFrame} with 4 groups based on keys: :id, :id2
First Group: 2 rows
│ Row │ id    │ id2   │ v        │
│     │ [90mInt64[39m │ [90mInt64[39m │ [90mFloat64[39m  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 1     │ 1     │ 0.572026 │
│ 2   │ 1     │ 1     │ 0.181193 │
⋮
Last Group: 2 rows
│ Row │ id    │ id2   │ v        │
│     │ [90mInt64[39m │ [90mInt64[39m │ [90mFloat64[39m  │
├─────┼───────┼───────┼──────────┤
│ 1   │ 4     │ 2     │ 0.674911 │
│ 2   │ 4     │ 2     │ 0.523293 │

In [5]:
vcat(gx2...) # back to the original DataFrame

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.572026
2,1,1,0.181193
3,2,2,0.989102
4,2,2,0.903744
5,3,1,0.556679
6,3,1,0.242435
7,4,2,0.674911
8,4,2,0.523293


In [6]:
combine(gx2) # the same; warnings will go away in DataFrames 0.16

│   caller = _broadcast_getindex at broadcast.jl:540 [inlined]
└ @ Core .\broadcast.jl:540
│   caller = copyto_nonleaf!(::Array{DataType,1}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Tuple{Base.OneTo{Int64}},typeof(eltype),Tuple{Base.Broadcast.Extruded{DataFrames.DataFrameColumns{SubDataFrame{Array{Int64,1}},AbstractArray{T,1} where T},Tuple{Bool},Tuple{Int64}}}}, ::Base.OneTo{Int64}, ::Int64, ::Int64) at broadcast.jl:540
└ @ Base.Broadcast .\broadcast.jl:540
│   caller = append_rows!(::SubDataFrame{Array{Int64,1}}, ::Tuple{Array{Int64,1},Array{Int64,1},Array{Float64,1}}, ::Int64, ::Tuple{Symbol,Symbol,Symbol}) at grouping.jl:566
└ @ DataFrames C:\Users\bogum\.julia\packages\DataFrames\KUWId\src\groupeddataframe\grouping.jl:566


Unnamed: 0_level_0,id,id2,id_1,id2_1,v
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Float64
1,1,1,1,1,0.572026
2,1,1,1,1,0.181193
3,2,2,2,2,0.989102
4,2,2,2,2,0.903744
5,3,1,3,1,0.556679
6,3,1,3,1,0.242435
7,4,2,4,2,0.674911
8,4,2,4,2,0.523293


In [7]:
x = DataFrame(id = [missing, 5, 1, 3, missing], x = 1:5)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64⍰,Int64
1,missing,1
2,5,2
3,1,3
4,3,4
5,missing,5


In [8]:
groupby(x, :id) # by default groups include mising values and are not sorted

GroupedDataFrame{DataFrame} with 4 groups based on key: :id
First Group: 2 rows
│ Row │ id      │ x     │
│     │ [90mInt64⍰[39m  │ [90mInt64[39m │
├─────┼─────────┼───────┤
│ 1   │ [90mmissing[39m │ 1     │
│ 2   │ [90mmissing[39m │ 5     │
⋮
Last Group: 1 row
│ Row │ id     │ x     │
│     │ [90mInt64⍰[39m │ [90mInt64[39m │
├─────┼────────┼───────┤
│ 1   │ 3      │ 4     │

In [9]:
groupby(x, :id, sort=true, skipmissing=true) # but we can change it :)

GroupedDataFrame{DataFrame} with 3 groups based on key: :id
First Group: 1 row
│ Row │ id     │ x     │
│     │ [90mInt64⍰[39m │ [90mInt64[39m │
├─────┼────────┼───────┤
│ 1   │ 1      │ 3     │
⋮
Last Group: 1 row
│ Row │ id     │ x     │
│     │ [90mInt64⍰[39m │ [90mInt64[39m │
├─────┼────────┼───────┤
│ 1   │ 5      │ 2     │

In [10]:
using Statistics
x = DataFrame(id=rand('a':'d', 100), v=rand(100));
by(x, :id, :v=>mean) # apply a function to each group of a data frame

Unnamed: 0_level_0,id,v_mean
Unnamed: 0_level_1,Char,Float64
1,'d',0.480063
2,'b',0.530918
3,'a',0.575516
4,'c',0.448413


In [11]:
by(x, :id, :v=>mean, sort=true) # we can sort the output

Unnamed: 0_level_0,id,v_mean
Unnamed: 0_level_1,Char,Float64
1,'a',0.575516
2,'b',0.530918
3,'c',0.448413
4,'d',0.480063


In [12]:
by(x, :id, res=:v=>mean) # this way we can set a name for a column

Unnamed: 0_level_0,id,res
Unnamed: 0_level_1,Char,Float64
1,'d',0.480063
2,'b',0.530918
3,'a',0.575516
4,'c',0.448413


In [13]:
by(x, :id, res1=:v=>mean, res2=:v=>sum) # you can have multiple operations

Unnamed: 0_level_0,id,res1,res2
Unnamed: 0_level_1,Char,Float64,Float64
1,'d',0.480063,14.4019
2,'b',0.530918,12.2111
3,'a',0.575516,12.0858
4,'c',0.448413,11.6587


In [14]:
x = DataFrame(id=rand('a':'d', 100), x1=rand(100), x2=rand(100))
aggregate(x, :id, sum) # apply a function over all columns of a data frame in groups given by id

Unnamed: 0_level_0,id,x1_sum,x2_sum
Unnamed: 0_level_1,Char,Float64,Float64
1,'a',9.86814,11.4617
2,'d',9.3865,10.2027
3,'b',13.4094,15.7414
4,'c',13.257,14.7559


In [15]:
aggregate(x, :id, sum, sort=true) # also can be sorted

Unnamed: 0_level_0,id,x1_sum,x2_sum
Unnamed: 0_level_1,Char,Float64,Float64
1,'a',9.86814,11.4617
2,'b',13.4094,15.7414
3,'c',13.257,14.7559
4,'d',9.3865,10.2027


*We omit the discussion of of map/combine till DataFrames 0.16*

A new feature is `mapcols` convinience function

In [16]:
x = DataFrame(rand(3, 5))

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.0674746,0.922701,0.0473791,0.274157,0.999944
2,0.715346,0.616982,0.382527,0.153836,0.524913
3,0.104946,0.585806,0.0970622,0.0980471,0.614058


In [17]:
mapcols(mean, x)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.295922,0.708496,0.175656,0.175347,0.712972


In [18]:
map(mean, eachcol(x)) # map a function over each column and return a data frame; this warning will go away in DataFrames 0.16

│   caller = top-level scope at In[18]:1
└ @ Core In[18]:1
│   caller = top-level scope at In[18]:1
└ @ Core In[18]:1


Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.295922,0.708496,0.175656,0.175347,0.712972


In [19]:
foreach(c -> println(c[1], ": ", mean(c[2])), eachcol(x, true)) # an iteration returns a Pair with column name and values

x1: 0.2959219631186962
x2: 0.7084964494002054
x3: 0.17565596776152984
x4: 0.1753466545048854
x5: 0.7129715487597655


In [20]:
map(mean, eachcol(x, false)) # with false these are only columns without names

5-element Array{Float64,1}:
 0.2959219631186962 
 0.7084964494002054 
 0.17565596776152984
 0.1753466545048854 
 0.7129715487597655 

In [21]:
colwise(mean, x) # colwise is similar

5-element Array{Float64,1}:
 0.2959219631186962 
 0.7084964494002054 
 0.17565596776152984
 0.1753466545048854 
 0.7129715487597655 

In [22]:
x[:id] = [1,1,2]
colwise(mean,groupby(x, :id)) # but it also works on GroupedDataFrame

2-element Array{Array{Float64,1},1}:
 [0.39141, 0.769841, 0.214953, 0.213996, 0.762428, 1.0]   
 [0.104946, 0.585806, 0.0970622, 0.0980471, 0.614058, 2.0]

In [23]:
map(r -> r.x1/r.x2, eachrow(x)) # now the returned value is DataFrameRow which works similarly to a one-row DataFrame

3-element Array{Float64,1}:
 0.07312730316155054
 1.1594269576671856 
 0.1791471500413872 