# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), January 18, 2019**

In [1]:
using DataFrames # load package

## Split-apply-combine

In [2]:
x = DataFrame(id=[1,2,3,4,1,2,3,4], id2=[1,2,1,2,1,2,1,2], v=rand(8))

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.431948
2,2,2,0.877294
3,3,1,0.62008
4,4,2,0.400774
5,1,1,0.188222
6,2,2,0.00783845
7,3,1,0.533851
8,4,2,0.945553


In [3]:
gx1 = groupby(x, :id)

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.431948
2,1,1,0.188222

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,4,2,0.400774
2,4,2,0.945553


In [4]:
gx2 = groupby(x, [:id, :id2])

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.431948
2,1,1,0.188222

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,4,2,0.400774
2,4,2,0.945553


In [5]:
parent(gx2) # get the parent DataFrame 

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.431948
2,2,2,0.877294
3,3,1,0.62008
4,4,2,0.400774
5,1,1,0.188222
6,2,2,0.00783845
7,3,1,0.533851
8,4,2,0.945553


In [6]:
vcat(gx2...) # back to the DataFrame, but in a different order of rows than the original

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.431948
2,1,1,0.188222
3,2,2,0.877294
4,2,2,0.00783845
5,3,1,0.62008
6,3,1,0.533851
7,4,2,0.400774
8,4,2,0.945553


In [7]:
combine(gx2) # the same

Unnamed: 0_level_0,id,id2,id_1,id2_1,v
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Float64
1,1,1,1,1,0.431948
2,1,1,1,1,0.188222
3,2,2,2,2,0.877294
4,2,2,2,2,0.00783845
5,3,1,3,1,0.62008
6,3,1,3,1,0.533851
7,4,2,4,2,0.400774
8,4,2,4,2,0.945553


In [8]:
x = DataFrame(id = [missing, 5, 1, 3, missing], x = 1:5)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64⍰,Int64
1,missing,1
2,5,2
3,1,3
4,3,4
5,missing,5


In [9]:
groupby(x, :id) # by default groups include mising values and are not sorted

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64⍰,Int64
1,missing,1
2,missing,5

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64⍰,Int64
1,3,4


In [10]:
groupby(x, :id, sort=true, skipmissing=true) # but we can change it

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64⍰,Int64
1,1,3

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64⍰,Int64
1,5,2


In [11]:
using Statistics
x = DataFrame(id=rand('a':'d', 100), v=rand(100));
by(x, :id, :v=>mean) # apply a function to each group of a data frame

Unnamed: 0_level_0,id,v_Statistics.mean
Unnamed: 0_level_1,Char,Float64
1,'b',0.524993
2,'a',0.51178
3,'c',0.495782
4,'d',0.39401


In [12]:
by(x, :id, :v=>mean, sort=true) # we can sort the output

Unnamed: 0_level_0,id,v_Statistics.mean
Unnamed: 0_level_1,Char,Float64
1,'a',0.51178
2,'b',0.524993
3,'c',0.495782
4,'d',0.39401


In [13]:
by(x, :id, res=:v=>mean) # this way we can set a name for a column

Unnamed: 0_level_0,id,res
Unnamed: 0_level_1,Char,Float64
1,'b',0.524993
2,'a',0.51178
3,'c',0.495782
4,'d',0.39401


In [14]:
by(x, :id, res1=:v=>mean, res2=:v=>sum) # you can have multiple operations

Unnamed: 0_level_0,id,res1,res2
Unnamed: 0_level_1,Char,Float64,Float64
1,'b',0.524993,13.1248
2,'a',0.51178,12.7945
3,'c',0.495782,13.8819
4,'d',0.39401,8.66822


In [15]:
x = DataFrame(id=rand('a':'d', 100), x1=rand(100), x2=rand(100))
aggregate(x, :id, sum) # apply a function over all columns of a data frame in groups given by id

Unnamed: 0_level_0,id,x1_sum,x2_sum
Unnamed: 0_level_1,Char,Float64,Float64
1,'a',10.7951,11.0517
2,'b',12.3498,13.2309
3,'c',12.888,11.4458
4,'d',13.3698,16.8223


In [16]:
aggregate(x, :id, sum, sort=true) # also can be sorted

Unnamed: 0_level_0,id,x1_sum,x2_sum
Unnamed: 0_level_1,Char,Float64,Float64
1,'a',10.7951,11.0517
2,'b',12.3498,13.2309
3,'c',12.888,11.4458
4,'d',13.3698,16.8223


A new feature is `mapcols` convinience function

In [17]:
x = DataFrame(rand(3, 5))

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.222279,0.836563,0.0240105,0.407143,0.17952
2,0.581014,0.189129,0.379578,0.135048,0.340468
3,0.661496,0.361389,0.0572603,0.0828427,0.851504


In [18]:
mapcols(mean, x)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.488263,0.462361,0.153616,0.208344,0.457164


In [19]:
map(mean, eachcol(x, false)) # map a function over each column and return a vector

5-element Array{Float64,1}:
 0.4882632939149169 
 0.46236054918197445
 0.15361644419219825
 0.20834448031473243
 0.45716366440042094

In [20]:
foreach(c -> println(c[1], ": ", mean(c[2])), eachcol(x, true)) # an iteration returns a Pair with column name and values

x1: 0.4882632939149169
x2: 0.46236054918197445
x3: 0.15361644419219825
x4: 0.20834448031473243
x5: 0.45716366440042094


In [21]:
colwise([mean,minimum,maximum], x) # colwise is similar but accepts a vector of functions

3×5 Array{Float64,2}:
 0.488263  0.462361  0.153616   0.208344   0.457164
 0.222279  0.189129  0.0240105  0.0828427  0.17952 
 0.661496  0.836563  0.379578   0.407143   0.851504

In [22]:
x[:id] = [1,1,2]
colwise(mean,groupby(x, :id)) # and it also works on GroupedDataFrame

2-element Array{Array{Float64,1},1}:
 [0.401647, 0.512846, 0.201795, 0.271095, 0.259994, 1.0]  
 [0.661496, 0.361389, 0.0572603, 0.0828427, 0.851504, 2.0]

In [23]:
map(r -> r.x1/r.x2, eachrow(x)) # now the returned value is DataFrameRow which works similarly to a one-row DataFrame

3-element Array{Float64,1}:
 0.2657053196699975
 3.07205555708361  
 1.8304242736715894