# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), December 8, 2019**

In [1]:
using DataFrames

## Split-apply-combine

### Grouping a data frame

In [2]:
x = DataFrame(id=[1,2,3,4,1,2,3,4], id2=[1,2,1,2,1,2,1,2], v=rand(8))

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.969649
2,2,2,0.707424
3,3,1,0.556397
4,4,2,0.398913
5,1,1,0.536786
6,2,2,0.716584
7,3,1,0.726574
8,4,2,0.766508


In [3]:
groupby(x, :id)

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.969649
2,1,1,0.536786

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,4,2,0.398913
2,4,2,0.766508


In [4]:
groupby(x, []) # no grouping columns produce a single group; it can be useful for aggregations over a whole data frame

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.969649
2,2,2,0.707424
3,3,1,0.556397
4,4,2,0.398913
5,1,1,0.536786
6,2,2,0.716584
7,3,1,0.726574
8,4,2,0.766508


In [5]:
gx2 = groupby(x, [:id, :id2])

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.969649
2,1,1,0.536786

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,4,2,0.398913
2,4,2,0.766508


In [6]:
parent(gx2) # get the parent DataFrame 

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.969649
2,2,2,0.707424
3,3,1,0.556397
4,4,2,0.398913
5,1,1,0.536786
6,2,2,0.716584
7,3,1,0.726574
8,4,2,0.766508


In [7]:
vcat(gx2...) # back to the DataFrame, but in a different order of rows than the original

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.969649
2,1,1,0.536786
3,2,2,0.707424
4,2,2,0.716584
5,3,1,0.556397
6,3,1,0.726574
7,4,2,0.398913
8,4,2,0.766508


In [8]:
DataFrame(gx2) # the same

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.969649
2,1,1,0.536786
3,2,2,0.707424
4,2,2,0.716584
5,3,1,0.556397
6,3,1,0.726574
7,4,2,0.398913
8,4,2,0.766508


In [9]:
groupvars(gx2) # vector of names of grouping variables

2-element Array{Symbol,1}:
 :id 
 :id2

In [10]:
groupindices(gx2) # group indices in parent(gx2)

8-element Array{Union{Missing, Int64},1}:
 1
 2
 3
 4
 1
 2
 3
 4

In [11]:
kgx2 = keys(gx2)

4-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (id = 1, id2 = 1)
 GroupKey: (id = 2, id2 = 2)
 GroupKey: (id = 3, id2 = 1)
 GroupKey: (id = 4, id2 = 2)

You can index into a `GroupedDataFrame` like to a vector or to a dictionary.
The second form acceps `GroupKey`, `NamedTuple` or a `Tuple`

In [12]:
gx2

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.969649
2,1,1,0.536786

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,4,2,0.398913
2,4,2,0.766508


In [13]:
k = keys(gx2)[1]

GroupKey: (id = 1, id2 = 1)

In [14]:
ntk = NamedTuple(k)

(id = 1, id2 = 1)

In [15]:
tk = Tuple(k)

(1, 1)

the operations below produce the same result

In [16]:
gx2[1]

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.969649
2,1,1,0.536786


In [17]:
gx2[k]

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.969649
2,1,1,0.536786


In [18]:
gx2[ntk]

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.969649
2,1,1,0.536786


In [19]:
gx2[tk]

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.969649
2,1,1,0.536786


handling missing values

In [20]:
x = DataFrame(id = [missing, 5, 1, 3, missing], x = 1:5)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64⍰,Int64
1,missing,1
2,5,2
3,1,3
4,3,4
5,missing,5


In [21]:
groupby(x, :id) # by default groups include mising values and are not sorted

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64⍰,Int64
1,missing,1
2,missing,5

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64⍰,Int64
1,3,4


In [22]:
groupby(x, :id, sort=true, skipmissing=true) # but we can change it

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64⍰,Int64
1,1,3

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64⍰,Int64
1,5,2


### Performing transformations by group using `by`

In [23]:
using Statistics
x = DataFrame(id=rand('a':'d', 100), v=rand(100));
by(x, :id, :v=>mean) # apply a function to each group of a data frame

Unnamed: 0_level_0,id,v_mean
Unnamed: 0_level_1,Char,Float64
1,'d',0.415229
2,'c',0.536199
3,'b',0.54997
4,'a',0.438933


In [24]:
by(x, :id, :v=>mean, sort=true) # we can sort the output

Unnamed: 0_level_0,id,v_mean
Unnamed: 0_level_1,Char,Float64
1,'a',0.438933
2,'b',0.54997
3,'c',0.536199
4,'d',0.415229


In [25]:
by(x, :id, res=:v=>mean) # this way we can set a name for a column

Unnamed: 0_level_0,id,res
Unnamed: 0_level_1,Char,Float64
1,'d',0.415229
2,'c',0.536199
3,'b',0.54997
4,'a',0.438933


In [26]:
by(x, :id, res1=:v=>mean, res2=:v=>sum) # you can have multiple operations

Unnamed: 0_level_0,id,res1,res2
Unnamed: 0_level_1,Char,Float64,Float64
1,'d',0.415229,7.88936
2,'c',0.536199,21.9842
3,'b',0.54997,12.0993
4,'a',0.438933,7.90079


### Aggregation of a data frame using `aggregate`

In [27]:
x = DataFrame(id=rand('a':'d', 100), x1=rand(100), x2=rand(100))
aggregate(x, :id, sum) # apply a function over all columns of a data frame in groups given by id

Unnamed: 0_level_0,id,x1_sum,x2_sum
Unnamed: 0_level_1,Char,Float64,Float64
1,'d',10.1913,9.79288
2,'a',18.0119,17.5088
3,'c',7.9908,7.4382
4,'b',15.6003,12.5855


In [28]:
aggregate(x, :id, sum, sort=true) # also can be sorted

Unnamed: 0_level_0,id,x1_sum,x2_sum
Unnamed: 0_level_1,Char,Float64,Float64
1,'a',18.0119,17.5088
2,'b',15.6003,12.5855
3,'c',7.9908,7.4382
4,'d',10.1913,9.79288


Additionally you can use a `mapcols` convinience function

In [29]:
x = DataFrame(rand(3, 5))

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.724584,0.564722,0.125285,0.139595,0.269535
2,0.0283743,0.426664,0.449505,0.390202,0.368115
3,0.726498,0.304478,0.842521,0.430402,0.0972168


In [30]:
mapcols(mean, x)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.493152,0.431955,0.472437,0.320067,0.244955


### Mapping rows and columns using `eachcol` and `eachrow`

In [31]:
map(mean, eachcol(x, false)) # map a function over each column and return a vector

5-element Array{Float64,1}:
 0.4931520339797553 
 0.43195477494607304
 0.47243716191293056
 0.3200665438143661 
 0.2449554266873982 

In [32]:
foreach(c -> println(c[1], ": ", mean(c[2])), eachcol(x, true)) # an iteration returns a Pair with column name and values

x1: 0.4931520339797553
x2: 0.43195477494607304
x3: 0.47243716191293056
x4: 0.3200665438143661
x5: 0.2449554266873982


In [33]:
map(r -> r.x1/r.x2, eachrow(x)) # now the returned value is DataFrameRow which works similarly to a one-row DataFrame

3-element Array{Float64,1}:
 1.2830801044067726 
 0.06650260896319257
 2.386046545972731  

In [34]:
er = eachrow(x) # it prints like a data frame, only the caption is different so that you know the type of the object

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.724584,0.564722,0.125285,0.139595,0.269535
2,0.0283743,0.426664,0.449505,0.390202,0.368115
3,0.726498,0.304478,0.842521,0.430402,0.0972168


In [35]:
er.x1 # you can access columns of a parent data frame directly

3-element Array{Float64,1}:
 0.7245836824840111  
 0.028374300180372458
 0.7264981192748825  

In [36]:
ec = eachcol(x) # it prints like a data frame, only the caption is different so that you know the type of the object

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.724584,0.564722,0.125285,0.139595,0.269535
2,0.0283743,0.426664,0.449505,0.390202,0.368115
3,0.726498,0.304478,0.842521,0.430402,0.0972168


In [37]:
ec.x1 # you can access columns of a parent data frame directly

3-element Array{Float64,1}:
 0.7245836824840111  
 0.028374300180372458
 0.7264981192748825  