# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), May 15, 2021**

In [1]:
using DataFrames

## Split-apply-combine

### Grouping a data frame

In [2]:
x = DataFrame(id=[1,2,3,4,1,2,3,4], id2=[1,2,1,2,1,2,1,2], v=rand(8))

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.413963
2,2,2,0.222133
3,3,1,0.448879
4,4,2,0.451822
5,1,1,0.850196
6,2,2,0.187787
7,3,1,0.934941
8,4,2,0.512361


In [3]:
groupby(x, :id)

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.413963
2,1,1,0.850196

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,4,2,0.451822
2,4,2,0.512361


In [4]:
groupby(x, [])

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.413963
2,2,2,0.222133
3,3,1,0.448879
4,4,2,0.451822
5,1,1,0.850196
6,2,2,0.187787
7,3,1,0.934941
8,4,2,0.512361


In [5]:
gx2 = groupby(x, [:id, :id2])

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.413963
2,1,1,0.850196

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,4,2,0.451822
2,4,2,0.512361


In [6]:
parent(gx2) # get the parent DataFrame 

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.413963
2,2,2,0.222133
3,3,1,0.448879
4,4,2,0.451822
5,1,1,0.850196
6,2,2,0.187787
7,3,1,0.934941
8,4,2,0.512361


In [7]:
vcat(gx2...) # back to the DataFrame, but in a different order of rows than the original

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.413963
2,1,1,0.850196
3,2,2,0.222133
4,2,2,0.187787
5,3,1,0.448879
6,3,1,0.934941
7,4,2,0.451822
8,4,2,0.512361


In [8]:
DataFrame(gx2) # the same

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.413963
2,1,1,0.850196
3,2,2,0.222133
4,2,2,0.187787
5,3,1,0.448879
6,3,1,0.934941
7,4,2,0.451822
8,4,2,0.512361


In [9]:
DataFrame(gx2, keepkeys=false) # drop grouping columns when creating a data frame

Unnamed: 0_level_0,v
Unnamed: 0_level_1,Float64
1,0.413963
2,0.850196
3,0.222133
4,0.187787
5,0.448879
6,0.934941
7,0.451822
8,0.512361


In [10]:
groupcols(gx2) # vector of names of grouping variables

2-element Vector{Symbol}:
 :id
 :id2

In [11]:
valuecols(gx2) # and non-grouping variables

1-element Vector{Symbol}:
 :v

In [12]:
groupindices(gx2) # group indices in parent(gx2)

8-element Vector{Union{Missing, Int64}}:
 1
 2
 3
 4
 1
 2
 3
 4

In [13]:
kgx2 = keys(gx2)

4-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (id = 1, id2 = 1)
 GroupKey: (id = 2, id2 = 2)
 GroupKey: (id = 3, id2 = 1)
 GroupKey: (id = 4, id2 = 2)

You can index into a `GroupedDataFrame` like to a vector or to a dictionary.
The second form acceps `GroupKey`, `NamedTuple` or a `Tuple`

In [14]:
gx2

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.413963
2,1,1,0.850196

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,4,2,0.451822
2,4,2,0.512361


In [15]:
k = keys(gx2)[1]

GroupKey: (id = 1, id2 = 1)

In [16]:
ntk = NamedTuple(k)

(id = 1, id2 = 1)

In [17]:
tk = Tuple(k)

(1, 1)

the operations below produce the same result and are fast

In [18]:
gx2[1]

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.413963
2,1,1,0.850196


In [19]:
gx2[k]

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.413963
2,1,1,0.850196


In [20]:
gx2[ntk]

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.413963
2,1,1,0.850196


In [21]:
gx2[tk]

Unnamed: 0_level_0,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.413963
2,1,1,0.850196


handling missing values

In [22]:
x = DataFrame(id = [missing, 5, 1, 3, missing], x = 1:5)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64?,Int64
1,missing,1
2,5,2
3,1,3
4,3,4
5,missing,5


In [23]:
groupby(x, :id) # by default groups include mising values and are not sorted

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64?,Int64
1,1,3

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64?,Int64
1,missing,1
2,missing,5


In [24]:
groupby(x, :id, sort=true, skipmissing=true) # but we can change it

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64?,Int64
1,1,3

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64?,Int64
1,5,2


### Performing transformations by group using `combine`, `select`, `select!`, `transform`, and `transform!`

In [25]:
using Statistics
using Pipe

In [26]:
ENV["LINES"] = 15 # reduce the number of rows in the output

15

In [27]:
x = DataFrame(id=rand('a':'d', 100), v=rand(100))

Unnamed: 0_level_0,id,v
Unnamed: 0_level_1,Char,Float64
1,d,0.830276
2,a,0.0751451
3,c,0.495729
4,b,0.311539
5,d,0.800109
6,b,0.295747
7,b,0.277973
8,b,0.0517359
9,d,0.877831
10,c,0.291347


In [28]:
# apply a function to each group of a data frame
# combine keeps as many rows as are returned from the function
@pipe x |> groupby(_, :id) |> combine(_, :v=>mean)

Unnamed: 0_level_0,id,v_mean
Unnamed: 0_level_1,Char,Float64
1,d,0.480157
2,a,0.457628
3,c,0.475433
4,b,0.469266


In [29]:
x.id2 = axes(x, 1)

Base.OneTo(100)

In [30]:
# select and transform keep as many rows as are in the source data frame and in correct order
# additionally transform keeps all columns from the source
@pipe x |> groupby(_, :id) |> transform(_, :v=>mean)

Unnamed: 0_level_0,id,v,id2,v_mean
Unnamed: 0_level_1,Char,Float64,Int64,Float64
1,d,0.830276,1,0.480157
2,a,0.0751451,2,0.457628
3,c,0.495729,3,0.475433
4,b,0.311539,4,0.469266
5,d,0.800109,5,0.480157
6,b,0.295747,6,0.469266
7,b,0.277973,7,0.469266
8,b,0.0517359,8,0.469266
9,d,0.877831,9,0.480157
10,c,0.291347,10,0.475433


In [31]:
# note that combine reorders rows by group of GroupedDataFrame
@pipe x |> groupby(_, :id) |> combine(_, :id2, :v=>mean)

Unnamed: 0_level_0,id,id2,v_mean
Unnamed: 0_level_1,Char,Int64,Float64
1,d,1,0.480157
2,d,5,0.480157
3,d,9,0.480157
4,d,13,0.480157
5,d,16,0.480157
6,d,17,0.480157
7,d,20,0.480157
8,d,24,0.480157
9,d,25,0.480157
10,d,26,0.480157


In [32]:
# we give a custom name for the result column
@pipe x |> groupby(_, :id) |> combine(_, :v=>mean=>:res)

Unnamed: 0_level_0,id,res
Unnamed: 0_level_1,Char,Float64
1,d,0.480157
2,a,0.457628
3,c,0.475433
4,b,0.469266


In [33]:
# you can have multiple operations
@pipe x |> groupby(_, :id) |> combine(_, :v=>mean=>:res1, :v=>sum=>:res2, nrow=>:n)

Unnamed: 0_level_0,id,res1,res2,n
Unnamed: 0_level_1,Char,Float64,Float64,Int64
1,d,0.480157,15.365,32
2,a,0.457628,10.9831,24
3,c,0.475433,9.03322,19
4,b,0.469266,11.7317,25


Additional notes:
* `select!` and `transform!` perform operations in-place
* The general syntax for transformation is `source_columns => function => target_column`
* if you pass multiple columns to a function they are treated as positional arguments
* `ByRow` and `AsTable` work exactly like discussed for operations on data frames in 05_columns.ipynb
* you can automatically groupby again the result of `combine`, `select` etc. by passing `ungroup=false` keyword argument to them
* similarly `keepkeys` keyword argument allows you to drop grouping columns from the resulting data frame

It is also allowed to pass a function to all these functions (also - as a special case, as a first argument). In this case the return value can be a table. In particular it allows for an easy dropping of groups if you return an empty table from the function.

If you pass a function you can use a `do` block syntax. In case of passing a function it gets a `SubDataFrame` as its argument.

Here is an example:

In [34]:
combine(groupby(x, :id)) do sdf
    n = nrow(sdf)
    n < 25 ? DataFrame() : DataFrame(n=n) # drop groups with low number of rows
end

Unnamed: 0_level_0,id,n
Unnamed: 0_level_1,Char,Int64
1,d,32
2,b,25


You can also produce multiple columns in a single operation, e.g.:

In [35]:
df = DataFrame(id=[1,1,2,2], val=[1,2,3,4])

Unnamed: 0_level_0,id,val
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,1,2
3,2,3
4,2,4


In [36]:
@pipe df |> groupby(_, :id) |> combine(_, :val => (x -> [x]) => AsTable)

Unnamed: 0_level_0,id,x1,x2
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,1,2
2,2,3,4


In [37]:
@pipe df |> groupby(_, :id) |> combine(_, :val => (x -> [x]) => [:c1, :c2])

Unnamed: 0_level_0,id,c1,c2
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,1,2
2,2,3,4


### Aggregation of a data frame using `mapcols`

In [38]:
x = DataFrame(rand(10, 10), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.192541,0.00823978,0.759743,0.627588,0.210741,0.641939,0.913316,0.373572
2,0.233219,0.774847,0.764817,0.236951,0.552773,0.293831,0.031392,0.472605
3,0.00496608,0.525278,0.0962988,0.260287,0.562414,0.164036,0.187594,0.0981683
4,0.702417,0.175973,0.889661,0.651011,0.195969,0.785807,0.547239,0.549654
5,0.831447,0.690645,0.208195,0.785246,0.231691,0.570581,0.562815,0.499155
6,0.811698,0.999773,0.152812,0.708421,0.980536,0.520621,0.667761,0.717624
7,0.0838621,0.891044,0.899605,0.137132,0.753841,0.266889,0.351575,0.34191
8,0.479779,0.0987629,0.31926,0.773762,0.775632,0.11155,0.83326,0.683657
9,0.597412,0.353388,0.0932035,0.570374,0.927963,0.341653,0.905626,0.825433
10,0.766673,0.156479,0.635945,0.427705,0.869142,0.478612,0.266469,0.128289


In [39]:
mapcols(mean, x)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.470401,0.467443,0.481954,0.517848,0.60607,0.417552,0.526705,0.469007,0.510761


### Mapping rows and columns using `eachcol` and `eachrow`

In [40]:
map(mean, eachcol(x)) # map a function over each column and return a vector

10-element Vector{Float64}:
 0.47040144156855535
 0.46744292233241913
 0.4819539338753063
 0.5178477270202617
 0.6060702559218771
 0.4175518207310521
 0.5267046580086261
 0.46900663984942953
 0.5107610870865886
 0.3891861002348541

In [41]:
# an iteration returns a Pair with column name and values
foreach(c -> println(c[1], ": ", mean(c[2])), pairs(eachcol(x)))

x1: 0.47040144156855535
x2: 0.46744292233241913
x3: 0.4819539338753063
x4: 0.5178477270202617
x5: 0.6060702559218771
x6: 0.4175518207310521
x7: 0.5267046580086261
x8: 0.46900663984942953
x9: 0.5107610870865886
x10: 0.3891861002348541


In [42]:
# now the returned value is DataFrameRow which works as a NamedTuple but is a view to a parent DataFrame
map(r -> r.x1/r.x2, eachrow(x))

10-element Vector{Float64}:
 23.367205350823458
  0.30098745348074873
  0.009454193184731584
  3.9916198369956133
  1.2038715564114404
  0.8118815979296972
  0.09411672133625724
  4.85789048301045
  1.6905290545338842
  4.899517214762139

In [43]:
# it prints like a data frame, only the caption is different so that you know the type of the object
er = eachrow(x)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.192541,0.00823978,0.759743,0.627588,0.210741,0.641939,0.913316,0.373572
2,0.233219,0.774847,0.764817,0.236951,0.552773,0.293831,0.031392,0.472605
3,0.00496608,0.525278,0.0962988,0.260287,0.562414,0.164036,0.187594,0.0981683
4,0.702417,0.175973,0.889661,0.651011,0.195969,0.785807,0.547239,0.549654
5,0.831447,0.690645,0.208195,0.785246,0.231691,0.570581,0.562815,0.499155
6,0.811698,0.999773,0.152812,0.708421,0.980536,0.520621,0.667761,0.717624
7,0.0838621,0.891044,0.899605,0.137132,0.753841,0.266889,0.351575,0.34191
8,0.479779,0.0987629,0.31926,0.773762,0.775632,0.11155,0.83326,0.683657
9,0.597412,0.353388,0.0932035,0.570374,0.927963,0.341653,0.905626,0.825433
10,0.766673,0.156479,0.635945,0.427705,0.869142,0.478612,0.266469,0.128289


In [44]:
er.x1 # you can access columns of a parent data frame directly

10-element Vector{Float64}:
 0.19254068014795966
 0.23321927121656771
 0.004966076118129603
 0.7024168971836942
 0.8314473241181026
 0.8116976375505227
 0.08386212871521459
 0.4797794062602667
 0.597412329499531
 0.766672664875564

In [45]:
# it prints like a data frame, only the caption is different so that you know the type of the object
ec = eachcol(x)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.192541,0.00823978,0.759743,0.627588,0.210741,0.641939,0.913316,0.373572
2,0.233219,0.774847,0.764817,0.236951,0.552773,0.293831,0.031392,0.472605
3,0.00496608,0.525278,0.0962988,0.260287,0.562414,0.164036,0.187594,0.0981683
4,0.702417,0.175973,0.889661,0.651011,0.195969,0.785807,0.547239,0.549654
5,0.831447,0.690645,0.208195,0.785246,0.231691,0.570581,0.562815,0.499155
6,0.811698,0.999773,0.152812,0.708421,0.980536,0.520621,0.667761,0.717624
7,0.0838621,0.891044,0.899605,0.137132,0.753841,0.266889,0.351575,0.34191
8,0.479779,0.0987629,0.31926,0.773762,0.775632,0.11155,0.83326,0.683657
9,0.597412,0.353388,0.0932035,0.570374,0.927963,0.341653,0.905626,0.825433
10,0.766673,0.156479,0.635945,0.427705,0.869142,0.478612,0.266469,0.128289


In [46]:
ec.x1 # you can access columns of a parent data frame directly

10-element Vector{Float64}:
 0.19254068014795966
 0.23321927121656771
 0.004966076118129603
 0.7024168971836942
 0.8314473241181026
 0.8116976375505227
 0.08386212871521459
 0.4797794062602667
 0.597412329499531
 0.766672664875564

### Transposing

you can transpose a data frame using `permutedims`:

In [47]:
df = DataFrame(reshape(1:12, 3, 4), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,1,4,7,10
2,2,5,8,11
3,3,6,9,12


In [48]:
df.names = ["a", "b", "c"]

3-element Vector{String}:
 "a"
 "b"
 "c"

In [49]:
permutedims(df, :names)

Unnamed: 0_level_0,names,a,b,c
Unnamed: 0_level_1,String,Int64,Int64,Int64
1,x1,1,2,3
2,x2,4,5,6
3,x3,7,8,9
4,x4,10,11,12
