# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), October 5, 2022**

In [1]:
using DataFrames

## Split-apply-combine

### Grouping a data frame

In [2]:
x = DataFrame(id=[1,2,3,4,1,2,3,4], id2=[1,2,1,2,1,2,1,2], v=rand(8))

Row,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.687987
2,2,2,0.100706
3,3,1,0.323349
4,4,2,0.270926
5,1,1,0.484443
6,2,2,0.401948
7,3,1,0.0337727
8,4,2,0.05383


In [3]:
groupby(x, :id)

Row,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.687987
2,1,1,0.484443

Row,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,4,2,0.270926
2,4,2,0.05383


In [4]:
groupby(x, [])

Row,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.687987
2,2,2,0.100706
3,3,1,0.323349
4,4,2,0.270926
5,1,1,0.484443
6,2,2,0.401948
7,3,1,0.0337727
8,4,2,0.05383


In [5]:
gx2 = groupby(x, [:id, :id2])

Row,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.687987
2,1,1,0.484443

Row,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,4,2,0.270926
2,4,2,0.05383


In [6]:
parent(gx2) # get the parent DataFrame 

Row,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.687987
2,2,2,0.100706
3,3,1,0.323349
4,4,2,0.270926
5,1,1,0.484443
6,2,2,0.401948
7,3,1,0.0337727
8,4,2,0.05383


In [7]:
vcat(gx2...) # back to the DataFrame, but in a different order of rows than the original

Row,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.687987
2,1,1,0.484443
3,2,2,0.100706
4,2,2,0.401948
5,3,1,0.323349
6,3,1,0.0337727
7,4,2,0.270926
8,4,2,0.05383


In [8]:
DataFrame(gx2) # the same

Row,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.687987
2,1,1,0.484443
3,2,2,0.100706
4,2,2,0.401948
5,3,1,0.323349
6,3,1,0.0337727
7,4,2,0.270926
8,4,2,0.05383


In [9]:
DataFrame(gx2, keepkeys=false) # drop grouping columns when creating a data frame

Row,v
Unnamed: 0_level_1,Float64
1,0.687987
2,0.484443
3,0.100706
4,0.401948
5,0.323349
6,0.0337727
7,0.270926
8,0.05383


In [10]:
groupcols(gx2) # vector of names of grouping variables

2-element Vector{Symbol}:
 :id
 :id2

In [11]:
valuecols(gx2) # and non-grouping variables

1-element Vector{Symbol}:
 :v

In [12]:
groupindices(gx2) # group indices in parent(gx2)

8-element Vector{Union{Missing, Int64}}:
 1
 2
 3
 4
 1
 2
 3
 4

In [13]:
kgx2 = keys(gx2)

4-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (id = 1, id2 = 1)
 GroupKey: (id = 2, id2 = 2)
 GroupKey: (id = 3, id2 = 1)
 GroupKey: (id = 4, id2 = 2)

You can index into a `GroupedDataFrame` like to a vector or to a dictionary.
The second form acceps `GroupKey`, `NamedTuple` or a `Tuple`

In [14]:
gx2

Row,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.687987
2,1,1,0.484443

Row,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,4,2,0.270926
2,4,2,0.05383


In [15]:
k = keys(gx2)[1]

GroupKey: (id = 1, id2 = 1)

In [16]:
ntk = NamedTuple(k)

(id = 1, id2 = 1)

In [17]:
tk = Tuple(k)

(1, 1)

the operations below produce the same result and are fast

In [18]:
gx2[1]

Row,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.687987
2,1,1,0.484443


In [19]:
gx2[k]

Row,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.687987
2,1,1,0.484443


In [20]:
gx2[ntk]

Row,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.687987
2,1,1,0.484443


In [21]:
gx2[tk]

Row,id,id2,v
Unnamed: 0_level_1,Int64,Int64,Float64
1,1,1,0.687987
2,1,1,0.484443


handling missing values

In [22]:
x = DataFrame(id = [missing, 5, 1, 3, missing], x = 1:5)

Row,id,x
Unnamed: 0_level_1,Int64?,Int64
1,missing,1
2,5,2
3,1,3
4,3,4
5,missing,5


In [23]:
groupby(x, :id) # by default groups include mising values and their order is not guaranteed

Row,id,x
Unnamed: 0_level_1,Int64?,Int64
1,1,3

Row,id,x
Unnamed: 0_level_1,Int64?,Int64
1,missing,1
2,missing,5


In [24]:
groupby(x, :id, sort=true, skipmissing=true) # but we can change it; now they are sorted

Row,id,x
Unnamed: 0_level_1,Int64?,Int64
1,1,3

Row,id,x
Unnamed: 0_level_1,Int64?,Int64
1,5,2


In [25]:
groupby(x, :id, sort=false) # and now they are in the order they appear in the source data frame

Row,id,x
Unnamed: 0_level_1,Int64?,Int64
1,missing,1
2,missing,5

Row,id,x
Unnamed: 0_level_1,Int64?,Int64
1,3,4


### Performing transformations by group using `combine`, `select`, `select!`, `transform`, and `transform!`

In [26]:
using Statistics
using Chain

In [27]:
x = DataFrame(id=rand('a':'d', 100), v=rand(100))

Row,id,v
Unnamed: 0_level_1,Char,Float64
1,b,0.314588
2,b,0.571885
3,c,0.0996595
4,a,0.709889
5,b,0.289401
6,d,0.0122073
7,b,0.236957
8,d,0.661488
9,c,0.0276841
10,d,0.200754


In [28]:
# apply a function to each group of a data frame
# combine keeps as many rows as are returned from the function
@chain x begin
    groupby(:id)
    combine(:v => mean)
end

Row,id,v_mean
Unnamed: 0_level_1,Char,Float64
1,b,0.541344
2,c,0.387992
3,a,0.481815
4,d,0.425779


In [29]:
x.id2 = axes(x, 1)

Base.OneTo(100)

In [30]:
# select and transform keep as many rows as are in the source data frame and in correct order
# additionally transform keeps all columns from the source
@chain x begin
    groupby(:id)
    transform(:v => mean)
end

Row,id,v,id2,v_mean
Unnamed: 0_level_1,Char,Float64,Int64,Float64
1,b,0.314588,1,0.541344
2,b,0.571885,2,0.541344
3,c,0.0996595,3,0.387992
4,a,0.709889,4,0.481815
5,b,0.289401,5,0.541344
6,d,0.0122073,6,0.425779
7,b,0.236957,7,0.541344
8,d,0.661488,8,0.425779
9,c,0.0276841,9,0.387992
10,d,0.200754,10,0.425779


In [31]:
# note that combine reorders rows by group of GroupedDataFrame
@chain x begin
    groupby(:id)
    combine(:id2, :v => mean)
end

Row,id,id2,v_mean
Unnamed: 0_level_1,Char,Int64,Float64
1,b,1,0.541344
2,b,2,0.541344
3,b,5,0.541344
4,b,7,0.541344
5,b,15,0.541344
6,b,16,0.541344
7,b,18,0.541344
8,b,21,0.541344
9,b,24,0.541344
10,b,30,0.541344


In [32]:
# we give a custom name for the result column
@chain x begin
    groupby(:id)
    combine(:v => mean => :res)
end

Row,id,res
Unnamed: 0_level_1,Char,Float64
1,b,0.541344
2,c,0.387992
3,a,0.481815
4,d,0.425779


In [33]:
# you can have multiple operations
@chain x begin
    groupby(:id)
    combine(:v => mean => :res1, :v => sum => :res2, nrow => :n)
end

Row,id,res1,res2,n
Unnamed: 0_level_1,Char,Float64,Float64,Int64
1,b,0.541344,14.6163,27
2,c,0.387992,9.69981,25
3,a,0.481815,12.0454,25
4,d,0.425779,9.79293,23


Additional notes:
* `select!` and `transform!` perform operations in-place
* The general syntax for transformation is `source_columns => function => target_column`
* if you pass multiple columns to a function they are treated as positional arguments
* `ByRow` and `AsTable` work exactly like discussed for operations on data frames in 05_columns.ipynb
* you can automatically groupby again the result of `combine`, `select` etc. by passing `ungroup=false` keyword argument to them
* similarly `keepkeys` keyword argument allows you to drop grouping columns from the resulting data frame

It is also allowed to pass a function to all these functions (also - as a special case, as a first argument). In this case the return value can be a table. In particular it allows for an easy dropping of groups if you return an empty table from the function.

If you pass a function you can use a `do` block syntax. In case of passing a function it gets a `SubDataFrame` as its argument.

Here is an example:

In [34]:
combine(groupby(x, :id)) do sdf
    n = nrow(sdf)
    n < 25 ? DataFrame() : DataFrame(n=n) # drop groups with low number of rows
end

Row,id,n
Unnamed: 0_level_1,Char,Int64
1,b,27
2,c,25
3,a,25


You can also produce multiple columns in a single operation, e.g.:

In [35]:
df = DataFrame(id=[1,1,2,2], val=[1,2,3,4])

Row,id,val
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,1,2
3,2,3
4,2,4


In [36]:
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => AsTable)
end

Row,id,x1,x2
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,1,2
2,2,3,4


In [37]:
@chain df begin
    groupby(:id)
    combine(:val => (x -> [x]) => [:c1, :c2])
end

Row,id,c1,c2
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,1,2
2,2,3,4


It is easy to unnest the column into multiple columns, e.g.

In [38]:
df = DataFrame(a=[(p=1, q=2), (p=3, q=4)])

Row,a
Unnamed: 0_level_1,NamedTup…
1,"(p = 1, q = 2)"
2,"(p = 3, q = 4)"


In [39]:
select(df, :a => AsTable)

Row,p,q
Unnamed: 0_level_1,Int64,Int64
1,1,2
2,3,4


In [40]:
df = DataFrame(a=[[1, 2], [3, 4]])

Row,a
Unnamed: 0_level_1,Array…
1,"[1, 2]"
2,"[3, 4]"


In [41]:
select(df, :a => AsTable) # automatic column names generated

Row,x1,x2
Unnamed: 0_level_1,Int64,Int64
1,1,2
2,3,4


In [42]:
select(df, :a => [:C1, :C2]) # custom column names generated

Row,C1,C2
Unnamed: 0_level_1,Int64,Int64
1,1,2
2,3,4


Finally, observe that one can conveniently apply multiple transformations using broadcasting:

In [43]:
df = DataFrame(id=repeat(1:10, 10), x1=1:100, x2=101:200)

Row,id,x1,x2
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,1,101
2,2,2,102
3,3,3,103
4,4,4,104
5,5,5,105
6,6,6,106
7,7,7,107
8,8,8,108
9,9,9,109
10,10,10,110


In [44]:
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> minimum)
end

Row,id,x1_minimum,x2_minimum
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,1,101
2,2,2,102
3,3,3,103
4,4,4,104
5,5,5,105
6,6,6,106
7,7,7,107
8,8,8,108
9,9,9,109
10,10,10,110


In [45]:
@chain df begin
    groupby(:id)
    combine([:x1, :x2] .=> [minimum maximum])
end

Row,id,x1_minimum,x2_minimum,x1_maximum,x2_maximum
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,1,1,101,91,191
2,2,2,102,92,192
3,3,3,103,93,193
4,4,4,104,94,194
5,5,5,105,95,195
6,6,6,106,96,196
7,7,7,107,97,197
8,8,8,108,98,198
9,9,9,109,99,199
10,10,10,110,100,200


### Aggregation of a data frame using `mapcols`

In [46]:
x = DataFrame(rand(10, 10), :auto)

Row,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.609252,0.478449,0.147533,0.416274,0.0799526,0.20815,0.0492013,0.083144,0.696722,0.242335
2,0.608381,0.966422,0.143487,0.070214,0.788614,0.54208,0.812702,0.650973,0.15644,0.569921
3,0.734881,0.455496,0.137201,0.774702,0.315646,0.975013,0.447677,0.48016,0.523917,0.681097
4,0.747723,0.964219,0.293375,0.705086,0.618123,0.14647,0.497369,0.310807,0.531596,0.784373
5,0.917051,0.238184,0.542938,0.225385,0.84831,0.161894,0.156023,0.160125,0.547363,0.753897
6,0.00145541,0.964926,0.706856,0.431619,0.892142,0.160579,0.0160198,0.198285,0.0593324,0.204419
7,0.670264,0.468089,0.647681,0.909407,0.286941,0.962149,0.816631,0.894423,0.17228,0.853088
8,0.6642,0.701694,0.99797,0.0796954,0.319452,0.555587,0.590807,0.510126,0.0836374,0.100546
9,0.716944,0.56986,0.09389,0.978335,0.799974,0.31208,0.69401,0.407816,0.0845134,0.626323
10,0.586466,0.812984,0.836441,0.361244,0.528176,0.115888,0.722673,0.466461,0.943547,0.532951


In [47]:
mapcols(mean, x)

Row,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.625662,0.662032,0.454737,0.495196,0.547733,0.413989,0.480311,0.416232,0.379935,0.534895


### Mapping rows and columns using `eachcol` and `eachrow`

In [48]:
map(mean, eachcol(x)) # map a function over each column and return a vector

10-element Vector{Float64}:
 0.6256618320714122
 0.6620323039610124
 0.4547371137600953
 0.49519624682485375
 0.5477331719184032
 0.4139888970786079
 0.48031139667665457
 0.4162322284290608
 0.37993482842905174
 0.5348949585851259

In [49]:
# an iteration returns a Pair with column name and values
foreach(c -> println(c[1], ": ", mean(c[2])), pairs(eachcol(x)))

x1: 0.6256618320714122
x2: 0.6620323039610124
x3: 0.4547371137600953
x4: 0.49519624682485375
x5: 0.5477331719184032
x6: 0.4139888970786079
x7: 0.48031139667665457
x8: 0.4162322284290608
x9: 0.37993482842905174
x10: 0.5348949585851259


In [50]:
# now the returned value is DataFrameRow which works as a NamedTuple but is a view to a parent DataFrame
map(r -> r.x1/r.x2, eachrow(x))

10-element Vector{Float64}:
 1.2733889624168895
 0.6295189528968664
 1.6133650647551416
 0.77546997973602
 3.8501765481483945
 0.0015083071503702158
 1.4319177541455181
 0.9465663241387834
 1.2581062523658957
 0.7213747414586192

In [51]:
# it prints like a data frame, only the caption is different so that you know the type of the object
er = eachrow(x)

Row,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.609252,0.478449,0.147533,0.416274,0.0799526,0.20815,0.0492013,0.083144,0.696722,0.242335
2,0.608381,0.966422,0.143487,0.070214,0.788614,0.54208,0.812702,0.650973,0.15644,0.569921
3,0.734881,0.455496,0.137201,0.774702,0.315646,0.975013,0.447677,0.48016,0.523917,0.681097
4,0.747723,0.964219,0.293375,0.705086,0.618123,0.14647,0.497369,0.310807,0.531596,0.784373
5,0.917051,0.238184,0.542938,0.225385,0.84831,0.161894,0.156023,0.160125,0.547363,0.753897
6,0.00145541,0.964926,0.706856,0.431619,0.892142,0.160579,0.0160198,0.198285,0.0593324,0.204419
7,0.670264,0.468089,0.647681,0.909407,0.286941,0.962149,0.816631,0.894423,0.17228,0.853088
8,0.6642,0.701694,0.99797,0.0796954,0.319452,0.555587,0.590807,0.510126,0.0836374,0.100546
9,0.716944,0.56986,0.09389,0.978335,0.799974,0.31208,0.69401,0.407816,0.0845134,0.626323
10,0.586466,0.812984,0.836441,0.361244,0.528176,0.115888,0.722673,0.466461,0.943547,0.532951


In [52]:
er.x1 # you can access columns of a parent data frame directly

10-element Vector{Float64}:
 0.6092521394634033
 0.6083808320990414
 0.7348807317901408
 0.7477226729643642
 0.9170514636754948
 0.0014554054082409618
 0.6702644125637675
 0.6642002609707035
 0.7169443028766239
 0.586466098902341

In [53]:
# it prints like a data frame, only the caption is different so that you know the type of the object
ec = eachcol(x)

Row,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.609252,0.478449,0.147533,0.416274,0.0799526,0.20815,0.0492013,0.083144,0.696722,0.242335
2,0.608381,0.966422,0.143487,0.070214,0.788614,0.54208,0.812702,0.650973,0.15644,0.569921
3,0.734881,0.455496,0.137201,0.774702,0.315646,0.975013,0.447677,0.48016,0.523917,0.681097
4,0.747723,0.964219,0.293375,0.705086,0.618123,0.14647,0.497369,0.310807,0.531596,0.784373
5,0.917051,0.238184,0.542938,0.225385,0.84831,0.161894,0.156023,0.160125,0.547363,0.753897
6,0.00145541,0.964926,0.706856,0.431619,0.892142,0.160579,0.0160198,0.198285,0.0593324,0.204419
7,0.670264,0.468089,0.647681,0.909407,0.286941,0.962149,0.816631,0.894423,0.17228,0.853088
8,0.6642,0.701694,0.99797,0.0796954,0.319452,0.555587,0.590807,0.510126,0.0836374,0.100546
9,0.716944,0.56986,0.09389,0.978335,0.799974,0.31208,0.69401,0.407816,0.0845134,0.626323
10,0.586466,0.812984,0.836441,0.361244,0.528176,0.115888,0.722673,0.466461,0.943547,0.532951


In [54]:
ec.x1 # you can access columns of a parent data frame directly

10-element Vector{Float64}:
 0.6092521394634033
 0.6083808320990414
 0.7348807317901408
 0.7477226729643642
 0.9170514636754948
 0.0014554054082409618
 0.6702644125637675
 0.6642002609707035
 0.7169443028766239
 0.586466098902341

### Transposing

you can transpose a data frame using `permutedims`:

In [55]:
df = DataFrame(reshape(1:12, 3, 4), :auto)

Row,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,1,4,7,10
2,2,5,8,11
3,3,6,9,12


In [56]:
permutedims(df)

Row,x1,x2,x3
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,2,3
2,4,5,6
3,7,8,9
4,10,11,12


In [57]:
df.names = ["a", "b", "c"]

3-element Vector{String}:
 "a"
 "b"
 "c"

In [58]:
permutedims(df, :names)

Row,names,a,b,c
Unnamed: 0_level_1,String,Int64,Int64,Int64
1,x1,1,2,3
2,x2,4,5,6
3,x3,7,8,9
4,x4,10,11,12
