# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), December 12, 2021**

In [1]:
using DataFrames

## Possible pitfalls

### Know what is copied when creating a `DataFrame`

In [2]:
x = DataFrame(rand(3, 5), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.345332,0.88169,0.636903,0.275687,0.441932
2,0.942451,0.243809,0.860925,0.710149,0.0612701
3,0.937008,0.67613,0.391914,0.549464,0.178747


In [3]:
y = copy(x)
x === y # not the same object

false

In [4]:
y = DataFrame(x)
x === y

false

In [5]:
any(x[!, i] === y[!, i] for i in ncol(x)) # the columns are also not the same

false

In [6]:
y = DataFrame(x, copycols=false)
x === y

false

In [7]:
all(x[!, i] === y[!, i] for i in ncol(x)) # the columns are the same

true

In [8]:
x = 1:3; y = [1, 2, 3]; df = DataFrame(x=x,y=y) # the same when creating data frames using kwarg syntax

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,2
3,3,3


In [9]:
y === df.y # different object

false

In [10]:
typeof(x), typeof(df.x) # range is converted to a vector

(UnitRange{Int64}, Vector{Int64})

In [11]:
y === df[:, :y] # slicing rows always creates a copy

false

you can avoid copying by using `copycols=false` keyword argument in functions.

In [12]:
df = DataFrame(x=x,y=y, copycols=false)

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,2
3,3,3


In [13]:
y === df.y # now it is the same

true

In [14]:
select(df, :y)[!, 1] === y # not the same

false

In [15]:
select(df, :y, copycols=false)[!, 1] === y # the same

true

### Do not modify the parent of `GroupedDataFrame` or `view`

In [16]:
x = DataFrame(id=repeat([1,2], outer=3), x=1:6)
g = groupby(x, :id)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,1,3
3,1,5

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,2
2,2,4
3,2,6


In [17]:
x[1:3, 1] = [2,2,2]
g # well - it is wrong now, g is only a view

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,1
2,2,3
3,1,5

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,2
2,2,4
3,2,6


In [18]:
s = view(x, 5:6, :)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,1,5
2,2,6


In [19]:
delete!(x, 3:6)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,1
2,2,2


In [20]:
s # error

BoundsError: BoundsError: attempt to access 2-element Vector{Int64} at index [5:6]

### Single column selection for `DataFrame` creates aliases with `!` and `getproperty` syntax and copies with `:`

In [21]:
x = DataFrame(a=1:3)
x.b = x[!, 1] # alias
x.c = x[:, 1] # copy
x.d = x[!, 1][:] # copy
x.e = copy(x[!, 1]) # explicit copy
display(x)
x[1,1] = 100
display(x)

Unnamed: 0_level_0,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,1,1,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


Unnamed: 0_level_0,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,100,100,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


### When iterating rows of a data frame use `eachrow` to avoid compilation cost (wide tables), but `Tables.namedtupleiterator` for fast execution (tall tables)

this table is wide

In [22]:
df1 = DataFrame([rand([1:2, 'a':'b', false:true, 1.0:2.0]) for i in 1:900], :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13
Unnamed: 0_level_1,Bool,Int64,Char,Bool,Float64,Char,Char,Char,Bool,Bool,Char,Float64,Bool
1,0,1,a,0,1.0,a,a,a,0,0,a,1.0,0
2,1,2,b,1,2.0,b,b,b,1,1,b,2.0,1


In [23]:
@time collect(eachrow(df1))

  0.083117 seconds (86.08 k allocations: 4.741 MiB, 99.93% compilation time)


2-element Vector{DataFrameRow}:
 [1mDataFrameRow[0m
[1m Row [0m│[1m x1    [0m[1m x2    [0m[1m x3   [0m[1m x4    [0m[1m x5      [0m[1m x6   [0m[1m x7   [0m[1m x8   [0m[1m x9    [0m[1m x10   [0m[1m x11[0m ⋯
[1m     [0m│[90m Bool  [0m[90m Int64 [0m[90m Char [0m[90m Bool  [0m[90m Float64 [0m[90m Char [0m[90m Char [0m[90m Char [0m[90m Bool  [0m[90m Bool  [0m[90m Cha[0m ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ false      1  a     false      1.0  a     a     a     false  false  a   ⋯
[36m                                                             890 columns omitted[0m
 [1mDataFrameRow[0m
[1m Row [0m│[1m x1   [0m[1m x2    [0m[1m x3   [0m[1m x4   [0m[1m x5      [0m[1m x6   [0m[1m x7   [0m[1m x8   [0m[1m x9   [0m[1m x10  [0m[1m x11  [0m[1m x[0m ⋯
[1m     [0m│[90m Bool [0m[90m Int64 [0m[90m Char [0m[90m Bool [0m[90m Float64 [0m[90m Char [0m[90m Char [0

In [24]:
@time collect(Tables.namedtupleiterator(df1));

 11.948307 seconds (2.66 M allocations: 167.463 MiB, 0.68% gc time, 99.89% compilation time)


as you can see the time to compile `Tables.namedtupleiterator` is very large in this case, and it would get much worse if the table was wider (that is why we include this tip in pitfalls notebook)

the table below is tall

In [25]:
df2 = DataFrame(rand(10^6, 10), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.812598,0.199008,0.489529,0.173243,0.174659,0.629508,0.261476,0.384585
2,0.00280933,0.0697785,0.721708,0.985117,0.748635,0.747097,0.95652,0.234513
3,0.674185,0.642761,0.277017,0.576259,0.641213,0.905591,0.880482,0.558141
4,0.637642,0.0178559,0.938028,0.309252,0.446349,0.577177,0.966324,0.75673
5,0.847842,0.0951855,0.185581,0.437136,0.0211549,0.90453,0.382964,0.450601
6,0.85145,0.388947,0.265416,0.628984,0.433217,0.806872,0.39362,0.866561
7,0.808643,0.328728,0.753462,0.404849,0.107149,0.285132,0.82107,0.0583062
8,0.513336,0.771539,0.344971,0.71855,0.266392,0.745457,0.370414,0.996824
9,0.235142,0.7607,0.243749,0.386569,0.0957563,0.789125,0.390211,0.28802
10,0.970881,0.00220123,0.305367,0.34369,0.6262,0.251802,0.0153362,0.533094


In [26]:
@time map(sum, eachrow(df2))

  3.892058 seconds (60.19 M allocations: 1.061 GiB, 8.04% gc time, 3.45% compilation time)


1000000-element Vector{Float64}:
 3.5415480933137844
 5.205413796084415
 5.538985173625073
 5.679680368162268
 4.819966279265384
 6.039599576343484
 4.559158997855922
 5.506623184812048
 4.357268924654652
 3.9055762037646815
 5.468423085862514
 5.135525917068438
 6.650543085096766
 ⋮
 4.808328609279509
 4.135386777246283
 5.411053708396329
 6.091779993930741
 4.527597648004583
 6.795736552178264
 5.536045157550963
 4.279163510515187
 3.9958529869139956
 5.377263758627355
 5.361734558567568
 4.74749740730375

In [27]:
@time map(sum, eachrow(df2))

  3.526568 seconds (59.99 M allocations: 1.050 GiB, 2.39% gc time)


1000000-element Vector{Float64}:
 3.5415480933137844
 5.205413796084415
 5.538985173625073
 5.679680368162268
 4.819966279265384
 6.039599576343484
 4.559158997855922
 5.506623184812048
 4.357268924654652
 3.9055762037646815
 5.468423085862514
 5.135525917068438
 6.650543085096766
 ⋮
 4.808328609279509
 4.135386777246283
 5.411053708396329
 6.091779993930741
 4.527597648004583
 6.795736552178264
 5.536045157550963
 4.279163510515187
 3.9958529869139956
 5.377263758627355
 5.361734558567568
 4.74749740730375

In [28]:
@time map(sum, Tables.namedtupleiterator(df2))

  0.275496 seconds (500.67 k allocations: 35.538 MiB, 93.38% compilation time)


1000000-element Vector{Float64}:
 3.5415480933137844
 5.205413796084415
 5.538985173625073
 5.679680368162268
 4.819966279265384
 6.039599576343484
 4.559158997855922
 5.506623184812048
 4.357268924654652
 3.9055762037646815
 5.468423085862514
 5.135525917068438
 6.650543085096766
 ⋮
 4.808328609279509
 4.135386777246283
 5.411053708396329
 6.091779993930741
 4.527597648004583
 6.795736552178264
 5.536045157550963
 4.279163510515187
 3.9958529869139956
 5.377263758627355
 5.361734558567568
 4.74749740730375

In [29]:
@time map(sum, Tables.namedtupleiterator(df2))

  0.017857 seconds (13 allocations: 7.630 MiB)


1000000-element Vector{Float64}:
 3.5415480933137844
 5.205413796084415
 5.538985173625073
 5.679680368162268
 4.819966279265384
 6.039599576343484
 4.559158997855922
 5.506623184812048
 4.357268924654652
 3.9055762037646815
 5.468423085862514
 5.135525917068438
 6.650543085096766
 ⋮
 4.808328609279509
 4.135386777246283
 5.411053708396329
 6.091779993930741
 4.527597648004583
 6.795736552178264
 5.536045157550963
 4.279163510515187
 3.9958529869139956
 5.377263758627355
 5.361734558567568
 4.74749740730375

as you can see - this time it is much faster to iterate a type stable container

still you might want to use the `select` syntax, which is optimized for such reductions:

In [30]:
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum # this includes compilation time

  1.123701 seconds (2.16 M allocations: 134.329 MiB, 2.20% gc time, 98.66% compilation time)


1000000-element Vector{Float64}:
 3.5415480933137844
 5.205413796084415
 5.538985173625073
 5.679680368162268
 4.819966279265384
 6.039599576343484
 4.559158997855922
 5.506623184812048
 4.357268924654652
 3.9055762037646815
 5.468423085862514
 5.135525917068438
 6.650543085096766
 ⋮
 4.808328609279509
 4.135386777246283
 5.411053708396329
 6.091779993930741
 4.527597648004583
 6.795736552178264
 5.536045157550963
 4.279163510515187
 3.9958529869139956
 5.377263758627355
 5.361734558567568
 4.74749740730375

In [31]:
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum

  0.013828 seconds (151 allocations: 7.637 MiB)


1000000-element Vector{Float64}:
 3.5415480933137844
 5.205413796084415
 5.538985173625073
 5.679680368162268
 4.819966279265384
 6.039599576343484
 4.559158997855922
 5.506623184812048
 4.357268924654652
 3.9055762037646815
 5.468423085862514
 5.135525917068438
 6.650543085096766
 ⋮
 4.808328609279509
 4.135386777246283
 5.411053708396329
 6.091779993930741
 4.527597648004583
 6.795736552178264
 5.536045157550963
 4.279163510515187
 3.9958529869139956
 5.377263758627355
 5.361734558567568
 4.74749740730375