# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), July 17, 2021**

In [1]:
using DataFrames

## Possible pitfalls

### Know what is copied when creating a `DataFrame`

In [2]:
x = DataFrame(rand(3, 5), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.967459,0.604687,0.298296,0.976798,0.417961
2,0.772129,0.503242,0.394637,0.410221,0.908613
3,0.854332,0.294652,0.681667,0.177108,0.112114


In [3]:
y = copy(x)
x === y # not the same object

false

In [4]:
y = DataFrame(x)
x === y

false

In [5]:
any(x[!, i] === y[!, i] for i in ncol(x)) # the columns are also not the same

false

In [6]:
y = DataFrame(x, copycols=false)
x === y

false

In [7]:
all(x[!, i] === y[!, i] for i in ncol(x)) # the columns are the same

true

In [8]:
x = 1:3; y = [1, 2, 3]; df = DataFrame(x=x,y=y) # the same when creating data frames using kwarg syntax

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,2
3,3,3


In [9]:
y === df.y # different object

false

In [10]:
typeof(x), typeof(df.x) # range is converted to a vector

(UnitRange{Int64}, Vector{Int64})

In [11]:
y === df[:, :y] # slicing rows always creates a copy

false

you can avoid copying by using `copycols=false` keyword argument in functions.

In [12]:
df = DataFrame(x=x,y=y, copycols=false)

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,2
3,3,3


In [13]:
y === df.y # now it is the same

true

In [14]:
select(df, :y)[!, 1] === y # not the same

false

In [15]:
select(df, :y, copycols=false)[!, 1] === y # the same

true

### Do not modify the parent of `GroupedDataFrame` or `view`

In [16]:
x = DataFrame(id=repeat([1,2], outer=3), x=1:6)
g = groupby(x, :id)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,1,3
3,1,5

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,2
2,2,4
3,2,6


In [17]:
x[1:3, 1] = [2,2,2]
g # well - it is wrong now, g is only a view

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,1
2,2,3
3,1,5

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,2
2,2,4
3,2,6


In [18]:
s = view(x, 5:6, :)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,1,5
2,2,6


In [19]:
delete!(x, 3:6)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,1
2,2,2


In [20]:
s # error

BoundsError: BoundsError: attempt to access 2-element Vector{Int64} at index [5:6]

### Single column selection for `DataFrame` creates aliases with `!` and `getproperty` syntax and copies with `:`

In [21]:
x = DataFrame(a=1:3)
x.b = x[!, 1] # alias
x.c = x[:, 1] # copy
x.d = x[!, 1][:] # copy
x.e = copy(x[!, 1]) # explicit copy
display(x)
x[1,1] = 100
display(x)

Unnamed: 0_level_0,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,1,1,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


Unnamed: 0_level_0,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,100,100,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


### When iterating rows of a data frame use `eachrow` to avoid compilation cost (wide tables), but `Tables.namedtupleiterator` for fast execution (tall tables)

this table is wide

In [22]:
df1 = DataFrame([rand([1:2, 'a':'b', false:true, 1.0:2.0]) for i in 1:900], :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11
Unnamed: 0_level_1,Float64,Float64,Bool,Bool,Int64,Float64,Bool,Int64,Float64,Char,Char
1,1.0,1.0,0,0,1,1.0,0,1,1.0,a,a
2,2.0,2.0,1,1,2,2.0,1,2,2.0,b,b


In [23]:
@time collect(eachrow(df1))

  0.052469 seconds (84.22 k allocations: 5.168 MiB, 99.92% compilation time)


2-element Vector{DataFrameRow}:
 [1mDataFrameRow[0m
[1m Row [0m│[1m x1      [0m[1m x2      [0m[1m x3    [0m[1m x4    [0m[1m x5    [0m[1m x6      [0m[1m x7    [0m[1m x8    [0m[1m x9      [0m[1m [0m ⋯
[1m     [0m│[90m Float64 [0m[90m Float64 [0m[90m Bool  [0m[90m Bool  [0m[90m Int64 [0m[90m Float64 [0m[90m Bool  [0m[90m Int64 [0m[90m Float64 [0m[90m [0m ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │     1.0      1.0  false  false      1      1.0  false      1      1.0   ⋯
[36m                                                             891 columns omitted[0m
 [1mDataFrameRow[0m
[1m Row [0m│[1m x1      [0m[1m x2      [0m[1m x3   [0m[1m x4   [0m[1m x5    [0m[1m x6      [0m[1m x7   [0m[1m x8    [0m[1m x9      [0m[1m x10[0m ⋯
[1m     [0m│[90m Float64 [0m[90m Float64 [0m[90m Bool [0m[90m Bool [0m[90m Int64 [0m[90m Float64 [0m[90m Bool [0m[90m Int64 [0m[90m Floa

In [24]:
@time collect(Tables.namedtupleiterator(df1));

  6.964989 seconds (2.32 M allocations: 147.241 MiB, 0.51% gc time, 99.78% compilation time)


as you can see the time to compile `Tables.namedtupleiterator` is very large in this case, and it would get much worse if the table was wider (that is why we include this tip in pitfalls notebook)

the table below is tall

In [25]:
df2 = DataFrame(rand(10^6, 10), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.620851,0.75956,0.086471,0.141431,0.258899,0.413615,0.934604,0.39581
2,0.105589,0.209177,0.685343,0.0660393,0.396761,0.913346,0.0248727,0.133875
3,0.983523,0.712555,0.0992709,0.759441,0.440898,0.10246,0.990721,0.904335
4,0.388133,0.102261,0.316264,0.900712,0.977021,0.968912,0.699624,0.163675
5,0.399831,0.967829,0.12189,0.722027,0.162325,0.209936,0.847496,0.0758826
6,0.405094,0.812244,0.784731,0.421858,0.800331,0.441914,0.315882,0.482087
7,0.435393,0.0532526,0.294183,0.996928,0.336332,0.307938,0.163823,0.542819
8,0.97882,0.686855,0.40084,0.914398,0.325452,0.343381,0.527434,0.865557
9,0.871545,0.231649,0.690776,0.169682,0.235853,0.747655,0.61211,0.769123
10,0.66691,0.609028,0.444258,0.611758,0.799476,0.992992,0.269694,0.169853


In [26]:
@time map(sum, eachrow(df2))

  2.381582 seconds (60.18 M allocations: 1.061 GiB, 9.77% gc time, 5.24% compilation time)


1000000-element Vector{Float64}:
 5.495798903655533
 3.8660165001175106
 6.862903729428506
 4.886936643625564
 4.225000638039363
 4.903868354093585
 3.8553432291243332
 6.0265198403912015
 5.386701419889809
 6.2912782382105865
 6.276551945400184
 5.60523928826555
 5.858276676796322
 ⋮
 4.447862691920903
 5.535517660729703
 5.814354029712671
 6.224623878720433
 4.494643103789899
 5.826518591421258
 4.64247460526621
 7.1289860081045235
 5.985524441433525
 5.580873911160428
 6.019561584680654
 4.878133442850752

In [27]:
@time map(sum, eachrow(df2))

  2.048501 seconds (59.99 M allocations: 1.050 GiB, 2.75% gc time)


1000000-element Vector{Float64}:
 5.495798903655533
 3.8660165001175106
 6.862903729428506
 4.886936643625564
 4.225000638039363
 4.903868354093585
 3.8553432291243332
 6.0265198403912015
 5.386701419889809
 6.2912782382105865
 6.276551945400184
 5.60523928826555
 5.858276676796322
 ⋮
 4.447862691920903
 5.535517660729703
 5.814354029712671
 6.224623878720433
 4.494643103789899
 5.826518591421258
 4.64247460526621
 7.1289860081045235
 5.985524441433525
 5.580873911160428
 6.019561584680654
 4.878133442850752

In [28]:
@time map(sum, Tables.namedtupleiterator(df2))

  0.230335 seconds (512.70 k allocations: 38.737 MiB, 3.45% gc time, 94.18% compilation time)


1000000-element Vector{Float64}:
 5.495798903655533
 3.8660165001175106
 6.862903729428506
 4.886936643625564
 4.225000638039363
 4.903868354093585
 3.8553432291243332
 6.0265198403912015
 5.386701419889809
 6.2912782382105865
 6.276551945400184
 5.60523928826555
 5.858276676796322
 ⋮
 4.447862691920903
 5.535517660729703
 5.814354029712671
 6.224623878720433
 4.494643103789899
 5.826518591421258
 4.64247460526621
 7.1289860081045235
 5.985524441433525
 5.580873911160428
 6.019561584680654
 4.878133442850752

In [29]:
@time map(sum, Tables.namedtupleiterator(df2))

  0.014752 seconds (17 allocations: 7.631 MiB)


1000000-element Vector{Float64}:
 5.495798903655533
 3.8660165001175106
 6.862903729428506
 4.886936643625564
 4.225000638039363
 4.903868354093585
 3.8553432291243332
 6.0265198403912015
 5.386701419889809
 6.2912782382105865
 6.276551945400184
 5.60523928826555
 5.858276676796322
 ⋮
 4.447862691920903
 5.535517660729703
 5.814354029712671
 6.224623878720433
 4.494643103789899
 5.826518591421258
 4.64247460526621
 7.1289860081045235
 5.985524441433525
 5.580873911160428
 6.019561584680654
 4.878133442850752

as you can see - this time it is much faster to iterate a type stable container