# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), February 13, 2023**

In [1]:
using DataFrames

## Possible pitfalls

### Know what is copied when creating a `DataFrame`

In [2]:
x = DataFrame(rand(3, 5), :auto)

Row,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.182172,0.0214047,0.298923,0.431279,0.872178
2,0.398011,0.0102083,0.321484,0.282343,0.463992
3,0.15172,0.591452,0.273743,0.636897,0.589512


In [3]:
y = copy(x)
x === y # not the same object

false

In [4]:
y = DataFrame(x)
x === y

false

In [5]:
any(x[!, i] === y[!, i] for i in ncol(x)) # the columns are also not the same

false

In [6]:
y = DataFrame(x, copycols=false)
x === y

false

In [7]:
all(x[!, i] === y[!, i] for i in ncol(x)) # the columns are the same

true

In [8]:
x = 1:3; y = [1, 2, 3]; df = DataFrame(x=x,y=y) # the same when creating data frames using kwarg syntax

Row,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,2
3,3,3


In [9]:
y === df.y # different object

false

In [10]:
typeof(x), typeof(df.x) # range is converted to a vector

(UnitRange{Int64}, Vector{Int64})

In [11]:
y === df[:, :y] # slicing rows always creates a copy

false

you can avoid copying by using `copycols=false` keyword argument in functions.

In [12]:
df = DataFrame(x=x,y=y, copycols=false)

Row,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,2
3,3,3


In [13]:
y === df.y # now it is the same

true

In [14]:
select(df, :y)[!, 1] === y # not the same

false

In [15]:
select(df, :y, copycols=false)[!, 1] === y # the same

true

### Do not modify the parent of `GroupedDataFrame` or `view`

In [16]:
x = DataFrame(id=repeat([1,2], outer=3), x=1:6)
g = groupby(x, :id)

Row,id,x
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,1,3
3,1,5

Row,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,2
2,2,4
3,2,6


In [17]:
x[1:3, 1] = [2,2,2]
g # well - it is wrong now, g is only a view

Row,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,1
2,2,3
3,1,5

Row,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,2
2,2,4
3,2,6


In [18]:
s = view(x, 5:6, :)

Row,id,x
Unnamed: 0_level_1,Int64,Int64
1,1,5
2,2,6


In [19]:
delete!(x, 3:6)

Row,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,1
2,2,2


In [20]:
s # error

BoundsError: BoundsError: attempt to access 2-element Vector{Int64} at index [5:6]

### Single column selection for `DataFrame` creates aliases with `!` and `getproperty` syntax and copies with `:`

In [21]:
x = DataFrame(a=1:3)
x.b = x[!, 1] # alias
x.c = x[:, 1] # copy
x.d = x[!, 1][:] # copy
x.e = copy(x[!, 1]) # explicit copy
display(x)
x[1,1] = 100
display(x)

Row,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,1,1,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


Row,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,100,100,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


### When iterating rows of a data frame use `eachrow` to avoid compilation cost (wide tables), but `Tables.namedtupleiterator` for fast execution (tall tables)

this table is wide

In [22]:
df1 = DataFrame([rand([1:2, 'a':'b', false:true, 1.0:2.0]) for i in 1:900], :auto)

Row,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30,x31,x32,x33,x34,x35,x36,x37,x38,x39,x40,x41,x42,x43,x44,x45,x46,x47,x48,x49,x50,x51,x52,x53,x54,x55,x56,x57,x58,x59,x60,x61,x62,x63,x64,x65,x66,x67,x68,x69,x70,x71,x72,x73,x74,x75,x76,x77,x78,x79,x80,x81,x82,x83,x84,x85,x86,x87,x88,x89,x90,x91,x92,x93,x94,x95,x96,x97,x98,x99,x100,⋯
Unnamed: 0_level_1,Float64,Int64,Char,Char,Bool,Float64,Int64,Int64,Int64,Int64,Bool,Char,Bool,Bool,Int64,Float64,Int64,Float64,Char,Float64,Float64,Bool,Int64,Float64,Bool,Bool,Char,Int64,Bool,Bool,Int64,Int64,Bool,Bool,Float64,Bool,Float64,Char,Char,Char,Int64,Int64,Char,Float64,Bool,Float64,Int64,Float64,Bool,Int64,Bool,Bool,Float64,Char,Float64,Bool,Bool,Float64,Int64,Int64,Float64,Float64,Bool,Float64,Int64,Float64,Char,Bool,Int64,Int64,Int64,Bool,Char,Bool,Char,Float64,Bool,Float64,Int64,Bool,Char,Char,Float64,Char,Int64,Char,Bool,Bool,Bool,Char,Char,Float64,Bool,Int64,Char,Int64,Int64,Bool,Int64,Float64,⋯
1,1.0,1,a,a,False,1.0,1,1,1,1,False,a,False,False,1,1.0,1,1.0,a,1.0,1.0,False,1,1.0,False,False,a,1,False,False,1,1,False,False,1.0,False,1.0,a,a,a,1,1,a,1.0,False,1.0,1,1.0,False,1,False,False,1.0,a,1.0,False,False,1.0,1,1,1.0,1.0,False,1.0,1,1.0,a,False,1,1,1,False,a,False,a,1.0,False,1.0,1,False,a,a,1.0,a,1,a,False,False,False,a,a,1.0,False,1,a,1,1,False,1,1.0,⋯
2,2.0,2,b,b,True,2.0,2,2,2,2,True,b,True,True,2,2.0,2,2.0,b,2.0,2.0,True,2,2.0,True,True,b,2,True,True,2,2,True,True,2.0,True,2.0,b,b,b,2,2,b,2.0,True,2.0,2,2.0,True,2,True,True,2.0,b,2.0,True,True,2.0,2,2,2.0,2.0,True,2.0,2,2.0,b,True,2,2,2,True,b,True,b,2.0,True,2.0,2,True,b,b,2.0,b,2,b,True,True,True,b,b,2.0,True,2,b,2,2,True,2,2.0,⋯


In [23]:
@time collect(eachrow(df1))

  0.039638 seconds (66.81 k allocations: 4.511 MiB, 99.90% compilation time)


2-element Vector{DataFrameRow}:
 [1mDataFrameRow[0m
[1m Row [0m│[1m x1      [0m[1m x2    [0m[1m x3   [0m[1m x4   [0m[1m x5    [0m[1m x6      [0m[1m x7    [0m[1m x8    [0m[1m x9    [0m[1m x10   [0m[1m[0m ⋯
     │[90m Float64 [0m[90m Int64 [0m[90m Char [0m[90m Char [0m[90m Bool  [0m[90m Float64 [0m[90m Int64 [0m[90m Int64 [0m[90m Int64 [0m[90m Int64 [0m[90m[0m ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │     1.0      1  a     a     false      1.0      1      1      1      1  ⋯
[36m                                                             890 columns omitted[0m
 [1mDataFrameRow[0m
[1m Row [0m│[1m x1      [0m[1m x2    [0m[1m x3   [0m[1m x4   [0m[1m x5   [0m[1m x6      [0m[1m x7    [0m[1m x8    [0m[1m x9    [0m[1m x10   [0m[1m [0m ⋯
     │[90m Float64 [0m[90m Int64 [0m[90m Char [0m[90m Char [0m[90m Bool [0m[90m Float64 [0m[90m Int64 [0m[90m Int64 [0m[9

In [24]:
@time collect(Tables.namedtupleiterator(df1));

  7.774781 seconds (1.64 M allocations: 116.145 MiB, 0.30% gc time, 99.88% compilation time)


as you can see the time to compile `Tables.namedtupleiterator` is very large in this case, and it would get much worse if the table was wider (that is why we include this tip in pitfalls notebook)

the table below is tall

In [25]:
df2 = DataFrame(rand(10^6, 10), :auto)

Row,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.424322,0.0387172,0.165538,0.375359,0.124302,0.0640515,0.12196,0.586493,0.604638,0.703147
2,0.15205,0.272963,0.700006,0.829304,0.0415525,0.943028,0.477391,0.920519,0.733945,0.38232
3,0.725465,0.580529,0.50084,0.987348,0.075047,0.558718,0.676676,0.947075,0.409558,0.66122
4,0.925171,0.318553,0.812868,0.401339,0.604567,0.269032,0.916789,0.900742,0.147204,0.398988
5,0.246163,0.226364,0.31503,0.148128,0.369622,0.150685,0.102971,0.230194,0.182942,0.709785
6,0.0492007,0.794896,0.60956,0.816386,0.11619,0.801362,0.686194,0.301658,0.507337,0.135642
7,0.946628,0.0157797,0.568566,0.336721,0.414166,0.430364,0.842273,0.146307,0.964439,0.152112
8,0.821027,0.539961,0.951991,0.0557496,0.307656,0.00224519,0.948003,0.629882,0.076703,0.963577
9,0.215563,0.799691,0.430948,0.456424,0.750694,0.511513,0.102725,0.952334,0.443177,0.128398
10,0.351084,0.648419,0.922247,0.221091,0.321099,0.239239,0.367316,0.698181,0.137452,0.744402


In [26]:
@time map(sum, eachrow(df2))

  2.375425 seconds (60.11 M allocations: 1.058 GiB, 6.06% gc time, 4.08% compilation time)


1000000-element Vector{Float64}:
 3.2085280065644035
 5.453078720387793
 6.122475264709386
 5.695252453849797
 2.6818844243721185
 4.818425816634775
 4.817355199044771
 5.296794147868831
 4.791467327272049
 4.650529563996984
 3.6677680401851025
 4.705650893054499
 4.448097647798718
 ⋮
 4.026247702243423
 3.2464343021300115
 4.401648676752053
 4.5970393787236
 3.6877232629240018
 3.3405753484373895
 5.937165225771452
 3.521171557784275
 4.097246522668247
 5.602193794286321
 6.570343754548945
 3.8751540034469354

In [27]:
@time map(sum, eachrow(df2))

  2.304615 seconds (59.99 M allocations: 1.050 GiB, 6.11% gc time)


1000000-element Vector{Float64}:
 3.2085280065644035
 5.453078720387793
 6.122475264709386
 5.695252453849797
 2.6818844243721185
 4.818425816634775
 4.817355199044771
 5.296794147868831
 4.791467327272049
 4.650529563996984
 3.6677680401851025
 4.705650893054499
 4.448097647798718
 ⋮
 4.026247702243423
 3.2464343021300115
 4.401648676752053
 4.5970393787236
 3.6877232629240018
 3.3405753484373895
 5.937165225771452
 3.521171557784275
 4.097246522668247
 5.602193794286321
 6.570343754548945
 3.8751540034469354

In [28]:
@time map(sum, Tables.namedtupleiterator(df2))

  0.203013 seconds (270.07 k allocations: 25.450 MiB, 4.53% gc time, 93.26% compilation time)


1000000-element Vector{Float64}:
 3.2085280065644035
 5.453078720387793
 6.122475264709386
 5.695252453849797
 2.6818844243721185
 4.818425816634775
 4.817355199044771
 5.296794147868831
 4.791467327272049
 4.650529563996984
 3.6677680401851025
 4.705650893054499
 4.448097647798718
 ⋮
 4.026247702243423
 3.2464343021300115
 4.401648676752053
 4.5970393787236
 3.6877232629240018
 3.3405753484373895
 5.937165225771452
 3.521171557784275
 4.097246522668247
 5.602193794286321
 6.570343754548945
 3.8751540034469354

In [29]:
@time map(sum, Tables.namedtupleiterator(df2))

  0.011576 seconds (20 allocations: 7.631 MiB)


1000000-element Vector{Float64}:
 3.2085280065644035
 5.453078720387793
 6.122475264709386
 5.695252453849797
 2.6818844243721185
 4.818425816634775
 4.817355199044771
 5.296794147868831
 4.791467327272049
 4.650529563996984
 3.6677680401851025
 4.705650893054499
 4.448097647798718
 ⋮
 4.026247702243423
 3.2464343021300115
 4.401648676752053
 4.5970393787236
 3.6877232629240018
 3.3405753484373895
 5.937165225771452
 3.521171557784275
 4.097246522668247
 5.602193794286321
 6.570343754548945
 3.8751540034469354

as you can see - this time it is much faster to iterate a type stable container

still you might want to use the `select` syntax, which is optimized for such reductions:

In [30]:
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum # this includes compilation time

  0.045867 seconds (14.20 k allocations: 8.574 MiB, 78.01% compilation time)


1000000-element Vector{Float64}:
 3.2085280065644035
 5.453078720387793
 6.122475264709386
 5.695252453849797
 2.6818844243721185
 4.818425816634775
 4.817355199044771
 5.296794147868831
 4.791467327272049
 4.650529563996984
 3.6677680401851025
 4.705650893054499
 4.448097647798718
 ⋮
 4.026247702243423
 3.2464343021300115
 4.401648676752053
 4.5970393787236
 3.6877232629240018
 3.3405753484373895
 5.937165225771452
 3.521171557784275
 4.097246522668247
 5.602193794286321
 6.570343754548945
 3.8751540034469354

In [31]:
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum

  0.009749 seconds (163 allocations: 7.638 MiB)


1000000-element Vector{Float64}:
 3.2085280065644035
 5.453078720387793
 6.122475264709386
 5.695252453849797
 2.6818844243721185
 4.818425816634775
 4.817355199044771
 5.296794147868831
 4.791467327272049
 4.650529563996984
 3.6677680401851025
 4.705650893054499
 4.448097647798718
 ⋮
 4.026247702243423
 3.2464343021300115
 4.401648676752053
 4.5970393787236
 3.6877232629240018
 3.3405753484373895
 5.937165225771452
 3.521171557784275
 4.097246522668247
 5.602193794286321
 6.570343754548945
 3.8751540034469354