# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), May 15, 2021**

In [1]:
using DataFrames

## Possible pitfalls

### Know what is copied when creating a `DataFrame`

In [2]:
x = DataFrame(rand(3, 5), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.976481,0.275304,0.0564721,0.0683025,0.286872
2,0.725549,0.475192,0.117808,0.838682,0.411452
3,0.862387,0.120068,0.0747617,0.107699,0.407875


In [3]:
y = copy(x)
x === y # not the same object

false

In [4]:
y = DataFrame(x)
x === y

false

In [5]:
any(x[!, i] === y[!, i] for i in ncol(x)) # the columns are also not the same

false

In [6]:
y = DataFrame(x, copycols=false)
x === y

false

In [7]:
all(x[!, i] === y[!, i] for i in ncol(x)) # the columns are the same

true

In [8]:
x = 1:3; y = [1, 2, 3]; df = DataFrame(x=x,y=y) # the same when creating data frames using kwarg syntax

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,2
3,3,3


In [9]:
y === df.y # different object

false

In [10]:
typeof(x), typeof(df.x) # range is converted to a vector

(UnitRange{Int64}, Vector{Int64})

In [11]:
y === df[:, :y] # slicing rows always creates a copy

false

you can avoid copying by using `copycols=false` keyword argument in functions.

In [12]:
df = DataFrame(x=x,y=y, copycols=false)

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,2
3,3,3


In [13]:
y === df.y # now it is the same

true

In [14]:
select(df, :y)[!, 1] === y # not the same

false

In [15]:
select(df, :y, copycols=false)[!, 1] === y # the same

true

### Do not modify the parent of `GroupedDataFrame` or `view`

In [16]:
x = DataFrame(id=repeat([1,2], outer=3), x=1:6)
g = groupby(x, :id)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,1,3
3,1,5

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,2
2,2,4
3,2,6


In [17]:
x[1:3, 1] = [2,2,2]
g # well - it is wrong now, g is only a view

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,1
2,2,3
3,1,5

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,2
2,2,4
3,2,6


In [18]:
s = view(x, 5:6, :)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,1,5
2,2,6


In [19]:
delete!(x, 3:6)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,1
2,2,2


In [20]:
s # error

BoundsError: BoundsError: attempt to access 2-element Vector{Int64} at index [5:6]

### Single column selection for `DataFrame` creates aliases with `!` and `getproperty` syntax and copies with `:`

In [21]:
x = DataFrame(a=1:3)
x.b = x[!, 1] # alias
x.c = x[:, 1] # copy
x.d = x[!, 1][:] # copy
x.e = copy(x[!, 1]) # explicit copy
display(x)
x[1,1] = 100
display(x)

Unnamed: 0_level_0,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,1,1,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


Unnamed: 0_level_0,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,100,100,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


### When iterating rows of a data frame use `eachrow` to avoid compilation cost (wide tables), but `Tables.namedtupleiterator` for fast execution (tall tables)

this table is wide

In [22]:
df1 = DataFrame([rand([1:2, 'a':'b', false:true, 1.0:2.0]) for i in 1:900], :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12
Unnamed: 0_level_1,Float64,Bool,Char,Bool,Char,Char,Float64,Char,Int64,Int64,Char,Char
1,1.0,0,a,0,a,a,1.0,a,1,1,a,a
2,2.0,1,b,1,b,b,2.0,b,2,2,b,b


In [23]:
@time collect(eachrow(df1))

  0.089423 seconds (84.22 k allocations: 5.169 MiB, 15.41% gc time, 99.86% compilation time)


2-element Vector{DataFrameRow}:
 [1mDataFrameRow[0m
[1m Row [0m│[1m x1      [0m[1m x2    [0m[1m x3   [0m[1m x4    [0m[1m x5   [0m[1m x6   [0m[1m x7      [0m[1m x8   [0m[1m x9    [0m[1m x10   [0m[1m x[0m ⋯
[1m     [0m│[90m Float64 [0m[90m Bool  [0m[90m Char [0m[90m Bool  [0m[90m Char [0m[90m Char [0m[90m Float64 [0m[90m Char [0m[90m Int64 [0m[90m Int64 [0m[90m C[0m ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │     1.0  false  a     false  a     a         1.0  a         1      1  a ⋯
[36m                                                             890 columns omitted[0m
 [1mDataFrameRow[0m
[1m Row [0m│[1m x1      [0m[1m x2   [0m[1m x3   [0m[1m x4   [0m[1m x5   [0m[1m x6   [0m[1m x7      [0m[1m x8   [0m[1m x9    [0m[1m x10   [0m[1m x11[0m ⋯
[1m     [0m│[90m Float64 [0m[90m Bool [0m[90m Char [0m[90m Bool [0m[90m Char [0m[90m Char [0m[90m Float64 [0m[90m

In [24]:
@time collect(Tables.namedtupleiterator(df1));

  8.521188 seconds (2.16 M allocations: 137.889 MiB, 0.48% gc time, 99.84% compilation time)


as you can see the time to compile `Tables.namedtupleiterator` is very large in this case, and it would get much worse if the table was wider (that is why we include this tip in pitfalls notebook)

the table below is tall

In [25]:
df2 = DataFrame(rand(10^6, 10), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.0884655,0.0510019,0.418001,0.275263,0.196886,0.577606,0.632162,0.365267
2,0.0238469,0.858567,0.873395,0.0980489,0.783281,0.992697,0.998073,0.154986
3,0.0993084,0.173046,0.461451,0.776819,0.688546,0.45026,0.575941,0.379185
4,0.192213,0.868402,0.810197,0.825681,0.107654,0.652194,0.995758,0.992064
5,0.484915,0.781096,0.188717,0.0551245,0.582094,0.10756,0.69763,0.440782
6,0.647591,0.380823,0.708158,0.688091,0.108107,0.470009,0.316223,0.440454
7,0.669742,0.790828,0.479396,0.494966,0.807786,0.761803,0.104153,0.643289
8,0.922929,0.645575,0.357334,0.866658,0.231557,0.952013,0.367215,0.0662225
9,0.495266,0.284814,0.0686187,0.158368,0.895262,0.13835,0.866163,0.558396
10,0.437317,0.161811,0.708778,0.163434,0.810803,0.0457139,0.339694,0.319919


In [26]:
@time map(sum, eachrow(df2))

  2.174921 seconds (60.18 M allocations: 1.062 GiB, 12.13% gc time, 6.81% compilation time)


1000000-element Vector{Float64}:
 3.087291028381489
 5.345627504830888
 5.235889869866188
 7.327630528555932
 4.064692319349455
 4.778736743160573
 5.772862362490546
 5.228269781093092
 4.816404857302385
 3.8098906622639586
 4.99862788869052
 4.917802521802978
 5.961152448858158
 ⋮
 6.0338386399380175
 2.454927895878434
 5.276538143330482
 5.357233468184416
 3.845470319321367
 6.355833709116523
 4.800912227041232
 5.151766316639874
 5.643323475970824
 4.8604648205245224
 3.534161598723104
 5.294559441519406

In [27]:
@time map(sum, eachrow(df2))

  2.005875 seconds (59.99 M allocations: 1.050 GiB, 4.06% gc time)


1000000-element Vector{Float64}:
 3.087291028381489
 5.345627504830888
 5.235889869866188
 7.327630528555932
 4.064692319349455
 4.778736743160573
 5.772862362490546
 5.228269781093092
 4.816404857302385
 3.8098906622639586
 4.99862788869052
 4.917802521802978
 5.961152448858158
 ⋮
 6.0338386399380175
 2.454927895878434
 5.276538143330482
 5.357233468184416
 3.845470319321367
 6.355833709116523
 4.800912227041232
 5.151766316639874
 5.643323475970824
 4.8604648205245224
 3.534161598723104
 5.294559441519406

In [28]:
@time map(sum, Tables.namedtupleiterator(df2))

  0.257138 seconds (508.38 k allocations: 38.440 MiB, 3.33% gc time, 90.47% compilation time)


1000000-element Vector{Float64}:
 3.087291028381489
 5.345627504830888
 5.235889869866188
 7.327630528555932
 4.064692319349455
 4.778736743160573
 5.772862362490546
 5.228269781093092
 4.816404857302385
 3.8098906622639586
 4.99862788869052
 4.917802521802978
 5.961152448858158
 ⋮
 6.0338386399380175
 2.454927895878434
 5.276538143330482
 5.357233468184416
 3.845470319321367
 6.355833709116523
 4.800912227041232
 5.151766316639874
 5.643323475970824
 4.8604648205245224
 3.534161598723104
 5.294559441519406

In [29]:
@time map(sum, Tables.namedtupleiterator(df2))

  0.015502 seconds (16 allocations: 7.631 MiB)


1000000-element Vector{Float64}:
 3.087291028381489
 5.345627504830888
 5.235889869866188
 7.327630528555932
 4.064692319349455
 4.778736743160573
 5.772862362490546
 5.228269781093092
 4.816404857302385
 3.8098906622639586
 4.99862788869052
 4.917802521802978
 5.961152448858158
 ⋮
 6.0338386399380175
 2.454927895878434
 5.276538143330482
 5.357233468184416
 3.845470319321367
 6.355833709116523
 4.800912227041232
 5.151766316639874
 5.643323475970824
 4.8604648205245224
 3.534161598723104
 5.294559441519406

as you can see - this time it is much faster to iterate a type stable container