# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), October 5, 2022**

In [1]:
using DataFrames

## Possible pitfalls

### Know what is copied when creating a `DataFrame`

In [2]:
x = DataFrame(rand(3, 5), :auto)

Row,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.201581,0.808985,0.102587,0.965488,0.743211
2,0.00583326,0.302103,0.58221,0.458408,0.180463
3,0.159787,0.613589,0.612709,0.390491,0.197921


In [3]:
y = copy(x)
x === y # not the same object

false

In [4]:
y = DataFrame(x)
x === y

false

In [5]:
any(x[!, i] === y[!, i] for i in ncol(x)) # the columns are also not the same

false

In [6]:
y = DataFrame(x, copycols=false)
x === y

false

In [7]:
all(x[!, i] === y[!, i] for i in ncol(x)) # the columns are the same

true

In [8]:
x = 1:3; y = [1, 2, 3]; df = DataFrame(x=x,y=y) # the same when creating data frames using kwarg syntax

Row,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,2
3,3,3


In [9]:
y === df.y # different object

false

In [10]:
typeof(x), typeof(df.x) # range is converted to a vector

(UnitRange{Int64}, Vector{Int64})

In [11]:
y === df[:, :y] # slicing rows always creates a copy

false

you can avoid copying by using `copycols=false` keyword argument in functions.

In [12]:
df = DataFrame(x=x,y=y, copycols=false)

Row,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,2
3,3,3


In [13]:
y === df.y # now it is the same

true

In [14]:
select(df, :y)[!, 1] === y # not the same

false

In [15]:
select(df, :y, copycols=false)[!, 1] === y # the same

true

### Do not modify the parent of `GroupedDataFrame` or `view`

In [16]:
x = DataFrame(id=repeat([1,2], outer=3), x=1:6)
g = groupby(x, :id)

Row,id,x
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,1,3
3,1,5

Row,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,2
2,2,4
3,2,6


In [17]:
x[1:3, 1] = [2,2,2]
g # well - it is wrong now, g is only a view

Row,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,1
2,2,3
3,1,5

Row,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,2
2,2,4
3,2,6


In [18]:
s = view(x, 5:6, :)

Row,id,x
Unnamed: 0_level_1,Int64,Int64
1,1,5
2,2,6


In [19]:
delete!(x, 3:6)

Row,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,1
2,2,2


In [20]:
s # error

BoundsError: BoundsError: attempt to access 2-element Vector{Int64} at index [5:6]

### Single column selection for `DataFrame` creates aliases with `!` and `getproperty` syntax and copies with `:`

In [21]:
x = DataFrame(a=1:3)
x.b = x[!, 1] # alias
x.c = x[:, 1] # copy
x.d = x[!, 1][:] # copy
x.e = copy(x[!, 1]) # explicit copy
display(x)
x[1,1] = 100
display(x)

Row,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,1,1,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


Row,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,100,100,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


### When iterating rows of a data frame use `eachrow` to avoid compilation cost (wide tables), but `Tables.namedtupleiterator` for fast execution (tall tables)

this table is wide

In [22]:
df1 = DataFrame([rand([1:2, 'a':'b', false:true, 1.0:2.0]) for i in 1:900], :auto)

Row,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30,x31,x32,x33,x34,x35,x36,x37,x38,x39,x40,x41,x42,x43,x44,x45,x46,x47,x48,x49,x50,x51,x52,x53,x54,x55,x56,x57,x58,x59,x60,x61,x62,x63,x64,x65,x66,x67,x68,x69,x70,x71,x72,x73,x74,x75,x76,x77,x78,x79,x80,x81,x82,x83,x84,x85,x86,x87,x88,x89,x90,x91,x92,x93,x94,x95,x96,x97,x98,x99,x100,⋯
Unnamed: 0_level_1,Int64,Bool,Char,Bool,Int64,Float64,Bool,Float64,Float64,Char,Float64,Bool,Int64,Bool,Bool,Float64,Int64,Char,Char,Int64,Bool,Char,Bool,Char,Int64,Int64,Float64,Int64,Char,Bool,Char,Char,Bool,Float64,Bool,Float64,Bool,Bool,Int64,Char,Bool,Char,Char,Bool,Int64,Int64,Float64,Bool,Int64,Bool,Float64,Bool,Float64,Int64,Char,Char,Int64,Bool,Float64,Char,Char,Int64,Int64,Char,Char,Bool,Int64,Int64,Bool,Char,Float64,Float64,Char,Float64,Float64,Char,Float64,Bool,Char,Bool,Bool,Char,Int64,Char,Char,Int64,Char,Char,Float64,Int64,Bool,Bool,Float64,Bool,Int64,Int64,Float64,Float64,Int64,Bool,⋯
1,1,False,a,False,1,1.0,False,1.0,1.0,a,1.0,False,1,False,False,1.0,1,a,a,1,False,a,False,a,1,1,1.0,1,a,False,a,a,False,1.0,False,1.0,False,False,1,a,False,a,a,False,1,1,1.0,False,1,False,1.0,False,1.0,1,a,a,1,False,1.0,a,a,1,1,a,a,False,1,1,False,a,1.0,1.0,a,1.0,1.0,a,1.0,False,a,False,False,a,1,a,a,1,a,a,1.0,1,False,False,1.0,False,1,1,1.0,1.0,1,False,⋯
2,2,True,b,True,2,2.0,True,2.0,2.0,b,2.0,True,2,True,True,2.0,2,b,b,2,True,b,True,b,2,2,2.0,2,b,True,b,b,True,2.0,True,2.0,True,True,2,b,True,b,b,True,2,2,2.0,True,2,True,2.0,True,2.0,2,b,b,2,True,2.0,b,b,2,2,b,b,True,2,2,True,b,2.0,2.0,b,2.0,2.0,b,2.0,True,b,True,True,b,2,b,b,2,b,b,2.0,2,True,True,2.0,True,2,2,2.0,2.0,2,True,⋯


In [23]:
@time collect(eachrow(df1))

  0.043300 seconds (110.77 k allocations: 5.894 MiB, 99.90% compilation time)


2-element Vector{DataFrameRow}:
 [1mDataFrameRow[0m
[1m Row [0m│[1m x1    [0m[1m x2    [0m[1m x3   [0m[1m x4    [0m[1m x5    [0m[1m x6      [0m[1m x7    [0m[1m x8      [0m[1m x9      [0m[1m x10[0m ⋯
     │[90m Int64 [0m[90m Bool  [0m[90m Char [0m[90m Bool  [0m[90m Int64 [0m[90m Float64 [0m[90m Bool  [0m[90m Float64 [0m[90m Float64 [0m[90m Cha[0m ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │     1  false  a     false      1      1.0  false      1.0      1.0  a   ⋯
[36m                                                             891 columns omitted[0m
 [1mDataFrameRow[0m
[1m Row [0m│[1m x1    [0m[1m x2   [0m[1m x3   [0m[1m x4   [0m[1m x5    [0m[1m x6      [0m[1m x7   [0m[1m x8      [0m[1m x9      [0m[1m x10  [0m[1m [0m ⋯
     │[90m Int64 [0m[90m Bool [0m[90m Char [0m[90m Bool [0m[90m Int64 [0m[90m Float64 [0m[90m Bool [0m[90m Float64 [0m[90m Float64 [0m[9

In [24]:
@time collect(Tables.namedtupleiterator(df1));

  6.315561 seconds (2.07 M allocations: 119.292 MiB, 0.50% gc time, 99.87% compilation time)


as you can see the time to compile `Tables.namedtupleiterator` is very large in this case, and it would get much worse if the table was wider (that is why we include this tip in pitfalls notebook)

the table below is tall

In [25]:
df2 = DataFrame(rand(10^6, 10), :auto)

Row,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.622619,0.436343,0.728031,0.105606,0.999174,0.0449576,0.19143,0.393511,0.485616,0.979407
2,0.775199,0.4118,0.325693,0.733648,0.802903,0.000505339,0.53319,0.505315,0.894738,0.720311
3,0.323058,0.45963,0.191191,0.0162073,0.824213,0.120045,0.014544,0.504028,0.804441,0.412626
4,0.487134,0.459817,0.953669,0.866734,0.29691,0.0539078,0.673938,0.744832,0.424905,0.855512
5,0.764617,0.104164,0.305518,0.882695,0.933912,0.58931,0.70341,0.54892,0.0380409,0.316546
6,0.817391,0.0734525,0.887336,0.853012,0.0613441,0.787182,0.00444137,0.687151,0.0325217,0.822289
7,0.168903,0.890934,0.88228,0.806285,0.559195,0.699344,0.296276,0.892094,0.133931,0.844231
8,0.873277,0.205052,0.937417,0.680343,0.506249,0.753223,0.529395,0.975386,0.829405,0.758151
9,0.775905,0.826503,0.0389343,0.510118,0.658464,0.565723,0.231966,0.608431,0.297745,0.370042
10,0.0285103,0.943569,0.796121,0.322704,0.470704,0.153233,0.483333,0.524271,0.121153,0.618462


In [26]:
@time map(sum, eachrow(df2))

  2.211330 seconds (60.25 M allocations: 1.064 GiB, 10.66% gc time, 4.70% compilation time)


1000000-element Vector{Float64}:
 4.986694968551295
 5.703301973928069
 3.6699821580942373
 5.8173604519654365
 5.187132612130035
 5.0261200831636526
 6.173471538651279
 7.047896934559805
 4.8838311978141355
 4.4620607062848165
 6.511812188675087
 2.65610417120726
 3.8126691051180788
 ⋮
 5.647798831400137
 5.482999732609254
 5.679812696627533
 4.737219922862144
 4.33495820390455
 5.7304529387305765
 5.897063907649167
 5.602893138716614
 3.392533429935587
 5.770531985593611
 5.445066567576743
 3.379936205888135

In [27]:
@time map(sum, eachrow(df2))

  1.945997 seconds (59.99 M allocations: 1.050 GiB, 3.50% gc time)


1000000-element Vector{Float64}:
 4.986694968551295
 5.703301973928069
 3.6699821580942373
 5.8173604519654365
 5.187132612130035
 5.0261200831636526
 6.173471538651279
 7.047896934559805
 4.8838311978141355
 4.4620607062848165
 6.511812188675087
 2.65610417120726
 3.8126691051180788
 ⋮
 5.647798831400137
 5.482999732609254
 5.679812696627533
 4.737219922862144
 4.33495820390455
 5.7304529387305765
 5.897063907649167
 5.602893138716614
 3.392533429935587
 5.770531985593611
 5.445066567576743
 3.379936205888135

In [28]:
@time map(sum, Tables.namedtupleiterator(df2))

  0.163732 seconds (440.27 k allocations: 30.963 MiB, 92.95% compilation time)


1000000-element Vector{Float64}:
 4.986694968551295
 5.703301973928069
 3.6699821580942373
 5.8173604519654365
 5.187132612130035
 5.0261200831636526
 6.173471538651279
 7.047896934559805
 4.8838311978141355
 4.4620607062848165
 6.511812188675087
 2.65610417120726
 3.8126691051180788
 ⋮
 5.647798831400137
 5.482999732609254
 5.679812696627533
 4.737219922862144
 4.33495820390455
 5.7304529387305765
 5.897063907649167
 5.602893138716614
 3.392533429935587
 5.770531985593611
 5.445066567576743
 3.379936205888135

In [29]:
@time map(sum, Tables.namedtupleiterator(df2))

  0.010230 seconds (13 allocations: 7.630 MiB)


1000000-element Vector{Float64}:
 4.986694968551295
 5.703301973928069
 3.6699821580942373
 5.8173604519654365
 5.187132612130035
 5.0261200831636526
 6.173471538651279
 7.047896934559805
 4.8838311978141355
 4.4620607062848165
 6.511812188675087
 2.65610417120726
 3.8126691051180788
 ⋮
 5.647798831400137
 5.482999732609254
 5.679812696627533
 4.737219922862144
 4.33495820390455
 5.7304529387305765
 5.897063907649167
 5.602893138716614
 3.392533429935587
 5.770531985593611
 5.445066567576743
 3.379936205888135

as you can see - this time it is much faster to iterate a type stable container

still you might want to use the `select` syntax, which is optimized for such reductions:

In [30]:
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum # this includes compilation time

  0.453571 seconds (912.14 k allocations: 56.067 MiB, 98.20% compilation time)


1000000-element Vector{Float64}:
 4.986694968551295
 5.703301973928069
 3.6699821580942373
 5.8173604519654365
 5.187132612130035
 5.0261200831636526
 6.173471538651279
 7.047896934559805
 4.8838311978141355
 4.4620607062848165
 6.511812188675087
 2.65610417120726
 3.8126691051180788
 ⋮
 5.647798831400137
 5.482999732609254
 5.679812696627533
 4.737219922862144
 4.33495820390455
 5.7304529387305765
 5.897063907649167
 5.602893138716614
 3.392533429935587
 5.770531985593611
 5.445066567576743
 3.379936205888135

In [31]:
@time select(df2, AsTable(:) => ByRow(sum) => "sum").sum

  0.006923 seconds (170 allocations: 7.638 MiB)


1000000-element Vector{Float64}:
 4.986694968551295
 5.703301973928069
 3.6699821580942373
 5.8173604519654365
 5.187132612130035
 5.0261200831636526
 6.173471538651279
 7.047896934559805
 4.8838311978141355
 4.4620607062848165
 6.511812188675087
 2.65610417120726
 3.8126691051180788
 ⋮
 5.647798831400137
 5.482999732609254
 5.679812696627533
 4.737219922862144
 4.33495820390455
 5.7304529387305765
 5.897063907649167
 5.602893138716614
 3.392533429935587
 5.770531985593611
 5.445066567576743
 3.379936205888135