# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), November 20, 2020**

In [1]:
using DataFrames

## Possible pitfalls

### Know what is copied when creating a `DataFrame`

In [2]:
x = DataFrame(rand(3, 5), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.594692,0.514567,0.18404,0.563166,0.0321274
2,0.437888,0.0622966,0.223676,0.788461,0.128907
3,0.52067,0.191295,0.923419,0.704108,0.0789202


In [3]:
y = copy(x)
x === y # not the same object

false

In [4]:
y = DataFrame(x)
x === y

false

In [5]:
any(x[!, i] === y[!, i] for i in ncol(x)) # the columns are also not the same

false

In [6]:
y = DataFrame(x, copycols=false)
x === y

false

In [7]:
all(x[!, i] === y[!, i] for i in ncol(x)) # the columns are the same

true

In [8]:
x = 1:3; y = [1, 2, 3]; df = DataFrame(x=x,y=y) # the same when creating data frames using kwarg syntax

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,2
3,3,3


In [9]:
y === df.y # different object

false

In [10]:
typeof(x), typeof(df.x) # range is converted to a vector

(UnitRange{Int64}, Array{Int64,1})

In [11]:
y === df[:, :y] # slicing rows always creates a copy

false

you can avoid copying by using `copycols=false` keyword argument in functions.

In [12]:
df = DataFrame(x=x,y=y, copycols=false)

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,2
3,3,3


In [13]:
y === df.y # now it is the same

true

In [14]:
select(df, :y)[!, 1] === y # not the same

false

In [15]:
select(df, :y, copycols=false)[!, 1] === y # the same

true

### Do not modify the parent of `GroupedDataFrame` or `view`

In [16]:
x = DataFrame(id=repeat([1,2], outer=3), x=1:6)
g = groupby(x, :id)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,1,3
3,1,5

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,2
2,2,4
3,2,6


In [17]:
x[1:3, 1] = [2,2,2]
g # well - it is wrong now, g is only a view

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,1
2,2,3
3,1,5

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,2
2,2,4
3,2,6


In [18]:
s = view(x, 5:6, :)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,1,5
2,2,6


In [19]:
delete!(x, 3:6)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,1
2,2,2


In [20]:
s # error

BoundsError: BoundsError: attempt to access 2-element Array{Int64,1} at index [5:6]

### Single column selection for `DataFrame` creates aliases with `!` and `getproperty` syntax and copies with `:`

In [21]:
x = DataFrame(a=1:3)
x.b = x[!, 1] # alias
x.c = x[:, 1] # copy
x.d = x[!, 1][:] # copy
x.e = copy(x[!, 1]) # explicit copy
display(x)
x[1,1] = 100
display(x)

Unnamed: 0_level_0,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,1,1,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


Unnamed: 0_level_0,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,100,100,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


### When iterating rows of a data frame use `eachrow` to avoid compilation cost (wide tables), but `Tables.namedtupleiterator` for fast execution (tall tables)

this table is wide

In [22]:
df1 = DataFrame([rand([1:2, 'a':'b', false:true, 1.0:2.0]) for i in 1:900], :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11
Unnamed: 0_level_1,Char,Int64,Char,Int64,Float64,Int64,Float64,Float64,Int64,Float64,Float64
1,a,1,a,1,1.0,1,1.0,1.0,1,1.0,1.0
2,b,2,b,2,2.0,2,2.0,2.0,2,2.0,2.0


In [23]:
@time collect(eachrow(df1))

  0.062215 seconds (112.85 k allocations: 6.044 MiB)


2-element Array{DataFrameRow{DataFrame,DataFrames.Index},1}:
 [1mDataFrameRow[0m
[1m Row [0m│[1m x1   [0m[1m x2    [0m[1m x3   [0m[1m x4    [0m[1m x5      [0m[1m x6    [0m[1m x7      [0m[1m x8      [0m[1m x9    [0m[1m x10 [0m ⋯
[1m     [0m│[90m Char [0m[90m Int64 [0m[90m Char [0m[90m Int64 [0m[90m Float64 [0m[90m Int64 [0m[90m Float64 [0m[90m Float64 [0m[90m Int64 [0m[90m Floa[0m ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ a         1  a         1      1.0      1      1.0      1.0  1      1.0  ⋯
[31m                                                             891 columns omitted[0m
 [1mDataFrameRow[0m
[1m Row [0m│[1m x1   [0m[1m x2    [0m[1m x3   [0m[1m x4    [0m[1m x5      [0m[1m x6    [0m[1m x7      [0m[1m x8      [0m[1m x9    [0m[1m x10 [0m ⋯
[1m     [0m│[90m Char [0m[90m Int64 [0m[90m Char [0m[90m Int64 [0m[90m Float64 [0m[90m Int64 [0m[90m Float64 [

In [24]:
@time collect(Tables.namedtupleiterator(df1));

  3.543209 seconds (5.56 M allocations: 313.787 MiB, 2.72% gc time)


as you can see the time to compile `Tables.namedtupleiterator` is very large in this case, and it would get much worse if the table was wider (that is why we include this tip in pitfalls notebook)

the table below is tall

In [25]:
df2 = DataFrame(rand(10^6, 10), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.426059,0.742032,0.707272,0.652059,0.304092,0.719528,0.688369,0.632429
2,0.45075,0.218993,0.908634,0.279011,0.842885,0.107019,0.315626,0.241803
3,0.233813,0.215836,0.257729,0.740415,0.0919936,0.71226,0.914756,0.726123
4,0.965376,0.898253,0.95447,0.672381,0.83925,0.724675,0.150684,0.396315
5,0.184352,0.357508,0.528711,0.966319,0.917087,0.90029,0.230387,0.956348
6,0.927468,0.542864,0.978972,0.632794,0.19679,0.213441,0.752753,0.186146
7,0.479898,0.176363,0.831515,0.223991,0.3749,0.83241,0.497393,0.356694
8,0.560469,0.413727,0.95283,0.00153209,0.660829,0.240667,0.612188,0.637321
9,0.15406,0.917065,0.162022,0.332862,0.823867,0.988992,0.818119,0.205916
10,0.209826,0.479812,0.415616,0.764865,0.501729,0.0671917,0.0618787,0.0783432


In [26]:
@time map(sum, eachrow(df2))

  2.127842 seconds (60.20 M allocations: 1.061 GiB, 10.49% gc time)


1000000-element Array{Float64,1}:
 5.425498600888671
 3.7768052553526426
 4.659967330231552
 7.103370561604825
 5.814570363240507
 6.1859065433961336
 4.471149052140817
 5.476184303052102
 5.606651743657868
 3.47177416802785
 6.213498373088708
 5.435893647895248
 5.0660833071317
 ⋮
 4.725244826624812
 6.798193294947637
 6.194266127548412
 4.206175463282667
 5.705487427159872
 5.1125101149843895
 5.725319376702462
 3.463395534254586
 4.904474994182661
 3.5746522076541454
 3.5747126855179996
 6.369457105293115

In [27]:
@time map(sum, eachrow(df2))

  1.968601 seconds (59.99 M allocations: 1.050 GiB, 3.47% gc time)


1000000-element Array{Float64,1}:
 5.425498600888671
 3.7768052553526426
 4.659967330231552
 7.103370561604825
 5.814570363240507
 6.1859065433961336
 4.471149052140817
 5.476184303052102
 5.606651743657868
 3.47177416802785
 6.213498373088708
 5.435893647895248
 5.0660833071317
 ⋮
 4.725244826624812
 6.798193294947637
 6.194266127548412
 4.206175463282667
 5.705487427159872
 5.1125101149843895
 5.725319376702462
 3.463395534254586
 4.904474994182661
 3.5746522076541454
 3.5747126855179996
 6.369457105293115

In [28]:
@time map(sum, Tables.namedtupleiterator(df2))

  0.221442 seconds (516.88 k allocations: 34.874 MiB, 3.61% gc time)


1000000-element Array{Float64,1}:
 5.425498600888671
 3.7768052553526426
 4.659967330231552
 7.103370561604825
 5.814570363240507
 6.1859065433961336
 4.471149052140817
 5.476184303052102
 5.606651743657868
 3.47177416802785
 6.213498373088708
 5.435893647895248
 5.0660833071317
 ⋮
 4.725244826624812
 6.798193294947637
 6.194266127548412
 4.206175463282667
 5.705487427159872
 5.1125101149843895
 5.725319376702462
 3.463395534254586
 4.904474994182661
 3.5746522076541454
 3.5747126855179996
 6.369457105293115

In [29]:
@time map(sum, Tables.namedtupleiterator(df2))

  0.020637 seconds (19 allocations: 7.631 MiB)


1000000-element Array{Float64,1}:
 5.425498600888671
 3.7768052553526426
 4.659967330231552
 7.103370561604825
 5.814570363240507
 6.1859065433961336
 4.471149052140817
 5.476184303052102
 5.606651743657868
 3.47177416802785
 6.213498373088708
 5.435893647895248
 5.0660833071317
 ⋮
 4.725244826624812
 6.798193294947637
 6.194266127548412
 4.206175463282667
 5.705487427159872
 5.1125101149843895
 5.725319376702462
 3.463395534254586
 4.904474994182661
 3.5746522076541454
 3.5747126855179996
 6.369457105293115

as you can see - this time it is much faster to iterate a type stable container