# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), December 9, 2018**

In [1]:
using DataFrames

## Possible pitfalls

### Know what is copied when creating a `DataFrame`

In [2]:
x = DataFrame(rand(3, 5))

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.641023,0.848304,0.49703,0.216958,0.585984
2,0.326486,0.443172,0.390857,0.331882,0.960979
3,0.282608,0.53563,0.444903,0.698256,0.867249


In [3]:
y = convert(DataFrame, x) # after https://github.com/JuliaData/DataFrames.jl/pull/1489 is released also DataFrame(x)
x === y # no copyinng performed

true

In [4]:
y = copy(x)
x === y # not the same object

false

In [5]:
all(x[i] === y[i] for i in ncol(x)) # but the columns are the same

true

In [6]:
x = 1:3; y = [1, 2, 3]; df = DataFrame(x=x,y=y) # the same when creating arrays or assigning columns, except ranges

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,2
3,3,3


In [7]:
y === df[:y] # the same object

true

In [8]:
typeof(x), typeof(df[:x]) # range is converted to a vector

(UnitRange{Int64}, Array{Int64,1})

### Do not modify the parent of `GroupedDataFrame` or `view`

In [9]:
x = DataFrame(id=repeat([1,2], outer=3), x=1:6)
g = groupby(x, :id)

GroupedDataFrame{DataFrame} with 2 groups based on key: :id
First Group: 3 rows
│ Row │ id    │ x     │
│     │ [90mInt64[39m │ [90mInt64[39m │
├─────┼───────┼───────┤
│ 1   │ 1     │ 1     │
│ 2   │ 1     │ 3     │
│ 3   │ 1     │ 5     │
⋮
Last Group: 3 rows
│ Row │ id    │ x     │
│     │ [90mInt64[39m │ [90mInt64[39m │
├─────┼───────┼───────┤
│ 1   │ 2     │ 2     │
│ 2   │ 2     │ 4     │
│ 3   │ 2     │ 6     │

In [10]:
x[1:3, 1]=[2,2,2]
g # well - it is wrong now, g is only a view

GroupedDataFrame{DataFrame} with 2 groups based on key: :id
First Group: 3 rows
│ Row │ id    │ x     │
│     │ [90mInt64[39m │ [90mInt64[39m │
├─────┼───────┼───────┤
│ 1   │ 2     │ 1     │
│ 2   │ 2     │ 3     │
│ 3   │ 1     │ 5     │
⋮
Last Group: 3 rows
│ Row │ id    │ x     │
│     │ [90mInt64[39m │ [90mInt64[39m │
├─────┼───────┼───────┤
│ 1   │ 2     │ 2     │
│ 2   │ 2     │ 4     │
│ 3   │ 2     │ 6     │

In [11]:
s = view(x, 5:6, :)

│   caller = show(::IOContext{Base.GenericIOBuffer{Array{UInt8,1}}}, ::MIME{Symbol("text/html")}, ::SubDataFrame{UnitRange{Int64}}) at io.jl:108
└ @ DataFrames C:\Users\bogum\.julia\packages\DataFrames\KUWId\src\abstractdataframe\io.jl:108


Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,1,5
2,2,6


│   caller = show(::IOContext{Base.GenericIOBuffer{Array{UInt8,1}}}, ::MIME{Symbol("text/html")}, ::SubDataFrame{UnitRange{Int64}}) at io.jl:127
└ @ DataFrames C:\Users\bogum\.julia\packages\DataFrames\KUWId\src\abstractdataframe\io.jl:127
│   caller = show(::IOContext{Base.GenericIOBuffer{Array{UInt8,1}}}, ::MIME{Symbol("text/latex")}, ::SubDataFrame{UnitRange{Int64}}) at io.jl:185
└ @ DataFrames C:\Users\bogum\.julia\packages\DataFrames\KUWId\src\abstractdataframe\io.jl:185


In [12]:
deleterows!(x, 3:6)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,1
2,2,2


In [13]:
s # error

BoundsError: BoundsError: attempt to access 2-element Array{Int64,1} at index [5:6]

### Remember that you can filter columns of a `DataFrame` using booleans

In [14]:
using Random
Random.seed!(1)
x = DataFrame(rand(5, 5))

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.236033,0.210968,0.555751,0.209472,0.0769509
2,0.346517,0.951916,0.437108,0.251379,0.640396
3,0.312707,0.999905,0.424718,0.0203749,0.873544
4,0.00790928,0.251662,0.773223,0.287702,0.278582
5,0.488613,0.986666,0.28119,0.859512,0.751313


In [15]:
x[x[:x1] .< 0.25] # well - we have filtered columns not rows by accident as you can select columns using booleans

Unnamed: 0_level_0,x1,x4
Unnamed: 0_level_1,Float64,Float64
1,0.236033,0.209472
2,0.346517,0.251379
3,0.312707,0.0203749
4,0.00790928,0.287702
5,0.488613,0.859512


In [16]:
x[x[:x1] .< 0.25, :] # probably this is what we wanted

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.236033,0.210968,0.555751,0.209472,0.0769509
2,0.00790928,0.251662,0.773223,0.287702,0.278582


### Column selection for DataFrame creates aliases unless explicitly copied

In [17]:
x = DataFrame(a=1:3)
x[:b] = x[1] # alias
x[:c] = x[:, 1] # also alias - this will change in the future
x[:d] = x[1][:] # copy
x[:e] = copy(x[1]) # explicit copy
display(x)
x[1,1] = 100
display(x)

Unnamed: 0_level_0,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,1,1,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


Unnamed: 0_level_0,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,100,100,100,1,1
2,2,2,2,2,2
3,3,3,3,3,3


│   caller = top-level scope at In[17]:3
└ @ Core In[17]:3
