# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), Apr 21, 2018**

### Reference

* https://github.com/JuliaComputing/JuliaBoxTutorials/tree/master/introductory-tutorials/broader-topics-and-ecosystem/intro-to-julia-DataFrames

### Series

* https://deepstat.tistory.com/69 (01. constructors)(in English)
* https://deepstat.tistory.com/70 (01. constructors)(한글)
* https://deepstat.tistory.com/71 (02. basicinfo)(in English)
* https://deepstat.tistory.com/72 (02. basicinfo)(한글)
* https://deepstat.tistory.com/73 (03. missingvalues)(in English)
* https://deepstat.tistory.com/74 (03. missingvalues)(한글)
* https://deepstat.tistory.com/75 (04. loadsave)(in English)
* https://deepstat.tistory.com/76 (04. loadsave)(한글)
* https://deepstat.tistory.com/77 (05. columns)(in English)
* https://deepstat.tistory.com/78 (05. columns)(한글)
* https://deepstat.tistory.com/79 (06. rows)(in English)
* https://deepstat.tistory.com/80 (06. rows)(한글)
* https://deepstat.tistory.com/81 (07. factors)(in English)
* https://deepstat.tistory.com/82 (07. factors)(한글)
* https://deepstat.tistory.com/83 (08. joins)(in English)
* https://deepstat.tistory.com/84 (08. joins)(한글)
* https://deepstat.tistory.com/85 (09. reshaping)(in English)
* https://deepstat.tistory.com/86 (09. reshaping)(한글)
* https://deepstat.tistory.com/87 (10. transforms)(in English)
* https://deepstat.tistory.com/88 (10. transforms)(한글)
* https://deepstat.tistory.com/89 (11. performance)(in English)
* https://deepstat.tistory.com/90 (11. performance)(한글)

In [1]:
using DataFrames
using BenchmarkTools

## Performance tips

### Access by column number is faster than by name

In [2]:
x = DataFrame(rand(5, 1000))
@btime x[500];
@btime x[:x500];

  13.743 ns (0 allocations: 0 bytes)
  21.816 ns (0 allocations: 0 bytes)


### When working with data `DataFrame` use barrier functions or type annotation

In [3]:
using Random

function f_bad() # this function will be slow
    Random.seed!(1); x = DataFrame(rand(1000000,2))
    y, z = x[1], x[2]
    p = 0.0
    for i in 1:nrow(x)
        p += y[i]*z[i]
    end
    p
end

@btime f_bad();

  107.867 ms (5999022 allocations: 122.06 MiB)


In [4]:
@code_warntype f_bad() # the reason is that Julia does not know the types of columns in `DataFrame`

Body[91m[1m::Any[22m[39m
[90m[43G│╻            seed![1G[39m[90m4  [39m1 ── %1  = Random.GLOBAL_RNG[36m::MersenneTwister[39m
[90m[43G││╻╷╷╷         seed![1G[39m[90m   [39m│    %2  = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{UInt32,1}, svec(Any, Int64), :(:ccall), 2, Array{UInt32,1}, 0, 0))[36m::Array{UInt32,1}[39m
[90m[43G│││╻╷╷╷╷╷╷╷     make_seed[1G[39m[90m   [39m│    %3  = (Core.lshr_int)(1, 63)[36m::Int64[39m
[90m[43G││││┃│││││││     push![1G[39m[90m   [39m│    %4  = (Core.trunc_int)(Core.UInt8, %3)[36m::UInt8[39m
[90m[43G│││││┃││││││      _growend![1G[39m[90m   [39m│    %5  = (Core.eq_int)(%4, 0x01)[36m::Bool[39m
[90m[43G││││││┃││││        cconvert[1G[39m[90m   [39m└───       goto #3 if not %5
[90m[43G│││││││┃│││         convert[1G[39m[90m   [39m2 ──       invoke Core.throw_inexacterror(:check_top_bit::Symbol, Int64::Any, 1::Int64)
[90m[43G││││││││┃││          Type[1G[39m[90m   [39m└───       $(Expr(:unreachable))

In [5]:
# solution 1 is to use barrier function (it should be possible to use it in almost any code)
function f_inner(y,z)
   p = 0.0
   for i in 1:length(y)
       p += y[i]*z[i]
   end
   p
end

function f_barrier() # extract the work to an inner function
    Random.seed!(1); x = DataFrame(rand(1000000,2))
    f_inner(x[1], x[2])
end

using LinearAlgebra

function f_inbuilt() # or use inbuilt function if possible
    Random.seed!(1); x = DataFrame(rand(1000000,2))
    x[1] ⋅ x[2]
end

@btime f_barrier();
@btime f_inbuilt();

  8.436 ms (44 allocations: 30.52 MiB)
  9.642 ms (44 allocations: 30.52 MiB)


In [6]:
# solution 2 is to provide the types of extracted columns
# it is simpler but there are cases in which you will not know these types
function f_typed()
    Random.seed!(1); x = DataFrame(rand(1000000,2))
    y::Vector{Float64}, z::Vector{Float64} = x[1], x[2]
    p = 0.0
    for i in 1:nrow(x)
        p += y[i]*z[i]
    end
    p
end

@btime f_typed();

  8.464 ms (44 allocations: 30.52 MiB)


### Consider using delayed `DataFrame` creation technique

In [7]:
function f1()
    x = DataFrame(Float64, 10^4, 100) # we work with DataFrame directly
    for c in 1:ncol(x)
        d = x[c]
        for r in 1:nrow(x)
            d[r] = rand()
        end
    end
    x
end

function f2()
    x = Vector{Any}(undef,100)
    for c in 1:length(x)
        d = Vector{Float64}(undef,10^4)
        for r in 1:length(d)
            d[r] = rand()
        end
        x[c] = d
    end
    DataFrame(x) # we delay creation of DataFrame after we have our job done
end

@btime f1();
@btime f2();

  22.924 ms (1950037 allocations: 37.42 MiB)
  2.098 ms (937 allocations: 7.69 MiB)


### You can add rows to a `DataFrame` in place and it is fast

- But I don't know why the sizes changes. There is no explanation in the original text.

In [8]:
x = DataFrame(rand(10^6, 5))
y = DataFrame(transpose(1.0:5.0))
z = [1.0:5.0;]
println("Size of original x = ",size(x))
@btime vcat($x, $y); # creates a new DataFrame - slow
println("Size of result after running vcat = ", size(vcat(x,y)))
@btime push!($x, $z); # add a single row in place - fast
println("Size of x after running push! = ", size(x))
println(" ")
x = DataFrame(rand(10^6, 5)) # reset to the same starting point
println("Size of original x = ", size(x))
@btime append!($x, $y); # in place - fastest
println("Size of x after running append! = ", size(x))

Size of original x = (1000000, 5)
  6.643 ms (135 allocations: 38.15 MiB)
Size of result after running vcat = (1000001, 5)
  204.251 ns (5 allocations: 80 bytes)
Size of x after running push! = (7350502, 5)
 
Size of original x = (1000000, 5)
  164.216 ns (1 allocation: 16 bytes)
Size of x after running append! = (9260502, 5)


### Allowing `missing` as well as `categorical` slows down computations

In [9]:
using StatsBase

function test(data) # uses countmap function to test performance
    println(eltype(data))
    x = rand(data, 10^6)
    y = categorical(x)
    println(" raw:")
    @btime countmap($x)
    println(" categorical:")
    @btime countmap($y)
    nothing
end

println("Using test(1:10)")
test(1:10)
println(" ")
println("Using test([randstring() for i in 1:10])")
test([randstring() for i in 1:10])
println(" ")
println("Using test(allowmissing(1:10))")
test(allowmissing(1:10))
println(" ")
println("Using test(allowmissing([randstring() for i in 1:10]))")
test(allowmissing([randstring() for i in 1:10]))


Using test(1:10)
Int64
 raw:
  5.340 ms (8 allocations: 7.63 MiB)
 categorical:
  20.467 ms (4 allocations: 608 bytes)
 
Using test([randstring() for i in 1:10])
String
 raw:
  33.041 ms (4 allocations: 608 bytes)
 categorical:
  38.489 ms (4 allocations: 608 bytes)
 
Using test(allowmissing(1:10))
Union{Missing, Int64}
 raw:
  13.648 ms (4 allocations: 624 bytes)
 categorical:
  20.305 ms (4 allocations: 608 bytes)
 
Using test(allowmissing([randstring() for i in 1:10]))
Union{Missing, String}
 raw:
  19.645 ms (4 allocations: 608 bytes)
 categorical:
  29.604 ms (4 allocations: 608 bytes)
