# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), January 18, 2019**

In [1]:
using DataFrames # load package

## Load and save DataFrames
We do not cover all features of the packages. Please refer to their documentation to learn them.

Here we'll load `CSV` and `CSVFiles` to read and write CSV files and `JLD2` and serialization, which allow us to work with a binary format.

In [2]:
using CSV
using CSVFiles
using JLD2
using Serialization

Let's create a simple `DataFrame` for testing purposes,

In [3]:
x = DataFrame(A=[true, false, true], B=[1, 2, missing],
              C=[missing, "b", "c"], D=['a', missing, 'c'])


Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,Char⍰
1,True,1,missing,'a'
2,False,2,b,missing
3,True,missing,c,'c'


and use `eltypes` to look at the columnwise types.

In [4]:
eltypes(x)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, Char}  

### CSV.jl

Let's use `CSV` to save `x` to disk; make sure `x1.csv` does not conflict with some file in your working directory.

In [5]:
CSV.write("x1.csv", x)

"x1.csv"

Now we can see how it was saved by reading `x.csv`.

In [6]:
print(read("x1.csv", String))

A,B,C,D
true,1,,a
false,2,b,
true,,c,c


We can also load it back (`use_mmap=false` disables memory mapping so that on Windows the file can be deleted in the same session, on other OSs it is not needed).

In [7]:
y = CSV.read("x1.csv", use_mmap=false)

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool⍰,Int64⍰,String⍰,String⍰
1,True,1,missing,a
2,False,2,b,missing
3,True,missing,c,c


When loading in a `DataFrame` from a `CSV`, all columns allow `Missing` by default. Note that the column types have changed!

In [8]:
eltypes(y)

4-element Array{Union,1}:
 Union{Missing, Bool}  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, String}

### CSVFiles.jl

Now we will use `CSVFiles` to achieve the same. First we save the file. Notice that we override default `nastring` that is `"NA"` because we have missings in non-numeric columns.

In [9]:
x |> save("x2.csv", nastring="")

and peek the saved file:

In [10]:
print(read("x2.csv", String))

"A","B","C","D"
true,1,,a
false,2,"b",
true,,"c",c


We can load it back using `load`:

In [11]:
y = load("x2.csv") |> DataFrame

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,String,Int64⍰,String,String
1,True,1,,a
2,False,2,b,
3,True,missing,c,c


Let us check element types again:

In [12]:
eltypes(y)

4-element Array{Type,1}:
 String               
 Union{Missing, Int64}
 String               
 String               

Observe that in columns `:C` and `:D` missings were read back as empty strings

### JLD2.jl

Now let's save `x` to a file in a binary format; make sure that `x.jld` does not exist in your working directory.

In [13]:
@save "x.jld2" x

After loading in `x.jld2` as `y`, `y` is identical to `x`.

In [14]:
using FileIO # needed for function interface to JLD2, "jld2" extension is mandatory for filetype detection
y = load("x.jld2", "x")

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,Char⍰
1,True,1,missing,'a'
2,False,2,b,missing
3,True,missing,c,missing


Note that the column types of `y` are the same as those of `x`!

In [15]:
eltypes(y)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, Char}  

### Serialization

Finally we use serialization to save `x`. Note that in general, this process will not work if the reading and writing are done by different versions of Julia, or an instance of Julia with a different system image.

In [16]:
open("x.bin", "w") do io
    serialize(io, x)
end

Now we load back the saved file to `y` variable. Again `y` is identical to `x`.

In [17]:
y = open(deserialize, "x.bin")

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,Char⍰
1,True,1,missing,'a'
2,False,2,b,missing
3,True,missing,c,'c'


In [18]:
eltypes(y)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, Char}  

### Basic bechmarking

Next, we'll create the files `bigdf1.csv`, `bigdf2.csv` and `bigdf.jld2`, so be careful that you don't already have these files on disc!

In particular, we'll time how long it takes us to write a `DataFrame` with 10^3 rows and 10^5 columns to `.csv` and `.jld2` files.

In [19]:
bigdf = DataFrame(Bool, 10^3, 10^2)
bigdf[1] = Int.(bigdf[1])
bigdf[2] = bigdf[2] .+ 0.5
bigdf[3] = string.(bigdf[3], ", as string")
println("First run")
@time CSV.write("bigdf1.csv", bigdf)
@time bigdf |> save("bigdf2.csv")
@time @save "bigdf.jld2" bigdf
@time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
println("Second run")
@time CSV.write("bigdf1.csv", bigdf)
@time bigdf |> save("bigdf2.csv")
@time @save "bigdf.jld2" bigdf
@time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
getfield.(stat.(["bigdf1.csv", "bigdf2.csv", "bigdf.jld2", "bigdf.bin"]), :size)

First run
  0.682687 seconds (1.86 M allocations: 89.554 MiB, 4.65% gc time)
  0.430858 seconds (862.30 k allocations: 44.037 MiB, 3.40% gc time)
  0.236200 seconds (700.10 k allocations: 34.019 MiB, 5.71% gc time)
  0.098760 seconds (440.94 k allocations: 21.759 MiB, 7.11% gc time)
Second run
  0.015536 seconds (57.43 k allocations: 1.094 MiB)
  0.006687 seconds (5.52 k allocations: 340.063 KiB)
  0.010677 seconds (19.27 k allocations: 654.722 KiB)
  0.007404 seconds (5.26 k allocations: 266.581 KiB)


4-element Array{Int64,1}:
 588374
 588574
 189346
  86404

Finally, let's clean up. Do not run the next cell unless you are sure that it will not erase your important files.

In [20]:
foreach(rm, ["x1.csv", "x2.csv", "x.jld2", "x.bin", "bigdf1.csv", "bigdf2.csv", "bigdf.jld2", "bigdf.bin"])