# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), April 27, 2019**

In [1]:
using DataFrames # load package

## Load and save DataFrames
We do not cover all features of the packages. Please refer to their documentation to learn them.

Here we'll load `CSV` and `CSVFiles` to read and write CSV files and `Feather` and serialization, which allow us to work with a binary format.

In [2]:
using CSV
using CSVFiles
using Serialization
using Feather

Let's create a simple `DataFrame` for testing purposes,

In [3]:
x = DataFrame(A=[true, false, true], B=[1, 2, missing],
              C=[missing, "b", "c"], D=['a', missing, 'c'])


Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,Char⍰
1,True,1,missing,'a'
2,False,2,b,missing
3,True,missing,c,'c'


and use `eltypes` to look at the columnwise types.

In [4]:
eltypes(x)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, Char}  

### CSV.jl

Let's use `CSV` to save `x` to disk; make sure `x1.csv` does not conflict with some file in your working directory.

In [5]:
CSV.write("x1.csv", x)

"x1.csv"

Now we can see how it was saved by reading `x.csv`.

In [6]:
print(read("x1.csv", String))

A,B,C,D
true,1,,a
false,2,b,
true,,c,c


We can also load it back (`use_mmap=false` disables memory mapping so that on Windows the file can be deleted in the same session, on other OSs it is not needed).

In [7]:
y = CSV.read("x1.csv", use_mmap=false)

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool⍰,Int64⍰,String⍰,String⍰
1,True,1,missing,a
2,False,2,b,missing
3,True,missing,c,c


When loading in a `DataFrame` from a `CSV`, all columns allow `Missing` by default. Note that the column types have changed!

In [8]:
eltypes(y)

4-element Array{Union,1}:
 Union{Missing, Bool}  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, String}

### CSVFiles.jl

Now we will use `CSVFiles` to achieve the same. First we save the file. Notice that we override default `nastring` that is `"NA"` because we have missings in non-numeric columns.

In [9]:
x |> save("x2.csv", nastring="")

and peek the saved file:

In [10]:
print(read("x2.csv", String))

"A","B","C","D"
true,1,,a
false,2,"b",
true,,"c",c


We can load it back using `load`:

In [11]:
y = load("x2.csv") |> DataFrame

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,String,Int64⍰,String,String
1,True,1,,a
2,False,2,b,
3,True,missing,c,c


Let us check element types again:

In [12]:
eltypes(y)

4-element Array{Type,1}:
 String               
 Union{Missing, Int64}
 String               
 String               

Observe that in columns `:C` and `:D` missings were read back as empty strings

### Serialization

Now we use serialization to save `x`. Note that in general, this process will not work if the reading and writing are done by different versions of Julia, or an instance of Julia with a different system image.

In [13]:
open("x.bin", "w") do io
    serialize(io, x)
end

Now we load back the saved file to `y` variable. Again `y` is identical to `x`.

In [14]:
y = open(deserialize, "x.bin")

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,Char⍰
1,True,1,missing,'a'
2,False,2,b,missing
3,True,missing,c,'c'


In [15]:
eltypes(y)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, Char}  

### Feather.jl

Finally we use Feather format that allows, in particular, for data interchange with R or Python.

In [16]:
x.D = passmissing(string).(x.D) # Feather format does not support Char type

3-element Array{Union{Missing, String},1}:
 "a"    
 missing
 "c"    

In [17]:
Feather.write("x.feather", x)

"x.feather"

In [18]:
y = Feather.materialize("x.feather") # Feather.read is a lazy alternative

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,String⍰
1,True,1,missing,a
2,False,2,b,missing
3,True,missing,c,c


In [19]:
eltypes(y)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, String}

### Basic bechmarking

Next, we'll create some files, so be careful that you don't already have these files on disc!

In particular, we'll time how long it takes us to write a `DataFrame` with 10^3 rows and 10^5 columns.

In [20]:
bigdf = DataFrame(rand(Bool, 10^3, 10^2))
bigdf[1] = Int.(bigdf[1])
bigdf[2] = bigdf[2] .+ 0.5
bigdf[3] = string.(bigdf[3], ", as string")
println("First run")
@time CSV.write("bigdf1.csv", bigdf)
@time bigdf |> save("bigdf2.csv")
@time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
@time Feather.write("bigdf.feather", bigdf)
println("Second run")
@time CSV.write("bigdf1.csv", bigdf)
@time bigdf |> save("bigdf2.csv")
@time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
@time Feather.write("bigdf.feather", bigdf)
getfield.(stat.(["bigdf1.csv", "bigdf2.csv", "bigdf.bin", "bigdf.feather"]), :size)

First run
  0.745948 seconds (814.64 k allocations: 40.707 MiB, 4.48% gc time)
  1.050906 seconds (830.88 k allocations: 42.539 MiB, 2.87% gc time)
  0.213652 seconds (427.49 k allocations: 21.266 MiB)
  0.318751 seconds (373.35 k allocations: 25.882 MiB, 9.61% gc time)
Second run
  0.024630 seconds (54.97 k allocations: 1.052 MiB)
  0.016964 seconds (5.53 k allocations: 340.375 KiB)
  0.022761 seconds (4.81 k allocations: 243.089 KiB)
  0.018969 seconds (34.02 k allocations: 9.008 MiB)


4-element Array{Int64,1}:
 558378
 558578
  84439
  54072

In [21]:
println("First run")
@time CSV.read("bigdf1.csv")
@time load("bigdf2.csv") |> DataFrame
@time open(deserialize, "bigdf.bin")
@time Feather.materialize("bigdf.feather")
println("Second run")
@time CSV.read("bigdf1.csv")
@time load("bigdf2.csv") |> DataFrame
@time open(deserialize, "bigdf.bin")
@time Feather.materialize("bigdf.feather");

First run
  0.622640 seconds (802.64 k allocations: 28.521 MiB, 2.97% gc time)
  1.679781 seconds (2.60 M allocations: 218.270 MiB, 8.63% gc time)
  0.042708 seconds (72.82 k allocations: 2.008 MiB)
  0.673740 seconds (639.10 k allocations: 32.375 MiB, 4.97% gc time)
Second run
  0.048690 seconds (52.07 k allocations: 1.326 MiB)
  0.099486 seconds (506.49 k allocations: 108.821 MiB, 54.91% gc time)
  0.006630 seconds (49.60 k allocations: 951.719 KiB)
  0.007290 seconds (26.39 k allocations: 1.200 MiB)


Finally, let's clean up. Do not run the next cell unless you are sure that it will not erase your important files.

In [22]:
foreach(rm, ["x1.csv", "x2.csv", "x.bin", "x.feather",
             "bigdf1.csv", "bigdf2.csv", "bigdf.bin", "bigdf.feather"])