# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), August 16, 2019**

In [1]:
using Pkg
Pkg.activate(".")

[32m[1mActivating[22m[39m environment at `d:\Dev\Julia\DataFrames_Tutorial\Project.toml`


In [2]:
using DataFrames

## Load and save DataFrames
We do not cover all features of the packages. Please refer to their documentation to learn them.

Here we'll load `CSV` and `CSVFiles` to read and write CSV files and `Feather` and serialization, which allow us to work with a binary format and `JSONTables` for JSON interaction.

In [3]:
using CSV
using CSVFiles
using Serialization
using Feather
using JSONTables

┌ Info: Recompiling stale cache file C:\Users\bogum\.julia\compiled\v1.2\CSV\HHBkp.ji for CSV [336ed68f-0bac-5ca0-87d4-7b16caf5d00b]
└ @ Base loading.jl:1240
┌ Info: Recompiling stale cache file C:\Users\bogum\.julia\compiled\v1.2\CSVFiles\kq3Uy.ji for CSVFiles [5d742f6a-9f54-50ce-8119-2520741973ca]
└ @ Base loading.jl:1240
┌ Info: Recompiling stale cache file C:\Users\bogum\.julia\compiled\v1.2\Feather\RgcL0.ji for Feather [becb17da-46f6-5d3c-ad1b-1c5fe96bc73c]
└ @ Base loading.jl:1240


Let's create a simple `DataFrame` for testing purposes,

In [4]:
x = DataFrame(A=[true, false, true], B=[1, 2, missing],
              C=[missing, "b", "c"], D=['a', missing, 'c'])


Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,Char⍰
1,1,1,missing,'a'
2,0,2,b,missing
3,1,missing,c,'c'


and use `eltypes` to look at the columnwise types.

In [5]:
eltypes(x)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, Char}  

### CSV.jl

Let's use `CSV` to save `x` to disk; make sure `x1.csv` does not conflict with some file in your working directory.

In [6]:
CSV.write("x1.csv", x)

"x1.csv"

Now we can see how it was saved by reading `x.csv`.

In [7]:
print(read("x1.csv", String))

A,B,C,D
true,1,,a
false,2,b,
true,,c,c


We can also load it back (`use_mmap=false` disables memory mapping so that on Windows the file can be deleted in the same session, on other OSs it is not needed).

In [8]:
y = CSV.read("x1.csv", use_mmap=false)

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,String⍰
1,1,1,missing,a
2,0,2,b,missing
3,1,missing,c,c


When loading in a `DataFrame` from a `CSV`, all columns allow `Missing` by default. Note that the column types have changed!

In [9]:
eltypes(y)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, String}

### CSVFiles.jl

Now we will use `CSVFiles` to achieve the same. First we save the file. Notice that we override default `nastring` that is `"NA"` because we have missings in non-numeric columns.

In [10]:
x |> save("x2.csv", nastring="")

and peek the saved file:

In [11]:
print(read("x2.csv", String))

"A","B","C","D"
true,1,,a
false,2,"b",
true,,"c",c


We can load it back using `load`:

In [12]:
y = load("x2.csv") |> DataFrame

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,String,Int64⍰,String,String
1,True,1,,a
2,False,2,b,
3,True,missing,c,c


Let us check element types again:

In [13]:
eltypes(y)

4-element Array{Type,1}:
 String               
 Union{Missing, Int64}
 String               
 String               

Observe that in columns `:C` and `:D` missings were read back as empty strings

### Serialization

Now we use serialization to save `x`. Note that in general, this process will not work if the reading and writing are done by different versions of Julia, or an instance of Julia with a different system image.

In [14]:
open("x.bin", "w") do io
    serialize(io, x)
end

Now we load back the saved file to `y` variable. Again `y` is identical to `x`.

In [15]:
y = open(deserialize, "x.bin")

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,Char⍰
1,1,1,missing,'a'
2,0,2,b,missing
3,1,missing,c,'c'


In [16]:
eltypes(y)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, Char}  

### JSONTables.jl

Often you might need to read and write data stored in JSON format. JSONTables.jl provides a way to process them in row-oriented or column-oriented layout. We present both options below.

In [17]:
open(io -> arraytable(io, x), "x1.json", "w")

106

In [18]:
open(io -> objecttable(io, x), "x2.json", "w")

76

In [19]:
print(read("x1.json", String))

[{"A":true,"B":1,"C":null,"D":"a"},{"A":false,"B":2,"C":"b","D":null},{"A":true,"B":null,"C":"c","D":"c"}]

In [20]:
print(read("x2.json", String))

{"A":[true,false,true],"B":[1,2,null],"C":[null,"b","c"],"D":["a",null,"c"]}

In [21]:
y1 = open(jsontable, "x1.json") |> DataFrame

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,String⍰
1,1,1,missing,a
2,0,2,b,missing
3,1,missing,c,c


In [22]:
eltypes(y1)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, String}

In [23]:
y2 = open(jsontable, "x2.json") |> DataFrame

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,String⍰
1,1,1,missing,a
2,0,2,b,missing
3,1,missing,c,c


In [24]:
eltypes(y2)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, String}

### Feather.jl

Finally we use Feather format that allows, in particular, for data interchange with R or Python.

In [25]:
x.D = passmissing(string).(x.D) # Feather format does not support Char type

3-element Array{Union{Missing, String},1}:
 "a"    
 missing
 "c"    

In [26]:
Feather.write("x.feather", x)

"x.feather"

In [27]:
y = Feather.materialize("x.feather") # Feather.read is a lazy alternative

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,String⍰
1,1,1,missing,a
2,0,2,b,missing
3,1,missing,c,c


In [28]:
eltypes(y)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, String}

### Basic bechmarking

Next, we'll create some files, so be careful that you don't already have these files on disc!

In particular, we'll time how long it takes us to write a `DataFrame` with 10^3 rows and 10^5 columns.

In [29]:
bigdf = DataFrame(rand(Bool, 10^3, 10^2))
bigdf[!, 1] = Int.(bigdf[!, 1])
bigdf[!, 2] = bigdf[!, 2] .+ 0.5
bigdf[!, 3] = string.(bigdf[!, 3], ", as string")
println("First run")
@time CSV.write("bigdf1.csv", bigdf)
@time bigdf |> save("bigdf2.csv")
@time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
@time Feather.write("bigdf.feather", bigdf)
@time open(io -> arraytable(io, bigdf), "bigdf1.json", "w")
@time open(io -> objecttable(io, bigdf), "bigdf2.json", "w")
println("Second run")
@time CSV.write("bigdf1.csv", bigdf)
@time bigdf |> save("bigdf2.csv")
@time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
@time Feather.write("bigdf.feather", bigdf)
@time open(io -> arraytable(io, bigdf), "bigdf1.json", "w")
@time open(io -> objecttable(io, bigdf), "bigdf2.json", "w")
getfield.(stat.(["bigdf1.csv", "bigdf2.csv", "bigdf.bin", "bigdf.feather", "bigdf1.json", "bigdf2.json"]), :size)

First run
  0.447905 seconds (1.32 M allocations: 67.134 MiB, 4.91% gc time)
  0.477190 seconds (835.73 k allocations: 43.375 MiB, 2.77% gc time)
  0.099662 seconds (417.81 k allocations: 21.023 MiB, 7.08% gc time)
  0.116537 seconds (376.11 k allocations: 26.321 MiB, 5.44% gc time)
  0.304354 seconds (2.59 M allocations: 78.502 MiB, 7.63% gc time)
  0.085929 seconds (249.27 k allocations: 13.715 MiB)
Second run
  0.006285 seconds (6.51 k allocations: 290.703 KiB)
  0.007105 seconds (5.43 k allocations: 339.016 KiB)
  0.009119 seconds (4.59 k allocations: 239.058 KiB)
  0.018969 seconds (33.62 k allocations: 8.987 MiB, 43.89% gc time)
  0.132593 seconds (2.07 M allocations: 50.173 MiB, 5.21% gc time)
  0.009554 seconds (10.32 k allocations: 488.453 KiB)


6-element Array{Int64,1}:
  558570
  558770
   84028
   54064
 1152179
  558971

In [30]:
println("First run")
@time CSV.read("bigdf1.csv")
@time load("bigdf2.csv") |> DataFrame
@time open(deserialize, "bigdf.bin")
@time Feather.materialize("bigdf.feather")
@time open(jsontable, "bigdf1.json")
@time open(jsontable, "bigdf2.json")
println("Second run")
@time CSV.read("bigdf1.csv")
@time load("bigdf2.csv") |> DataFrame
@time open(deserialize, "bigdf.bin")
@time Feather.materialize("bigdf.feather")
@time open(jsontable, "bigdf1.json")
@time open(jsontable, "bigdf2.json");

First run
  0.053163 seconds (61.81 k allocations: 3.462 MiB)
  0.615034 seconds (2.28 M allocations: 202.273 MiB, 9.07% gc time)
  0.020315 seconds (71.66 k allocations: 1.974 MiB)
  0.244456 seconds (572.36 k allocations: 29.795 MiB, 6.08% gc time)
  0.005495 seconds (31 allocations: 1.100 MiB)
  0.003602 seconds (26 allocations: 547.375 KiB)
Second run
  0.004757 seconds (1.12 k allocations: 69.320 KiB)
  0.052420 seconds (502.31 k allocations: 107.977 MiB, 53.81% gc time)
  0.002674 seconds (49.60 k allocations: 951.563 KiB)
  0.003510 seconds (26.48 k allocations: 1.204 MiB)
  0.005838 seconds (31 allocations: 1.100 MiB)
  0.003441 seconds (26 allocations: 547.375 KiB)


Finally, let's clean up. Do not run the next cell unless you are sure that it will not erase your important files.

In [31]:
foreach(rm, ["x1.csv", "x2.csv", "x.bin", "x.feather", "x1.json", "x2.json",
             "bigdf1.csv", "bigdf2.csv", "bigdf.bin", "bigdf.feather", "bigdf1.json", "bigdf2.json"])