# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), August 14, 2019**

In [1]:
using DataFrames

## Load and save DataFrames
We do not cover all features of the packages. Please refer to their documentation to learn them.

Here we'll load `CSV` and `CSVFiles` to read and write CSV files and `Feather` and serialization, which allow us to work with a binary format and `JSONTables` for JSON interaction.

In [2]:
using CSV
using CSVFiles
using Serialization
using Feather
using JSONTables

Let's create a simple `DataFrame` for testing purposes,

In [3]:
x = DataFrame(A=[true, false, true], B=[1, 2, missing],
              C=[missing, "b", "c"], D=['a', missing, 'c'])


Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,Char⍰
1,True,1,missing,'a'
2,False,2,b,missing
3,True,missing,c,'c'


and use `eltypes` to look at the columnwise types.

In [4]:
eltypes(x)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, Char}  

### CSV.jl

Let's use `CSV` to save `x` to disk; make sure `x1.csv` does not conflict with some file in your working directory.

In [5]:
CSV.write("x1.csv", x)

"x1.csv"

Now we can see how it was saved by reading `x.csv`.

In [6]:
print(read("x1.csv", String))

A,B,C,D
true,1,,a
false,2,b,
true,,c,c


We can also load it back (`use_mmap=false` disables memory mapping so that on Windows the file can be deleted in the same session, on other OSs it is not needed).

In [7]:
y = CSV.read("x1.csv", use_mmap=false)

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,String⍰
1,True,1,missing,a
2,False,2,b,missing
3,True,missing,c,c


When loading in a `DataFrame` from a `CSV`, all columns allow `Missing` by default. Note that the column types have changed!

In [8]:
eltypes(y)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, String}

### CSVFiles.jl

Now we will use `CSVFiles` to achieve the same. First we save the file. Notice that we override default `nastring` that is `"NA"` because we have missings in non-numeric columns.

In [9]:
x |> save("x2.csv", nastring="")

and peek the saved file:

In [10]:
print(read("x2.csv", String))

"A","B","C","D"
true,1,,a
false,2,"b",
true,,"c",c


We can load it back using `load`:

In [11]:
y = load("x2.csv") |> DataFrame

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,String,Int64⍰,String,String
1,True,1,,a
2,False,2,b,
3,True,missing,c,c


Let us check element types again:

In [12]:
eltypes(y)

4-element Array{Type,1}:
 String               
 Union{Missing, Int64}
 String               
 String               

Observe that in columns `:C` and `:D` missings were read back as empty strings

### Serialization

Now we use serialization to save `x`. Note that in general, this process will not work if the reading and writing are done by different versions of Julia, or an instance of Julia with a different system image.

In [13]:
open("x.bin", "w") do io
    serialize(io, x)
end

Now we load back the saved file to `y` variable. Again `y` is identical to `x`.

In [14]:
y = open(deserialize, "x.bin")

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,Char⍰
1,True,1,missing,'a'
2,False,2,b,missing
3,True,missing,c,'c'


In [15]:
eltypes(y)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, Char}  

### JSONTables.jl

Often you might need to read and write data stored in JSON format. JSONTables.jl provides a way to process them in row-oriented or column-oriented layout. We present both options below.

In [16]:
open(io -> arraytable(io, x), "x1.json", "w")

106

In [17]:
open(io -> objecttable(io, x), "x2.json", "w")

76

In [18]:
print(read("x1.json", String))

[{"A":true,"B":1,"C":null,"D":"a"},{"A":false,"B":2,"C":"b","D":null},{"A":true,"B":null,"C":"c","D":"c"}]

In [19]:
print(read("x2.json", String))

{"A":[true,false,true],"B":[1,2,null],"C":[null,"b","c"],"D":["a",null,"c"]}

In [20]:
y1 = open(jsontable, "x1.json") |> DataFrame

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,String⍰
1,True,1,missing,a
2,False,2,b,missing
3,True,missing,c,c


In [21]:
eltypes(y1)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, String}

In [22]:
y2 = open(jsontable, "x2.json") |> DataFrame

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,String⍰
1,True,1,missing,a
2,False,2,b,missing
3,True,missing,c,c


In [23]:
eltypes(y2)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, String}

### Feather.jl

Finally we use Feather format that allows, in particular, for data interchange with R or Python.

In [24]:
x.D = passmissing(string).(x.D) # Feather format does not support Char type

3-element Array{Union{Missing, String},1}:
 "a"    
 missing
 "c"    

In [25]:
Feather.write("x.feather", x)

"x.feather"

In [26]:
y = Feather.materialize("x.feather") # Feather.read is a lazy alternative

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,String⍰
1,True,1,missing,a
2,False,2,b,missing
3,True,missing,c,c


In [27]:
eltypes(y)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, String}

### Basic bechmarking

Next, we'll create some files, so be careful that you don't already have these files on disc!

In particular, we'll time how long it takes us to write a `DataFrame` with 10^3 rows and 10^5 columns.

In [28]:
bigdf = DataFrame(rand(Bool, 10^3, 10^2))
bigdf[!, 1] = Int.(bigdf[!, 1])
bigdf[!, 2] = bigdf[!, 2] .+ 0.5
bigdf[!, 3] = string.(bigdf[!, 3], ", as string")
println("First run")
@time CSV.write("bigdf1.csv", bigdf)
@time bigdf |> save("bigdf2.csv")
@time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
@time Feather.write("bigdf.feather", bigdf)
@time open(io -> arraytable(io, bigdf), "bigdf1.json", "w")
@time open(io -> objecttable(io, bigdf), "bigdf2.json", "w")
println("Second run")
@time CSV.write("bigdf1.csv", bigdf)
@time bigdf |> save("bigdf2.csv")
@time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
@time Feather.write("bigdf.feather", bigdf)
@time open(io -> arraytable(io, bigdf), "bigdf1.json", "w")
@time open(io -> objecttable(io, bigdf), "bigdf2.json", "w")
getfield.(stat.(["bigdf1.csv", "bigdf2.csv", "bigdf.bin", "bigdf.feather", "bigdf1.json", "bigdf2.json"]), :size)

First run
  0.500097 seconds (1.37 M allocations: 68.644 MiB, 6.28% gc time)
  0.512602 seconds (929.64 k allocations: 47.472 MiB, 3.07% gc time)
  0.112955 seconds (427.33 k allocations: 21.272 MiB, 6.25% gc time)
  0.135171 seconds (397.79 k allocations: 27.029 MiB, 5.00% gc time)
  0.434002 seconds (4.08 M allocations: 107.839 MiB, 6.16% gc time)
  0.161530 seconds (1.47 M allocations: 36.141 MiB, 7.15% gc time)
Second run
  0.009997 seconds (6.53 k allocations: 291.453 KiB)
  0.007837 seconds (5.54 k allocations: 342.203 KiB)
  0.007937 seconds (4.81 k allocations: 243.620 KiB)
  0.008117 seconds (34.02 k allocations: 9.020 MiB)
  0.200410 seconds (3.42 M allocations: 73.130 MiB, 6.06% gc time)
  0.048523 seconds (1.17 M allocations: 19.665 MiB, 7.90% gc time)


6-element Array{Int64,1}:
  558403
  558603
   84362
   54088
 1152012
  558804

In [29]:
println("First run")
@time CSV.read("bigdf1.csv")
@time load("bigdf2.csv") |> DataFrame
@time open(deserialize, "bigdf.bin")
@time Feather.materialize("bigdf.feather")
@time open(jsontable, "bigdf1.json")
@time open(jsontable, "bigdf2.json")
println("Second run")
@time CSV.read("bigdf1.csv")
@time load("bigdf2.csv") |> DataFrame
@time open(deserialize, "bigdf.bin")
@time Feather.materialize("bigdf.feather")
@time open(jsontable, "bigdf1.json")
@time open(jsontable, "bigdf2.json");

First run
  0.062096 seconds (86.70 k allocations: 4.852 MiB)
  0.789684 seconds (2.48 M allocations: 211.370 MiB, 19.56% gc time)
  0.020529 seconds (72.82 k allocations: 2.008 MiB)
  0.280229 seconds (640.21 k allocations: 32.480 MiB, 2.18% gc time)
  0.005569 seconds (41 allocations: 1.101 MiB)
  0.004199 seconds (36 allocations: 547.359 KiB)
Second run
  0.004470 seconds (2.15 k allocations: 85.773 KiB)
  0.052487 seconds (506.70 k allocations: 108.055 MiB, 51.70% gc time)
  0.002924 seconds (49.60 k allocations: 951.969 KiB)
  0.003304 seconds (26.39 k allocations: 1.201 MiB)
  0.005846 seconds (41 allocations: 1.101 MiB)
  0.003070 seconds (36 allocations: 547.359 KiB)


Finally, let's clean up. Do not run the next cell unless you are sure that it will not erase your important files.

In [30]:
foreach(rm, ["x1.csv", "x2.csv", "x.bin", "x.feather", "x1.json", "x2.json",
             "bigdf1.csv", "bigdf2.csv", "bigdf.bin", "bigdf.feather", "bigdf1.json", "bigdf2.json"])