# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), August 30, 2019**

In [1]:
using Pkg
Pkg.activate(".")

[32m[1mActivating[22m[39m environment at `C:\Users\RTX2080\git\Julia-DataFrames-Tutorial\Project.toml`


In [2]:
using DataFrames

## Load and save DataFrames
We do not cover all features of the packages. Please refer to their documentation to learn them.

Here we'll load `CSV` and `CSVFiles` to read and write CSV files and `Feather`, `JLSO`, and serialization, which allow us to work with a binary format and `JSONTables` for JSON interaction.

In [3]:
using CSV
using CSVFiles
using Serialization
using JLSO
using Feather
using JSONTables
using CodecZlib
using ZipFile
using JDF

┌ Info: Recompiling stale cache file C:\Users\RTX2080\.julia\compiled\v1.2\JDF\7FsP0.ji for JDF [babc3d20-cd49-4f60-a736-a8f9c08892d3]
└ @ Base loading.jl:1240


JDF: parallel save/load do not work in < Julia 1.3
JDF: parallel save/load do not work in < Julia 1.3


Let's create a simple `DataFrame` for testing purposes,

In [4]:
x = DataFrame(A=[true, false, true], B=[1, 2, missing],
              C=[missing, "b", "c"], D=['a', missing, 'c'])


Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,Char⍰
1,1,1,missing,'a'
2,0,2,b,missing
3,1,missing,c,'c'


and use `eltypes` to look at the columnwise types.

In [5]:
eltypes(x)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, Char}  

### CSV.jl

Let's use `CSV` to save `x` to disk; make sure `x1.csv` does not conflict with some file in your working directory.

In [6]:
CSV.write("x1.csv", x)

"x1.csv"

Now we can see how it was saved by reading `x.csv`.

In [7]:
print(read("x1.csv", String))

A,B,C,D
true,1,,a
false,2,b,
true,,c,c


We can also load it back (`use_mmap=false` disables memory mapping so that on Windows the file can be deleted in the same session, on other OSs it is not needed).

In [8]:
y = CSV.read("x1.csv", use_mmap=false)

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,String⍰
1,1,1,missing,a
2,0,2,b,missing
3,1,missing,c,c


When loading in a `DataFrame` from a `CSV`, all columns allow `Missing` by default. Note that the column types have changed!

In [9]:
eltypes(y)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, String}

### CSVFiles.jl

Now we will use `CSVFiles` to achieve the same. First we save the file. Notice that we override default `nastring` that is `"NA"` because we have missings in non-numeric columns.

In [10]:
x |> save("x2.csv", nastring="")

and peek the saved file:

In [11]:
print(read("x2.csv", String))

"A","B","C","D"
true,1,,a
false,2,"b",
true,,"c",c


We can load it back using `load`:

In [12]:
y = load("x2.csv") |> DataFrame

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,String,Int64⍰,String,String
1,True,1,,a
2,False,2,b,
3,True,missing,c,c


Let us check element types again:

In [13]:
eltypes(y)

4-element Array{Type,1}:
 String               
 Union{Missing, Int64}
 String               
 String               

Observe that in columns `:C` and `:D` missings were read back as empty strings

### Serialization, JDF.jl, and JLSO.jl

#### Serialization

Now we use serialization to save `x`.

There are two ways to perform serialization. The first way is to use the `Serialization.serialize` as below:

Note that in general, this process will not work if the reading and writing are done by different versions of Julia, or an instance of Julia with a different system image.

In [14]:
open("x.bin", "w") do io
    serialize(io, x)
end

Now we load back the saved file to `y` variable. Again `y` is identical to `x`. However, please beware that if you session does not have DataFrames.jl loaded, then it may not recognise the content as DataFrames.jl

In [15]:
y = open(deserialize, "x.bin")

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,Char⍰
1,1,1,missing,'a'
2,0,2,b,missing
3,1,missing,c,'c'


In [16]:
eltypes(y)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, Char}  

#### JDF.jl

[JDF.jl](https://github.com/xiaodaigh/JDF) is a relatively new package designed to serialize DataFrames. You can save a DataFrame with the `savejdf` function.

In [23]:
savejdf("x.jdf", x);

To load the saved JDF file, one can use the `loadjdf` function

In [24]:
x_loaded = loadjdf("x.jdf")

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,Char⍰
1,1,1,missing,'a'
2,0,2,b,missing
3,1,missing,c,'c'


You can see that they are the same

In [25]:
isequal(x_loaded, x)

true

JDF.jl offers the ability to load only certain columns from disk to help with working with large files

In [26]:
# set up a JDFFile which is a on disk representation of `x` backed by JDF.jl
x_ondisk = jdf"x.jdf"

JDFFile{String}("x.jdf")

We can see all the names of `x` without loading it into memory

In [27]:
names(x_ondisk)

4-element Array{Symbol,1}:
 :A
 :B
 :C
 :D

The below is an example of how to load only columns `:A` and `:D` 

In [29]:
xd = sloadjdf(x_ondisk; cols = [:A, :D])

Unnamed: 0_level_0,A,D
Unnamed: 0_level_1,Bool,Char⍰
1,1,'a'
2,0,missing
3,1,'c'


##### JDF.jl vs others

JDF.jl is specialized to DataFrames and only supports a restricted list of columns, so it can not save DataFrames with arbitrary column types. However, this also means that JDF.jl has specialised algorithms to serailize the type it supports to optimize speed, minimize disk usage, and reduce the chance of errors

The list support columns for JDF include

```julia
WeakRefStrings.StringVector
Vector{T}, Vector{Union{Mising, T}}, Vector{Union{Nothing, T}}
CategoricalArrays.CategoricalVetors{T}
```

where `T` can be `String`, `Bool`, `Symbol`, `Char`, `TimeZones.ZonedDateTime` (experimental) and `isbit`s types i.e. `UInt*`, `Int*`, `Float*`, and `Date*` types etc.

#### JLSO.jl

Another way to perform serialization is by using the [JLSO.jl](https://github.com/invenia/JLSO.jl) library:

In [19]:
JLSO.save("x.jlso", x)

Now we can laod back the file to `y`

In [18]:
y = JLSO.load("x.jlso")["data"]

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,Char⍰
1,1,1,missing,'a'
2,0,2,b,missing
3,1,missing,c,'c'


In [19]:
eltypes(y)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, Char}  

### JSONTables.jl

Often you might need to read and write data stored in JSON format. JSONTables.jl provides a way to process them in row-oriented or column-oriented layout. We present both options below.

In [20]:
open(io -> arraytable(io, x), "x1.json", "w")

106

In [21]:
open(io -> objecttable(io, x), "x2.json", "w")

76

In [22]:
print(read("x1.json", String))

[{"A":true,"B":1,"C":null,"D":"a"},{"A":false,"B":2,"C":"b","D":null},{"A":true,"B":null,"C":"c","D":"c"}]

In [23]:
print(read("x2.json", String))

{"A":[true,false,true],"B":[1,2,null],"C":[null,"b","c"],"D":["a",null,"c"]}

In [24]:
y1 = open(jsontable, "x1.json") |> DataFrame

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,String⍰
1,1,1,missing,a
2,0,2,b,missing
3,1,missing,c,c


In [25]:
eltypes(y1)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, String}

In [26]:
y2 = open(jsontable, "x2.json") |> DataFrame

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,String⍰
1,1,1,missing,a
2,0,2,b,missing
3,1,missing,c,c


In [27]:
eltypes(y2)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, String}

### Feather.jl

Finally we use Feather format that allows, in particular, for data interchange with R or Python.

In [28]:
x.D = passmissing(string).(x.D) # Feather format does not support Char type

3-element Array{Union{Missing, String},1}:
 "a"    
 missing
 "c"    

In [29]:
Feather.write("x.feather", x)

"x.feather"

In [30]:
y = Feather.materialize("x.feather") # Feather.read is a lazy alternative

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,String⍰
1,1,1,missing,a
2,0,2,b,missing
3,1,missing,c,c


In [31]:
eltypes(y)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, String}

### Basic bechmarking

Next, we'll create some files, so be careful that you don't already have these files in your working directory!

In particular, we'll time how long it takes us to write a `DataFrame` with 10^3 rows and 10^5 columns.

In [32]:
bigdf = DataFrame(rand(Bool, 10^5, 500))
bigdf[!, 1] = Int.(bigdf[!, 1])
bigdf[!, 2] = bigdf[!, 2] .+ 0.5
bigdf[!, 3] = string.(bigdf[!, 3], ", as string")
println("First run")
println("CSV.jl")
@time CSV.write("bigdf1.csv", bigdf)
println("CSVFiles.jl")
@time bigdf |> save("bigdf2.csv")
println("Serialization")
@time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
println("JLSO.jl")
@time JLSO.save("bigdf.jlso", bigdf)
println("Feather.jl")
@time Feather.write("bigdf.feather", bigdf)
println("JSONTables.jl arraytable")
@time open(io -> arraytable(io, bigdf), "bigdf1.json", "w")
println("JSONTables.jl objecttable")
@time open(io -> objecttable(io, bigdf), "bigdf2.json", "w")
println("Second run")
println("CSV.jl")
@time CSV.write("bigdf1.csv", bigdf)
println("CSVFiles.jl")
@time bigdf |> save("bigdf2.csv")
println("Serialization")
@time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
println("JLSO.jl")
@time JLSO.save("bigdf.jlso", bigdf)
println("Feather.jl")
@time Feather.write("bigdf.feather", bigdf)
println("JSONTables.jl arraytable")
@time open(io -> arraytable(io, bigdf), "bigdf1.json", "w")
println("JSONTables.jl objecttable")
@time open(io -> objecttable(io, bigdf), "bigdf2.json", "w")

First run
CSV.jl
  5.642190 seconds (52.09 M allocations: 887.398 MiB, 2.40% gc time)
CSVFiles.jl
  6.449236 seconds (3.35 M allocations: 212.784 MiB, 0.97% gc time)
Serialization
  0.468108 seconds (418.58 k allocations: 21.527 MiB, 1.20% gc time)
JLSO.jl
  7.928347 seconds (388.59 k allocations: 183.912 MiB, 0.62% gc time)
Feather.jl
  0.617448 seconds (801.51 k allocations: 663.334 MiB, 29.89% gc time)
JSONTables.jl arraytable
 71.049804 seconds (1.05 G allocations: 24.645 GiB, 3.61% gc time)
JSONTables.jl objecttable
  1.198479 seconds (651.50 k allocations: 33.517 MiB, 0.50% gc time)
Second run
CSV.jl
  5.159269 seconds (50.25 M allocations: 788.967 MiB, 0.94% gc time)
CSVFiles.jl
  2.897314 seconds (406.63 k allocations: 64.416 MiB, 0.14% gc time)
Serialization
  0.388764 seconds (5.39 k allocations: 761.731 KiB)
JLSO.jl
  7.804692 seconds (23.52 k allocations: 165.984 MiB, 0.54% gc time)
Feather.jl
  0.869113 seconds (458.98 k allocations: 645.998 MiB, 12.97% gc time)
JSONTables

275806947

In [33]:
data_files = ["bigdf1.csv", "bigdf2.csv", "bigdf.bin", "bigdf.feather", "bigdf1.json", "bigdf2.json"]
DataFrame(file = data_files,
          size = getfield.(stat.(data_files), :size))

Unnamed: 0_level_0,file,size
Unnamed: 0_level_1,String,Int64
1,bigdf1.csv,275804946
2,bigdf2.csv,275805946
3,bigdf.bin,28207930
4,bigdf.feather,9797672
5,bigdf1.json,615202555
6,bigdf2.json,275806947


In [34]:
println("First run")
println("CSV.jl")
@time CSV.read("bigdf1.csv")
println("CSVFiles.jl")
println("  disabled due to time-out")
# @time load("bigdf2.csv") |> DataFrame
println("Serialization")
@time open(deserialize, "bigdf.bin")
println("JLSO.jl")
@time JLSO.load("bigdf.jlso")
println("Feather.jl")
@time Feather.materialize("bigdf.feather")
println("JSONTables.jl arraytable")
@time open(jsontable, "bigdf1.json")
println("JSONTables.jl objecttable")
@time open(jsontable, "bigdf2.json")
println("Second run")
@time CSV.read("bigdf1.csv")
println("CSVFiles.jl")
println("  disabled due to time-out")
# @time load("bigdf2.csv") |> DataFrame
println("Serialization")
@time open(deserialize, "bigdf.bin")
println("JLSO.jl")
@time JLSO.load("bigdf.jlso")
println("Feather.jl")
@time Feather.materialize("bigdf.feather")
println("JSONTables.jl arraytable")
@time open(jsontable, "bigdf1.json")
println("JSONTables.jl objecttable")
@time open(jsontable, "bigdf2.json");

First run
CSV.jl
  1.916099 seconds (65.81 k allocations: 3.678 MiB)
CSVFiles.jl
  disabled due to time-out
Serialization
  2.067152 seconds (49.69 M allocations: 812.147 MiB, 6.80% gc time)
JLSO.jl
  1.622146 seconds (49.68 M allocations: 819.819 MiB, 8.72% gc time)
Feather.jl
  0.569779 seconds (776.24 k allocations: 145.573 MiB, 19.98% gc time)
JSONTables.jl arraytable
  2.759570 seconds (31 allocations: 586.705 MiB, 0.80% gc time)
JSONTables.jl objecttable
  2.000751 seconds (26 allocations: 263.031 MiB, 6.65% gc time)
Second run
  1.837442 seconds (5.13 k allocations: 292.148 KiB)
CSVFiles.jl
  disabled due to time-out
Serialization
  1.653366 seconds (49.65 M allocations: 810.168 MiB, 9.67% gc time)
JLSO.jl
  1.630883 seconds (49.66 M allocations: 819.162 MiB, 8.34% gc time)
Feather.jl
  0.302821 seconds (230.36 k allocations: 116.961 MiB, 13.92% gc time)
JSONTables.jl arraytable
  2.864430 seconds (31 allocations: 586.705 MiB, 5.12% gc time)
JSONTables.jl objecttable
  2.017142 

### Using gzip compression

A common user requirement is to be able to load and save CSV that are compressed using gzip.
Below we show how this can be accomplished using CodecZlib.jl.
The same pattern is applicable to JSONTables.jl compression/decompression.

Again make sure that you do not have file named `df_compress_test.csv.gz` in your working directory

We first generate a random data frame

In [35]:
df = DataFrame(rand(1:10, 10, 1000))

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,4,8,10,6,6,3,10,6,1,1,5,5
2,8,9,7,6,3,10,10,2,6,8,4,3
3,3,1,4,2,4,4,1,3,2,8,7,8
4,6,1,4,3,5,7,1,7,2,7,2,4
5,3,1,7,8,10,1,5,6,2,9,5,7
6,5,10,8,9,7,4,2,4,4,5,10,1
7,6,4,6,2,2,1,4,7,5,3,6,6
8,6,1,10,2,1,4,2,3,4,10,9,5
9,9,2,6,1,1,10,3,9,3,4,5,2
10,1,4,10,3,3,6,5,4,4,7,9,6


In [36]:
# GzipCompressorStream comes from CodecZlib

open("df_compress_test.csv.gz", "w") do io
    stream = GzipCompressorStream(io)
    CSV.write(stream, df)
    close(stream)
end

In [37]:
df2 = open("df_compress_test.csv.gz") do io
    stream = GzipDecompressorStream(io)
    res = CSV.read(stream)
    close(stream)
    res
end

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,4,8,10,6,6,3,10,6,1,1,5,5
2,8,9,7,6,3,10,10,2,6,8,4,3
3,3,1,4,2,4,4,1,3,2,8,7,8
4,6,1,4,3,5,7,1,7,2,7,2,4
5,3,1,7,8,10,1,5,6,2,9,5,7
6,5,10,8,9,7,4,2,4,4,5,10,1
7,6,4,6,2,2,1,4,7,5,3,6,6
8,6,1,10,2,1,4,2,3,4,10,9,5
9,9,2,6,1,1,10,3,9,3,4,5,2
10,1,4,10,3,3,6,5,4,4,7,9,6


In [38]:
df == df2

true

### Using zip files

Sometimes you may have files compressed inside a zip file.

In such a situation you may use [ZipFile.jl](https://github.com/fhs/ZipFile.jl) in conjunction an an appropriate reader to read the files.

Here we first create a ZIP file and then read back its contents into a `DataFrame`.

In [39]:
df1 = DataFrame(rand(1:10, 3, 4))

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,5,7,2,7
2,7,9,4,8
3,8,6,8,6


In [40]:
df2 = DataFrame(rand(1:10, 3, 4))

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,10,8,6,5
2,2,8,7,9
3,9,1,6,3


And we show yet another way to write a `DataFrame` into a CSV file

In [41]:
# write a CSV file into the zip file
w = ZipFile.Writer("x.zip")

f1 = ZipFile.addfile(w, "x1.csv")
write(f1, sprint(show, "text/csv", df1))

# write a second CSV file into zip file
f2 = ZipFile.addfile(w, "x2.csv", method=ZipFile.Deflate)
write(f2, sprint(show, "text/csv", df2))

close(w)

Now we read the CSV we have written:

In [42]:
z = ZipFile.Reader("x.zip");

In [43]:
# find the index index of file called x1.csv
index_xcsv = findfirst(x->x.name == "x1.csv", z.files)
# to read the x1.csv file in the zip file
df1_2 = CSV.read(z.files[index_xcsv])

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,5,7,2,7
2,7,9,4,8
3,8,6,8,6


In [44]:
df1_2 == df1

true

In [45]:
# find the index index of file called x2.csv
index_xcsv = findfirst(x->x.name == "x2.csv", z.files)
# to read the x2.csv file in the zip file
df2_2 = CSV.read(z.files[index_xcsv])

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,10,8,6,5
2,2,8,7,9
3,9,1,6,3


In [46]:
df2_2 == df2

true

Note that once you read a given file from `z` object its stream is all used-up (it is at its end). Therefore to read it again you need to close `z` and open it again.

Also do not forget to close the zip file once done.

In [47]:
close(z)

Finally, let's clean up. Do not run the next cell unless you are sure that it will not erase your important files.

In [48]:
foreach(rm, ["x1.csv", "x2.csv", "x.bin", "x.jlso", "x.feather", "x1.json", "x2.json",
             "bigdf1.csv", "bigdf2.csv", "bigdf.bin", "bigdf.jlso", "bigdf.feather", "bigdf1.json", "bigdf2.json", 
             "df_compress_test.csv.gz", "x.zip"])