# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), August 29, 2019**

In [1]:
using Pkg
Pkg.activate(".")

[32m[1mActivating[22m[39m environment at `d:\Dev\Julia\DataFrames_Tutorial\Project.toml`


In [2]:
using DataFrames

## Load and save DataFrames
We do not cover all features of the packages. Please refer to their documentation to learn them.

Here we'll load `CSV` and `CSVFiles` to read and write CSV files and `Feather` and serialization, which allow us to work with a binary format and `JSONTables` for JSON interaction.

In [3]:
using CSV
using CSVFiles
using Serialization
using JLSO
using Feather
using JSONTables
using CodecZlib
using ZipFile

Let's create a simple `DataFrame` for testing purposes,

In [4]:
x = DataFrame(A=[true, false, true], B=[1, 2, missing],
              C=[missing, "b", "c"], D=['a', missing, 'c'])


Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,Char⍰
1,1,1,missing,'a'
2,0,2,b,missing
3,1,missing,c,'c'


and use `eltypes` to look at the columnwise types.

In [5]:
eltypes(x)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, Char}  

### CSV.jl

Let's use `CSV` to save `x` to disk; make sure `x1.csv` does not conflict with some file in your working directory.

In [6]:
CSV.write("x1.csv", x)

"x1.csv"

Now we can see how it was saved by reading `x.csv`.

In [7]:
print(read("x1.csv", String))

A,B,C,D
true,1,,a
false,2,b,
true,,c,c


We can also load it back (`use_mmap=false` disables memory mapping so that on Windows the file can be deleted in the same session, on other OSs it is not needed).

In [8]:
y = CSV.read("x1.csv", use_mmap=false)

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,String⍰
1,1,1,missing,a
2,0,2,b,missing
3,1,missing,c,c


When loading in a `DataFrame` from a `CSV`, all columns allow `Missing` by default. Note that the column types have changed!

In [9]:
eltypes(y)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, String}

### CSVFiles.jl

Now we will use `CSVFiles` to achieve the same. First we save the file. Notice that we override default `nastring` that is `"NA"` because we have missings in non-numeric columns.

In [10]:
x |> save("x2.csv", nastring="")

and peek the saved file:

In [11]:
print(read("x2.csv", String))

"A","B","C","D"
true,1,,a
false,2,"b",
true,,"c",c


We can load it back using `load`:

In [12]:
y = load("x2.csv") |> DataFrame

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,String,Int64⍰,String,String
1,True,1,,a
2,False,2,b,
3,True,missing,c,c


Let us check element types again:

In [13]:
eltypes(y)

4-element Array{Type,1}:
 String               
 Union{Missing, Int64}
 String               
 String               

Observe that in columns `:C` and `:D` missings were read back as empty strings

### Serialization

Now we use serialization to save `x`. Note that in general, this process will not work if the reading and writing are done by different versions of Julia, or an instance of Julia with a different system image. There are two ways to perform serialization. The first way is to use the `Serialization.serialize` as below:

In [14]:
open("x.bin", "w") do io
    serialize(io, x)
end

Now we load back the saved file to `y` variable. Again `y` is identical to `x`. However, please beware that if you session does not have DataFrames.jl loaded, then it may not recognise the content as DataFrames.jl

In [15]:
y = open(deserialize, "x.bin")

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,Char⍰
1,1,1,missing,'a'
2,0,2,b,missing
3,1,missing,c,'c'


In [16]:
eltypes(y)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, Char}  

The second way to perform serialization is by using the [JLSO.jl](https://github.com/invenia/JLSO.jl) library:

In [17]:
JLSO.save("x.jlso", x)

Now we can laod back the file to `y`

In [18]:
y = JLSO.load("x.jlso")["data"]

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,Char⍰
1,1,1,missing,'a'
2,0,2,b,missing
3,1,missing,c,'c'


### JSONTables.jl

Often you might need to read and write data stored in JSON format. JSONTables.jl provides a way to process them in row-oriented or column-oriented layout. We present both options below.

In [19]:
open(io -> arraytable(io, x), "x1.json", "w")

106

In [20]:
open(io -> objecttable(io, x), "x2.json", "w")

76

In [21]:
print(read("x1.json", String))

[{"A":true,"B":1,"C":null,"D":"a"},{"A":false,"B":2,"C":"b","D":null},{"A":true,"B":null,"C":"c","D":"c"}]

In [22]:
print(read("x2.json", String))

{"A":[true,false,true],"B":[1,2,null],"C":[null,"b","c"],"D":["a",null,"c"]}

In [23]:
y1 = open(jsontable, "x1.json") |> DataFrame

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,String⍰
1,1,1,missing,a
2,0,2,b,missing
3,1,missing,c,c


In [24]:
eltypes(y1)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, String}

In [25]:
y2 = open(jsontable, "x2.json") |> DataFrame

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,String⍰
1,1,1,missing,a
2,0,2,b,missing
3,1,missing,c,c


In [26]:
eltypes(y2)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, String}

### Feather.jl

Finally we use Feather format that allows, in particular, for data interchange with R or Python.

In [27]:
x.D = passmissing(string).(x.D) # Feather format does not support Char type

3-element Array{Union{Missing, String},1}:
 "a"    
 missing
 "c"    

In [28]:
Feather.write("x.feather", x)

"x.feather"

In [29]:
y = Feather.materialize("x.feather") # Feather.read is a lazy alternative

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,String⍰
1,1,1,missing,a
2,0,2,b,missing
3,1,missing,c,c


In [30]:
eltypes(y)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, String}

### Basic bechmarking

Next, we'll create some files, so be careful that you don't already have these files in your working directory!

In particular, we'll time how long it takes us to write a `DataFrame` with 10^3 rows and 10^5 columns.

In [31]:
bigdf = DataFrame(rand(Bool, 10^3, 10^2))
bigdf[!, 1] = Int.(bigdf[!, 1])
bigdf[!, 2] = bigdf[!, 2] .+ 0.5
bigdf[!, 3] = string.(bigdf[!, 3], ", as string")
println("First run")
@time CSV.write("bigdf1.csv", bigdf)
@time bigdf |> save("bigdf2.csv")
@time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
@time JLSO.save("bigdf.jlso", bigdf)
@time Feather.write("bigdf.feather", bigdf)
@time open(io -> arraytable(io, bigdf), "bigdf1.json", "w")
@time open(io -> objecttable(io, bigdf), "bigdf2.json", "w")
println("Second run")
@time CSV.write("bigdf1.csv", bigdf)
@time bigdf |> save("bigdf2.csv")
@time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
@time JLSO.save("bigdf.jlso", bigdf)
@time Feather.write("bigdf.feather", bigdf)
@time open(io -> arraytable(io, bigdf), "bigdf1.json", "w")
@time open(io -> objecttable(io, bigdf), "bigdf2.json", "w")
getfield.(stat.(["bigdf1.csv", "bigdf2.csv", "bigdf.bin", "bigdf.feather", "bigdf1.json", "bigdf2.json"]), :size)

First run
  0.451292 seconds (1.32 M allocations: 67.531 MiB, 5.81% gc time)
  0.494021 seconds (836.41 k allocations: 43.345 MiB, 3.17% gc time)
  0.085290 seconds (352.20 k allocations: 17.402 MiB, 12.41% gc time)
  0.084796 seconds (381.87 k allocations: 19.582 MiB, 8.71% gc time)
  0.022820 seconds (50.07 k allocations: 9.821 MiB)
  0.342492 seconds (2.58 M allocations: 78.066 MiB, 8.02% gc time)
  0.100681 seconds (235.54 k allocations: 12.973 MiB)
Second run
  0.006234 seconds (6.53 k allocations: 291.766 KiB)
  0.015849 seconds (5.43 k allocations: 339.016 KiB, 48.30% gc time)
  0.008055 seconds (4.66 k allocations: 241.620 KiB)
  0.016152 seconds (15.85 k allocations: 1.613 MiB)
  0.008623 seconds (33.62 k allocations: 8.998 MiB)
  0.146916 seconds (2.07 M allocations: 50.180 MiB, 6.74% gc time)
  0.009177 seconds (10.39 k allocations: 491.547 KiB)


6-element Array{Int64,1}:
  558357
  558557
   84349
   54088
 1151966
  558758

In [32]:
println("First run")
@time CSV.read("bigdf1.csv")
@time load("bigdf2.csv") |> DataFrame
@time open(deserialize, "bigdf.bin")
@time JLSO.load("bigdf.jlso")
@time Feather.materialize("bigdf.feather")
@time open(jsontable, "bigdf1.json")
@time open(jsontable, "bigdf2.json")
println("Second run")
@time CSV.read("bigdf1.csv")
@time load("bigdf2.csv") |> DataFrame
@time open(deserialize, "bigdf.bin")
@time JLSO.load("bigdf.jlso")
@time Feather.materialize("bigdf.feather")
@time open(jsontable, "bigdf1.json")
@time open(jsontable, "bigdf2.json");

First run
  0.014641 seconds (36.70 k allocations: 1.990 MiB)
  0.721832 seconds (2.26 M allocations: 201.152 MiB, 28.23% gc time)
  0.016557 seconds (71.72 k allocations: 1.978 MiB)
  0.013111 seconds (69.30 k allocations: 2.080 MiB)
  0.154732 seconds (331.85 k allocations: 17.423 MiB, 5.12% gc time)
  0.004817 seconds (31 allocations: 1.100 MiB)
  0.002774 seconds (26 allocations: 547.125 KiB)
Second run
  0.003720 seconds (1.12 k allocations: 69.320 KiB)
  0.047248 seconds (502.31 k allocations: 107.978 MiB, 59.39% gc time)
  0.002311 seconds (49.60 k allocations: 951.844 KiB)
  0.004142 seconds (56.02 k allocations: 1.423 MiB)
  0.002432 seconds (26.48 k allocations: 1.204 MiB)
  0.004706 seconds (31 allocations: 1.100 MiB)
  0.002745 seconds (26 allocations: 547.125 KiB)


### Using gzip compression

A common user requirement is to be able to load and save CSV that are compressed using gzip.
Below we show how this can be accomplished using CodecZlib.jl.
The same pattern is applicable to JSONTables.jl compression/decompression.

Again make sure that you do not have file named `df_compress_test.csv.gz` in your working directory

We first generate a random data frame

In [33]:
df = DataFrame(rand(1:10, 10, 1000))

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,6,10,9,6,10,3,10,10,9,5,1,4
2,9,4,5,7,2,7,2,1,7,2,2,2
3,5,4,2,2,4,5,1,10,10,9,7,6
4,8,8,6,6,8,8,8,5,5,10,7,10
5,6,6,10,6,9,3,1,9,7,7,5,7
6,1,7,2,5,4,4,8,10,7,7,9,4
7,8,10,1,8,3,9,5,9,1,10,8,2
8,5,8,10,9,10,1,2,9,1,8,6,9
9,2,10,1,5,4,1,2,8,6,8,2,6
10,1,5,3,7,2,7,9,2,1,10,1,8


In [34]:
# GzipCompressorStream comes from CodecZlib

open("df_compress_test.csv.gz", "w") do io
    stream = GzipCompressorStream(io)
    CSV.write(stream, df)
    close(stream)
end

In [35]:
df2 = open("df_compress_test.csv.gz") do io
    stream = GzipDecompressorStream(io)
    res = CSV.read(stream)
    close(stream)
    res
end

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,6,10,9,6,10,3,10,10,9,5,1,4
2,9,4,5,7,2,7,2,1,7,2,2,2
3,5,4,2,2,4,5,1,10,10,9,7,6
4,8,8,6,6,8,8,8,5,5,10,7,10
5,6,6,10,6,9,3,1,9,7,7,5,7
6,1,7,2,5,4,4,8,10,7,7,9,4
7,8,10,1,8,3,9,5,9,1,10,8,2
8,5,8,10,9,10,1,2,9,1,8,6,9
9,2,10,1,5,4,1,2,8,6,8,2,6
10,1,5,3,7,2,7,9,2,1,10,1,8


In [36]:
df == df2

true

# Reading compresssed files directly from zip files

Sometimes you may files compress inside a zip file. In those situation you may use ZipFile.jl in conjunction an an appropriate reader to read the files. For example to read a CSV file from inside a zip files:

In [37]:
# write a CSV file into the zip file
w = ZipFile.Writer("x.zip");
f = ZipFile.addfile(w, "x.csv");
write(f, "a,b\n");
write(f, "1,2\n");

In [38]:
# write a second CSV file into
f2 = ZipFile.addfile(w, "x2.csv", method=ZipFile.Deflate);
write(f2, "d,e\n");
write(f2, "d,e\n");
close(w)

Now we read the CSV we've written:

In [39]:
z = ZipFile.Reader("x.zip");

In [40]:
# find the index index of file called x2.csv
index_xcsv = findfirst(x->x.name == "x2.csv", z.files)

2

In [41]:
# to read the x2.csv file in the zip file
x = CSV.read(z.files[index_xcsv])

Unnamed: 0_level_0,d,e
Unnamed: 0_level_1,String,String
1,d,e


In [42]:
# Don't forget to close the zip file once done
close(z)

Finally, let's clean up. Do not run the next cell unless you are sure that it will not erase your important files.

In [43]:
foreach(rm, ["x1.csv", "x2.csv", "x.bin", "x.feather", "x1.json", "x2.json",
             "bigdf1.csv", "bigdf2.csv", "bigdf.bin", "bigdf.feather", "bigdf1.json", "bigdf2.json", "x.jlso", 
             "df_compress_test.csv.gz", "x.zip", "bigdf.jlso"])