# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), August 29, 2019**

In [1]:
using Pkg
Pkg.activate(".")

[32m[1mActivating[22m[39m environment at `d:\Dev\Julia\DataFrames_Tutorial\Project.toml`


In [2]:
using DataFrames

## Load and save DataFrames
We do not cover all features of the packages. Please refer to their documentation to learn them.

Here we'll load `CSV` and `CSVFiles` to read and write CSV files and `Feather` and serialization, which allow us to work with a binary format and `JSONTables` for JSON interaction.

In [3]:
using CSV
using CSVFiles
using Serialization
using Feather
using JSONTables

Let's create a simple `DataFrame` for testing purposes,

In [4]:
x = DataFrame(A=[true, false, true], B=[1, 2, missing],
              C=[missing, "b", "c"], D=['a', missing, 'c'])


Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,Char⍰
1,1,1,missing,'a'
2,0,2,b,missing
3,1,missing,c,'c'


and use `eltypes` to look at the columnwise types.

In [5]:
eltypes(x)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, Char}  

### CSV.jl

Let's use `CSV` to save `x` to disk; make sure `x1.csv` does not conflict with some file in your working directory.

In [6]:
CSV.write("x1.csv", x)

"x1.csv"

Now we can see how it was saved by reading `x.csv`.

In [7]:
print(read("x1.csv", String))

A,B,C,D
true,1,,a
false,2,b,
true,,c,c


We can also load it back (`use_mmap=false` disables memory mapping so that on Windows the file can be deleted in the same session, on other OSs it is not needed).

In [8]:
y = CSV.read("x1.csv", use_mmap=false)

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,String⍰
1,1,1,missing,a
2,0,2,b,missing
3,1,missing,c,c


When loading in a `DataFrame` from a `CSV`, all columns allow `Missing` by default. Note that the column types have changed!

In [9]:
eltypes(y)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, String}

### CSVFiles.jl

Now we will use `CSVFiles` to achieve the same. First we save the file. Notice that we override default `nastring` that is `"NA"` because we have missings in non-numeric columns.

In [10]:
x |> save("x2.csv", nastring="")

and peek the saved file:

In [11]:
print(read("x2.csv", String))

"A","B","C","D"
true,1,,a
false,2,"b",
true,,"c",c


We can load it back using `load`:

In [12]:
y = load("x2.csv") |> DataFrame

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,String,Int64⍰,String,String
1,True,1,,a
2,False,2,b,
3,True,missing,c,c


Let us check element types again:

In [13]:
eltypes(y)

4-element Array{Type,1}:
 String               
 Union{Missing, Int64}
 String               
 String               

Observe that in columns `:C` and `:D` missings were read back as empty strings

### Serialization

Now we use serialization to save `x`. Note that in general, this process will not work if the reading and writing are done by different versions of Julia, or an instance of Julia with a different system image. There are two ways to perform serialization. The first way is to use the `Serialization.serialize` as below:

In [14]:
using Serialization
open("x.bin", "w") do io
    serialize(io, x)
end

Now we load back the saved file to `y` variable. Again `y` is identical to `x`. However, please beware that if you session does not have DataFrames.jl loaded, then it may not recognise the content as DataFrames.jl

In [15]:
using Serialization, DataFrames
y = open(deserialize, "x.bin")

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,Char⍰
1,1,1,missing,'a'
2,0,2,b,missing
3,1,missing,c,'c'


In [16]:
eltypes(y)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, Char}  

The second way to perform serialization is by using the [JLSO.jl](https://github.com/invenia/JLSO.jl) library:

In [17]:
using JLSO
JLSO.save("x.jlso")

Now we can laod back the file to `y`

using JLSO, DataFrams
y = JLSO.load("x.jlso")["data"]

### JSONTables.jl

Often you might need to read and write data stored in JSON format. JSONTables.jl provides a way to process them in row-oriented or column-oriented layout. We present both options below.

In [18]:
open(io -> arraytable(io, x), "x1.json", "w")

106

In [19]:
open(io -> objecttable(io, x), "x2.json", "w")

76

In [20]:
print(read("x1.json", String))

[{"A":true,"B":1,"C":null,"D":"a"},{"A":false,"B":2,"C":"b","D":null},{"A":true,"B":null,"C":"c","D":"c"}]

In [21]:
print(read("x2.json", String))

{"A":[true,false,true],"B":[1,2,null],"C":[null,"b","c"],"D":["a",null,"c"]}

In [22]:
y1 = open(jsontable, "x1.json") |> DataFrame

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,String⍰
1,1,1,missing,a
2,0,2,b,missing
3,1,missing,c,c


In [23]:
eltypes(y1)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, String}

In [24]:
y2 = open(jsontable, "x2.json") |> DataFrame

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,String⍰
1,1,1,missing,a
2,0,2,b,missing
3,1,missing,c,c


In [25]:
eltypes(y2)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, String}

### Feather.jl

Finally we use Feather format that allows, in particular, for data interchange with R or Python.

In [26]:
x.D = passmissing(string).(x.D) # Feather format does not support Char type

3-element Array{Union{Missing, String},1}:
 "a"    
 missing
 "c"    

In [27]:
Feather.write("x.feather", x)

"x.feather"

In [28]:
y = Feather.materialize("x.feather") # Feather.read is a lazy alternative

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Bool,Int64⍰,String⍰,String⍰
1,1,1,missing,a
2,0,2,b,missing
3,1,missing,c,c


In [29]:
eltypes(y)

4-element Array{Type,1}:
 Bool                  
 Union{Missing, Int64} 
 Union{Missing, String}
 Union{Missing, String}

### Basic bechmarking

Next, we'll create some files, so be careful that you don't already have these files in your working directory!

In particular, we'll time how long it takes us to write a `DataFrame` with 10^3 rows and 10^5 columns.

In [30]:
bigdf = DataFrame(rand(Bool, 10^3, 10^2))
bigdf[!, 1] = Int.(bigdf[!, 1])
bigdf[!, 2] = bigdf[!, 2] .+ 0.5
bigdf[!, 3] = string.(bigdf[!, 3], ", as string")
println("First run")
@time CSV.write("bigdf1.csv", bigdf)
@time bigdf |> save("bigdf2.csv")
@time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
@time Feather.write("bigdf.feather", bigdf)
@time open(io -> arraytable(io, bigdf), "bigdf1.json", "w")
@time open(io -> objecttable(io, bigdf), "bigdf2.json", "w")
println("Second run")
@time CSV.write("bigdf1.csv", bigdf)
@time bigdf |> save("bigdf2.csv")
@time open(io -> serialize(io, bigdf), "bigdf.bin", "w")
@time Feather.write("bigdf.feather", bigdf)
@time open(io -> arraytable(io, bigdf), "bigdf1.json", "w")
@time open(io -> objecttable(io, bigdf), "bigdf2.json", "w")
getfield.(stat.(["bigdf1.csv", "bigdf2.csv", "bigdf.bin", "bigdf.feather", "bigdf1.json", "bigdf2.json"]), :size)

First run
  0.479076 seconds (1.32 M allocations: 67.134 MiB, 5.64% gc time)
  0.494325 seconds (835.73 k allocations: 43.406 MiB, 3.28% gc time)
  0.102074 seconds (417.81 k allocations: 21.038 MiB, 7.82% gc time)
  0.125126 seconds (376.11 k allocations: 26.330 MiB, 6.28% gc time)
  0.335797 seconds (2.59 M allocations: 78.314 MiB, 8.20% gc time)
  0.093361 seconds (249.27 k allocations: 13.715 MiB)
Second run
  0.006423 seconds (6.53 k allocations: 291.672 KiB)
  0.008520 seconds (5.43 k allocations: 339.016 KiB)
  0.008966 seconds (4.59 k allocations: 238.526 KiB)
  0.018672 seconds (33.62 k allocations: 8.994 MiB, 50.37% gc time)
  0.142286 seconds (2.07 M allocations: 50.174 MiB, 6.23% gc time)
  0.010606 seconds (10.32 k allocations: 488.453 KiB)


6-element Array{Int64,1}:
  558200
  558400
   84110
   54088
 1151809
  558601

In [31]:
println("First run")
@time CSV.read("bigdf1.csv")
@time load("bigdf2.csv") |> DataFrame
@time open(deserialize, "bigdf.bin")
@time Feather.materialize("bigdf.feather")
@time open(jsontable, "bigdf1.json")
@time open(jsontable, "bigdf2.json")
println("Second run")
@time CSV.read("bigdf1.csv")
@time load("bigdf2.csv") |> DataFrame
@time open(deserialize, "bigdf.bin")
@time Feather.materialize("bigdf.feather")
@time open(jsontable, "bigdf1.json")
@time open(jsontable, "bigdf2.json");

First run
  0.059345 seconds (61.82 k allocations: 3.468 MiB)
  0.637866 seconds (2.28 M allocations: 202.282 MiB, 10.59% gc time)
  0.019549 seconds (71.66 k allocations: 1.974 MiB)
  0.264623 seconds (572.37 k allocations: 29.928 MiB, 7.48% gc time)
  0.008866 seconds (31 allocations: 1.100 MiB)
  0.003744 seconds (26 allocations: 547.000 KiB)
Second run
  0.005236 seconds (1.12 k allocations: 69.320 KiB)
  0.056800 seconds (502.31 k allocations: 107.978 MiB, 55.14% gc time)
  0.009780 seconds (49.60 k allocations: 951.906 KiB)
  0.005541 seconds (26.48 k allocations: 1.204 MiB)
  0.018484 seconds (31 allocations: 1.100 MiB)
  0.003305 seconds (26 allocations: 547.000 KiB)


Finally, let's clean up. Do not run the next cell unless you are sure that it will not erase your important files.

In [32]:
foreach(rm, ["x1.csv", "x2.csv", "x.bin", "x.feather", "x1.json", "x2.json",
             "bigdf1.csv", "bigdf2.csv", "bigdf.bin", "bigdf.feather", "bigdf1.json", "bigdf2.json"])

### Using gzip compression

A common user requirement is to be able to load and save CSV that are compressed using gzip.
Below we show how this can be accomplished using CodecZlib.jl.
The same pattern is applicable to JSONTables.jl compression/decompression.

Again make sure that you do not have file named `df_compress_test.csv.gz` in your working directory

In [33]:
using CodecZlib

We first generate a random data frame

In [34]:
df = DataFrame(rand(1:10, 10, 1000))

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,5,1,6,6,2,9,10,7,8,4,2,8
2,5,2,5,7,6,6,4,1,1,4,7,9
3,2,1,2,5,9,7,10,8,6,2,5,4
4,5,9,5,3,10,6,7,1,1,7,3,5
5,1,2,1,8,5,4,10,8,6,6,8,5
6,9,10,4,2,6,8,10,3,1,4,9,4
7,8,7,4,2,3,3,7,1,9,6,10,1
8,3,3,10,3,8,2,3,8,10,10,10,8
9,4,6,9,8,2,8,9,5,9,7,2,10
10,10,10,4,3,9,7,3,2,2,1,6,5


In [35]:
open("df_compress_test.csv.gz", "w") do io
    stream = GzipCompressorStream(io)
    CSV.write(stream, df)
    close(stream)
end

In [36]:
df2 = open("df_compress_test.csv.gz") do io
    stream = GzipDecompressorStream(io)
    res = CSV.read(stream)
    close(stream)
    res
end

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,5,1,6,6,2,9,10,7,8,4,2,8
2,5,2,5,7,6,6,4,1,1,4,7,9
3,2,1,2,5,9,7,10,8,6,2,5,4
4,5,9,5,3,10,6,7,1,1,7,3,5
5,1,2,1,8,5,4,10,8,6,6,8,5
6,9,10,4,2,6,8,10,3,1,4,9,4
7,8,7,4,2,3,3,7,1,9,6,10,1
8,3,3,10,3,8,2,3,8,10,10,10,8
9,4,6,9,8,2,8,9,5,9,7,2,10
10,10,10,4,3,9,7,3,2,2,1,6,5


In [37]:
df == df2

true

In [38]:
rm("df_compress_test.csv.gz")

# Reading compresssed files directly from zip files

Sometimes you may files compress inside a zip file. In those situation you may use ZipFile.jl in conjunction an an appropriate reader to read the files. For example to read a CSV file from inside a zip files:

```julia
using CSV, ZipFile

z = ZipFile.read("thezipfile.zip")

# to read the first file in the zip file
x = CSV.read(z.files[1])

# Don't forget to close the zip file once done
close(z)
```