# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), Apr 21, 2018**

In [1]:
using DataFrames # load package

## Load and save DataFrames
We do not cover all features of the packages. Please refer to their documentation to learn them.

Here we'll load `CSV` to read and write CSV files and `JLD`, which allows us to work with a Julia native binary format.

In [2]:
using CSV
using JLD

Let's create a simple DataFrame for testing purposes,

In [3]:
x = DataFrame(A=[true, false, true], B=[1, 2, missing],
              C=[missing, "b", "c"], D=['a', missing, 'c'])


Unnamed: 0,A,B,C,D
1,True,1,missing,'a'
2,False,2,b,missing
3,True,missing,c,'c'


and use `eltypes` to look at the columnwise types.

In [4]:
eltypes(x)

4-element Array{Type,1}:
 Bool                           
 Union{Int64, Missings.Missing} 
 Union{Missings.Missing, String}
 Union{Char, Missings.Missing}  

Let's use `CSV` to save `x` to disk; make sure `x.csv` does not conflict with some file in your working directory.

In [5]:
CSV.write("x.csv", x)

CSV.Sink{Void,DataType}(    CSV.Options:
        delim: ','
        quotechar: '"'
        escapechar: '\\'
        missingstring: ""
        dateformat: nothing
        decimal: '.'
        truestring: 'true'
        falsestring: 'false', IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf, ptr=1, mark=-1), "x.csv", 8, true, String["A", "B", "C", "D"], 4, false, Val{false})

Now we can see how it was saved by reading `x.csv`.

In [6]:
print(read("x.csv", String))

A,B,C,D
true,1,,a
false,2,b,
true,,c,c


We can also load it back. `use_mmap=false` disables memory mapping so that on Windows the file can be deleted in the same session.

In [7]:
y = CSV.read("x.csv", use_mmap=false)

Unnamed: 0,A,B,C,D
1,True,1,missing,a
2,False,2,b,missing
3,True,missing,c,c


When loading in a `DataFrame` from a `CSV`, all columns allow `Missing` by default. Note that the column types have changed!

In [8]:
eltypes(y)

4-element Array{Type,1}:
 Union{Bool, Missings.Missing}  
 Union{Int64, Missings.Missing} 
 Union{Missings.Missing, String}
 Union{Missings.Missing, String}

Now let's save `x` to a file in a binary format; make sure that `x.jld` does not exist in your working directory.

In [9]:
save("x.jld", "x", x)

After loading in `x.jld` as `y`, `y` is identical to `x`.

In [10]:
y = load("x.jld", "x")

Unnamed: 0,A,B,C,D
1,True,1,missing,'a'
2,False,2,b,missing
3,True,missing,c,'c'


Note that the column types of `y` are the same as those of `x`!

In [11]:
eltypes(y)

4-element Array{Type,1}:
 Bool                           
 Union{Int64, Missings.Missing} 
 Union{Missings.Missing, String}
 Union{Char, Missings.Missing}  

Next, we'll create the files `bigdf.csv` and `bigdf.jld`, so be careful that you don't already have these files on disc!

In particular, we'll time how long it takes us to write a `DataFrame` with 10^3 rows and 10^5 columns to `.csv` and `.jld` files.  *You can expect JLD to be faster!* Use `compress=true` to reduce file sizes.

In [12]:
bigdf = DataFrame(Bool, 10^3, 10^2)
@time CSV.write("bigdf.csv", bigdf)
@time save("bigdf.jld", "bigdf", bigdf)
getfield.(stat.(["bigdf.csv", "bigdf.jld"]), :size)

  0.782157 seconds (688.90 k allocations: 30.828 MiB, 1.08% gc time)
  0.018250 seconds (203.61 k allocations: 3.339 MiB)


2-element Array{Int64,1}:
 595307
 154487

Finally, let's clean up. Do not run the next cell unless you are sure that it will not erase your important files.

In [13]:
foreach(rm, ["x.csv", "x.jld", "bigdf.csv", "bigdf.jld"])