# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), Dec 2, 2017**

A brief introduction to basic usage of `DataFrames`. Tested under version `0.11`.
I will try to keep it up to date as the package evolves.

In [2]:
using DataFrames # load package

[1m[36mINFO: [39m[22m[36mRecompiling stale cache file D:\Software\JULIA_PKG\lib\v0.6\DataFrames.ji for module DataFrames.
[39m

## Constructors

In [3]:
DataFrame() # empty DataFrame

In [4]:
DataFrame(A=1:3, B=rand(3), C=randstring.([3,3,3])) # keyword arguments

Unnamed: 0,A,B,C
1,1,0.973394,cMT
2,2,0.425512,AwK
3,3,0.358187,DLU


In [5]:
x = Dict("A" => [1,2], "B" => [true, false], "C" => ['a', 'b'])
DataFrame(x) # from dictionary, columns will be sorted

Unnamed: 0,A,B,C
1,1,True,'a'
2,2,False,'b'


In [6]:
DataFrame(:A => [1,2], :B => [true, false], :C => ['a', 'b']) # from pairs

Unnamed: 0,A,B,C
1,1,True,'a'
2,2,False,'b'


In [7]:
DataFrame([rand(3) for i in 1:3]) # from vector of vectors

Unnamed: 0,x1,x2,x3
1,0.775941,0.966468,0.694461
2,0.625562,0.634619,0.311111
3,0.783295,0.270914,0.893683


In [8]:
DataFrame(rand(3)) # edge case vector of atoms

Unnamed: 0,x1,x2,x3
1,0.368323,0.428365,0.48776


In [9]:
DataFrame(rand(3), [:A, :B, :C]) # pass second argument to give column names

Unnamed: 0,A,B,C
1,0.431197,0.0596815,0.704822


In [10]:
DataFrame(rand(3,4)) # from matrix

Unnamed: 0,x1,x2,x3,x4
1,0.574503,0.69332,0.00228494,0.920834
2,0.728191,0.835786,0.262539,0.777827
3,0.322564,0.259628,0.573652,0.664233


In [11]:
DataFrame([Int, Float64, Any], [:A, :B, :C], 1) # pass column types, names and number of rows
# we get missing because Any >: Missing

Unnamed: 0,A,B,C
1,3840,0.0,missing


In [12]:
DataFrame([Int, Float64, String], [:A, :B, :C], 1)
# it was created OK, only value for String is #undef so Jupyer has a problem with printing it

UndefRefError: [91mUndefRefError: access to undefined reference[39m

In [13]:
DataFrame([Int, Float64, String], [:A, :B, :C], 0) # columns are created, but there are no rows

Unnamed: 0,A,B,C


In [14]:
DataFrame(Int, 3, 5) # a quick way to create homogenous DataFrame

Unnamed: 0,x1,x2,x3,x4,x5
1,440544624,110690800,0,165676688,141539024
2,110723344,110690512,0,110723280,110723136
3,110723408,111469008,0,110821520,110821520


In [15]:
DataFrame([Int, Float64], 4) # similar, but with nonhomogenous columns

Unnamed: 0,x1,x2
1,461208592,5.51452e-316
2,110692240,5.46892e-316
3,461208624,5.46892e-316
4,459978384,5.47553e-316


In [16]:
x = DataFrame(A = [1, 2], B = [1.0, missing], C = ["a", "b"], D = [1, "a"])
convert(Array, x) # convert DataFrame to Matrix

2×4 Array{Any,2}:
 1  1.0       "a"  1   
 2   missing  "b"   "a"

## Getting basic information about a data frame

In [17]:
x = DataFrame(A = [1, 2], B = [1.0, missing], C = ["a", "b"], D = [1, "a"])

Unnamed: 0,A,B,C,D
1,1,1.0,a,1
2,2,missing,b,a


In [18]:
size(x), size(x, 1), size(x, 2)

((2, 4), 2, 4)

In [19]:
nrow(x), ncol(x), length(x)

(2, 4, 4)

In [20]:
describe(x)

A
Summary Stats:
Mean:           1.500000
Minimum:        1.000000
1st Quartile:   1.250000
Median:         1.500000
3rd Quartile:   1.750000
Maximum:        2.000000
Length:         2
Type:           Int64

B
Summary Stats:
Mean:           1.000000
Minimum:        1.000000
1st Quartile:   1.000000
Median:         1.000000
3rd Quartile:   1.000000
Maximum:        1.000000
Length:         2
Type:           Union{Float64, Missings.Missing}
Number Missing: 1
% Missing:      50.000000

C
Summary Stats:
Length:         2
Type:           String
Number Unique:  2

D
Summary Stats:
Length:         2
Type:           Any
Number Unique:  2
Number Missing: 0
% Missing:      0.000000



In [21]:
showcols(x)

2×4 DataFrames.DataFrame
│ Col # │ Name │ Eltype                           │ Missing │ Values          │
├───────┼──────┼──────────────────────────────────┼─────────┼─────────────────┤
│ 1     │ A    │ Int64                            │ 0       │ 1  …  2         │
│ 2     │ B    │ Union{Float64, Missings.Missing} │ 1       │ 1.0  …  missing │
│ 3     │ C    │ String                           │ 0       │ a  …  b         │
│ 4     │ D    │ Any                              │ 0       │ 1  …  a         │

In [22]:
names(x)

4-element Array{Symbol,1}:
 :A
 :B
 :C
 :D

In [23]:
eltypes(x)

4-element Array{Type,1}:
 Int64                           
 Union{Float64, Missings.Missing}
 String                          
 Any                             

In [24]:
y = DataFrame(rand(1:10, 20, 10))

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
1,7,3,6,10,9,7,3,10,5,5
2,3,1,9,7,9,1,10,1,8,5
3,4,5,7,10,2,8,7,3,2,2
4,4,3,6,8,8,6,2,4,2,10
5,10,3,1,5,9,10,8,10,2,4
6,7,3,9,6,4,7,5,3,8,7
7,9,3,9,2,1,4,2,6,9,9
8,3,3,4,2,9,10,3,9,2,5
9,9,4,9,2,3,1,4,10,1,7
10,7,2,10,6,4,5,1,6,2,9


In [25]:
head(y)

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
1,7,3,6,10,9,7,3,10,5,5
2,3,1,9,7,9,1,10,1,8,5
3,4,5,7,10,2,8,7,3,2,2
4,4,3,6,8,8,6,2,4,2,10
5,10,3,1,5,9,10,8,10,2,4
6,7,3,9,6,4,7,5,3,8,7


In [26]:
tail(y, 3)

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
1,3,9,7,5,6,7,6,2,8,10
2,10,2,5,7,10,1,10,10,10,5
3,10,1,6,5,6,4,3,7,5,4


## Handling missing values

In [27]:
missing, typeof(missing) # sinelton type

(missing, Missings.Missing)

In [28]:
x = [1, 2, missing, 3] # arrays automatically create an appropriate union type

4-element Array{Union{Int64, Missings.Missing},1}:
 1       
 2       
  missing
 3       

In [29]:
ismissing(1), ismissing(missing), ismissing(x), ismissing.(x) # check if variable is missing

(false, true, false, Bool[false, false, true, false])

In [30]:
eltype(x), Missings.T(eltype(x)) # extract the type combined with Missing (useful for arrays)

(Union{Int64, Missings.Missing}, Int64)

In [31]:
missing == missing, missing != missing, missing < missing # missing comparisons produce missing

(missing, missing, missing)

In [32]:
1 == missing, 1 != missing, 1 < missing # the same with values of other types

(missing, missing, missing)

In [33]:
isequal(missing, missing), missing === missing, isequal(1, missing), isless(1, missing) # those produce Bool result

(true, true, false, true)

In [34]:
map(x -> x(missing), [sin, cos, zero, sqrt]) # many (not all) functions handle missing

4-element Array{Missings.Missing,1}:
 missing
 missing
 missing
 missing

In [35]:
map(x -> x(missing, 1), [+, - , *, /, div]) # part 2

5-element Array{Missings.Missing,1}:
 missing
 missing
 missing
 missing
 missing

In [36]:
map(x -> x([1,2,missing]), [minimum, maximum, extrema, mean, any, float]) # part 3

6-element Array{Any,1}:
 missing                                            
 missing                                            
 (missing, missing)                                 
 missing                                            
 missing                                            
 Union{Float64, Missings.Missing}[1.0, 2.0, missing]

In [37]:
collect(skipmissing([1, missing, 2, missing])) # skipmissings returns iterator skipping missing values

2-element Array{Int64,1}:
 1
 2

In [38]:
collect(Missings.replace([1.0, missing, 2.0, missing], NaN)) # the same but replacing missings

4-element Array{Float64,1}:
   1.0
 NaN  
   2.0
 NaN  

In [39]:
unique([1, missing, 2, missing]), levels([1, missing, 2, missing]) # get unique values with or without missings

(Any[1, missing, 2], [1, 2])

In [40]:
x = DataFrame(Int, 2, 3)
showcols(x)
allowmissing!(x, 1) # make first column accept missings
allowmissing!(x, :x3) # make :x3 column accept missings
println("\n\nAfter: ", eltypes(x))

2×3 DataFrames.DataFrame
│ Col # │ Name │ Eltype │ Missing │ Values           │
├───────┼──────┼────────┼─────────┼──────────────────┤
│ 1     │ x1   │ Int64  │ 0       │ -1  …  110692240 │
│ 2     │ x2   │ Int64  │ 0       │ 447355136  …  0  │
│ 3     │ x3   │ Int64  │ 0       │ 0  …  0          │

After: Type[Union{Int64, Missings.Missing}, Int64, Union{Int64, Missings.Missing}]


In [41]:
x = DataFrame(A=[1, missing, 3, 4], B=["A", "B", missing, "C"])
println("Complete cases:\n", completecases(x)) # find rows with all complete data
y = dropmissing(x) # remove rows with incomplete data from DataFrame, create a new DataFrame
dropmissing!(x) # the same but in-place
[x, y]

Complete cases:
Bool[true, false, false, true]


2-element Array{DataFrames.DataFrame,1}:
 2×2 DataFrames.DataFrame
│ Row │ A │ B │
├─────┼───┼───┤
│ 1   │ 1 │ A │
│ 2   │ 4 │ C │
 2×2 DataFrames.DataFrame
│ Row │ A │ B │
├─────┼───┼───┤
│ 1   │ 1 │ A │
│ 2   │ 4 │ C │

## Load and save DataFrames

In [44]:
using CSV # reading and writing CSV files
using JLD # Julia native binary format

[1m[36mINFO: [39m[22m[36mRecompiling stale cache file D:\Software\JULIA_PKG\lib\v0.6\HDF5.ji for module HDF5.
[39m[1m[36mINFO: [39m[22m[36mRecompiling stale cache file D:\Software\JULIA_PKG\lib\v0.6\JLD.ji for module JLD.
[39m[1m[36mINFO: [39m[22m[36mPrecompiling module Feather.
[39m

In [43]:
x = DataFrame(A=[true, false, true], B=[1,2,missing],
              C=[missing, "b", "c"], D=['a', missing, 'c']) # create a simple DataFrame for testing purposes


Unnamed: 0,A,B,C,D
1,True,1,missing,'a'
2,False,2,b,missing
3,True,missing,c,'c'


In [45]:
CSV.write("x.csv", x)

CSV.Sink{DateFormat{Symbol("yyyy-mm-dd"),Tuple{Base.Dates.DatePart{'y'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'m'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'d'}}},DataType}(    CSV.Options:
        delim: ','
        quotechar: '"'
        escapechar: '\\'
        null: ""
        dateformat: dateformat"yyyy-mm-dd"
        decimal: '.'
        truestring: 'true'
        falsestring: 'false', IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf, ptr=1, mark=-1), "x.csv", 8, true, String["A", "B", "C", "D"], 4, false, Val{false})

In [47]:
y = CSV.read("x.csv")

Unnamed: 0,A,B,C,D
1,True,1,missing,a
2,False,2,b,missing
3,True,missing,c,c


In [51]:
eltypes(y) # notice that by default WeakRefString is used for efficiency

4-element Array{Type,1}:
 Bool                                         
 Union{Int64, Missings.Missing}               
 Union{Missings.Missing, WeakRefString{UInt8}}
 Union{Missings.Missing, WeakRefString{UInt8}}

In [55]:
save("x.jld", "x", x)

In [56]:
y = load("x.jld", "x") # this is identical to x

Unnamed: 0,A,B,C,D
1,True,1,missing,'a'
2,False,2,b,missing
3,True,missing,c,'c'


In [57]:
eltypes(y)

4-element Array{Type,1}:
 Bool                           
 Union{Int64, Missings.Missing} 
 Union{Missings.Missing, String}
 Union{Char, Missings.Missing}  

In [6]:
@time bigdf = DataFrame(Bool, 10^3, 10^5) # 10^3 rows, 10^5 columns
@time CSV.write("bigdf.csv", bigdf)
@time save("bigdf.jld", "bigdf", bigdf)
getfield.(stat.(["bigdf.csv", "bigdf.jld"]), :size) # get sizes of files
# you can expect JLD to be faster, file size might depend on saved data

  0.672588 seconds (1.30 M allocations: 158.957 MiB, 14.23% gc time)
 69.622614 seconds (698.90 M allocations: 25.085 GiB, 7.31% gc time)
  9.266937 seconds (202.79 M allocations: 3.220 GiB, 12.56% gc time)


2-element Array{Int64,1}:
 600683777
 140505404

## Manipulating columns of DataFrame

### Renaming

In [21]:
x = DataFrame(Bool, 3, 4)

Unnamed: 0,x1,x2,x3,x4
1,False,False,False,False
2,False,False,False,True
3,False,True,False,True


In [22]:
rename(x, :x1 => :A) # new data frame, also accepts collections of Pairs

Unnamed: 0,A,x2,x3,x4
1,False,False,False,False
2,False,False,False,True
3,False,True,False,True


In [23]:
rename!(c -> Symbol(string(c)^2), x) # in place transofmation by applying a function

Unnamed: 0,x1x1,x2x2,x3x3,x4x4
1,False,False,False,False
2,False,False,False,True
3,False,True,False,True


In [25]:
rename(x, names(x)[3] => :third) # change name of third variable, new data frame

Unnamed: 0,x1x1,x2x2,third,x4x4
1,False,False,False,False
2,False,False,False,True
3,False,True,False,True


In [26]:
names!(x, [:a, :b, :c, :d]) # change all names of variables

Unnamed: 0,a,b,c,d
1,False,False,False,False
2,False,False,False,True
3,False,True,False,True


In [27]:
names!(x, fill(:a, 4), allow_duplicates=true) # handle duplicates in passed names

Unnamed: 0,a,a_1,a_2,a_3
1,False,False,False,False
2,False,False,False,True
3,False,True,False,True
