# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), Dec 2, 2017**

A brief introduction to basic usage of `DataFrames`. Tested under version `0.11`.
I will try to keep it up to date as the package evolves.

In [1]:
using DataFrames # load package

## Constructors

In [2]:
DataFrame() # empty DataFrame

In [3]:
DataFrame(A=1:3, B=rand(3), C=randstring.([3,3,3])) # keyword arguments

Unnamed: 0,A,B,C
1,1,0.559732,Bgn
2,2,0.848476,RYA
3,3,0.455683,j6K


In [4]:
x = Dict("A" => [1,2], "B" => [true, false], "C" => ['a', 'b'])
DataFrame(x) # from dictionary, columns will be sorted

Unnamed: 0,A,B,C
1,1,True,'a'
2,2,False,'b'


In [5]:
DataFrame(:A => [1,2], :B => [true, false], :C => ['a', 'b']) # from pairs

Unnamed: 0,A,B,C
1,1,True,'a'
2,2,False,'b'


In [6]:
DataFrame([rand(3) for i in 1:3]) # from vector of vectors

Unnamed: 0,x1,x2,x3
1,0.572434,0.787104,0.542594
2,0.511473,0.138666,0.738349
3,0.405708,0.703204,0.056101


In [7]:
DataFrame(rand(3)) # edge case vector of atoms

Unnamed: 0,x1,x2,x3
1,0.0635589,0.839034,0.633092


In [8]:
DataFrame(rand(3), [:A, :B, :C]) # pass second argument to give column names

Unnamed: 0,A,B,C
1,0.852616,0.361704,0.912633


In [9]:
DataFrame(rand(3,4)) # from matrix

Unnamed: 0,x1,x2,x3,x4
1,0.637503,0.169445,0.415154,0.732673
2,0.218318,0.0257092,0.491263,0.935891
3,0.573728,0.473092,0.906829,0.129402


In [10]:
DataFrame([Int, Float64, Any], [:A, :B, :C], 1) # pass column types, names and number of rows
# we get missing because Any >: Missing

Unnamed: 0,A,B,C
1,435284208,2.15059e-315,missing


In [11]:
DataFrame([Int, Float64, String], [:A, :B, :C], 1)
# it was created OK, only value for String is #undef so Jupyer has a problem with printing it

UndefRefError: [91mUndefRefError: access to undefined reference[39m

In [12]:
DataFrame([Int, Float64, String], [:A, :B, :C], 0) # columns are created, but there are no rows

Unnamed: 0,A,B,C


In [13]:
DataFrame(Int, 3, 5) # a quick way to create homogenous DataFrame

Unnamed: 0,x1,x2,x3,x4,x5
1,118349520,110876872,169972976,118349520,422508336
2,117255152,109917576,117276880,117255152,117245840
3,117245840,109992280,117276944,117245840,117245840


In [14]:
DataFrame([Int, Float64], 4) # similar, but with nonhomogenous columns

Unnamed: 0,x1,x2
1,422741008,2.08862e-315
2,422740336,2.08862e-315
3,184264272,2.08862e-315
4,0,2.08862e-315


In [15]:
x = DataFrame(A = [1, 2], B = [1.0, missing], C = ["a", "b"], D = [1, "a"])
convert(Array, x) # convert DataFrame to Matrix

2×4 Array{Any,2}:
 1  1.0       "a"  1   
 2   missing  "b"   "a"

## Getting basic information about a data frame

In [16]:
x = DataFrame(A = [1, 2], B = [1.0, missing], C = ["a", "b"], D = [1, "a"])

Unnamed: 0,A,B,C,D
1,1,1.0,a,1
2,2,missing,b,a


In [17]:
size(x), size(x, 1), size(x, 2)

((2, 4), 2, 4)

In [18]:
nrow(x), ncol(x), length(x)

(2, 4, 4)

In [19]:
describe(x)

A
Summary Stats:
Mean:           1.500000
Minimum:        1.000000
1st Quartile:   1.250000
Median:         1.500000
3rd Quartile:   1.750000
Maximum:        2.000000
Length:         2
Type:           Int64

B
Summary Stats:
Mean:           1.000000
Minimum:        1.000000
1st Quartile:   1.000000
Median:         1.000000
3rd Quartile:   1.000000
Maximum:        1.000000
Length:         2
Type:           Union{Float64, Missings.Missing}
Number Missing: 1
% Missing:      50.000000

C
Summary Stats:
Length:         2
Type:           String
Number Unique:  2

D
Summary Stats:
Length:         2
Type:           Any
Number Unique:  2
Number Missing: 0
% Missing:      0.000000



In [20]:
showcols(x)

2×4 DataFrames.DataFrame
│ Col # │ Name │ Eltype                           │ Missing │ Values          │
├───────┼──────┼──────────────────────────────────┼─────────┼─────────────────┤
│ 1     │ A    │ Int64                            │ 0       │ 1  …  2         │
│ 2     │ B    │ Union{Float64, Missings.Missing} │ 1       │ 1.0  …  missing │
│ 3     │ C    │ String                           │ 0       │ a  …  b         │
│ 4     │ D    │ Any                              │ 0       │ 1  …  a         │

In [21]:
names(x)

4-element Array{Symbol,1}:
 :A
 :B
 :C
 :D

In [22]:
eltypes(x)

4-element Array{Type,1}:
 Int64                           
 Union{Float64, Missings.Missing}
 String                          
 Any                             

In [23]:
y = DataFrame(rand(1:10, 20, 10))

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
1,5,2,9,4,4,5,3,4,10,9
2,3,2,8,5,1,10,1,8,7,6
3,6,4,3,8,9,8,8,2,9,3
4,1,6,4,3,2,9,10,1,10,10
5,4,10,8,8,7,7,4,1,9,1
6,1,5,6,9,6,6,9,8,2,5
7,1,5,7,6,5,5,5,10,7,7
8,5,5,1,9,5,4,5,3,9,10
9,10,4,10,6,8,10,5,5,2,7
10,8,6,5,8,6,2,9,6,8,7


In [24]:
head(y)

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
1,5,2,9,4,4,5,3,4,10,9
2,3,2,8,5,1,10,1,8,7,6
3,6,4,3,8,9,8,8,2,9,3
4,1,6,4,3,2,9,10,1,10,10
5,4,10,8,8,7,7,4,1,9,1
6,1,5,6,9,6,6,9,8,2,5


In [25]:
tail(y, 3)

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
1,4,9,7,2,5,2,10,8,2,3
2,5,3,7,3,3,8,3,1,8,9
3,8,10,3,8,9,9,9,10,7,7


## Handling missing values

In [26]:
missing, typeof(missing) # sinelton type

(missing, Missings.Missing)

In [27]:
x = [1, 2, missing, 3] # arrays automatically create an appropriate union type

4-element Array{Union{Int64, Missings.Missing},1}:
 1       
 2       
  missing
 3       

In [28]:
ismissing(1), ismissing(missing), ismissing(x), ismissing.(x) # check if variable is missing

(false, true, false, Bool[false, false, true, false])

In [29]:
eltype(x), Missings.T(eltype(x)) # extract the type combined with Missing (useful for arrays)

(Union{Int64, Missings.Missing}, Int64)

In [30]:
missing == missing, missing != missing, missing < missing # missing comparisons produce missing

(missing, missing, missing)

In [31]:
1 == missing, 1 != missing, 1 < missing # the same with values of other types

(missing, missing, missing)

In [32]:
isequal(missing, missing), missing === missing, isequal(1, missing), isless(1, missing) # those produce Bool result

(true, true, false, true)

In [33]:
map(x -> x(missing), [sin, cos, zero, sqrt]) # many (not all) functions handle missing

4-element Array{Missings.Missing,1}:
 missing
 missing
 missing
 missing

In [34]:
map(x -> x(missing, 1), [+, - , *, /, div]) # part 2

5-element Array{Missings.Missing,1}:
 missing
 missing
 missing
 missing
 missing

In [35]:
map(x -> x([1,2,missing]), [minimum, maximum, extrema, mean, any, float]) # part 3

6-element Array{Any,1}:
 missing                                            
 missing                                            
 (missing, missing)                                 
 missing                                            
 missing                                            
 Union{Float64, Missings.Missing}[1.0, 2.0, missing]

In [36]:
collect(skipmissing([1, missing, 2, missing])) # skipmissings returns iterator skipping missing values

2-element Array{Int64,1}:
 1
 2

In [37]:
collect(Missings.replace([1.0, missing, 2.0, missing], NaN)) # the same but replacing missings

4-element Array{Float64,1}:
   1.0
 NaN  
   2.0
 NaN  

In [38]:
unique([1, missing, 2, missing]), levels([1, missing, 2, missing]) # get unique values with or without missings

(Any[1, missing, 2], [1, 2])

In [39]:
x = DataFrame(Int, 2, 3)
showcols(x)
allowmissing!(x, 1) # make first column accept missings
allowmissing!(x, :x3) # make :x3 column accept missings
println("\n\nAfter: ", eltypes(x))

2×3 DataFrames.DataFrame
│ Col # │ Name │ Eltype │ Missing │ Values                  │
├───────┼──────┼────────┼─────────┼─────────────────────────┤
│ 1     │ x1   │ Int64  │ 0       │ 4434  …  117278928      │
│ 2     │ x2   │ Int64  │ 0       │ -1  …  118225456        │
│ 3     │ x3   │ Int64  │ 0       │ 407641952  …  118310992 │

After: Type[Union{Int64, Missings.Missing}, Int64, Union{Int64, Missings.Missing}]


In [40]:
x = DataFrame(A=[1, missing, 3, 4], B=["A", "B", missing, "C"])
println("Complete cases:\n", completecases(x)) # find rows with all complete data
y = dropmissing(x) # remove rows with incomplete data from DataFrame, create a new DataFrame
dropmissing!(x) # the same but in-place
[x, y]

Complete cases:
Bool[true, false, false, true]


2-element Array{DataFrames.DataFrame,1}:
 2×2 DataFrames.DataFrame
│ Row │ A │ B │
├─────┼───┼───┤
│ 1   │ 1 │ A │
│ 2   │ 4 │ C │
 2×2 DataFrames.DataFrame
│ Row │ A │ B │
├─────┼───┼───┤
│ 1   │ 1 │ A │
│ 2   │ 4 │ C │

## Load and save DataFrames

In [41]:
using CSV # reading and writing CSV files
using JLD # Julia native binary format

In [42]:
x = DataFrame(A=[true, false, true], B=[1,2,missing],
              C=[missing, "b", "c"], D=['a', missing, 'c']) # create a simple DataFrame for testing purposes


Unnamed: 0,A,B,C,D
1,True,1,missing,'a'
2,False,2,b,missing
3,True,missing,c,'c'


In [43]:
CSV.write("x.csv", x)

CSV.Sink{DateFormat{Symbol("yyyy-mm-dd"),Tuple{Base.Dates.DatePart{'y'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'m'},Base.Dates.Delim{Char,1},Base.Dates.DatePart{'d'}}},DataType}(    CSV.Options:
        delim: ','
        quotechar: '"'
        escapechar: '\\'
        null: ""
        dateformat: dateformat"yyyy-mm-dd"
        decimal: '.'
        truestring: 'true'
        falsestring: 'false', IOBuffer(data=UInt8[...], readable=true, writable=true, seekable=true, append=false, size=0, maxsize=Inf, ptr=1, mark=-1), "x.csv", 8, true, String["A", "B", "C", "D"], 4, false, Val{false})

In [44]:
y = CSV.read("x.csv")

Unnamed: 0,A,B,C,D
1,True,1,missing,a
2,False,2,b,missing
3,True,missing,c,c


In [45]:
eltypes(y) # notice that by default WeakRefString is used for efficiency

4-element Array{Type,1}:
 Bool                                         
 Union{Int64, Missings.Missing}               
 Union{Missings.Missing, WeakRefString{UInt8}}
 Union{Missings.Missing, WeakRefString{UInt8}}

In [46]:
save("x.jld", "x", x)

In [47]:
y = load("x.jld", "x") # this is identical to x

Unnamed: 0,A,B,C,D
1,True,1,missing,'a'
2,False,2,b,missing
3,True,missing,c,'c'


In [48]:
eltypes(y)

4-element Array{Type,1}:
 Bool                           
 Union{Int64, Missings.Missing} 
 Union{Missings.Missing, String}
 Union{Char, Missings.Missing}  

In [57]:
bigdf = DataFrame(Bool, 10^3, 10^2) # 10^3 rows, 10^5 columns
@time CSV.write("bigdf.csv", bigdf)
@time save("bigdf.jld", "bigdf", bigdf)
getfield.(stat.(["bigdf.csv", "bigdf.jld"]), :size) #  you can expect JLD to be faster, file size depends on data

  0.090868 seconds (500.06 k allocations: 20.926 MiB)
  0.024537 seconds (203.74 k allocations: 3.345 MiB, 23.30% gc time)


2-element Array{Int64,1}:
 588833
 154487

## Manipulating columns of DataFrame

### Renaming columns

In [50]:
x = DataFrame(Bool, 3, 4)

Unnamed: 0,x1,x2,x3,x4
1,False,False,False,False
2,True,True,False,False
3,False,True,True,False


In [51]:
rename(x, :x1 => :A) # new data frame, also accepts collections of Pairs

Unnamed: 0,A,x2,x3,x4
1,False,False,False,False
2,True,True,False,False
3,False,True,True,False


In [52]:
rename!(c -> Symbol(string(c)^2), x) # in place transofmation by applying a function

Unnamed: 0,x1x1,x2x2,x3x3,x4x4
1,False,False,False,False
2,True,True,False,False
3,False,True,True,False


In [53]:
rename(x, names(x)[3] => :third) # change name of third variable, new data frame

Unnamed: 0,x1x1,x2x2,third,x4x4
1,False,False,False,False
2,True,True,False,False
3,False,True,True,False


In [54]:
names!(x, [:a, :b, :c, :d]) # change all names of variables

Unnamed: 0,a,b,c,d
1,False,False,False,False
2,True,True,False,False
3,False,True,True,False


In [55]:
names!(x, fill(:a, 4), allow_duplicates=true) # handle duplicates in passed names

Unnamed: 0,a,a_1,a_2,a_3
1,False,False,False,False
2,True,True,False,False
3,False,True,True,False


### Reordering columns

In [59]:
x[shuffle(names(x))] # new DataFrame, reorder names(x) vector as needed

Unnamed: 0,a_3,a_1,a_2,a
1,False,False,False,False
2,False,True,False,True
3,False,True,True,False


### Merging columns

In [60]:
x = DataFrame([(i,j) for i in 1:3, j in 1:4])

Unnamed: 0,x1,x2,x3,x4
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


In [62]:
[x x] # merge two data frames, also hcat if you have many columns to merge

Unnamed: 0,x1,x2,x3,x4,x1_1,x2_1,x3_1,x4_1
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)","(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)","(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)","(3, 1)","(3, 2)","(3, 3)","(3, 4)"


In [76]:
y = hcat(x, [1,2,3]) # add a new column without a name

Unnamed: 0,x1,x2,x3,x4,x1_1
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)",1
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)",2
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)",3


In [79]:
y = hcat(x, DataFrame(A=[1,2,3])) # this is a bit more verbose but cleaner

Unnamed: 0,x1,x2,x3,x4,A
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)",1
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)",2
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)",3


In [88]:
y = [DataFrame(A=[1,2,3]) x] # a way to append a vector at the start of the DataFrame

Unnamed: 0,A,x1,x2,x3,x4
1,1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


In [86]:
using BenchmarkTools
@btime [$x[1:2] DataFrame(A=[1,2,3]) $x[3:4]] # and in the middle, method 1

  17.261 μs (125 allocations: 9.17 KiB)


Unnamed: 0,x1,x2,A,x3,x4
1,"(1, 1)","(1, 2)",1,"(1, 3)","(1, 4)"
2,"(2, 1)","(2, 2)",2,"(2, 3)","(2, 4)"
3,"(3, 1)","(3, 2)",3,"(3, 3)","(3, 4)"


In [85]:
@btime y = [$x DataFrame(A=[1,2,3])]; y[names(y)[[1,2,5,3,4]]] # method 2, faster but more messy

  9.797 μs (60 allocations: 4.41 KiB)


Unnamed: 0,x1,x2,x4,A,x3
1,"(1, 1)","(1, 2)","(1, 4)",1,"(1, 3)"
2,"(2, 1)","(2, 2)","(2, 4)",2,"(2, 3)"
3,"(3, 1)","(3, 2)","(3, 4)",3,"(3, 3)"
