# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), Dec 31, 2017**

A brief introduction to basic usage of `DataFrames`. Tested under `DataFrames` master on 2017-12-05.
I will try to keep it up to date as the package evolves.

In [1]:
using DataFrames # load package

## Constructors and conversion

### Constructors

In [2]:
DataFrame() # empty DataFrame

In [3]:
DataFrame(A=1:3, B=rand(3), C=randstring.([3,3,3])) # keyword arguments

Unnamed: 0,A,B,C
1,1,0.551311,T6G
2,2,0.0843715,npE
3,3,0.36462,sZT


In [4]:
x = Dict("A" => [1,2], "B" => [true, false], "C" => ['a', 'b'])
DataFrame(x) # from dictionary, columns will be sorted

Unnamed: 0,A,B,C
1,1,True,'a'
2,2,False,'b'


In [5]:
DataFrame(:A => [1,2], :B => [true, false], :C => ['a', 'b']) # from pairs

Unnamed: 0,A,B,C
1,1,True,'a'
2,2,False,'b'


In [6]:
DataFrame([rand(3) for i in 1:3]) # from vector of vectors

Stacktrace:
 [1] [1mdepwarn[22m[22m[1m([22m[22m::String, ::Symbol[1m)[22m[22m at [1m.\deprecated.jl:70[22m[22m
 [2] [1mDataFrames.DataFrame[22m[22m[1m([22m[22m::Array{Array{Float64,1},1}[1m)[22m[22m at [1mD:\Software\JULIA_PKG\v0.6\DataFrames\src\deprecated.jl:4[22m[22m
 [3] [1minclude_string[22m[22m[1m([22m[22m::String, ::String[1m)[22m[22m at [1m.\loading.jl:522[22m[22m
 [4] [1minclude_string[22m[22m[1m([22m[22m::Module, ::String, ::String[1m)[22m[22m at [1mD:\Software\JULIA_PKG\v0.6\Compat\src\Compat.jl:174[22m[22m
 [5] [1mexecute_request[22m[22m[1m([22m[22m::ZMQ.Socket, ::IJulia.Msg[1m)[22m[22m at [1mD:\Software\JULIA_PKG\v0.6\IJulia\src\execute_request.jl:154[22m[22m
 [6] [1m(::Compat.#inner#16{Array{Any,1},IJulia.#execute_request,Tuple{ZMQ.Socket,IJulia.Msg}})[22m[22m[1m([22m[22m[1m)[22m[22m at [1mD:\Software\JULIA_PKG\v0.6\Compat\src\Compat.jl:496[22m[22m
 [7] [1meventloop[22m[22m[1m([22m[22m::ZMQ.Sock

In [7]:
DataFrame(rand(3)) # edge case vector of atoms

Stacktrace:
 [1] [1mdepwarn[22m[22m[1m([22m[22m::String, ::Symbol[1m)[22m[22m at [1m.\deprecated.jl:70[22m[22m
 [2] [1mDataFrames.DataFrame[22m[22m[1m([22m[22m::Array{Float64,1}[1m)[22m[22m at [1mD:\Software\JULIA_PKG\v0.6\DataFrames\src\deprecated.jl:4[22m[22m
 [3] [1minclude_string[22m[22m[1m([22m[22m::String, ::String[1m)[22m[22m at [1m.\loading.jl:522[22m[22m
 [4] [1minclude_string[22m[22m[1m([22m[22m::Module, ::String, ::String[1m)[22m[22m at [1mD:\Software\JULIA_PKG\v0.6\Compat\src\Compat.jl:174[22m[22m
 [5] [1mexecute_request[22m[22m[1m([22m[22m::ZMQ.Socket, ::IJulia.Msg[1m)[22m[22m at [1mD:\Software\JULIA_PKG\v0.6\IJulia\src\execute_request.jl:154[22m[22m
 [6] [1m(::Compat.#inner#16{Array{Any,1},IJulia.#execute_request,Tuple{ZMQ.Socket,IJulia.Msg}})[22m[22m[1m([22m[22m[1m)[22m[22m at [1mD:\Software\JULIA_PKG\v0.6\Compat\src\Compat.jl:496[22m[22m
 [7] [1meventloop[22m[22m[1m([22m[22m::ZMQ.Socket[1m)[

Unnamed: 0,x1,x2,x3
1,0.262448,0.341091,0.535386


In [8]:
DataFrame(rand(3), [:A, :B, :C]) # pass second argument to give column names

Stacktrace:
 [1] [1mdepwarn[22m[22m[1m([22m[22m::String, ::Symbol[1m)[22m[22m at [1m.\deprecated.jl:70[22m[22m
 [2] [1m#DataFrame#57[22m[22m[1m([22m[22m::Bool, ::Type{T} where T, ::Array{Float64,1}, ::Array{Symbol,1}[1m)[22m[22m at [1mD:\Software\JULIA_PKG\v0.6\DataFrames\src\dataframe\dataframe.jl:136[22m[22m
 [3] [1mDataFrames.DataFrame[22m[22m[1m([22m[22m::Array{Float64,1}, ::Array{Symbol,1}[1m)[22m[22m at [1mD:\Software\JULIA_PKG\v0.6\DataFrames\src\dataframe\dataframe.jl:134[22m[22m
 [4] [1minclude_string[22m[22m[1m([22m[22m::String, ::String[1m)[22m[22m at [1m.\loading.jl:522[22m[22m
 [5] [1minclude_string[22m[22m[1m([22m[22m::Module, ::String, ::String[1m)[22m[22m at [1mD:\Software\JULIA_PKG\v0.6\Compat\src\Compat.jl:174[22m[22m
 [6] [1mexecute_request[22m[22m[1m([22m[22m::ZMQ.Socket, ::IJulia.Msg[1m)[22m[22m at [1mD:\Software\JULIA_PKG\v0.6\IJulia\src\execute_request.jl:154[22m[22m
 [7] [1m(::Compat.#inner

Unnamed: 0,A,B,C
1,0.668701,0.775651,0.653475


In [9]:
DataFrame(rand(3,4)) # from matrix

Unnamed: 0,x1,x2,x3,x4
1,0.0545579,0.711941,0.391925,0.784698
2,0.217127,0.817265,0.723701,0.529613
3,0.0353567,0.604078,0.137163,0.0769358


In [10]:
DataFrame([Int, Float64, Any], [:A, :B, :C], 1) # pass column types, names and number of rows
# we get missing because Any >: Missing

Unnamed: 0,A,B,C
1,-1,2.17937e-315,missing


In [11]:
DataFrame([Int, Float64, String], [:A, :B, :C], 1)
# it was created OK, only value for String is #undef so Jupyer has a problem with printing it

UndefRefError: [91mUndefRefError: access to undefined reference[39m

In [12]:
DataFrame([Int, Float64, String], [:A, :B, :C], 0) # columns are created, but there are no rows

Unnamed: 0,A,B,C


In [13]:
DataFrame(Int, 3, 5) # a quick way to create homogenous DataFrame

Unnamed: 0,x1,x2,x3,x4,x5
1,151987024,458724656,458724752,458724784,150634704
2,150603664,150605488,150605488,458724816,150634768
3,150603664,150603664,456707184,449371856,150634832


In [14]:
DataFrame([Int, Float64], 4) # similar, but with nonhomogenous columns

Unnamed: 0,x1,x2
1,159884496,2.26819e-315
2,150603664,7.4409e-316
3,150603664,7.44081e-316
4,150769776,7.44074e-316


In [15]:
x = DataFrame(A = [1, 2], B = [1.0, missing], C = ["a", "b"], D = [1, "a"])
convert(Array, x) # convert DataFrame to Matrix

2×4 Array{Any,2}:
 1  1.0       "a"  1   
 2   missing  "b"   "a"

In [16]:
y = DataFrame(x) # no change
z = copy(x) # copy (shallow)
(x === y), (x === z), isequal(x, z)

(true, false, true)

### Conversion to a matrix

In [17]:
x = DataFrame(x=1:2, y=["A", "B"])

Unnamed: 0,x,y
1,1,A
2,2,B


In [18]:
Matrix(x)

2×2 Array{Any,2}:
 1  "A"
 2  "B"

In [19]:
x = DataFrame(x=1:2, y=[missing,"B"])

Unnamed: 0,x,y
1,1,missing
2,2,B


In [20]:
Matrix(x) # missings are OK

2×2 Array{Any,2}:
 1  missing
 2  "B"    

In [21]:
x = DataFrame(x=1:2, y=3:4)

Unnamed: 0,x,y
1,1,3
2,2,4


In [22]:
Matrix(x) # type of Matrix is inferred

2×2 Array{Int64,2}:
 1  3
 2  4

In [23]:
x = DataFrame(x=1:2, y=[missing,4])

Unnamed: 0,x,y
1,1,missing
2,2,4


In [24]:
Matrix(x) # correct identification that Union is needed here

2×2 Array{Union{Int64, Missings.Missing},2}:
 1   missing
 2  4       

In [25]:
Matrix{Int}(x) # error - conversion to Int type is impossible

LoadError: [91mcannot convert a DataFrame containing missing values to array (found for column y)[39m