# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), Apr 21, 2018**

Let's get started by loading the `DataFrames` package.

In [1]:
using DataFrames

## Constructors and conversion

### Constructors

In this section, you'll see many ways to create a `DataFrame` using the `DataFrame()` constructor.

First, we could create an empty DataFrame,

In [2]:
DataFrame() # empty DataFrame

Or we could call the constructor using keyword arguments to add columns to the `DataFrame`.

In [3]:
DataFrame(A=1:3, B=rand(3), C=randstring.([3,3,3]))

Unnamed: 0,A,B,C
1,1,0.865057,Jds
2,2,0.870442,hDG
3,3,0.128666,oyt


We can create a `DataFrame` from a dictionary, in which case keys from the dictionary will be sorted to create the `DataFrame` columns.

In [4]:
x = Dict("A" => [1,2], "B" => [true, false], "C" => ['a', 'b'])
DataFrame(x)

Unnamed: 0,A,B,C
1,1,True,'a'
2,2,False,'b'


Rather than explicitly creating a dictionary first, as above, we could pass `DataFrame` arguments with the syntax of dictionary key-value pairs. 

Note that in this case, we use symbols to denote the column names. For example, `:A` -- the symbol `A` -- is the name of the first column here:

In [5]:
DataFrame(:A => [1,2], :B => [true, false], :C => ['a', 'b'])

Unnamed: 0,A,B,C
1,1,True,'a'
2,2,False,'b'


Here we create a `DataFrame` from a vector of vectors, and each vector becomes a column.

In [6]:
DataFrame([rand(3) for i in 1:3])

Unnamed: 0,x1,x2,x3
1,0.738969,0.476396,0.926968
2,0.498559,0.190063,0.839678
3,0.0957712,0.843156,0.120698


 For now we can construct a single `DataFrame` from a `Vector` of atoms, creating a `DataFrame` with a single row. In Julia 0.7 and beyond, this behavior is deprecated.

In [7]:
DataFrame(rand(3))

Stacktrace:
 [1] [1mdepwarn[22m[22m[1m([22m[22m::String, ::Symbol[1m)[22m[22m at [1m.\deprecated.jl:70[22m[22m
 [2] [1m#DataFrame#57[22m[22m[1m([22m[22m::Bool, ::Type{T} where T, ::Array{Float64,1}, ::Array{Symbol,1}[1m)[22m[22m at [1mD:\Software\JULIA_PKG\v0.6\DataFrames\src\dataframe\dataframe.jl:154[22m[22m
 [3] [1mDataFrames.DataFrame[22m[22m[1m([22m[22m::Array{Float64,1}[1m)[22m[22m at [1mD:\Software\JULIA_PKG\v0.6\DataFrames\src\dataframe\dataframe.jl:152[22m[22m
 [4] [1minclude_string[22m[22m[1m([22m[22m::String, ::String[1m)[22m[22m at [1m.\loading.jl:522[22m[22m
 [5] [1mexecute_request[22m[22m[1m([22m[22m::ZMQ.Socket, ::IJulia.Msg[1m)[22m[22m at [1mD:\Software\JULIA_PKG\v0.6\IJulia\src\execute_request.jl:158[22m[22m
 [6] [1m(::Compat.#inner#17{Array{Any,1},IJulia.#execute_request,Tuple{ZMQ.Socket,IJulia.Msg}})[22m[22m[1m([22m[22m[1m)[22m[22m at [1mD:\Software\JULIA_PKG\v0.6\Compat\src\Compat.jl:385[22m[22m

Unnamed: 0,x1,x2,x3
1,0.613673,0.9388,0.976714


Instead use a transposed vector if you have a vector of atoms.

In [8]:
DataFrame(transpose([1, 2, 3]))

Unnamed: 0,x1,x2,x3
1,1,2,3


Pass a second argument to give the columns names.

In [9]:
DataFrame([1:3, 4:6, 7:9], [:A, :B, :C])

Unnamed: 0,A,B,C
1,1,4,7
2,2,5,8
3,3,6,9


Here we create a `DataFrame` from a matrix,

In [10]:
DataFrame(rand(3,4))

Unnamed: 0,x1,x2,x3,x4
1,0.874557,0.777246,0.949467,0.697868
2,0.579164,0.816029,0.191466,0.0563065
3,0.280777,0.795716,0.201309,0.191633


and here we do the same but also pass column names.

In [11]:
DataFrame(rand(3,4), Symbol.('a':'d'))

Unnamed: 0,a,b,c,d
1,0.627336,0.850384,0.225164,0.617465
2,0.645045,0.709581,0.0780468,0.0941601
3,0.48563,0.608378,0.777213,0.630866


We can also construct an uninitialized DataFrame.

Here we pass column types, names and number of rows; we get `missing` in column :C because Any >: Missing.

In [12]:
DataFrame([Int, Float64, Any], [:A, :B, :C], 1)

Unnamed: 0,A,B,C
1,147259888,1.08013e-319,missing


Here we create a `DataFrame`, but column `:C` is #undef and Jupyter has problem with displaying it. (This works OK at the REPL.)

This will be fixed in next release of DataFrames!

In [13]:
DataFrame([Int, Float64, String], [:A, :B, :C], 1)

UndefRefError: [91mUndefRefError: access to undefined reference[39m

To initialize a `DataFrame` with column names, but no rows ---

In [14]:
DataFrame([Int, Float64, String], [:A, :B, :C], 0) 

Unnamed: 0,A,B,C


This syntax gives us a quick way to create homogenous `DataFrame`.

In [15]:
DataFrame(Int, 3, 5)

Unnamed: 0,x1,x2,x3,x4,x5
1,202568848,199866128,1649267441664,0,439701264
2,147292368,147292368,0,0,148642768
3,147293072,147292432,0,0,439701424


This example is similar, but has nonhomogenous columns.

In [16]:
DataFrame([Int, Float64], 4)

Unnamed: 0,x1,x2
1,148644688,2.17246e-315
2,147261328,2.17242e-315
3,147261328,2.17246e-315
4,0,2.20078e-315


Finally, we can create a `DataFrame` by copying an existing `DataFrame`.

Note that `copy` creates a shallow copy.

In [17]:
y = DataFrame(x)
z = copy(x)
(x === y), (x === z), isequal(x, z)

(false, false, true)

### Conversion to a matrix

Let's start by creating a `DataFrame` with two rows and two columns.

In [18]:
x = DataFrame(x=1:2, y=["A", "B"])

Unnamed: 0,x,y
1,1,A
2,2,B


We can create a matrix by passing this `DataFrame` to `Matrix`.

In [19]:
Matrix(x)

2×2 Array{Any,2}:
 1  "A"
 2  "B"

This would work even if the `DataFrame` had some `missing`s:

In [20]:
x = DataFrame(x=1:2, y=[missing,"B"])

Unnamed: 0,x,y
1,1,missing
2,2,B


In [21]:
Matrix(x)

2×2 Array{Any,2}:
 1  missing
 2  "B"    

In the two previous matrix examples, Julia created matrices with elements of type `Any`. We can see more clearly that the type of matrix is inferred when we pass, for example, a `DataFrame` of integers to `Matrix`, creating a 2D `Array` of `Int64`s:

In [22]:
x = DataFrame(x=1:2, y=3:4)

Unnamed: 0,x,y
1,1,3
2,2,4


In [23]:
Matrix(x)

2×2 Array{Int64,2}:
 1  3
 2  4

In this next example, Julia correctly identifies that `Union` is needed to express the type of the resulting `Matrix` (which contains `missing`s).

In [24]:
x = DataFrame(x=1:2, y=[missing,4])

Unnamed: 0,x,y
1,1,missing
2,2,4


In [25]:
Matrix(x)

2×2 Array{Union{Int64, Missings.Missing},2}:
 1   missing
 2  4       

Note that we can't force a conversion of `missing` values to `Int`s!

In [26]:
Matrix{Int}(x)

LoadError: [91mcannot convert a DataFrame containing missing values to array (found for column y)[39m

### Handling of duplicate column names

We can pass the `makeunique` keyword argument to allow duplicate names

In [27]:
df = DataFrame(:a=>1, :a=>2, :a_1=>3; makeunique=true)

Unnamed: 0,a,a_2,a_1
1,1,2,3


Otherwise, duplicates will not be allowed in the future.

In [28]:
df = DataFrame(:a=>1, :a=>2, :a_1=>3)

Stacktrace:
 [1] [1mdepwarn[22m[22m[1m([22m[22m::String, ::Symbol[1m)[22m[22m at [1m.\deprecated.jl:70[22m[22m
 [2] [1m#make_unique#3[22m[22m[1m([22m[22m::Bool, ::Function, ::Array{Symbol,1}[1m)[22m[22m at [1mD:\Software\JULIA_PKG\v0.6\DataFrames\src\other\utils.jl:61[22m[22m
 [3] [1m(::DataFrames.#kw##make_unique)[22m[22m[1m([22m[22m::Array{Any,1}, ::DataFrames.#make_unique, ::Array{Symbol,1}[1m)[22m[22m at [1m.\<missing>:0[22m[22m
 [4] [1m#Index#6[22m[22m at [1mD:\Software\JULIA_PKG\v0.6\DataFrames\src\other\index.jl:12[22m[22m [inlined]
 [5] [1m(::Core.#kw#Type)[22m[22m[1m([22m[22m::Array{Any,1}, ::Type{DataFrames.Index}, ::Array{Symbol,1}[1m)[22m[22m at [1m.\<missing>:0[22m[22m
 [6] [1m#DataFrame#47[22m[22m[1m([22m[22m::Bool, ::Type{T} where T, ::Pair{Symbol,Int64}, ::Vararg{Pair{Symbol,Int64},N} where N[1m)[22m[22m at [1mD:\Software\JULIA_PKG\v0.6\DataFrames\src\dataframe\dataframe.jl:126[22m[22m
 [7] [1mDataFrames

Unnamed: 0,a,a_2,a_1
1,1,2,3


A constructor that is passed column names that use the names of keyword arguments is a corner case.
You cannot pass `makeunique` to allow duplicates here.

In [29]:
df = DataFrame(a=1, a=2, makeunique=true)

Stacktrace:
 [1] [1mdepwarn[22m[22m[1m([22m[22m::String, ::Symbol[1m)[22m[22m at [1m.\deprecated.jl:70[22m[22m
 [2] [1m#make_unique#3[22m[22m[1m([22m[22m::Bool, ::Function, ::Array{Symbol,1}[1m)[22m[22m at [1mD:\Software\JULIA_PKG\v0.6\DataFrames\src\other\utils.jl:61[22m[22m
 [3] [1m(::DataFrames.#kw##make_unique)[22m[22m[1m([22m[22m::Array{Any,1}, ::DataFrames.#make_unique, ::Array{Symbol,1}[1m)[22m[22m at [1m.\<missing>:0[22m[22m
 [4] [1m#Index#6[22m[22m at [1mD:\Software\JULIA_PKG\v0.6\DataFrames\src\other\index.jl:12[22m[22m [inlined]
 [5] [1m(::Core.#kw#Type)[22m[22m[1m([22m[22m::Array{Any,1}, ::Type{DataFrames.Index}, ::Array{Symbol,1}[1m)[22m[22m at [1m.\<missing>:0[22m[22m
 [6] [1m#DataFrame#47[22m[22m[1m([22m[22m::Bool, ::Type{T} where T, ::Pair{Symbol,Int64}, ::Vararg{Pair{Symbol,#s8} where #s8,N} where N[1m)[22m[22m at [1mD:\Software\JULIA_PKG\v0.6\DataFrames\src\dataframe\dataframe.jl:126[22m[22m
 [7] [1mDa

Unnamed: 0,a,a_1,makeunique
1,1,2,True
