# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), May 23, 2018**

### Reference

* https://github.com/JuliaComputing/JuliaBoxTutorials/tree/master/introductory-tutorials/broader-topics-and-ecosystem/intro-to-julia-DataFrames

### Series

* https://deepstat.tistory.com/69 (01. constructors)(in English)
* https://deepstat.tistory.com/70 (01. constructors)(한글)

Let's get started by loading the `DataFrames` package.

In [1]:
using DataFrames

## Constructors and conversion

### Constructors

In this section, you'll see many ways to create a `DataFrame` using the `DataFrame()` constructor.

First, we could create an empty DataFrame,

In [2]:
DataFrame() # empty DataFrame

Or we could call the constructor using keyword arguments to add columns to the `DataFrame`.

In [3]:
DataFrame(A=1:3, B=rand(3), C=randstring.([3,3,3]))

UndefVarError: UndefVarError: randstring not defined

In [4]:
using Random
DataFrame(A=1:3, B=rand(3), C=Random.randstring.([3,3,3]))

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64,String
1,1,0.1272,2yx
2,2,0.884208,EFK
3,3,0.561774,tMQ


We can create a `DataFrame` from a dictionary, in which case keys from the dictionary will be sorted to create the `DataFrame` columns.

In [5]:
x = Dict("A" => [1,2], "B" => [true, false], "C" => ['a', 'b'])
DataFrame(x)

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Bool,Char
1,1,True,'a'
2,2,False,'b'


Rather than explicitly creating a dictionary first, as above, we could pass `DataFrame` arguments with the syntax of dictionary key-value pairs. 

Note that in this case, we use symbols to denote the column names and arguments are not sorted. For example, `:A`, the symbol, produces `A`, the name of the first column here:

In [6]:
DataFrame(:A => [1,2], :B => [true, false], :C => ['a', 'b'])

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Bool,Char
1,1,True,'a'
2,2,False,'b'


Here we create a `DataFrame` from a vector of vectors, and each vector becomes a column.

In [7]:
DataFrame([rand(3) for i in 1:3])

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.61836,0.956117,0.772712
2,0.339327,0.646796,0.0373572
3,0.383056,0.015768,0.568402


 For now we can construct a single `DataFrame` from a `Vector` of atoms, creating a `DataFrame` with a single row. this will throw an error.

In [8]:
DataFrame(rand(3))

ArgumentError: ArgumentError: unable to construct DataFrame from Array{Float64,1}

Instead use a transposed vector if you have a vector of atoms (in this way you effectively pass a two dimensional array to the constructor which is supported).

In [9]:
DataFrame(transpose([1, 2, 3]))

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,2,3


Pass a second argument to give the columns names.

In [10]:
DataFrame([1:3, 4:6, 7:9], [:A, :B, :C])

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,4,7
2,2,5,8
3,3,6,9


Here we create a `DataFrame` from a matrix,

In [11]:
DataFrame(rand(3,4))

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Float64,Float64,Float64,Float64
1,0.549398,0.536636,0.564859,0.406246
2,0.354452,0.233334,0.585349,0.239128
3,0.369394,0.38183,0.54614,0.672149


and here we do the same but also pass column names.

In [12]:
DataFrame(rand(3,4), Symbol.('a':'d'))

Unnamed: 0_level_0,a,b,c,d
Unnamed: 0_level_1,Float64,Float64,Float64,Float64
1,0.301636,0.575057,0.246931,0.0463783
2,0.483671,0.0564641,0.579429,0.231766
3,0.545517,0.214624,0.765022,0.8457


We can also construct an uninitialized DataFrame.

Here we pass column types, names and number of rows; we get `missing` in column :C because `Any >: Missing`.

In [13]:
DataFrame([Int, Float64, Any], [:A, :B, :C], 1)

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64,Any
1,140482106970544,6.94074e-310,missing


Here we create a `DataFrame`, but column `:C` is #undef.

In [14]:
DataFrame([Int, Float64, String], [:A, :B, :C], 1)

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64,String
1,140479790317568,6.94068e-310,#undef


To initialize a `DataFrame` with column names, but no rows use

In [15]:
DataFrame([Int, Float64, String], [:A, :B, :C], 0) 

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64,String


This syntax gives us a quick way to create homogenous `DataFrame`.

In [16]:
DataFrame(Int, 3, 5)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,140482222175792,140482170689280,140480869092464,140482222232064,140482222232064
2,140482074905216,140482221606992,140480876470832,140482074902080,140482074902080
3,140482074905088,140482074902080,140482077542384,140480867023472,140480867023488


This example is similar, but has nonhomogenous columns.

In [17]:
DataFrame([Int, Float64], 4)

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Int64,Float64
1,140482074898440,6.94074e-310
2,140480869273680,6.94068e-310
3,140480867045792,6.94068e-310
4,140479790317568,6.94068e-310


Finally, we can create a `DataFrame` by copying an existing `DataFrame`.

Note that `copy` creates a shallow copy.

In [18]:
y = DataFrame(x)
z = copy(x)
(x === y), (x === z), isequal(x, z)

(false, false, true)

### Conversion to a matrix

Let's start by creating a `DataFrame` with two rows and two columns.

In [19]:
x = DataFrame(x=1:2, y=["A", "B"])

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,String
1,1,A
2,2,B


We can create a matrix by passing this `DataFrame` to `Matrix`.

In [20]:
Matrix(x)

2×2 Array{Any,2}:
 1  "A"
 2  "B"

This would work even if the `DataFrame` had some `missing`s:

In [21]:
x = DataFrame(x=1:2, y=[missing,"B"])

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,String⍰
1,1,missing
2,2,B


In [22]:
Matrix(x)

2×2 Array{Any,2}:
 1  missing
 2  "B"    

In the two previous matrix examples, Julia created matrices with elements of type `Any`. We can see more clearly that the type of matrix is inferred when we pass, for example, a `DataFrame` of integers to `Matrix`, creating a 2D `Array` of `Int64`s:

In [23]:
x = DataFrame(x=1:2, y=3:4)

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,3
2,2,4


In [24]:
Matrix(x)

2×2 Array{Int64,2}:
 1  3
 2  4

In this next example, Julia correctly identifies that `Union` is needed to express the type of the resulting `Matrix` (which contains `missing`s).

In [25]:
x = DataFrame(x=1:2, y=[missing,4])

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64⍰
1,1,missing
2,2,4


In [26]:
Matrix(x)

2×2 Array{Union{Missing, Int64},2}:
 1   missing
 2  4       

Note that we can't force a conversion of `missing` values to `Int`s!

In [27]:
Matrix{Int}(x)

ErrorException: cannot convert a DataFrame containing missing values to array (found for column y)

### Handling of duplicate column names

We cannot use duplicate names in `DataFrame`. We can pass the `makeunique` keyword argument to get deduplicated names.

In [28]:
df = DataFrame(:a=>1, :a=>2, :a_1=>3; makeunique=true)

Unnamed: 0_level_0,a,a_2,a_1
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,2,3


Otherwise, duplicates will not be allowed.

In [29]:
df = DataFrame(:a=>1, :a=>2, :a_1=>3)

│   caller = ip:0x0
└ @ Core :-1


Unnamed: 0_level_0,a,a_2,a_1
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,2,3


A constructor that is passed column names as keyword arguments is a corner case.
You cannot pass `makeunique` to allow duplicates here.

In [30]:
df = DataFrame(a=1, a=2, makeunique=true)

ErrorException: syntax: keyword argument "a" repeated in call to "DataFrame"