# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), August 16, 2019**

Let's get started by loading the `DataFrames` package.

In [1]:
using Pkg
Pkg.activate(".")

"d:\\Dev\\Julia\\DataFrames_Tutorial\\Project.toml"

In [2]:
using DataFrames, Random

## Constructors and conversion

### Constructors

In this section, you'll see many ways to create a `DataFrame` using the `DataFrame()` constructor.

First, we could create an empty DataFrame,

In [3]:
DataFrame()

Or we could call the constructor using keyword arguments to add columns to the `DataFrame`.

In [4]:
DataFrame(A=1:3, B=rand(3), C=randstring.([3,3,3]))

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64,String
1,1,0.305455,yex
2,2,0.316977,dpX
3,3,0.651892,hAb


We can create a `DataFrame` from a dictionary, in which case keys from the dictionary will be sorted to create the `DataFrame` columns.

In [5]:
x = Dict("A" => [1,2], "B" => [true, false], "C" => ['a', 'b'])
DataFrame(x)

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Bool,Char
1,1,True,'a'
2,2,False,'b'


Rather than explicitly creating a dictionary first, as above, we could pass `DataFrame` arguments with the syntax of dictionary key-value pairs. 

Note that in this case, we use symbols to denote the column names and arguments are not sorted. For example, `:A`, the symbol, produces `A`, the name of the first column here:

In [6]:
DataFrame(:A => [1,2], :B => [true, false], :C => ['a', 'b'])

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Bool,Char
1,1,True,'a'
2,2,False,'b'


Here we create a `DataFrame` from a vector of vectors, and each vector becomes a column.

In [7]:
DataFrame([rand(3) for i in 1:3])

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.551445,0.947995,0.724924
2,0.0822741,0.859683,0.895534
3,0.407038,0.611067,0.924225


It is not allowed to pass a vector of scalars to `DataFrame` constructor.

In [8]:
DataFrame(rand(3))

ArgumentError: ArgumentError: 'Array{Float64,1}' iterates 'Float64' values, which don't satisfy the Tables.jl Row-iterator interface

Instead use a transposed vector if you have a vector of atoms (in this way you effectively pass a two dimensional array to the constructor which is supported).

In [9]:
DataFrame(permutedims([1, 2, 3]))

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,2,3


or pass a vector of `NamedTuple`s

In [10]:
v = [(a=1, b=2), (a=3, b=4)]
DataFrame(v)

Unnamed: 0_level_0,a,b
Unnamed: 0_level_1,Int64,Int64
1,1,2
2,3,4


Pass a second argument to give the columns names in case you pass a vector of vectors.

In [11]:
DataFrame([1:3, 4:6, 7:9], [:A, :B, :C])

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,4,7
2,2,5,8
3,3,6,9


Alternatively you can pass a `NamedTuple` of vectors:

In [12]:
n = (a=1:3, b=11:13)
DataFrame(n)

Unnamed: 0_level_0,a,b
Unnamed: 0_level_1,Int64,Int64
1,1,11
2,2,12
3,3,13


Here we create a `DataFrame` from a matrix,

In [13]:
DataFrame(rand(3,4))

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Float64,Float64,Float64,Float64
1,0.398882,0.533697,0.837649,0.10422
2,0.506718,0.665512,0.0604845,0.666706
3,0.484289,0.960548,0.81243,0.43562


and here we do the same but also pass column names.

In [14]:
DataFrame(rand(3,4), Symbol.('a':'d'))

Unnamed: 0_level_0,a,b,c,d
Unnamed: 0_level_1,Float64,Float64,Float64,Float64
1,0.266858,0.13173,0.748664,0.79571
2,0.265655,0.404239,0.586315,0.241154
3,0.880879,0.397492,0.483709,0.947983


We can also construct an uninitialized DataFrame.

Here we pass column types, names and number of rows; we get `missing` in column :C because `Any >: Missing`.

In [15]:
DataFrame([Int, Float64, Any], [:A, :B, :C], 1)

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64,Any
1,95945280,4.9786e-316,missing


Here we create a `DataFrame` where `:C` is `#undef`

In [16]:
DataFrame([Int, Float64, String], [:A, :B, :C], 1)

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64,String
1,126244176,6.24972e-316,#undef


To initialize a `DataFrame` with column names, but no rows use

In [17]:
DataFrame([Int, Float64, String], [:A, :B, :C])

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64,String


or

In [18]:
DataFrame(A=Int[], B=Float64[], C=String[])

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64,String


Finally, we can create a `DataFrame` by copying an existing `DataFrame`.

Note that `copy` also copies the vectors.

In [19]:
x = DataFrame(a=1:2, b='a':'b')
y = copy(x)
(x === y), isequal(x, y), (x.a == y.a), (x.a === y.a)

(false, true, true, false)

Calling `DataFrame` on a `DataFrame` object works like `copy`.

In [20]:
x = DataFrame(a=1:2, b='a':'b')
y = DataFrame(x)
(x === y), isequal(x, y), (x.a == y.a), (x.a === y.a)

(false, true, true, false)

You can avoid copying of columns of a data frame by passing `copycols=false` keyword argument or using `DataFrame!` constructor.

In [21]:
x = DataFrame(a=1:2, b='a':'b')
y = DataFrame(x, copycols=false)
(x === y), isequal(x, y), (x.a == y.a), (x.a === y.a)

(false, true, true, true)

In [22]:
x = DataFrame(a=1:2, b='a':'b')
y = DataFrame!(x)
(x === y), isequal(x, y), (x.a == y.a), (x.a === y.a)

(false, true, true, true)

The same rule applies to other constructors

In [23]:
a = [1, 2, 3]
df1 = DataFrame(a=a)
df2 = DataFrame(a=a, copycols=false)
df1.a === a, df2.a === a

(false, true)

You can create a similar uninitialized `DataFrame` based on an original one:

In [24]:
similar(x)

Unnamed: 0_level_0,a,b
Unnamed: 0_level_1,Int64,Char
1,72880016,'\x06\x78\xc9\x70'
2,118764768,'\0'


In [25]:
similar(x, 0) # number of rows in a new DataFrame passed as a second argument

Unnamed: 0_level_0,a,b
Unnamed: 0_level_1,Int64,Char


You can also create a new `DataFrame` from `SubDataFrame` or `DataFrameRow` (discussed in detail later in the tutorial)

In [26]:
sdf = view(x, [1,1], :)

Unnamed: 0_level_0,a,b
Unnamed: 0_level_1,Int64,Char
1,1,'a'
2,1,'a'


In [27]:
typeof(sdf)

SubDataFrame{DataFrame,DataFrames.Index,Array{Int64,1}}

In [28]:
DataFrame(sdf)

Unnamed: 0_level_0,a,b
Unnamed: 0_level_1,Int64,Char
1,1,'a'
2,1,'a'


In [29]:
dfr = x[1, :]

Unnamed: 0_level_0,a,b
Unnamed: 0_level_1,Int64,Char
1,1,'a'


In [30]:
DataFrame(dfr)

Unnamed: 0_level_0,a,b
Unnamed: 0_level_1,Int64,Char
1,1,'a'


### Conversion to a matrix

Let's start by creating a `DataFrame` with two rows and two columns.

In [31]:
x = DataFrame(x=1:2, y=["A", "B"])

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,String
1,1,A
2,2,B


We can create a matrix by passing this `DataFrame` to `Matrix`.

In [32]:
Matrix(x)

2×2 Array{Any,2}:
 1  "A"
 2  "B"

This would work even if the `DataFrame` had some `missing`s:

In [33]:
x = DataFrame(x=1:2, y=[missing,"B"])

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,String⍰
1,1,missing
2,2,B


In [34]:
Matrix(x)

2×2 Array{Any,2}:
 1  missing
 2  "B"    

In the two previous matrix examples, Julia created matrices with elements of type `Any`. We can see more clearly that the type of matrix is inferred when we pass, for example, a `DataFrame` of integers to `Matrix`, creating a 2D `Array` of `Int64`s:

In [35]:
x = DataFrame(x=1:2, y=3:4)

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,3
2,2,4


In [36]:
Matrix(x)

2×2 Array{Int64,2}:
 1  3
 2  4

In this next example, Julia correctly identifies that `Union` is needed to express the type of the resulting `Matrix` (which contains `missing`s).

In [37]:
x = DataFrame(x=1:2, y=[missing,4])

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64⍰
1,1,missing
2,2,4


In [38]:
Matrix(x)

2×2 Array{Union{Missing, Int64},2}:
 1   missing
 2  4       

Note that we can't force a conversion of `missing` values to `Int`s!

In [39]:
Matrix{Int}(x)

ArgumentError: ArgumentError: cannot convert a DataFrame containing missing values to Matrix{Int64} (found for column y)

### Conversion to `NamedTuple` related tabular structures

In [40]:
x = DataFrame(x=1:2, y=["A", "B"])

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,String
1,1,A
2,2,B


In [41]:
using Tables

First we convert a `DataFrame` into a `NamedTuple` of vectors

In [42]:
ct = Tables.columntable(x)

(x = [1, 2], y = ["A", "B"])

Next we convert it into a vector of `NamedTuples`

In [43]:
rt = Tables.rowtable(x)

2-element Array{NamedTuple{(:x, :y),Tuple{Int64,String}},1}:
 (x = 1, y = "A")
 (x = 2, y = "B")

We can perform the conversions back to a `DataFrame` using a standard constructor call:

In [44]:
DataFrame(ct)

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,String
1,1,A
2,2,B


In [45]:
DataFrame(rt)

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,String
1,1,A
2,2,B


### Handling of duplicate column names

We can pass the `makeunique` keyword argument to allow passing duplicate names (they get deduplicated)

In [46]:
df = DataFrame(:a=>1, :a=>2, :a_1=>3; makeunique=true)

Unnamed: 0_level_0,a,a_2,a_1
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,2,3


Otherwise, duplicates are not allowed.

In [47]:
df = DataFrame(:a=>1, :a=>2, :a_1=>3)

ArgumentError: ArgumentError: Duplicate variable names: :a. Pass makeunique=true to make them unique using a suffix automatically.

Finallly observe that `nothing` is not printed when displaying a `DataFrame`:

In [48]:
DataFrame(x=[1, nothing], y=[nothing, "a"])

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Union…,Union…
1,1.0,
2,,a
