# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), October 4, 2018**

In [1]:
using DataFrames # load package

## Manipulating columns of DataFrame

### Renaming columns

Let's start with a `DataFrame` of `Bool`s that has default column names.

In [2]:
x = DataFrame(Bool, 3, 4)

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,False,False,False,False
2,True,True,False,True
3,True,False,False,True


With `rename`, we create new `DataFrame`; here we rename the column `:x1` to `:A`. (`rename` also accepts collections of Pairs.)

In [3]:
rename(x, :x1 => :A)

Unnamed: 0_level_0,A,x2,x3,x4
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,False,False,False,False
2,True,True,False,True
3,True,False,False,True


With `rename!` we do an in place transformation. 

This time we've applied a function to every column name.

In [4]:
rename!(c -> Symbol(string(c)^2), x)

Unnamed: 0_level_0,x1x1,x2x2,x3x3,x4x4
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,False,False,False,False
2,True,True,False,True
3,True,False,False,True


We can also change the name of a particular column without knowing the original.

Here we change the name of the third column, creating a new `DataFrame`.

In [5]:
rename(x, names(x)[3] => :third)

Unnamed: 0_level_0,x1x1,x2x2,third,x4x4
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,False,False,False,False
2,True,True,False,True
3,True,False,False,True


With `names!`, we can change the names of all variables.

In [6]:
names!(x, [:a, :b, :c, :d])

Unnamed: 0_level_0,a,b,c,d
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,False,False,False,False
2,True,True,False,True
3,True,False,False,True


We get an error when we try to provide duplicate names

In [7]:
names!(x, fill(:a, 4))

ArgumentError: ArgumentError: Duplicate variable names: Symbol[:a, :a, :a, :a].
Pass makeunique=true to make them unique using a suffix automatically.

 unless we pass `makeunique=true`, which allows us to handle duplicates in passed names.

In [8]:
names!(x, fill(:a, 4), makeunique=true)

Unnamed: 0_level_0,a,a_1,a_2,a_3
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,False,False,False,False
2,True,True,False,True
3,True,False,False,True


### Reordering columns

We can reorder the names(x) vector as needed, creating a new DataFrame.

In [9]:
using Random
Random.seed!(1234)
x[shuffle(names(x))]

Unnamed: 0_level_0,a_1,a_3,a_2,a
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,False,False,False,False
2,True,True,False,True
3,False,True,False,True


Also `permutecols!` can be used to achieve this in place:

In [10]:
permutecols!(x, 4:-1:1); x

Unnamed: 0_level_0,a_3,a_2,a_1,a
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,False,False,False,False
2,True,False,True,True
3,True,False,False,True


### Merging/adding columns

In [11]:
x = DataFrame([(i,j) for i in 1:3, j in 1:4])

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


With `hcat` we can merge two `DataFrame`s. Also [x y] syntax is supported but only when DataFrames have unique column names.

In [12]:
hcat(x, x, makeunique=true)

Unnamed: 0_level_0,x1,x2,x3,x4,x1_1,x2_1,x3_1,x4_1
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)","(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)","(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)","(3, 1)","(3, 2)","(3, 3)","(3, 4)"


We can also use `hcat` to add a new column; a default name `:x1` will be used for this column, so `makeunique=true` is needed.

In [13]:
y = hcat(x, [1,2,3], makeunique=true)

Unnamed: 0_level_0,x1,x2,x3,x4,x1_1
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Int64
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)",1
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)",2
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)",3


You can also prepend a vector with `hcat`.

In [14]:
hcat([1,2,3], x, makeunique=true)

Unnamed: 0_level_0,x1,x1_1,x2,x3,x4
Unnamed: 0_level_1,Int64,Tuple…,Tuple…,Tuple…,Tuple…
1,1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


Alternatively you could append a vector with the following syntax. This is a bit more verbose but cleaner.

In [15]:
y = [x DataFrame(A=[1,2,3])]

Unnamed: 0_level_0,x1,x2,x3,x4,A
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Int64
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)",1
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)",2
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)",3


Here we do the same but add column `:A` to the front.

In [16]:
y = [DataFrame(A=[1,2,3]) x]

Unnamed: 0_level_0,A,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Tuple…,Tuple…,Tuple…,Tuple…
1,1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


A column can also be added in the middle. Here a brute-force method is used and a new DataFrame is created.

In [17]:
using BenchmarkTools
@btime [$x[1:2] DataFrame(A=[1,2,3]) $x[3:4]]

  13.062 μs (120 allocations: 9.36 KiB)


Unnamed: 0_level_0,x1,x2,A,x3,x4
Unnamed: 0_level_1,Tuple…,Tuple…,Int64,Tuple…,Tuple…
1,"(1, 1)","(1, 2)",1,"(1, 3)","(1, 4)"
2,"(2, 1)","(2, 2)",2,"(2, 3)","(2, 4)"
3,"(3, 1)","(3, 2)",3,"(3, 3)","(3, 4)"


We could also do this with a specialized in place method `insert!`. Let's add `:newcol` to the `DataFrame` `y`.

In [18]:
insert!(y, 2, [1,2,3], :newcol)

Unnamed: 0_level_0,A,newcol,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Tuple…,Tuple…,Tuple…,Tuple…
1,1,1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,2,2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,3,3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


If you want to insert the same column name several times `makeunique=true` is needed as usual.

In [19]:
insert!(y, 2, [1,2,3], :newcol, makeunique=true)

Unnamed: 0_level_0,A,newcol_1,newcol,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Int64,Tuple…,Tuple…,Tuple…,Tuple…
1,1,1,1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,2,2,2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,3,3,3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


We can see how much faster it is to insert a column with `insert!` than with `hcat` using `@btime`.

In [20]:
@btime insert!(copy($x), 3, [1,2,3], :A)

  1.586 μs (17 allocations: 1.38 KiB)


Unnamed: 0_level_0,x1,x2,A,x3,x4
Unnamed: 0_level_1,Tuple…,Tuple…,Int64,Tuple…,Tuple…
1,"(1, 1)","(1, 2)",1,"(1, 3)","(1, 4)"
2,"(2, 1)","(2, 2)",2,"(2, 3)","(2, 4)"
3,"(3, 1)","(3, 2)",3,"(3, 3)","(3, 4)"


Let's use `insert!` to append a column in place,

In [21]:
insert!(x, ncol(x)+1, [1,2,3], :A)

Unnamed: 0_level_0,x1,x2,x3,x4,A
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Int64
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)",1
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)",2
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)",3


and to in place prepend a column.

In [22]:
insert!(x, 1, [1,2,3], :B)

Unnamed: 0_level_0,B,x1,x2,x3,x4,A
Unnamed: 0_level_1,Int64,Tuple…,Tuple…,Tuple…,Tuple…,Int64
1,1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)",1
2,2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)",2
3,3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)",3


With `merge!`, let's merge the second `DataFrame` into first, but overwriting duplicates.

In [23]:
df1 = DataFrame(x=1:3, y=4:6)

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,4
2,2,5
3,3,6


In [24]:
df2 = DataFrame(x='a':'c', z = 'd':'f', new=11:13)

Unnamed: 0_level_0,x,z,new
Unnamed: 0_level_1,Char,Char,Int64
1,'a','d',11
2,'b','e',12
3,'c','f',13


In [25]:
merge!(df1, df2)

Unnamed: 0_level_0,x,y,z,new
Unnamed: 0_level_1,Char,Int64,Char,Int64
1,'a',4,'d',11
2,'b',5,'e',12
3,'c',6,'f',13


 For comparison: merge two `DataFrames`s but renaming duplicate names via `hcat`.

In [26]:
df1 = DataFrame(x=1:3, y=4:6)
df2 = DataFrame(x='a':'c', z = 'd':'f', new=11:13)
hcat(df1, df2, makeunique=true)

Unnamed: 0_level_0,x,y,x_1,z,new
Unnamed: 0_level_1,Int64,Int64,Char,Char,Int64
1,1,4,'a','d',11
2,2,5,'b','e',12
3,3,6,'c','f',13


### Subsetting/removing columns

Let's create a new `DataFrame` `x` and show a few ways to create DataFrames with a subset of `x`'s columns.

In [27]:
x = DataFrame([(i,j) for i in 1:3, j in 1:5])

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)","(3, 5)"


First we could do this by index:

In [28]:
x[[1,2,4,5]]

Unnamed: 0_level_0,x1,x2,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 2)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 2)","(3, 4)","(3, 5)"


or by column name:

In [29]:
x[[:x1, :x4]]

Unnamed: 0_level_0,x1,x4
Unnamed: 0_level_1,Tuple…,Tuple…
1,"(1, 1)","(1, 4)"
2,"(2, 1)","(2, 4)"
3,"(3, 1)","(3, 4)"


We can also choose to keep or exclude columns by `Bool` (we need a vector whose length is the number of columns in the original `DataFrame`).

In [30]:
x[[true, false, true, false, true]]

Unnamed: 0_level_0,x1,x3,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 3)","(1, 5)"
2,"(2, 1)","(2, 3)","(2, 5)"
3,"(3, 1)","(3, 3)","(3, 5)"


Here we create a single column `DataFrame`,

In [31]:
x[[:x1]]

Unnamed: 0_level_0,x1
Unnamed: 0_level_1,Tuple…
1,"(1, 1)"
2,"(2, 1)"
3,"(3, 1)"


and here we access the vector contained in column `:x1`.

In [32]:
x[:x1]

3-element Array{Tuple{Int64,Int64},1}:
 (1, 1)
 (2, 1)
 (3, 1)

In [33]:
x.x1 # the same

3-element Array{Tuple{Int64,Int64},1}:
 (1, 1)
 (2, 1)
 (3, 1)

We could grab the same vector by column number

In [34]:
x[1]

3-element Array{Tuple{Int64,Int64},1}:
 (1, 1)
 (2, 1)
 (3, 1)

and remove everything from a `DataFrame` with `empty!`.

In [35]:
empty!(y)

Here we create a copy of `x` and delete the 3rd column from the copy with `delete!`.

In [36]:
z = copy(x)
delete!(z, 3)

Unnamed: 0_level_0,x1,x2,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 2)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 2)","(3, 4)","(3, 5)"


In [37]:
x

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)","(3, 5)"


### Modify column by name

In [38]:
x = DataFrame([(i,j) for i in 1:3, j in 1:5])

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)","(3, 5)"


With the following syntax, the existing column is modified without performing any copying.

In [39]:
x[:x1] = x[:x2]
x

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 2)","(1, 2)","(1, 3)","(1, 4)","(1, 5)"
2,"(2, 2)","(2, 2)","(2, 3)","(2, 4)","(2, 5)"
3,"(3, 2)","(3, 2)","(3, 3)","(3, 4)","(3, 5)"


We can also use the following syntax to add a new column at the end of a `DataFrame`.

In [40]:
x[:A] = [1,2,3]
x

Unnamed: 0_level_0,x1,x2,x3,x4,x5,A
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…,Int64
1,"(1, 2)","(1, 2)","(1, 3)","(1, 4)","(1, 5)",1
2,"(2, 2)","(2, 2)","(2, 3)","(2, 4)","(2, 5)",2
3,"(3, 2)","(3, 2)","(3, 3)","(3, 4)","(3, 5)",3


A new column name will be added to our `DataFrame` with the following syntax as well (7 is equal to `ncol(x)+1`).

In [41]:
x[7] = 11:13
x

Unnamed: 0_level_0,x1,x2,x3,x4,x5,A,x7
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…,Int64,Int64
1,"(1, 2)","(1, 2)","(1, 3)","(1, 4)","(1, 5)",1,11
2,"(2, 2)","(2, 2)","(2, 3)","(2, 4)","(2, 5)",2,12
3,"(3, 2)","(3, 2)","(3, 3)","(3, 4)","(3, 5)",3,13


### Find column name

In [42]:
x = DataFrame([(i,j) for i in 1:3, j in 1:5])

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)","(3, 5)"


We can check if a column with a given name exists via

In [43]:
haskey(x, :x1)

true

and determine its index via

In [44]:
findfirst(isequal(:x2), names(x))

2