# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), July 16, 2019**

In [1]:
using DataFrames

## Manipulating columns of a `DataFrame`

### Renaming columns

Let's start with a `DataFrame` of `Bool`s that has default column names.

In [2]:
x = DataFrame(rand(Bool, 3, 4))

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,False,False,False,True
2,True,True,False,False
3,True,True,True,True


With `rename`, we create new `DataFrame`; here we rename the column `:x1` to `:A`. (`rename` also accepts collections of Pairs.)

In [3]:
rename(x, :x1 => :A)

Unnamed: 0_level_0,A,x2,x3,x4
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,False,False,False,True
2,True,True,False,False
3,True,True,True,True


With `rename!` we do an in place transformation. 

This time we've applied a function to every column name.

In [4]:
rename!(c -> Symbol(string(c)^2), x)

Unnamed: 0_level_0,x1x1,x2x2,x3x3,x4x4
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,False,False,False,True
2,True,True,False,False
3,True,True,True,True


We can also change the name of a particular column without knowing the original.

Here we change the name of the third column, creating a new `DataFrame`.

In [5]:
rename(x, names(x)[3] => :third)

Unnamed: 0_level_0,x1x1,x2x2,third,x4x4
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,False,False,False,True
2,True,True,False,False
3,True,True,True,True


With `names!`, we can change the names of all variables.

In [6]:
names!(x, [:a, :b, :c, :d])

Unnamed: 0_level_0,a,b,c,d
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,False,False,False,True
2,True,True,False,False
3,True,True,True,True


We get an error when we try to provide duplicate names

In [7]:
names!(x, fill(:a, 4))

ArgumentError: ArgumentError: Duplicate variable names: :a. Pass makeunique=true to make them unique using a suffix automatically.

 unless we pass `makeunique=true`, which allows us to handle duplicates in passed names.

In [8]:
names!(x, fill(:a, 4), makeunique=true)

Unnamed: 0_level_0,a,a_1,a_2,a_3
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,False,False,False,True
2,True,True,False,False
3,True,True,True,True


### Reordering columns

We can reorder the names(x) vector as needed, creating a new `DataFrame`.

In [9]:
using Random
Random.seed!(1234)
x[:, shuffle(names(x))]

Unnamed: 0_level_0,a_1,a_3,a_2,a
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,False,True,False,False
2,True,False,False,True
3,True,True,True,True


Also `permutecols!` can be used to achieve this in place:

In [10]:
permutecols!(x, 4:-1:1); x

Unnamed: 0_level_0,a_3,a_2,a_1,a
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,True,False,False,False
2,False,False,True,True
3,True,True,True,True


### Merging/adding columns

In [11]:
x = DataFrame([(i,j) for i in 1:3, j in 1:4])

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


With `hcat` we can merge two `DataFrame`s. Also [x y] syntax is supported but only when DataFrames have unique column names.

In [12]:
hcat(x, x, makeunique=true)

Unnamed: 0_level_0,x1,x2,x3,x4,x1_1,x2_1,x3_1,x4_1
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)","(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)","(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)","(3, 1)","(3, 2)","(3, 3)","(3, 4)"


We can also use `hcat` to add a new column; a default name `:x1` will be used for this column, so `makeunique=true` is needed in our case.

In [13]:
y = hcat(x, [1,2,3], makeunique=true)

Unnamed: 0_level_0,x1,x2,x3,x4,x1_1
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Int64
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)",1
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)",2
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)",3


You can also prepend a vector with `hcat`.

In [14]:
hcat([1,2,3], x, makeunique=true)

Unnamed: 0_level_0,x1,x1_1,x2,x3,x4
Unnamed: 0_level_1,Int64,Tuple…,Tuple…,Tuple…,Tuple…
1,1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


Alternatively you could append a vector with the following syntax. This is a bit more verbose but cleaner.

In [15]:
y = [x DataFrame(A=[1,2,3])]

Unnamed: 0_level_0,x1,x2,x3,x4,A
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Int64
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)",1
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)",2
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)",3


Here we do the same but add column `:A` to the front.

In [16]:
y = [DataFrame(A=[1,2,3]) x]

Unnamed: 0_level_0,A,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Tuple…,Tuple…,Tuple…,Tuple…
1,1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


A column can also be added in the middle. Here a brute-force method is used and a new `DataFrame` is created.

In [17]:
using BenchmarkTools
@btime [$x[!, 1:2] DataFrame(A=[1,2,3]) $x[!, 3:4]]

  14.600 μs (134 allocations: 10.69 KiB)


Unnamed: 0_level_0,x1,x2,A,x3,x4
Unnamed: 0_level_1,Tuple…,Tuple…,Int64,Tuple…,Tuple…
1,"(1, 1)","(1, 2)",1,"(1, 3)","(1, 4)"
2,"(2, 1)","(2, 2)",2,"(2, 3)","(2, 4)"
3,"(3, 1)","(3, 2)",3,"(3, 3)","(3, 4)"


We could also do this with a specialized in place method `insertcols!`. Let's add `:newcol` to the `DataFrame` `y`.

In [18]:
insertcols!(y, 2, newcol=[1,2,3])

Unnamed: 0_level_0,A,newcol,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Tuple…,Tuple…,Tuple…,Tuple…
1,1,1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,2,2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,3,3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


If you want to insert the same column name several times `makeunique=true` is needed as usual.

In [19]:
insertcols!(y, 2, newcol=[1,2,3], makeunique=true)

Unnamed: 0_level_0,A,newcol_1,newcol,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Int64,Tuple…,Tuple…,Tuple…,Tuple…
1,1,1,1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,2,2,2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,3,3,3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


We can see how much faster it is to insert a column with `insertcols!` than with `hcat` using `@btime` (note that we use here a `Pair` notation as an example).

In [20]:
@btime insertcols!(copy($x), 3, :A => [1,2,3])

  2.390 μs (30 allocations: 2.61 KiB)


Unnamed: 0_level_0,x1,x2,A,x3,x4
Unnamed: 0_level_1,Tuple…,Tuple…,Int64,Tuple…,Tuple…
1,"(1, 1)","(1, 2)",1,"(1, 3)","(1, 4)"
2,"(2, 1)","(2, 2)",2,"(2, 3)","(2, 4)"
3,"(3, 1)","(3, 2)",3,"(3, 3)","(3, 4)"


Let's use `insertcols!` to append a column in place,

In [21]:
insertcols!(x, ncol(x)+1, A=[1,2,3])

Unnamed: 0_level_0,x1,x2,x3,x4,A
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Int64
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)",1
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)",2
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)",3


and to in place prepend a column.

In [22]:
insertcols!(x, 1, B=[1,2,3])

Unnamed: 0_level_0,B,x1,x2,x3,x4,A
Unnamed: 0_level_1,Int64,Tuple…,Tuple…,Tuple…,Tuple…,Int64
1,1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)",1
2,2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)",2
3,3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)",3


### Subsetting/removing columns

Let's create a new `DataFrame` `x` and show a few ways to create DataFrames with a subset of `x`'s columns.

In [23]:
x = DataFrame([(i,j) for i in 1:3, j in 1:5])

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)","(3, 5)"


First we could do this by index:

In [24]:
x[:, [1,2,4,5]] # use ! instead of : for non-copying operation

Unnamed: 0_level_0,x1,x2,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 2)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 2)","(3, 4)","(3, 5)"


or by column name:

In [25]:
x[:, [:x1, :x4]]

Unnamed: 0_level_0,x1,x4
Unnamed: 0_level_1,Tuple…,Tuple…
1,"(1, 1)","(1, 4)"
2,"(2, 1)","(2, 4)"
3,"(3, 1)","(3, 4)"


We can also choose to keep or exclude columns by `Bool` (we need a vector whose length is the number of columns in the original `DataFrame`).

In [26]:
x[:, [true, false, true, false, true]]

Unnamed: 0_level_0,x1,x3,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 3)","(1, 5)"
2,"(2, 1)","(2, 3)","(2, 5)"
3,"(3, 1)","(3, 3)","(3, 5)"


Here we create a single column `DataFrame`,

In [27]:
x[:, [:x1]]

Unnamed: 0_level_0,x1
Unnamed: 0_level_1,Tuple…
1,"(1, 1)"
2,"(2, 1)"
3,"(3, 1)"


and here we access the vector contained in column `:x1`.

In [28]:
x[!, :x1] # use : instead of ! to copy

3-element Array{Tuple{Int64,Int64},1}:
 (1, 1)
 (2, 1)
 (3, 1)

In [29]:
x.x1 # the same

3-element Array{Tuple{Int64,Int64},1}:
 (1, 1)
 (2, 1)
 (3, 1)

We could grab the same vector by column number

In [30]:
x[!, 1]

3-element Array{Tuple{Int64,Int64},1}:
 (1, 1)
 (2, 1)
 (3, 1)

Note that getting a single column returns it without copying while creating a new `DataFrame` performs a copy of the column

In [31]:
x[!, 1] === x[!, [1]]

false

you can also use `Regex` and `Not` from InvertedIndies.jl for column selection:

In [32]:
x[!, r"[12]"]

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Tuple…,Tuple…
1,"(1, 1)","(1, 2)"
2,"(2, 1)","(2, 2)"
3,"(3, 1)","(3, 2)"


In [33]:
x[!, Not(1)]

Unnamed: 0_level_0,x2,x3,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 2)","(1, 3)","(1, 4)","(1, 5)"
2,"(2, 2)","(2, 3)","(2, 4)","(2, 5)"
3,"(3, 2)","(3, 3)","(3, 4)","(3, 5)"


you can use `select` and `select!` functions to select a subset of columns from a data frame. `select` creates a new data frame and `select!` operates in place

In [34]:
df = copy(x)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)","(3, 5)"


In [35]:
df2 = select(df, [1, 2])

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Tuple…,Tuple…
1,"(1, 1)","(1, 2)"
2,"(2, 1)","(2, 2)"
3,"(3, 1)","(3, 2)"


In [36]:
select(df, Not([1, 2]))

Unnamed: 0_level_0,x3,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…
1,"(1, 3)","(1, 4)","(1, 5)"
2,"(2, 3)","(2, 4)","(2, 5)"
3,"(3, 3)","(3, 4)","(3, 5)"


by default `select` copies columns

In [37]:
df2[!, 1] === df[!, 1]

false

this can be avoided by using `copycols=false` keyword argument

In [38]:
df2 = select(df, [1, 2], copycols=false)

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Tuple…,Tuple…
1,"(1, 1)","(1, 2)"
2,"(2, 1)","(2, 2)"
3,"(3, 1)","(3, 2)"


In [39]:
df2[!, 1] === df[!, 1]

true

using `select!` will modify the source data frame

In [40]:
select!(df, [1,2])

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Tuple…,Tuple…
1,"(1, 1)","(1, 2)"
2,"(2, 1)","(2, 2)"
3,"(3, 1)","(3, 2)"


In [41]:
df == df2

true

Here we create a copy of `x` and delete the 3rd column from the copy with `select!` and `Not`.

In [42]:
z = copy(x)
select!(z, Not(3))

Unnamed: 0_level_0,x1,x2,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 2)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 2)","(3, 4)","(3, 5)"


alternatively we can achieve the same by using the `select` function

In [43]:
select(x, Not(3))

Unnamed: 0_level_0,x1,x2,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 2)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 2)","(3, 4)","(3, 5)"


`x` stays unchanged

In [44]:
x

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)","(3, 5)"


Note, that you can also create a view of a `DataFrame` when we want a subset of its columns:

In [45]:
@btime x[:, [1,3,5]]

  2.280 μs (28 allocations: 2.33 KiB)


Unnamed: 0_level_0,x1,x3,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 3)","(1, 5)"
2,"(2, 1)","(2, 3)","(2, 5)"
3,"(3, 1)","(3, 3)","(3, 5)"


In [46]:
@btime @view x[:, [1,3,5]]

  332.505 ns (4 allocations: 256 bytes)


Unnamed: 0_level_0,x1,x3,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 3)","(1, 5)"
2,"(2, 1)","(2, 3)","(2, 5)"
3,"(3, 1)","(3, 3)","(3, 5)"


(now creation of the `view` is slow, but in the coming releases of the DataFrames.jl package it will become significantly faster)

### Modify column by name

In [47]:
x = DataFrame([(i,j) for i in 1:3, j in 1:5])

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)","(3, 5)"


With the following syntax, the existing column is modified without performing any copying (this is discouraged as it creates column alias).

In [48]:
x[!, :x1] = x[!, :x2]
x

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 2)","(1, 2)","(1, 3)","(1, 4)","(1, 5)"
2,"(2, 2)","(2, 2)","(2, 3)","(2, 4)","(2, 5)"
3,"(3, 2)","(3, 2)","(3, 3)","(3, 4)","(3, 5)"


this syntax is safer

In [49]:
x[!, :x1] = x[:, :x2]

3-element Array{Tuple{Int64,Int64},1}:
 (1, 2)
 (2, 2)
 (3, 2)

We can also use the following syntax to add a new column at the end of a `DataFrame`.

In [50]:
x[!, :A] = [1,2,3]
x

Unnamed: 0_level_0,x1,x2,x3,x4,x5,A
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…,Int64
1,"(1, 2)","(1, 2)","(1, 3)","(1, 4)","(1, 5)",1
2,"(2, 2)","(2, 2)","(2, 3)","(2, 4)","(2, 5)",2
3,"(3, 2)","(3, 2)","(3, 3)","(3, 4)","(3, 5)",3


A new column name will be added to our `DataFrame` with the following syntax as well:

In [51]:
x.B = 11:13
x

Unnamed: 0_level_0,x1,x2,x3,x4,x5,A,B
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…,Int64,Int64
1,"(1, 2)","(1, 2)","(1, 3)","(1, 4)","(1, 5)",1,11
2,"(2, 2)","(2, 2)","(2, 3)","(2, 4)","(2, 5)",2,12
3,"(3, 2)","(3, 2)","(3, 3)","(3, 4)","(3, 5)",3,13


### Find column name

In [52]:
x = DataFrame([(i,j) for i in 1:3, j in 1:5])

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)","(3, 5)"


We can check if a column with a given name exists via

In [53]:
hasproperty(x, :x1)

true

and determine its index via

In [54]:
findfirst(isequal(:x2), names(x))

2