# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), Apr 21, 2018**

In [1]:
using DataFrames # load package

## Manipulating columns of DataFrame

### Renaming columns

Let's start with a `DataFrame` of `Bool`s that has default column names.

In [2]:
x = DataFrame(Bool, 3, 4)

Unnamed: 0,x1,x2,x3,x4
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False


With `rename`, we create new data frame; here we rename the column `:x1` to `:A`. (`rename` also accepts collections of Pairs.)

In [3]:
rename(x, :x1 => :A)

Unnamed: 0,A,x2,x3,x4
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False


With `rename!` we do an in place transformation. 

This time we've applied a function to every column name.

In [4]:
rename!(c -> Symbol(string(c)^2), x)

Unnamed: 0,x1x1,x2x2,x3x3,x4x4
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False


We can also change the name of a particular column without knowing the original.

Here we change the name of the third column, creating a new data frame.

In [5]:
rename(x, names(x)[3] => :third)

Unnamed: 0,x1x1,x2x2,third,x4x4
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False


With `names!`, we can change the names of all variables.

In [6]:
names!(x, [:a, :b, :c, :d])

Unnamed: 0,a,b,c,d
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False


We get an error when we try to provide duplicate names

In [7]:
names!(x, fill(:a, 4))

LoadError: [91mArgumentError: Duplicate variable names: Symbol[:a, :a, :a, :a].
Pass makeunique=true to make them unique using a suffix automatically.[39m

 unless we pass `makeunique=true`, which allows us to handle duplicates in passed names.

In [8]:
names!(x, fill(:a, 4), makeunique=true)

Unnamed: 0,a,a_1,a_2,a_3
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False


### Reordering columns

We can reorder the names(x) vector as needed, creating a new DataFrame.

In [9]:
srand(1234)
x[shuffle(names(x))]

Unnamed: 0,a_1,a_3,a_2,a
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False


also `permutecols!` will be introduced in next release of DataFrames

### Merging/adding columns

In [10]:
x = DataFrame([(i,j) for i in 1:3, j in 1:4])

Unnamed: 0,x1,x2,x3,x4
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


With `hcat` we can merge two data frames. Also [x y] syntax is supported but only when DataFrames have unique column names.

In [11]:
hcat(x, x, makeunique=true)

Unnamed: 0,x1,x2,x3,x4,x1_1,x2_1,x3_1,x4_1
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)","(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)","(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)","(3, 1)","(3, 2)","(3, 3)","(3, 4)"


We can also use `hcat` to add a new column; a default name `:x1` will be used for this column, so `makeunique=true` is needed.

In [12]:
y = hcat(x, [1,2,3], makeunique=true)

Unnamed: 0,x1,x2,x3,x4,x1_1
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)",1
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)",2
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)",3


You can also prepend a vector with `hcat`.

In [13]:
hcat([1,2,3], x, makeunique=true)

Unnamed: 0,x1,x1_1,x2,x3,x4
1,1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


Alternatively you could append a vector with the following syntax. This is a bit more verbose but cleaner.

In [14]:
y = [x DataFrame(A=[1,2,3])]

Unnamed: 0,x1,x2,x3,x4,A
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)",1
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)",2
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)",3


Here we do the same but add column `:A` to the front.

In [15]:
y = [DataFrame(A=[1,2,3]) x]

Unnamed: 0,A,x1,x2,x3,x4
1,1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


A column can also be added in the middle. Here a brute-force method is used and a new DataFrame is created.

In [16]:
using BenchmarkTools
@btime [$x[1:2] DataFrame(A=[1,2,3]) $x[3:4]]

  20.993 μs (133 allocations: 10.20 KiB)


Unnamed: 0,x1,x2,A,x3,x4
1,"(1, 1)","(1, 2)",1,"(1, 3)","(1, 4)"
2,"(2, 1)","(2, 2)",2,"(2, 3)","(2, 4)"
3,"(3, 1)","(3, 2)",3,"(3, 3)","(3, 4)"


We could also do this with `insert!` but now an in place specialized method adds :newcol  to the data frame.

In [17]:
insert!(y, 2, [1,2,3], :newcol)

Unnamed: 0,A,newcol,x1,x2,x3,x4
1,1,1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,2,2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,3,3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


If you want to insert the same column name several times `makeunique=true` is needed as usual.

In [18]:
insert!(y, 2, [1,2,3], :newcol, makeunique=true)

Unnamed: 0,A,newcol_1,newcol,x1,x2,x3,x4
1,1,1,1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,2,2,2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,3,3,3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


We can see how much faster it is to insert a column with `insert!` using `@btime`.

In [19]:
@btime insert!(copy($x), 3, [1,2,3], :A)

  5.598 μs (20 allocations: 1.45 KiB)


Unnamed: 0,x1,x2,A,x3,x4
1,"(1, 1)","(1, 2)",1,"(1, 3)","(1, 4)"
2,"(2, 1)","(2, 2)",2,"(2, 3)","(2, 4)"
3,"(3, 1)","(3, 2)",3,"(3, 3)","(3, 4)"


Let's use `insert` to in place append a column,

In [20]:
insert!(x, ncol(x)+1, [1,2,3], :A)

Unnamed: 0,x1,x2,x3,x4,A
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)",1
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)",2
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)",3


and to in place prepend a column.

In [21]:
insert!(x, 1, [1,2,3], :B)

Unnamed: 0,B,x1,x2,x3,x4,A
1,1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)",1
2,2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)",2
3,3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)",3


With `merge!`, let's merge the second DataFrame into first, but overwriting duplicates.

In [22]:
df1 = DataFrame(x=1:3, y=4:6)
df2 = DataFrame(x='a':'c', z = 'd':'f', new=11:13)
df1, df2, merge!(df1, df2)

(3×4 DataFrames.DataFrame
│ Row │ x   │ y │ z   │ new │
├─────┼─────┼───┼─────┼─────┤
│ 1   │ 'a' │ 4 │ 'd' │ 11  │
│ 2   │ 'b' │ 5 │ 'e' │ 12  │
│ 3   │ 'c' │ 6 │ 'f' │ 13  │, 3×3 DataFrames.DataFrame
│ Row │ x   │ z   │ new │
├─────┼─────┼─────┼─────┤
│ 1   │ 'a' │ 'd' │ 11  │
│ 2   │ 'b' │ 'e' │ 12  │
│ 3   │ 'c' │ 'f' │ 13  │, 3×4 DataFrames.DataFrame
│ Row │ x   │ y │ z   │ new │
├─────┼─────┼───┼─────┼─────┤
│ 1   │ 'a' │ 4 │ 'd' │ 11  │
│ 2   │ 'b' │ 5 │ 'e' │ 12  │
│ 3   │ 'c' │ 6 │ 'f' │ 13  │)

 For comparison: merge two data frames but overwriting duplicate names via `hcat`.

In [23]:
df1 = DataFrame(x=1:3, y=4:6)
df2 = DataFrame(x='a':'c', z = 'd':'f', new=11:13)
hcat(df1, df2, makeunique=true)

Unnamed: 0,x,y,x_1,z,new
1,1,4,'a','d',11
2,2,5,'b','e',12
3,3,6,'c','f',13


### Subsetting/removing columns

Let's create a new `DataFrame` `x` and show a few ways to create DataFrames with a subset of `x`'s columns.

In [24]:
x = DataFrame([(i,j) for i in 1:3, j in 1:5])

Unnamed: 0,x1,x2,x3,x4,x5
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)","(3, 5)"


First we could do this by index

In [25]:
x[[1,2,4,5]]

Unnamed: 0,x1,x2,x4,x5
1,"(1, 1)","(1, 2)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 2)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 2)","(3, 4)","(3, 5)"


or by column name.

In [26]:
x[[:x1, :x4]]

Unnamed: 0,x1,x4
1,"(1, 1)","(1, 4)"
2,"(2, 1)","(2, 4)"
3,"(3, 1)","(3, 4)"


We can also choose to keep or exclude columns by `Bool`. (We need a vector whose length is the number of columns in the original data frame.)

In [27]:
x[[true, false, true, false, true]]

Unnamed: 0,x1,x3,x5
1,"(1, 1)","(1, 3)","(1, 5)"
2,"(2, 1)","(2, 3)","(2, 5)"
3,"(3, 1)","(3, 3)","(3, 5)"


Here we create a single column Data Frame,

In [28]:
x[[:x1]]

Unnamed: 0,x1
1,"(1, 1)"
2,"(2, 1)"
3,"(3, 1)"


and here we access the vector contained in column `:x1`.

In [29]:
x[:x1]

3-element Array{Tuple{Int64,Int64},1}:
 (1, 1)
 (2, 1)
 (3, 1)

We could grab the same vector by column number

In [30]:
x[1]

3-element Array{Tuple{Int64,Int64},1}:
 (1, 1)
 (2, 1)
 (3, 1)

and remove everything from a data frame with `empty!`.

In [31]:
empty!(y)

Here we create a copy of `x` and delete the 3rd column from the copy with `delete!`.

In [32]:
z = copy(x)
x, delete!(z, 3)

(3×5 DataFrames.DataFrame
│ Row │ x1     │ x2     │ x3     │ x4     │ x5     │
├─────┼────────┼────────┼────────┼────────┼────────┤
│ 1   │ (1, 1) │ (1, 2) │ (1, 3) │ (1, 4) │ (1, 5) │
│ 2   │ (2, 1) │ (2, 2) │ (2, 3) │ (2, 4) │ (2, 5) │
│ 3   │ (3, 1) │ (3, 2) │ (3, 3) │ (3, 4) │ (3, 5) │, 3×4 DataFrames.DataFrame
│ Row │ x1     │ x2     │ x4     │ x5     │
├─────┼────────┼────────┼────────┼────────┤
│ 1   │ (1, 1) │ (1, 2) │ (1, 4) │ (1, 5) │
│ 2   │ (2, 1) │ (2, 2) │ (2, 4) │ (2, 5) │
│ 3   │ (3, 1) │ (3, 2) │ (3, 4) │ (3, 5) │)

### Modify column by name

In [33]:
x = DataFrame([(i,j) for i in 1:3, j in 1:5])

Unnamed: 0,x1,x2,x3,x4,x5
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)","(3, 5)"


With the following syntax, the existing column is modified without performing any copying.

In [34]:
x[:x1] = x[:x2]
x

Unnamed: 0,x1,x2,x3,x4,x5
1,"(1, 2)","(1, 2)","(1, 3)","(1, 4)","(1, 5)"
2,"(2, 2)","(2, 2)","(2, 3)","(2, 4)","(2, 5)"
3,"(3, 2)","(3, 2)","(3, 3)","(3, 4)","(3, 5)"


We can also use the following syntax to add a new column at the end of a data frame.

In [35]:
x[:A] = [1,2,3]
x

Unnamed: 0,x1,x2,x3,x4,x5,A
1,"(1, 2)","(1, 2)","(1, 3)","(1, 4)","(1, 5)",1
2,"(2, 2)","(2, 2)","(2, 3)","(2, 4)","(2, 5)",2
3,"(3, 2)","(3, 2)","(3, 3)","(3, 4)","(3, 5)",3


A unique column name will be created with the following syntax as well.

In [36]:
x[7] = 11:13
x

Unnamed: 0,x1,x2,x3,x4,x5,A,x7
1,"(1, 2)","(1, 2)","(1, 3)","(1, 4)","(1, 5)",1,11
2,"(2, 2)","(2, 2)","(2, 3)","(2, 4)","(2, 5)",2,12
3,"(3, 2)","(3, 2)","(3, 3)","(3, 4)","(3, 5)",3,13


### Find column name

In [37]:
x = DataFrame([(i,j) for i in 1:3, j in 1:5])

Unnamed: 0,x1,x2,x3,x4,x5
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)","(3, 5)"


We can check if a column with a given name exists via

In [38]:
:x1 in names(x) 

true

and determine its index via

In [39]:
findfirst(names(x), :x2)

2