# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), Dec 2, 2017**

A brief introduction to basic usage of `DataFrames`. Tested under version `0.11`.
I will try to keep it up to date as the package evolves.

In [1]:
using DataFrames # load package

## Manipulating columns of DataFrame

### Renaming columns

In [2]:
x = DataFrame(Bool, 3, 4)

Unnamed: 0,x1,x2,x3,x4
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False


In [3]:
rename(x, :x1 => :A) # new data frame, also accepts collections of Pairs

Unnamed: 0,A,x2,x3,x4
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False


In [4]:
rename!(c -> Symbol(string(c)^2), x) # in place transofmation by applying a function

Unnamed: 0,x1x1,x2x2,x3x3,x4x4
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False


In [5]:
rename(x, names(x)[3] => :third) # change name of third variable, new data frame

Unnamed: 0,x1x1,x2x2,third,x4x4
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False


In [6]:
names!(x, [:a, :b, :c, :d]) # change all names of variables

Unnamed: 0,a,b,c,d
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False


In [7]:
names!(x, fill(:a, 4), allow_duplicates=true) # handle duplicates in passed names

Unnamed: 0,a,a_1,a_2,a_3
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False


### Reordering columns

In [8]:
x[shuffle(names(x))] # new DataFrame, reorder names(x) vector as needed

Unnamed: 0,a_2,a,a_3,a_1
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False


### Merging/adding columns

In [9]:
x = DataFrame([(i,j) for i in 1:3, j in 1:4])

Unnamed: 0,x1,x2,x3,x4
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


In [10]:
[x x] # merge two data frames, also hcat if you have many columns to merge

Unnamed: 0,x1,x2,x3,x4,x1_1,x2_1,x3_1,x4_1
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)","(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)","(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)","(3, 1)","(3, 2)","(3, 3)","(3, 4)"


In [11]:
y = hcat(x, [1,2,3]) # add a new column without a name

Unnamed: 0,x1,x2,x3,x4,x1_1
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)",1
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)",2
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)",3


*TODO: what will be changed hcat with leading vector*

In [12]:
y = hcat(x, DataFrame(A=[1,2,3])) # this is a bit more verbose but cleaner

Unnamed: 0,x1,x2,x3,x4,A
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)",1
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)",2
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)",3


In [13]:
y = [DataFrame(A=[1,2,3]) x] # a way to append a vector at the start of the DataFrame

Unnamed: 0,A,x1,x2,x3,x4
1,1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


In [14]:
using BenchmarkTools
@btime [$x[1:2] DataFrame(A=[1,2,3]) $x[3:4]] # and in the middle, method 1

  17.261 μs (125 allocations: 9.17 KiB)


Unnamed: 0,x1,x2,A,x3,x4
1,"(1, 1)","(1, 2)",1,"(1, 3)","(1, 4)"
2,"(2, 1)","(2, 2)",2,"(2, 3)","(2, 4)"
3,"(3, 1)","(3, 2)",3,"(3, 3)","(3, 4)"


In [15]:
@btime y = [$x DataFrame(A=[1,2,3])]; y[names(y)[[1,2,5,3,4]]] # method 2, faster but more messy

  9.797 μs (60 allocations: 4.41 KiB)


Unnamed: 0,A,x1,x4,x2,x3
1,1,"(1, 1)","(1, 4)","(1, 2)","(1, 3)"
2,2,"(2, 1)","(2, 4)","(2, 2)","(2, 3)"
3,3,"(3, 1)","(3, 4)","(3, 2)","(3, 3)"


In [16]:
DataFrames.hcat!(x, [1,2,3]) # modify x in place

Unnamed: 0,x1,x2,x3,x4,x1_1
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)",1
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)",2
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)",3


### Subsetting/removing columns

In [17]:
x[[1,2,4,5]] # by index

Unnamed: 0,x1,x2,x4,x1_1
1,"(1, 1)","(1, 2)","(1, 4)",1
2,"(2, 1)","(2, 2)","(2, 4)",2
3,"(3, 1)","(3, 2)","(3, 4)",3


In [18]:
x[[:x1, :x4]] # by name

Unnamed: 0,x1,x4
1,"(1, 1)","(1, 4)"
2,"(2, 1)","(2, 4)"
3,"(3, 1)","(3, 4)"


In [19]:
x[[true, false, true]] # by Bool - does not have to have exact length

Unnamed: 0,x1,x3
1,"(1, 1)","(1, 3)"
2,"(2, 1)","(2, 3)"
3,"(3, 1)","(3, 3)"


In [20]:
x[[:x1]] # single column Data Frame

Unnamed: 0,x1
1,"(1, 1)"
2,"(2, 1)"
3,"(3, 1)"


In [21]:
x[:x1] # vector contained in column :x1

3-element Array{Tuple{Int64,Int64},1}:
 (1, 1)
 (2, 1)
 (3, 1)

In [22]:
x[1] # the same by column number

3-element Array{Tuple{Int64,Int64},1}:
 (1, 1)
 (2, 1)
 (3, 1)

### Modify column by name

In [23]:
x[:x1] = x[:x2] # existing column is modified
x

Unnamed: 0,x1,x2,x3,x4,x1_1
1,"(1, 2)","(1, 2)","(1, 3)","(1, 4)",1
2,"(2, 2)","(2, 2)","(2, 3)","(2, 4)",2
3,"(3, 2)","(3, 2)","(3, 3)","(3, 4)",3


In [24]:
x[:A] = [1,2,3] # new column - added at the end
x

Unnamed: 0,x1,x2,x3,x4,x1_1,A
1,"(1, 2)","(1, 2)","(1, 3)","(1, 4)",1,1
2,"(2, 2)","(2, 2)","(2, 3)","(2, 4)",2,2
3,"(3, 2)","(3, 2)","(3, 3)","(3, 4)",3,3


### Find column name

In [25]:
:x1 in names(x) # does column exist?

true

In [26]:
findfirst(names(x), :x2) # what is its number

2