# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), Dec 6, 2017**

A brief introduction to basic usage of `DataFrames`. Tested under `DataFrames` master on 2017-12-05.
I will try to keep it up to date as the package evolves.

In [36]:
using DataFrames # load package

## Manipulating columns of DataFrame

### Renaming columns

In [37]:
x = DataFrame(Bool, 3, 4)

Unnamed: 0,x1,x2,x3,x4
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False


In [38]:
rename(x, :x1 => :A) # new data frame, also accepts collections of Pairs

Unnamed: 0,A,x2,x3,x4
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False


In [39]:
rename!(c -> Symbol(string(c)^2), x) # in place transofmation by applying a function

Unnamed: 0,x1x1,x2x2,x3x3,x4x4
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False


In [40]:
rename(x, names(x)[3] => :third) # change name of third variable, new data frame

Unnamed: 0,x1x1,x2x2,third,x4x4
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False


In [41]:
names!(x, [:a, :b, :c, :d]) # change all names of variables

Unnamed: 0,a,b,c,d
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False


In [42]:
names!(x, fill(:a, 4)) # error - duplicate names

LoadError: [91mArgumentError: Duplicate variable names: Symbol[:a].
Pass allow_duplicates=true to make them unique using a suffix automatically.[39m

In [43]:
names!(x, fill(:a, 4), allow_duplicates=true) # handle duplicates in passed names

Unnamed: 0,a,a_1,a_2,a_3
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False


### Reordering columns

In [44]:
x[shuffle(names(x))] # new DataFrame, reorder names(x) vector as needed

Unnamed: 0,a_2,a,a_1,a_3
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False


### Merging/adding columns

In [45]:
x = DataFrame([(i,j) for i in 1:3, j in 1:4])

Unnamed: 0,x1,x2,x3,x4
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


In [46]:
[x x] # merge two data frames, also hcat if you have many columns to merge

Unnamed: 0,x1,x2,x3,x4,x1_1,x2_1,x3_1,x4_1
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)","(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)","(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)","(3, 1)","(3, 2)","(3, 3)","(3, 4)"


In [47]:
y = hcat(x, [1,2,3]) # add a new column without a name

Unnamed: 0,x1,x2,x3,x4,x1_1
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)",1
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)",2
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)",3


In [48]:
hcat([1,2,3], x) # also in front

Unnamed: 0,x1,x1_1,x2,x3,x4
1,1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


In [49]:
y = hcat(x, DataFrame(A=[1,2,3])) # this is a bit more verbose but cleaner

Unnamed: 0,x1,x2,x3,x4,A
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)",1
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)",2
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)",3


In [50]:
y = [DataFrame(A=[1,2,3]) x] # a way to append a vector at the start of the DataFrame

Unnamed: 0,A,x1,x2,x3,x4
1,1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


In [51]:
using BenchmarkTools
@btime [$x[1:2] DataFrame(A=[1,2,3]) $x[3:4]] # and in the middle, brute-force method

  17.728 μs (125 allocations: 9.17 KiB)


In [52]:
insert!(y, 2, [1,2,3], :newcol) # add :newcol in second poistion in a data frame in place

Unnamed: 0,A,newcol,x1,x2,x3,x4
1,1,1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,2,2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,3,3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


In [53]:
@btime insert!(copy($x), 3, [1,2,3], :A) # 2nd method to insert a column, faster

  5.676 μs (20 allocations: 1.45 KiB)


Unnamed: 0,x1,x2,A,x3,x4
1,"(1, 1)","(1, 2)",1,"(1, 3)","(1, 4)"
2,"(2, 1)","(2, 2)",2,"(2, 3)","(2, 4)"
3,"(3, 1)","(3, 2)",3,"(3, 3)","(3, 4)"


In [54]:
DataFrames.hcat!(x, [1,2,3]) # modify x in place

Unnamed: 0,x1,x2,x3,x4,x1_1
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)",1
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)",2
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)",3


In [55]:
df1 = DataFrame(x=1:3, y=4:6)
df2 = DataFrame(x='a':'c', z = 'd':'f')
df1, df2, merge!(df1, df2) # merge two data frames in place but overwritting duplicates

(3×3 DataFrames.DataFrame
│ Row │ x   │ y │ z   │
├─────┼─────┼───┼─────┤
│ 1   │ 'a' │ 4 │ 'd' │
│ 2   │ 'b' │ 5 │ 'e' │
│ 3   │ 'c' │ 6 │ 'f' │, 3×2 DataFrames.DataFrame
│ Row │ x   │ z   │
├─────┼─────┼─────┤
│ 1   │ 'a' │ 'd' │
│ 2   │ 'b' │ 'e' │
│ 3   │ 'c' │ 'f' │, 3×3 DataFrames.DataFrame
│ Row │ x   │ y │ z   │
├─────┼─────┼───┼─────┤
│ 1   │ 'a' │ 4 │ 'd' │
│ 2   │ 'b' │ 5 │ 'e' │
│ 3   │ 'c' │ 6 │ 'f' │)

In [56]:
df1 = DataFrame(x=1:3, y=4:6)
df2 = DataFrame(x='a':'c', z = 'd':'f')
[df1 df2] # compare: merge two data frames but overwritting duplicate names

Unnamed: 0,x,y,x_1,z
1,1,4,'a','d'
2,2,5,'b','e'
3,3,6,'c','f'


### Subsetting/removing columns

In [57]:
x[[1,2,4,5]] # by index

Unnamed: 0,x1,x2,x4,x1_1
1,"(1, 1)","(1, 2)","(1, 4)",1
2,"(2, 1)","(2, 2)","(2, 4)",2
3,"(3, 1)","(3, 2)","(3, 4)",3


In [58]:
x[[:x1, :x4]] # by name

Unnamed: 0,x1,x4
1,"(1, 1)","(1, 4)"
2,"(2, 1)","(2, 4)"
3,"(3, 1)","(3, 4)"


In [59]:
x[[true, false, true]] # by Bool - does not have to have exact length

Unnamed: 0,x1,x3
1,"(1, 1)","(1, 3)"
2,"(2, 1)","(2, 3)"
3,"(3, 1)","(3, 3)"


In [60]:
x[[:x1]] # a single column Data Frame

Unnamed: 0,x1
1,"(1, 1)"
2,"(2, 1)"
3,"(3, 1)"


In [61]:
x[:x1] # a vector contained in column :x1

3-element Array{Tuple{Int64,Int64},1}:
 (1, 1)
 (2, 1)
 (3, 1)

In [62]:
x[1] # the same by column number

3-element Array{Tuple{Int64,Int64},1}:
 (1, 1)
 (2, 1)
 (3, 1)

In [63]:
y, empty!(y) # remove everything from a data frame

(0×0 DataFrames.DataFrame
, 0×0 DataFrames.DataFrame
)

In [64]:
z = copy(x)
x, delete!(z, 3) # delete 3rd column in z

(3×5 DataFrames.DataFrame
│ Row │ x1     │ x2     │ x3     │ x4     │ x1_1 │
├─────┼────────┼────────┼────────┼────────┼──────┤
│ 1   │ (1, 1) │ (1, 2) │ (1, 3) │ (1, 4) │ 1    │
│ 2   │ (2, 1) │ (2, 2) │ (2, 3) │ (2, 4) │ 2    │
│ 3   │ (3, 1) │ (3, 2) │ (3, 3) │ (3, 4) │ 3    │, 3×4 DataFrames.DataFrame
│ Row │ x1     │ x2     │ x4     │ x1_1 │
├─────┼────────┼────────┼────────┼──────┤
│ 1   │ (1, 1) │ (1, 2) │ (1, 4) │ 1    │
│ 2   │ (2, 1) │ (2, 2) │ (2, 4) │ 2    │
│ 3   │ (3, 1) │ (3, 2) │ (3, 4) │ 3    │)

### Modify column by name

In [65]:
x[:x1] = x[:x2] # existing column is modified
x

Unnamed: 0,x1,x2,x3,x4,x1_1
1,"(1, 2)","(1, 2)","(1, 3)","(1, 4)",1
2,"(2, 2)","(2, 2)","(2, 3)","(2, 4)",2
3,"(3, 2)","(3, 2)","(3, 3)","(3, 4)",3


In [66]:
x[:A] = [1,2,3] # a new column - added at the end
x

Unnamed: 0,x1,x2,x3,x4,x1_1,A
1,"(1, 2)","(1, 2)","(1, 3)","(1, 4)",1,1
2,"(2, 2)","(2, 2)","(2, 3)","(2, 4)",2,2
3,"(3, 2)","(3, 2)","(3, 3)","(3, 4)",3,3


In [67]:
categorical!(x, :A) # make column :A categorical

Unnamed: 0,x1,x2,x3,x4,x1_1,A
1,"(1, 2)","(1, 2)","(1, 3)","(1, 4)",1,1
2,"(2, 2)","(2, 2)","(2, 3)","(2, 4)",2,2
3,"(3, 2)","(3, 2)","(3, 3)","(3, 4)",3,3


### Find column name

In [68]:
:x1 in names(x) # does a column exist?

true

In [69]:
findfirst(names(x), :x2) # what is its number

2

In [70]:
DataFrames.index(x)[:x2] # other way to get the same

2