# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), Dec 5, 2017**

A brief introduction to basic usage of `DataFrames`. Tested under `DataFrames` master on 2017-12-05.
I will try to keep it up to date as the package evolves. This tutorial covers `DataFrames`, `CSV`, `Missings` and `CategoricalArrays` only. It does not show any additional packages that can be used with `DataFrames`.

In [1]:
using DataFrames # load package

## Manipulating rows of DataFrame

### Reordering rows

In [2]:
x = DataFrame(id=1:10, x = rand(10), y = [zeros(5); ones(5)]) # and we hope that x[:x] is not sorted :)

Unnamed: 0,id,x,y
1,1,0.376047,0.0
2,2,0.512854,0.0
3,3,0.183976,0.0
4,4,0.987844,0.0
5,5,0.918253,0.0
6,6,0.261753,1.0
7,7,0.617087,1.0
8,8,0.537077,1.0
9,9,0.0257101,1.0
10,10,0.525669,1.0


In [3]:
sort!(x, cols=:x) # sort x in place

Unnamed: 0,id,x,y
1,9,0.0257101,1.0
2,3,0.183976,0.0
3,6,0.261753,1.0
4,1,0.376047,0.0
5,2,0.512854,0.0
6,10,0.525669,1.0
7,8,0.537077,1.0
8,7,0.617087,1.0
9,5,0.918253,0.0
10,4,0.987844,0.0


In [4]:
y = sort(x, cols=:id) # new DataFrame

Unnamed: 0,id,x,y
1,1,0.376047,0.0
2,2,0.512854,0.0
3,3,0.183976,0.0
4,4,0.987844,0.0
5,5,0.918253,0.0
6,6,0.261753,1.0
7,7,0.617087,1.0
8,8,0.537077,1.0
9,9,0.0257101,1.0
10,10,0.525669,1.0


In [5]:
sort(x, cols = (:y, :x), rev=(true, false))

Unnamed: 0,id,x,y
1,9,0.0257101,1.0
2,6,0.261753,1.0
3,10,0.525669,1.0
4,8,0.537077,1.0
5,7,0.617087,1.0
6,3,0.183976,0.0
7,1,0.376047,0.0
8,2,0.512854,0.0
9,5,0.918253,0.0
10,4,0.987844,0.0


In [6]:
sort(x, cols = (order(:y, rev=true), :x)) # the same as above

Unnamed: 0,id,x,y
1,9,0.0257101,1.0
2,6,0.261753,1.0
3,10,0.525669,1.0
4,8,0.537077,1.0
5,7,0.617087,1.0
6,3,0.183976,0.0
7,1,0.376047,0.0
8,2,0.512854,0.0
9,5,0.918253,0.0
10,4,0.987844,0.0


In [7]:
sort(x, cols = (order(:y, rev=true), order(:x, by=v->rem(v,1)))) # some more fancy sorting stuff

Unnamed: 0,id,x,y
1,9,0.0257101,1.0
2,6,0.261753,1.0
3,10,0.525669,1.0
4,8,0.537077,1.0
5,7,0.617087,1.0
6,3,0.183976,0.0
7,1,0.376047,0.0
8,2,0.512854,0.0
9,5,0.918253,0.0
10,4,0.987844,0.0


In [8]:
x[shuffle(1:10), :] # reorder rows (here randomly)

Unnamed: 0,id,x,y
1,10,0.525669,1.0
2,8,0.537077,1.0
3,7,0.617087,1.0
4,3,0.183976,0.0
5,5,0.918253,0.0
6,4,0.987844,0.0
7,1,0.376047,0.0
8,6,0.261753,1.0
9,2,0.512854,0.0
10,9,0.0257101,1.0


In [9]:
sort!(x, cols=:id)
x[[1,10],:] = x[[10,1],:] # swap rows
x

Unnamed: 0,id,x,y
1,10,0.525669,1.0
2,2,0.512854,0.0
3,3,0.183976,0.0
4,4,0.987844,0.0
5,5,0.918253,0.0
6,6,0.261753,1.0
7,7,0.617087,1.0
8,8,0.537077,1.0
9,9,0.0257101,1.0
10,1,0.376047,0.0


In [10]:
x[1,:], x[10,:] = x[10,:], x[1,:] # and swap again
x

Unnamed: 0,id,x,y
1,1,0.376047,0.0
2,2,0.512854,0.0
3,3,0.183976,0.0
4,4,0.987844,0.0
5,5,0.918253,0.0
6,6,0.261753,1.0
7,7,0.617087,1.0
8,8,0.537077,1.0
9,9,0.0257101,1.0
10,10,0.525669,1.0


### Merging/adding rows

In [11]:
x = DataFrame(rand(3, 5))

Unnamed: 0,x1,x2,x3,x4,x5
1,0.800235,0.932343,0.631114,0.879989,0.97126
2,0.373127,0.239389,0.554891,0.567843,0.600885
3,0.995317,0.778797,0.904739,0.705626,0.817267


In [12]:
[x; x] # merge by rows - data frames must have the same column names; the same is vcat

Unnamed: 0,x1,x2,x3,x4,x5
1,0.800235,0.932343,0.631114,0.879989,0.97126
2,0.373127,0.239389,0.554891,0.567843,0.600885
3,0.995317,0.778797,0.904739,0.705626,0.817267
4,0.800235,0.932343,0.631114,0.879989,0.97126
5,0.373127,0.239389,0.554891,0.567843,0.600885
6,0.995317,0.778797,0.904739,0.705626,0.817267


In [13]:
append!(x, x) # the same but modifies x

Unnamed: 0,x1,x2,x3,x4,x5
1,0.800235,0.932343,0.631114,0.879989,0.97126
2,0.373127,0.239389,0.554891,0.567843,0.600885
3,0.995317,0.778797,0.904739,0.705626,0.817267
4,0.800235,0.932343,0.631114,0.879989,0.97126
5,0.373127,0.239389,0.554891,0.567843,0.600885
6,0.995317,0.778797,0.904739,0.705626,0.817267


In [14]:
push!(x, 1:5) # add one row to x at the end; must give correct number of values and correct types
x

Unnamed: 0,x1,x2,x3,x4,x5
1,0.800235,0.932343,0.631114,0.879989,0.97126
2,0.373127,0.239389,0.554891,0.567843,0.600885
3,0.995317,0.778797,0.904739,0.705626,0.817267
4,0.800235,0.932343,0.631114,0.879989,0.97126
5,0.373127,0.239389,0.554891,0.567843,0.600885
6,0.995317,0.778797,0.904739,0.705626,0.817267
7,1.0,2.0,3.0,4.0,5.0


In [15]:
push!(x, Dict(:x1=> 11, :x2=> 12, :x3=> 13, :x4=> 14, :x5=> 15)) # also works with dictionaries
x

Unnamed: 0,x1,x2,x3,x4,x5
1,0.800235,0.932343,0.631114,0.879989,0.97126
2,0.373127,0.239389,0.554891,0.567843,0.600885
3,0.995317,0.778797,0.904739,0.705626,0.817267
4,0.800235,0.932343,0.631114,0.879989,0.97126
5,0.373127,0.239389,0.554891,0.567843,0.600885
6,0.995317,0.778797,0.904739,0.705626,0.817267
7,1.0,2.0,3.0,4.0,5.0
8,11.0,12.0,13.0,14.0,15.0


### Subsetting/removing rows

In [16]:
x[1:2, :] # by index

Unnamed: 0,x1,x2,x3,x4,x5
1,0.800235,0.932343,0.631114,0.879989,0.97126
2,0.373127,0.239389,0.554891,0.567843,0.600885


In [17]:
view(x, 1:2) # the same but a view

Unnamed: 0,x1,x2,x3,x4,x5
1,0.800235,0.932343,0.631114,0.879989,0.97126
2,0.373127,0.239389,0.554891,0.567843,0.600885


In [18]:
x[repmat([true, false], 4), 1:3] # by Bool, exact length required

Unnamed: 0,x1,x2,x3
1,0.800235,0.932343,0.631114
2,0.995317,0.778797,0.904739
3,0.373127,0.239389,0.554891
4,1.0,2.0,3.0


In [19]:
view(x, repmat([true, false], 4), 1:3) # view again

Unnamed: 0,x1,x2,x3
1,0.800235,0.932343,0.631114
2,0.995317,0.778797,0.904739
3,0.373127,0.239389,0.554891
4,1.0,2.0,3.0


In [20]:
deleterows!(x, 7) # delete one row

Unnamed: 0,x1,x2,x3,x4,x5
1,0.800235,0.932343,0.631114,0.879989,0.97126
2,0.373127,0.239389,0.554891,0.567843,0.600885
3,0.995317,0.778797,0.904739,0.705626,0.817267
4,0.800235,0.932343,0.631114,0.879989,0.97126
5,0.373127,0.239389,0.554891,0.567843,0.600885
6,0.995317,0.778797,0.904739,0.705626,0.817267
7,11.0,12.0,13.0,14.0,15.0


In [21]:
deleterows!(x, 6:7) # delete collection of rows

Unnamed: 0,x1,x2,x3,x4,x5
1,0.800235,0.932343,0.631114,0.879989,0.97126
2,0.373127,0.239389,0.554891,0.567843,0.600885
3,0.995317,0.778797,0.904739,0.705626,0.817267
4,0.800235,0.932343,0.631114,0.879989,0.97126
5,0.373127,0.239389,0.554891,0.567843,0.600885


### Deduplicating

In [22]:
x = DataFrame(A=[1,2], B=["x","y"])
append!(x, x)
x[:C] = 1:4
x

Unnamed: 0,A,B,C
1,1,x,1
2,2,y,2
3,1,x,3
4,2,y,4


In [23]:
unique(x, [1,2]) # get first unique rows for given index

Unnamed: 0,A,B,C
1,1,x,1
2,2,y,2


In [24]:
unique(x) # now we look at whole rows

Unnamed: 0,A,B,C
1,1,x,1
2,2,y,2
3,1,x,3
4,2,y,4


In [25]:
nonunique(x, :A) # get indicators of non-unique rows

4-element Array{Bool,1}:
 false
 false
  true
  true

In [26]:
unique!(x, :B) # modify x in place

Unnamed: 0,A,B,C
1,1,x,1
2,2,y,2
