# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), October 5, 2022**

In [1]:
using DataFrames

In [2]:
using Statistics

In [3]:
using Random

In [4]:
Random.seed!(1);

## Manipulating rows of DataFrame

### Selecting rows

In [5]:
df = DataFrame(rand(4, 5), :auto)

Row,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.0491718,0.691857,0.840384,0.198521,0.802561
2,0.119079,0.767518,0.89077,0.00819786,0.661425
3,0.393271,0.087253,0.138227,0.592041,0.347513
4,0.0240943,0.855718,0.347737,0.801055,0.778149


using `:` as row selector will copy columns

In [6]:
df[:, :]

Row,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.0491718,0.691857,0.840384,0.198521,0.802561
2,0.119079,0.767518,0.89077,0.00819786,0.661425
3,0.393271,0.087253,0.138227,0.592041,0.347513
4,0.0240943,0.855718,0.347737,0.801055,0.778149


this is the same as

In [7]:
copy(df)

Row,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.0491718,0.691857,0.840384,0.198521,0.802561
2,0.119079,0.767518,0.89077,0.00819786,0.661425
3,0.393271,0.087253,0.138227,0.592041,0.347513
4,0.0240943,0.855718,0.347737,0.801055,0.778149


you can get a subset of rows of a data frame without copying using `view` to get a `SubDataFrame` 

In [8]:
sdf = view(df, 1:3, 1:3)

Row,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.0491718,0.691857,0.840384
2,0.119079,0.767518,0.89077
3,0.393271,0.087253,0.138227


you still have a detailed reference to the parent

In [9]:
parent(sdf), parentindices(sdf)

([1m4×5 DataFrame[0m
[1m Row [0m│[1m x1        [0m[1m x2       [0m[1m x3       [0m[1m x4         [0m[1m x5       [0m
     │[90m Float64   [0m[90m Float64  [0m[90m Float64  [0m[90m Float64    [0m[90m Float64  [0m
─────┼─────────────────────────────────────────────────────
   1 │ 0.0491718  0.691857  0.840384  0.198521    0.802561
   2 │ 0.119079   0.767518  0.89077   0.00819786  0.661425
   3 │ 0.393271   0.087253  0.138227  0.592041    0.347513
   4 │ 0.0240943  0.855718  0.347737  0.801055    0.778149, (1:3, 1:3))

selecting a single row returns a `DataFrameRow` object which is also a view

In [10]:
dfr = df[3, :]

Row,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
3,0.393271,0.087253,0.138227,0.592041,0.347513


In [11]:
parent(dfr), parentindices(dfr), rownumber(dfr)

([1m4×5 DataFrame[0m
[1m Row [0m│[1m x1        [0m[1m x2       [0m[1m x3       [0m[1m x4         [0m[1m x5       [0m
     │[90m Float64   [0m[90m Float64  [0m[90m Float64  [0m[90m Float64    [0m[90m Float64  [0m
─────┼─────────────────────────────────────────────────────
   1 │ 0.0491718  0.691857  0.840384  0.198521    0.802561
   2 │ 0.119079   0.767518  0.89077   0.00819786  0.661425
   3 │ 0.393271   0.087253  0.138227  0.592041    0.347513
   4 │ 0.0240943  0.855718  0.347737  0.801055    0.778149, (3, Base.OneTo(5)), 3)

let us add a column to a data frame by assigning a scalar broadcasting

In [12]:
df[!, :Z] .= 1

4-element Vector{Int64}:
 1
 1
 1
 1

In [13]:
df

Row,x1,x2,x3,x4,x5,Z
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Int64
1,0.0491718,0.691857,0.840384,0.198521,0.802561,1
2,0.119079,0.767518,0.89077,0.00819786,0.661425,1
3,0.393271,0.087253,0.138227,0.592041,0.347513,1
4,0.0240943,0.855718,0.347737,0.801055,0.778149,1


Earlier we used `:` for column selection in a view (`SubDataFrame` and `DataFrameRow`).
In this case a view will have all columns of the parent after the parent is mutated.

In [14]:
dfr

Row,x1,x2,x3,x4,x5,Z
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Int64
3,0.393271,0.087253,0.138227,0.592041,0.347513,1


In [15]:
parent(dfr), parentindices(dfr), rownumber(dfr)

([1m4×6 DataFrame[0m
[1m Row [0m│[1m x1        [0m[1m x2       [0m[1m x3       [0m[1m x4         [0m[1m x5       [0m[1m Z     [0m
     │[90m Float64   [0m[90m Float64  [0m[90m Float64  [0m[90m Float64    [0m[90m Float64  [0m[90m Int64 [0m
─────┼────────────────────────────────────────────────────────────
   1 │ 0.0491718  0.691857  0.840384  0.198521    0.802561      1
   2 │ 0.119079   0.767518  0.89077   0.00819786  0.661425      1
   3 │ 0.393271   0.087253  0.138227  0.592041    0.347513      1
   4 │ 0.0240943  0.855718  0.347737  0.801055    0.778149      1, (3, Base.OneTo(6)), 3)

Note that `parent` and `parentindices` refer to the true source of data for a `DataFrameRow` and `rownumber` refers to row number in the direct object that was used to create `DataFrameRow`

In [16]:
df = DataFrame(a=1:4)

Row,a
Unnamed: 0_level_1,Int64
1,1
2,2
3,3
4,4


In [17]:
dfv = view(df, [3,2], :)

Row,a
Unnamed: 0_level_1,Int64
1,3
2,2


In [18]:
dfr = dfv[1, :]

Row,a
Unnamed: 0_level_1,Int64
3,3


In [19]:
parent(dfr), parentindices(dfr), rownumber(dfr)

([1m4×1 DataFrame[0m
[1m Row [0m│[1m a     [0m
     │[90m Int64 [0m
─────┼───────
   1 │     1
   2 │     2
   3 │     3
   4 │     4, (3, Base.OneTo(1)), 1)

### Reordering rows

We create some random data frame (and hope that `x.x` is not sorted :), which is quite likely with 12 rows)

In [20]:
x = DataFrame(id=1:12, x = rand(12), y = [zeros(6); ones(6)])

Row,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,1,0.830334,0.0
2,2,0.573132,0.0
3,3,0.176625,0.0
4,4,0.114935,0.0
5,5,0.7864,0.0
6,6,0.892598,0.0
7,7,0.452015,1.0
8,8,0.206873,1.0
9,9,0.286582,1.0
10,10,0.918916,1.0


check if a DataFrame or a subset of its columns is sorted

In [21]:
issorted(x), issorted(x, :x)

(true, false)

we sort x in place

In [22]:
sort!(x, :x)

Row,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,4,0.114935,0.0
2,3,0.176625,0.0
3,8,0.206873,1.0
4,9,0.286582,1.0
5,7,0.452015,1.0
6,2,0.573132,0.0
7,5,0.7864,0.0
8,12,0.796831,1.0
9,1,0.830334,0.0
10,6,0.892598,0.0


now we create a new DataFrame

In [23]:
y = sort(x, :id)

Row,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,1,0.830334,0.0
2,2,0.573132,0.0
3,3,0.176625,0.0
4,4,0.114935,0.0
5,5,0.7864,0.0
6,6,0.892598,0.0
7,7,0.452015,1.0
8,8,0.206873,1.0
9,9,0.286582,1.0
10,10,0.918916,1.0


here we sort by two columns, first is decreasing, second is increasing

In [24]:
sort(x, [:y, :x], rev=[true, false])

Row,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,8,0.206873,1.0
2,9,0.286582,1.0
3,7,0.452015,1.0
4,12,0.796831,1.0
5,10,0.918916,1.0
6,11,0.991071,1.0
7,4,0.114935,0.0
8,3,0.176625,0.0
9,2,0.573132,0.0
10,5,0.7864,0.0


In [25]:
sort(x, [order(:y, rev=true), :x]) # the same as above

Row,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,8,0.206873,1.0
2,9,0.286582,1.0
3,7,0.452015,1.0
4,12,0.796831,1.0
5,10,0.918916,1.0
6,11,0.991071,1.0
7,4,0.114935,0.0
8,3,0.176625,0.0
9,2,0.573132,0.0
10,5,0.7864,0.0


now we try some more fancy sorting stuff

In [26]:
sort(x, [order(:y, rev=true), order(:x, by=v->-v)])

Row,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,11,0.991071,1.0
2,10,0.918916,1.0
3,12,0.796831,1.0
4,7,0.452015,1.0
5,9,0.286582,1.0
6,8,0.206873,1.0
7,6,0.892598,0.0
8,1,0.830334,0.0
9,5,0.7864,0.0
10,2,0.573132,0.0


this is how you can reorder rows (here randomly)

In [27]:
x[shuffle(1:10), :]

Row,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,12,0.796831,1.0
2,6,0.892598,0.0
3,2,0.573132,0.0
4,5,0.7864,0.0
5,9,0.286582,1.0
6,3,0.176625,0.0
7,1,0.830334,0.0
8,8,0.206873,1.0
9,7,0.452015,1.0
10,4,0.114935,0.0


or for shuffling just

In [28]:
shuffle(x)

Row,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,10,0.918916,1.0
2,2,0.573132,0.0
3,5,0.7864,0.0
4,8,0.206873,1.0
5,3,0.176625,0.0
6,11,0.991071,1.0
7,7,0.452015,1.0
8,9,0.286582,1.0
9,1,0.830334,0.0
10,6,0.892598,0.0


 it is also easy to swap rows using broadcasted assignment

In [29]:
sort!(x, :id)
x[[1,10],:] .= x[[10,1],:]
x

Row,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,10,0.918916,1.0
2,2,0.573132,0.0
3,3,0.176625,0.0
4,4,0.114935,0.0
5,5,0.7864,0.0
6,6,0.892598,0.0
7,7,0.452015,1.0
8,8,0.206873,1.0
9,9,0.286582,1.0
10,1,0.830334,0.0


In [30]:
x

Row,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,10,0.918916,1.0
2,2,0.573132,0.0
3,3,0.176625,0.0
4,4,0.114935,0.0
5,5,0.7864,0.0
6,6,0.892598,0.0
7,7,0.452015,1.0
8,8,0.206873,1.0
9,9,0.286582,1.0
10,1,0.830334,0.0


### Merging/adding rows

In [31]:
x = DataFrame(rand(3, 5), :auto)

Row,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.566724,0.841692,0.301389,0.636364,0.796357
2,0.251823,0.343484,0.0551143,0.518067,0.0991427
3,0.293575,0.688629,0.0234869,0.394761,0.927945


merge by rows - data frames must have the same column names; the same is `vcat`

In [32]:
[x; x]

Row,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.566724,0.841692,0.301389,0.636364,0.796357
2,0.251823,0.343484,0.0551143,0.518067,0.0991427
3,0.293575,0.688629,0.0234869,0.394761,0.927945
4,0.566724,0.841692,0.301389,0.636364,0.796357
5,0.251823,0.343484,0.0551143,0.518067,0.0991427
6,0.293575,0.688629,0.0234869,0.394761,0.927945


you can efficiently `vcat` a vector of `DataFrames` using `reduce`

In [33]:
reduce(vcat, [x, x, x])

Row,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.566724,0.841692,0.301389,0.636364,0.796357
2,0.251823,0.343484,0.0551143,0.518067,0.0991427
3,0.293575,0.688629,0.0234869,0.394761,0.927945
4,0.566724,0.841692,0.301389,0.636364,0.796357
5,0.251823,0.343484,0.0551143,0.518067,0.0991427
6,0.293575,0.688629,0.0234869,0.394761,0.927945
7,0.566724,0.841692,0.301389,0.636364,0.796357
8,0.251823,0.343484,0.0551143,0.518067,0.0991427
9,0.293575,0.688629,0.0234869,0.394761,0.927945


get `y` with other order of names

In [34]:
y = x[:, reverse(names(x))]

Row,x5,x4,x3,x2,x1
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.796357,0.636364,0.301389,0.841692,0.566724
2,0.0991427,0.518067,0.0551143,0.343484,0.251823
3,0.927945,0.394761,0.0234869,0.688629,0.293575


`vcat` is still possible as it does column name matching

In [35]:
vcat(x, y)

Row,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.566724,0.841692,0.301389,0.636364,0.796357
2,0.251823,0.343484,0.0551143,0.518067,0.0991427
3,0.293575,0.688629,0.0234869,0.394761,0.927945
4,0.566724,0.841692,0.301389,0.636364,0.796357
5,0.251823,0.343484,0.0551143,0.518067,0.0991427
6,0.293575,0.688629,0.0234869,0.394761,0.927945


but column names must still match

In [36]:
vcat(x, y[:, 1:3])

LoadError: ArgumentError: column(s) x1 and x2 are missing from argument(s) 2

unless you pass `:intersect`, `:union` or specific column names as keyword argument `cols`

In [37]:
vcat(x, y[:, 1:3], cols=:intersect)

Row,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.301389,0.636364,0.796357
2,0.0551143,0.518067,0.0991427
3,0.0234869,0.394761,0.927945
4,0.301389,0.636364,0.796357
5,0.0551143,0.518067,0.0991427
6,0.0234869,0.394761,0.927945


In [38]:
vcat(x, y[:, 1:3], cols=:union)

Row,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64?,Float64?,Float64,Float64,Float64
1,0.566724,0.841692,0.301389,0.636364,0.796357
2,0.251823,0.343484,0.0551143,0.518067,0.0991427
3,0.293575,0.688629,0.0234869,0.394761,0.927945
4,missing,missing,0.301389,0.636364,0.796357
5,missing,missing,0.0551143,0.518067,0.0991427
6,missing,missing,0.0234869,0.394761,0.927945


In [39]:
vcat(x, y[:, 1:3], cols=[:x1, :x5])

Row,x1,x5
Unnamed: 0_level_1,Float64?,Float64
1,0.566724,0.796357
2,0.251823,0.0991427
3,0.293575,0.927945
4,missing,0.796357
5,missing,0.0991427
6,missing,0.927945


`append!` modifies `x` in place

In [40]:
append!(x, x)

Row,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.566724,0.841692,0.301389,0.636364,0.796357
2,0.251823,0.343484,0.0551143,0.518067,0.0991427
3,0.293575,0.688629,0.0234869,0.394761,0.927945
4,0.566724,0.841692,0.301389,0.636364,0.796357
5,0.251823,0.343484,0.0551143,0.518067,0.0991427
6,0.293575,0.688629,0.0234869,0.394761,0.927945


here column names must match exactly unless `cols` keyword argument is passed

In [41]:
append!(x, y)

Row,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.566724,0.841692,0.301389,0.636364,0.796357
2,0.251823,0.343484,0.0551143,0.518067,0.0991427
3,0.293575,0.688629,0.0234869,0.394761,0.927945
4,0.566724,0.841692,0.301389,0.636364,0.796357
5,0.251823,0.343484,0.0551143,0.518067,0.0991427
6,0.293575,0.688629,0.0234869,0.394761,0.927945
7,0.566724,0.841692,0.301389,0.636364,0.796357
8,0.251823,0.343484,0.0551143,0.518067,0.0991427
9,0.293575,0.688629,0.0234869,0.394761,0.927945


you can also use `prepend!` to add a table at the beginning.

standard `repeat` function works on rows; also `inner` and `outer` keyword arguments are accepted

In [42]:
repeat(x, 2)

Row,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.566724,0.841692,0.301389,0.636364,0.796357
2,0.251823,0.343484,0.0551143,0.518067,0.0991427
3,0.293575,0.688629,0.0234869,0.394761,0.927945
4,0.566724,0.841692,0.301389,0.636364,0.796357
5,0.251823,0.343484,0.0551143,0.518067,0.0991427
6,0.293575,0.688629,0.0234869,0.394761,0.927945
7,0.566724,0.841692,0.301389,0.636364,0.796357
8,0.251823,0.343484,0.0551143,0.518067,0.0991427
9,0.293575,0.688629,0.0234869,0.394761,0.927945
10,0.566724,0.841692,0.301389,0.636364,0.796357


`push!` adds one row to `x` at the end; one must pass a correct number of values unless `cols` keyword argument is passed

In [43]:
push!(x, 1:5)
x

Row,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.566724,0.841692,0.301389,0.636364,0.796357
2,0.251823,0.343484,0.0551143,0.518067,0.0991427
3,0.293575,0.688629,0.0234869,0.394761,0.927945
4,0.566724,0.841692,0.301389,0.636364,0.796357
5,0.251823,0.343484,0.0551143,0.518067,0.0991427
6,0.293575,0.688629,0.0234869,0.394761,0.927945
7,0.566724,0.841692,0.301389,0.636364,0.796357
8,0.251823,0.343484,0.0551143,0.518067,0.0991427
9,0.293575,0.688629,0.0234869,0.394761,0.927945
10,1.0,2.0,3.0,4.0,5.0


also works with dictionaries

In [44]:
push!(x, Dict(:x1=> 11, :x2=> 12, :x3=> 13, :x4=> 14, :x5=> 15))
x

Row,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.566724,0.841692,0.301389,0.636364,0.796357
2,0.251823,0.343484,0.0551143,0.518067,0.0991427
3,0.293575,0.688629,0.0234869,0.394761,0.927945
4,0.566724,0.841692,0.301389,0.636364,0.796357
5,0.251823,0.343484,0.0551143,0.518067,0.0991427
6,0.293575,0.688629,0.0234869,0.394761,0.927945
7,0.566724,0.841692,0.301389,0.636364,0.796357
8,0.251823,0.343484,0.0551143,0.518067,0.0991427
9,0.293575,0.688629,0.0234869,0.394761,0.927945
10,1.0,2.0,3.0,4.0,5.0


and `NamedTuples` via name matching

In [45]:
push!(x, (x2=2, x1=1, x4=4, x3=3, x5=5))

Row,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.566724,0.841692,0.301389,0.636364,0.796357
2,0.251823,0.343484,0.0551143,0.518067,0.0991427
3,0.293575,0.688629,0.0234869,0.394761,0.927945
4,0.566724,0.841692,0.301389,0.636364,0.796357
5,0.251823,0.343484,0.0551143,0.518067,0.0991427
6,0.293575,0.688629,0.0234869,0.394761,0.927945
7,0.566724,0.841692,0.301389,0.636364,0.796357
8,0.251823,0.343484,0.0551143,0.518067,0.0991427
9,0.293575,0.688629,0.0234869,0.394761,0.927945
10,1.0,2.0,3.0,4.0,5.0


and `DataFrameRow` also via name matching

In [46]:
push!(x, x[1, :])

Row,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.566724,0.841692,0.301389,0.636364,0.796357
2,0.251823,0.343484,0.0551143,0.518067,0.0991427
3,0.293575,0.688629,0.0234869,0.394761,0.927945
4,0.566724,0.841692,0.301389,0.636364,0.796357
5,0.251823,0.343484,0.0551143,0.518067,0.0991427
6,0.293575,0.688629,0.0234869,0.394761,0.927945
7,0.566724,0.841692,0.301389,0.636364,0.796357
8,0.251823,0.343484,0.0551143,0.518067,0.0991427
9,0.293575,0.688629,0.0234869,0.394761,0.927945
10,1.0,2.0,3.0,4.0,5.0


Also supported are `pushfirst!` and `insert!`.

Please consult the documentation of `push!`, `append!` and `vcat` (and related functions) for allowed values of `cols` keyword argument.
This keyword argument governs the way these functions perform column matching of passed arguments. Also `append!` and `push!` support a `promote` keyword argument that decides if column type promotion is allowed.

Let us here just give a quick example of how heterogeneous data can be stored in the data frame using these functionalities:

In [47]:
source = [(a=1, b=2), (a=missing, b=10, c=20), (b="s", c=1, d=1)]

3-element Vector{NamedTuple}:
 (a = 1, b = 2)
 (a = missing, b = 10, c = 20)
 (b = "s", c = 1, d = 1)

In [48]:
df = DataFrame()

In [49]:
for row in source
    push!(df, row, cols=:union) # if cols is :union then promote is true by default
end

In [50]:
df

Row,a,b,c,d
Unnamed: 0_level_1,Int64?,Any,Int64?,Int64?
1,1,2,missing,missing
2,missing,10,20,missing
3,missing,s,1,1


and we see that `push!` dynamically added columns as needed and updated their element types

### Subsetting/removing rows

In [51]:
x = DataFrame(id=1:10, val='a':'j')

Row,id,val
Unnamed: 0_level_1,Int64,Char
1,1,a
2,2,b
3,3,c
4,4,d
5,5,e
6,6,f
7,7,g
8,8,h
9,9,i
10,10,j


by using indexing

In [52]:
x[1:2, :]

Row,id,val
Unnamed: 0_level_1,Int64,Char
1,1,a
2,2,b


a single row selection creates a `DataFrameRow`

In [53]:
x[1, :]

Row,id,val
Unnamed: 0_level_1,Int64,Char
1,1,a


but this is a `DataFrame`

In [54]:
x[1:1, :]

Row,id,val
Unnamed: 0_level_1,Int64,Char
1,1,a


the same but a view

In [55]:
view(x, 1:2, :)

Row,id,val
Unnamed: 0_level_1,Int64,Char
1,1,a
2,2,b


selects columns 1 and 2

In [56]:
view(x, :, 1:2)

Row,id,val
Unnamed: 0_level_1,Int64,Char
1,1,a
2,2,b
3,3,c
4,4,d
5,5,e
6,6,f
7,7,g
8,8,h
9,9,i
10,10,j


indexing by `Bool`, exact length math is required

In [57]:
x[repeat([true, false], 5), :]

Row,id,val
Unnamed: 0_level_1,Int64,Char
1,1,a
2,3,c
3,5,e
4,7,g
5,9,i


alternatively we can also create a view

In [58]:
view(x, repeat([true, false], 5), :)

Row,id,val
Unnamed: 0_level_1,Int64,Char
1,1,a
2,3,c
3,5,e
4,7,g
5,9,i


we can delete one row in place

In [59]:
deleteat!(x, 7)

Row,id,val
Unnamed: 0_level_1,Int64,Char
1,1,a
2,2,b
3,3,c
4,4,d
5,5,e
6,6,f
7,8,h
8,9,i
9,10,j


or a collection of rows, also in place

In [60]:
deleteat!(x, 6:7)

Row,id,val
Unnamed: 0_level_1,Int64,Char
1,1,a
2,2,b
3,3,c
4,4,d
5,5,e
6,9,i
7,10,j


Also a similar function `keepat!` is supported.

you can also create a new `DataFrame` when deleting rows using `Not` indexing

In [61]:
x[Not(1:2), :]

Row,id,val
Unnamed: 0_level_1,Int64,Char
1,3,c
2,4,d
3,5,e
4,9,i
5,10,j


In [62]:
x

Row,id,val
Unnamed: 0_level_1,Int64,Char
1,1,a
2,2,b
3,3,c
4,4,d
5,5,e
6,9,i
7,10,j


now we move to row filtering

In [63]:
x = DataFrame([1:4, 2:5, 3:6], :auto)

Row,x1,x2,x3
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,2,3
2,2,3,4
3,3,4,5
4,4,5,6


create a new `DataFrame` where filtering function operates on `DataFrameRow`

In [64]:
filter(r -> r.x1 > 2.5, x)

Row,x1,x2,x3
Unnamed: 0_level_1,Int64,Int64,Int64
1,3,4,5
2,4,5,6


In [65]:
filter(r -> r.x1 > 2.5, x, view=true) # the same but as a view

Row,x1,x2,x3
Unnamed: 0_level_1,Int64,Int64,Int64
1,3,4,5
2,4,5,6


or

In [66]:
filter(:x1 => >(2.5), x)

Row,x1,x2,x3
Unnamed: 0_level_1,Int64,Int64,Int64
1,3,4,5
2,4,5,6


in place modification of `x`, an example with `do`-block syntax

In [67]:
filter!(x) do r
    if r.x1 > 2.5
        return r.x2 < 4.5
    end
    r.x3 < 3.5
end

Row,x1,x2,x3
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,2,3
2,3,4,5


A common operation is selection of rows for which a value in a column is contained in a given set. Here are a few ways in which you can achieve this.

In [68]:
df = DataFrame(x=1:12, y=mod1.(1:12, 4))

Row,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,2
3,3,3
4,4,4
5,5,1
6,6,2
7,7,3
8,8,4
9,9,1
10,10,2


We select rows for which column `y` has value `1` or `4`.

In [69]:
filter(row -> row.y in [1,4], df)

Row,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,4,4
3,5,1
4,8,4
5,9,1
6,12,4


In [70]:
filter(:y => in([1,4]), df)

Row,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,4,4
3,5,1
4,8,4
5,9,1
6,12,4


In [71]:
df[in.(df.y, Ref([1,4])), :]

Row,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,4,4
3,5,1
4,8,4
5,9,1
6,12,4


DataFrames.jl also provides a `subset` function that works on whole columns and allows for multiple conditions:

In [72]:
x = DataFrame([1:4, 2:5, 3:6], :auto)

Row,x1,x2,x3
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,2,3
2,2,3,4
3,3,4,5
4,4,5,6


In [73]:
subset(x, :x1 => x -> x .< mean(x), :x2 => ByRow(<(2.5)))

Row,x1,x2,x3
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,2,3


Similarly an in-place `subset!` function is provided.

### Deduplicating

In [74]:
x = DataFrame(A=[1,2], B=["x","y"])
append!(x, x)
x.C = 1:4
x

Row,A,B,C
Unnamed: 0_level_1,Int64,String,Int64
1,1,x,1
2,2,y,2
3,1,x,3
4,2,y,4


get first unique rows for given index

In [75]:
unique(x, [1,2])

Row,A,B,C
Unnamed: 0_level_1,Int64,String,Int64
1,1,x,1
2,2,y,2


now we look at whole rows

In [76]:
unique(x)

Row,A,B,C
Unnamed: 0_level_1,Int64,String,Int64
1,1,x,1
2,2,y,2
3,1,x,3
4,2,y,4


get indicators of non-unique rows

In [77]:
nonunique(x, :A)

4-element Vector{Bool}:
 0
 0
 1
 1

modify `x` in place

In [78]:
unique!(x, :B)

Row,A,B,C
Unnamed: 0_level_1,Int64,String,Int64
1,1,x,1
2,2,y,2


### Extracting one row from a `DataFrame` into standard collections

In [79]:
x = DataFrame(x=[1,missing,2], y=["a", "b", missing], z=[true,false,true])

Row,x,y,z
Unnamed: 0_level_1,Int64?,String?,Bool
1,1,a,True
2,missing,b,False
3,2,missing,True


In [80]:
cols = [:y, :z]

2-element Vector{Symbol}:
 :y
 :z

you can use a conversion to a `Vector` or an `Array`

In [81]:
Vector(x[1, cols])

2-element Vector{Any}:
     "a"
 true

In [82]:
Array(x[1, cols]) # the same

2-element Vector{Any}:
     "a"
 true

now you will get a vector of vectors

In [83]:
[Vector(x[i, cols]) for i in axes(x, 1)]

3-element Vector{Vector{Any}}:
 ["a", true]
 ["b", false]
 [missing, true]

it is easy to convert a `DataFrameRow` into a `NamedTuple`

In [84]:
copy(x[1, cols])

NamedTuple{(:y, :z), Tuple{Union{Missing, String}, Bool}}(("a", true))

or a `Tuple`

In [85]:
Tuple(x[1, cols])

("a", true)

### Working with a collection of rows of a data frame

You can use `eachrow` to get a vector-like collection of `DataFrameRow`s

In [86]:
df = DataFrame(reshape(1:12, 3, 4), :auto)

Row,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,1,4,7,10
2,2,5,8,11
3,3,6,9,12


In [87]:
er_df = eachrow(df)

Row,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,1,4,7,10
2,2,5,8,11
3,3,6,9,12


In [88]:
er_df[1]

Row,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,1,4,7,10


In [89]:
last(er_df)

Row,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
3,3,6,9,12


In [90]:
er_df[end]

Row,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
3,3,6,9,12


As `DataFrameRows` objects keeps connection to the parent data frame you can get the columns of the parent using `getproperty`

In [91]:
er_df.x1

3-element Vector{Int64}:
 1
 2
 3

### Flattening a data frame

Occasionally you have a data frame whose one column is a vector of collections. You can expand (*flatten*) such a column using the `flatten` function

In [92]:
df = DataFrame(a = 'a':'c', b = [[1, 2, 3], [4, 5], 6])

Row,a,b
Unnamed: 0_level_1,Char,Any
1,a,"[1, 2, 3]"
2,b,"[4, 5]"
3,c,6


In [93]:
flatten(df, :b)

Row,a,b
Unnamed: 0_level_1,Char,Int64
1,a,1
2,a,2
3,a,3
4,b,4
5,b,5
6,c,6


### Only one row

`only` from Julia Base is also supported in DataFrames.jl and succeeds if the data frame has only one row, in which case it is returned.

In [94]:
df = DataFrame(a=1)

Row,a
Unnamed: 0_level_1,Int64
1,1


In [95]:
only(df)

Row,a
Unnamed: 0_level_1,Int64
1,1


In [96]:
df2 = repeat(df, 2)

Row,a
Unnamed: 0_level_1,Int64
1,1
2,1


In [97]:
only(df2)

LoadError: ArgumentError: data frame must contain exactly 1 row