## Manipulating rows of DataFrame
Selecting rows

In [71]:
using DataFrames
using Random
using Statistics

In [6]:
df = DataFrame(rand(4, 5), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.683838,0.584953,0.944772,0.727937,0.8273
2,0.636507,0.307688,0.89018,0.294209,0.0301562
3,0.6738,0.85877,0.118572,0.0700025,0.0100401
4,0.687528,0.318514,0.0187954,0.838922,0.265266


In [7]:
# using : as row selector will copy columns

df[:, :]

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.683838,0.584953,0.944772,0.727937,0.8273
2,0.636507,0.307688,0.89018,0.294209,0.0301562
3,0.6738,0.85877,0.118572,0.0700025,0.0100401
4,0.687528,0.318514,0.0187954,0.838922,0.265266


In [8]:
# this is the same as

copy(df)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.683838,0.584953,0.944772,0.727937,0.8273
2,0.636507,0.307688,0.89018,0.294209,0.0301562
3,0.6738,0.85877,0.118572,0.0700025,0.0100401
4,0.687528,0.318514,0.0187954,0.838922,0.265266


In [9]:
# you can get a subset of rows of a data frame without copying using view to get a SubDataFrame

sdf = view(df, 1:3, 1:3)


Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.683838,0.584953,0.944772
2,0.636507,0.307688,0.89018
3,0.6738,0.85877,0.118572


In [10]:
# you still have a detailed reference to the parent

parent(sdf), parentindices(sdf)

([1m4×5 DataFrame[0m
[1m Row [0m│[1m x1       [0m[1m x2       [0m[1m x3        [0m[1m x4        [0m[1m x5        [0m
[1m     [0m│[90m Float64  [0m[90m Float64  [0m[90m Float64   [0m[90m Float64   [0m[90m Float64   [0m
─────┼─────────────────────────────────────────────────────
   1 │ 0.683838  0.584953  0.944772   0.727937   0.8273
   2 │ 0.636507  0.307688  0.89018    0.294209   0.0301562
   3 │ 0.6738    0.85877   0.118572   0.0700025  0.0100401
   4 │ 0.687528  0.318514  0.0187954  0.838922   0.265266, (1:3, 1:3))

In [11]:
# selecting a single row returns a DataFrameRow object which is also a view

dfr = df[3, :]

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
3,0.6738,0.85877,0.118572,0.0700025,0.0100401


In [12]:
parent(dfr), parentindices(dfr), rownumber(dfr)

([1m4×5 DataFrame[0m
[1m Row [0m│[1m x1       [0m[1m x2       [0m[1m x3        [0m[1m x4        [0m[1m x5        [0m
[1m     [0m│[90m Float64  [0m[90m Float64  [0m[90m Float64   [0m[90m Float64   [0m[90m Float64   [0m
─────┼─────────────────────────────────────────────────────
   1 │ 0.683838  0.584953  0.944772   0.727937   0.8273
   2 │ 0.636507  0.307688  0.89018    0.294209   0.0301562
   3 │ 0.6738    0.85877   0.118572   0.0700025  0.0100401
   4 │ 0.687528  0.318514  0.0187954  0.838922   0.265266, (3, Base.OneTo(5)), 3)

In [13]:
# let us add a column to a data frame by assigning a scalar broadcasting

df[!, :Z] .= 1

4-element Vector{Int64}:
 1
 1
 1
 1

In [14]:
df

Unnamed: 0_level_0,x1,x2,x3,x4,x5,Z
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Int64
1,0.683838,0.584953,0.944772,0.727937,0.8273,1
2,0.636507,0.307688,0.89018,0.294209,0.0301562,1
3,0.6738,0.85877,0.118572,0.0700025,0.0100401,1
4,0.687528,0.318514,0.0187954,0.838922,0.265266,1


In [15]:
# Earlier we used : for column selection in a view (SubDataFrame and DataFrameRow). In this case a view will have all columns of the parent after the parent is mutated.

dfr


Unnamed: 0_level_0,x1,x2,x3,x4,x5,Z
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Int64
3,0.6738,0.85877,0.118572,0.0700025,0.0100401,1


In [17]:
# Note that parent and parentindices refer to the true source of data for a DataFrameRow and rownumber refers to row number in the direct object that was used to create DataFrameRow

parent(dfr), parentindices(dfr), rownumber(dfr)


([1m4×6 DataFrame[0m
[1m Row [0m│[1m x1       [0m[1m x2       [0m[1m x3        [0m[1m x4        [0m[1m x5        [0m[1m Z     [0m
[1m     [0m│[90m Float64  [0m[90m Float64  [0m[90m Float64   [0m[90m Float64   [0m[90m Float64   [0m[90m Int64 [0m
─────┼────────────────────────────────────────────────────────────
   1 │ 0.683838  0.584953  0.944772   0.727937   0.8273         1
   2 │ 0.636507  0.307688  0.89018    0.294209   0.0301562      1
   3 │ 0.6738    0.85877   0.118572   0.0700025  0.0100401      1
   4 │ 0.687528  0.318514  0.0187954  0.838922   0.265266       1, (3, Base.OneTo(6)), 3)

In [18]:
df = DataFrame(a=1:4)

Unnamed: 0_level_0,a
Unnamed: 0_level_1,Int64
1,1
2,2
3,3
4,4


In [19]:
dfv = view(df, [3,2], :)

Unnamed: 0_level_0,a
Unnamed: 0_level_1,Int64
1,3
2,2


In [20]:
dfr = dfv[1, :]

Unnamed: 0_level_0,a
Unnamed: 0_level_1,Int64
3,3


In [21]:
parent(dfr), parentindices(dfr), rownumber(dfr)

([1m4×1 DataFrame[0m
[1m Row [0m│[1m a     [0m
[1m     [0m│[90m Int64 [0m
─────┼───────
   1 │     1
   2 │     2
   3 │     3
   4 │     4, (3, Base.OneTo(1)), 1)

## Reordering rows
We create some random data frame (and hope that x.x is not sorted :), which is quite likely with 12 rows)

In [22]:
x = DataFrame(id=1:12, x = rand(12), y = [zeros(6); ones(6)])

Unnamed: 0_level_0,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,1,0.552987,0.0
2,2,0.375738,0.0
3,3,0.00541398,0.0
4,4,0.215497,0.0
5,5,0.0381143,0.0
6,6,0.0736901,0.0
7,7,0.0917023,1.0
8,8,0.824968,1.0
9,9,0.935277,1.0
10,10,0.644843,1.0


In [23]:
# check if a DataFrame or a subset of its columns is sorted

issorted(x), issorted(x, :x)

(true, false)

In [24]:
# we sort x in place

sort!(x, :x)

Unnamed: 0_level_0,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,3,0.00541398,0.0
2,5,0.0381143,0.0
3,6,0.0736901,0.0
4,7,0.0917023,1.0
5,4,0.215497,0.0
6,2,0.375738,0.0
7,1,0.552987,0.0
8,10,0.644843,1.0
9,12,0.728604,1.0
10,11,0.774834,1.0


In [25]:
# now we create a new DataFrame

y = sort(x, :id)

Unnamed: 0_level_0,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,1,0.552987,0.0
2,2,0.375738,0.0
3,3,0.00541398,0.0
4,4,0.215497,0.0
5,5,0.0381143,0.0
6,6,0.0736901,0.0
7,7,0.0917023,1.0
8,8,0.824968,1.0
9,9,0.935277,1.0
10,10,0.644843,1.0


In [26]:
# here we sort by two columns, first is decreasing, second is increasing

sort(x, [:y, :x], rev=[true, false])


Unnamed: 0_level_0,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,7,0.0917023,1.0
2,10,0.644843,1.0
3,12,0.728604,1.0
4,11,0.774834,1.0
5,8,0.824968,1.0
6,9,0.935277,1.0
7,3,0.00541398,0.0
8,5,0.0381143,0.0
9,6,0.0736901,0.0
10,4,0.215497,0.0


In [27]:
sort(x, [order(:y, rev=true), :x]) # the same as above

Unnamed: 0_level_0,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,7,0.0917023,1.0
2,10,0.644843,1.0
3,12,0.728604,1.0
4,11,0.774834,1.0
5,8,0.824968,1.0
6,9,0.935277,1.0
7,3,0.00541398,0.0
8,5,0.0381143,0.0
9,6,0.0736901,0.0
10,4,0.215497,0.0


In [28]:
# now we try some more fancy sorting stuff

sort(x, [order(:y, rev=true), order(:x, by=v->-v)])

Unnamed: 0_level_0,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,9,0.935277,1.0
2,8,0.824968,1.0
3,11,0.774834,1.0
4,12,0.728604,1.0
5,10,0.644843,1.0
6,7,0.0917023,1.0
7,1,0.552987,0.0
8,2,0.375738,0.0
9,4,0.215497,0.0
10,6,0.0736901,0.0


In [31]:
x[shuffle(1:10), :]

Unnamed: 0_level_0,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,11,0.774834,1.0
2,4,0.215497,0.0
3,10,0.644843,1.0
4,5,0.0381143,0.0
5,7,0.0917023,1.0
6,2,0.375738,0.0
7,6,0.0736901,0.0
8,1,0.552987,0.0
9,3,0.00541398,0.0
10,12,0.728604,1.0


In [32]:
# it is also easy to swap rows using broadcasted assignment

sort!(x, :id)
x[[1,10],:] .= x[[10,1],:]
x

Unnamed: 0_level_0,id,x,y
Unnamed: 0_level_1,Int64,Float64,Float64
1,10,0.644843,1.0
2,2,0.375738,0.0
3,3,0.00541398,0.0
4,4,0.215497,0.0
5,5,0.0381143,0.0
6,6,0.0736901,0.0
7,7,0.0917023,1.0
8,8,0.824968,1.0
9,9,0.935277,1.0
10,1,0.552987,0.0


## Merging/adding rows


In [33]:
x = DataFrame(rand(3, 5), :auto)
# merge by rows - data frames must have the same column names; the same is vcat

[x; x]


Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.968842,0.503856,0.324074,0.280555,0.997833
2,0.172332,0.386847,0.29139,0.116583,0.798872
3,0.015591,0.0841733,0.115784,0.42946,0.904764
4,0.968842,0.503856,0.324074,0.280555,0.997833
5,0.172332,0.386847,0.29139,0.116583,0.798872
6,0.015591,0.0841733,0.115784,0.42946,0.904764


In [34]:
x

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.968842,0.503856,0.324074,0.280555,0.997833
2,0.172332,0.386847,0.29139,0.116583,0.798872
3,0.015591,0.0841733,0.115784,0.42946,0.904764


In [35]:
# you can efficiently vcat a vector of DataFrames using reduce

reduce(vcat, [x, x, x])

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.968842,0.503856,0.324074,0.280555,0.997833
2,0.172332,0.386847,0.29139,0.116583,0.798872
3,0.015591,0.0841733,0.115784,0.42946,0.904764
4,0.968842,0.503856,0.324074,0.280555,0.997833
5,0.172332,0.386847,0.29139,0.116583,0.798872
6,0.015591,0.0841733,0.115784,0.42946,0.904764
7,0.968842,0.503856,0.324074,0.280555,0.997833
8,0.172332,0.386847,0.29139,0.116583,0.798872
9,0.015591,0.0841733,0.115784,0.42946,0.904764


In [36]:
# get y with other order of names
y = x[:, reverse(names(x))]

Unnamed: 0_level_0,x5,x4,x3,x2,x1
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.997833,0.280555,0.324074,0.503856,0.968842
2,0.798872,0.116583,0.29139,0.386847,0.172332
3,0.904764,0.42946,0.115784,0.0841733,0.015591


In [37]:
# vcat is still possible as it does column name matching

vcat(x, y)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.968842,0.503856,0.324074,0.280555,0.997833
2,0.172332,0.386847,0.29139,0.116583,0.798872
3,0.015591,0.0841733,0.115784,0.42946,0.904764
4,0.968842,0.503856,0.324074,0.280555,0.997833
5,0.172332,0.386847,0.29139,0.116583,0.798872
6,0.015591,0.0841733,0.115784,0.42946,0.904764


In [38]:
# but column names must still match

vcat(x, y[:, 1:3])


LoadError: ArgumentError: column(s) x1 and x2 are missing from argument(s) 2

In [39]:
# unless you pass :intersect, :union or specific column names as keyword argument cols

vcat(x, y[:, 1:3], cols=:intersect)


Unnamed: 0_level_0,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.324074,0.280555,0.997833
2,0.29139,0.116583,0.798872
3,0.115784,0.42946,0.904764
4,0.324074,0.280555,0.997833
5,0.29139,0.116583,0.798872
6,0.115784,0.42946,0.904764


In [40]:
vcat(x, y[:, 1:3], cols=:union)


Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64?,Float64?,Float64,Float64,Float64
1,0.968842,0.503856,0.324074,0.280555,0.997833
2,0.172332,0.386847,0.29139,0.116583,0.798872
3,0.015591,0.0841733,0.115784,0.42946,0.904764
4,missing,missing,0.324074,0.280555,0.997833
5,missing,missing,0.29139,0.116583,0.798872
6,missing,missing,0.115784,0.42946,0.904764


In [41]:
vcat(x, y[:, 1:3], cols=[:x1, :x5])


Unnamed: 0_level_0,x1,x5
Unnamed: 0_level_1,Float64?,Float64
1,0.968842,0.997833
2,0.172332,0.798872
3,0.015591,0.904764
4,missing,0.997833
5,missing,0.798872
6,missing,0.904764


In [42]:
# append! modifies x in place

append!(x, x)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.968842,0.503856,0.324074,0.280555,0.997833
2,0.172332,0.386847,0.29139,0.116583,0.798872
3,0.015591,0.0841733,0.115784,0.42946,0.904764
4,0.968842,0.503856,0.324074,0.280555,0.997833
5,0.172332,0.386847,0.29139,0.116583,0.798872
6,0.015591,0.0841733,0.115784,0.42946,0.904764


In [43]:
# here column names must match exactly unless cols keyword argument is passed

append!(x, y)


Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.968842,0.503856,0.324074,0.280555,0.997833
2,0.172332,0.386847,0.29139,0.116583,0.798872
3,0.015591,0.0841733,0.115784,0.42946,0.904764
4,0.968842,0.503856,0.324074,0.280555,0.997833
5,0.172332,0.386847,0.29139,0.116583,0.798872
6,0.015591,0.0841733,0.115784,0.42946,0.904764
7,0.968842,0.503856,0.324074,0.280555,0.997833
8,0.172332,0.386847,0.29139,0.116583,0.798872
9,0.015591,0.0841733,0.115784,0.42946,0.904764


In [44]:
# standard repeat function works on rows; also inner and outer keyword arguments are accepted

repeat(x, 2)


Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.968842,0.503856,0.324074,0.280555,0.997833
2,0.172332,0.386847,0.29139,0.116583,0.798872
3,0.015591,0.0841733,0.115784,0.42946,0.904764
4,0.968842,0.503856,0.324074,0.280555,0.997833
5,0.172332,0.386847,0.29139,0.116583,0.798872
6,0.015591,0.0841733,0.115784,0.42946,0.904764
7,0.968842,0.503856,0.324074,0.280555,0.997833
8,0.172332,0.386847,0.29139,0.116583,0.798872
9,0.015591,0.0841733,0.115784,0.42946,0.904764
10,0.968842,0.503856,0.324074,0.280555,0.997833


In [45]:
# push! adds one row to x at the end; one must pass a correct number of values unless cols keyword argument is passed

push!(x, 1:5)
x

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.968842,0.503856,0.324074,0.280555,0.997833
2,0.172332,0.386847,0.29139,0.116583,0.798872
3,0.015591,0.0841733,0.115784,0.42946,0.904764
4,0.968842,0.503856,0.324074,0.280555,0.997833
5,0.172332,0.386847,0.29139,0.116583,0.798872
6,0.015591,0.0841733,0.115784,0.42946,0.904764
7,0.968842,0.503856,0.324074,0.280555,0.997833
8,0.172332,0.386847,0.29139,0.116583,0.798872
9,0.015591,0.0841733,0.115784,0.42946,0.904764
10,1.0,2.0,3.0,4.0,5.0


In [46]:
push!(x, Dict(:x1=> 11, :x2=> 12, :x3=> 13, :x4=> 14, :x5=> 15))
x

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.968842,0.503856,0.324074,0.280555,0.997833
2,0.172332,0.386847,0.29139,0.116583,0.798872
3,0.015591,0.0841733,0.115784,0.42946,0.904764
4,0.968842,0.503856,0.324074,0.280555,0.997833
5,0.172332,0.386847,0.29139,0.116583,0.798872
6,0.015591,0.0841733,0.115784,0.42946,0.904764
7,0.968842,0.503856,0.324074,0.280555,0.997833
8,0.172332,0.386847,0.29139,0.116583,0.798872
9,0.015591,0.0841733,0.115784,0.42946,0.904764
10,1.0,2.0,3.0,4.0,5.0


In [47]:
# and NamedTuples via name matching
push!(x, (x2=2, x1=1, x4=4, x3=3, x5=5))

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.968842,0.503856,0.324074,0.280555,0.997833
2,0.172332,0.386847,0.29139,0.116583,0.798872
3,0.015591,0.0841733,0.115784,0.42946,0.904764
4,0.968842,0.503856,0.324074,0.280555,0.997833
5,0.172332,0.386847,0.29139,0.116583,0.798872
6,0.015591,0.0841733,0.115784,0.42946,0.904764
7,0.968842,0.503856,0.324074,0.280555,0.997833
8,0.172332,0.386847,0.29139,0.116583,0.798872
9,0.015591,0.0841733,0.115784,0.42946,0.904764
10,1.0,2.0,3.0,4.0,5.0


In [48]:
# and DataFrameRow also via name matching

push!(x, x[1, :])

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.968842,0.503856,0.324074,0.280555,0.997833
2,0.172332,0.386847,0.29139,0.116583,0.798872
3,0.015591,0.0841733,0.115784,0.42946,0.904764
4,0.968842,0.503856,0.324074,0.280555,0.997833
5,0.172332,0.386847,0.29139,0.116583,0.798872
6,0.015591,0.0841733,0.115784,0.42946,0.904764
7,0.968842,0.503856,0.324074,0.280555,0.997833
8,0.172332,0.386847,0.29139,0.116583,0.798872
9,0.015591,0.0841733,0.115784,0.42946,0.904764
10,1.0,2.0,3.0,4.0,5.0


Please consult the documentation of push!, append! and vcat for allowed values of cols keyword argument. This keyword argument governs the way these functions perform column matching of passed arguments. Also append! and push! support a promote keyword argument that decides if column type promotion is allowed.

Let us here just give a quick example of how heterogeneous data can be stored in the data frame using these functionalities:

In [49]:
source = [(a=1, b=2), (a=missing, b=10, c=20), (b="s", c=1, d=1)]

3-element Vector{NamedTuple}:
 (a = 1, b = 2)
 (a = missing, b = 10, c = 20)
 (b = "s", c = 1, d = 1)

In [50]:
df = DataFrame()
for row in source
    push!(df, row, cols=:union) # if cols is :union then promote is true by default
end

df


Unnamed: 0_level_0,a,b,c,d
Unnamed: 0_level_1,Int64?,Any,Int64?,Int64?
1,1,2,missing,missing
2,missing,10,20,missing
3,missing,s,1,1


and we see that push! dynamically added columns as needed and updated their element types

## Subsetting/removing rows


In [51]:
x = DataFrame(id=1:10, val='a':'j')

# by using indexing

x[1:2, :]

Unnamed: 0_level_0,id,val
Unnamed: 0_level_1,Int64,Char
1,1,a
2,2,b


In [52]:
# a single row selection creates a DataFrameRow

x[1, :]

Unnamed: 0_level_0,id,val
Unnamed: 0_level_1,Int64,Char
1,1,a


In [53]:
# but this is a DataFrame

x[1:1, :]


Unnamed: 0_level_0,id,val
Unnamed: 0_level_1,Int64,Char
1,1,a


In [54]:
# the same but a view

view(x, 1:2, :)

Unnamed: 0_level_0,id,val
Unnamed: 0_level_1,Int64,Char
1,1,a
2,2,b


In [55]:
# selects columns 1 and 2

view(x, :, 1:2)

Unnamed: 0_level_0,id,val
Unnamed: 0_level_1,Int64,Char
1,1,a
2,2,b
3,3,c
4,4,d
5,5,e
6,6,f
7,7,g
8,8,h
9,9,i
10,10,j


In [56]:
# indexing by Bool, exact length math is required

x[repeat([true, false], 5), :]

Unnamed: 0_level_0,id,val
Unnamed: 0_level_1,Int64,Char
1,1,a
2,3,c
3,5,e
4,7,g
5,9,i


In [57]:
# alternatively we can also create a view

view(x, repeat([true, false], 5), :)


Unnamed: 0_level_0,id,val
Unnamed: 0_level_1,Int64,Char
1,1,a
2,3,c
3,5,e
4,7,g
5,9,i


In [58]:
# we can delete one row in place

deleteat!(x, 7)

Unnamed: 0_level_0,id,val
Unnamed: 0_level_1,Int64,Char
1,1,a
2,2,b
3,3,c
4,4,d
5,5,e
6,6,f
7,8,h
8,9,i
9,10,j


In [59]:
# or a collection of rows, also in place

deleteat!(x, 6:7)


Unnamed: 0_level_0,id,val
Unnamed: 0_level_1,Int64,Char
1,1,a
2,2,b
3,3,c
4,4,d
5,5,e
6,9,i
7,10,j


In [60]:
# you can also create a new DataFrame when deleting rows using Not indexing

x[Not(1:2), :]


Unnamed: 0_level_0,id,val
Unnamed: 0_level_1,Int64,Char
1,3,c
2,4,d
3,5,e
4,9,i
5,10,j


In [61]:
x

Unnamed: 0_level_0,id,val
Unnamed: 0_level_1,Int64,Char
1,1,a
2,2,b
3,3,c
4,4,d
5,5,e
6,9,i
7,10,j


In [62]:
# now we move to row filtering

x = DataFrame([1:4, 2:5, 3:6], :auto)

# create a new DataFrame where filtering function operates on DataFrameRow

filter(r -> r.x1 > 2.5, x)


Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Int64,Int64,Int64
1,3,4,5
2,4,5,6


In [63]:
filter(r -> r.x1 > 2.5, x, view=true) # the same but as a view


Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Int64,Int64,Int64
1,3,4,5
2,4,5,6


In [64]:
filter(:x1 => >(2.5), x)

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Int64,Int64,Int64
1,3,4,5
2,4,5,6


In [65]:
# in place modification of x, an example with do-block syntax

filter!(x) do r
    if r.x1 > 2.5
        return r.x2 < 4.5
    end
    r.x3 < 3.5
end


Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,2,3
2,3,4,5


In [66]:
# A common operation is selection of rows for which a value in a column is contained in a given set. Here are a few ways in which you can achieve this.

df = DataFrame(x=1:12, y=mod1.(1:12, 4))


Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,2
3,3,3
4,4,4
5,5,1
6,6,2
7,7,3
8,8,4
9,9,1
10,10,2


In [67]:
# We select rows for which column y has value 1 or 4.

filter(row -> row.y in [1,4], df)


Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,4,4
3,5,1
4,8,4
5,9,1
6,12,4


In [68]:
filter(:y => in([1,4]), df)

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,4,4
3,5,1
4,8,4
5,9,1
6,12,4


In [69]:
df[in.(df.y, Ref([1,4])), :]


Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,4,4
3,5,1
4,8,4
5,9,1
6,12,4


In [72]:
# DataFrames.jl also provides a subset function that works on whole columns and allows for multiple conditions:

x = DataFrame([1:4, 2:5, 3:6], :auto)

subset(x, :x1 => x -> x .< mean(x), :x2 => ByRow(<(2.5)))


Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,2,3


Similarly an in-place subset! function is provided.

## Deduplicating


In [74]:
x = DataFrame(A=[1,2], B=["x","y"])
append!(x, x)
x.C = 1:4
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,String,Int64
1,1,x,1
2,2,y,2
3,1,x,3
4,2,y,4


In [75]:
# get first unique rows for given index

unique(x, [1,2])


Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,String,Int64
1,1,x,1
2,2,y,2


In [76]:
# now we look at whole rows

unique(x)

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,String,Int64
1,1,x,1
2,2,y,2
3,1,x,3
4,2,y,4


In [77]:
# get indicators of non-unique rows

nonunique(x, :A)

4-element Vector{Bool}:
 0
 0
 1
 1

In [78]:
# get indicators of non-unique rows

nonunique(x, :A)

4-element Vector{Bool}:
 0
 0
 1
 1

In [79]:
# modify x in place

unique!(x, :B)

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,String,Int64
1,1,x,1
2,2,y,2


## Extracting one row from a DataFrame into standard collections

In [80]:
x = DataFrame(x=[1,missing,2], y=["a", "b", missing], z=[true,false,true])

Unnamed: 0_level_0,x,y,z
Unnamed: 0_level_1,Int64?,String?,Bool
1,1,a,1
2,missing,b,0
3,2,missing,1


In [81]:
cols = [:y, :z]

2-element Vector{Symbol}:
 :y
 :z

In [82]:
## you can use a conversion to a Vector or an Array

Vector(x[1, cols])

2-element Vector{Any}:
     "a"
 true

In [83]:
Array(x[1, cols]) # the same


2-element Vector{Any}:
     "a"
 true

In [85]:
# now you will get a vector of vectors

[Vector(x[i, cols]) for i in axes(x, 1)]

3-element Vector{Vector{Any}}:
 ["a", true]
 ["b", false]
 [missing, true]

In [86]:
# it is easy to convert a DataFrameRow into a NamedTuple

copy(x[1, cols])

NamedTuple{(:y, :z), Tuple{Union{Missing, String}, Bool}}(("a", true))

In [87]:
# or a Tuple

Tuple(x[1, cols])

("a", true)

## Working with a collection of rows of a data frame
You can use eachrow to get a vector-like collection of DataFrameRows

In [88]:
df = DataFrame(reshape(1:12, 3, 4), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,1,4,7,10
2,2,5,8,11
3,3,6,9,12


In [89]:
er_df = eachrow(df)

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,1,4,7,10
2,2,5,8,11
3,3,6,9,12


In [90]:
er_df[1]


Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,1,4,7,10


In [91]:
last(er_df)

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
3,3,6,9,12


In [92]:
er_df[end]

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
3,3,6,9,12


In [93]:
## As DataFrameRows objects keeps connection to the parent data frame you can get the columns of the parent using getproperty

er_df.x1

3-element Vector{Int64}:
 1
 2
 3

## Flattening a data frame
Occasionally you have a data frame whose one column is a vector of collections. You can expand (flatten) such a column using the flatten function

In [95]:
df = DataFrame(a = 'a':'c', b = [[1, 2, 3], [4, 5], 6])


Unnamed: 0_level_0,a,b
Unnamed: 0_level_1,Char,Any
1,a,"[1, 2, 3]"
2,b,"[4, 5]"
3,c,6


In [96]:
flatten(df, :b)

Unnamed: 0_level_0,a,b
Unnamed: 0_level_1,Char,Int64
1,a,1
2,a,2
3,a,3
4,b,4
5,b,5
6,c,6


## Only one row
only from Julia Base is also supported in DataFrames.jl and succeeds if the data frame has only one row, in which case it is returned.

In [97]:
df = DataFrame(a=1)

Unnamed: 0_level_0,a
Unnamed: 0_level_1,Int64
1,1


In [98]:
only(df)


Unnamed: 0_level_0,a
Unnamed: 0_level_1,Int64
1,1


In [100]:
df2 = repeat(df, 2)

Unnamed: 0_level_0,a
Unnamed: 0_level_1,Int64
1,1
2,1


In [101]:
only(df2)

LoadError: ArgumentError: data frame must contain exactly 1 row