## Manipulate columns of a dataframe

In [2]:
using DataFrames
using Pkg
Pkg.add("BenchmarkTools")
x = DataFrame(rand(Bool, 3, 4), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,0,0,0,1
2,0,0,1,1
3,0,1,0,1


In [3]:
# Rename a column
rename(x, :x1 => :A)

Unnamed: 0_level_0,A,x2,x3,x4
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,0,0,0,1
2,0,0,1,1
3,0,1,0,1


In [4]:
# With rename! we do an in place transformation.

# This time we've applied a function to every column name (note that the function gets a column names as a string).

rename!(c -> c^2, x)


Unnamed: 0_level_0,x1x1,x2x2,x3x3,x4x4
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,0,0,0,1
2,0,0,1,1
3,0,1,0,1


In [5]:
# We can also change the name of a particular column without knowing the original.

# Here we change the name of the third column, creating a new DataFrame.

rename(x, 3 => :third)


Unnamed: 0_level_0,x1x1,x2x2,third,x4x4
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,0,0,0,1
2,0,0,1,1
3,0,1,0,1


In [6]:
# If we pass a vector of names to rename!, we can change the names of all variables.

rename!(x, [:a, :b, :c, :d])

Unnamed: 0_level_0,a,b,c,d
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,0,0,0,1
2,0,0,1,1
3,0,1,0,1


In [7]:
# In all the above examples you could have used strings instead of symbols, e.g.

rename!(x, string.('a':'d'))

Unnamed: 0_level_0,a,b,c,d
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,0,0,0,1
2,0,0,1,1
3,0,1,0,1


In [8]:
rename!(x, "a"=>"d", "d"=>"a")


Unnamed: 0_level_0,d,b,c,a
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,0,0,0,1
2,0,0,1,1
3,0,1,0,1


In [9]:
# We get an error when we try to provide duplicate names

rename(x, fill(:a, 4))

LoadError: ArgumentError: Duplicate variable names: :a. Pass makeunique=true to make them unique using a suffix automatically.

In [10]:
# unless we pass makeunique=true, which allows us to handle duplicates in passed names.

rename(x, fill(:a, 4), makeunique=true)

Unnamed: 0_level_0,a,a_1,a_2,a_3
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,0,0,0,1
2,0,0,1,1
3,0,1,0,1


## Reordering columns
We can reorder the names(x) vector as needed, creating a new DataFrame.

In [11]:
using Random
Random.seed!(1234)
x[:, shuffle(names(x))]

Unnamed: 0_level_0,b,a,c,d
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,0,1,0,0
2,0,1,1,0
3,1,1,0,0


In [12]:
# Also select! can be used to achieve this in place (or select to perform a copy):

select!(x, 4:-1:1);
x

Unnamed: 0_level_0,a,c,b,d
Unnamed: 0_level_1,Bool,Bool,Bool,Bool
1,1,0,0,0
2,1,1,0,0
3,1,0,1,0


## Merging/adding columns

In [13]:
x = DataFrame([(i,j) for i in 1:3, j in 1:4], :auto)


Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


In [14]:
# With hcat we can merge two DataFrames. Also [x y] syntax is supported but only when DataFrames have unique column names.

hcat(x, x, makeunique=true)


Unnamed: 0_level_0,x1,x2,x3,x4,x1_1,x2_1,x3_1,x4_1
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)","(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)","(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)","(3, 1)","(3, 2)","(3, 3)","(3, 4)"


In [15]:
# You can append a vector to a data frame with the following syntax:

y = [x DataFrame(A=[1,2,3])]


Unnamed: 0_level_0,x1,x2,x3,x4,A
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Int64
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)",1
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)",2
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)",3


In [16]:
# Here we do the same but add column :A to the front.

y = [DataFrame(A=[1,2,3]) x]

Unnamed: 0_level_0,A,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Tuple…,Tuple…,Tuple…,Tuple…
1,1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


In [17]:
# A column can also be added in the middle. Here a brute-force method is used and a new DataFrame is created.

using BenchmarkTools
@btime [$x[!, 1:2] DataFrame(A=[1,2,3]) $x[!, 3:4]]

LoadError: ArgumentError: Package BenchmarkTools not found in current path:
- Run `import Pkg; Pkg.add("BenchmarkTools")` to install the BenchmarkTools package.


In [18]:
# We could also do this with a specialized in place method insertcols!. Let's add :newcol to the DataFrame y.

insertcols!(y, 2, "newcol" => [1,2,3])


Unnamed: 0_level_0,A,newcol,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Tuple…,Tuple…,Tuple…,Tuple…
1,1,1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,2,2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,3,3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


In [19]:
# If you want to insert the same column name several times makeunique=true is needed as usual.

insertcols!(y, 2, :newcol => [1,2,3], makeunique=true)


Unnamed: 0_level_0,A,newcol_1,newcol,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Int64,Tuple…,Tuple…,Tuple…,Tuple…
1,1,1,1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)"
2,2,2,2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)"
3,3,3,3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)"


In [20]:
# We can see how much faster it is to insert a column with insertcols! than with hcat using @btime (note that we use here a Pair notation as an example).

@btime insertcols!(copy($x), 3, :A => [1,2,3])


LoadError: LoadError: UndefVarError: @btime not defined
in expression starting at In[20]:3

In [21]:
# Let's use insertcols! to append a column in place (note that we dropped the index at which we insert the column)

insertcols!(x, :A => [1,2,3])

Unnamed: 0_level_0,x1,x2,x3,x4,A
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Int64
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)",1
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)",2
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)",3


In [22]:
# and to in place prepend a column.
insertcols!(x, 1, :B => [1,2,3])

Unnamed: 0_level_0,B,x1,x2,x3,x4,A
Unnamed: 0_level_1,Int64,Tuple…,Tuple…,Tuple…,Tuple…,Int64
1,1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)",1
2,2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)",2
3,3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)",3


In [23]:
# Note that insertcols! can be used to insert several columns to a data frame at once and that it performs broadcasting if needed:

df = DataFrame(a = [1, 2, 3])
insertcols!(df, :b => "x", :c => 'a':'c', :d => Ref([1,2,3]))


Unnamed: 0_level_0,a,b,c,d
Unnamed: 0_level_1,Int64,String,Char,Array…
1,1,x,a,"[1, 2, 3]"
2,2,x,b,"[1, 2, 3]"
3,3,x,c,"[1, 2, 3]"


In [24]:
# Interestingly we can emulate hcat mutating the data frame in-place using insertcols!:

df1 = DataFrame(a=[1,2])


Unnamed: 0_level_0,a
Unnamed: 0_level_1,Int64
1,1
2,2


In [25]:
df2 = DataFrame(b=[2,3], c=[3,4])


Unnamed: 0_level_0,b,c
Unnamed: 0_level_1,Int64,Int64
1,2,3
2,3,4


In [26]:
hcat(df1, df2)


Unnamed: 0_level_0,a,b,c
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,2,3
2,2,3,4


In [27]:
df1 # df1 is not touched


Unnamed: 0_level_0,a
Unnamed: 0_level_1,Int64
1,1
2,2


In [28]:
insertcols!(df1, pairs(eachcol(df2))...)

Unnamed: 0_level_0,a,b,c
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,2,3
2,2,3,4


In [30]:
df1 # now we have changed df1

Unnamed: 0_level_0,a,b,c
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,2,3
2,2,3,4


## Subsetting/removing columns
Let's create a new DataFrame x and show a few ways to create DataFrames with a subset of x's columns.



In [31]:
x = DataFrame([(i,j) for i in 1:3, j in 1:5], :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)","(3, 5)"


In [32]:
# First we could do this by index:

x[:, [1,2,4,5]] # use ! instead of : for non-copying operation

Unnamed: 0_level_0,x1,x2,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 2)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 2)","(3, 4)","(3, 5)"


In [33]:
# or by column name:

x[:, [:x1, :x4]]

Unnamed: 0_level_0,x1,x4
Unnamed: 0_level_1,Tuple…,Tuple…
1,"(1, 1)","(1, 4)"
2,"(2, 1)","(2, 4)"
3,"(3, 1)","(3, 4)"


In [34]:
# We can also choose to keep or exclude columns by Bool (we need a vector whose length is the number of columns in the original DataFrame).

x[:, [true, false, true, false, true]]

Unnamed: 0_level_0,x1,x3,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 3)","(1, 5)"
2,"(2, 1)","(2, 3)","(2, 5)"
3,"(3, 1)","(3, 3)","(3, 5)"


In [35]:
# Here we create a single column DataFrame,

x[:, [:x1]]

Unnamed: 0_level_0,x1
Unnamed: 0_level_1,Tuple…
1,"(1, 1)"
2,"(2, 1)"
3,"(3, 1)"


In [36]:
x.x1 # the same


3-element Vector{Tuple{Int64, Int64}}:
 (1, 1)
 (2, 1)
 (3, 1)

In [37]:
# We could grab the same vector by column number
x[!, 1]

3-element Vector{Tuple{Int64, Int64}}:
 (1, 1)
 (2, 1)
 (3, 1)

In [38]:
# Note that getting a single column returns it without copying while creating a new DataFrame performs a copy of the column

x[!, 1] === x[!, [1]]

false

In [39]:
# you can also use Regex, All, Between and Not from InvertedIndies.jl for column selection:

x[!, r"[12]"]


Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Tuple…,Tuple…
1,"(1, 1)","(1, 2)"
2,"(2, 1)","(2, 2)"
3,"(3, 1)","(3, 2)"


In [40]:
x[!, Not(1)]


Unnamed: 0_level_0,x2,x3,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 2)","(1, 3)","(1, 4)","(1, 5)"
2,"(2, 2)","(2, 3)","(2, 4)","(2, 5)"
3,"(3, 2)","(3, 3)","(3, 4)","(3, 5)"


In [41]:
x[!, Between(:x2, :x4)]


Unnamed: 0_level_0,x2,x3,x4
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…
1,"(1, 2)","(1, 3)","(1, 4)"
2,"(2, 2)","(2, 3)","(2, 4)"
3,"(3, 2)","(3, 3)","(3, 4)"


In [42]:
x[!, Cols(:x1, Between(:x3, :x5))]


Unnamed: 0_level_0,x1,x3,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 3)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 3)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 3)","(3, 4)","(3, 5)"


In [43]:
select(x, :x1, Between(:x3, :x5), copycols=false) # the same as above


Unnamed: 0_level_0,x1,x3,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 3)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 3)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 3)","(3, 4)","(3, 5)"


In [44]:
# you can use select and select! functions to select a subset of columns from a data frame. select creates a new data frame and select! operates in place

df = copy(x)
df2 = select(df, [1, 2])


Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Tuple…,Tuple…
1,"(1, 1)","(1, 2)"
2,"(2, 1)","(2, 2)"
3,"(3, 1)","(3, 2)"


In [45]:
select(df, Not([1, 2]))


Unnamed: 0_level_0,x3,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…
1,"(1, 3)","(1, 4)","(1, 5)"
2,"(2, 3)","(2, 4)","(2, 5)"
3,"(3, 3)","(3, 4)","(3, 5)"


In [46]:
# by default select copies columns

df2[!, 1] === df[!, 1]

false

In [47]:
# this can be avoided by using copycols=false keyword argument

df2 = select(df, [1, 2], copycols=false)

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Tuple…,Tuple…
1,"(1, 1)","(1, 2)"
2,"(2, 1)","(2, 2)"
3,"(3, 1)","(3, 2)"


In [48]:
df2[!, 1] === df[!, 1]

true

In [49]:
# using select! will modify the source data frame

select!(df, [1,2])


Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Tuple…,Tuple…
1,"(1, 1)","(1, 2)"
2,"(2, 1)","(2, 2)"
3,"(3, 1)","(3, 2)"


In [50]:
df == df2

true

In [51]:
# Here we create a copy of x and delete the 3rd column from the copy with select! and Not.

z = copy(x)
select!(z, Not(3))

Unnamed: 0_level_0,x1,x2,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 2)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 2)","(3, 4)","(3, 5)"


In [52]:
# alternatively we can achieve the same by using the select function

select(x, Not(3))


Unnamed: 0_level_0,x1,x2,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 2)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 2)","(3, 4)","(3, 5)"


In [53]:
# x stays unchanged

x

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)","(3, 5)"


## Views
Note, that you can also create a view of a DataFrame when we want a subset of its columns:


In [54]:
@btime x[:, [1,3,5]]

LoadError: LoadError: UndefVarError: @btime not defined
in expression starting at In[54]:1

In [55]:
@btime @view x[:, [1,3,5]]


LoadError: LoadError: UndefVarError: @btime not defined
in expression starting at In[55]:1

(now creation of the view is slow, but in the coming releases of the DataFrames.jl package it will become significantly faster)

## Modify column by name


In [56]:
x = DataFrame([(i,j) for i in 1:3, j in 1:5], :auto)

# With the following syntax, the existing column is modified without performing any copying (this is discouraged as it creates column alias).

x[!, :x1] = x[!, :x2]
x

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 2)","(1, 2)","(1, 3)","(1, 4)","(1, 5)"
2,"(2, 2)","(2, 2)","(2, 3)","(2, 4)","(2, 5)"
3,"(3, 2)","(3, 2)","(3, 3)","(3, 4)","(3, 5)"


In [57]:
# this syntax is safer

x[!, :x1] = x[:, :x2]


3-element Vector{Tuple{Int64, Int64}}:
 (1, 2)
 (2, 2)
 (3, 2)

In [58]:
# We can also use the following syntax to add a new column at the end of a DataFrame.
x[!, :A] = [1,2,3]
x

Unnamed: 0_level_0,x1,x2,x3,x4,x5,A
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…,Int64
1,"(1, 2)","(1, 2)","(1, 3)","(1, 4)","(1, 5)",1
2,"(2, 2)","(2, 2)","(2, 3)","(2, 4)","(2, 5)",2
3,"(3, 2)","(3, 2)","(3, 3)","(3, 4)","(3, 5)",3


In [59]:
# A new column name will be added to our DataFrame with the following syntax as well:

x.B = 11:13
x

Unnamed: 0_level_0,x1,x2,x3,x4,x5,A,B
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…,Int64,Int64
1,"(1, 2)","(1, 2)","(1, 3)","(1, 4)","(1, 5)",1,11
2,"(2, 2)","(2, 2)","(2, 3)","(2, 4)","(2, 5)",2,12
3,"(3, 2)","(3, 2)","(3, 3)","(3, 4)","(3, 5)",3,13


In [60]:
# Find column name

x = DataFrame([(i,j) for i in 1:3, j in 1:5], :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Tuple…,Tuple…,Tuple…,Tuple…,Tuple…
1,"(1, 1)","(1, 2)","(1, 3)","(1, 4)","(1, 5)"
2,"(2, 1)","(2, 2)","(2, 3)","(2, 4)","(2, 5)"
3,"(3, 1)","(3, 2)","(3, 3)","(3, 4)","(3, 5)"


In [61]:
# We can check if a column with a given name exists via

hasproperty(x, :x1)

true

In [62]:
# and determine its index via

columnindex(x, :x2)


2

## Advanced ways of column selection
these are most useful for non-standard column names (e.g. containing spaces)

In [63]:
df = DataFrame()
df.x1 = 1:3
df[!, "column 2"] = 4:6
df

Unnamed: 0_level_0,x1,column 2
Unnamed: 0_level_1,Int64,Int64
1,1,4
2,2,5
3,3,6


In [64]:
df."column 2"


3-element Vector{Int64}:
 4
 5
 6

In [65]:
df[:, "column 2"]

3-element Vector{Int64}:
 4
 5
 6

In [66]:
# or you can interpolate column name using :() syntax

for n in names(df)
    println(n, "\n", df.:($n), "\n")
end

x1
[1, 2, 3]

column 2
[4, 5, 6]



## Working on a collection of columns

When using eachcol of a data frame the resulting object retains reference to its parent and e.g. can be queried with getproperty

In [67]:
df = DataFrame(reshape(1:12, 3, 4), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,1,4,7,10
2,2,5,8,11
3,3,6,9,12


In [68]:
ec_df = eachcol(df)

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,1,4,7,10
2,2,5,8,11
3,3,6,9,12


In [69]:
ec_df[1]


3-element Vector{Int64}:
 1
 2
 3

In [70]:
ec_df.x1

3-element Vector{Int64}:
 1
 2
 3