# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), November 20, 2020**

In [1]:
using DataFrames

## Getting basic information about a data frame

Let's start by creating a `DataFrame` object, `x`, so that we can learn how to get information on that data frame.

In [2]:
x = DataFrame(A = [1, 2], B = [1.0, missing], C = ["a", "b"])

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,2,missing,b


The standard `size` function works to get dimensions of the `DataFrame`,

In [3]:
size(x), size(x, 1), size(x, 2)

((2, 3), 2, 3)

as well as `nrow` and `ncol` from R.

In [4]:
nrow(x), ncol(x)

(2, 3)

`describe` gives basic summary statistics of data in your `DataFrame` (check out the help of `describe` for information on how to customize shown statistics).

In [5]:
describe(x)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,Type
1,A,1.5,1,1.5,2,0,Int64
2,B,1.0,1.0,1.0,1.0,1,"Union{Missing, Float64}"
3,C,,a,,b,0,String


you can limit the columns shown by `describe` using `cols` keyword argument

In [6]:
describe(x, cols=1:2)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Real,Float64,Real,Int64,Type
1,A,1.5,1.0,1.5,2.0,0,Int64
2,B,1.0,1.0,1.0,1.0,1,"Union{Missing, Float64}"


`names` will return the names of all columns as strings

In [7]:
names(x)

3-element Array{String,1}:
 "A"
 "B"
 "C"

you can also get column names with a given `eltype`:

In [8]:
names(x, String)

1-element Array{String,1}:
 "C"

use `propertynames` to get a vector of `Symbol`s:

In [9]:
propertynames(x)

3-element Array{Symbol,1}:
 :A
 :B
 :C

using `eltype` on `eachcol(x)` returns element types of columns:

In [10]:
eltype.(eachcol(x))

3-element Array{Type,1}:
 Int64
 Union{Missing, Float64}
 String

Here we create some large `DataFrame`

In [11]:
y = DataFrame(rand(1:10, 1000, 10), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,5,9,3,2,8,8,8,3,8,1
2,4,10,3,7,1,4,5,6,8,5
3,10,4,1,4,4,3,9,1,2,9
4,10,10,4,3,3,3,4,5,1,2
5,4,5,5,4,1,1,6,1,8,4
6,10,9,2,1,1,9,10,6,3,7
7,1,6,6,9,3,9,6,6,7,8
8,7,9,3,4,4,2,1,9,6,4
9,10,3,7,10,5,8,9,9,4,2
10,7,3,2,10,10,2,2,4,2,6


and then we can use `first` to peek into its first few rows

In [12]:
first(y, 5)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,5,9,3,2,8,8,8,3,8,1
2,4,10,3,7,1,4,5,6,8,5
3,10,4,1,4,4,3,9,1,2,9
4,10,10,4,3,3,3,4,5,1,2
5,4,5,5,4,1,1,6,1,8,4


and `last` to see its bottom rows.

In [13]:
last(y, 3)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,8,1,1,10,3,7,1,2,6,9
2,2,8,9,3,3,7,5,5,4,9
3,8,6,4,1,5,9,5,9,3,5


Using `first` and `last` without number of rows will return a first/last `DataFrameRow` in the `DataFrame`

In [14]:
first(y)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,5,9,3,2,8,8,8,3,8,1


In [15]:
last(y)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1000,8,6,4,1,5,9,5,9,3,5


### Displaying large data frames

Create a wide and tall data frame:

In [16]:
df = DataFrame(rand(100, 100), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.0407606,0.117232,0.348204,0.269438,0.452216,0.31456,0.677597,0.719277
2,0.376506,0.482919,0.669578,0.43432,0.975168,0.03292,0.181948,0.232741
3,0.753126,0.00911066,0.991773,0.296264,0.270301,0.0932391,0.494824,0.97819
4,0.814347,0.312632,0.805042,0.545419,0.169538,0.688139,0.600611,0.591047
5,0.584648,0.449159,0.781232,0.704917,0.401786,0.262193,0.763988,0.114793
6,0.557345,0.106035,0.788548,0.216535,0.678853,0.572301,0.231112,0.367835
7,0.29653,0.932721,0.316042,0.667673,0.66609,0.0961946,0.837986,0.901083
8,0.0647716,0.423136,0.449464,0.0786289,0.114109,0.186914,0.321308,0.212699
9,0.650907,0.443851,0.433511,0.0636988,0.153576,0.919171,0.140223,0.0825558
10,0.958762,0.368439,0.0191283,0.717186,0.0771803,0.386313,0.05567,0.574292


we can see that 92 of its columns were not printed. Also we get its first 30 rows. You can easily change this behavior by changing the value of `ENV["LINES"]` and `ENV["COLUMNS"]`.

In [17]:
ENV["LINES"] = 10

10

In [18]:
ENV["COLUMNS"] = 200

200

In [19]:
df

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.0407606,0.117232,0.348204,0.269438,0.452216,0.31456,0.677597,0.719277,0.615861,0.36204,0.671318,0.45735,0.238713,0.647599,0.393895,0.611443,0.0909405,0.236412,0.990843
2,0.376506,0.482919,0.669578,0.43432,0.975168,0.03292,0.181948,0.232741,0.0667648,0.160548,0.65104,0.336965,0.94655,0.751701,0.137363,0.0701963,0.846639,0.422752,0.429095
3,0.753126,0.00911066,0.991773,0.296264,0.270301,0.0932391,0.494824,0.97819,0.996029,0.222652,0.949151,0.851107,0.497406,0.131852,0.288838,0.906171,0.111243,0.366607,0.693664
4,0.814347,0.312632,0.805042,0.545419,0.169538,0.688139,0.600611,0.591047,0.596966,0.370545,0.922549,0.478018,0.0408179,0.400419,0.761952,0.752568,0.0598793,0.762053,0.765239
5,0.584648,0.449159,0.781232,0.704917,0.401786,0.262193,0.763988,0.114793,0.852059,0.338727,0.117684,0.746356,0.115795,0.184791,0.140508,0.187075,0.347514,0.211822,0.261907
6,0.557345,0.106035,0.788548,0.216535,0.678853,0.572301,0.231112,0.367835,0.0288492,0.0479094,0.612561,0.0194852,0.0324983,0.150302,0.524042,0.548574,0.652731,0.780659,0.644513
7,0.29653,0.932721,0.316042,0.667673,0.66609,0.0961946,0.837986,0.901083,0.0271284,0.23356,0.822109,0.807471,0.302836,0.00490122,0.416054,0.844796,0.133597,0.555466,0.742581
8,0.0647716,0.423136,0.449464,0.0786289,0.114109,0.186914,0.321308,0.212699,0.453385,0.697554,0.55713,0.675983,0.53533,0.676487,0.317385,0.359657,0.79792,0.243943,0.46387
9,0.650907,0.443851,0.433511,0.0636988,0.153576,0.919171,0.140223,0.0825558,0.0033033,0.847624,0.191551,0.574725,0.341964,0.659708,0.701146,0.0615678,0.566943,0.35424,0.818846
10,0.958762,0.368439,0.0191283,0.717186,0.0771803,0.386313,0.05567,0.574292,0.0668751,0.563988,0.446098,0.838466,0.211583,0.671035,0.770441,0.0935197,0.536912,0.718397,0.966319


### Most elementary get and set operations

Given the `DataFrame` `x` we have created earlier, here are various ways to grab one of its columns as a `Vector`.

In [20]:
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,2,missing,b


In [21]:
x.A, x[!, 1], x[!, :A] # all get the vector stored in our DataFrame without copying it

([1, 2], [1, 2], [1, 2])

In [22]:
x."A", x[!, "A"] # the same using string indexing

([1, 2], [1, 2])

In [23]:
x[:, 1] # note that this creates a copy

2-element Array{Int64,1}:
 1
 2

In [24]:
x[:, 1] === x[:, 1]

false

To grab one row as a `DataFrame`, we can index as follows.

In [25]:
x[1:1, :]

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a


In [26]:
x[1, :] # this produces a DataFrameRow which is treated as 1-dimensional object similar to a NamedTuple

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a


We can grab a single cell or element with the same syntax to grab an element of an array.

In [27]:
x[1, 1]

1

or a new `DataFrame` that is a subset of rows and columns

In [28]:
x[1:2, 1:2]

Unnamed: 0_level_0,A,B
Unnamed: 0_level_1,Int64,Float64?
1,1,1.0
2,2,missing


You can also use `Regex` to select columns and `Not` from InvertedIndices.jl both to select rows and columns

In [29]:
x[Not(1), r"A"]

Unnamed: 0_level_0,A
Unnamed: 0_level_1,Int64
1,2


In [30]:
x[!, Not(1)] # ! indicates that underlying columns are not copied

Unnamed: 0_level_0,B,C
Unnamed: 0_level_1,Float64?,String
1,1.0,a
2,missing,b


In [31]:
x[:, Not(1)] # : means that the columns will get copied

Unnamed: 0_level_0,B,C
Unnamed: 0_level_1,Float64?,String
1,1.0,a
2,missing,b


Assignment of a scalar to a data frame can be done in ranges using broadcasting:

In [32]:
x[1:2, 1:2] .= 1
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,1,1.0,b


Assignment of a vector of length equal to the number of assigned rows using broadcasting

In [33]:
x[1:2, 1:2] .= [1,2]
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,2,2.0,b


Assignment or of another data frame of matching size and column names, again using broadcasting:

In [34]:
x[1:2, 1:2] .= DataFrame([5 6; 7 8], [:A, :B])
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,5,6.0,a
2,7,8.0,b


**Caution**

With `df[!, :col]` and `df.col` syntax you get a direct (non copying) access to a column of a data frame.
This is potentially unsafe as you can easily corrupt data in the `df` data frame if you resize, sort, etc. the column obtained in this way.
Therefore such access should be used with caution.

Similarly `df[!, cols]` when `cols` is a collection of columns produces a new data frame that holds the same (not copied) columns as the source `df` data frame. Similarly, modifying the data frame obtained via `df[!, cols]` might cause problems with the consistency of `df`.

The `df[:, :col]` and `df[:, cols]` syntaxes always copy columns so they are safe to use (and should generally be preferred except for performance or memory critical use cases).

Here are examples of how `Cols` and `Between` can be used to select columns of a data frame.

In [35]:
x = DataFrame(rand(4, 5), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.779175,0.571052,0.00519813,0.580951,0.727203
2,0.284709,0.264599,0.618271,0.625157,0.265699
3,0.956069,0.0278224,0.875838,0.0169878,0.396787
4,0.853433,0.577229,0.378074,0.225203,0.747455


In [36]:
x[:, Between(:x2, :x4)]

Unnamed: 0_level_0,x2,x3,x4
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.571052,0.00519813,0.580951
2,0.264599,0.618271,0.625157
3,0.0278224,0.875838,0.0169878
4,0.577229,0.378074,0.225203


In [37]:
x[:, Cols("x1", Between("x2", "x4"))]

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Float64,Float64,Float64,Float64
1,0.779175,0.571052,0.00519813,0.580951
2,0.284709,0.264599,0.618271,0.625157
3,0.956069,0.0278224,0.875838,0.0169878
4,0.853433,0.577229,0.378074,0.225203


### Views

You can simply create a view of a `DataFrame` (it is more efficient than creating a materialized selection). Here are the possible return value options.

In [38]:
@view x[1:2, 1]

2-element view(::Array{Float64,1}, 1:2) with eltype Float64:
 0.7791753981936183
 0.2847094117872502

In [39]:
@view x[1,1]

0-dimensional view(::Array{Float64,1}, 1) with eltype Float64:
0.7791753981936183

In [40]:
@view x[1, 1:2] # a DataFrameRow, the same as for x[1, 1:2] without a view

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Float64,Float64
1,0.779175,0.571052


In [41]:
@view x[1:2, 1:2] # a SubDataFrame

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Float64,Float64
1,0.779175,0.571052
2,0.284709,0.264599


### Adding new columns to a data frame

In [42]:
df = DataFrame()

using `setproperty!`

In [43]:
x = [1, 2, 3]
df.a = x
df

Unnamed: 0_level_0,a
Unnamed: 0_level_1,Int64
1,1
2,2
3,3


In [44]:
df.a === x # no copy is performed

true

using `setindex!`

In [45]:
df[!, :b] = x
df[:, :c] = x
df

Unnamed: 0_level_0,a,b,c
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,1,1
2,2,2,2
3,3,3,3


In [46]:
df.b === x # no copy

true

In [47]:
df.c === x # copy

false

In [48]:
df[!, :d] .= x
df[:, :e] .= x
df

Unnamed: 0_level_0,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,1,1,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


In [49]:
df.d === x, df.e === x # both copy, so in this case `!` and `:` has the same effect

(false, false)

note that in our data frame columns `:a` and `:b` store the vector `x` (not a copy)

In [50]:
df.a === df.b === x

true

This can lead to silent errors. For example this code leads to a bug (note that calling `pairs` on `eachcol(df)` creates an iterator of (column name, column) pairs):

In [51]:
for (n, c) in pairs(eachcol(df))
    println("$n: ", pop!(c))
end

a: 3
b: 2
c: 3
d: 3
e: 3


note that for column `:b` we printed `2` as `3` was removed from it when we used `pop!` on column `:a`.

Such mistakes sometimes happen. Because of this DataFrames.jl performs consistency checks before doing an expensive operation (most notably before showing a data frame).

In [52]:
df

AssertionError: AssertionError: Data frame is corrupt: length of column :c (2) does not match length of column 1 (1). The column vector has likely been resized unintentionally (either directly or because it is shared with another data frame).

We can investigate the columns to find out what happend:

In [53]:
collect(pairs(eachcol(df)))

5-element Array{Pair{Symbol,AbstractArray{T,1} where T},1}:
 :a => [1]
 :b => [1]
 :c => [1, 2]
 :d => [1, 2]
 :e => [1, 2]

The output confirms that the data frame `df` got corrupted.

DataFrames.jl supports a complete set of `getindex`, `getproperty`, `setindex!`, `setproperty!`, `view`, broadcasting, and broadcasting assignment operations. The details are explained here: http://juliadata.github.io/DataFrames.jl/latest/lib/indexing/.

### Comparisons

In [54]:
using DataFrames

In [55]:
df = DataFrame(rand(2,3), :auto)

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.35208,0.855444,0.0661244
2,0.814754,0.591507,0.744025


In [56]:
df2 = copy(df)

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.35208,0.855444,0.0661244
2,0.814754,0.591507,0.744025


In [57]:
df == df2 # compares column names and contents

true

create a minimally different data frame and use `isapprox` for comparison

In [58]:
df3 = df2 .+ eps()

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.35208,0.855444,0.0661244
2,0.814754,0.591507,0.744025


In [59]:
df == df3

false

In [60]:
isapprox(df, df3)

true

In [61]:
isapprox(df, df3, atol = eps()/2)

false

`missings` are handled as in Julia Base

In [62]:
df = DataFrame(a=missing)

Unnamed: 0_level_0,a
Unnamed: 0_level_1,Missing
1,missing


In [63]:
df == df

missing

In [64]:
df === df

true

In [65]:
isequal(df, df)

true