# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), July 17, 2021**

In [1]:
using DataFrames

## Getting basic information about a data frame

Let's start by creating a `DataFrame` object, `x`, so that we can learn how to get information on that data frame.

In [2]:
x = DataFrame(A = [1, 2], B = [1.0, missing], C = ["a", "b"])

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,2,missing,b


The standard `size` function works to get dimensions of the `DataFrame`,

In [3]:
size(x), size(x, 1), size(x, 2)

((2, 3), 2, 3)

as well as `nrow` and `ncol` from R.

In [4]:
nrow(x), ncol(x)

(2, 3)

`describe` gives basic summary statistics of data in your `DataFrame` (check out the help of `describe` for information on how to customize shown statistics).

In [5]:
describe(x)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,Type
1,A,1.5,1,1.5,2,0,Int64
2,B,1.0,1.0,1.0,1.0,1,"Union{Missing, Float64}"
3,C,,a,,b,0,String


you can limit the columns shown by `describe` using `cols` keyword argument

In [6]:
describe(x, cols=1:2)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Real,Float64,Real,Int64,Type
1,A,1.5,1.0,1.5,2.0,0,Int64
2,B,1.0,1.0,1.0,1.0,1,"Union{Missing, Float64}"


`names` will return the names of all columns as strings

In [7]:
names(x)

3-element Vector{String}:
 "A"
 "B"
 "C"

you can also get column names with a given `eltype`:

In [8]:
names(x, String)

1-element Vector{String}:
 "C"

use `propertynames` to get a vector of `Symbol`s:

In [9]:
propertynames(x)

3-element Vector{Symbol}:
 :A
 :B
 :C

using `eltype` on `eachcol(x)` returns element types of columns:

In [10]:
eltype.(eachcol(x))

3-element Vector{Type}:
 Int64
 Union{Missing, Float64}
 String

Here we create some large `DataFrame`

In [11]:
y = DataFrame(rand(1:10, 1000, 10), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,9,10,5,7,8,7,9,4,1,2
2,2,1,7,7,1,8,10,5,4,3
3,9,6,2,9,10,7,10,1,6,10
4,2,7,6,7,8,1,6,3,10,3
5,7,1,8,5,1,2,10,8,7,7
6,5,9,3,9,4,8,7,9,2,4
7,8,4,3,1,5,10,1,3,8,2
8,4,7,3,1,7,10,8,4,6,6
9,6,6,10,7,3,5,4,3,7,7
10,4,6,3,7,8,4,4,5,8,3


and then we can use `first` to peek into its first few rows

In [12]:
first(y, 5)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,9,10,5,7,8,7,9,4,1,2
2,2,1,7,7,1,8,10,5,4,3
3,9,6,2,9,10,7,10,1,6,10
4,2,7,6,7,8,1,6,3,10,3
5,7,1,8,5,1,2,10,8,7,7


and `last` to see its bottom rows.

In [13]:
last(y, 3)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,5,10,5,10,3,7,10,8,1,10
2,2,6,4,6,8,10,10,3,2,1
3,6,3,1,9,4,4,7,5,7,5


Using `first` and `last` without number of rows will return a first/last `DataFrameRow` in the `DataFrame`

In [14]:
first(y)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,9,10,5,7,8,7,9,4,1,2


In [15]:
last(y)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1000,6,3,1,9,4,4,7,5,7,5


### Displaying large data frames

Create a wide and tall data frame:

In [16]:
df = DataFrame(rand(100, 100), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.404855,0.243012,0.683807,0.192505,0.177699,0.490366,0.256095,0.905991
2,0.690703,0.98421,0.562601,0.328666,0.744332,0.720237,0.274809,0.251483
3,0.360016,0.543438,0.665666,0.108831,0.97722,0.907352,0.882782,0.89743
4,0.40676,0.138815,0.836288,0.152535,0.363914,0.92364,0.968768,0.239175
5,0.574124,0.0588055,0.479767,0.550021,0.852979,0.349545,0.193424,0.410679
6,0.741206,0.360742,0.456086,0.807961,0.184329,0.202235,0.931272,0.0470354
7,0.583758,0.508866,0.992778,0.421772,0.65335,0.481035,0.853926,0.512148
8,0.89155,0.493118,0.654586,0.111203,0.486452,0.895107,0.923835,0.136125
9,0.0134661,0.652482,0.482317,0.0963468,0.59272,0.474086,0.582232,0.44114
10,0.433474,0.0102393,0.18827,0.0993683,0.140597,0.199437,0.179752,0.14468


we can see that 92 of its columns were not printed. Also we get its first 30 rows. You can easily change this behavior by changing the value of `ENV["LINES"]` and `ENV["COLUMNS"]`.

In [17]:
ENV["LINES"] = 10

10

In [18]:
ENV["COLUMNS"] = 200

200

In [19]:
df

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.404855,0.243012,0.683807,0.192505,0.177699,0.490366,0.256095,0.905991,0.387573,0.676251,0.342257,0.75176,0.873887,0.780598,0.37259,0.602779,0.351328,0.898146,0.136442
2,0.690703,0.98421,0.562601,0.328666,0.744332,0.720237,0.274809,0.251483,0.345267,0.532479,0.586223,0.398477,0.0954523,0.343082,0.810024,0.906448,0.313045,0.515277,0.986998
3,0.360016,0.543438,0.665666,0.108831,0.97722,0.907352,0.882782,0.89743,0.244234,0.642323,0.138691,0.619279,0.924233,0.0231141,0.795032,0.715749,0.157171,0.155569,0.603605
4,0.40676,0.138815,0.836288,0.152535,0.363914,0.92364,0.968768,0.239175,0.121966,0.473963,0.810705,0.535522,0.368712,0.496021,0.422308,0.176252,0.94233,0.376803,0.945638
5,0.574124,0.0588055,0.479767,0.550021,0.852979,0.349545,0.193424,0.410679,0.105166,0.659435,0.25631,0.718698,0.000407438,0.0910939,0.313009,0.0928969,0.361912,0.276834,0.0685941
6,0.741206,0.360742,0.456086,0.807961,0.184329,0.202235,0.931272,0.0470354,0.945089,0.33734,0.754415,0.0654029,0.788944,0.691528,0.284889,0.491287,0.297232,0.280993,0.14589
7,0.583758,0.508866,0.992778,0.421772,0.65335,0.481035,0.853926,0.512148,0.550709,0.436073,0.0559767,0.971778,0.729512,0.546261,0.856627,0.452861,0.710556,0.276183,0.978197
8,0.89155,0.493118,0.654586,0.111203,0.486452,0.895107,0.923835,0.136125,0.642167,0.263163,0.327283,0.114689,0.755289,0.704462,0.834631,0.0254315,0.876798,0.657908,0.19089
9,0.0134661,0.652482,0.482317,0.0963468,0.59272,0.474086,0.582232,0.44114,0.466024,0.49818,0.504244,0.229541,0.358785,0.111,0.358165,0.119663,0.629892,0.348444,0.462189
10,0.433474,0.0102393,0.18827,0.0993683,0.140597,0.199437,0.179752,0.14468,0.728592,0.913592,0.733236,0.929452,0.64385,0.786097,0.213996,0.945879,0.284861,0.228,0.743188


### Most elementary get and set operations

Given the `DataFrame` `x` we have created earlier, here are various ways to grab one of its columns as a `Vector`.

In [20]:
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,2,missing,b


In [21]:
x.A, x[!, 1], x[!, :A] # all get the vector stored in our DataFrame without copying it

([1, 2], [1, 2], [1, 2])

In [22]:
x."A", x[!, "A"] # the same using string indexing

([1, 2], [1, 2])

In [23]:
x[:, 1] # note that this creates a copy

2-element Vector{Int64}:
 1
 2

In [24]:
x[:, 1] === x[:, 1]

false

To grab one row as a `DataFrame`, we can index as follows.

In [25]:
x[1:1, :]

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a


In [26]:
x[1, :] # this produces a DataFrameRow which is treated as 1-dimensional object similar to a NamedTuple

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a


We can grab a single cell or element with the same syntax to grab an element of an array.

In [27]:
x[1, 1]

1

or a new `DataFrame` that is a subset of rows and columns

In [28]:
x[1:2, 1:2]

Unnamed: 0_level_0,A,B
Unnamed: 0_level_1,Int64,Float64?
1,1,1.0
2,2,missing


You can also use `Regex` to select columns and `Not` from InvertedIndices.jl both to select rows and columns

In [29]:
x[Not(1), r"A"]

Unnamed: 0_level_0,A
Unnamed: 0_level_1,Int64
1,2


In [30]:
x[!, Not(1)] # ! indicates that underlying columns are not copied

Unnamed: 0_level_0,B,C
Unnamed: 0_level_1,Float64?,String
1,1.0,a
2,missing,b


In [31]:
x[:, Not(1)] # : means that the columns will get copied

Unnamed: 0_level_0,B,C
Unnamed: 0_level_1,Float64?,String
1,1.0,a
2,missing,b


Assignment of a scalar to a data frame can be done in ranges using broadcasting:

In [32]:
x[1:2, 1:2] .= 1
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,1,1.0,b


Assignment of a vector of length equal to the number of assigned rows using broadcasting

In [33]:
x[1:2, 1:2] .= [1,2]
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,2,2.0,b


Assignment or of another data frame of matching size and column names, again using broadcasting:

In [34]:
x[1:2, 1:2] .= DataFrame([5 6; 7 8], [:A, :B])
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,5,6.0,a
2,7,8.0,b


**Caution**

With `df[!, :col]` and `df.col` syntax you get a direct (non copying) access to a column of a data frame.
This is potentially unsafe as you can easily corrupt data in the `df` data frame if you resize, sort, etc. the column obtained in this way.
Therefore such access should be used with caution.

Similarly `df[!, cols]` when `cols` is a collection of columns produces a new data frame that holds the same (not copied) columns as the source `df` data frame. Similarly, modifying the data frame obtained via `df[!, cols]` might cause problems with the consistency of `df`.

The `df[:, :col]` and `df[:, cols]` syntaxes always copy columns so they are safe to use (and should generally be preferred except for performance or memory critical use cases).

Here are examples of how `Cols` and `Between` can be used to select columns of a data frame.

In [35]:
x = DataFrame(rand(4, 5), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.69221,0.903738,0.335539,0.11739,0.320136
2,0.573932,0.106309,0.240108,0.807816,0.834898
3,0.886848,0.120244,0.583345,0.268339,0.497417
4,0.902572,0.987881,0.924044,0.539209,0.990526


In [36]:
x[:, Between(:x2, :x4)]

Unnamed: 0_level_0,x2,x3,x4
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.903738,0.335539,0.11739
2,0.106309,0.240108,0.807816
3,0.120244,0.583345,0.268339
4,0.987881,0.924044,0.539209


In [37]:
x[:, Cols("x1", Between("x2", "x4"))]

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Float64,Float64,Float64,Float64
1,0.69221,0.903738,0.335539,0.11739
2,0.573932,0.106309,0.240108,0.807816
3,0.886848,0.120244,0.583345,0.268339
4,0.902572,0.987881,0.924044,0.539209


### Views

You can simply create a view of a `DataFrame` (it is more efficient than creating a materialized selection). Here are the possible return value options.

In [38]:
@view x[1:2, 1]

2-element view(::Vector{Float64}, 1:2) with eltype Float64:
 0.6922102533868262
 0.5739317182365873

In [39]:
@view x[1,1]

0-dimensional view(::Vector{Float64}, 1) with eltype Float64:
0.6922102533868262

In [40]:
@view x[1, 1:2] # a DataFrameRow, the same as for x[1, 1:2] without a view

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Float64,Float64
1,0.69221,0.903738


In [41]:
@view x[1:2, 1:2] # a SubDataFrame

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Float64,Float64
1,0.69221,0.903738
2,0.573932,0.106309


### Adding new columns to a data frame

In [42]:
df = DataFrame()

using `setproperty!`

In [43]:
x = [1, 2, 3]
df.a = x
df

Unnamed: 0_level_0,a
Unnamed: 0_level_1,Int64
1,1
2,2
3,3


In [44]:
df.a === x # no copy is performed

true

using `setindex!`

In [45]:
df[!, :b] = x
df[:, :c] = x
df

Unnamed: 0_level_0,a,b,c
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,1,1
2,2,2,2
3,3,3,3


In [46]:
df.b === x # no copy

true

In [47]:
df.c === x # copy

false

In [48]:
df[!, :d] .= x
df[:, :e] .= x
df

Unnamed: 0_level_0,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,1,1,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


In [49]:
df.d === x, df.e === x # both copy, so in this case `!` and `:` has the same effect

(false, false)

note that in our data frame columns `:a` and `:b` store the vector `x` (not a copy)

In [50]:
df.a === df.b === x

true

This can lead to silent errors. For example this code leads to a bug (note that calling `pairs` on `eachcol(df)` creates an iterator of (column name, column) pairs):

In [51]:
for (n, c) in pairs(eachcol(df))
    println("$n: ", pop!(c))
end

a: 3
b: 2
c: 3
d: 3
e: 3


note that for column `:b` we printed `2` as `3` was removed from it when we used `pop!` on column `:a`.

Such mistakes sometimes happen. Because of this DataFrames.jl performs consistency checks before doing an expensive operation (most notably before showing a data frame).

In [52]:
df

AssertionError: AssertionError: Data frame is corrupt: length of column :c (2) does not match length of column 1 (1). The column vector has likely been resized unintentionally (either directly or because it is shared with another data frame).

We can investigate the columns to find out what happend:

In [53]:
collect(pairs(eachcol(df)))

5-element Vector{Pair{Symbol, AbstractVector{T} where T}}:
 :a => [1]
 :b => [1]
 :c => [1, 2]
 :d => [1, 2]
 :e => [1, 2]

The output confirms that the data frame `df` got corrupted.

DataFrames.jl supports a complete set of `getindex`, `getproperty`, `setindex!`, `setproperty!`, `view`, broadcasting, and broadcasting assignment operations. The details are explained here: http://juliadata.github.io/DataFrames.jl/latest/lib/indexing/.

### Comparisons

In [54]:
using DataFrames

In [55]:
df = DataFrame(rand(2,3), :auto)

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.164412,0.466921,0.700084
2,0.399886,0.000657272,0.531395


In [56]:
df2 = copy(df)

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.164412,0.466921,0.700084
2,0.399886,0.000657272,0.531395


In [57]:
df == df2 # compares column names and contents

true

create a minimally different data frame and use `isapprox` for comparison

In [58]:
df3 = df2 .+ eps()

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.164412,0.466921,0.700084
2,0.399886,0.000657272,0.531395


In [59]:
df == df3

false

In [60]:
isapprox(df, df3)

true

In [61]:
isapprox(df, df3, atol = eps()/2)

false

`missings` are handled as in Julia Base

In [62]:
df = DataFrame(a=missing)

Unnamed: 0_level_0,a
Unnamed: 0_level_1,Missing
1,missing


In [63]:
df == df

missing

In [64]:
df === df

true

In [65]:
isequal(df, df)

true