# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), May 15, 2021**

In [None]:
using DataFrames

## Getting basic information about a data frame

Let's start by creating a `DataFrame` object, `x`, so that we can learn how to get information on that data frame.

In [2]:
x = DataFrame(A = [1, 2], B = [1.0, missing], C = ["a", "b"])

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,2,missing,b


The standard `size` function works to get dimensions of the `DataFrame`,

In [3]:
size(x), size(x, 1), size(x, 2)

((2, 3), 2, 3)

as well as `nrow` and `ncol` from R.

In [4]:
nrow(x), ncol(x)

(2, 3)

`describe` gives basic summary statistics of data in your `DataFrame` (check out the help of `describe` for information on how to customize shown statistics).

In [5]:
describe(x)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,Type
1,A,1.5,1,1.5,2,0,Int64
2,B,1.0,1.0,1.0,1.0,1,"Union{Missing, Float64}"
3,C,,a,,b,0,String


you can limit the columns shown by `describe` using `cols` keyword argument

In [6]:
describe(x, cols=1:2)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Real,Float64,Real,Int64,Type
1,A,1.5,1.0,1.5,2.0,0,Int64
2,B,1.0,1.0,1.0,1.0,1,"Union{Missing, Float64}"


`names` will return the names of all columns as strings

In [7]:
names(x)

3-element Vector{String}:
 "A"
 "B"
 "C"

you can also get column names with a given `eltype`:

In [8]:
names(x, String)

1-element Vector{String}:
 "C"

use `propertynames` to get a vector of `Symbol`s:

In [9]:
propertynames(x)

3-element Vector{Symbol}:
 :A
 :B
 :C

using `eltype` on `eachcol(x)` returns element types of columns:

In [10]:
eltype.(eachcol(x))

3-element Vector{Type}:
 Int64
 Union{Missing, Float64}
 String

Here we create some large `DataFrame`

In [11]:
y = DataFrame(rand(1:10, 1000, 10), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,1,5,7,7,4,7,5,10,7,5
2,1,2,3,5,10,8,5,1,1,5
3,5,3,3,9,4,1,2,10,8,10
4,1,5,2,4,4,3,5,6,10,1
5,5,1,9,1,6,2,7,9,10,4
6,3,10,7,10,10,9,2,2,4,2
7,2,8,2,9,3,3,6,1,6,6
8,9,5,5,5,2,7,1,4,4,10
9,5,2,8,5,10,9,6,7,10,9
10,10,6,3,2,9,10,9,7,8,10


and then we can use `first` to peek into its first few rows

In [12]:
first(y, 5)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,1,5,7,7,4,7,5,10,7,5
2,1,2,3,5,10,8,5,1,1,5
3,5,3,3,9,4,1,2,10,8,10
4,1,5,2,4,4,3,5,6,10,1
5,5,1,9,1,6,2,7,9,10,4


and `last` to see its bottom rows.

In [13]:
last(y, 3)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,2,1,4,1,10,2,6,3,2,1
2,10,2,1,6,5,3,6,4,6,2
3,1,1,6,4,3,8,9,1,2,10


Using `first` and `last` without number of rows will return a first/last `DataFrameRow` in the `DataFrame`

In [14]:
first(y)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,1,5,7,7,4,7,5,10,7,5


In [15]:
last(y)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1000,1,1,6,4,3,8,9,1,2,10


### Displaying large data frames

Create a wide and tall data frame:

In [16]:
df = DataFrame(rand(100, 100), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.284863,0.150485,0.737636,0.806588,0.316241,0.743701,0.218102,0.165176
2,0.501957,0.231188,0.551661,0.360076,0.769242,0.700422,0.0486096,0.7185
3,0.973025,0.537029,0.00941453,0.52119,0.5629,0.250874,0.140555,0.112987
4,0.842876,0.730808,0.903605,0.946731,0.977526,0.333802,0.400038,0.33784
5,0.65979,0.401221,0.313394,0.676907,0.850611,0.0627812,0.810162,0.102764
6,0.822837,0.0276197,0.658882,0.016395,0.0616261,0.310589,0.436686,0.484467
7,0.0833133,0.311927,0.0744152,0.0912951,0.361815,0.472003,0.559648,0.319237
8,0.817371,0.265196,0.463726,0.91228,0.483566,0.581756,0.00719598,0.35739
9,0.409563,0.418353,0.256868,0.926944,0.0589892,0.291366,0.326054,0.461798
10,0.0243913,0.742843,0.188017,0.694479,0.666269,0.129156,0.672293,0.858619


we can see that 92 of its columns were not printed. Also we get its first 30 rows. You can easily change this behavior by changing the value of `ENV["LINES"]` and `ENV["COLUMNS"]`.

In [17]:
ENV["LINES"] = 10

10

In [18]:
ENV["COLUMNS"] = 200

200

In [19]:
df

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.284863,0.150485,0.737636,0.806588,0.316241,0.743701,0.218102,0.165176,0.0818365,0.0983644,0.0770015,0.583222,0.296157,0.837324,0.741328,0.545649,0.425016,0.0116202,0.918861
2,0.501957,0.231188,0.551661,0.360076,0.769242,0.700422,0.0486096,0.7185,0.705136,0.45167,0.192877,0.046497,0.173873,0.479886,0.586149,0.251211,0.189006,0.225503,0.348623
3,0.973025,0.537029,0.00941453,0.52119,0.5629,0.250874,0.140555,0.112987,0.194617,0.160313,0.977295,0.168291,0.953982,0.874838,0.805281,0.871176,0.548642,0.691007,0.552542
4,0.842876,0.730808,0.903605,0.946731,0.977526,0.333802,0.400038,0.33784,0.37407,0.860041,0.171666,0.91238,0.870106,0.80575,0.573598,0.90186,0.170766,0.0448987,0.656011
5,0.65979,0.401221,0.313394,0.676907,0.850611,0.0627812,0.810162,0.102764,0.463117,0.589625,0.611264,0.087908,0.972714,0.552724,0.424026,0.452282,0.71269,0.869421,0.552599
6,0.822837,0.0276197,0.658882,0.016395,0.0616261,0.310589,0.436686,0.484467,0.0648184,0.800449,0.00449887,0.623002,0.537856,0.189178,0.480624,0.0794668,0.639608,0.756184,0.324802
7,0.0833133,0.311927,0.0744152,0.0912951,0.361815,0.472003,0.559648,0.319237,0.62895,0.496908,0.736253,0.910113,0.173795,0.90739,0.159655,0.639726,0.150445,0.340965,0.942088
8,0.817371,0.265196,0.463726,0.91228,0.483566,0.581756,0.00719598,0.35739,0.259512,0.495318,0.48536,0.741115,0.800019,0.52507,0.580446,0.990122,0.920072,0.256567,0.451351
9,0.409563,0.418353,0.256868,0.926944,0.0589892,0.291366,0.326054,0.461798,0.626255,0.539294,0.00208776,0.354188,0.744345,0.99916,0.27808,0.553075,0.70034,0.707614,0.253815
10,0.0243913,0.742843,0.188017,0.694479,0.666269,0.129156,0.672293,0.858619,0.897235,0.36655,0.833211,0.170462,0.119561,0.911127,0.536527,0.861722,0.667115,0.806371,0.449196


### Most elementary get and set operations

Given the `DataFrame` `x` we have created earlier, here are various ways to grab one of its columns as a `Vector`.

In [20]:
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,2,missing,b


In [21]:
x.A, x[!, 1], x[!, :A] # all get the vector stored in our DataFrame without copying it

([1, 2], [1, 2], [1, 2])

In [22]:
x."A", x[!, "A"] # the same using string indexing

([1, 2], [1, 2])

In [23]:
x[:, 1] # note that this creates a copy

2-element Vector{Int64}:
 1
 2

In [24]:
x[:, 1] === x[:, 1]

false

To grab one row as a `DataFrame`, we can index as follows.

In [25]:
x[1:1, :]

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a


In [26]:
x[1, :] # this produces a DataFrameRow which is treated as 1-dimensional object similar to a NamedTuple

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a


We can grab a single cell or element with the same syntax to grab an element of an array.

In [27]:
x[1, 1]

1

or a new `DataFrame` that is a subset of rows and columns

In [28]:
x[1:2, 1:2]

Unnamed: 0_level_0,A,B
Unnamed: 0_level_1,Int64,Float64?
1,1,1.0
2,2,missing


You can also use `Regex` to select columns and `Not` from InvertedIndices.jl both to select rows and columns

In [29]:
x[Not(1), r"A"]

Unnamed: 0_level_0,A
Unnamed: 0_level_1,Int64
1,2


In [30]:
x[!, Not(1)] # ! indicates that underlying columns are not copied

Unnamed: 0_level_0,B,C
Unnamed: 0_level_1,Float64?,String
1,1.0,a
2,missing,b


In [31]:
x[:, Not(1)] # : means that the columns will get copied

Unnamed: 0_level_0,B,C
Unnamed: 0_level_1,Float64?,String
1,1.0,a
2,missing,b


Assignment of a scalar to a data frame can be done in ranges using broadcasting:

In [32]:
x[1:2, 1:2] .= 1
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,1,1.0,b


Assignment of a vector of length equal to the number of assigned rows using broadcasting

In [33]:
x[1:2, 1:2] .= [1,2]
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,2,2.0,b


Assignment or of another data frame of matching size and column names, again using broadcasting:

In [34]:
x[1:2, 1:2] .= DataFrame([5 6; 7 8], [:A, :B])
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,5,6.0,a
2,7,8.0,b


**Caution**

With `df[!, :col]` and `df.col` syntax you get a direct (non copying) access to a column of a data frame.
This is potentially unsafe as you can easily corrupt data in the `df` data frame if you resize, sort, etc. the column obtained in this way.
Therefore such access should be used with caution.

Similarly `df[!, cols]` when `cols` is a collection of columns produces a new data frame that holds the same (not copied) columns as the source `df` data frame. Similarly, modifying the data frame obtained via `df[!, cols]` might cause problems with the consistency of `df`.

The `df[:, :col]` and `df[:, cols]` syntaxes always copy columns so they are safe to use (and should generally be preferred except for performance or memory critical use cases).

Here are examples of how `Cols` and `Between` can be used to select columns of a data frame.

In [35]:
x = DataFrame(rand(4, 5), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.596819,0.502579,0.680181,0.28736,0.594995
2,0.601399,0.456482,0.256213,0.209147,0.225149
3,0.328791,0.278495,0.409022,0.3291,0.600518
4,0.360823,0.275127,0.693335,0.629522,0.395221


In [36]:
x[:, Between(:x2, :x4)]

Unnamed: 0_level_0,x2,x3,x4
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.502579,0.680181,0.28736
2,0.456482,0.256213,0.209147
3,0.278495,0.409022,0.3291
4,0.275127,0.693335,0.629522


In [37]:
x[:, Cols("x1", Between("x2", "x4"))]

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Float64,Float64,Float64,Float64
1,0.596819,0.502579,0.680181,0.28736
2,0.601399,0.456482,0.256213,0.209147
3,0.328791,0.278495,0.409022,0.3291
4,0.360823,0.275127,0.693335,0.629522


### Views

You can simply create a view of a `DataFrame` (it is more efficient than creating a materialized selection). Here are the possible return value options.

In [38]:
@view x[1:2, 1]

2-element view(::Vector{Float64}, 1:2) with eltype Float64:
 0.5968194305122594
 0.601398691761077

In [39]:
@view x[1,1]

0-dimensional view(::Vector{Float64}, 1) with eltype Float64:
0.5968194305122594

In [40]:
@view x[1, 1:2] # a DataFrameRow, the same as for x[1, 1:2] without a view

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Float64,Float64
1,0.596819,0.502579


In [41]:
@view x[1:2, 1:2] # a SubDataFrame

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Float64,Float64
1,0.596819,0.502579
2,0.601399,0.456482


### Adding new columns to a data frame

In [42]:
df = DataFrame()

using `setproperty!`

In [43]:
x = [1, 2, 3]
df.a = x
df

Unnamed: 0_level_0,a
Unnamed: 0_level_1,Int64
1,1
2,2
3,3


In [44]:
df.a === x # no copy is performed

true

using `setindex!`

In [45]:
df[!, :b] = x
df[:, :c] = x
df

Unnamed: 0_level_0,a,b,c
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,1,1
2,2,2,2
3,3,3,3


In [46]:
df.b === x # no copy

true

In [47]:
df.c === x # copy

false

In [48]:
df[!, :d] .= x
df[:, :e] .= x
df

Unnamed: 0_level_0,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,1,1,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


In [49]:
df.d === x, df.e === x # both copy, so in this case `!` and `:` has the same effect

(false, false)

note that in our data frame columns `:a` and `:b` store the vector `x` (not a copy)

In [50]:
df.a === df.b === x

true

This can lead to silent errors. For example this code leads to a bug (note that calling `pairs` on `eachcol(df)` creates an iterator of (column name, column) pairs):

In [51]:
for (n, c) in pairs(eachcol(df))
    println("$n: ", pop!(c))
end

a: 3
b: 2
c: 3
d: 3
e: 3


note that for column `:b` we printed `2` as `3` was removed from it when we used `pop!` on column `:a`.

Such mistakes sometimes happen. Because of this DataFrames.jl performs consistency checks before doing an expensive operation (most notably before showing a data frame).

In [52]:
df

AssertionError: AssertionError: Data frame is corrupt: length of column :c (2) does not match length of column 1 (1). The column vector has likely been resized unintentionally (either directly or because it is shared with another data frame).

We can investigate the columns to find out what happend:

In [53]:
collect(pairs(eachcol(df)))

5-element Vector{Pair{Symbol, AbstractVector{T} where T}}:
 :a => [1]
 :b => [1]
 :c => [1, 2]
 :d => [1, 2]
 :e => [1, 2]

The output confirms that the data frame `df` got corrupted.

DataFrames.jl supports a complete set of `getindex`, `getproperty`, `setindex!`, `setproperty!`, `view`, broadcasting, and broadcasting assignment operations. The details are explained here: http://juliadata.github.io/DataFrames.jl/latest/lib/indexing/.

### Comparisons

In [54]:
using DataFrames

In [55]:
df = DataFrame(rand(2,3), :auto)

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.862639,0.505863,0.397636
2,0.81065,0.543768,0.667132


In [56]:
df2 = copy(df)

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.862639,0.505863,0.397636
2,0.81065,0.543768,0.667132


In [57]:
df == df2 # compares column names and contents

true

create a minimally different data frame and use `isapprox` for comparison

In [58]:
df3 = df2 .+ eps()

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.862639,0.505863,0.397636
2,0.81065,0.543768,0.667132


In [59]:
df == df3

false

In [60]:
isapprox(df, df3)

true

In [61]:
isapprox(df, df3, atol = eps()/2)

false

`missings` are handled as in Julia Base

In [62]:
df = DataFrame(a=missing)

Unnamed: 0_level_0,a
Unnamed: 0_level_1,Missing
1,missing


In [63]:
df == df

missing

In [64]:
df === df

true

In [65]:
isequal(df, df)

true