# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), December 12, 2021**

In [1]:
using DataFrames

## Getting basic information about a data frame

Let's start by creating a `DataFrame` object, `x`, so that we can learn how to get information on that data frame.

In [2]:
x = DataFrame(A = [1, 2], B = [1.0, missing], C = ["a", "b"])

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,2,missing,b


The standard `size` function works to get dimensions of the `DataFrame`,

In [3]:
size(x), size(x, 1), size(x, 2)

((2, 3), 2, 3)

as well as `nrow` and `ncol` from R.

In [4]:
nrow(x), ncol(x)

(2, 3)

`describe` gives basic summary statistics of data in your `DataFrame` (check out the help of `describe` for information on how to customize shown statistics).

In [5]:
describe(x)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,Type
1,A,1.5,1,1.5,2,0,Int64
2,B,1.0,1.0,1.0,1.0,1,"Union{Missing, Float64}"
3,C,,a,,b,0,String


you can limit the columns shown by `describe` using `cols` keyword argument

In [6]:
describe(x, cols=1:2)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Real,Float64,Real,Int64,Type
1,A,1.5,1.0,1.5,2.0,0,Int64
2,B,1.0,1.0,1.0,1.0,1,"Union{Missing, Float64}"


`names` will return the names of all columns as strings

In [7]:
names(x)

3-element Vector{String}:
 "A"
 "B"
 "C"

you can also get column names with a given `eltype`:

In [8]:
names(x, String)

1-element Vector{String}:
 "C"

use `propertynames` to get a vector of `Symbol`s:

In [9]:
propertynames(x)

3-element Vector{Symbol}:
 :A
 :B
 :C

using `eltype` on `eachcol(x)` returns element types of columns:

In [10]:
eltype.(eachcol(x))

3-element Vector{Type}:
 Int64
 Union{Missing, Float64}
 String

Here we create some large `DataFrame`

In [11]:
y = DataFrame(rand(1:10, 1000, 10), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,3,6,9,6,9,5,9,8,5,1
2,7,8,4,8,1,8,8,4,4,2
3,5,7,10,6,10,4,4,1,5,1
4,1,10,5,2,6,9,5,2,3,6
5,8,3,6,8,6,9,6,2,4,3
6,9,2,2,10,5,2,1,3,8,5
7,10,4,10,10,10,8,6,9,4,8
8,7,7,8,4,10,1,4,6,6,2
9,9,5,10,1,5,7,3,4,7,10
10,9,4,5,3,6,6,6,1,2,9


and then we can use `first` to peek into its first few rows

In [12]:
first(y, 5)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,3,6,9,6,9,5,9,8,5,1
2,7,8,4,8,1,8,8,4,4,2
3,5,7,10,6,10,4,4,1,5,1
4,1,10,5,2,6,9,5,2,3,6
5,8,3,6,8,6,9,6,2,4,3


and `last` to see its bottom rows.

In [13]:
last(y, 3)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,1,9,8,7,4,4,1,6,2,4
2,6,3,9,2,7,9,1,5,3,10
3,10,6,7,5,7,3,1,6,9,10


Using `first` and `last` without number of rows will return a first/last `DataFrameRow` in the `DataFrame`

In [14]:
first(y)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,3,6,9,6,9,5,9,8,5,1


In [15]:
last(y)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1000,10,6,7,5,7,3,1,6,9,10


### Displaying large data frames

Create a wide and tall data frame:

In [16]:
df = DataFrame(rand(100, 100), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.937537,0.434762,0.624101,0.330633,0.780235,0.966165,0.300685,0.196115
2,0.316273,0.933712,0.297821,0.455944,0.135766,0.342949,0.676665,0.55228
3,0.704614,0.161441,0.93509,0.362899,0.78428,0.825509,0.935901,0.0804371
4,0.871178,0.371284,0.105463,0.156164,0.280341,0.31227,0.0334978,0.12334
5,0.0807914,0.89206,0.288906,0.278503,0.464292,0.0720536,0.685766,0.84284
6,0.0850526,0.823181,0.541579,0.18709,0.113331,0.682002,0.119516,0.565451
7,0.316889,0.0158169,0.364496,0.506836,0.89823,0.0234474,0.862016,0.443777
8,0.558646,0.730943,0.267209,0.679209,0.903464,0.491354,0.401884,0.414432
9,0.768379,0.880277,0.628593,0.619127,0.777861,0.442033,0.801788,0.64136
10,0.177642,0.0513107,0.221894,0.089259,0.727725,0.164175,0.483483,0.775985


we can see that 92 of its columns were not printed. Also we get its first 30 rows. You can easily change this behavior by changing the value of `ENV["LINES"]` and `ENV["COLUMNS"]`.

In [17]:
ENV["LINES"] = 10

10

In [18]:
ENV["COLUMNS"] = 200

200

In [19]:
df

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.937537,0.434762,0.624101,0.330633,0.780235,0.966165,0.300685,0.196115,0.999093,0.490282,0.19984,0.950018,0.905897,0.164993,0.655526,0.475493,0.221768,0.611937,0.680317
2,0.316273,0.933712,0.297821,0.455944,0.135766,0.342949,0.676665,0.55228,0.456232,0.237123,0.458375,0.632109,0.877576,0.510561,0.00994217,0.465554,0.554434,0.783488,0.786656
3,0.704614,0.161441,0.93509,0.362899,0.78428,0.825509,0.935901,0.0804371,0.478275,0.96748,0.338668,0.121199,0.875487,0.163666,0.319268,0.317752,0.131977,0.271312,0.313141
4,0.871178,0.371284,0.105463,0.156164,0.280341,0.31227,0.0334978,0.12334,0.184612,0.724425,0.290446,0.00651331,0.0486466,0.440407,0.996949,0.182879,0.530507,0.885594,0.849986
5,0.0807914,0.89206,0.288906,0.278503,0.464292,0.0720536,0.685766,0.84284,0.903505,0.459453,0.0234239,0.588885,0.0362375,0.766705,0.69943,0.994435,0.686968,0.176177,0.51632
6,0.0850526,0.823181,0.541579,0.18709,0.113331,0.682002,0.119516,0.565451,0.368162,0.394141,0.753103,0.970445,0.488772,0.300935,0.4077,0.147689,0.332845,0.759781,0.280839
7,0.316889,0.0158169,0.364496,0.506836,0.89823,0.0234474,0.862016,0.443777,0.761403,0.411127,0.62072,0.261861,0.204788,0.0613125,0.0802998,0.92124,0.439635,0.49084,0.976735
8,0.558646,0.730943,0.267209,0.679209,0.903464,0.491354,0.401884,0.414432,0.443621,0.895188,0.204415,0.466727,0.481289,0.654467,0.614662,0.791214,0.00829345,0.641154,0.493376
9,0.768379,0.880277,0.628593,0.619127,0.777861,0.442033,0.801788,0.64136,0.607006,0.219696,0.406148,0.967594,0.517552,0.862473,0.666879,0.713703,0.695816,0.182099,0.833874
10,0.177642,0.0513107,0.221894,0.089259,0.727725,0.164175,0.483483,0.775985,0.522092,0.733456,0.616819,0.693912,0.241367,0.985453,0.869485,0.0106348,0.219196,0.845455,0.0601673


### Most elementary get and set operations

Given the `DataFrame` `x` we have created earlier, here are various ways to grab one of its columns as a `Vector`.

In [20]:
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,2,missing,b


In [21]:
x.A, x[!, 1], x[!, :A] # all get the vector stored in our DataFrame without copying it

([1, 2], [1, 2], [1, 2])

In [22]:
x."A", x[!, "A"] # the same using string indexing

([1, 2], [1, 2])

In [23]:
x[:, 1] # note that this creates a copy

2-element Vector{Int64}:
 1
 2

In [24]:
x[:, 1] === x[:, 1]

false

To grab one row as a `DataFrame`, we can index as follows.

In [25]:
x[1:1, :]

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a


In [26]:
x[1, :] # this produces a DataFrameRow which is treated as 1-dimensional object similar to a NamedTuple

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a


We can grab a single cell or element with the same syntax to grab an element of an array.

In [27]:
x[1, 1]

1

or a new `DataFrame` that is a subset of rows and columns

In [28]:
x[1:2, 1:2]

Unnamed: 0_level_0,A,B
Unnamed: 0_level_1,Int64,Float64?
1,1,1.0
2,2,missing


You can also use `Regex` to select columns and `Not` from InvertedIndices.jl both to select rows and columns

In [29]:
x[Not(1), r"A"]

Unnamed: 0_level_0,A
Unnamed: 0_level_1,Int64
1,2


In [30]:
x[!, Not(1)] # ! indicates that underlying columns are not copied

Unnamed: 0_level_0,B,C
Unnamed: 0_level_1,Float64?,String
1,1.0,a
2,missing,b


In [31]:
x[:, Not(1)] # : means that the columns will get copied

Unnamed: 0_level_0,B,C
Unnamed: 0_level_1,Float64?,String
1,1.0,a
2,missing,b


Assignment of a scalar to a data frame can be done in ranges using broadcasting:

In [32]:
x[1:2, 1:2] .= 1
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,1,1.0,b


Assignment of a vector of length equal to the number of assigned rows using broadcasting

In [33]:
x[1:2, 1:2] .= [1,2]
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,1,1.0,a
2,2,2.0,b


Assignment or of another data frame of matching size and column names, again using broadcasting:

In [34]:
x[1:2, 1:2] .= DataFrame([5 6; 7 8], [:A, :B])
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64?,String
1,5,6.0,a
2,7,8.0,b


**Caution**

With `df[!, :col]` and `df.col` syntax you get a direct (non copying) access to a column of a data frame.
This is potentially unsafe as you can easily corrupt data in the `df` data frame if you resize, sort, etc. the column obtained in this way.
Therefore such access should be used with caution.

Similarly `df[!, cols]` when `cols` is a collection of columns produces a new data frame that holds the same (not copied) columns as the source `df` data frame. Similarly, modifying the data frame obtained via `df[!, cols]` might cause problems with the consistency of `df`.

The `df[:, :col]` and `df[:, cols]` syntaxes always copy columns so they are safe to use (and should generally be preferred except for performance or memory critical use cases).

Here are examples of how `Cols` and `Between` can be used to select columns of a data frame.

In [35]:
x = DataFrame(rand(4, 5), :auto)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.691577,0.52603,0.895686,0.0981766,0.20148
2,0.372255,0.696497,0.729393,0.50415,0.422465
3,0.856247,0.150694,0.318079,0.613301,0.862029
4,0.628479,0.122975,0.0826925,0.615152,0.498344


In [36]:
x[:, Between(:x2, :x4)]

Unnamed: 0_level_0,x2,x3,x4
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.52603,0.895686,0.0981766
2,0.696497,0.729393,0.50415
3,0.150694,0.318079,0.613301
4,0.122975,0.0826925,0.615152


In [37]:
x[:, Cols("x1", Between("x2", "x4"))]

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Float64,Float64,Float64,Float64
1,0.691577,0.52603,0.895686,0.0981766
2,0.372255,0.696497,0.729393,0.50415
3,0.856247,0.150694,0.318079,0.613301
4,0.628479,0.122975,0.0826925,0.615152


### Views

You can simply create a view of a `DataFrame` (it is more efficient than creating a materialized selection). Here are the possible return value options.

In [38]:
@view x[1:2, 1]

2-element view(::Vector{Float64}, 1:2) with eltype Float64:
 0.6915767270073034
 0.37225450151199246

In [39]:
@view x[1,1]

0-dimensional view(::Vector{Float64}, 1) with eltype Float64:
0.6915767270073034

In [40]:
@view x[1, 1:2] # a DataFrameRow, the same as for x[1, 1:2] without a view

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Float64,Float64
1,0.691577,0.52603


In [41]:
@view x[1:2, 1:2] # a SubDataFrame

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Float64,Float64
1,0.691577,0.52603
2,0.372255,0.696497


### Adding new columns to a data frame

In [42]:
df = DataFrame()

using `setproperty!`

In [43]:
x = [1, 2, 3]
df.a = x
df

Unnamed: 0_level_0,a
Unnamed: 0_level_1,Int64
1,1
2,2
3,3


In [44]:
df.a === x # no copy is performed

true

using `setindex!`

In [45]:
df[!, :b] = x
df[:, :c] = x
df

Unnamed: 0_level_0,a,b,c
Unnamed: 0_level_1,Int64,Int64,Int64
1,1,1,1
2,2,2,2
3,3,3,3


In [46]:
df.b === x # no copy

true

In [47]:
df.c === x # copy

false

In [48]:
df[!, :d] .= x
df[:, :e] .= x
df

Unnamed: 0_level_0,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,1,1,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


In [49]:
df.d === x, df.e === x # both copy, so in this case `!` and `:` has the same effect

(false, false)

note that in our data frame columns `:a` and `:b` store the vector `x` (not a copy)

In [50]:
df.a === df.b === x

true

This can lead to silent errors. For example this code leads to a bug (note that calling `pairs` on `eachcol(df)` creates an iterator of (column name, column) pairs):

In [51]:
for (n, c) in pairs(eachcol(df))
    println("$n: ", pop!(c))
end

a: 3
b: 2
c: 3
d: 3
e: 3


note that for column `:b` we printed `2` as `3` was removed from it when we used `pop!` on column `:a`.

Such mistakes sometimes happen. Because of this DataFrames.jl performs consistency checks before doing an expensive operation (most notably before showing a data frame).

In [52]:
df

AssertionError: AssertionError: Data frame is corrupt: length of column :c (2) does not match length of column 1 (1). The column vector has likely been resized unintentionally (either directly or because it is shared with another data frame).

We can investigate the columns to find out what happend:

In [53]:
collect(pairs(eachcol(df)))

5-element Vector{Pair{Symbol, AbstractVector}}:
 :a => [1]
 :b => [1]
 :c => [1, 2]
 :d => [1, 2]
 :e => [1, 2]

The output confirms that the data frame `df` got corrupted.

DataFrames.jl supports a complete set of `getindex`, `getproperty`, `setindex!`, `setproperty!`, `view`, broadcasting, and broadcasting assignment operations. The details are explained here: http://juliadata.github.io/DataFrames.jl/latest/lib/indexing/.

### Comparisons

In [54]:
using DataFrames

In [55]:
df = DataFrame(rand(2,3), :auto)

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.416058,0.255371,0.781103
2,0.973929,0.272253,0.697177


In [56]:
df2 = copy(df)

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.416058,0.255371,0.781103
2,0.973929,0.272253,0.697177


In [57]:
df == df2 # compares column names and contents

true

create a minimally different data frame and use `isapprox` for comparison

In [58]:
df3 = df2 .+ eps()

Unnamed: 0_level_0,x1,x2,x3
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.416058,0.255371,0.781103
2,0.973929,0.272253,0.697177


In [59]:
df == df3

false

In [60]:
isapprox(df, df3)

true

In [61]:
isapprox(df, df3, atol = eps()/2)

false

`missings` are handled as in Julia Base

In [62]:
df = DataFrame(a=missing)

Unnamed: 0_level_0,a
Unnamed: 0_level_1,Missing
1,missing


In [63]:
df == df

missing

In [64]:
df === df

true

In [65]:
isequal(df, df)

true