# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), August 26, 2019**

In [1]:
using Pkg
Pkg.activate(".");

[32m[1mActivating[22m[39m environment at `d:\Dev\Julia\DataFrames_Tutorial\Project.toml`


In [2]:
using DataFrames

## Getting basic information about a data frame

Let's start by creating a `DataFrame` object, `x`, so that we can learn how to get information on that data frame.

In [3]:
x = DataFrame(A = [1, 2], B = [1.0, missing], C = ["a", "b"])

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64⍰,String
1,1,1.0,a
2,2,missing,b


The standard `size` function works to get dimensions of the `DataFrame`,

In [4]:
size(x), size(x, 1), size(x, 2)

((2, 3), 2, 3)

as well as `nrow` and `ncol` from R (`length` in the past gave the number of columns but now is deprecated).

In [5]:
nrow(x), ncol(x)

(2, 3)

`describe` gives basic summary statistics of data in your `DataFrame` (check out the help of `describe` for information how to customize shown statistics).

In [6]:
describe(x)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Union…,Union…,Type
1,A,1.5,1,1.5,2,,,Int64
2,B,1.0,1.0,1.0,1.0,,1.0,"Union{Missing, Float64}"
3,C,,a,,b,2.0,,String


`names` will return the names of all columns,

In [7]:
names(x)

3-element Array{Symbol,1}:
 :A
 :B
 :C

and `eltypes` returns their types.

In [8]:
eltypes(x)

3-element Array{Type,1}:
 Int64                  
 Union{Missing, Float64}
 String                 

Here we create some large `DataFrame`

In [9]:
y = DataFrame(rand(1:10, 1000, 10))

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,5,8,9,9,7,1,3,9,1,3
2,2,3,2,8,6,1,3,5,3,7
3,2,5,9,10,9,4,5,8,5,9
4,4,2,7,8,2,6,9,9,4,3
5,3,6,9,2,7,4,2,7,7,8
6,2,8,8,2,2,5,2,8,6,5
7,7,4,9,5,8,2,7,6,7,1
8,6,2,4,3,10,4,10,9,3,6
9,1,1,4,5,9,10,1,3,10,3
10,2,3,1,8,2,10,1,10,9,1


and then we can use `first` to peek into its first few rows

In [10]:
first(y, 5)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,5,8,9,9,7,1,3,9,1,3
2,2,3,2,8,6,1,3,5,3,7
3,2,5,9,10,9,4,5,8,5,9
4,4,2,7,8,2,6,9,9,4,3
5,3,6,9,2,7,4,2,7,7,8


and `last` to see its bottom rows.

In [11]:
last(y, 3)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,9,7,6,7,3,9,5,1,4,4
2,7,5,5,7,4,3,9,9,10,6
3,1,6,3,8,9,1,8,10,7,10


Using `first` and `last` without number of rows will return a first/last `DataFrameRow` in the `DataFrame`

In [12]:
first(y)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1,5,8,9,9,7,1,3,9,1,3


In [13]:
last(y)

Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64
1000,1,6,3,8,9,1,8,10,7,10


### Most elementary get and set operations

Given the `DataFrame`, `x`, here are four ways to grab one of its columns as a `Vector`.

In [14]:
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64⍰,String
1,1,1.0,a
2,2,missing,b


In [15]:
x.A, x[!, 1], x[!, :A] # all get the vector stored in our DataFrame without copying it

([1, 2], [1, 2], [1, 2])

In [16]:
x[:, 1] # note that this creates a copy

2-element Array{Int64,1}:
 1
 2

In [17]:
x[:, 1] === x[:, 1]

false

To grab one row as a `DataFrame`, we can index as follows.

In [18]:
x[1:1, :]

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64⍰,String
1,1,1.0,a


In [19]:
x[1, :] # this produces a DataFrameRow

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64⍰,String
1,1,1.0,a


We can grab a single cell or element with the same syntax to grab an element of an array.

In [20]:
x[1, 1]

1

or a new `DataFrame` that is a subset of rows and columns

In [21]:
x[1:2, 1:2]

Unnamed: 0_level_0,A,B
Unnamed: 0_level_1,Int64,Float64⍰
1,1,1.0
2,2,missing


You can also use `Regex` to select columns and `Not` from InvertedIndices.jl both to select rows and columns

In [22]:
x[Not(1), r"A"]

Unnamed: 0_level_0,A
Unnamed: 0_level_1,Int64
1,2


In [23]:
x[!, Not(1)] # ! indicates that underlying columns are not copied

Unnamed: 0_level_0,B,C
Unnamed: 0_level_1,Float64⍰,String
1,1.0,a
2,missing,b


In [24]:
x[:, Not(1)] # : means that the columns will get copied

Unnamed: 0_level_0,B,C
Unnamed: 0_level_1,Float64⍰,String
1,1.0,a
2,missing,b


Assignment can be done in ranges to a scalar using broadcasting:

In [25]:
x[1:2, 1:2] .= 1
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64⍰,String
1,1,1.0,a
2,1,1.0,b


to a vector of length equal to the number of assigned rows using broadcasting

In [26]:
x[1:2, 1:2] .= [1,2]
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64⍰,String
1,1,1.0,a
2,2,2.0,b


or to another data frame of matching size and column names, again using broadcasting:

In [27]:
x[1:2, 1:2] .= DataFrame([5 6; 7 8], [:A, :B])
x

Unnamed: 0_level_0,A,B,C
Unnamed: 0_level_1,Int64,Float64⍰,String
1,5,6.0,a
2,7,8.0,b


**Caution**

With `df[!, :col]` and `df.col` syntax you get a direct (non copying) access to a column of a data frame.
This is potentially unsafe as you can easily corrupt data in the `df` data frame if you resize, sort, etc. the column obtained in this way.
Therefore such access should be used with caution.

Similarly `df[!, cols]` when `cols` is a collection of columns produces a new data frame that holds the same (not copied) columns as the source `df` data frame. Similarly, modifying the data frame obtained via `df[!, cols]` might cause problems with the consistency of `df`.

The `df[:, :col]` and `df[:, cols]` syntaxes always copy columns so they are safe to use (and should generally be preferred except for performance or memory critical use cases).

Here are examples how `All` and `Between` can be used to select columns of a data frame.

In [28]:
x = DataFrame(rand(4, 5))

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.549554,0.262778,0.279193,0.256622,0.278516
2,0.340675,0.433378,0.729121,0.607762,0.412879
3,0.594728,0.114828,0.628797,0.904247,0.868032
4,0.130788,0.219316,0.620918,0.291334,0.0503846


In [29]:
x[:, Between(:x2, :x4)]

Unnamed: 0_level_0,x2,x3,x4
Unnamed: 0_level_1,Float64,Float64,Float64
1,0.262778,0.279193,0.256622
2,0.433378,0.729121,0.607762
3,0.114828,0.628797,0.904247
4,0.219316,0.620918,0.291334


In [30]:
x[:, All(:x1, Between(:x2, :x4))]

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Float64,Float64,Float64,Float64
1,0.549554,0.262778,0.279193,0.256622
2,0.340675,0.433378,0.729121,0.607762
3,0.594728,0.114828,0.628797,0.904247
4,0.130788,0.219316,0.620918,0.291334


### Views

You can simply create a view of a `DataFrame` (it is more efficient than creating a materialized selection). Here are the possible return value options.

In [31]:
@view x[1:2, 1]

2-element view(::Array{Float64,1}, 1:2) with eltype Float64:
 0.549554006434269  
 0.34067516965756894

In [32]:
@view x[1,1]

0-dimensional view(::Array{Float64,1}, 1) with eltype Float64:
0.549554006434269

In [33]:
@view x[1, 1:2] # a DataFrameRow, the same as for x[1, 1:2] without a view

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Float64,Float64
1,0.549554,0.262778


In [34]:
@view x[1:2, 1:2] # a SubDataFrame

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Float64,Float64
1,0.549554,0.262778
2,0.340675,0.433378
