# A deep dive into DataFrames.jl indexing
# Part 1: indexing in DataFrames.jl by example
### Bogumił Kamiński

`DataFrame` is an object that holds a collection of named columns stored as vectors.

In this tutorial we discuss how to get or set values of these columns.

What are we going to cover:
* `getindex`, a.k.a. `x[...]`
* `setindex!`, a.k.a. `x[...] =`
* `broadcast`, a.k.a. `fun.(x)`
* `broadcat!`, a.k.a. `x .= ...`
* `getproperty`, a.k.a. `x.field` and `x.field .= ...`
* `setproperty`, a.k.a. `x.field = ...`

Indexable types that DataFrames.jl defines:
* `DataFrame`
* `SubDataFrame`
* `DataFrameRow`
* `DataFrameRows`
* `DataFrameColumns`
* `GroupedDataFrame`
* `GroupKeys`
* `GroupKey`
* `StackedVector`
* `RepeatedVector`

### Environment setup

In [1]:
using BenchmarkTools

In [2]:
using CSV

In [3]:
using DataFrames

In [4]:
using Dates

In [5]:
using ShiftedArrays

In [6]:
using Statistics

Typical initial setup steps to override default settings of JupyterNotebook

In [7]:
ENV["COLUMNS"] = 500

500

In [8]:
ENV["LINES"] = 15

15

Let us read-in the data set we will work with.

Make sure you have the required file in the working directory.

The detailed instructions how to get it are in https://github.com/bkamins/JuliaCon2020-DataFrames-Tutorial/blob/master/README.md

In [9]:
fh5 = CSV.File("fh_5yrs.csv") |> DataFrame

Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol
Unnamed: 0_level_1,Date,Int64,Float64,Float64,Float64,Float64,Float64,String
1,2020-07-02,257500,17.64,17.74,17.62,17.71,17.71,AAAU
2,2020-07-01,468100,17.73,17.73,17.54,17.68,17.68,AAAU
3,2020-06-30,319100,17.65,17.8,17.61,17.78,17.78,AAAU
4,2020-06-29,405500,17.67,17.69,17.63,17.68,17.68,AAAU
5,2020-06-26,335100,17.49,17.67,17.42,17.67,17.67,AAAU
6,2020-06-25,246800,17.6,17.6,17.52,17.59,17.59,AAAU
7,2020-06-24,329200,17.61,17.71,17.56,17.61,17.61,AAAU
8,2020-06-23,351800,17.55,17.66,17.55,17.66,17.66,AAAU
9,2020-06-22,308300,17.5,17.57,17.44,17.5,17.5,AAAU
10,2020-06-19,153800,17.27,17.4,17.26,17.4,17.4,AAAU


#### Warm up exercises

*Get short description of columns in our data frame*

In [10]:
describe(fh5)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Union…,Nothing,DataType
1,date,,2015-01-02,,2020-07-02,1385.0,,Date
2,volume,1015410.0,1,120600.0,2156725200,,,Int64
3,open,298.086,0.001,24.95,6.91553e7,,,Float64
4,high,305.876,0.0,25.11,7.05886e7,,,Float64
5,low,291.014,0.0,24.75,6.84387e7,,,Float64
6,close,296.783,0.001,24.94,6.95136e7,,,Float64
7,adjclose,293.231,-3.77096,23.3258,6.90222e7,,,Float64
8,symbol,,AAAU,,ZYXI,6335.0,,String


(see https://github.com/JuliaData/DataFrames.jl/issues/2269 for a discussion of the design decisions here, feel free to comment there if you have an opinion)

*Get information about exact types of the columns stored in the data frame*

In [11]:
typeof.(eachcol(fh5))

8-element Array{DataType,1}:
 Array{Date,1}
 Array{Int64,1}
 Array{Float64,1}
 Array{Float64,1}
 Array{Float64,1}
 Array{Float64,1}
 Array{Float64,1}
 PooledArrays.PooledArray{String,UInt32,1,Array{UInt32,1}}

*Get names of columns as strings*

In [12]:
names(fh5)

8-element Array{String,1}:
 "date"
 "volume"
 "open"
 "high"
 "low"
 "close"
 "adjclose"
 "symbol"

*Get names of columns as `Symbol`s*

In [13]:
propertynames(fh5)

8-element Array{Symbol,1}:
 :date
 :volume
 :open
 :high
 :low
 :close
 :adjclose
 :symbol

## `getindex`

Get a single column as a whole without copying

In [14]:
unique([fh5.date,
        fh5."date",
        fh5[!, 1],
        fh5[!, :date],
        fh5[!, "date"]])

1-element Array{Array{Date,1},1}:
 [Date("2020-07-02"), Date("2020-07-01"), Date("2020-06-30"), Date("2020-06-29"), Date("2020-06-26"), Date("2020-06-25"), Date("2020-06-24"), Date("2020-06-23"), Date("2020-06-22"), Date("2020-06-19")  …  Date("2015-01-21"), Date("2015-01-20"), Date("2015-01-16"), Date("2015-01-14"), Date("2015-01-13"), Date("2015-01-12"), Date("2015-01-09"), Date("2015-01-07"), Date("2015-01-05"), Date("2015-01-02")]

In [15]:
unique([getproperty(fh5, :date),
        getproperty(fh5, "date"),
        getindex(fh5, !, 1),
        getindex(fh5, !, :date),
        getindex(fh5,!, "date")])

1-element Array{Array{Date,1},1}:
 [Date("2020-07-02"), Date("2020-07-01"), Date("2020-06-30"), Date("2020-06-29"), Date("2020-06-26"), Date("2020-06-25"), Date("2020-06-24"), Date("2020-06-23"), Date("2020-06-22"), Date("2020-06-19")  …  Date("2015-01-21"), Date("2015-01-20"), Date("2015-01-16"), Date("2015-01-14"), Date("2015-01-13"), Date("2015-01-12"), Date("2015-01-09"), Date("2015-01-07"), Date("2015-01-05"), Date("2015-01-02")]

Get a single column as a whole with copying

In [16]:
unique([copy(fh5.date),
        copy(fh5."date"),
        fh5[:, 1],
        fh5[:, :date],
        fh5[:, "date"]])

1-element Array{Array{Date,1},1}:
 [Date("2020-07-02"), Date("2020-07-01"), Date("2020-06-30"), Date("2020-06-29"), Date("2020-06-26"), Date("2020-06-25"), Date("2020-06-24"), Date("2020-06-23"), Date("2020-06-22"), Date("2020-06-19")  …  Date("2015-01-21"), Date("2015-01-20"), Date("2015-01-16"), Date("2015-01-14"), Date("2015-01-13"), Date("2015-01-12"), Date("2015-01-09"), Date("2015-01-07"), Date("2015-01-05"), Date("2015-01-02")]

Let us compare the performance of various ways to get a column without copying

In [17]:
@btime $fh5.date
@btime $fh5."date"
@btime $fh5[!, 1]
@btime $fh5[!, :date]
@btime $fh5[!, "date"];

  13.326 ns (0 allocations: 0 bytes)
  36.894 ns (0 allocations: 0 bytes)
  4.500 ns (0 allocations: 0 bytes)
  13.426 ns (0 allocations: 0 bytes)
  36.895 ns (0 allocations: 0 bytes)


`@btime` is from BenchmarkTools.jl package. We use `$` to ensure the time is measured properly.
This is a special syntax specific to `@btime` (like `$` used in string interpolation context).

#### Exercise

Check the same but with copying

In [18]:
@btime copy($fh5.date)
@btime copy($fh5."date")
@btime $fh5[:, 1]
@btime $fh5[:, :date]
@btime $fh5[:, "date"];

  19.042 ms (2 allocations: 52.28 MiB)
  19.034 ms (2 allocations: 52.28 MiB)
  18.979 ms (2 allocations: 52.28 MiB)
  19.051 ms (2 allocations: 52.28 MiB)
  19.009 ms (2 allocations: 52.28 MiB)


Let us check how lookup speed scales with the number of columns:

In [19]:
@time df_tmp = DataFrame(ones(1, 100_000))

  0.090328 seconds (599.57 k allocations: 47.574 MiB)


Unnamed: 0_level_0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,x20,x21,x22,x23,x24,x25,x26,x27,x28,x29,x30,x31,x32,x33,x34,x35,x36,x37,x38,x39,x40,x41,x42,x43,x44,x45,x46,x47,x48,x49,x50,x51,x52,x53,x54,x55,x56
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [20]:
@btime $df_tmp.x100000
@btime $df_tmp."x100000"
@btime $df_tmp[!, 100000];

  15.129 ns (0 allocations: 0 bytes)
  56.211 ns (0 allocations: 0 bytes)
  4.599 ns (0 allocations: 0 bytes)


<div class="alert alert-block alert-info">
<b>Tip:</b>
    
DataFrames.jl is specifically designed to allow you to handle without huge compilation costs
very wide data frames with heterogeneous column types and changing the schema of the data frame in-place.
(or if you do not want to think if you will not run into these issues)
</div>

Get a single column, but take a subset of rows: you can either make a copy or get a view

In [21]:
fh5[1:2, :date]

2-element Array{Date,1}:
 2020-07-02
 2020-07-01

In [22]:
view(fh5, 1:2, :date)

2-element view(::Array{Date,1}, 1:2) with eltype Date:
 2020-07-02
 2020-07-01

this is the same as e.g.:

In [23]:
fh5.date[1:2]

2-element Array{Date,1}:
 2020-07-02
 2020-07-01

In [24]:
@view fh5.date[1:2]

2-element view(::Array{Date,1}, 1:2) with eltype Date:
 2020-07-02
 2020-07-01

you can use `Not` for inverted selection of rows

In [25]:
fh5[Not(3:end), :date]

2-element Array{Date,1}:
 2020-07-02
 2020-07-01

Get a single cell in a data frame: you can either get a value or a view

In [26]:
fh5[1, :date]

2020-07-02

In [27]:
fh5[CartesianIndex(1, 1)]

2020-07-02

In [28]:
@view fh5[1, "date"]

0-dimensional view(::Array{Date,1}, 1) with eltype Date:
Date("2020-07-02")

In what case you might want to use a view instead of getting a value?

Check what is the consequence of running the following lines:

In [29]:
tmp_cell = view(fh5, 1, :date)

0-dimensional view(::Array{Date,1}, 1) with eltype Date:
Date("2020-07-02")

In [30]:
tmp_cell2 = getindex(fh5, 1, :date)

2020-07-02

In [31]:
tmp_cell[] = Date("2222-07-02")

2222-07-02

In [32]:
fh5

Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol
Unnamed: 0_level_1,Date,Int64,Float64,Float64,Float64,Float64,Float64,String
1,2222-07-02,257500,17.64,17.74,17.62,17.71,17.71,AAAU
2,2020-07-01,468100,17.73,17.73,17.54,17.68,17.68,AAAU
3,2020-06-30,319100,17.65,17.8,17.61,17.78,17.78,AAAU
4,2020-06-29,405500,17.67,17.69,17.63,17.68,17.68,AAAU
5,2020-06-26,335100,17.49,17.67,17.42,17.67,17.67,AAAU
6,2020-06-25,246800,17.6,17.6,17.52,17.59,17.59,AAAU
7,2020-06-24,329200,17.61,17.71,17.56,17.61,17.61,AAAU
8,2020-06-23,351800,17.55,17.66,17.55,17.66,17.66,AAAU
9,2020-06-22,308300,17.5,17.57,17.44,17.5,17.5,AAAU
10,2020-06-19,153800,17.27,17.4,17.26,17.4,17.4,AAAU


Revert the change we have just made

In [33]:
tmp_cell[] = tmp_cell2

2020-07-02

In [34]:
fh5

Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol
Unnamed: 0_level_1,Date,Int64,Float64,Float64,Float64,Float64,Float64,String
1,2020-07-02,257500,17.64,17.74,17.62,17.71,17.71,AAAU
2,2020-07-01,468100,17.73,17.73,17.54,17.68,17.68,AAAU
3,2020-06-30,319100,17.65,17.8,17.61,17.78,17.78,AAAU
4,2020-06-29,405500,17.67,17.69,17.63,17.68,17.68,AAAU
5,2020-06-26,335100,17.49,17.67,17.42,17.67,17.67,AAAU
6,2020-06-25,246800,17.6,17.6,17.52,17.59,17.59,AAAU
7,2020-06-24,329200,17.61,17.71,17.56,17.61,17.61,AAAU
8,2020-06-23,351800,17.55,17.66,17.55,17.66,17.66,AAAU
9,2020-06-22,308300,17.5,17.57,17.44,17.5,17.5,AAAU
10,2020-06-19,153800,17.27,17.4,17.26,17.4,17.4,AAAU


To conclude note that with `view` there is no difference between `!` and `:`:

In [35]:
@view fh5[!, 1]

6852038-element view(::Array{Date,1}, :) with eltype Date:
 2020-07-02
 2020-07-01
 2020-06-30
 2020-06-29
 2020-06-26
 ⋮
 2015-01-12
 2015-01-09
 2015-01-07
 2015-01-05
 2015-01-02

In [36]:
@view fh5[:, 1]

6852038-element view(::Array{Date,1}, :) with eltype Date:
 2020-07-02
 2020-07-01
 2020-06-30
 2020-06-29
 2020-06-26
 ⋮
 2015-01-12
 2015-01-09
 2015-01-07
 2015-01-05
 2015-01-02

#### Exercise

Why is it useful to support `@view` both for `!` and `:`?

*Answer: because it makes it easy to use @views on whole expressions*

<div class="alert alert-block alert-info">
<b>Tip:</b>
    
passing a single column as an integer, `Symbol` or string drops one dimension of
a data frame and allows you to select or subset a column from it (depending on the row selector you choose)
</div>

Multiple column selection options are:
* a vector of `Symbol` (does not have to be a subtype of `AbstractVector{Symbol}`, e.g. `Any[:date]`);
* a vector of `AbstractString` (does not have to be a subtype of `AbstractVector{<:AbstractString}`, e.g. `Any["date"]`);
* a vector of `Integer` other than `Bool` (does not have to be a subtype of `AbstractVector{<:Integer}`, e.g. `Any[1]`);
* a vector of `Bool` that has to be a subtype of `AbstractVector{Bool}`;
* a regular expression, which gets expanded to a vector of matching column names;
* a `Not` expression;
* an `All` or `Between` expression;
* a colon literal `:`.

The type of the result depends on the row selecor:
* if it is a single row you get a `DataFrameRow` (a dimension is dropped)
* if it is a collection of rows you get a data frame

Single row selection is always a view that is `DataFrameRow`:

In [37]:
fh5[1, [:date]]

Unnamed: 0_level_0,date
Unnamed: 0_level_1,Date
1,2020-07-02


In [38]:
@view fh5[1, [:date]]

Unnamed: 0_level_0,date
Unnamed: 0_level_1,Date
1,2020-07-02


Note that `DataFrameRow` is one-dimensional (as usual - single value indexing drops a dimension). You can think of it as a mutable `NamedTuple`.

Making `fg5[1, [:date]]` to be copying would incur overhead that was considered to be not justified by typical use cases of this syntax. (but if you disagree please open an issue on GitHub)

Multiple row selection is a `DataFrame` for `getindex`:

In [39]:
fh5[1:2, 1:2]

Unnamed: 0_level_0,date,volume
Unnamed: 0_level_1,Date,Int64
1,2020-07-02,257500
2,2020-07-01,468100


This does not copy columns and is fast

In [40]:
df_tmp = fh5[!, 1:2]

Unnamed: 0_level_0,date,volume
Unnamed: 0_level_1,Date,Int64
1,2020-07-02,257500
2,2020-07-01,468100
3,2020-06-30,319100
4,2020-06-29,405500
5,2020-06-26,335100
6,2020-06-25,246800
7,2020-06-24,329200
8,2020-06-23,351800
9,2020-06-22,308300
10,2020-06-19,153800


In [41]:
df_tmp.date === fh5.date

true

Using `view` creates a `SubDataFrame`

In [42]:
dfv_tmp = view(fh5, 1:2, 1:2)

Unnamed: 0_level_0,date,volume
Unnamed: 0_level_1,Date,Int64
1,2020-07-02,257500
2,2020-07-01,468100


In [43]:
typeof(dfv_tmp)

SubDataFrame{DataFrame,DataFrames.SubIndex{DataFrames.Index,UnitRange{Int64},UnitRange{Int64}},UnitRange{Int64}}

#### Execrcise

Check that `view(fh5, !, :)` and `view(fh5, :, :)` produce the same result

In [44]:
dump(view(fh5, !, :), maxdepth=2)

SubDataFrame{DataFrame,DataFrames.Index,Base.OneTo{Int64}}
  parent: DataFrame
    columns: Array{AbstractArray{T,1} where T}((8,))
    colindex: DataFrames.Index
  colindex: DataFrames.Index
    lookup: Dict{Symbol,Int64}
    names: Array{Symbol}((8,))
  rows: Base.OneTo{Int64}
    stop: Int64 6852038


In [45]:
dump(view(fh5, :, :), maxdepth=2)

SubDataFrame{DataFrame,DataFrames.Index,Base.OneTo{Int64}}
  parent: DataFrame
    columns: Array{AbstractArray{T,1} where T}((8,))
    colindex: DataFrames.Index
  colindex: DataFrames.Index
    lookup: Dict{Symbol,Int64}
    names: Array{Symbol}((8,))
  rows: Base.OneTo{Int64}
    stop: Int64 6852038


As a warning remember that when you modify the parent of a `SubDataFrame` (or `DataFrameRow`) you may get an error when trying to access it:

In [46]:
df_tmp = fh5[1:3, 1:4]

Unnamed: 0_level_0,date,volume,open,high
Unnamed: 0_level_1,Date,Int64,Float64,Float64
1,2020-07-02,257500,17.64,17.74
2,2020-07-01,468100,17.73,17.73
3,2020-06-30,319100,17.65,17.8


In [47]:
dfv_tmp = view(df_tmp, 1:2, 1:3)

Unnamed: 0_level_0,date,volume,open
Unnamed: 0_level_1,Date,Int64,Float64
1,2020-07-02,257500,17.64
2,2020-07-01,468100,17.73


In [48]:
select!(df_tmp, 1) # note that in `select` et al. a data frame is always produced

Unnamed: 0_level_0,date
Unnamed: 0_level_1,Date
1,2020-07-02
2,2020-07-01
3,2020-06-30


In [49]:
dfv_tmp

BoundsError: BoundsError: attempt to access 1-element Array{Symbol,1} at index [1:3]

A special case is when you use `:` as a column selection with a `view`. In this case the `SubDataFrame` and `DataFrameRow` always get updated with the changed columns:

In [50]:
df_tmp = fh5[1:3, 1:4]

Unnamed: 0_level_0,date,volume,open,high
Unnamed: 0_level_1,Date,Int64,Float64,Float64
1,2020-07-02,257500,17.64,17.74
2,2020-07-01,468100,17.73,17.73
3,2020-06-30,319100,17.65,17.8


In [51]:
dfv_tmp = view(df_tmp, 1:2, :)

Unnamed: 0_level_0,date,volume,open,high
Unnamed: 0_level_1,Date,Int64,Float64,Float64
1,2020-07-02,257500,17.64,17.74
2,2020-07-01,468100,17.73,17.73


In [52]:
select!(df_tmp, 1, :open => :newcol)

Unnamed: 0_level_0,date,newcol
Unnamed: 0_level_1,Date,Float64
1,2020-07-02,17.64
2,2020-07-01,17.73
3,2020-06-30,17.65


In [53]:
dfv_tmp

Unnamed: 0_level_0,date,newcol
Unnamed: 0_level_1,Date,Float64
1,2020-07-02,17.64
2,2020-07-01,17.73


The reason for this behavior is that subsetting of a data frame by only rows (and taking all columns) is very common, and in this case we can create and index such views much faster. In particular `DataFrameRow`s produced by `eachrow` are efficient this way:

In [54]:
@btime mean(x -> x.open, eachrow(fh5))

  1.010 s (34259693 allocations: 522.76 MiB)


298.08612911254846

In [55]:
@btime mean(i -> df[i, :open], 1:nrow(fh5))

UndefVarError: UndefVarError: df not defined

Of course, type-stable operation would be faster (but sometimes processing data row-wise is more convenient):

In [56]:
@btime mean(fh5.open)

  3.001 ms (1 allocation: 16 bytes)


298.08612911254846

or, if your table is not very wide (so that you are not penalized by the compilation cost of `NamedTuple`) you can use:

In [57]:
@btime mean(x -> x.open, Tables.namedtupleiterator(fh5))

  10.517 ms (18 allocations: 1.31 KiB)


298.0861291125547

<div class="alert alert-block alert-info">
<b>Tip:</b>
    

If your table is not very wide then you can use `Tables.namedtupleiterator` or `Tables.columntable` to swithch a `DataFrame` into a type-stable mode (at the cost of fixing its schema). Both are non-allocating. You can also used `Tables.rowtable` but it is allocating so usually `Tables.namedtupleiterator` is preferred.
</div>

Note though that `DataFrameRow` allows you to modify the source data frame, while iterating `NamedTuple`s is read-only (more on `setindex!` later).

In [58]:
df_tmp = copy(fh5)

Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol
Unnamed: 0_level_1,Date,Int64,Float64,Float64,Float64,Float64,Float64,String
1,2020-07-02,257500,17.64,17.74,17.62,17.71,17.71,AAAU
2,2020-07-01,468100,17.73,17.73,17.54,17.68,17.68,AAAU
3,2020-06-30,319100,17.65,17.8,17.61,17.78,17.78,AAAU
4,2020-06-29,405500,17.67,17.69,17.63,17.68,17.68,AAAU
5,2020-06-26,335100,17.49,17.67,17.42,17.67,17.67,AAAU
6,2020-06-25,246800,17.6,17.6,17.52,17.59,17.59,AAAU
7,2020-06-24,329200,17.61,17.71,17.56,17.61,17.61,AAAU
8,2020-06-23,351800,17.55,17.66,17.55,17.66,17.66,AAAU
9,2020-06-22,308300,17.5,17.57,17.44,17.5,17.5,AAAU
10,2020-06-19,153800,17.27,17.4,17.26,17.4,17.4,AAAU


#### Exercise

In `df_tmp` find rows in which `:high` is less than `:low` and swap these values.
We will discuss three ways to do it.

In [59]:
bad_idx = findall(df_tmp.low .> df_tmp.high)
@show df_tmp[bad_idx, :]
df_tmp.low[bad_idx], df_tmp.high[bad_idx] = df_tmp.high[bad_idx], df_tmp.low[bad_idx]
@show df_tmp[bad_idx, :]

df_tmp[bad_idx, :] = 2×8 DataFrame
│ Row │ date       │ volume │ open    │ high    │ low     │ close   │ adjclose │ symbol │
│     │ Date       │ Int64  │ Float64 │ Float64 │ Float64 │ Float64 │ Float64  │ String │
├─────┼────────────┼────────┼─────────┼─────────┼─────────┼─────────┼──────────┼────────┤
│ 1   │ 2016-10-19 │ 468    │ 6345.0  │ 6250.0  │ 6345.0  │ 6250.0  │ 6250.0   │ PALC   │
│ 2   │ 2016-10-07 │ 88860  │ 6250.0  │ 6150.0  │ 6250.0  │ 6150.0  │ 6150.0   │ PALC   │
df_tmp[bad_idx, :] = 2×8 DataFrame
│ Row │ date       │ volume │ open    │ high    │ low     │ close   │ adjclose │ symbol │
│     │ Date       │ Int64  │ Float64 │ Float64 │ Float64 │ Float64 │ Float64  │ String │
├─────┼────────────┼────────┼─────────┼─────────┼─────────┼─────────┼──────────┼────────┤
│ 1   │ 2016-10-19 │ 468    │ 6345.0  │ 6345.0  │ 6250.0  │ 6250.0  │ 6250.0   │ PALC   │
│ 2   │ 2016-10-07 │ 88860  │ 6250.0  │ 6250.0  │ 6150.0  │ 6150.0  │ 6150.0   │ PALC   │


Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol
Unnamed: 0_level_1,Date,Int64,Float64,Float64,Float64,Float64,Float64,String
1,2016-10-19,468,6345.0,6345.0,6250.0,6250.0,6250.0,PALC
2,2016-10-07,88860,6250.0,6250.0,6150.0,6150.0,6150.0,PALC


In [60]:
df_tmp = copy(fh5)
@show df_tmp[bad_idx, :]
foreach(eachrow(df_tmp)) do row
    if row.low >  row.high
        row.low, row.high = row.high, row.low
    end
end
@show df_tmp[bad_idx, :]

df_tmp[bad_idx, :] = 2×8 DataFrame
│ Row │ date       │ volume │ open    │ high    │ low     │ close   │ adjclose │ symbol │
│     │ Date       │ Int64  │ Float64 │ Float64 │ Float64 │ Float64 │ Float64  │ String │
├─────┼────────────┼────────┼─────────┼─────────┼─────────┼─────────┼──────────┼────────┤
│ 1   │ 2016-10-19 │ 468    │ 6345.0  │ 6250.0  │ 6345.0  │ 6250.0  │ 6250.0   │ PALC   │
│ 2   │ 2016-10-07 │ 88860  │ 6250.0  │ 6150.0  │ 6250.0  │ 6150.0  │ 6150.0   │ PALC   │
df_tmp[bad_idx, :] = 2×8 DataFrame
│ Row │ date       │ volume │ open    │ high    │ low     │ close   │ adjclose │ symbol │
│     │ Date       │ Int64  │ Float64 │ Float64 │ Float64 │ Float64 │ Float64  │ String │
├─────┼────────────┼────────┼─────────┼─────────┼─────────┼─────────┼──────────┼────────┤
│ 1   │ 2016-10-19 │ 468    │ 6345.0  │ 6345.0  │ 6250.0  │ 6250.0  │ 6250.0   │ PALC   │
│ 2   │ 2016-10-07 │ 88860  │ 6250.0  │ 6250.0  │ 6150.0  │ 6150.0  │ 6150.0   │ PALC   │


Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol
Unnamed: 0_level_1,Date,Int64,Float64,Float64,Float64,Float64,Float64,String
1,2016-10-19,468,6345.0,6345.0,6250.0,6250.0,6250.0,PALC
2,2016-10-07,88860,6250.0,6250.0,6150.0,6150.0,6150.0,PALC


In [61]:
df_tmp = copy(fh5)
@show df_tmp[bad_idx, :]
transform!(df_tmp, [[:low, :high]] .=> ByRow.([min, max]) .=> [:low, :high])
@show df_tmp[bad_idx, :]

df_tmp[bad_idx, :] = 2×8 DataFrame
│ Row │ date       │ volume │ open    │ high    │ low     │ close   │ adjclose │ symbol │
│     │ Date       │ Int64  │ Float64 │ Float64 │ Float64 │ Float64 │ Float64  │ String │
├─────┼────────────┼────────┼─────────┼─────────┼─────────┼─────────┼──────────┼────────┤
│ 1   │ 2016-10-19 │ 468    │ 6345.0  │ 6250.0  │ 6345.0  │ 6250.0  │ 6250.0   │ PALC   │
│ 2   │ 2016-10-07 │ 88860  │ 6250.0  │ 6150.0  │ 6250.0  │ 6150.0  │ 6150.0   │ PALC   │
df_tmp[bad_idx, :] = 2×8 DataFrame
│ Row │ date       │ volume │ open    │ high    │ low     │ close   │ adjclose │ symbol │
│     │ Date       │ Int64  │ Float64 │ Float64 │ Float64 │ Float64 │ Float64  │ String │
├─────┼────────────┼────────┼─────────┼─────────┼─────────┼─────────┼──────────┼────────┤
│ 1   │ 2016-10-19 │ 468    │ 6345.0  │ 6345.0  │ 6250.0  │ 6250.0  │ 6250.0   │ PALC   │
│ 2   │ 2016-10-07 │ 88860  │ 6250.0  │ 6250.0  │ 6150.0  │ 6150.0  │ 6150.0   │ PALC   │


Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol
Unnamed: 0_level_1,Date,Int64,Float64,Float64,Float64,Float64,Float64,String
1,2016-10-19,468,6345.0,6345.0,6250.0,6250.0,6250.0,PALC
2,2016-10-07,88860,6250.0,6250.0,6150.0,6150.0,6150.0,PALC


I give you the following column selectors. Can you tell the effect of each of them when trying to run `fh5[1:2, selector]`?
Write the code that tests it.

In [62]:
selectors = [Between(1, 10), Between(:low, :high), [:low, :low], All(:low, :low), All(:low, :), All()]

6-element Array{Any,1}:
 Between{Int64,Int64}(1, 10)
 Between{Symbol,Symbol}(:low, :high)
 [:low, :low]
 All{Tuple{Symbol,Symbol}}((:low, :low))
 All{Tuple{Symbol,Colon}}((:low, Colon()))
 All{Tuple{}}(())

In [63]:
for selector in selectors
    @show selector
    try
        println(fh5[1:2, selector])
    catch
        println("errored")
    end
end

selector = Between{Int64,Int64}(1, 10)
errored
selector = Between{Symbol,Symbol}(:low, :high)
0×0 DataFrame

selector = [:low, :low]
errored
selector = All{Tuple{Symbol,Symbol}}((:low, :low))
2×1 DataFrame
│ Row │ low     │
│     │ [90mFloat64[39m │
├─────┼─────────┤
│ 1   │ 17.62   │
│ 2   │ 17.54   │
selector = All{Tuple{Symbol,Colon}}((:low, Colon()))
2×8 DataFrame
│ Row │ low     │ date       │ volume │ open    │ high    │ close   │ adjclose │ symbol │
│     │ [90mFloat64[39m │ [90mDate[39m       │ [90mInt64[39m  │ [90mFloat64[39m │ [90mFloat64[39m │ [90mFloat64[39m │ [90mFloat64[39m  │ [90mString[39m │
├─────┼─────────┼────────────┼────────┼─────────┼─────────┼─────────┼──────────┼────────┤
│ 1   │ 17.62   │ 2020-07-02 │ 257500 │ 17.64   │ 17.74   │ 17.71   │ 17.71    │ AAAU   │
│ 2   │ 17.54   │ 2020-07-01 │ 468100 │ 17.73   │ 17.73   │ 17.68   │ 17.68    │ AAAU   │
selector = All{Tuple{}}(())
2×8 DataFrame
│ Row │ date       │ volume │ open    │ high    │ low  

<div class="alert alert-block alert-info">
<b>Tip:</b>
    
In general `df.single_col` and `df[!, single_col]` produce the same result in `getindex`.

The exceptions are:
* `@view` and `@views` does not affect `df.single_col`.
* in `df.single_col` you cannot pass `single_col` as a variable (unless you write `getproperty(df, single_col)`)
* only `df[!, single_col]` allows integer indexing

</div>


### Indexing `GroupedDataFrame`

A `GroupedDataFrame` is a view into a data frame which defines a key allowing a fast lookup (and in particular this key is then automatically used in split-apply-combine operations with `select`, `select!`, `transform`, `transform!` and `combine`).

In [64]:
gdf = groupby(fh5, :symbol)

Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol
Unnamed: 0_level_1,Date,Int64,Float64,Float64,Float64,Float64,Float64,String
1,2020-07-02,257500,17.64,17.74,17.62,17.71,17.71,AAAU
2,2020-07-01,468100,17.73,17.73,17.54,17.68,17.68,AAAU
3,2020-06-30,319100,17.65,17.8,17.61,17.78,17.78,AAAU
4,2020-06-29,405500,17.67,17.69,17.63,17.68,17.68,AAAU
5,2020-06-26,335100,17.49,17.67,17.42,17.67,17.67,AAAU
6,2020-06-25,246800,17.6,17.6,17.52,17.59,17.59,AAAU
7,2020-06-24,329200,17.61,17.71,17.56,17.61,17.61,AAAU
8,2020-06-23,351800,17.55,17.66,17.55,17.66,17.66,AAAU
9,2020-06-22,308300,17.5,17.57,17.44,17.5,17.5,AAAU
10,2020-06-19,153800,17.27,17.4,17.26,17.4,17.4,AAAU

Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol
Unnamed: 0_level_1,Date,Int64,Float64,Float64,Float64,Float64,Float64,String
1,2020-07-02,1072200,24.54,26.84,24.44,25.12,25.12,ZYXI
2,2020-07-01,630100,24.77,24.85,23.95,24.41,24.41,ZYXI
3,2020-06-30,1054800,23.05,25.24,22.81,24.87,24.87,ZYXI
4,2020-06-29,757500,22.92,23.46,21.94,23.18,23.18,ZYXI
5,2020-06-26,1061400,25.09,25.14,22.75,22.82,22.82,ZYXI
6,2020-06-25,901800,23.41,25.49,23.19,24.74,24.74,ZYXI
7,2020-06-24,777900,23.86,24.35,22.59,23.59,23.59,ZYXI
8,2020-06-23,675800,24.46,24.72,23.8,24.34,24.34,ZYXI
9,2020-06-22,821200,24.48,24.49,23.5,24.44,24.44,ZYXI
10,2020-06-19,1892700,24.5,25.71,22.63,24.06,24.06,ZYXI


In [65]:
gdf_keys = keys(gdf)

6335-element DataFrames.GroupKeys{GroupedDataFrame{DataFrame}}:
 GroupKey: (symbol = "AAAU",)
 GroupKey: (symbol = "AACG",)
 GroupKey: (symbol = "AADR",)
 GroupKey: (symbol = "AAL",)
 GroupKey: (symbol = "AAMC",)
 ⋮
 GroupKey: (symbol = "ZUO",)
 GroupKey: (symbol = "ZVO",)
 GroupKey: (symbol = "ZYME",)
 GroupKey: (symbol = "ZYNE",)
 GroupKey: (symbol = "ZYXI",)

As usual - indexing by a single value drops a dimension (you get a `SubDataFrame`)

In [66]:
gdf[1]

Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol
Unnamed: 0_level_1,Date,Int64,Float64,Float64,Float64,Float64,Float64,String
1,2020-07-02,257500,17.64,17.74,17.62,17.71,17.71,AAAU
2,2020-07-01,468100,17.73,17.73,17.54,17.68,17.68,AAAU
3,2020-06-30,319100,17.65,17.8,17.61,17.78,17.78,AAAU
4,2020-06-29,405500,17.67,17.69,17.63,17.68,17.68,AAAU
5,2020-06-26,335100,17.49,17.67,17.42,17.67,17.67,AAAU
6,2020-06-25,246800,17.6,17.6,17.52,17.59,17.59,AAAU
7,2020-06-24,329200,17.61,17.71,17.56,17.61,17.61,AAAU
8,2020-06-23,351800,17.55,17.66,17.55,17.66,17.66,AAAU
9,2020-06-22,308300,17.5,17.57,17.44,17.5,17.5,AAAU
10,2020-06-19,153800,17.27,17.4,17.26,17.4,17.4,AAAU


In [67]:
gdf_keys[1]

GroupKey: (symbol = "AAAU",)

In [68]:
gdf[gdf_keys[1]]

Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol
Unnamed: 0_level_1,Date,Int64,Float64,Float64,Float64,Float64,Float64,String
1,2020-07-02,257500,17.64,17.74,17.62,17.71,17.71,AAAU
2,2020-07-01,468100,17.73,17.73,17.54,17.68,17.68,AAAU
3,2020-06-30,319100,17.65,17.8,17.61,17.78,17.78,AAAU
4,2020-06-29,405500,17.67,17.69,17.63,17.68,17.68,AAAU
5,2020-06-26,335100,17.49,17.67,17.42,17.67,17.67,AAAU
6,2020-06-25,246800,17.6,17.6,17.52,17.59,17.59,AAAU
7,2020-06-24,329200,17.61,17.71,17.56,17.61,17.61,AAAU
8,2020-06-23,351800,17.55,17.66,17.55,17.66,17.66,AAAU
9,2020-06-22,308300,17.5,17.57,17.44,17.5,17.5,AAAU
10,2020-06-19,153800,17.27,17.4,17.26,17.4,17.4,AAAU


In [69]:
gdf[(symbol="AAAU",)]

Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol
Unnamed: 0_level_1,Date,Int64,Float64,Float64,Float64,Float64,Float64,String
1,2020-07-02,257500,17.64,17.74,17.62,17.71,17.71,AAAU
2,2020-07-01,468100,17.73,17.73,17.54,17.68,17.68,AAAU
3,2020-06-30,319100,17.65,17.8,17.61,17.78,17.78,AAAU
4,2020-06-29,405500,17.67,17.69,17.63,17.68,17.68,AAAU
5,2020-06-26,335100,17.49,17.67,17.42,17.67,17.67,AAAU
6,2020-06-25,246800,17.6,17.6,17.52,17.59,17.59,AAAU
7,2020-06-24,329200,17.61,17.71,17.56,17.61,17.61,AAAU
8,2020-06-23,351800,17.55,17.66,17.55,17.66,17.66,AAAU
9,2020-06-22,308300,17.5,17.57,17.44,17.5,17.5,AAAU
10,2020-06-19,153800,17.27,17.4,17.26,17.4,17.4,AAAU


In [70]:
gdf[("AAAU",)]

Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol
Unnamed: 0_level_1,Date,Int64,Float64,Float64,Float64,Float64,Float64,String
1,2020-07-02,257500,17.64,17.74,17.62,17.71,17.71,AAAU
2,2020-07-01,468100,17.73,17.73,17.54,17.68,17.68,AAAU
3,2020-06-30,319100,17.65,17.8,17.61,17.78,17.78,AAAU
4,2020-06-29,405500,17.67,17.69,17.63,17.68,17.68,AAAU
5,2020-06-26,335100,17.49,17.67,17.42,17.67,17.67,AAAU
6,2020-06-25,246800,17.6,17.6,17.52,17.59,17.59,AAAU
7,2020-06-24,329200,17.61,17.71,17.56,17.61,17.61,AAAU
8,2020-06-23,351800,17.55,17.66,17.55,17.66,17.66,AAAU
9,2020-06-22,308300,17.5,17.57,17.44,17.5,17.5,AAAU
10,2020-06-19,153800,17.27,17.4,17.26,17.4,17.4,AAAU


And indexing by a collection produces a subsetted `GroupedDataFrame`:

In [71]:
gdf[1:2]

Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol
Unnamed: 0_level_1,Date,Int64,Float64,Float64,Float64,Float64,Float64,String
1,2020-07-02,257500,17.64,17.74,17.62,17.71,17.71,AAAU
2,2020-07-01,468100,17.73,17.73,17.54,17.68,17.68,AAAU
3,2020-06-30,319100,17.65,17.8,17.61,17.78,17.78,AAAU
4,2020-06-29,405500,17.67,17.69,17.63,17.68,17.68,AAAU
5,2020-06-26,335100,17.49,17.67,17.42,17.67,17.67,AAAU
6,2020-06-25,246800,17.6,17.6,17.52,17.59,17.59,AAAU
7,2020-06-24,329200,17.61,17.71,17.56,17.61,17.61,AAAU
8,2020-06-23,351800,17.55,17.66,17.55,17.66,17.66,AAAU
9,2020-06-22,308300,17.5,17.57,17.44,17.5,17.5,AAAU
10,2020-06-19,153800,17.27,17.4,17.26,17.4,17.4,AAAU

Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol
Unnamed: 0_level_1,Date,Int64,Float64,Float64,Float64,Float64,Float64,String
1,2020-07-02,46800,1.3,1.39,1.3,1.31,1.31,AACG
2,2020-07-01,95400,1.25,1.4,1.18,1.31,1.31,AACG
3,2020-06-30,40200,1.16,1.27,1.16,1.26,1.26,AACG
4,2020-06-29,46900,1.15,1.25,1.15,1.17,1.17,AACG
5,2020-06-26,43700,1.12,1.18,1.12,1.15,1.15,AACG
6,2020-06-25,72900,1.22,1.25,1.11,1.23,1.23,AACG
7,2020-06-24,80400,1.14,1.3,1.14,1.25,1.25,AACG
8,2020-06-23,57000,1.19,1.22,1.14,1.17,1.17,AACG
9,2020-06-22,115500,1.0,1.17,0.97,1.14,1.14,AACG
10,2020-06-19,52500,0.96,1.03,0.95,0.95,0.95,AACG


In [72]:
gdf[tuple.(["AAAU", "AACG"])]

Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol
Unnamed: 0_level_1,Date,Int64,Float64,Float64,Float64,Float64,Float64,String
1,2020-07-02,257500,17.64,17.74,17.62,17.71,17.71,AAAU
2,2020-07-01,468100,17.73,17.73,17.54,17.68,17.68,AAAU
3,2020-06-30,319100,17.65,17.8,17.61,17.78,17.78,AAAU
4,2020-06-29,405500,17.67,17.69,17.63,17.68,17.68,AAAU
5,2020-06-26,335100,17.49,17.67,17.42,17.67,17.67,AAAU
6,2020-06-25,246800,17.6,17.6,17.52,17.59,17.59,AAAU
7,2020-06-24,329200,17.61,17.71,17.56,17.61,17.61,AAAU
8,2020-06-23,351800,17.55,17.66,17.55,17.66,17.66,AAAU
9,2020-06-22,308300,17.5,17.57,17.44,17.5,17.5,AAAU
10,2020-06-19,153800,17.27,17.4,17.26,17.4,17.4,AAAU

Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol
Unnamed: 0_level_1,Date,Int64,Float64,Float64,Float64,Float64,Float64,String
1,2020-07-02,46800,1.3,1.39,1.3,1.31,1.31,AACG
2,2020-07-01,95400,1.25,1.4,1.18,1.31,1.31,AACG
3,2020-06-30,40200,1.16,1.27,1.16,1.26,1.26,AACG
4,2020-06-29,46900,1.15,1.25,1.15,1.17,1.17,AACG
5,2020-06-26,43700,1.12,1.18,1.12,1.15,1.15,AACG
6,2020-06-25,72900,1.22,1.25,1.11,1.23,1.23,AACG
7,2020-06-24,80400,1.14,1.3,1.14,1.25,1.25,AACG
8,2020-06-23,57000,1.19,1.22,1.14,1.17,1.17,AACG
9,2020-06-22,115500,1.0,1.17,0.97,1.14,1.14,AACG
10,2020-06-19,52500,0.96,1.03,0.95,0.95,0.95,AACG


Lookup is backed by `Dict` so it is fast.

In [73]:
@btime $gdf[("AACG",)]

  556.757 ns (2 allocations: 2.77 KiB)


Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol
Unnamed: 0_level_1,Date,Int64,Float64,Float64,Float64,Float64,Float64,String
1,2020-07-02,46800,1.3,1.39,1.3,1.31,1.31,AACG
2,2020-07-01,95400,1.25,1.4,1.18,1.31,1.31,AACG
3,2020-06-30,40200,1.16,1.27,1.16,1.26,1.26,AACG
4,2020-06-29,46900,1.15,1.25,1.15,1.17,1.17,AACG
5,2020-06-26,43700,1.12,1.18,1.12,1.15,1.15,AACG
6,2020-06-25,72900,1.22,1.25,1.11,1.23,1.23,AACG
7,2020-06-24,80400,1.14,1.3,1.14,1.25,1.25,AACG
8,2020-06-23,57000,1.19,1.22,1.14,1.17,1.17,AACG
9,2020-06-22,115500,1.0,1.17,0.97,1.14,1.14,AACG
10,2020-06-19,52500,0.96,1.03,0.95,0.95,0.95,AACG


In [74]:
@btime @view $fh5[findall(==("AACG"), $fh5.symbol), :]

  23.967 ms (14 allocations: 8.53 KiB)


Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol
Unnamed: 0_level_1,Date,Int64,Float64,Float64,Float64,Float64,Float64,String
1,2020-07-02,46800,1.3,1.39,1.3,1.31,1.31,AACG
2,2020-07-01,95400,1.25,1.4,1.18,1.31,1.31,AACG
3,2020-06-30,40200,1.16,1.27,1.16,1.26,1.26,AACG
4,2020-06-29,46900,1.15,1.25,1.15,1.17,1.17,AACG
5,2020-06-26,43700,1.12,1.18,1.12,1.15,1.15,AACG
6,2020-06-25,72900,1.22,1.25,1.11,1.23,1.23,AACG
7,2020-06-24,80400,1.14,1.3,1.14,1.25,1.25,AACG
8,2020-06-23,57000,1.19,1.22,1.14,1.17,1.17,AACG
9,2020-06-22,115500,1.0,1.17,0.97,1.14,1.14,AACG
10,2020-06-19,52500,0.96,1.03,0.95,0.95,0.95,AACG


In [75]:
@btime filter(:symbol => ==("AACG"), fh5)

  27.506 ms (39 allocations: 1.10 MiB)


Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol
Unnamed: 0_level_1,Date,Int64,Float64,Float64,Float64,Float64,Float64,String
1,2020-07-02,46800,1.3,1.39,1.3,1.31,1.31,AACG
2,2020-07-01,95400,1.25,1.4,1.18,1.31,1.31,AACG
3,2020-06-30,40200,1.16,1.27,1.16,1.26,1.26,AACG
4,2020-06-29,46900,1.15,1.25,1.15,1.17,1.17,AACG
5,2020-06-26,43700,1.12,1.18,1.12,1.15,1.15,AACG
6,2020-06-25,72900,1.22,1.25,1.11,1.23,1.23,AACG
7,2020-06-24,80400,1.14,1.3,1.14,1.25,1.25,AACG
8,2020-06-23,57000,1.19,1.22,1.14,1.17,1.17,AACG
9,2020-06-22,115500,1.0,1.17,0.97,1.14,1.14,AACG
10,2020-06-19,52500,0.96,1.03,0.95,0.95,0.95,AACG


<div class="alert alert-block alert-info">
<b>Tip:</b>
    

Think of `GroupedDataFrame` as a wrapper over a data frame object which caches information
about the parent data frame to make operations that rely on row index faster. Currently this is used in:
* split/apply/combine
* lookup

In other words: if you like row indices in Pandas then `GroupedDataFrame` is a way to achieve such functionality in DataFrames.jl.

Notably:
* you can set multiple row indices to the same data frame, just by creating different `GroupedDataFrame` objects on top of the same data frame.
* the creation of cache in grouped data frame is lazy (it is computed only if needed); cache computation is thread safe.

</div>

In [76]:
for i in 1:3
    local gdf
    @show i
    @time gdf = groupby(fh5, :symbol)
    @time gdf[("AACG",)]
    @time gdf[("AACG",)]
    @time gdf[("AACG",)]
end

i = 1
  0.026666 seconds (41 allocations: 52.334 MiB)
  0.056631 seconds (22.98 k allocations: 53.392 MiB)
  0.000008 seconds (4 allocations: 2.844 KiB)
  0.000004 seconds (4 allocations: 2.844 KiB)
i = 2
  0.026115 seconds (41 allocations: 52.334 MiB)
  0.055984 seconds (22.98 k allocations: 53.392 MiB)
  0.000008 seconds (4 allocations: 2.844 KiB)
  0.000003 seconds (4 allocations: 2.844 KiB)
i = 3
  0.026122 seconds (41 allocations: 52.334 MiB)
  0.056337 seconds (22.98 k allocations: 53.392 MiB)
  0.000006 seconds (4 allocations: 2.844 KiB)
  0.000003 seconds (4 allocations: 2.844 KiB)


## setindex!

`setindex!` is defined only for `DataFrame`, `SubDataFrame` and `DataFrameRow` (other types are read-only).

Intended rules:
* the dimensions of left hand side and right hand side must match;
* if right hand side has names they must match the names in left hand side.

special cases (that might be surprising):

In [77]:
x = rand(5, 5)

5×5 Array{Float64,2}:
 0.686374  0.0361668  0.381712   0.838498   0.334841
 0.378644  0.984343   0.614099   0.433745   0.515516
 0.144086  0.745478   0.0693587  0.896013   0.883914
 0.504572  0.378033   0.175026   0.797727   0.13199
 0.518882  0.351102   0.348487   0.0651846  0.353089

In [78]:
df = DataFrame(x)

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.686374,0.0361668,0.381712,0.838498,0.334841
2,0.378644,0.984343,0.614099,0.433745,0.515516
3,0.144086,0.745478,0.0693587,0.896013,0.883914
4,0.504572,0.378033,0.175026,0.797727,0.13199
5,0.518882,0.351102,0.348487,0.0651846,0.353089


In [79]:
x[1, 1:4] = [1 2; 3 4]

2×2 Array{Int64,2}:
 1  2
 3  4

In [80]:
x

5×5 Array{Float64,2}:
 1.0       3.0       2.0        4.0        0.334841
 0.378644  0.984343  0.614099   0.433745   0.515516
 0.144086  0.745478  0.0693587  0.896013   0.883914
 0.504572  0.378033  0.175026   0.797727   0.13199
 0.518882  0.351102  0.348487   0.0651846  0.353089

In [81]:
x[1:4, 1] = [1 2; 3 4]

DimensionMismatch: DimensionMismatch("tried to assign 2×2 array to 4×1 destination")

#### Exercise

Check what happens if you try the same operations on `df`.

In [82]:
df

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.686374,0.0361668,0.381712,0.838498,0.334841
2,0.378644,0.984343,0.614099,0.433745,0.515516
3,0.144086,0.745478,0.0693587,0.896013,0.883914
4,0.504572,0.378033,0.175026,0.797727,0.13199
5,0.518882,0.351102,0.348487,0.0651846,0.353089


In [83]:
df[1:4, 1] = [1 2; 3 4]

DimensionMismatch: DimensionMismatch("array could not be broadcast to match destination")

In [84]:
df

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.686374,0.0361668,0.381712,0.838498,0.334841
2,0.378644,0.984343,0.614099,0.433745,0.515516
3,0.144086,0.745478,0.0693587,0.896013,0.883914
4,0.504572,0.378033,0.175026,0.797727,0.13199
5,0.518882,0.351102,0.348487,0.0651846,0.353089


In [85]:
df[1, 1:4] = [1 2; 3 4]

2×2 Array{Int64,2}:
 1  2
 3  4

In [86]:
df

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,1.0,3.0,2.0,4.0,0.334841
2,0.378644,0.984343,0.614099,0.433745,0.515516
3,0.144086,0.745478,0.0693587,0.896013,0.883914
4,0.504572,0.378033,0.175026,0.797727,0.13199
5,0.518882,0.351102,0.348487,0.0651846,0.353089


<div class="alert alert-block alert-info">
<b>Tip:</b>
    
We want DataFrames.jl to match 100% what Base does with indexing (except for cases where matching column names matters).

If you find cases when it does not please report an issue.

</div>

The most important case, that does work as in Base (as `Matrix` is not resizable) is creation of new columns.

In [87]:
df = DataFrame(rand(2, 2))

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Float64,Float64
1,0.665823,0.674426
2,0.471903,0.883762


In [88]:
x3 = [1, 2]

2-element Array{Int64,1}:
 1
 2

In [89]:
df.x3_1 = x3

2-element Array{Int64,1}:
 1
 2

In [90]:
df[:, :x3_2] = x3

2-element Array{Int64,1}:
 1
 2

In [91]:
df[!, :x3_3] = x3

2-element Array{Int64,1}:
 1
 2

In [92]:
df

Unnamed: 0_level_0,x1,x2,x3_1,x3_2,x3_3
Unnamed: 0_level_1,Float64,Float64,Int64,Int64,Int64
1,0.665823,0.674426,1,1,1
2,0.471903,0.883762,2,2,2


#### Exercise
Check in which cases `x3` got copied.

In [93]:
eachcol(df[!, r"x3"]) .=== Ref(x3)

3-element BitArray{1}:
 1
 0
 1

A special case when column always gets copied is for range objects:

In [94]:
fh5.col_idx = axes(fh5, 1)

Base.OneTo(6852038)

In [95]:
fh5.col_idx

6852038-element Array{Int64,1}:
       1
       2
       3
       4
       5
       ⋮
 6852034
 6852035
 6852036
 6852037
 6852038

Let us try `setindex!` for multi-column indexing.

In [96]:
df

Unnamed: 0_level_0,x1,x2,x3_1,x3_2,x3_3
Unnamed: 0_level_1,Float64,Float64,Int64,Int64,Int64
1,0.665823,0.674426,1,1,1
2,0.471903,0.883762,2,2,2


In [97]:
df2 = DataFrame(x1=[10,20], x2=["a","b"])

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Int64,String
1,10,a
2,20,b


In [98]:
df[:, 1:2] = df2

MethodError: MethodError: Cannot `convert` an object of type String to an object of type Float64
Closest candidates are:
  convert(::Type{T}, !Matched::T) where T<:Number at number.jl:6
  convert(::Type{T}, !Matched::Number) where T<:Number at number.jl:7
  convert(::Type{T}, !Matched::Base.TwicePrecision) where T<:Number at twiceprecision.jl:250
  ...

Warning: this operation is not atomic now.

In [99]:
df

Unnamed: 0_level_0,x1,x2,x3_1,x3_2,x3_3
Unnamed: 0_level_1,Float64,Float64,Int64,Int64,Int64
1,10.0,0.674426,1,1,1
2,20.0,0.883762,2,2,2


In [100]:
df[!, 1:2] = df2

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Int64,String
1,10,a
2,20,b


In [101]:
df

Unnamed: 0_level_0,x1,x2,x3_1,x3_2,x3_3
Unnamed: 0_level_1,Int64,String,Int64,Int64,Int64
1,10,a,1,1,1
2,20,b,2,2,2


<div class="alert alert-block alert-info">
<b>Tip:</b>

In `setindex!` context:
* when `:` is used to select rows it operates in-place (it works just as any row selector); except when adding a new column, in which case it copies a column;
* when `!` is used to select rows it always allocates a new column; if a single column is selected - this is a no-copy operation; if multiple columns are selected a copy is always made
* `df.single_col = v` is exactly the same as `df[!, single_col] = v` if `single_col` is a `Symbol` or string literal
* it is not allowed to add columns to `SubDataFrame` or `DataFrameRow`

</div>

## broadcast

Data frame behaves in broadcasting just like a matrix, with two exceptions:
* it forces the style of the result to be a `DataFrame`
* if several data frames take part in broadcasting they must have matching column names

In [102]:
log.(fh5[!, Between(:open, :close)])

Unnamed: 0_level_0,open,high,low,close
Unnamed: 0_level_1,Float64,Float64,Float64,Float64
1,2.87017,2.87582,2.86903,2.87413
2,2.87526,2.87526,2.86448,2.87243
3,2.87074,2.8792,2.86847,2.87807
4,2.87187,2.873,2.8696,2.87243
5,2.86163,2.87187,2.85762,2.87187
6,2.8679,2.8679,2.86334,2.86733
7,2.86847,2.87413,2.86562,2.86847
8,2.86505,2.8713,2.86505,2.8713
9,2.8622,2.86619,2.85877,2.8622
10,2.84897,2.85647,2.84839,2.85647


In [103]:
log.(fh5[!, Between(:open, :adjclose)])

DomainError: DomainError with -0.8700000643730164:
log will only return a complex result if called with a complex argument. Try log(Complex(x)).

In [104]:
filter(:adjclose => <=(0), fh5)

Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol,col_idx
Unnamed: 0_level_1,Date,Int64,Float64,Float64,Float64,Float64,Float64,String,Int64
1,2020-05-13,507400,3.4,3.4,2.99,3.02,-0.87,HCHC,2814814
2,2020-05-12,658600,3.22,3.48,3.11,3.19,-0.918974,HCHC,2814815
3,2020-05-11,564700,3.17,3.43,3.12,3.18,-0.916093,HCHC,2814816
4,2020-05-08,264600,2.92,3.08,2.8,3.06,-0.881523,HCHC,2814817
5,2020-05-07,246900,3.06,3.06,2.67,2.8,-0.806623,HCHC,2814818
6,2020-05-06,219500,2.87,3.08,2.78,2.98,-0.858477,HCHC,2814819
7,2020-05-05,190900,2.7,2.84,2.7,2.8,-0.806623,HCHC,2814820
8,2020-05-04,139800,2.58,2.8,2.58,2.69,-0.774934,HCHC,2814821
9,2020-05-01,288800,2.64,2.77,2.54,2.65,-0.763411,HCHC,2814822
10,2020-04-30,199700,2.88,2.88,2.61,2.69,-0.774934,HCHC,2814823


In [105]:
describe(fh5)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Union…,Nothing,DataType
1,date,,2015-01-02,,2020-07-02,1385.0,,Date
2,volume,1015410.0,1,120600.0,2156725200,,,Int64
3,open,298.086,0.001,24.95,6.91553e7,,,Float64
4,high,305.876,0.0,25.11,7.05886e7,,,Float64
5,low,291.014,0.0,24.75,6.84387e7,,,Float64
6,close,296.783,0.001,24.94,6.95136e7,,,Float64
7,adjclose,293.231,-3.77096,23.3258,6.90222e7,,,Float64
8,symbol,,AAAU,,ZYXI,6335.0,,String
9,col_idx,3426020.0,1,3426020.0,6852038,,,Int64


#### Exercise

Replace non-positive values in `:adjclose` by a value in `:close` and store it in a variable `:adjclose_fix`.

In [106]:
fh5.adjclose_fix = ifelse.(fh5.adjclose .<= 0, fh5.close, fh5.adjclose)

6852038-element Array{Float64,1}:
 17.709999084472656
 17.68000030517578
 17.780000686645508
 17.68000030517578
 17.670000076293945
  ⋮
  0.1363018900156021
  0.1460377424955368
  0.1460377424955368
  0.1557735800743103
  0.1752452850341797

In [107]:
filter(:adjclose => <=(0), fh5)

Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol,col_idx,adjclose_fix
Unnamed: 0_level_1,Date,Int64,Float64,Float64,Float64,Float64,Float64,String,Int64,Float64
1,2020-05-13,507400,3.4,3.4,2.99,3.02,-0.87,HCHC,2814814,3.02
2,2020-05-12,658600,3.22,3.48,3.11,3.19,-0.918974,HCHC,2814815,3.19
3,2020-05-11,564700,3.17,3.43,3.12,3.18,-0.916093,HCHC,2814816,3.18
4,2020-05-08,264600,2.92,3.08,2.8,3.06,-0.881523,HCHC,2814817,3.06
5,2020-05-07,246900,3.06,3.06,2.67,2.8,-0.806623,HCHC,2814818,2.8
6,2020-05-06,219500,2.87,3.08,2.78,2.98,-0.858477,HCHC,2814819,2.98
7,2020-05-05,190900,2.7,2.84,2.7,2.8,-0.806623,HCHC,2814820,2.8
8,2020-05-04,139800,2.58,2.8,2.58,2.69,-0.774934,HCHC,2814821,2.69
9,2020-05-01,288800,2.64,2.77,2.54,2.65,-0.763411,HCHC,2814822,2.65
10,2020-04-30,199700,2.88,2.88,2.61,2.69,-0.774934,HCHC,2814823,2.69


and something more advanced now for each stock calculate a column with log-returns of `:close` and store it in `:close_returns`:

In [108]:
df = sort(fh5, :date)

Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol,col_idx,adjclose_fix
Unnamed: 0_level_1,Date,Int64,Float64,Float64,Float64,Float64,Float64,String,Int64,Float64
1,2015-01-02,2000,37.25,37.25,36.64,36.64,35.3998,AADR,2114,35.3998
2,2015-01-02,10748600,54.28,54.6,53.07,53.91,51.0799,AAL,3499,51.0799
3,2015-01-02,11500,308.0,348.59,308.0,327.18,327.18,AAMC,4798,327.18
4,2015-01-02,11400,3.99,4.03,3.98,4.03,3.91772,AAME,6079,3.91772
5,2015-01-02,898900,30.81,30.86,30.04,30.62,30.0588,AAN,7464,30.0588
6,2015-01-02,184600,11.28,11.28,10.72,10.79,10.79,AAOI,8840,10.79
7,2015-01-02,90700,22.55,22.68,21.6,21.93,20.9958,AAON,10225,20.9958
8,2015-01-02,509800,160.85,162.5,157.47,158.56,156.251,AAP,11610,156.251
9,2015-01-02,53204600,111.39,111.44,107.35,109.33,99.9459,AAPL,12995,99.9459
10,2015-01-02,107500,40.12,40.46,39.95,40.37,35.0019,AAT,14380,35.0019


In [109]:
gdf = groupby(df, :symbol)

Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol,col_idx,adjclose_fix
Unnamed: 0_level_1,Date,Int64,Float64,Float64,Float64,Float64,Float64,String,Int64,Float64
1,2018-08-15,27300,11.84,11.84,11.74,11.74,11.74,AAAU,474,11.74
2,2018-08-16,428400,11.78,11.8,11.74,11.74,11.74,AAAU,473,11.74
3,2018-08-17,52400,11.8,11.82,11.77,11.82,11.82,AAAU,472,11.82
4,2018-08-20,28700,11.88,11.91,11.85,11.9,11.9,AAAU,471,11.9
5,2018-08-21,30600,11.92,11.95,11.89,11.93,11.93,AAAU,470,11.93
6,2018-08-22,101200,11.98,11.98,11.94,11.96,11.96,AAAU,469,11.96
7,2018-08-23,54800,11.91,11.92,11.85,11.85,11.85,AAAU,468,11.85
8,2018-08-24,106500,11.97,12.08,11.96,12.05,12.05,AAAU,467,12.05
9,2018-08-27,295100,12.06,12.13,12.06,12.1,12.1,AAAU,466,12.1
10,2018-08-28,30400,12.13,12.13,12.01,12.01,12.01,AAAU,465,12.01

Unnamed: 0_level_0,date,volume,open,high,low,close,adjclose,symbol,col_idx,adjclose_fix
Unnamed: 0_level_1,Date,Int64,Float64,Float64,Float64,Float64,Float64,String,Int64,Float64
1,2015-01-02,100,0.18,0.18,0.18,0.18,0.175245,ZYXI,6852038,0.175245
2,2015-01-05,33200,0.22,0.22,0.14,0.16,0.155774,ZYXI,6852037,0.155774
3,2015-01-07,8100,0.14,0.15,0.13,0.15,0.146038,ZYXI,6852036,0.146038
4,2015-01-09,200,0.15,0.15,0.15,0.15,0.146038,ZYXI,6852035,0.146038
5,2015-01-12,10000,0.14,0.14,0.14,0.14,0.136302,ZYXI,6852034,0.136302
6,2015-01-13,27800,0.14,0.19,0.14,0.19,0.184981,ZYXI,6852033,0.184981
7,2015-01-14,33800,0.13,0.19,0.13,0.19,0.184981,ZYXI,6852032,0.184981
8,2015-01-16,8000,0.12,0.18,0.12,0.18,0.175245,ZYXI,6852031,0.175245
9,2015-01-20,1800,0.18,0.18,0.17,0.18,0.175245,ZYXI,6852030,0.175245
10,2015-01-21,7800,0.13,0.18,0.12,0.18,0.175245,ZYXI,6852029,0.175245


In [110]:
df2 = transform(gdf, :close => (x -> log.(x ./ lag(x))) => :close_returns)

Unnamed: 0_level_0,symbol,date,volume,open,high,low,close,adjclose,col_idx,adjclose_fix,close_returns
Unnamed: 0_level_1,String,Date,Int64,Float64,Float64,Float64,Float64,Float64,Int64,Float64,Float64?
1,AADR,2015-01-02,2000,37.25,37.25,36.64,36.64,35.3998,2114,35.3998,missing
2,AAL,2015-01-02,10748600,54.28,54.6,53.07,53.91,51.0799,3499,51.0799,missing
3,AAMC,2015-01-02,11500,308.0,348.59,308.0,327.18,327.18,4798,327.18,missing
4,AAME,2015-01-02,11400,3.99,4.03,3.98,4.03,3.91772,6079,3.91772,missing
5,AAN,2015-01-02,898900,30.81,30.86,30.04,30.62,30.0588,7464,30.0588,missing
6,AAOI,2015-01-02,184600,11.28,11.28,10.72,10.79,10.79,8840,10.79,missing
7,AAON,2015-01-02,90700,22.55,22.68,21.6,21.93,20.9958,10225,20.9958,missing
8,AAP,2015-01-02,509800,160.85,162.5,157.47,158.56,156.251,11610,156.251,missing
9,AAPL,2015-01-02,53204600,111.39,111.44,107.35,109.33,99.9459,12995,99.9459,missing
10,AAT,2015-01-02,107500,40.12,40.46,39.95,40.37,35.0019,14380,35.0019,missing


In [111]:
combine(sdf -> first(sdf, 3), groupby(df2, :symbol), ungroup=false)

Unnamed: 0_level_0,symbol,date,volume,open,high,low,close,adjclose,col_idx,adjclose_fix,close_returns
Unnamed: 0_level_1,String,Date,Int64,Float64,Float64,Float64,Float64,Float64,Int64,Float64,Float64?
1,AAAU,2018-08-15,27300,11.84,11.84,11.74,11.74,11.74,474,11.74,missing
2,AAAU,2018-08-16,428400,11.78,11.8,11.74,11.74,11.74,473,11.74,0.0
3,AAAU,2018-08-17,52400,11.8,11.82,11.77,11.82,11.82,472,11.82,0.00679119

Unnamed: 0_level_0,symbol,date,volume,open,high,low,close,adjclose,col_idx,adjclose_fix,close_returns
Unnamed: 0_level_1,String,Date,Int64,Float64,Float64,Float64,Float64,Float64,Int64,Float64,Float64?
1,ZYXI,2015-01-02,100,0.18,0.18,0.18,0.18,0.175245,6852038,0.175245,missing
2,ZYXI,2015-01-05,33200,0.22,0.22,0.14,0.16,0.155774,6852037,0.155774,-0.117783
3,ZYXI,2015-01-07,8100,0.14,0.15,0.13,0.15,0.146038,6852036,0.146038,-0.0645385


#### Exercise

Check our codes to make sure that in each group in `df2` first and only the fist element of `:close_returns` is `missing`

In [112]:
all(sdf -> findall(ismissing, sdf.close_returns) == [1], groupby(df2, :symbol))

true

#### Exercises

Check what happens if you try to broadcast a sum of a 1-row `DataFrame` with an array having multiple rows

In [113]:
[1 2] .+ rand(2,2)

2×2 Array{Float64,2}:
 1.45186  2.33647
 1.73167  2.58812

In [114]:
DataFrame([1 2]) .+ rand(2,2)

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Float64,Float64
1,1.63575,2.45478
2,1.83101,2.667


Check what happens if you try to broadcast a sum of a 0-row `DataFrame` with an array having multiple rows

In [115]:
zeros(0, 2) .+ rand(2,2)

DimensionMismatch: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 0 and 2")

In [116]:
DataFrame(x1=[], x2=[]) .+ rand(2,2)

DimensionMismatch: DimensionMismatch("arrays could not be broadcast to a common size; got a dimension with lengths 0 and 2")

Check what happens when you try to broadcast a sum of a `DataFrame` with a 3D array.

In [117]:
[1 2] .+ rand(2,2,2)

2×2×2 Array{Float64,3}:
[:, :, 1] =
 1.76137  2.22359
 1.12264  2.1019

[:, :, 2] =
 1.69563  2.94204
 1.67432  2.48171

In [118]:
DataFrame([1 2]) .+ rand(2,2,2)

DimensionMismatch: DimensionMismatch("cannot broadcast a data frame into 3 dimensions")

Check if broadcasting is defined for `DataFrameRow`.

In [119]:
dfr = DataFrame(x1=1, x2=2)[1, :]

Unnamed: 0_level_0,x1,x2
Unnamed: 0_level_1,Int64,Int64
1,1,2


In [120]:
dfr .+ 1

ArgumentError: ArgumentError: broadcasting over `DataFrameRow`s is reserved

In [121]:
nt = NamedTuple(dfr)

(x1 = 1, x2 = 2)

In [122]:
nt .+ 1

ArgumentError: ArgumentError: broadcasting over dictionaries and `NamedTuple`s is reserved

<div class="alert alert-block alert-info">
<b>Tip:</b>

In broadcasting `df.single_col` and `df[!, single_col]` behave in the same way (the same exceptions as in `getindex` apply).

</div>

## broadcast!

It is possible to assign a value to `AbstractDataFrame` and `DataFrameRow` objects using the `.=` operator.

In such an operation `AbstractDataFrame` is considered as two-dimensional and `DataFrameRow` as single-dimensional (columnar).

#### Special cases:

Broadcasting into a single cell unwraps it before opertion:

In [123]:
df = DataFrame(a=[[1,2], [3,4]])

Unnamed: 0_level_0,a
Unnamed: 0_level_1,Array…
1,"[1, 2]"
2,"[3, 4]"


In [124]:
df[1,1] .= [10, 20]

2-element Array{Int64,1}:
 10
 20

In [125]:
df

Unnamed: 0_level_0,a
Unnamed: 0_level_1,Array…
1,"[10, 20]"
2,"[3, 4]"


Broadcasting into any slice of a data frame is in-place except for two cases:
* `!` is used as a row selector
* `:` is used as a row selector and column selector is a single column that does not exist in a data frame

In these cases a new column is allocated, except if a data frame is `SubDataFrame` in which case column adding is disallowed.

<div class="alert alert-block alert-info">
<b>Tip:</b>

These rules mean that `df.single_col .= v` behaves in the same way as `df[:, single_col] .= v`

(NOTE that here we use `:` not `!` in this context),

except that `df.single_col .= v` does not allow creation of new columns, integer indexing, nor using a variable to pass a name.

</div>

## Summary

All the rules how indexing works in DataFrames.jl are specified here:

https://juliadata.github.io/DataFrames.jl/latest/lib/indexing/

If you find some cases where the behavior does not match these rules please report an issue.

<div class="alert alert-block alert-info">
<b>Tip:</b>

Rules for most common operations:
* get a column without copying `df.single_col` or `df[!, single_col]`
* get a column with copying `df[:, single_col]`


* assign a column without copying `df.single_col = vector` or `df[!, single_col] = vector` (except for ranges)
* assign a column with copying `df[:, single_col] = vector`
* assign a value in-place with broadcasting if column exists:  `df.single_col .= value` or `df[:, single_col] .= value`
* assign a value with copying, create a column if it does not exist: `df[!, single_col] .= value`
* for convenience also `df[:, single_col] .= value` creates a column with copying if it does not exist

</div>