# A deep dive into DataFrames.jl indexing
# Part 2: implementation of indexing in DataFrames.jl
### Bogumił Kamiński

In this part we will not cover all scenarios of implementation of indexing in DataFrames.jl, but rather I will focus on scenarios that are non-obvious (at least for me).

In general to provide support for indexing and broadcasting for your type you should follow:
* https://docs.julialang.org/en/latest/manual/interfaces/#Indexing-1
* https://docs.julialang.org/en/latest/manual/interfaces/#man-interfaces-broadcasting-1

Actually, effectively, this tutorial is mostly about how you can dig into what Julia is doing under the hood when processing your code.

Also I hope it will show package developers how hard it is to define your own types that fully support indexing/broadcasting.

Finally, this notebook is more advanced and I refer to the source code a lot. I expect that it will be hard to follow it without watching the video recording of the tutorial during JuliaCon2020.

In [1]:
using DataFrames

#### Example 1: Consequences of the fact that `DataFrame` can be resized

In [2]:
df = DataFrame()

In [3]:
size(df)

(0, 0)

we get that number of rows is `0` but actually for `setindex!` and `setproperty!` it is treated as *undefined*

In [4]:
df.x = [1, 2, 3]

3-element Array{Int64,1}:
 1
 2
 3

In [5]:
df

Unnamed: 0_level_0,x
Unnamed: 0_level_1,Int64
1,1
2,2
3,3


In [6]:
df.y = [1, 2]

LoadError: ArgumentError: New columns must have the same length as old columns

In [7]:
@less df.y = [1, 2]

"""
    DataFrame <: AbstractDataFrame

An AbstractDataFrame that stores a set of named columns

The columns are normally AbstractVectors stored in memory,
particularly a Vector or CategoricalVector.

# Constructors
```julia
DataFrame(pairs::Pair...; makeunique::Bool=false, copycols::Bool=true)
DataFrame(pairs::AbstractVector{<:Pair}; makeunique::Bool=false, copycols::Bool=true)
DataFrame(ds::AbstractDict; copycols::Bool=true)
DataFrame(kwargs..., copycols::Bool=true)

DataFrame(columns::AbstractVecOrMat, names::Union{AbstractVector, Symbol};
          makeunique::Bool=false, copycols::Bool=true)

DataFrame(table; copycols::Bool=true)
DataFrame(::DataFrameRow)
DataFrame(::GroupedDataFrame; keepkeys::Bool=true)
```

# Keyword arguments

- `copycols` : whether vectors passed as columns should be copied; by default set
  to `true` and the vectors are copied; if set to `false` then the constructor
  will still copy the passed columns if it is not possible to construct a
  `DataFrame` witho

          makeunique::Bool=false) =
    DataFrame(columns, Symbol.(cnames); makeunique=makeunique)

function DataFrame(columns::AbstractMatrix, cnames::Symbol)
    if cnames !== :auto
        throw(ArgumentError("if the first positional argument to DataFrame " *
                            "constructor is a matrix and a second " *
                            "positional argument is passed then the second " *
                            "argument must be a vector of column names or :auto"))
    end
    return DataFrame(columns, gennames(size(columns, 2)), makeunique=false)
end


##############################################################################
##
## AbstractDataFrame interface
##
##############################################################################

index(df::DataFrame) = getfield(df, :colindex)
_columns(df::DataFrame) = getfield(df, :columns)

# note: these type assertions are required to pass tests
nrow(df::DataFrame) = ncol(df) > 0 ? length(_columns(df)[1])::Int

In [8]:
@less df[!, :y] = [1, 2]

"""
    DataFrame <: AbstractDataFrame

An AbstractDataFrame that stores a set of named columns

The columns are normally AbstractVectors stored in memory,
particularly a Vector or CategoricalVector.

# Constructors
```julia
DataFrame(pairs::Pair...; makeunique::Bool=false, copycols::Bool=true)
DataFrame(pairs::AbstractVector{<:Pair}; makeunique::Bool=false, copycols::Bool=true)
DataFrame(ds::AbstractDict; copycols::Bool=true)
DataFrame(kwargs..., copycols::Bool=true)

DataFrame(columns::AbstractVecOrMat, names::Union{AbstractVector, Symbol};
          makeunique::Bool=false, copycols::Bool=true)

DataFrame(table; copycols::Bool=true)
DataFrame(::DataFrameRow)
DataFrame(::GroupedDataFrame; keepkeys::Bool=true)
```

# Keyword arguments

- `copycols` : whether vectors passed as columns should be copied; by default set
  to `true` and the vectors are copied; if set to `false` then the constructor
  will still copy the passed columns if it is not possible to construct a
  `DataFrame` witho

                   makeunique::Bool=false, copycols::Bool=true)::DataFrame
    if !(eltype(columns) <: AbstractVector) && !all(col -> isa(col, AbstractVector), columns)
        throw(ArgumentError("columns argument must be a vector of AbstractVector objects"))
    end
    return DataFrame(collect(AbstractVector, columns),
                     Index(convert(Vector{Symbol}, cnames), makeunique=makeunique),
                     copycols=copycols)
end

DataFrame(columns::AbstractVector, cnames::AbstractVector{<:AbstractString};
          makeunique::Bool=false, copycols::Bool=true) =
    DataFrame(columns, Symbol.(cnames), makeunique=makeunique, copycols=copycols)

DataFrame(columns::AbstractVector{<:AbstractVector}, cnames::AbstractVector{Symbol};
          makeunique::Bool=false, copycols::Bool=true)::DataFrame =
    DataFrame(collect(AbstractVector, columns),
              Index(convert(Vector{Symbol}, cnames), makeunique=makeunique),
              copycols=copycols)

DataFrame(columns:

In [9]:
@less DataFrames.insert_single_column!(df, [1, 2], :y)

"""
    DataFrame <: AbstractDataFrame

An AbstractDataFrame that stores a set of named columns

The columns are normally AbstractVectors stored in memory,
particularly a Vector or CategoricalVector.

# Constructors
```julia
DataFrame(pairs::Pair...; makeunique::Bool=false, copycols::Bool=true)
DataFrame(pairs::AbstractVector{<:Pair}; makeunique::Bool=false, copycols::Bool=true)
DataFrame(ds::AbstractDict; copycols::Bool=true)
DataFrame(kwargs..., copycols::Bool=true)

DataFrame(columns::AbstractVecOrMat, names::Union{AbstractVector, Symbol};
          makeunique::Bool=false, copycols::Bool=true)

DataFrame(table; copycols::Bool=true)
DataFrame(::DataFrameRow)
DataFrame(::GroupedDataFrame; keepkeys::Bool=true)
```

# Keyword arguments

- `copycols` : whether vectors passed as columns should be copied; by default set
  to `true` and the vectors are copied; if set to `false` then the constructor
  will still copy the passed columns if it is not possible to construct a
  `DataFrame` witho

                   makeunique::Bool=false, copycols::Bool=true)::DataFrame
    if !(eltype(columns) <: AbstractVector) && !all(col -> isa(col, AbstractVector), columns)
        throw(ArgumentError("columns argument must be a vector of AbstractVector objects"))
    end
    return DataFrame(collect(AbstractVector, columns),
                     Index(convert(Vector{Symbol}, cnames), makeunique=makeunique),
                     copycols=copycols)
end

DataFrame(columns::AbstractVector, cnames::AbstractVector{<:AbstractString};
          makeunique::Bool=false, copycols::Bool=true) =
    DataFrame(columns, Symbol.(cnames), makeunique=makeunique, copycols=copycols)

DataFrame(columns::AbstractVector{<:AbstractVector}, cnames::AbstractVector{Symbol};
          makeunique::Bool=false, copycols::Bool=true)::DataFrame =
    DataFrame(collect(AbstractVector, columns),
              Index(convert(Vector{Symbol}, cnames), makeunique=makeunique),
              copycols=copycols)

DataFrame(columns:

Note that for `broadcast!` it is treated as `0` rows to be consistent with the value returned by `size`:

In [10]:
df = DataFrame()
df[!, :x] .= 1

Int64[]

In [11]:
df

Unnamed: 0_level_0,x
Unnamed: 0_level_1,Int64


However, pseudo-broadcasting provided by DataFrames.jl in `DataFrame`, `insertcols!` and `combine` broadcasts scalars into 1-row, as usually this is what the user expects.

In [12]:
df = DataFrame(:a => 1)

Unnamed: 0_level_0,a
Unnamed: 0_level_1,Int64
1,1


In [13]:
insertcols!(DataFrame(), :a => 1)

Unnamed: 0_level_0,a
Unnamed: 0_level_1,Int64
1,1


In [14]:
combine(DataFrame(), nrow)

Unnamed: 0_level_0,nrow
Unnamed: 0_level_1,Int64
1,0


but not in `select` and `transform` as in this case we keep the number of rows in the source:

In [15]:
select(DataFrame(), nrow)

Unnamed: 0_level_0,nrow
Unnamed: 0_level_1,Int64


In [16]:
transform(DataFrame(), nrow)

Unnamed: 0_level_0,nrow
Unnamed: 0_level_1,Int64


#### Example 2: broadcasting assignment of getproperty

In [17]:
df = DataFrame(x=1:2)

Unnamed: 0_level_0,x
Unnamed: 0_level_1,Int64
1,1
2,2


A most common question is why the following statement fails (if you have an opinion on this please comment in https://github.com/JuliaLang/julia/issues/36741):

In [18]:
df.y .= 2

LoadError: ArgumentError: column name :y not found in the data frame; existing most similar names are: :x

while this works:

In [19]:
df[!, :y] .= 1

2-element Array{Int64,1}:
 1
 1

In [20]:
df

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,1


Here is the way to check what is going on:

In [21]:
@code_warntype (df -> df.z .= 1)(df)

Variables
  #self#[36m::Core.Compiler.Const(var"#1#2"(), false)[39m
  df[36m::DataFrame[39m

Body[91m[1m::Any[22m[39m
[90m1 ─[39m %1 = Base.getproperty(df, :z)[91m[1m::AbstractArray{T,1} where T[22m[39m
[90m│  [39m %2 = Base.broadcasted(Base.identity, 1)[36m::Core.Compiler.Const(Base.Broadcast.Broadcasted(identity, (1,)), false)[39m
[90m│  [39m %3 = Base.materialize!(%1, %2)[91m[1m::Any[22m[39m
[90m└──[39m      return %3


vs

In [22]:
@code_warntype (df -> df[:, :z] .= 1)(df)

Variables
  #self#[36m::Core.Compiler.Const(var"#3#4"(), false)[39m
  df[36m::DataFrame[39m

Body[91m[1m::Any[22m[39m
[90m1 ─[39m %1 = Base.dotview(df, Main.:(:), :z)[91m[1m::Union{DataFrames.LazyNewColDataFrame{Symbol}, SubArray}[22m[39m
[90m│  [39m %2 = Base.broadcasted(Base.identity, 1)[36m::Core.Compiler.Const(Base.Broadcast.Broadcasted(identity, (1,)), false)[39m
[90m│  [39m %3 = Base.materialize!(%1, %2)[91m[1m::Any[22m[39m
[90m└──[39m      return %3


We see that in `df.z .= 1` Julia does the following steps:
1. takes a property `:z` from `df`
2. does broadcasting into the result of `df.z`

And since `:z` does not exist in `df` we get an error.

As an application of this observation consider:

In [23]:
df.x .= "a"

LoadError: MethodError: Cannot `convert` an object of type String to an object of type Int64
Closest candidates are:
  convert(::Type{T}, !Matched::T) where T<:Number at number.jl:6
  convert(::Type{T}, !Matched::Number) where T<:Number at number.jl:7
  convert(::Type{T}, !Matched::Ptr) where T<:Integer at pointer.jl:23
  ...

We also get an error. Now we understand why - we try to broadcast `"a"` into `df.x` which allows only integer values.

Now what happens in `df[:, :z] .= 1` is that try to broadcast into a result of `Base.dotview(df, :, :z)` instead.

Let us check what it returns:

In [24]:
Base.dotview(df, :, :z)

DataFrames.LazyNewColDataFrame{Symbol}([1m2×2 DataFrame[0m
[1m Row [0m│[1m x     [0m[1m y     [0m
[1m     [0m│[90m Int64 [0m[90m Int64 [0m
─────┼──────────────
   1 │     1      1
   2 │     2      1, :z)

In [25]:
Base.dotview(df, :, :x)

2-element view(::Array{Int64,1}, :) with eltype Int64:
 1
 2

In [26]:
Base.dotview(df, !, :z)

DataFrames.LazyNewColDataFrame{Symbol}([1m2×2 DataFrame[0m
[1m Row [0m│[1m x     [0m[1m y     [0m
[1m     [0m│[90m Int64 [0m[90m Int64 [0m
─────┼──────────────
   1 │     1      1
   2 │     2      1, :z)

In [27]:
Base.dotview(df, !, :x)

DataFrames.LazyNewColDataFrame{Symbol}([1m2×2 DataFrame[0m
[1m Row [0m│[1m x     [0m[1m y     [0m
[1m     [0m│[90m Int64 [0m[90m Int64 [0m
─────┼──────────────
   1 │     1      1
   2 │     2      1, :x)

In [28]:
@less Base.dotview(df, !, :x)

### Broadcasting

Base.getindex(df::AbstractDataFrame, idx::CartesianIndex{2}) = df[idx[1], idx[2]]
Base.view(df::AbstractDataFrame, idx::CartesianIndex{2}) = view(df, idx[1], idx[2])
Base.setindex!(df::AbstractDataFrame, val, idx::CartesianIndex{2}) =
    (df[idx[1], idx[2]] = val)

Base.broadcastable(df::AbstractDataFrame) = df

struct DataFrameStyle <: Base.Broadcast.BroadcastStyle end

Base.Broadcast.BroadcastStyle(::Type{<:AbstractDataFrame}) =
    DataFrameStyle()

Base.Broadcast.BroadcastStyle(::DataFrameStyle, ::Base.Broadcast.BroadcastStyle) =
    DataFrameStyle()
Base.Broadcast.BroadcastStyle(::Base.Broadcast.BroadcastStyle, ::DataFrameStyle) =
    DataFrameStyle()
Base.Broadcast.BroadcastStyle(::DataFrameStyle, ::DataFrameStyle) = DataFrameStyle()

function copyto_widen!(res::AbstractVector{T}, bc::Base.Broadcast.Broadcasted,
                       pos, col) where T
    for i in pos:length(axes(bc)[1])
        val = bc[CartesianIndex(i, col)]
        S = typeof(val)
        

Note that `dotview` is defined only when a special treatement is needed:

In [29]:
methods(Base.dotview, DataFrames)

as "normally" the default implementation is just enough:

In [30]:
Base.dotview(df, 1:1, 1:1)

Unnamed: 0_level_0,x
Unnamed: 0_level_1,Int64
1,1


In [31]:
typeof(Base.dotview(df, 1:1, 1:1))

SubDataFrame{DataFrame,DataFrames.SubIndex{DataFrames.Index,UnitRange{Int64},UnitRange{Int64}},UnitRange{Int64}}

So we can see that:
1. if we use `df[:, :x]` (an existing column) - we get just a view into it; a particular consequence is that we cannot cheange the `eltype` of the column (just like with `df.x .= 1`)
2. if we use `df[!, ...]` (any column) or `df[:, :z]` (non existing column) we get a `LazyNewColDataFrame` object.

Importantly note that in indexing context `x[y] .= z` the meaning of `x[y]` can be controlled by the package developer.

Conversly, currently in the context `x.y .= z` the meaning of `x.y` is predefined in Base (https://github.com/JuliaLang/julia/issues/36741 proposes to make this more flexible).

Let us try to understand what `LazyNewColDataFrame` does.

For this we need to dig into how broadcasting assignment works.

In [32]:
df = DataFrame(x = [1, 2])

Unnamed: 0_level_0,x
Unnamed: 0_level_1,Int64
1,1
2,2


We want to manually recreate the process of execution of `df[:, :z] .= 1`

In [33]:
dest = Base.dotview(df, :, :z)

DataFrames.LazyNewColDataFrame{Symbol}([1m2×1 DataFrame[0m
[1m Row [0m│[1m x     [0m
[1m     [0m│[90m Int64 [0m
─────┼───────
   1 │     1
   2 │     2, :z)

In [34]:
bc = Base.broadcasted(identity, 1)

Base.Broadcast.Broadcasted(identity, (1,))

In [35]:
@less Base.materialize!(dest, bc)

# This file is a part of Julia. License is MIT: https://julialang.org/license

"""
    Base.Broadcast

Module containing the broadcasting implementation.
"""
module Broadcast

using .Base.Cartesian
using .Base: Indices, OneTo, tail, to_shape, isoperator, promote_typejoin,
             _msk_end, unsafe_bitgetindex, bitcache_chunks, bitcache_size, dumpbitcache, unalias
import .Base: copy, copyto!, axes
export broadcast, broadcast!, BroadcastStyle, broadcast_axes, broadcastable, dotview, @__dot__, broadcast_preserving_zero_d

## Computing the result's axes: deprecated name
const broadcast_axes = axes

### Objects with customized broadcasting behavior should declare a BroadcastStyle

"""
`BroadcastStyle` is an abstract type and trait-function used to determine behavior of
objects under broadcasting. `BroadcastStyle(typeof(x))` returns the style associated
with `x`. To customize the broadcasting behavior of a type, one can declare a style
by defining a type/method pair

    struct MyContain

instantiate(bc::Broadcasted{<:AbstractArrayStyle{0}}) = bc
# Tuples don't need axes, but when they have axes (for .= assignment), we need to check them (#33020)
instantiate(bc::Broadcasted{Style{Tuple}, Nothing}) = bc
function instantiate(bc::Broadcasted{Style{Tuple}})
    check_broadcast_axes(bc.axes, bc.args...)
    return bc
end
## Flattening

"""
    bcf = flatten(bc)

Create a "flat" representation of a lazy-broadcast operation.
From
   f.(a, g.(b, c), d)
we produce the equivalent of
   h.(a, b, c, d)
where
   h(w, x, y, z) = f(w, g(x, y), z)
In terms of its internal representation,
   Broadcasted(f, a, Broadcasted(g, b, c), d)
becomes
   Broadcasted(h, a, b, c, d)

This is an optional operation that may make custom implementation of broadcasting easier in
some cases.
"""
function flatten(bc::Broadcasted{Style}) where {Style}
    isflat(bc) && return bc
    # concatenate the nested arguments into {a, b, c, d}
    args = cat_nested(bc)
    # build a function `makeargs` that takes a

So we see that first Base checks what should be style of the output

In [36]:
Base.Broadcast.combine_styles(dest, bc)

Base.Broadcast.DefaultArrayStyle{1}()

but e.g.

In [37]:
Base.Broadcast.combine_styles(df, bc)

DataFrames.DataFrameStyle()

as we insist that if a data frame takes part in broadcasting the result should be a data frame (more on this later).

In [38]:
@less Base.materialize!(Base.Broadcast.combine_styles(dest, bc), dest, bc)

# This file is a part of Julia. License is MIT: https://julialang.org/license

"""
    Base.Broadcast

Module containing the broadcasting implementation.
"""
module Broadcast

using .Base.Cartesian
using .Base: Indices, OneTo, tail, to_shape, isoperator, promote_typejoin,
             _msk_end, unsafe_bitgetindex, bitcache_chunks, bitcache_size, dumpbitcache, unalias
import .Base: copy, copyto!, axes
export broadcast, broadcast!, BroadcastStyle, broadcast_axes, broadcastable, dotview, @__dot__, broadcast_preserving_zero_d

## Computing the result's axes: deprecated name
const broadcast_axes = axes

### Objects with customized broadcasting behavior should declare a BroadcastStyle

"""
`BroadcastStyle` is an abstract type and trait-function used to determine behavior of
objects under broadcasting. `BroadcastStyle(typeof(x))` returns the style associated
with `x`. To customize the broadcasting behavior of a type, one can declare a style
by defining a type/method pair

    struct MyContain

Base.IteratorSize(::Type{<:Broadcasted{<:Any,<:NTuple{N,Base.OneTo}}}) where {N} = Base.HasShape{N}()
Base.IteratorEltype(::Type{<:Broadcasted}) = Base.EltypeUnknown()

## Instantiation fills in the "missing" fields in Broadcasted.
instantiate(x) = x

"""
    Broadcast.instantiate(bc::Broadcasted)

Construct and check the axes for the lazy Broadcasted object `bc`.

Custom [`BroadcastStyle`](@ref)s may override this default in cases where it is fast and easy
to compute and verify the resulting `axes` on-demand, leaving the `axis` field
of the `Broadcasted` object empty (populated with [`nothing`](@ref)).
"""
@inline function instantiate(bc::Broadcasted{Style}) where {Style}
    if bc.axes isa Nothing # Not done via dispatch to make it easier to extend instantiate(::Broadcasted{Style})
        axes = combine_axes(bc.args...)
    else
        axes = bc.axes
        check_broadcast_axes(axes, bc.args...)
    end
    return Broadcasted{Style}(bc.f, bc.args, axes)
end
instantiate(bc::Broadca

In [39]:
typeof(bc)

Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0},Nothing,typeof(identity),Tuple{Int64}}

In [40]:
@less axes(dest)

### Broadcasting

Base.getindex(df::AbstractDataFrame, idx::CartesianIndex{2}) = df[idx[1], idx[2]]
Base.view(df::AbstractDataFrame, idx::CartesianIndex{2}) = view(df, idx[1], idx[2])
Base.setindex!(df::AbstractDataFrame, val, idx::CartesianIndex{2}) =
    (df[idx[1], idx[2]] = val)

Base.broadcastable(df::AbstractDataFrame) = df

struct DataFrameStyle <: Base.Broadcast.BroadcastStyle end

Base.Broadcast.BroadcastStyle(::Type{<:AbstractDataFrame}) =
    DataFrameStyle()

Base.Broadcast.BroadcastStyle(::DataFrameStyle, ::Base.Broadcast.BroadcastStyle) =
    DataFrameStyle()
Base.Broadcast.BroadcastStyle(::Base.Broadcast.BroadcastStyle, ::DataFrameStyle) =
    DataFrameStyle()
Base.Broadcast.BroadcastStyle(::DataFrameStyle, ::DataFrameStyle) = DataFrameStyle()

function copyto_widen!(res::AbstractVector{T}, bc::Base.Broadcast.Broadcasted,
                       pos, col) where T
    for i in pos:length(axes(bc)[1])
        val = bc[CartesianIndex(i, col)]
        S = typeof(val)
        

In [41]:
inst = Base.Broadcast.instantiate(Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}}((bc.f, bc.args), axes(dest)))

Base.Broadcast.Broadcasted((identity, (1,)), (Base.OneTo(2),))

In [42]:
@less copyto!(dest, inst)

### Broadcasting

Base.getindex(df::AbstractDataFrame, idx::CartesianIndex{2}) = df[idx[1], idx[2]]
Base.view(df::AbstractDataFrame, idx::CartesianIndex{2}) = view(df, idx[1], idx[2])
Base.setindex!(df::AbstractDataFrame, val, idx::CartesianIndex{2}) =
    (df[idx[1], idx[2]] = val)

Base.broadcastable(df::AbstractDataFrame) = df

struct DataFrameStyle <: Base.Broadcast.BroadcastStyle end

Base.Broadcast.BroadcastStyle(::Type{<:AbstractDataFrame}) =
    DataFrameStyle()

Base.Broadcast.BroadcastStyle(::DataFrameStyle, ::Base.Broadcast.BroadcastStyle) =
    DataFrameStyle()
Base.Broadcast.BroadcastStyle(::Base.Broadcast.BroadcastStyle, ::DataFrameStyle) =
    DataFrameStyle()
Base.Broadcast.BroadcastStyle(::DataFrameStyle, ::DataFrameStyle) = DataFrameStyle()

function copyto_widen!(res::AbstractVector{T}, bc::Base.Broadcast.Broadcasted,
                       pos, col) where T
    for i in pos:length(axes(bc)[1])
        val = bc[CartesianIndex(i, col)]
        S = typeof(val)
        

Why a special path for 0-dimensional objects is required?

In [43]:
Base.Broadcast.materialize(inst)

LoadError: MethodError: objects of type Tuple{typeof(identity),Tuple{Int64}} are not callable

#### Example 3: avoiding dispatch ambiguity

In [44]:
df = DataFrame([1 2 3 4], :auto)

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,1,2,3,4


In [45]:
df[1, Not(1)] = [11, 12, 13]

3-element Array{Int64,1}:
 11
 12
 13

In [46]:
df

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,1,11,12,13


In [47]:
@edit df[1, Not(1)] = [11, 12, 13] # note @eval in the source code

Why is this needed?
Because we are flexible in both row indexing and column indexing options.

Here is a simple worked example:

In [48]:
f(x::Union{Float64, Int64}, y::Int64) = 1
f(x::Int64, y) = 2

f (generic function with 2 methods)

In [49]:
f(1, 1)

LoadError: MethodError: f(::Int64, ::Int64) is ambiguous. Candidates:
  f(x::Union{Float64, Int64}, y::Int64) in Main at In[48]:1
  f(x::Int64, y) in Main at In[48]:2
Possible fix, define
  f(::Int64, ::Int64)

In [50]:
for T in (Float64, Int)
    @eval g(x::$T, y::Int64) = 1
end
g(x::Int64, y) = 2

g (generic function with 3 methods)

In [51]:
g(1, 1)

1

In more complex scenarios it gets very complicated to ensure that you cover every possible ambiguity (you have to think of a cartesian index of options), so it is simpler to unwrap `Union`.

Also have a look at this one to see how to define non-standard indices:

In [52]:
df

Unnamed: 0_level_0,x1,x2,x3,x4
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,1,11,12,13


In [53]:
@edit df[:, :] = rand(Int, 1, 4) # note how `!` or `Not` are referenced to

#### Example 4: defining broadcasting

Your type should support `CartesianIndex` indexing because it later can get used in broadcasting mechanics (which was not obvious for me initially)

In [54]:
@less df[CartesianIndex(1, 1)] = 1

### Broadcasting

Base.getindex(df::AbstractDataFrame, idx::CartesianIndex{2}) = df[idx[1], idx[2]]
Base.view(df::AbstractDataFrame, idx::CartesianIndex{2}) = view(df, idx[1], idx[2])
Base.setindex!(df::AbstractDataFrame, val, idx::CartesianIndex{2}) =
    (df[idx[1], idx[2]] = val)

Base.broadcastable(df::AbstractDataFrame) = df

struct DataFrameStyle <: Base.Broadcast.BroadcastStyle end

Base.Broadcast.BroadcastStyle(::Type{<:AbstractDataFrame}) =
    DataFrameStyle()

Base.Broadcast.BroadcastStyle(::DataFrameStyle, ::Base.Broadcast.BroadcastStyle) =
    DataFrameStyle()
Base.Broadcast.BroadcastStyle(::Base.Broadcast.BroadcastStyle, ::DataFrameStyle) =
    DataFrameStyle()
Base.Broadcast.BroadcastStyle(::DataFrameStyle, ::DataFrameStyle) = DataFrameStyle()

function copyto_widen!(res::AbstractVector{T}, bc::Base.Broadcast.Broadcasted,
                       pos, col) where T
    for i in pos:length(axes(bc)[1])
        val = bc[CartesianIndex(i, col)]
        S = typeof(val)
        

Also below you can see how we force broadcasting to make sure the result is a `DataFrame` using `BroadcastStyle`.

Now in order for broadcasting to overcome the problem that `DataFrame` column access is not type stable we have to process it column by column.

In [55]:
f(df) = df .+ 1

f (generic function with 3 methods)

In [56]:
@code_warntype f(df)

Variables
  #self#[36m::Core.Compiler.Const(f, false)[39m
  df[36m::DataFrame[39m

Body[36m::DataFrame[39m
[90m1 ─[39m %1 = Base.broadcasted(Main.:+, df, 1)[36m::Core.Compiler.PartialStruct(Base.Broadcast.Broadcasted{DataFrames.DataFrameStyle,Nothing,typeof(+),Tuple{DataFrame,Int64}}, Any[Core.Compiler.Const(+, false), Core.Compiler.PartialStruct(Tuple{DataFrame,Int64}, Any[DataFrame, Core.Compiler.Const(1, false)]), Core.Compiler.Const(nothing, false)])[39m
[90m│  [39m %2 = Base.materialize(%1)[36m::DataFrame[39m
[90m└──[39m      return %2


In [57]:
@less Base.materialize(Base.broadcasted(+, df, 1))

# This file is a part of Julia. License is MIT: https://julialang.org/license

"""
    Base.Broadcast

Module containing the broadcasting implementation.
"""
module Broadcast

using .Base.Cartesian
using .Base: Indices, OneTo, tail, to_shape, isoperator, promote_typejoin,
             _msk_end, unsafe_bitgetindex, bitcache_chunks, bitcache_size, dumpbitcache, unalias
import .Base: copy, copyto!, axes
export broadcast, broadcast!, BroadcastStyle, broadcast_axes, broadcastable, dotview, @__dot__, broadcast_preserving_zero_d

## Computing the result's axes: deprecated name
const broadcast_axes = axes

### Objects with customized broadcasting behavior should declare a BroadcastStyle

"""
`BroadcastStyle` is an abstract type and trait-function used to determine behavior of
objects under broadcasting. `BroadcastStyle(typeof(x))` returns the style associated
with `x`. To customize the broadcasting behavior of a type, one can declare a style
by defining a type/method pair

    struct MyContain

Base.IteratorSize(::Type{<:Broadcasted{<:Any,<:NTuple{N,Base.OneTo}}}) where {N} = Base.HasShape{N}()
Base.IteratorEltype(::Type{<:Broadcasted}) = Base.EltypeUnknown()

## Instantiation fills in the "missing" fields in Broadcasted.
instantiate(x) = x

"""
    Broadcast.instantiate(bc::Broadcasted)

Construct and check the axes for the lazy Broadcasted object `bc`.

Custom [`BroadcastStyle`](@ref)s may override this default in cases where it is fast and easy
to compute and verify the resulting `axes` on-demand, leaving the `axis` field
of the `Broadcasted` object empty (populated with [`nothing`](@ref)).
"""
@inline function instantiate(bc::Broadcasted{Style}) where {Style}
    if bc.axes isa Nothing # Not done via dispatch to make it easier to extend instantiate(::Broadcasted{Style})
        axes = combine_axes(bc.args...)
    else
        axes = bc.axes
        check_broadcast_axes(axes, bc.args...)
    end
    return Broadcasted{Style}(bc.f, bc.args, axes)
end
instantiate(bc::Broadca

So we see that essentially we need to define `copy`

In [58]:
edit(copy, (Base.Broadcast.Broadcasted{DataFrames.DataFrameStyle},)) # note getcolbc! and copyto_widen!

#### Example 5: unaliasing in broadcasting assignment

What is aliasing?

Assume we have:

In [59]:
x = [1, 2, 3]

3-element Array{Int64,1}:
 1
 2
 3

In [60]:
y = @view x[3:-1:1]

3-element view(::Array{Int64,1}, 3:-1:1) with eltype Int64:
 3
 2
 1

now we call:

In [61]:
x .= y

3-element Array{Int64,1}:
 3
 2
 1

In [62]:
x

3-element Array{Int64,1}:
 3
 2
 1

and all is OK.

But assume we have a naive broadcasting implemented:

In [63]:
x = [1, 2, 3]
y = @view x[3:-1:1]

3-element view(::Array{Int64,1}, 3:-1:1) with eltype Int64:
 3
 2
 1

In [64]:
naive_broadcast!(x, y) = foreach(i -> x[i] = y[i], eachindex(x, y))

naive_broadcast! (generic function with 1 method)

In [65]:
naive_broadcast!(x, y)

In [66]:
x

3-element Array{Int64,1}:
 3
 2
 3

This is ensured to be avoided by broadcasting mechanism in Base in `Base.Broadcast.preprocess` function (which should be called before performing assignment of source to target). This function intenally calls `Base.Broadcast.broadcast_unalias` that should be implemented for your custom type.

In [67]:
methods(Base.Broadcast.broadcast_unalias)

In [68]:
edit(Base.Broadcast.broadcast_unalias, (AbstractDataFrame, Any)) # this is a first method of several

Note that this process is expensive unfortunately, but we want to stay safe:

In [69]:
df = DataFrame(x=[1,2,3])

Unnamed: 0_level_0,x
Unnamed: 0_level_1,Int64
1,1
2,2
3,3


In [70]:
y = view(df, 3:-1:1, 1)

3-element view(::Array{Int64,1}, 3:-1:1) with eltype Int64:
 3
 2
 1

In [71]:
df .= y
df

Unnamed: 0_level_0,x
Unnamed: 0_level_1,Int64
1,3
2,2
3,1


In [72]:
y

3-element view(::Array{Int64,1}, 3:-1:1) with eltype Int64:
 1
 2
 3

In [73]:
df .= y
df

Unnamed: 0_level_0,x
Unnamed: 0_level_1,Int64
1,1
2,2
3,3


When is unaliasing triggered by DataFrames.jl?

Well - we already know that ultimately `copyto!` is called in broadcasting assignment:

In [74]:
methods(copyto!, DataFrames)

Let us have a look how they are implemented:

In [75]:
edit(Base.copyto!, (AbstractDataFrame, Base.Broadcast.Broadcasted))

#### That is all for today!

I hope this part of the tutorial gave you some insight how indexing and broadcasting is implemented in DataFrames.jl and what things you should take into account when designing your own types that are expected to support indexing/broadcasting.