# A deep dive into DataFrames.jl indexing
# Part 2: implementation of indexing in DataFrames.jl
### Bogumił Kamiński

In [1]:
using DataFrames

In [2]:
df = DataFrame()

In [3]:
size(df)

(0, 0)

we get that number of rows is `0` but actually for `setindex!` it is treated as *undefined*

In [8]:
df.x = [1, 2, 3]

3-element Array{Int64,1}:
 1
 2
 3

In [9]:
df

Unnamed: 0_level_0,x
Unnamed: 0_level_1,Int64
1,1
2,2
3,3


In [10]:
df.y = [1,2]

ArgumentError: ArgumentError: New columns must have the same length as old columns

but for `broadcast!` it is treated as `0` rows (we have to assume something)

In [11]:
df = DataFrame()
df[!, :x] .= 1

Int64[]

In [12]:
df

Unnamed: 0_level_0,x
Unnamed: 0_level_1,Int64


Now a most common question is why the following statement fails:

In [13]:
df.y .= 1

ArgumentError: ArgumentError: column name :y not found in the data frame; existing most similar names are: :x

while this works:

In [14]:
df[!, :y] .= 1

Int64[]

In [15]:
df

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64


To understand this consider:

In [16]:
@code_warntype (df -> df.z .= 1)(df)

Variables
  #self#[36m::Core.Compiler.Const(var"#1#2"(), false)[39m
  df[36m::DataFrame[39m

Body[91m[1m::Any[22m[39m
[90m1 ─[39m %1 = Base.getproperty(df, :z)[91m[1m::AbstractArray{T,1} where T[22m[39m
[90m│  [39m %2 = Base.broadcasted(Base.identity, 1)[36m::Core.Compiler.Const(Base.Broadcast.Broadcasted(identity, (1,)), false)[39m
[90m│  [39m %3 = Base.materialize!(%1, %2)[91m[1m::Any[22m[39m
[90m└──[39m      return %3


vs

In [17]:
@code_warntype (df -> df[:, :z] .= 1)(df)

Variables
  #self#[36m::Core.Compiler.Const(var"#3#4"(), false)[39m
  df[36m::DataFrame[39m

Body[91m[1m::Any[22m[39m
[90m1 ─[39m %1 = Base.dotview(df, Main.:(:), :z)[91m[1m::Union{DataFrames.LazyNewColDataFrame{Symbol}, SubArray}[22m[39m
[90m│  [39m %2 = Base.broadcasted(Base.identity, 1)[36m::Core.Compiler.Const(Base.Broadcast.Broadcasted(identity, (1,)), false)[39m
[90m│  [39m %3 = Base.materialize!(%1, %2)[91m[1m::Any[22m[39m
[90m└──[39m      return %3


We see that in `df.z .= 1` Julia does the following steps:
1. takes a property `:z` from `df`
2. does broadcasting into the result of `df.z`

And since `:z` does not exist in `df` we get an error.

Now note that if we do

In [18]:
df.x .= "a"

MethodError: MethodError: Cannot `convert` an object of type String to an object of type Int64
Closest candidates are:
  convert(::Type{T}, !Matched::T) where T<:Number at number.jl:6
  convert(::Type{T}, !Matched::Number) where T<:Number at number.jl:7
  convert(::Type{T}, !Matched::Ptr) where T<:Integer at pointer.jl:23
  ...

We also get an error. Now we understand why - we try to broadcast `"a"` into `df.x` which allows only integer values.

Now what happens in `df[:, :z] .= 1` is that try to broadcast into a result of `Base.dotview(df, :, :z)`.

Let us check what it returns:

In [19]:
Base.dotview(df, :, :z)

DataFrames.LazyNewColDataFrame{Symbol}(0×2 DataFrame
, :z)

In [20]:
Base.dotview(df, :, :x)

0-element view(::Array{Int64,1}, :) with eltype Int64

In [22]:
Base.dotview(df, !, :z)

DataFrames.LazyNewColDataFrame{Symbol}(0×2 DataFrame
, :z)

In [23]:
Base.dotview(df, !, :x)

DataFrames.LazyNewColDataFrame{Symbol}(0×2 DataFrame
, :x)

So we can see that:
1. if we use `df[:, :x]` (an existing column) - we get just a view into it; a particular consequence is that we cannot cheange the `eltype` of the column (just like with `df.x .= 1`)
2. if we use `df[!, ...]` (any column) or `df[:, :z]` (non existing column) we get a `LazyNewColDataFrame` object.

Let us try to understand what `LazyNewColDataFrame` does.

First check what needs to be implemented for custom broadcasting to work: https://docs.julialang.org/en/latest/manual/interfaces/#man-interfaces-broadcasting-1

In [27]:
methods(copyto!, (DataFrames.LazyNewColDataFrame, Any))