Agnostic data container proposal #15

ablaom · 2018-11-05T02:46:34Z

How about overloading the getindex and setindex methods for each container we want to support to give the different containers common functionality? We can do this without interfering with the existing index methods (or, worse, wrapping the containers in a common struct) as follows:

# types for dispatch:
struct Rows end
struct Cols end
struct Names end

Base.getindex(df::AbstractDataFrame, ::Type{Rows}, r) = df[r,:]
Base.getindex(df::AbstractDataFrame, ::Type{Cols}, c) = df[c]
Base.getindex(df::AbstractDataFrame, ::Type{Names}) = names(df)

Base.getindex(df::JuliaDB.Table, ::Type{Rows}, r) = df[r]
Base.getindex(df::JuliaDB.Table, ::Type{Cols}, c) = select(df, c)
Base.getindex(df::JuliaDB.Table, ::Type{Names}) = getfields(typeof(df.columns.columns))

Base.getindex(A::AbstractMatrix, ::Type{Rows}, r) = A[r,:]
Base.getindex(A::AbstractMatrix, ::Type{Cols}, c) = A[:,c]
Base.getindex(A::AbstractMatrix, ::Type{Names}) = 1:size(A, 2)

Base.getindex(v::AbstractVector, ::Type{Rows}, r) = v[r]

Then, for example, df[Rows, 3:7] returns rows 3-7 of df whether df is a DataFrame, a JuliaDB Table, a Matrix or a vector.

The text was updated successfully, but these errors were encountered:

vollmersj · 2018-11-05T16:19:00Z

I was wondering if data container are not well handeled in packages like https://github.com/queryverse/Queryverse.jl

vollmersj · 2018-11-05T20:09:59Z

https://github.com/queryverse/Query.jl

ablaom · 2018-11-05T22:55:08Z

Currently Query.jl supports only in-memory data sources, so not JuliaDB

vollmersj · 2018-11-05T23:03:29Z

Sorry mixed it up with https://github.com/queryverse/IterableTables.jl/blob/master/docs/src/integrationguide.md which does

ayush1999 · 2018-11-06T16:18:45Z

I've tried IterableTables a bit and it is much easier to get along with.

ablaom · 2018-11-23T06:58:22Z

Here is another challenge for agnostic data containers, which concerns categorical data. Some data containers do not handle factors well.

In a DataFrame and other column-based containers, a column for a categorical feature can store metadata encoding all possible classes for the features, so that if I restrict to some subset of rows, I still know what all the possible values are, even if they do not occur in the restricted frame. If a model needs such data one-hot encoded, for example, the encoder can get all the information it needs just from, say, the training set, and later we can apply the same transformation to the test set. This means wrapping such models in the encoding transformation is not problematic.

However, in JuliaDB for example, which is row-based, there is no such column meta data. One must see all the data before one can know how to one-hot encode a categorical. That's fine if we do our preprocessing "a priori" but precludes wrapping a model in an encoder because we only train on training data and the training data may not contain all classes seen in test.

fkiraly · 2018-11-23T08:22:37Z

I simply think a "proper" data container (for data scientific modelling) needs to be type aware. I.e., it needs to provide a method by which the user can query what type they would expect at a position within - including whether it is a factor variable and if yes which levels it has.

ablaom · 2018-11-28T11:22:22Z

A problem with using Query.jl, Tables.jl or similar as our interface point to data is that these are designed to handle very general source types. In particular, they do not allow random access to rows. They just generate row iterators, meaning random access is slow. Since the dominant use-case is data we load into memory, this is not satisfactory. I see we have three options: (a) Write our own interface for each type we want to support (not too bad, as we mostly have DataFrames and JuiliaDB on our wish list) (b) Abandon our desire to support multiple containers; or (c) use IterableTables or similar to convert any datatype into our favourite kind (which we could just as well require the user to do, and revert to (b)).

The other possibility is that there is a common random-access interface for in-memory data types out there, but I am not aware of one.

ablaom · 2018-11-28T16:53:23Z

Our agnostic data interface doesn't have to do much, but at minimum I think MLJ needs to be able to access arbitrary rows by index or splice (for, e.g., resampling) and request a subset of columns by name or index/splice. We should be able to extract the column names. It would be nice to be able change columns in-place (e.g., to standardise data) and to add columns (e.g., one-hot encode). Vertical and horizontal concatenation would also be nice. We probably don't need general joins or group-by.

ablaom · 2018-11-28T17:06:18Z

Regarding previous discussion regarding nominal features, the proposal is now the following: MLJ will require nominal features to have a nominal feature element type. Ie, each element knows all the possible levels the feature can take, as in R, and as in CategoricalArrays.jl. We could use CategoricalArrays.CategoricalValue and CategoricalArrays.CategoricalString as the required types. (CSV.jl can read specified columns into these formats.) So we would not require this information to be encoded in column metadata.

fkiraly · 2018-11-28T21:53:53Z

Much agreed, the use of a categorical data type for type-aware indexing and queries is natural. I assume this is the most widely adopted (or only available) solution in Julia?

ablaom · 2018-11-29T08:38:20Z

Yes, I think CategoricalArrays are the most widely adopted.

ablaom · 2018-11-29T08:42:35Z

To clarify, CategoricalArrays also have column metadata, but as the elements also know the levels, we just need the element type defined in that package. So, I could have a JuliaDB Table (which has no column metadata) with a column whose element type is CategoricalArrays.CategoricalValue and that's fine.

ablaom · 2018-11-30T15:55:18Z

I have lodged a feature request at Tables.jl for a row getindex method.

ablaom · 2019-01-27T20:18:30Z

After revisiting Tables.jl (and properly reading the docs :-) ) I have concluded that this package provides us with the most we can presently expect from a generic tabular interface. I had been flirting with Query/TableTraits in MLJBase for a while, but have now switched in an upcoming PR. The impact on MLJ will, for practical purposes, be nill.

ablaom mentioned this issue Nov 5, 2018

Goals and collaboration on tasks #13

Closed

ablaom assigned MikeInnes Nov 28, 2018

ysimillides added the design discussion Discussing design issues label Dec 11, 2018

ablaom mentioned this issue Jan 23, 2019

Related efforts in the Julia ecosystem. PP, autoML, formulae, visualization, and others. #47

Open

ablaom closed this as completed Jan 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agnostic data container proposal #15

Agnostic data container proposal #15

ablaom commented Nov 5, 2018 •

edited

vollmersj commented Nov 5, 2018

vollmersj commented Nov 5, 2018

ablaom commented Nov 5, 2018

vollmersj commented Nov 5, 2018

ayush1999 commented Nov 6, 2018

ablaom commented Nov 23, 2018

fkiraly commented Nov 23, 2018

ablaom commented Nov 28, 2018

ablaom commented Nov 28, 2018 •

edited

ablaom commented Nov 28, 2018

fkiraly commented Nov 28, 2018 via email

ablaom commented Nov 29, 2018

ablaom commented Nov 29, 2018

ablaom commented Nov 30, 2018

ablaom commented Jan 27, 2019

Agnostic data container proposal #15

Agnostic data container proposal #15

Comments

ablaom commented Nov 5, 2018 • edited

vollmersj commented Nov 5, 2018

vollmersj commented Nov 5, 2018

ablaom commented Nov 5, 2018

vollmersj commented Nov 5, 2018

ayush1999 commented Nov 6, 2018

ablaom commented Nov 23, 2018

fkiraly commented Nov 23, 2018

ablaom commented Nov 28, 2018

ablaom commented Nov 28, 2018 • edited

ablaom commented Nov 28, 2018

fkiraly commented Nov 28, 2018 via email

ablaom commented Nov 29, 2018

ablaom commented Nov 29, 2018

ablaom commented Nov 30, 2018

ablaom commented Jan 27, 2019

ablaom commented Nov 5, 2018 •

edited

ablaom commented Nov 28, 2018 •

edited