New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame Discussion #26845

Open
MgSam opened this Issue Feb 5, 2018 · 35 comments

Comments

@MgSam

MgSam commented Feb 5, 2018

With work underway on Tensor, and the new perf optimizations available via Span, I think it's time the .NET team seriously considered adding a DataFrame type. Working with large amounts of data has become critical in many applications these days, and is the reason libraries like Pandas for Python have been so successful. It's time to bring similar capabilities to .NET, and it needs to start with the framework itself adding a DataFrame type.

There are DataFrame libraries out there available for .NET, but the problem is that they each have their own implementations of DataFrames, that are minimally compatible with the BCL or any of the other libraries out there. From my experience, support for many of these various libraries is also pretty weak.

I think the BCL team implementing its own DataFrame is the first, most important step to improving the state of working with data in .NET. All we have right now is DataTable, a powerful but ancient type that is not well optimized for many scenarios and which I've variously seen .NET team members refer to as "legacy".

I'm creating this issue to garner opinions on the topic. I don't have a specific API in mind to propose at this time, as gauging interest first is probably more important.

Let's make .NET as great for data analysis as any other platform.

@danmosemsft

This comment has been minimized.

Member

danmosemsft commented Feb 5, 2018

Great to get community interest.
cc @eerhardt @ericstj who have been thinking about this.

@mgravell

This comment has been minimized.

Contributor

mgravell commented Feb 5, 2018

Could you define "DataFrame" for the purpose of this discussion? Because that is a vague, overloaded, and open-ended term - it would be good to know the intent as it relates to this discussion. For example, pandas defines it as:

Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects.

which sounds ... awful :) I can't think of many scenarios where that would be preferable to, say, a List<T> or Span<T>...

@eerhardt

This comment has been minimized.

Member

eerhardt commented Feb 6, 2018

which sounds ... awful :) I can't think of many scenarios where that would be preferable to, say, a List or Span...

I think the scenarios come from a more "data scientist" and Machine Learning angle. Take a look at https://github.com/ageron/handson-ml/blob/master/02_end_to_end_machine_learning_project.ipynb starting with the

In [5]: housing = load_housing_data() line.

Imagine trying to explore a data set (loaded through .csv or otherwise) like this in .NET code, and how much more code you were need to write to accomplish something like:

housing["ocean_proximity"].value_counts()

@mgravell

This comment has been minimized.

Contributor

mgravell commented Feb 6, 2018

that sounds like something that LINQ would excel at, with little more than a few helper extension methods - i.e.

var counts = housing.ValueCounts<string>("ocean_proximity");
// the <string> here just makes the return type stronger, could be done without

with an extension method something like below (completely untested). Note: I'm totally not a fan of DataSet for most purposes - I'm just trying to understand what problem this suggestion is trying to solve, and how it would be different to DataTable. I'm very cautious of the idea of adding another type that is basically the same as DataTable to represent ad-hoc data without a very good reason. If the reason is that DataTable is too heavy and has a lot of things (like change tracking, etc) that never apply, then that's great - I just think having a very clear statement of the problem is a prerequisite to having a good solution.

public static Dictionary<TKey, int> ValueCounts<TKey>(this DataTable table, string columnName)
{
    var col = table.Columns[columnName];
    return table.Rows.Cast<DataRow>().GroupBy(x => x.Field<TKey>(col)).ToDictionary(
        x => x.Key, x => x.Count());
}

Edit: personally, I'd much rather use a class HousingData POCO that looks like the shape of the data, i.e. has a Longitude, Latitude etc - along with great tools for loading that; but I acknowledge that not all scenarios fit the "we know the shape ahead of time" world.

@karelz karelz added the area-Meta label Feb 6, 2018

@tarekgh tarekgh added this to the Future milestone Feb 7, 2018

@MgSam

This comment has been minimized.

MgSam commented Feb 9, 2018

@mgravell I think the Pandas definition is as good as any.

You're right, it is very similar to a DataTable. However, DataTables are heavy, not generic (and thus box any value types) and do not support many of the useful properties of DataFrames.

I love Linq, but it is different type of abstraction. Common operations with DataFrames include aligning columns, performing operations between columns, joining frames, etc. Writing ad-hoc types or using Linq for this is much more verbose and time consuming. A key insight to realize about DataFrames vs Linq is that you're working with columns or frames of data vs individual items, as you do with Linq.

I strongly suggest looking at some example usage in Pandas or checking out Deedle. (Deedle has a pretty good implementation of a DataFrame, but it is written in F# and thus uses a lot of F#-isms that are problematic for other .NET languages, it also appears to be essentially abandoned).

@ghost

This comment has been minimized.

ghost commented Feb 9, 2018

but it is written in F# and thus uses a lot of F#-isms that are problematic for other .NET languages

Can you explain a bit what is F#-ism and what are those problems? If I am relying on a nuget package with .NET assembly, I shouldn't care about the source language the IL was produced from. Their usage docs has C# examples: http://bluemountaincapital.github.io/Deedle/csharpseries.html

@MgSam

This comment has been minimized.

MgSam commented Feb 9, 2018

@kasper3 F# adds a lot of its own types and conventions on top of the .NET Framework. It's in FSharp.Core.dll. You absolutely can use it from C#, and it is very powerful, but you quickly run into a lot of annoying impedance mismatches.

The biggest one being that F# has an Option<T> type. So Deedle is written to work with Option<T> rather than nullables. This means for double?, you have 3 different ways to say null- Double.NaN, null, Option<double?>.None. It makes it super annoying and error prone when doing analysis.

Though I digress. Really I'm just trying to show why there is a need for a DataFrame type to be baked into the core framework.

@mattdot

This comment has been minimized.

mattdot commented May 4, 2018

I've been trying to build a UWP app which reads data from IOT sensors and I've been banging my head against this exact problem. I can't use pandas or R because I'm in UWP-land and C# doesn't have a decent DataFrame API to use MathNet with in the same way I'd use SciPy with Pandas. I'm pretty close to just abandoning .NET and building the whole thing in Python with the desktop bridge. Unfortunate but there's really a gap in this area where Python and R excel. This would obviously be most useful on the server-side ML pipelines, but wanted to highlight how this is also useful to client developers. Particularly when developers will want to pre-process data they're feeding into ONNX models for WindowsML.

@voltcode

This comment has been minimized.

voltcode commented May 9, 2018

@mattdot check out the issue I referenced, it seems there's a new dataframe-ish abstraction called IDataView in just open sourced ML.NET library. Maybe it can be useful to you?

@ghost

This comment has been minimized.

ghost commented May 9, 2018

I'm pretty close to just abandoning .NET and building the whole thing

if there is some API shape proposal that you have in mind that would help your use-case, please do share. that will increase the chances of getting this discussion move forward.

@veikkoeeva

This comment has been minimized.

veikkoeeva commented May 17, 2018

@MgSam So DataFrame is column oriented instead of row oriented? Do chime in to https://github.com/dotnet/machinelearning to share your views. Why not @mattdot also, maybe that's doable already.

@MgSam

This comment has been minimized.

MgSam commented May 18, 2018

@veikkoeeva I'm not sure column-oriented vs row-oriented is the right description. It's a tabular data structure. It definitely makes column-based operations easy and much more readable. This definition for an R data frame is pretty good.

As soon as I get some free time I intend to play around with the IDataView.

@danmosemsft

This comment has been minimized.

Member

danmosemsft commented May 18, 2018

@glebuk

This comment has been minimized.

glebuk commented May 18, 2018

The ML.NET has the concept of "IDataView." (IDV) We are porting documentation for it in this PR 173
We hope it would provides some context for this discussion.
It is similar to DataFrame but have significant difference. Dataframe is fully materialized schematized data table in memory that supports random access. IDV is similar to a SQL View. It is also schematized view of data, but it is accessed with cursors that allows for streaming. While DataFrame is very good for ad-hoc data exploration, the IDV was optimized for very efficient, scalable (inf rows by billions of features) data plane for Machine Learning pipelines.

@bryanrcarlson

This comment has been minimized.

bryanrcarlson commented May 24, 2018

Please excuse my extreme naivete, but does a .net core implementation of Apache Arrow relate to this?

Wes McKinney recently posted on his blog announcing Ursa Labs which would open .net up to a whole new crowd if there were collaboration.

@voltcode

This comment has been minimized.

voltcode commented May 24, 2018

@bryanrcarlson I mentioned Apache Arrow as a means for ML.NET to integrate with R and Python - check out dotnet/machinelearning#69 if you're interested

@mungojam

This comment has been minimized.

mungojam commented Jun 4, 2018

I've been a C# programmer for 10 years, but in the last few years have learnt R and now Python. I love C#, but it is really lacking in this area when compared with Pandas in Python and dplyr/tidyr in R.

Those tools make it so easy to work with tabular data and for people who we want to move out of Excel for LoB data tools they are a much easier sell than C# at present.

Both R and Python benefit here during the exploration stage by being dynamically typed. They also benefit from having a really good REPL story.

I think a neat C# implementation would start in the dynamic world using ExpandoObject or similar and would need good REPL runtime intellisense. It would then let coders generate the statically typed classes and code when they are ready to generate an actual tool. The readr package in R applies a similar approach when reading CSV files, it can auto-detect the column data types when you first import a file but then prints out the R code that you should put in to specify those types explicitly (and encourages you to do so).

The column vs. row thing is important for performance. Columnar data is generally the way things are going for performance reasons, e.g. the Parquet data format. A lot of the operations people do on data tables benefit from using SIMD along a column of contiguous memory data.

@MovGP0

This comment has been minimized.

MovGP0 commented Jun 20, 2018

I agree that a DataFrame would be useful, especially in the Data Science and Machine Learning area.

It's basically an n-dimensional tensor (could be a scalar, vector, matrix, or higher level tensor) which has label-vectors for the individual dimensions.

The labels should be preserved as far as possible when mathematical operations are applied.

@MovGP0

This comment has been minimized.

MovGP0 commented Jun 20, 2018

I guess a DataFrame type must not (but may anyways) be part of the core library. Using something like the Deedle NuGet should do the job nicely. Especially when you can convert from and to the Tensor type.

@totalgit74

This comment has been minimized.

totalgit74 commented Jul 3, 2018

I arrived at this discussion as I too am searching for a suitable data analysis framework. Deedle is ok but suffers very bad performance on even moderately large datasets. Extreme Optimisation has an implementation that is very fast from what I have played around with.
I am mainly using R for data analysis, particularly the data.table class which has extremely good performance using into the hundreds of millions of data rows.
The problem I am finding is that I want the performance and flexibility of data.table but in a language where I can implement financial instruments (from simple swaps to structured deal types) a lot better like C# - OO, extension methods, interfaces, abstract classes etc etc. The lack of suitable implementation in C# means I am forced into using S4 classes and OO in a functional language.
Another issue is the lack of a decent IDE. I still think nothing comes close to the abilities of the MS IDEs

@danmosemsft danmosemsft added this to To do in ML.NET support work via automation Jul 3, 2018

@danmosemsft

This comment has been minimized.

Member

danmosemsft commented Jul 3, 2018

Adding to ML.NET "project" just to keep track (no decision implied but this is a good discussion)

@hrichardlee

This comment has been minimized.

hrichardlee commented Jul 18, 2018

I am extremely in favor of this! Here goes a long post; thank you to anyone who gets through it all.

API shape proposal-ish

To try to move the discussion forward, I'll put forward something that looks kind of like an API shape proposal. It's actually just a few core operations from pandas, but hopefully it gives a sense of what operations a real API shape proposal would have to cover. A real API shape proposal would need to be much more C#-like and cover a much wider set of scenarios. I'm using pandas here, but Matlab and R basically smell the same, the way that C/C++, Java, and C# all smell the same.

First I'll define a dataframe to make all of the examples actually runnable.

import pandas as pd
df = pd.DataFrame([[1, 2, 'a', pd.Timestamp('2016-01-01')],
                   [3, 4, 'b', pd.Timestamp('2016-01-02')],
                   [5, 6, 'c', pd.Timestamp('2016-01-03')],
                   [7, 8, 'a', pd.Timestamp('2016-01-04')],
                   [9, 10, 'b', pd.Timestamp('2016-01-05')],
                   [11, 12, 'c', pd.Timestamp('2016-01-06')]],
                  columns=['int1', 'int2', 'strs', 'dts'])

Here are the core operations that a dataframe needs to be able to support:

Selection:

# this selects just the rows where int1 is even (selecting a subset of columns isn't shown here)
df[df['int1'] % 2 == 0]

Indexes for fast selection:

# efficiently select just the rows between these two dates
df = df.set_index('dts')
df.loc[pd.Timestamp('2016-01-01'):pd.Timestamp('2016-01-04')]

Projection and Append:

# for each row, this multiplies int1 by 2 and adds int2 (i.e. a projection)
# and puts that new value in the int3 column (i.e. appending a column)
# appending a row is not shown
df['int3'] = (df['int1'] * 2) + df['int2']

Assignment:

# updates the int1 column in row 2 to be 100
df[2, 'int1'] = 100

(this seems a little contrived, but this is the building block for common operations like replacing all NaNs with 0)

GroupBy:

df2 = df.groupby('strs', as_index=False)['int1'].sum()

> strs ints
> a       8
> b      12
> c      16

Row/column labels:

# columns and rows have names that can be changed
df2 = df2.rename(columns={'int1': 'int1_summed'})

Join:

pd.merge(df, df2, how='inner', on='strs')

Sort:

df.sort_values(by=['str', 'int1'])

There are other things, like reading/writing CSVs and SQL tables, resampling and other time-related functions, rolling window calculations, more advanced selection, pivotting and transposing (e.g. see the table of contents at https://pandas.pydata.org/pandas-docs/stable/) but I think those will fall into place fairly easily once the core operations described above are defined. No comment here on dataframes being mutable/immutable (some of the syntax I showed here strongly suggests mutability because that's how pandas/Matlab/R work, but I don't think that has to be the case).

How is this different from LINQ on a List<T> of POCOs?

  • Appending columns, selecting a subset of columns, or renaming column labels doesn't work well if you have a typed POCO. Even if you use ExpandoObject, ExpandoObject doesn't allow setting new properties with a string (e.g. expandoObjectRow.Set("nameOfProperty", value)), which is an important limitation. So if you go down this road you have to choose between creating a new type for every stage of your operation and boxing value types.
  • Joining is hard (aka verbose). E.g. doing an outer join on xs and ys on on would require something like:
var ysDict = ys.ToDictionary(y => y[on], y => y);
var xsDict = xs.ToDictionary(x => x[on], x => x);
xs.Select(x => (x, ysDict.TryGetValue(x[on], out var y) ? y : null))
	.Concat(ys.Where(y => !xsDict.ContainsKey(y[on])).Select(y => (null, y)));

(and this hasn't even solved the appending columns problem!)

  • The performance benefits of having columnar data in memory that @mungojam mentions above is important when you have, say, 100 million rows.
  • Indexes for fast selection are not available, especially for a range query like I show above

How is this different from DataTable?

In addition to the good points above by @mgravell (heavy, i.e. change tracking) and @MgSam (box all value types), I'll also point out that DataTable doesn't have the required richness. I'm not super familiar with DataTable, so I might make some mistakes here, but I believe it doesn't support joins, renaming columns, indexes, assignment to a single cell, or any of the "bonus" things I mentioned above as non-core functionality, like resampling and other time-related functions, pivotting and transposing.

Why should dataframes be added to .NET?

  • Developers that love .NET are abandoning .NET for python because of the lack of dataframes and the associated ecosystem. At my company, we've used solely .NET for years, and have recently had to add first-class support for python in order to support researchers/scientists/data analysts that need tools that .NET doesn't have. To this point, it doesn't just end at DataFrames. In some ways, all of the comments asking how this is different from LINQ and DataTable etc. are right in a way--if .NET just added DataFrames and stopped there, especially just the core operations that I proposed above, it wouldn't be an earth-shaking development in the history of .NET. But if everything else came together, either from the .NET team or from the open source community, like a fully rich dataframe API (i.e. feature parity with pandas/R/Matlab), a REPL (we're almost there but we need 64-bit support!), Jupyter notebook support, a plotting library on par with matplotlib/Matlab/R, a library of statistical functions on par with pandas+scipy/Matlab/R, and an machine learning framework on par with Tensorflow/PyTorch (i.e. ML.NET), then that would basically open up .NET to an entire new set of use cases. This obviously won't happen overnight, and there's a reasonable argument that the python/pandas ecosystem have an insurmountable lead, but changing the world requires unreasonable people :).
  • There's an opportunity. Python/pandas currently holds the lead, I believe, because R and Matlab don't have a good story for general purpose programming, i.e., you can't write a web server in R/Matlab the way you can with Python/Django. But, pandas is at a critical juncture, because datasets have grown to 100 million rows or more, and the current architecture of pandas is limiting: http://wesmckinney.com/blog/apache-arrow-pandas-internals/. I think if you look at the Arrow project and the problems it's trying to solve, that gives a pretty good description of the opportunity. I'll give my spin on it here:
    • Pandas is "slow" at manipulating very large datasets.
    • Another aspect of the opportunity is the IDataView thing, i.e., the ability to reference an underlying potentially massive datastore and perform lazy operations on it, and then materialize the data and continue working on it using the exact same API. This is related to the previous point that pandas wasn't built in a world of large datasets, and that means that it needs a new paradigm for loading large datasets, not just the data manipulation aspect. (I haven't dug deeply into IDataView much yet, so I might be misunderstanding this.)
    • There's a distributed computation aspect as well. This is what Spark dataframes offers, which is why pandas-Spark compabitility is an important angle of the Arrow project. DryadLinq anyone? :) Also see dask: https://dask.pydata.org/en/latest/
  • Finally, I think there's an opportunity to make dataframes optionally typed like @mungojam talks about above, which would be really amazing. Also see @tpetricek's paper http://tomasp.net/academic/papers/pivot/pivot-ecoop17.pdf, especially figure 4. It's always a tough and contentious argument to make, but I believe (mostly on faith) that being able to optionally specify types statically would make data scientists more productive the way that some people believe that static types make general purpose programmers more productive.

A full example of how dataframes are used in real life

I'll go through a real-ish example here in pandas to demonstrate how dataframes are used and how C#/.NET as it stands are inadequate for the use case. Let's say you have a financial dataset as a CSV that has an "id" column (which might represent a stock or a bond), a "date" column, and a "value1" and "value2" column. Let's say you have a second dataset that contains returns for "id"s and "date"s. The first thing you need to do is read the CSVs, but I'm going to start with some synthetic data for convenience:

import random
import string
import numpy as np
import pandas as pd
import sklearn.linear_model

def random_string(n):
    return ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(n))

ids = np.array([random_string(6) for _ in range(900)])
dates = pd.date_range('2013-01-01', '2018-06-01')
df1 = pd.DataFrame({'id': np.tile(ids, len(dates)),
                    'date': np.repeat(dates, len(ids)),
                    'value1': np.random.rand(len(ids) * len(dates)),
                    'value2': np.random.rand(len(ids) * len(dates))})

df2 = pd.DataFrame({'id': np.tile(ids, len(dates)),
                    'date': np.repeat(dates, len(ids)),
                    'returns': np.random.rand(len(ids) * len(dates))})

Here's what df1 looks like:

>>>               date      id    value1    value2
>>> 0       2013-01-01  132ZXV  0.769004  0.237304
>>> 1       2013-01-01  HLX7J7  0.241722  0.273988
>>>             ...     ...       ...       ...
>>> 1780198 2018-06-01  N5EIUS  0.655888  0.220536
>>> 1780199 2018-06-01  MF8YVT  0.458046  0.118847             

Let's say you want to replace any NaNs in the data with 0.0, multiply value1 and value2 together, take a 5-observation rolling sum of that per id, take the zscore of that for each date, join that against the returns, and do a rolling 250-day linear regression on each day. Here's what that would look like in pandas:

df1 = df1.fillna(0.0)
df1['value3'] = df1['value1'] * df1['value2']
# there's some ugly cruft here--no one said pandas is perfect!
df1['value3_rolling_sum'] = df1.groupby('id')['value3'].rolling(5).sum() \
    .reset_index().drop('id', axis=1).sort_values(by='level_1').set_index('level_1')['value3']
df1['value3_zscore'] = df1.groupby('date')['value3_rolling_sum'].apply(lambda gr: (gr - gr.mean()) / gr.std())
merged = pd.merge(df1, df2, how='left', on=['id', 'date'])
merged = merged.set_index('date')
models = []
for d in merged.index.unique():
    regression_inputs = merged.loc[d - pd.DateOffset(250):d]
    model = sklearn.linear_model.LinearRegression()
    models.append(model.fit(regression_inputs[['value3_rolling_sum']], regression_inputs['returns']))

This is very long, but I wrote out the roughly equivalent thing in C# to try to bring home my point:

// prepare synthetic data
var random = new Random();

var alphanumeric = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
string RandomString(int n)
    => new string(Enumerable.Range(0, n).Select(_ => alphanumeric[random.Next(alphanumeric.Length)]).ToArray());

var ids = Enumerable.Range(0, 900).Select(_ => RandomString(6)).ToList();
var dates = Enumerable.Range(0, 1978).Select(i => new DateTimeOffset(2013, 1, 1, 0, 0, 0, TimeSpan.Zero) + TimeSpan.FromDays(i)).ToList();

var df1 = dates.SelectMany(date => ids.Select(id => (Id: id, Date: date, Value1: random.NextDouble(), Value2: random.NextDouble()))).ToList();
var df2 = dates.SelectMany(date => ids.Select(id => (Id: id, Date: date, Returns: random.NextDouble()))).ToList();


// replace nans and multiply Value1 and Value2
var value3s = df1.Select(row => (row.Value1 == double.NaN ? 0.0 : row.Value1) * (row.Value2 == double.NaN ? 0.0 : row.Value2));
// Within each id, rolling 5 day sum
var value3RollingSums = df1.Zip(value3s, (origRow, value3) => (origRow, value3))
    .GroupBy(row => row.origRow.Id, row => row.value3)
    .SelectMany(gr =>
    {
        var values = gr.ToList();
        var result = new double[values.Count];
        for (var i = 0; i < values.Count; i++)
            for (var j = Math.Max(0, i - 4); j <= i; j++)
                result[i] += values[i];
        return result;
    });
// Within each date, compute a zscore
var value3Zscores = df1.Zip(value3RollingSums, (origRow, value3RollingSum) => (origRow, value3RollingSum))
    .GroupBy(row => row.origRow.Date, row => row.value3RollingSum)
    .SelectMany(gr =>
    {
        var values = gr.ToList();
        var average = values.Average(); // assume an implementation
        var standardDeviation = values.StandardDeviation(); // assume an implementation
        return values.Select(v => (v - average) / standardDeviation);
    });
// merge with df2
var df2Dict = df2.ToDictionary(row => (row.Id, row.Date));
var merged = df1.Zip(value3Zscores, (origRow, value3Zscore) => (origRow, value3Zscore, df2Row: df2Dict[(origRow.Id, origRow.Date)])).ToList();
var models = new List<LinearRegressionModel>();
// do linear regression
foreach (var date in merged.Select(row => row.origRow.Date).Distinct().OrderBy(date => date))
{
    var startDate = date - TimeSpan.FromDays(250);
    var data = merged.Where(row => row.origRow.Date >= startDate && row.origRow.Date <= date).ToList();
    // assume an implementation of LinearRegressionModel
    var model = LinearRegressionModel.Fit(data.Select(row => row.value3Zscore).ToArray(), data.Select(row => row.df2Row.Returns).ToArray());
    models.Add(model);
}

If anything I would say that the C# version here looks better than it actually is in real life. In real life, this isn't the only computation we're doing. We're trying out other ways of normalizing the data, we're inspecting the dataframe after every step to double-check that our calculations are doing what we meant to do, we're trying other models like Lasso and Ridge and Random Forests, we have more than just one variable that we're trying to fit, we're trying rolling sums and regressions based on days rather than observations and vice versa. I would say that throwing all of these things into the mix make it even harder to do in C# and only trivially more difficult to do in Python/pandas.

@totalgit74

This comment has been minimized.

totalgit74 commented Jul 19, 2018

I think the choice of indexing will be interesting.
The df[df['int1'] % 2 == 0] style of indexing can, at best, be described as fugly. It smells a bit MatLab-ish.
The R data.table syntax of dt[int1 %% 2 == 0] is somewhat more elegant and a touch SQL/LINQ like. It also handles numeric references such as dt[1:20], column subsetting as well dt[1:20, list(int1, int2, int3)] as well as grouping within the brackets. Now some of these things may be a step too far with a .group() method being more appropriate, for instance.
Ability to reference from the end of the dataframe much like MatLab's m[end-9:end] or an open request like m[end-9:] or m[-9:] would also be good.

What pandas does bring to the table are the inbuilt interpolation, periodicity change, rolling etc calculations whereas in R you resort to other libraries such as zoo.

I think that, as .Net is a little late here, it is worth trying to utilise the Apache Arrow and Parquet work as you don't want to be both late to the party and the odd one out. Both Python and R are headed down this road and, if .Net wants to be part of the data analysis / machine learning party, .Net should go that way too.

@mungojam

This comment has been minimized.

mungojam commented Sep 18, 2018

Sorry to spam, but any update on whether this is likely to happen? Our department might come back from the brink of python if a pandas-like library was on the horizon

@danmosemsft

This comment has been minimized.

Member

danmosemsft commented Oct 5, 2018

@joshnyce

This comment has been minimized.

joshnyce commented Oct 11, 2018

Just to chime in, this would make all the difference. Pandas/DataFrames have made Python exceptional for working with timeseries. LINQ and SQL Server are both very weak for timeseries. I don't know about ML, but there are lots of uses for Pandas outside of ML. Take finance for example. We need tools for working with timeseries and operations like joins in-memory because performance is critical, but we also need code to be stable and performant. It doesn't make sense to use Python for executing trades, capturing data from APIs, etc, so we end up passing data to/from Python. Oracle and timeseries-based databases like KDB can be helpful, but performing operations in code is often critical. Python.NET can bridge Pandas and .NET to a degree, but there is no efficient way to pass data to Python as a dataframe or to convert C# arrays or dictionaries to dataframes in Python. Time keeps going by and Microsoft keeps falling further behind, and some of the comments in this thread give the impression that Microsoft still fails to understand why Pandas is so popular.

@nh43de

This comment has been minimized.

nh43de commented Oct 25, 2018

Just to chime in, this would make all the difference. Pandas/DataFrames have made Python exceptional for working with timeseries. LINQ and SQL Server are both very weak for timeseries. I don't know about ML, but there are lots of uses for Pandas outside of ML. Take finance for example. We need tools for working with timeseries and operations like joins in-memory because performance is critical, but we also need code to be stable and performant. It doesn't make sense to use Python for executing trades, capturing data from APIs, etc, so we end up passing data to/from Python. Oracle and timeseries-based databases like KDB can be helpful, but performing operations in code is often critical. Python.NET can bridge Pandas and .NET to a degree, but there is no efficient way to pass data to Python as a dataframe or to convert C# arrays or dictionaries to dataframes in Python. Time keeps going by and Microsoft keeps falling further behind, and some of the comments in this thread give the impression that Microsoft still fails to understand why Pandas is so popular.

This is a great conversation.

Has anyone started working on a spec for what this might look like as part of .NET?

@chris-lauren

This comment has been minimized.

chris-lauren commented Oct 25, 2018

We are beginning to dig into this but still fairly early in the process. I can assure you all we love pandas. Its an amazing library and we do fully appreciate the need for an equivalent in .NET. We simply are resource constrained at the moment and would love contributions / help spec'ing out and implementing

@nh43de

This comment has been minimized.

nh43de commented Oct 25, 2018

Awesome! Any existing resources you can point us to? Would love to help out

@bryanrcarlson

This comment has been minimized.

bryanrcarlson commented Oct 25, 2018

I'm interested in helping too. I'm not going to steer the ship but I can help with grunt work.

@voltcode

This comment has been minimized.

voltcode commented Oct 25, 2018

Some time has passed since my first comment in this thread. I'd seriously recommend anyone considering specifying this to look into Apache Arrow project (in-memory column store model), and others building on top of it (like https://rapids.ai/, also https://arrow.apache.org/powered_by/). It's a huge chance to catch up by plugging into existing data science ecosystem. It's too late to try to win others (Python, R), last chance is to join them.

In terms of the actual API design, .NET could again leverage Arrow implementers' experience and pick the best route. Being able to efficiently interop on large data sets with other processes would be a huge win for all data-processing programs. GPU-interop is another great boon.

To sum up, I think Apache Arrow .NET implementation could be a good foundation for .NET data frame.

@veikkoeeva

This comment has been minimized.

veikkoeeva commented Oct 25, 2018

As a complete layman looking at Arrow is very appealing. It took a while to look through this thread about Arrow, so I gather the few pieces here to save some trouble for other:

#26845 (comment)
#26845 (comment)

And the posts just below these.

@nh43de

This comment has been minimized.

nh43de commented Oct 27, 2018

Looks like someone just finished an initial C# implementation of Arrow

https://github.com/apache/arrow/tree/master/csharp

@bryanrcarlson

This comment has been minimized.

bryanrcarlson commented Oct 31, 2018

Good find, @nh43de.

Is this the way to go then; to build upon RecordBatch and implement an API on top (with array and index notation similar to pandas and/or data manip verbs similar to dplyr)? Or should we still consider defining a DataFrame type that can load in Arrow format?

PyArrow has a Table class that is a "tool to help with wrangling multiple record batches..." PyArrow Tables.

@veikkoeeva

This comment has been minimized.

veikkoeeva commented Nov 1, 2018

One thing to note is that a lot of data is in netCDF or geoTIFF format. By a lot I mean many of the scientific data sets having large amounts of data. It would be great if the .NET story working with these formats were better. If not otherwise, maybe reading something like geoTIFF (these are used as "raster data sets") and storing the data in Arrow format for other work. Now the tools are fairly non-existent and even the ones wrapping ones like gdal_translate (to give search engine fodder) could benefit of more people improving them. I think this is closely related to issues with .NET and geographic types for which there's some push currently.

(There's https://github.com/predictionmachines/SDSlite but it is abandoned.)

@danmosemsft danmosemsft moved this from To do to In progress in ML.NET support work Dec 5, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment