-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrame Discussion #24920
Comments
Could you define "DataFrame" for the purpose of this discussion? Because that is a vague, overloaded, and open-ended term - it would be good to know the intent as it relates to this discussion. For example, pandas defines it as:
which sounds ... awful :) I can't think of many scenarios where that would be preferable to, say, a |
I think the scenarios come from a more "data scientist" and Machine Learning angle. Take a look at https://github.com/ageron/handson-ml/blob/master/02_end_to_end_machine_learning_project.ipynb starting with the
Imagine trying to explore a data set (loaded through .csv or otherwise) like this in .NET code, and how much more code you were need to write to accomplish something like:
|
that sounds like something that LINQ would excel at, with little more than a few helper extension methods - i.e.
with an extension method something like below (completely untested). Note: I'm totally not a fan of
Edit: personally, I'd much rather use a |
@mgravell I think the Pandas definition is as good as any. You're right, it is very similar to a DataTable. However, DataTables are heavy, not generic (and thus box any value types) and do not support many of the useful properties of DataFrames. I love Linq, but it is different type of abstraction. Common operations with DataFrames include aligning columns, performing operations between columns, joining frames, etc. Writing ad-hoc types or using Linq for this is much more verbose and time consuming. A key insight to realize about DataFrames vs Linq is that you're working with columns or frames of data vs individual items, as you do with Linq. I strongly suggest looking at some example usage in Pandas or checking out Deedle. (Deedle has a pretty good implementation of a DataFrame, but it is written in F# and thus uses a lot of F#-isms that are problematic for other .NET languages, it also appears to be essentially abandoned). |
Can you explain a bit what is F#-ism and what are those problems? If I am relying on a nuget package with .NET assembly, I shouldn't care about the source language the IL was produced from. Their usage docs has C# examples: http://bluemountaincapital.github.io/Deedle/csharpseries.html |
@kasper3 F# adds a lot of its own types and conventions on top of the .NET Framework. It's in FSharp.Core.dll. You absolutely can use it from C#, and it is very powerful, but you quickly run into a lot of annoying impedance mismatches. The biggest one being that F# has an Though I digress. Really I'm just trying to show why there is a need for a DataFrame type to be baked into the core framework. |
I've been trying to build a UWP app which reads data from IOT sensors and I've been banging my head against this exact problem. I can't use pandas or R because I'm in UWP-land and C# doesn't have a decent DataFrame API to use MathNet with in the same way I'd use SciPy with Pandas. I'm pretty close to just abandoning .NET and building the whole thing in Python with the desktop bridge. Unfortunate but there's really a gap in this area where Python and R excel. This would obviously be most useful on the server-side ML pipelines, but wanted to highlight how this is also useful to client developers. Particularly when developers will want to pre-process data they're feeding into ONNX models for WindowsML. |
@mattdot check out the issue I referenced, it seems there's a new dataframe-ish abstraction called IDataView in just open sourced ML.NET library. Maybe it can be useful to you? |
if there is some API shape proposal that you have in mind that would help your use-case, please do share. that will increase the chances of getting this discussion move forward. |
@MgSam So |
@veikkoeeva I'm not sure column-oriented vs row-oriented is the right description. It's a tabular data structure. It definitely makes column-based operations easy and much more readable. This definition for an R data frame is pretty good. As soon as I get some free time I intend to play around with the IDataView. |
The ML.NET has the concept of "IDataView." (IDV) We are porting documentation for it in this PR 173 |
Please excuse my extreme naivete, but does a .net core implementation of Apache Arrow relate to this? Wes McKinney recently posted on his blog announcing Ursa Labs which would open .net up to a whole new crowd if there were collaboration. |
@bryanrcarlson I mentioned Apache Arrow as a means for ML.NET to integrate with R and Python - check out dotnet/machinelearning#69 if you're interested |
I've been a C# programmer for 10 years, but in the last few years have learnt R and now Python. I love C#, but it is really lacking in this area when compared with Pandas in Python and dplyr/tidyr in R. Those tools make it so easy to work with tabular data and for people who we want to move out of Excel for LoB data tools they are a much easier sell than C# at present. Both R and Python benefit here during the exploration stage by being dynamically typed. They also benefit from having a really good REPL story. I think a neat C# implementation would start in the The column vs. row thing is important for performance. Columnar data is generally the way things are going for performance reasons, e.g. the Parquet data format. A lot of the operations people do on data tables benefit from using SIMD along a column of contiguous memory data. |
I agree that a It's basically an n-dimensional tensor (could be a scalar, vector, matrix, or higher level tensor) which has label-vectors for the individual dimensions. The labels should be preserved as far as possible when mathematical operations are applied. |
I guess a |
I arrived at this discussion as I too am searching for a suitable data analysis framework. Deedle is ok but suffers very bad performance on even moderately large datasets. Extreme Optimisation has an implementation that is very fast from what I have played around with. |
Adding to ML.NET "project" just to keep track (no decision implied but this is a good discussion) |
I am extremely in favor of this! Here goes a long post; thank you to anyone who gets through it all. API shape proposal-ishTo try to move the discussion forward, I'll put forward something that looks kind of like an API shape proposal. It's actually just a few core operations from pandas, but hopefully it gives a sense of what operations a real API shape proposal would have to cover. A real API shape proposal would need to be much more C#-like and cover a much wider set of scenarios. I'm using pandas here, but Matlab and R basically smell the same, the way that C/C++, Java, and C# all smell the same. First I'll define a dataframe to make all of the examples actually runnable.
Here are the core operations that a dataframe needs to be able to support: Selection:
Indexes for fast selection:
Projection and Append:
Assignment:
(this seems a little contrived, but this is the building block for common operations like replacing all GroupBy:
Row/column labels:
Join:
Sort:
There are other things, like reading/writing CSVs and SQL tables, resampling and other time-related functions, rolling window calculations, more advanced selection, pivotting and transposing (e.g. see the table of contents at https://pandas.pydata.org/pandas-docs/stable/) but I think those will fall into place fairly easily once the core operations described above are defined. No comment here on dataframes being mutable/immutable (some of the syntax I showed here strongly suggests mutability because that's how pandas/Matlab/R work, but I don't think that has to be the case). How is this different from LINQ on a
|
I think the choice of indexing will be interesting. What pandas does bring to the table are the inbuilt interpolation, periodicity change, rolling etc calculations whereas in R you resort to other libraries such as zoo. I think that, as .Net is a little late here, it is worth trying to utilise the Apache Arrow and Parquet work as you don't want to be both late to the party and the odd one out. Both Python and R are headed down this road and, if .Net wants to be part of the data analysis / machine learning party, .Net should go that way too. |
Sorry to spam, but any update on whether this is likely to happen? Our department might come back from the brink of python if a pandas-like library was on the horizon |
Just to chime in, this would make all the difference. Pandas/DataFrames have made Python exceptional for working with timeseries. LINQ and SQL Server are both very weak for timeseries. I don't know about ML, but there are lots of uses for Pandas outside of ML. Take finance for example. We need tools for working with timeseries and operations like joins in-memory because performance is critical, but we also need code to be stable and performant. It doesn't make sense to use Python for executing trades, capturing data from APIs, etc, so we end up passing data to/from Python. Oracle and timeseries-based databases like KDB can be helpful, but performing operations in code is often critical. Python.NET can bridge Pandas and .NET to a degree, but there is no efficient way to pass data to Python as a dataframe or to convert C# arrays or dictionaries to dataframes in Python. Time keeps going by and Microsoft keeps falling further behind, and some of the comments in this thread give the impression that Microsoft still fails to understand why Pandas is so popular. |
This is a great conversation. Has anyone started working on a spec for what this might look like as part of .NET? |
We are beginning to dig into this but still fairly early in the process. I can assure you all we love pandas. Its an amazing library and we do fully appreciate the need for an equivalent in .NET. We simply are resource constrained at the moment and would love contributions / help spec'ing out and implementing |
Awesome! Any existing resources you can point us to? Would love to help out |
I'm interested in helping too. I'm not going to steer the ship but I can help with grunt work. |
(small puff of smoke) I HAVE BEEN SUMMONED I few notes from what I see in the (wonderful) discussion above. Typed and untyped need not be mutually exclusive There appears to be a fairly even split between people who want the more concise, untyped DataFrame type and those who want a boxing-free, type-safe DataFrame. I would like to suggest that we may be able to achieve a superposition of having and eating cake here. For instance, we may be able to have a situation wherein: Alternatively, we could have: But the former would be far better. I think of it as a parallel to the relationship between LambdaExpression and Expression. I would personally set an extremely high bar for having separate implementations for ML.Net and for DataFrame The scenarios in ML.Net and those described above have significant overlap. ML.Net has not yet achieved a 1.0 release. As noted above already, there is already a DataTable class (one that has an extraordinary number of shortcomings relative to the needs stated by people here) but if we also have a DataFrame class that's aaaaaaaalmost IDataView but not quite then I think we have failed. To that end, I hope the corefx and ML.Net teams are talking like nowish about where this goes. It need not be that DataFrame and IDataView refer to literally the same class, as it could be like the relationship between DataTable and DataView in the Fx. But there's too much overlap in scenario here for there to be divergent evolution. Trill could join this ecosystem as well if we had first-class time support The main difference between an abstraction like DataFrame and IStreamable (Trill's basic abstraction) is support for the progression of time. ML.Net may also be looking to Trill for some of its capabilities such as windowing, so this is likely a good discussion to keep going. Some of the more untyped-friendly operations can be done in the typed world with a little imagination and some code generation Adding a column was brought up as a great example of an operation that can be ridiculous to do in a typed environment but much easier over an untyped set. One of the things we did in Trill was implement a set of overloads to the .Select() method that take operations like adding a column and take it from ridiculous down to maybe a mild laugh. It's not as concise as in untyped-world, but still better than otherwise. For adding a column, the result looks something like:
It allows you to do something like this:
Again, not perfect, but better than the alternative. Any field that is the same between OldType and NewType in the above is carried over automatically because MAGIC er I mean code generation. Trill's got a few methods that contain magic like that; for instance, Trill streams have Pivot() and Unpivot() methods. Anyway, those are my first thoughts on the matter. Feel free to conjure me again. (disappears in a smaller puff of smoke) |
Closing, as work is actively proceeding in https://github.com/dotnet/corefxlab/tree/master/src/Microsoft.Data |
@mungojam @hrichardlee I've been working on a DataFrame library for .NET which implement many of the ideas in this thread and I'd appreciate some feedback when you have time: |
@allisterb This is great. Implementing iEnumerable (making your dataframe implementation linq-friendly) sets it apart from Deedle. I haven't tried the commercial dataframe products for .NET, so I don't know if they are linq-friendly, but it looks like the implementation in corefxlab doesn't implement iEnumerable yet. |
Not yet, but we plan on it - see eerhardt/corefxlab@5cf06f4. I just haven't had time to submit the PR yet, but will this week. |
If there is anything I can do to contribute to this, or if you'd like me to be a reviewer on any work in this space, please let me know. |
@joshnyce Thanks, I tried to address what I though were some of the shortcomings of Deedle and being able to use .NET LINQ as the main query library was one of them as .NET developers are already familiar with how LINQ works. |
@cybertyche Hey thanks a lot! I guess the main thing I need help with is getting the word out. I really thing .NET developers would prefer to stay in C# or F# for doing data analysis instead of using pandas or R and I think F# + Sylvester's data frame + Azure Notebooks offers a comparable experience. |
Wondering if Microsoft.Data will solve the same problems, or will we still need 3rd party frameworks? |
We now have a preview of DataFrame out on Nuget here. We'd definitely appreciate any feedback we get! Another exciting development is .NET Core support on Jupyter Notebooks. Check out https://devblogs.microsoft.com/dotnet/net-core-with-juypter-notebooks-is-here-preview-1/ to get started. And finally, this is a sample showing Jupyter Notebook + .NET DataFrame + Charting! Click on https://github.com/dotnet/try/blob/master/Notebook.md to try it out on binder! |
Can you give an example of some basic usage? I was unable to get it to work for a few basic cases. var a = new DataFrame();
a.Columns.Add(new PrimitiveDataFrameColumn<DateTime>("DOB"));
a.Columns.Add(new StringDataFrameColumn("Name", 100)); //What does the length param refer to?
a.Columns.Add(new PrimitiveDataFrameColumn<int>("Age")); //This throws
a.Append(new object[] { DateTime.Parse("2017/01/01"), "Mary", 50 });
a.Append(new object[] { DateTime.Parse("2011/03/01"), "Sue" , 15 });
a.Append(new object[] { DateTime.Parse("2015/05/01"), "John", 35 }); A few thoughts/questions:
Thanks for all the hard work! |
@pgovind Trying my luck and hinting a bit, would you happen to have examples coming on how to read netCDF/HDF5 files? :) |
Here's a quick example of DataFrame in action: PrimitiveDataFrameColumn<DateTime> dateTimes = new PrimitiveDataFrameColumn<DateTime>("DateTimes"); // Default length is 0.
PrimitiveDataFrameColumn<int> ints = new PrimitiveDataFrameColumn<int>("Ints", 3); // Makes a column of length 3. Filled with nulls initially
StringDataFrameColumn strings = new StringDataFrameColumn("Strings", 3); // Makes a column of length 3. Filled with nulls initially
// Append 3 values to dateTimes
dateTimes.Append(DateTime.Parse("2017/01/01"));
dateTimes.Append(DateTime.Parse("2017/01/01"));
dateTimes.Append(DateTime.Parse("2017/01/01"));
DataFrame df = new DataFrame(new List<DataFrameColumn> { dateTimes, ints, strings }); // This will throw if the columns are of different lengths
// To change a value directly through df
df[0, 1] = null; // 0 is the rowIndex, and 1 is the columnIndex. This sets the 0th value in the Ints columns to null
// Modify ints and strings columns by indexing
ints[1] = 100;
strings[1] = "Foo!";
// Indexing can throw when types don't match.
// ints[1] = "this will throw because I am a string";
// DataType can be used to figure out the type of data in a column.
ints.DataType; // returns System.Int32
// Add 5 to ints in place
ints.Add(5, inPlace: true);
// Add 5 to ints through the DataFrame
df["Ints"].Add(5, inPlace: true);
// We can also use binary operators. Binary operators produce a copy, so assign it back to our Ints column
df["Ints"] = (ints / 5) * 100;
// Fill nulls in our columns, if any. ints[0], ints[2], strings[0] and strings[1] are null
df["Ints"].FillNulls(-1, inPlace: true);
df["Strings"].FillNulls("Bar", inPlace: true);
// To inspect the first row
IList<object> row0 = df[0];
// Filter rows based on equality
DataFrame filtered = df.Filter(strings.ElementwiseNotEquals("Foo!"));
// Sort our dataframe using the Ints column
DataFrame sorted = df.Sort("Ints", ascending: true);
// GroupBy
GroupBy groupBy = df.GroupBy("DateTimes");
// Count of values in each group
DataFrame grouped = groupBy.Count(); // Alternatively find the count in just the desired columns Many of the APIs expose a "inPlace" parameter. The aim here is ease of use to inspect values in a notebook. @MgSam : |
Suggestion. How about adding a tutorial notebook on dotnet/try with some dummy data so that we can start to explore the library on binder directly. It will provide testers a taste of all functionalities before documentations are out. |
That's the plan. I'll be adding a sample notebook and an accompanying blogpost going over the major features in the preview soon. |
@veikkoeeva : We only support reading csv files in a DataFrame at the moment. Or, you can make a dataframe from an Arrow RecordBatch. I opened https://github.com/dotnet/corefxlab/issues/2785 |
@pgovind
|
I have not tried this yet and takes some weeks to get off of from a business trip, but as quick feedback in fear of missing:
|
Is there a design document that shows why the API is built the way it is? Also interested in performance and implementation concerns. |
We've got
The next relase will add a
Yup:). However, wait till the blog post and we might win you over to the notebook side where
Agreed. Will definitely make it into the next version
That would be awesome. Maybe put up a PR and tag me? |
Nope. But we can add the overload it if you feel like it's important. I suggest waiting for the first blog post to come out. It goes over how to use DataFrame + .NET notebook and adds custom formatting for the DataFrame that solves many pain points.
We actually don't have a |
No. My forthcoming blog post will go over some of the design and implementation considerations. It's a little early to talk about performance numbers IMO. I've done very little perf work for the 0.1.0 release, save for minor optimization and profiling. Understanding the performance concerns requires going over the
Finally, we haven't implemented any SIMD ops and/or multi-threading yet. So there's still loads of perf to gain :) |
Here's the promised blog post :) |
Thanks for the blog post! Some more feedback: class DataFrameColumn
{
public static PrimitiveDataFrameColumn<T> Create<T>(string name, IEnumerable<T> values) where T : unmanaged
{
return new PrimitiveDataFrameColumn<T>(name, values);
}
public static StringDataFrameColumn Create(string name, IEnumerable<string> values)
{
return new StringDataFrameColumn(name, values);
}
//... Also overloads for all the other constructors for PrimitiveDataFrameColumn and StringDataFrameColumn...
} Usage: var doubles = new[] { 3.0, 4.0, 5.0};
var ints = new[] { 3, 4, 5};
var strings = new[] { "foo", "bar", "baz" };
var col1 = DataFrameColumn.Create("doubles", doubles);
var col2 = DataFrameColumn.Create("ints", ints);
var col3 = DataFrameColumn.Create("strings", strings); EDIT: Happy to try my hand at a PR for this if the API looks ok. |
@MgSam : I agree. The API looks good to me. Feel free to tag me on the PR when you put it up and we can get it into the next preview :) |
+1 for the ToCsv method! That would be so useful. |
With work underway on
Tensor
, and the new perf optimizations available viaSpan
, I think it's time the .NET team seriously considered adding a DataFrame type. Working with large amounts of data has become critical in many applications these days, and is the reason libraries like Pandas for Python have been so successful. It's time to bring similar capabilities to .NET, and it needs to start with the framework itself adding a DataFrame type.There are DataFrame libraries out there available for .NET, but the problem is that they each have their own implementations of DataFrames, that are minimally compatible with the BCL or any of the other libraries out there. From my experience, support for many of these various libraries is also pretty weak.
I think the BCL team implementing its own DataFrame is the first, most important step to improving the state of working with data in .NET. All we have right now is
DataTable
, a powerful but ancient type that is not well optimized for many scenarios and which I've variously seen .NET team members refer to as "legacy".I'm creating this issue to garner opinions on the topic. I don't have a specific API in mind to propose at this time, as gauging interest first is probably more important.
Let's make .NET as great for data analysis as any other platform.
The text was updated successfully, but these errors were encountered: