Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe MVP #14

Closed
datapythonista opened this issue Jun 15, 2020 · 12 comments
Closed

Dataframe MVP #14

datapythonista opened this issue Jun 15, 2020 · 12 comments

Comments

@datapythonista
Copy link
Member

datapythonista commented Jun 15, 2020

We've already got several useful discussions open, on different topics. To try to give a bit of structure to the conversations, I propose we try to start with an initial MVP (minimum viable product), and we build iterating over it.

This is a draft of the topics that we may want to discuss, and a possible order to discuss them:

The idea would be to discuss and decide about each topic incrementally, and keep defining an API that can be used end to end (with very limited functionality at the beginning). So, focusing on being able to write code with the API, we should be identifying for each topic, the questions that need to be answered to construct the API. And then add to the RFC the API definition based on the agreements.

Next there is a very simple example of dataframe usage. And the questions that need to be answered to define a basic API for them.

>>> from whatever import dataframe

>>> data = {'a': [1, 2], 'b': [3, 4], 'c': [5, 6]}

>>> df = dataframe.load(data, format='dict')
>>> df
a b c
-----
1 3 5
2 4 6

>>> len(df)
2
>>> len(df.columns)
3

>>> df.dtypes
[int, int, int]

>>> df.columns
['a', 'b', 'c']

>>> df.columns = 'x', 'y', 'z'
>>> df.columns
['x', 'y', 'z']

>>> df
x y z
-----
1 3 5
2 4 6

>>> df['q'] = [7, 8]
>>> df
x y z q
-------
1 3 5 7
2 4 6 8

>>> df['y']
y
-
3
4

>>> df['z', 'x']
z x
---
5 1
6 2

>>> df.dump(format='dict')
{'x': [1, 2], 'y': [3, 4], 'z': [5, 6], 'q': [7, 8]}

The simpler questions that need to be answered to define this MVP API are:

  • Name of the dataframe class. I can think of two main options (feel free to propose more):
    • DataFrame or Dataframe, to be consistent with Python class capitalization
    • dataframe, using Python type capitalization (as in int, bool, datetime.datetime...
  • How to obtain the size of the dataframe?
    • Properties (num_columns, num_rows)
    • Using Python len: len(df), len(df.columns)
    • shape (it allows for N dimensions, which for a dataframe is not needed, since it's always 2D)
  • How to obtain the dtypes (is a dtypes property enough?)
  • Setting and getting column names
    • Is using a Python property enough?
    • What should be the name? columns, column_names...

The next two questions are also needed, but they are more complex, so I'll be creating separate issues for them:

  • Loading and exporting data

    • Should the dataframe class provide a constructor? If it does, should support different formats (like pandas)?
    • Should we have different syntax (as in pandas) for loading data from disk (pandas.read_csv...) and for loading data from memory (DataFrame.from_dict)? Or a standard way for all loading/exporting is preferred?
  • How to access and set columns in a dataframe

    • With __getittem__ directly (df[col] / df[col] = foo)
    • With __getitem__ over a property (df.col[col] / df.col[col] = foo)
    • With methods (df.get(col) / df.set(col=foo))
    • Is more than one way needed/preferred?
@TomAugspurger
Copy link

meta-question (which is probably appropriate for the Thursday call): why do this in a GitHub issue rather than the RFC document? IMO this document is a better home for it so that we can comment inline.

@tdimitri
Copy link

I hope we can term it something else besides "DataFrame". To me dataframe implies pandas version of this concept and I prefer another term to help distinguish. For example, Matlab has Datasets and Tables, R has Tables and xtabs.

Examples: Table, Grid, Dataset, DataGrid, DataMatrix, DataSheet, DataPage, RowCol, Screen, Slate, Panel, Lattice, Board, DataBoard, etc.

I dont care much, as long as we dont overload the term DataFrame.
It will get confusing when pandas APIs for its DataFrames are different than this group's recommended APIs for DataFrames.

@TomAugspurger
Copy link

TomAugspurger commented Jun 15, 2020 via email

@tdimitri
Copy link

Yeah I see R with data.frame vs data.table. For many python data analytics users, the word dataframe is intertwined with pandas dataframe. Feels like this is the "pandas" club, and I hope we can break out from its orbit and declare independence. Is this group's goal to

  1. Cleanup pandas APIs, consolidate them, and get rid of duplicate methods
  2. Come up with a better model for the common tasks at hand

If no. 1, we should just state that is what this group is doing. At which point im disappointed and feeling hoodwinked.

If no. 2, we should then indicate the most common tasks a user wants to perform - including two step tasks that can be consolidated into one. For instance, setindex followed by pivot can be made into one task and indicates some of the problems of following the pandas API model.

One step to breaking free from pandas APIs is to change the name. Further, having the same name and methods, that do things differently is more confusing for future users.

@TomAugspurger
Copy link

TomAugspurger commented Jun 16, 2020 via email

@tdimitri
Copy link

tdimitri commented Jun 16, 2020

Yes the impression I get from most of those was a conscious effort to be similar enough to the pandas dataframe so that existing code could be ported over more easily, and common methods could be used in the same way.

In riptide we called it a Dataset the name difference worked out well. Our examples use "ds" instead of "df". If we called it DataTable or DataGrid we could use "dt" or "dg". Then it would be easier to explain to users why we have different APIs with different kwargs.

Is the goal to be ~80% like the pandas Dataframe API or to be a new class with similar but different methods? Different group members may have different opinions.

Perhaps we can take a vote on this because I do care about the name. From my experience, I think calling it Dataframe is bound to lead to similar methods. Maybe that is what most people want, but it is not what i signed up for.

@jack-pappas
Copy link

I think it's best to just vote on it -- we can either do that at tomorrow's meeting, or we can decide to pick a working name for now then vote on a final name later on (when we're finalizing the spec).

DataFrame is more recognizable to end users (due to it's widespread use in existing projects), but re-using that name could also lead to some confusion on their part if the specification we produce has non-trivial differences from those existing implementations. I assume that's why R's data.table project chose a different name -- it's API is still recognizable as a "DataFrame" in spirit but diverges enough from the built-in R data.frame that calling it the same thing would have been confusing to end users.

Some additional data points:

Note: I don't have a preference towards any particular name. I do feel like the Data prefix is somewhat implied though, so I'd lean towards a simpler name like Frame or Table.

@jack-pappas
Copy link

@datapythonista The naming dicussion has more-or-less taken over this issue, but I think it's important we address the other points you brought up as well. Maybe we just rename / simplify this issue to be about the naming only and copy the rest of your original post to another issue so we can discuss those points?

@datapythonista
Copy link
Member Author

My main idea here, more than making decisions on the specific points, was about the methodology to move forward. We've got several issues now, with very interesting discussions. But I felt that instead of ending up with even more open discussions, finding the points for a minimal API, and deciding on those, could help get started.

Then we could work as follows:

  • We decide the initial points to discuss (e.g. class name, get/set columns...)
  • We open issues for those, and we try to reach decisions
  • Once there is agreement, we write the minimal API in the intended formats (RFC, not sure if we want a Python definition too, or something else)
  • Then, for the rest of issues, once there seems to be consensus in one of them, we open a PR to the RFC... with the outcome, to finalize the discussion.

I think this should help a bit keep focused. Being able to see the progress on what has already been agreed, and somehow limiting the number of open discussions at a time. But that's just a personal preference, to work in a more structured way. Surely other people will have ideas on how to work efficiently.

We can discuss if this makes sense in the call tomorrow. Also, if we want to take a subset of pandas as a starting point or not, as it was asked here. And what are the points we want to start with, if people like this approach.

@rgommers
Copy link
Member

We can discuss if this makes sense in the call tomorrow.

Agreed, methodology for constructing the APIs will be the main topic for tomorrow's call. Starting with arrays, where we're a little further along with tooling, and obtaining data and making decisions is generally a bit easier. And then for dataframes we should consider what parts of that methodology are applicable, and how to deal with dataframe-specific pain points.

@tdimitri
Copy link

When designing an API I often request good use case examples. I am not sure we have those yet (perhaps we do and I missed it). Different industries have different use cases, which can help us determine the APIs. For instance, if I want to select all rows with last name 'Dimitri' and first name 'Thomas' , what are the ways to do that? (that is just a simple one, i am hoping for more sophisticated ones).

Therefore I think one step in the methodology for constructing the APIs involves good use cases from different industries.

@datapythonista
Copy link
Member Author

This has been superseded by #25, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants