Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tabular data container (data frames) #15

Open
burner opened this issue May 11, 2019 · 5 comments
Open

Tabular data container (data frames) #15

burner opened this issue May 11, 2019 · 5 comments

Comments

@burner
Copy link
Member

burner commented May 11, 2019

Pandas, R and Julia have made data
frames

very popular. As D is getting more interest from data scientist (e.g.
eBay or
AdRoll)
it would be very beneficial to use one language for the entire data
analysis pipeline - especially considering that D (in contrast to
popular languages like Python, R or Julia) - is compiled to native
machine code and gets optimized by the sophisticated LLVM backend.

Minimum requirements:

  • conversion to and from CSV
  • multi-indexing
  • column binary operations, e.g. `column1 * column2`
  • group-by on an arbitrary number of columns
  • column/group aggregations
@burner
Copy link
Member Author

burner commented May 11, 2019

is being worked on by Prateek Nayak during gsoc 2019

@wilzbach
Copy link
Member

CC @Kriyszig

@Kriyszig
Copy link

Kriyszig commented May 12, 2019

Yes, I will be working on this project.

So far I have contacted the mentors and am exploring ndslice in mir-algorithms, while also looking into displaying the dataframe on the terminal with properly aligned columns. I'm a bit tight on time till this weekend because of final examination but after that I'll be working at my maximum capacity to realize the project.
We still need to discussing the structure of index to represent multi indexed dataframes after which I'll jump onto parsing of CSV files to dataframes.
At this point the dataframes will support adding multi-indexed data to the dataframe, parsing from files and writing to CSV.
Next will deal with access of elements, column binary ops.

I'm mostly looking into Pandas and it's implementation of dataframes mostly because I have worked quite extensively with Python in the past.
I'll update the issue with any and all progress made regarding the dataframe project

@Laeeth
Copy link

Laeeth commented May 25, 2019

Interop with pandas via JSON and msgpack might be quite helpful. I have written a streaming msgpack decoder (using msgpack-d) to work with our own simple data frame implementation, and there is some old code for reading and writing to hdf5 too.

@9il
Copy link
Member

9il commented May 26, 2019

Initial support for dataframe has been added to mir-algorithm.
Only allocation and labels access for now.

@safe pure unittest
{
    import mir.ndslice.slice;
    import mir.ndslice.allocation: slice;

    import std.datetime.date;

    auto dataframe = slice!(double, Date, string)(4, 3);
    assert(dataframe.length == 4);
    assert(dataframe.length!1 == 3);
    assert(dataframe.elementCount == 4 * 3);

    static assert(is(typeof(dataframe) ==
        Slice!(double*, 2, Contiguous, Date*, string*)));

    // Dataframe labels are contiguous 1-dimensional slices.

    // Fill row labels
    dataframe.label[] = [
        Date(2019, 1, 24),
        Date(2019, 2, 2),
        Date(2019, 2, 4),
        Date(2019, 2, 5),
    ];

    assert(dataframe.label!0[2] == Date(2019, 2, 4));

    // Fill column labels
    dataframe.label!1[] = ["income", "outcome", "balance"];

    assert(dataframe.label!1[2] == "balance");

    // Change label element
    dataframe.label!1[2] = "total";
    assert(dataframe.label!1[2] == "total");

    // Attach a newly allocated label
    dataframe.label!1 = ["Income", "Outcome", "Balance"].sliced;

    assert(dataframe.label!1[2] == "Balance");
}

@RazvanN7 RazvanN7 removed the gsoc19 label Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants