Clean ups around DataFrame.read_csv/2 #48

losvedir · 2021-09-05T18:27:31Z

Thanks for the great project! I was excited to play around with it but ran into some issues loading my CSVs. This PR does a few things:

Adds tests for the various read_csv options - I like to look at tests to get a feel for using a function and I didn't see any about loading CSVs or how the options worked. And it helped me realize that three listed options (:null_character, :names, and :dtypes) had issues. Not sure the best place to put these tests; if they should be for the Polars backend specifically, or doc tests, or something. I'm happy to move them.
Fixes a bug with with the null_character option - The Native.df_read_csv function call didn't have the arguments in the same order as the corresponding rust function.
Fixes the comment documentation for the dtypes option, which is said to take a keyword list of column name (as atom) to dtype. It doesn't specify the dtype format, but I assumed :string, :integer, etc. But it actually takes a list of tuples of column strings, and the type is specified like "str" or "i64". Here I just documented the current behavior correctly. But if you want the dtype to be specified like :string, for example, I could update dataframe.rs to work with atoms and do that. I know rust somewhat, but not rustler specifically, and this change seems straightforward enough. But from a skim of the docs I'm not entirely sure how to make the columns, which aren't known at compile time, atoms.
Removes the documentation for the names option which allegedly allows you to rename columns upon reading the CSV, but which appears to not be implemented. From a scan of the polars docs I didn't see any methods provided to handle it.

* Adds tests for the various read_csv options * Fixes a bug with with null_character * Fixes documentation

cigrainger

Thanks so much for this! I'm wholeheartedly in favour of adding many more tests!

There's actually a function for converting the canonical Explorer dtypes to the types expected by polars. One of the guiding principles in Explorer is that it's an API first and foremost. The polars backend is just one (the first) implementation. And its job is largely to translate the Explorer API into what polars expects.

So with that in mind, I'd like to keep the docs the same and instead fix the implementation :). The place to do that is in lib/explorer/polars_backend/data_frame.ex: https://github.com/elixir-nx/explorer/blob/52cd6cfadd1f9975cb32f2ad2ec8bb651a3e5b8b/lib/explorer/polars_backend/data_frame.ex#L19-L47

There are two things to do there: convert the atoms from the keyword list to strings, and convert the dtypes to the Explorer dtypes. We can achieve both by adding this to the top of that function:

dtypes = Enum.map(fn {colname, dtype} -> {Atom.to_string(colname), Shared.normalise_dtype(dtype)} end)

That will require something I should have done in the first place which is to add the dtype conversions going the other way! You can see them going from Explorer to polars here:
https://github.com/elixir-nx/explorer/blob/52cd6cfadd1f9975cb32f2ad2ec8bb651a3e5b8b/lib/explorer/polars_backend/shared.ex#L49-L57

So we'd just need to go back the other way:

  def normalise_dtype(:integer), do: "i64"
  def normalise_dtype(:float), do: "f64"
  def normalise_dtype(:boolean), do: "bool"
  def normalise_dtype(:string), do: "str"
  def normalise_dtype(:date), do: "date32(days)"
  def normalise_dtype(:datetime), do: "date64(ms)"

WDYT?

cigrainger · 2021-09-05T23:14:43Z

Oh and regarding the names issue, good catch! It's not provided by polars but is a common convenience in other libraries. My intent there (and I thought I had implemented it but clearly not!) was just to optionally use rename after reading in. So that can be in Explorer.PolarsBackend.DataFrame.read_csv/10 as well.

Possibly like this?

def read_csv(...) do
    ...
    case {df, names} do
      {{:ok, df}, nil} -> {:ok, Shared.to_dataframe(df)}
      {{:ok, df}, names} -> {:ok, df |> Shared.to_dataframe() |> rename(names)}
      {{:error, error}, _} -> {:error, error}
    end
end

The only caveat there is that we wouldn't get particularly useful error messages. So perhaps more like:

def read_csv(...) do
    ...
    wrap_read_csv(df, names)
end
 
defp wrap_read_csv({:ok, df}, nil), do: {:ok, Shared.to_dataframe(df)}
defp wrap_read_csv({:error, _} = err, _), do: err
defp wrap_read_csv({:ok, df}, names) do
  df = Shared.to_dataframe(df)
  df = if n_cols(df) == length(names), do: rename(df, names), else: raise(ArgumentError, "Expected length of provided names (#{length(names)) to match number of columns in dataframe (#n_cols(df)).")
  {:ok, df}
end

WDYT?

losvedir · 2021-09-06T00:41:17Z

Makes sense. I can work on that!

josevalim · 2021-09-06T07:38:16Z

test/explorer/data_frame_test.exs

+  def tmp_csv(contents) do
+    :ok = File.write!("tmp_test.csv", contents)
+    "tmp_test.csv"
+  end


It is better to write to a tmp directory. Since this requires Elixir v1.12, we can leverage the tmp_dir tag:

@tag :tmp_dir test "my test", config do csv = tmp_csv(config, """ ... """)

where tmp_csv is:

defp tmp_csv(config, contents) do path = Path.join(config.tmp_dir, "tmp.csv") :ok = File.write!(path, contents) path end

TIL! I must have missed the tmp_dir tag in all the other stuff to come out with 1.12.

josevalim · 2021-09-06T07:45:18Z

What happens if you provide both dtypes and names? dtypes already have the names, so should we should push towards specifying dtypes only?

@cigrainger btw, I think we should keep the precision of the numbers. For example, when converting to tensors and similar, we will want to know the exact the precision and we may also want to do data casts too. Nx uses u8, u16, s8, s16, etc. Should we follow the same convention?

cigrainger · 2021-09-06T09:11:59Z

What happens if you provide both dtypes and names? dtypes already have the names, so should we should push towards specifying dtypes only?

Good question. One benefit of having both is that you can build it up programmatically and not have to worry about the order in the file. polars puts both together as a schema. pandas has both (names as an array and dtype as a dict of colname -> dtype). R doesn't have dicts but readr is very similar, taking col_names (a character vector of names, or a boolean which also stands in for whether the file has a header row or not) and a cols specification which is strange but similar to a dict.

And when you provide both, the order of operations is dtypes, then names. That is, dtypes is used during the actual reading and names is applied via a rename after.

@cigrainger btw, I think we should keep the precision of the numbers. For example, when converting to tensors and similar, we will want to know the exact the precision and we may also want to do data casts too. Nx uses u8, u16, s8, s16, etc. Should we follow the same convention?

I avoided it at first for simplicity but I'd be receptive to it. The main argument against it is that it's not that important in the major df libraries (pandas borrows from numpy but mostly simplifies down to approximately the list I ended up with and R just has float, int, bool, char, etc). Going with most information (64-bit floats and signed ints) means you can cast it down when converting to a tensor etc. So IMO the main benefit to keeping the precision is memory. That can certainly be a compelling case. I could be missing something (possibly something really obvious).

josevalim · 2021-09-06T09:15:18Z

Thanks, it all sounds good to me! Let’s just make sure to also test when both dtypes and names are given!

losvedir · 2021-09-11T01:07:17Z

Pushed up a few more commits (let me know if you want me to re-arrange or squash or otherwise change the git history here).

I addressed the things you brought up, with some small tweaks:

I felt re-using normalize_dtype was a bit weird or confusing, since if normalise_dtype("i64") was :integer, then it didn't feel right to say that normalise_dtype(:integer) would be "i64". Naively I'd expect it to stay :integer. To me "normalize" means to "put in its canonical form", so it shouldn't leave that form if it stays that way. So considering that :integer is the normalized form, the function I added to go the other direction I named internal_from_dtype. What do you think? I can, of course, change it to normalise_dtype if you prefer that.
I did basically the suggested wrap_read_csv approach you laid out, but I called the function checked_rename, since that felt clearer to me what it was actually doing.
I re-did the tests using @josevalim 's @tag :tmp_csv suggestion (very cool! I also missed that in the Elixir 1.12 news), adding a test for the combination of dtypes and names options.

One more thing that came up, most noticeable in the dtypes and names test is the inconsistency of referring to columns sometimes as atoms and sometimes as strings. with_columns and names both takes the column names as strings, while dtypes expects a keyword list where they're atoms. Should I add another commit to settle on one or the other? I could also make either work, but that gives me flashbacks to Rails where controller params keys could be either strings or atoms, which caused a lot of confusion, and which Phoenix rightly (IMO) fixed by standardizing on just strings.

cigrainger

Fantastic, thank you! I like your thinking and reasoning for all of the above.

cigrainger · 2021-09-13T02:36:27Z

One more thing that came up, most noticeable in the dtypes and names test is the inconsistency of referring to columns sometimes as atoms and sometimes as strings. with_columns and names both takes the column names as strings, while dtypes expects a keyword list where they're atoms. Should I add another commit to settle on one or the other? I could also make either work, but that gives me flashbacks to Rails where controller params keys could be either strings or atoms, which caused a lot of confusion, and which Phoenix rightly (IMO) fixed by standardizing on just strings.

On this, I think the entire API is a bit split on this. I've tried to make both work in many cases where I felt it was more ergonomic, but I'm feeling the same as you that they should be standardised on one or the other. I think the obvious way to do that is to settle on strings. Let's leave this the way it is right now and we can revisit this in a broader context.

Clean ups around read_csv

f903e84

* Adds tests for the various read_csv options * Fixes a bug with with null_character * Fixes documentation

cigrainger requested changes Sep 5, 2021

View reviewed changes

josevalim reviewed Sep 6, 2021

View reviewed changes

Gabe Durazo added 3 commits September 6, 2021 14:18

use @tag :tmp_dir for tests

824430f

fix read_csv dtypes option

d1c80b5

add read_csv names option

3dde133

cigrainger approved these changes Sep 13, 2021

View reviewed changes

cigrainger mentioned this pull request Sep 13, 2021

Settle on strings vs. atoms for column names #58

Closed

cigrainger merged commit a4e5a21 into elixir-explorer:main Sep 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean ups around DataFrame.read_csv/2 #48

Clean ups around DataFrame.read_csv/2 #48

losvedir commented Sep 5, 2021

cigrainger left a comment •

edited

Loading

cigrainger commented Sep 5, 2021 •

edited

Loading

losvedir commented Sep 6, 2021

josevalim Sep 6, 2021 •

edited

Loading

cigrainger Sep 6, 2021

josevalim commented Sep 6, 2021

cigrainger commented Sep 6, 2021 •

edited

Loading

josevalim commented Sep 6, 2021 •

edited

Loading

losvedir commented Sep 11, 2021

cigrainger left a comment

cigrainger commented Sep 13, 2021 •

edited

Loading

Clean ups around DataFrame.read_csv/2 #48

Clean ups around DataFrame.read_csv/2 #48

Conversation

losvedir commented Sep 5, 2021

cigrainger left a comment • edited Loading

Choose a reason for hiding this comment

cigrainger commented Sep 5, 2021 • edited Loading

losvedir commented Sep 6, 2021

josevalim Sep 6, 2021 • edited Loading

Choose a reason for hiding this comment

cigrainger Sep 6, 2021

Choose a reason for hiding this comment

josevalim commented Sep 6, 2021

cigrainger commented Sep 6, 2021 • edited Loading

josevalim commented Sep 6, 2021 • edited Loading

losvedir commented Sep 11, 2021

cigrainger left a comment

Choose a reason for hiding this comment

cigrainger commented Sep 13, 2021 • edited Loading

cigrainger left a comment •

edited

Loading

cigrainger commented Sep 5, 2021 •

edited

Loading

josevalim Sep 6, 2021 •

edited

Loading

cigrainger commented Sep 6, 2021 •

edited

Loading

josevalim commented Sep 6, 2021 •

edited

Loading

cigrainger commented Sep 13, 2021 •

edited

Loading