Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean ups around DataFrame.read_csv/2 #48

Merged
merged 4 commits into from
Sep 13, 2021

Conversation

losvedir
Copy link
Contributor

@losvedir losvedir commented Sep 5, 2021

Thanks for the great project! I was excited to play around with it but ran into some issues loading my CSVs. This PR does a few things:

  • Adds tests for the various read_csv options - I like to look at tests to get a feel for using a function and I didn't see any about loading CSVs or how the options worked. And it helped me realize that three listed options (:null_character, :names, and :dtypes) had issues. Not sure the best place to put these tests; if they should be for the Polars backend specifically, or doc tests, or something. I'm happy to move them.
  • Fixes a bug with with the null_character option - The Native.df_read_csv function call didn't have the arguments in the same order as the corresponding rust function.
  • Fixes the comment documentation for the dtypes option, which is said to take a keyword list of column name (as atom) to dtype. It doesn't specify the dtype format, but I assumed :string, :integer, etc. But it actually takes a list of tuples of column strings, and the type is specified like "str" or "i64". Here I just documented the current behavior correctly. But if you want the dtype to be specified like :string, for example, I could update dataframe.rs to work with atoms and do that. I know rust somewhat, but not rustler specifically, and this change seems straightforward enough. But from a skim of the docs I'm not entirely sure how to make the columns, which aren't known at compile time, atoms.
  • Removes the documentation for the names option which allegedly allows you to rename columns upon reading the CSV, but which appears to not be implemented. From a scan of the polars docs I didn't see any methods provided to handle it.

* Adds tests for the various read_csv options
* Fixes a bug with with null_character
* Fixes documentation
Copy link
Member

@cigrainger cigrainger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for this! I'm wholeheartedly in favour of adding many more tests!

There's actually a function for converting the canonical Explorer dtypes to the types expected by polars. One of the guiding principles in Explorer is that it's an API first and foremost. The polars backend is just one (the first) implementation. And its job is largely to translate the Explorer API into what polars expects.

So with that in mind, I'd like to keep the docs the same and instead fix the implementation :). The place to do that is in lib/explorer/polars_backend/data_frame.ex: https://github.com/elixir-nx/explorer/blob/52cd6cfadd1f9975cb32f2ad2ec8bb651a3e5b8b/lib/explorer/polars_backend/data_frame.ex#L19-L47

There are two things to do there: convert the atoms from the keyword list to strings, and convert the dtypes to the Explorer dtypes. We can achieve both by adding this to the top of that function:

dtypes = Enum.map(fn {colname, dtype} -> {Atom.to_string(colname), Shared.normalise_dtype(dtype)} end)

That will require something I should have done in the first place which is to add the dtype conversions going the other way! You can see them going from Explorer to polars here:
https://github.com/elixir-nx/explorer/blob/52cd6cfadd1f9975cb32f2ad2ec8bb651a3e5b8b/lib/explorer/polars_backend/shared.ex#L49-L57

So we'd just need to go back the other way:

  def normalise_dtype(:integer), do: "i64"
  def normalise_dtype(:float), do: "f64"
  def normalise_dtype(:boolean), do: "bool"
  def normalise_dtype(:string), do: "str"
  def normalise_dtype(:date), do: "date32(days)"
  def normalise_dtype(:datetime), do: "date64(ms)"

WDYT?

@cigrainger
Copy link
Member

cigrainger commented Sep 5, 2021

Oh and regarding the names issue, good catch! It's not provided by polars but is a common convenience in other libraries. My intent there (and I thought I had implemented it but clearly not!) was just to optionally use rename after reading in. So that can be in Explorer.PolarsBackend.DataFrame.read_csv/10 as well.

Possibly like this?

def read_csv(...) do
    ...
    case {df, names} do
      {{:ok, df}, nil} -> {:ok, Shared.to_dataframe(df)}
      {{:ok, df}, names} -> {:ok, df |> Shared.to_dataframe() |> rename(names)}
      {{:error, error}, _} -> {:error, error}
    end
end

The only caveat there is that we wouldn't get particularly useful error messages. So perhaps more like:

def read_csv(...) do
    ...
    wrap_read_csv(df, names)
end
 
defp wrap_read_csv({:ok, df}, nil), do: {:ok, Shared.to_dataframe(df)}
defp wrap_read_csv({:error, _} = err, _), do: err
defp wrap_read_csv({:ok, df}, names) do
  df = Shared.to_dataframe(df)
  df = if n_cols(df) == length(names), do: rename(df, names), else: raise(ArgumentError, "Expected length of provided names (#{length(names)) to match number of columns in dataframe (#n_cols(df)).")
  {:ok, df}
end

WDYT?

@losvedir
Copy link
Contributor Author

losvedir commented Sep 6, 2021

Makes sense. I can work on that!

def tmp_csv(contents) do
:ok = File.write!("tmp_test.csv", contents)
"tmp_test.csv"
end
Copy link
Member

@josevalim josevalim Sep 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to write to a tmp directory. Since this requires Elixir v1.12, we can leverage the tmp_dir tag:

@tag :tmp_dir
test "my test", config do
  csv = tmp_csv(config, """
  ...
  """)

where tmp_csv is:

defp tmp_csv(config, contents) do
  path = Path.join(config.tmp_dir, "tmp.csv")
  :ok = File.write!(path, contents)
  path
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL! I must have missed the tmp_dir tag in all the other stuff to come out with 1.12.

@josevalim
Copy link
Member

What happens if you provide both dtypes and names? dtypes already have the names, so should we should push towards specifying dtypes only?

@cigrainger btw, I think we should keep the precision of the numbers. For example, when converting to tensors and similar, we will want to know the exact the precision and we may also want to do data casts too. Nx uses u8, u16, s8, s16, etc. Should we follow the same convention?

@cigrainger
Copy link
Member

cigrainger commented Sep 6, 2021

What happens if you provide both dtypes and names? dtypes already have the names, so should we should push towards specifying dtypes only?

Good question. One benefit of having both is that you can build it up programmatically and not have to worry about the order in the file. polars puts both together as a schema. pandas has both (names as an array and dtype as a dict of colname -> dtype). R doesn't have dicts but readr is very similar, taking col_names (a character vector of names, or a boolean which also stands in for whether the file has a header row or not) and a cols specification which is strange but similar to a dict.

And when you provide both, the order of operations is dtypes, then names. That is, dtypes is used during the actual reading and names is applied via a rename after.

@cigrainger btw, I think we should keep the precision of the numbers. For example, when converting to tensors and similar, we will want to know the exact the precision and we may also want to do data casts too. Nx uses u8, u16, s8, s16, etc. Should we follow the same convention?

I avoided it at first for simplicity but I'd be receptive to it. The main argument against it is that it's not that important in the major df libraries (pandas borrows from numpy but mostly simplifies down to approximately the list I ended up with and R just has float, int, bool, char, etc). Going with most information (64-bit floats and signed ints) means you can cast it down when converting to a tensor etc. So IMO the main benefit to keeping the precision is memory. That can certainly be a compelling case. I could be missing something (possibly something really obvious).

@josevalim
Copy link
Member

josevalim commented Sep 6, 2021

Thanks, it all sounds good to me! Let’s just make sure to also test when both dtypes and names are given!

@losvedir
Copy link
Contributor Author

Pushed up a few more commits (let me know if you want me to re-arrange or squash or otherwise change the git history here).

I addressed the things you brought up, with some small tweaks:

  • I felt re-using normalize_dtype was a bit weird or confusing, since if normalise_dtype("i64") was :integer, then it didn't feel right to say that normalise_dtype(:integer) would be "i64". Naively I'd expect it to stay :integer. To me "normalize" means to "put in its canonical form", so it shouldn't leave that form if it stays that way. So considering that :integer is the normalized form, the function I added to go the other direction I named internal_from_dtype. What do you think? I can, of course, change it to normalise_dtype if you prefer that.
  • I did basically the suggested wrap_read_csv approach you laid out, but I called the function checked_rename, since that felt clearer to me what it was actually doing.
  • I re-did the tests using @josevalim 's @tag :tmp_csv suggestion (very cool! I also missed that in the Elixir 1.12 news), adding a test for the combination of dtypes and names options.

One more thing that came up, most noticeable in the dtypes and names test is the inconsistency of referring to columns sometimes as atoms and sometimes as strings. with_columns and names both takes the column names as strings, while dtypes expects a keyword list where they're atoms. Should I add another commit to settle on one or the other? I could also make either work, but that gives me flashbacks to Rails where controller params keys could be either strings or atoms, which caused a lot of confusion, and which Phoenix rightly (IMO) fixed by standardizing on just strings.

Copy link
Member

@cigrainger cigrainger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic, thank you! I like your thinking and reasoning for all of the above.

@cigrainger
Copy link
Member

cigrainger commented Sep 13, 2021

One more thing that came up, most noticeable in the dtypes and names test is the inconsistency of referring to columns sometimes as atoms and sometimes as strings. with_columns and names both takes the column names as strings, while dtypes expects a keyword list where they're atoms. Should I add another commit to settle on one or the other? I could also make either work, but that gives me flashbacks to Rails where controller params keys could be either strings or atoms, which caused a lot of confusion, and which Phoenix rightly (IMO) fixed by standardizing on just strings.

On this, I think the entire API is a bit split on this. I've tried to make both work in many cases where I felt it was more ergonomic, but I'm feeling the same as you that they should be standardised on one or the other. I think the obvious way to do that is to settle on strings. Let's leave this the way it is right now and we can revisit this in a broader context.

@cigrainger cigrainger merged commit a4e5a21 into elixir-explorer:main Sep 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants