-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean ups around DataFrame.read_csv/2 #48
Conversation
* Adds tests for the various read_csv options * Fixes a bug with with null_character * Fixes documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for this! I'm wholeheartedly in favour of adding many more tests!
There's actually a function for converting the canonical Explorer
dtypes to the types expected by polars
. One of the guiding principles in Explorer
is that it's an API first and foremost. The polars backend is just one (the first) implementation. And its job is largely to translate the Explorer
API into what polars
expects.
So with that in mind, I'd like to keep the docs the same and instead fix the implementation :). The place to do that is in lib/explorer/polars_backend/data_frame.ex
: https://github.com/elixir-nx/explorer/blob/52cd6cfadd1f9975cb32f2ad2ec8bb651a3e5b8b/lib/explorer/polars_backend/data_frame.ex#L19-L47
There are two things to do there: convert the atoms from the keyword list to strings, and convert the dtypes to the Explorer
dtypes. We can achieve both by adding this to the top of that function:
dtypes = Enum.map(fn {colname, dtype} -> {Atom.to_string(colname), Shared.normalise_dtype(dtype)} end)
That will require something I should have done in the first place which is to add the dtype conversions going the other way! You can see them going from Explorer
to polars
here:
https://github.com/elixir-nx/explorer/blob/52cd6cfadd1f9975cb32f2ad2ec8bb651a3e5b8b/lib/explorer/polars_backend/shared.ex#L49-L57
So we'd just need to go back the other way:
def normalise_dtype(:integer), do: "i64"
def normalise_dtype(:float), do: "f64"
def normalise_dtype(:boolean), do: "bool"
def normalise_dtype(:string), do: "str"
def normalise_dtype(:date), do: "date32(days)"
def normalise_dtype(:datetime), do: "date64(ms)"
WDYT?
Oh and regarding the Possibly like this? def read_csv(...) do
...
case {df, names} do
{{:ok, df}, nil} -> {:ok, Shared.to_dataframe(df)}
{{:ok, df}, names} -> {:ok, df |> Shared.to_dataframe() |> rename(names)}
{{:error, error}, _} -> {:error, error}
end
end The only caveat there is that we wouldn't get particularly useful error messages. So perhaps more like: def read_csv(...) do
...
wrap_read_csv(df, names)
end
defp wrap_read_csv({:ok, df}, nil), do: {:ok, Shared.to_dataframe(df)}
defp wrap_read_csv({:error, _} = err, _), do: err
defp wrap_read_csv({:ok, df}, names) do
df = Shared.to_dataframe(df)
df = if n_cols(df) == length(names), do: rename(df, names), else: raise(ArgumentError, "Expected length of provided names (#{length(names)) to match number of columns in dataframe (#n_cols(df)).")
{:ok, df}
end WDYT? |
Makes sense. I can work on that! |
def tmp_csv(contents) do | ||
:ok = File.write!("tmp_test.csv", contents) | ||
"tmp_test.csv" | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to write to a tmp directory. Since this requires Elixir v1.12, we can leverage the tmp_dir
tag:
@tag :tmp_dir
test "my test", config do
csv = tmp_csv(config, """
...
""")
where tmp_csv
is:
defp tmp_csv(config, contents) do
path = Path.join(config.tmp_dir, "tmp.csv")
:ok = File.write!(path, contents)
path
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL! I must have missed the tmp_dir
tag in all the other stuff to come out with 1.12.
What happens if you provide both dtypes and names? dtypes already have the names, so should we should push towards specifying dtypes only? @cigrainger btw, I think we should keep the precision of the numbers. For example, when converting to tensors and similar, we will want to know the exact the precision and we may also want to do data casts too. Nx uses u8, u16, s8, s16, etc. Should we follow the same convention? |
Good question. One benefit of having both is that you can build it up programmatically and not have to worry about the order in the file. And when you provide both, the order of operations is dtypes, then names. That is,
I avoided it at first for simplicity but I'd be receptive to it. The main argument against it is that it's not that important in the major df libraries ( |
Thanks, it all sounds good to me! Let’s just make sure to also test when both dtypes and names are given! |
Pushed up a few more commits (let me know if you want me to re-arrange or squash or otherwise change the git history here). I addressed the things you brought up, with some small tweaks:
One more thing that came up, most noticeable in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic, thank you! I like your thinking and reasoning for all of the above.
On this, I think the entire API is a bit split on this. I've tried to make both work in many cases where I felt it was more ergonomic, but I'm feeling the same as you that they should be standardised on one or the other. I think the obvious way to do that is to settle on strings. Let's leave this the way it is right now and we can revisit this in a broader context. |
Thanks for the great project! I was excited to play around with it but ran into some issues loading my CSVs. This PR does a few things:
:null_character
,:names
, and:dtypes
) had issues. Not sure the best place to put these tests; if they should be for the Polars backend specifically, or doc tests, or something. I'm happy to move them.null_character
option - TheNative.df_read_csv
function call didn't have the arguments in the same order as the corresponding rust function.dtypes
option, which is said to take a keyword list of column name (as atom) to dtype. It doesn't specify the dtype format, but I assumed:string
,:integer
, etc. But it actually takes a list of tuples of column strings, and the type is specified like"str"
or"i64"
. Here I just documented the current behavior correctly. But if you want thedtype
to be specified like:string
, for example, I could updatedataframe.rs
to work with atoms and do that. I know rust somewhat, but not rustler specifically, and this change seems straightforward enough. But from a skim of the docs I'm not entirely sure how to make the columns, which aren't known at compile time, atoms.names
option which allegedly allows you to rename columns upon reading the CSV, but which appears to not be implemented. From a scan of the polars docs I didn't see any methods provided to handle it.