Add Struct json_decode/3 for decoding json from string #841

lkarthee · 2024-01-29T17:25:47Z

iex(1)> require Explorer.DataFrame, as: DF
Explorer.DataFrame
iex(2)> d = DF.new([%{a: "{\"n\": 1}"}])
#Explorer.DataFrame<
  Polars[1 x 1]
  a string ["{\"n\": 1}"]
>
iex(3)> DF.mutate(d, aj: json_decode(a))
#Explorer.DataFrame<
  Polars[1 x 2]
  a string ["{\"n\": 1}"]
  aj struct[1] [%{"n" => 1}]
>

josevalim

What happens if the string is not valid JSON?

lkarthee · 2024-01-29T17:38:01Z

Currently I get error unless I pass dtype: {:struct, %{"n" => {:s, 64}}} - due to out_df and check_df mismatch. When dtype is not passed, it is equivalent of from_list() scenario.

Please advise on how to handle this:

Should we accept the check_df.dtypes returned by polars in case of :df_mutate_with_exprs ?
Should we narrow down check to don't worry about {:struct, _} when out_df.dtypes != check_df.dtypes and accept check_df if everything else matches ?

** (RuntimeError) DataFrame mismatch.

expected:

    names: ["a", "aj"]
    dtypes: %{"a" => :string, "aj" => nil}

got:

    names: ["a", "aj"]
    dtypes: %{"a" => :string, "aj" => {:struct, %{"n" => {:s, 64}}}}

    (explorer 0.9.0-dev) lib/explorer/polars_backend/shared.ex:62: Explorer.PolarsBackend.Shared.apply_dataframe/4
    iex:3: (file)

explorer/lib/explorer/polars_backend/shared.ex

Lines 42 to 79 in 8b03726

    
             def apply_dataframe(%DataFrame{} = df, %DataFrame{} = out_df, fun, args) do 
        
               case apply(Native, fun, [df.data | args]) do 
        
                 {:ok, %module{} = new_df} when module in @polars_df -> 
        
                   if @check_frames do 
        
                     # We need to collect here, because the lazy frame may not have 
        
                     # the full picture of the result yet. 
        
                     check_df = 
        
                       if match?(%PolarsLazyFrame{}, new_df) do 
        
                         {:ok, new_df} = Native.lf_collect(new_df) 
        
                         create_dataframe(new_df) 
        
                       else 
        
                         create_dataframe(new_df) 
        
                       end 
        
                     if Enum.sort(out_df.names) != Enum.sort(check_df.names) or 
        
                          out_df.dtypes != check_df.dtypes do 
        
                       raise """ 
        
                       DataFrame mismatch. 
        
                       expected: 
        
                           names: #{inspect(out_df.names)} 
        
                           dtypes: #{inspect(out_df.dtypes)} 
        
                       got: 
        
                           names: #{inspect(check_df.names)} 
        
                           dtypes: #{inspect(check_df.dtypes)} 
        
                       """ 
        
                     end 
        
                   end 
        
                   %{out_df | data: new_df} 
        
                 {:error, error} -> 
        
                   raise runtime_error(error) 
        
               end 
        
             end

lkarthee · 2024-01-29T17:41:11Z

What happens if the string is not valid JSON?

Raises a polars error.

iex(6)> d = DF.new([%{a: "{\"n\": 1"}])
iex(7)> DF.mutate(d, aj: json_decode(a, dtype: {:struct, %{"n" => {:s, 64}}}))
** (RuntimeError) Polars Error: error deserializing JSON: json parsing error: 'ExpectedObjectContent at character 8 (']')'
    (explorer 0.9.0-dev) lib/explorer/polars_backend/shared.ex:81: Explorer.PolarsBackend.Shared.apply_dataframe/4
    iex:7: (file)

josevalim · 2024-01-29T20:41:29Z

I would start with the simplest API possible:

no options
document it errors on invalid json

Update expressions.rs add tests Update series.ex

lkarthee · 2024-01-30T04:21:18Z

native/explorer/Cargo.toml

@@ -36,6 +36,9 @@ object_store = { version = "0.8", default-features = false, optional = true }
 [target.'cfg(not(any(all(windows, target_env = "gnu"), all(target_os = "linux", target_env = "musl"))))'.dependencies]
 mimalloc = { version = "*", default-features = false }

+[patch.crates-io]


context here - pola-rs/polars#14008

lib/explorer/polars_backend/shared.ex

lkarthee · 2024-01-31T04:31:00Z

lib/explorer/polars_backend/expression.ex

@@ -172,6 +173,11 @@ defmodule Explorer.PolarsBackend.Expression do
    raise ArgumentError, "missing #{inspect(__MODULE__)} nodes: #{inspect(missing)}"
  end

+  def to_expr(%LazySeries{op: :from_json, args: _}) do


Suggested change

def to_expr(%LazySeries{op: :from_json, args: _}) do

def to_expr(%LazySeries{op: :from_json}) do

@josevalim If i drop args: _ i get a lot of warnings even though all tests pass. Why does this happen ?

Log containing warnings

09:59 $ mix ci Finished dev [unoptimized + debuginfo] target(s) in 0.42s Compiling 3 files (.ex) warning: this clause cannot match because a previous clause at line 265 always matches │ 314 │ def to_expr(%LazySeries{op: unquote(op), args: unquote(args)}) do │ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ │ └─ lib/explorer/polars_backend/expression.ex:314 warning: this clause cannot match because a previous clause at line 265 always matches │ 314 │ def to_expr(%LazySeries{op: unquote(op), args: unquote(args)}) do │ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ │ └─ lib/explorer/polars_backend/expression.ex:314 warning: this clause cannot match because a previous clause at line 265 always matches │ 314 │ def to_expr(%LazySeries{op: unquote(op), args: unquote(args)}) do │ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ │ └─ lib/explorer/polars_backend/expression.ex:314 warning: this clause cannot match because a previous clause at line 265 always matches │ 314 │ def to_expr(%LazySeries{op: unquote(op), args: unquote(args)}) do │ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ │ └─ lib/explorer/polars_backend/expression.ex:314 warning: this clause cannot match because a previous clause at line 265 always matches │ 314 │ def to_expr(%LazySeries{op: unquote(op), args: unquote(args)}) do │ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ │ └─ lib/explorer/polars_backend/expression.ex:314 warning: this clause cannot match because a previous clause at line 265 always matches │ 314 │ def to_expr(%LazySeries{op: unquote(op), args: unquote(args)}) do │ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ │ └─ lib/explorer/polars_backend/expression.ex:314 warning: this clause cannot match because a previous clause at line 265 always matches │ 314 │ def to_expr(%LazySeries{op: unquote(op), args: unquote(args)}) do │ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ │ └─ lib/explorer/polars_backend/expression.ex:314 warning: this clause cannot match because a previous clause at line 265 always matches │ 314 │ def to_expr(%LazySeries{op: unquote(op), args: unquote(args)}) do │ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ │ └─ lib/explorer/polars_backend/expression.ex:314 warning: this clause cannot match because a previous clause at line 265 always matches │ 314 │ def to_expr(%LazySeries{op: unquote(op), args: unquote(args)}) do │ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ │ └─ lib/explorer/polars_backend/expression.ex:314 warning: this clause cannot match because a previous clause at line 265 always matches │ 314 │ def to_expr(%LazySeries{op: unquote(op), args: unquote(args)}) do │ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ │ └─ lib/explorer/polars_backend/expression.ex:314 Compiling crate explorer in debug mode (native/explorer) Compiling explorer v0.1.0 (explorer/native/explorer) Finished dev [unoptimized + debuginfo] target(s) in 5.26s Excluding tags: [:cloud_integration] .................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................*................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ Finished in 2.8 seconds (2.8s async, 0.00s sync) 582 doctests, 1390 tests, 0 failures, 18 excluded, 1 skipped </details>

josevalim · 2024-01-31T08:44:36Z

I think supporting option 2 is a low hanging fruit:

Unfortunately I think this is inconsistent. All of our from_* operations work on data types outside of Series and DataFrame. So we would be breaking a rule here.

Therefore, if we introduce from_json (or read_json), the argument must be either a file or a string. Ideally we would support it in the dataframe, because all of our from_json (or read_json) functions are defined in the dataframe. However, from_json supports files and URLs, so it is quite some work, but it would be consistent with everything else including polars itself: https://docs.pola.rs/py-polars/html/reference/api/polars.read_json.html

So, after sleeping on it, I think we should first add read_json/write_json, and then from_json and to_json. If it is too much work, let us know, and @philss can pick it up soon (he pretty much wrote all file integration stuff :D).

lkarthee · 2024-01-31T14:01:46Z

@josevalim I can contribute an hour a day for few more months. Now that I am familiar explorer/polars - it does not take much time for implementing things from polars. Maintainers can feel free to tag me or assign PRs/bugs to me - i will try my best to resolve them.

I am confused about your comments and from_json - feel free to correct me.

Json Decode

example_file.csv

active,ask_code,ask_id,ask_price,code,data,expiry,id,index,inserted_at,item_id,level,project_id,properties,state,type,updated_at,user_id,value
True,,,,pull,{},1970-01-01T00:00:00Z,752e072c-9907-4fd0-9276-8fd30f7d1d1c,{},2023-02-24 15:41:57+00:00,e71a8d18-16c0-4efa-8442-dc17344ae5f8,1,890d0555-f266-42b2-a1ba-66925d8d47f6,"{'r_total': 0, 'u_total': 1, 'xp_total': 1}",ok,durable,2023-03-20 09:09:44+00:00,5755aa4d-d91e-4ab4-830e-791996f41a99,1

Description about data:

json_decode is useful in scenario where data frame is already loaded from a non json file like from .csv
Not all columns are json, rather some columns are json columns - rest are primitive types.

Possible Operations:

convert those json columns to struct pl.col('data').str.json_decode() .
extract data from a json path using pl.col('data').str.json_path_match('$. xp_total').alias('xp') .

Problems with having Series.json_decode as DataFrame operation:

we don't know dtype in advance, hence we can't use DF.mutate as apply_dataframe validates out_df.types == check_df.dtypes

Solution:

implement Series.json_decode as a series operation which wont work as data frame operation/expression.
raise error suggesting user to use Series.json_decode/2 rather than json_decode/2 in data frame mutate.

`from_json`

This is a complete different operation from json_decode:

from_json whole file is json
from_ndjson every line is json
json_decode one or more variables/columns are json. there might be columns which are just primitives, example csv file

example.json

{
  "columns": [
    {"name": "integer",  "datatype": "Int64",  "bit_settings": "",  "values": [1,  2,  3 ]},
    {"name": "float", "datatype": "Float64", "bit_settings": "",  "values": [4, 5, 6 ]},
    {"name": "string",  "datatype": "String", "bit_settings": "", "values": ["a", "b", "c"] }
  ]
}

df = pl.DataFrame(
    {
        "integer": [1, 2, 3],
        "float": [4.0, 5.0, 6.0],
        "string": ["a", "b", "c"],
    }
)

df.write_json("output.json")
df_json = pl.read_json("output.json")
print(df_json)

josevalim · 2024-01-31T14:13:15Z

Thank you for the context. I see what you mean and I don't have a good answer for it.

The best I can think of is to call it "json_decode" but always expect the "schema"/"dtype" to be given, so we don't have to guess.

josevalim · 2024-01-31T14:27:48Z

We could but I would like to avoid "branching" for now, if that makes any sense.

lkarthee · 2024-01-31T14:34:43Z

We could but I would like to avoid "branching" for now, if that makes any sense.

Jose reply is to my question - whether we could have Series only op that way we could have avoid passing dtype?

I felt my question was partly answered in his previous reply and deleted it (without seeing him answer it). Posting it again after seeing Jose's reply.

lkarthee · 2024-01-31T15:23:59Z

I removed infer_schema_length, we don't need it as we are passing dtype.

lib/explorer/polars_backend/expression.ex

lib/explorer/backend/lazy_series.ex

lib/explorer/series.ex

test/explorer/series_test.exs

lib/explorer/series.ex

josevalim · 2024-02-01T11:29:02Z

Thank you! It was a bit bumpy but I am glad with where we arrived!

josevalim · 2024-02-01T11:29:08Z

💚 💙 💜 💛 ❤️

lkarthee · 2024-02-01T11:35:29Z

Thank you! It was a bit bumpy but I am glad with where we arrived!

Thank you Jose, I don't mind bumpy rides.

lkarthee added 2 commits January 29, 2024 21:43

Add json_decode/3 for decoding string to struct

edcd02b

Merge branch 'main' into struct_json_decode

a3d30b3

josevalim reviewed Jan 29, 2024

View reviewed changes

lkarthee force-pushed the struct_json_decode branch from 3a83583 to 846eff0 Compare January 30, 2024 04:02

tests, document error

f48624d

Update expressions.rs add tests Update series.ex

lkarthee force-pushed the struct_json_decode branch from 846eff0 to f48624d Compare January 30, 2024 04:05

lkarthee commented Jan 30, 2024

View reviewed changes

lkarthee force-pushed the struct_json_decode branch from 3482d27 to 24c4f82 Compare January 31, 2024 04:16

lkarthee commented Jan 31, 2024

View reviewed changes

add tests

33a31a7

lkarthee force-pushed the struct_json_decode branch from 24c4f82 to 33a31a7 Compare January 31, 2024 15:16