[FR] join 2 CSV files #93

Minyus · 2022-12-10T11:02:42Z

Hi @dcmoura , I see examples to join a JSON file in the document, would joining 2 CSV files be supported?
https://spyql.readthedocs.io/en/latest/recipes.html?highlight=join#equi-joins

dcmoura · 2022-12-10T12:36:30Z

Right now we only support joining with a JSON file. The main data source can be in any data format, but the secondary data source must be JSON. You can translate a CSV to JSON, though. Here is an example: https://danielcmoura.com/blog/2022/spyql-cell-towers/

dcmoura · 2022-12-10T12:39:02Z

Proper JOINs are in the roadmap but it might take months until getting there, unless someone steps in.

Minyus · 2022-12-11T03:15:09Z

Thank you, @dcmoura !
Proper JOINs would not be easy from scratch, but I would suggest using pandas like this:

    import pandas as pd

    left_df = pd.read_csv(left_file, dtype=str, keep_default_na=False)
    right_df = pd.read_csv(right_file, dtype=str, keep_default_na=False)
    df = left_df.merge(right_df, **kwargs)
    df.to_csv(out_file, index=False)

pandas does not need to be a requirement of spyql.
An error can be returned if join feature is used but pandas is not installed.

Pipe feature (with "spy" output) for pandas dataframes can be omitted if difficult (in the first release).

Minyus · 2022-12-11T03:28:59Z

Currently, I use pandas to join CSV files, which means I need to prepare and manage Python scripts.
If spyql supports joining CSV files, I can save more time.

dcmoura · 2022-12-11T09:39:59Z

That is good to know, thank you for your feedback! Let me think on it.

dcmoura · 2022-12-13T11:24:31Z

Thank you, @dcmoura ! Proper JOINs would not be easy from scratch, but I would suggest using pandas like this:

Using pandas goes against the principles of SPyQL. Pandas adds a very large overhead, and loads everything into memory. Still, thank you for your suggestion.

In the meanwhile, I suggest you use something like the following. I am showing how can we JOIN two csv files using spyql.

$ cat example1.csv
id, name, age
1, Ana, 26
2, Jane, 31
3, Richard, 42
4, Samuel, 23

$ cat example2.csv
date, ammount, user_id
2022-02-01, 100.0, 3
2022-03-05, 25.1, 1
2022-03-15, 93.2, 1
2022-04-01, 50.0, 2

example1.csv is a list of users. We want to add user info to example2.csv. So, we will convert example1.csv to JSON and then use it to JOIN with example2.csv.

$ spyql "SELECT dict_agg(id, .) AS json FROM csv('example1.csv') TO json" > example1.json

$ spyql -Jusers=example1.json "SELECT *, users[user_id].name AS user_name, users[user_id].age AS user_age FROM csv('example2.csv') TO pretty"
date          ammount    user_id  user_name      user_age
----------  ---------  ---------  -----------  ----------
2022-02-01      100            3  Richard              42
2022-03-05       25.1          1  Ana                  26
2022-03-15       93.2          1  Ana                  26
2022-04-01       50            2  Jane                 31

Hope this helps.

Minyus · 2023-01-08T06:53:17Z

Thank you for your suggestion, @dcmoura !

Hmmm, unfortunately, that seems a bit complicated to me. Hope syntax similar to SQL can be used, as supported by clickhouse local as follows:

./clickhouse local -q "SELECT u.full_name, h.text FROM file('hackernews.csv', CSVWithNames) h \
JOIN file('users.tsv', TSVWithNames) u ON (u.username = h.by) WHERE NOT empty(text) AND length(text) < 50"

References:
https://clickhouse.com/blog/extracting-converting-querying-local-files-with-sql-clickhouse-local
https://clickhouse.com/docs/en/operations/utilities/clickhouse-local/

How about modin as an optional pandas alternative? modin should be able to run pandas code (without modifying). modin claims: "Modin is a drop-in replacement for pandas. ... Modin works especially well on larger datasets, where pandas becomes painfully slow or runs out of memory."

Install

pip install modin

Python code:

try:
    import modin.pandas as pd
except Exception:
    import pandas as pd

dcmoura · 2023-01-10T15:51:39Z

Hmmm, unfortunately, that seems a bit complicated to me. Hope syntax similar to SQL can be used, as supported by clickhouse local

Right now supporting JOINs (as in the SQL syntax) is not our top priority, but I hope we get there soon enough. Out of curiosity @Minyus, is there any particular advantage of SPyQL over clickhouse local for your use case?

How about modin as an optional pandas alternative? modin should be able to run pandas code (without modifying).

Thank you again for your suggestion, but we try to keep our list of dependencies as short as possible. And I would not put a core feature (such as a JOIN) depending on an optional package. The JOIN most probably will have to be implemented from scratch.

Minyus · 2023-01-12T15:50:50Z

Advantages of SPyQL over clickhouse local for my use case are:

SpyQL can be installed easily by pip
SpyQL supports Python objects/functions (e.g. pathlib.Path)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FR] join 2 CSV files #93

[FR] join 2 CSV files #93

Minyus commented Dec 10, 2022

dcmoura commented Dec 10, 2022

dcmoura commented Dec 10, 2022

Minyus commented Dec 11, 2022

Minyus commented Dec 11, 2022

dcmoura commented Dec 11, 2022

dcmoura commented Dec 13, 2022

Minyus commented Jan 8, 2023

dcmoura commented Jan 10, 2023

Minyus commented Jan 12, 2023

[FR] join 2 CSV files #93

[FR] join 2 CSV files #93

Comments

Minyus commented Dec 10, 2022

dcmoura commented Dec 10, 2022

dcmoura commented Dec 10, 2022

Minyus commented Dec 11, 2022

Minyus commented Dec 11, 2022

dcmoura commented Dec 11, 2022

dcmoura commented Dec 13, 2022

Minyus commented Jan 8, 2023

dcmoura commented Jan 10, 2023

Minyus commented Jan 12, 2023