-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FR] join 2 CSV files #93
Comments
Right now we only support joining with a JSON file. The main data source can be in any data format, but the secondary data source must be JSON. You can translate a CSV to JSON, though. Here is an example: https://danielcmoura.com/blog/2022/spyql-cell-towers/ |
Proper JOINs are in the roadmap but it might take months until getting there, unless someone steps in. |
Thank you, @dcmoura ! import pandas as pd
left_df = pd.read_csv(left_file, dtype=str, keep_default_na=False)
right_df = pd.read_csv(right_file, dtype=str, keep_default_na=False)
df = left_df.merge(right_df, **kwargs)
df.to_csv(out_file, index=False) pandas does not need to be a requirement of spyql. Pipe feature (with "spy" output) for pandas dataframes can be omitted if difficult (in the first release). |
Currently, I use pandas to join CSV files, which means I need to prepare and manage Python scripts. |
That is good to know, thank you for your feedback! Let me think on it. |
Using pandas goes against the principles of SPyQL. Pandas adds a very large overhead, and loads everything into memory. Still, thank you for your suggestion. In the meanwhile, I suggest you use something like the following. I am showing how can we JOIN two csv files using spyql. $ cat example1.csv
id, name, age
1, Ana, 26
2, Jane, 31
3, Richard, 42
4, Samuel, 23
$ cat example2.csv
date, ammount, user_id
2022-02-01, 100.0, 3
2022-03-05, 25.1, 1
2022-03-15, 93.2, 1
2022-04-01, 50.0, 2
$ spyql "SELECT dict_agg(id, .) AS json FROM csv('example1.csv') TO json" > example1.json
$ spyql -Jusers=example1.json "SELECT *, users[user_id].name AS user_name, users[user_id].age AS user_age FROM csv('example2.csv') TO pretty"
date ammount user_id user_name user_age
---------- --------- --------- ----------- ----------
2022-02-01 100 3 Richard 42
2022-03-05 25.1 1 Ana 26
2022-03-15 93.2 1 Ana 26
2022-04-01 50 2 Jane 31 Hope this helps. |
Thank you for your suggestion, @dcmoura !
./clickhouse local -q "SELECT u.full_name, h.text FROM file('hackernews.csv', CSVWithNames) h \
JOIN file('users.tsv', TSVWithNames) u ON (u.username = h.by) WHERE NOT empty(text) AND length(text) < 50" References:
Install pip install modin Python code: try:
import modin.pandas as pd
except Exception:
import pandas as pd |
Right now supporting JOINs (as in the SQL syntax) is not our top priority, but I hope we get there soon enough. Out of curiosity @Minyus, is there any particular advantage of SPyQL over clickhouse local for your use case?
Thank you again for your suggestion, but we try to keep our list of dependencies as short as possible. And I would not put a core feature (such as a JOIN) depending on an optional package. The JOIN most probably will have to be implemented from scratch. |
Advantages of SPyQL over clickhouse local for my use case are:
|
Hi @dcmoura , I see examples to join a JSON file in the document, would joining 2 CSV files be supported?
https://spyql.readthedocs.io/en/latest/recipes.html?highlight=join#equi-joins
The text was updated successfully, but these errors were encountered: