Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add tsv (tab separated) as save format #331

Closed
keiranmraine opened this issue Jun 17, 2022 · 7 comments
Closed

[FEATURE] Add tsv (tab separated) as save format #331

keiranmraine opened this issue Jun 17, 2022 · 7 comments

Comments

@keiranmraine
Copy link

Is your feature request related to a problem? Please describe.
TSV files are used extensively in bioinformatic spaces.

Describe the solution you'd like
Support SAVE direct to tsv.

Describe alternatives you've considered
Outputter following SAVE AND LOAD

@kvnkho
Copy link
Collaborator

kvnkho commented Jun 17, 2022

Just adding notes, I was testing this and the issue is we don't allow filename with .tsv extension. We should support it. For a lot of the engines, it's jut changing the delimiter of the method to read csvs

@goodwanghan
Copy link
Collaborator

image

We can support TSV already

@keiranmraine
Copy link
Author

FYI, this doesn't handle windows file paths:

File <string>:1
    "c:\Users\XXX\cleaned.tsv"
                                                                                                      ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

I had to do:

fe_output = f_output.replace("\\", "\\\\")
%%fsql
SELECT df_links.uuid, df_pcawg.*
    FROM df_pcawg
    INNER JOIN df_links ON df_pcawg.icgc_donor_id = df_links.DO
SAVE OVERWRITE CSV "{{fe_output}}" (sep="\t", header=TRUE)

@keiranmraine
Copy link
Author

Changing the last directive to SAVE AND USE OVERWRITE CSV "{{fe_output}}" (sep="\t", header=TRUE) results in an error:

File c:\Users\XXX\.venv\lib\site-packages\fugue\workflow\workflow.py:1518, in FugueWorkflow.run(self, *args, **kwargs)
   1516         if ctb is None:  # pragma: no cover
   1517             raise
-> 1518         raise ex.with_traceback(ctb)
   1519     self._computed = True
   1520 return DataFrames(
   1521     {
...
    231         pdf = _safe_load_csv(
    232             p.uri, **{"index_col": False, "header": None, "names": columns, **kw}
    233         )

ValueError: columns must be set if without header

@goodwanghan
Copy link
Collaborator

FYI, this doesn't handle windows file paths:

File <string>:1
    "c:\Users\XXX\cleaned.tsv"
                                                                                                      ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

I had to do:

fe_output = f_output.replace("\\", "\\\\")
%%fsql
SELECT df_links.uuid, df_pcawg.*
    FROM df_pcawg
    INNER JOIN df_links ON df_pcawg.icgc_donor_id = df_links.DO
SAVE OVERWRITE CSV "{{fe_output}}" (sep="\t", header=TRUE)

This is actually a correct behavior. It may be tricky to understand, but let me use a native python code to illustrate:
image

The eval here tries to parse the expression with double quotes, which is very similar to fsql. And you see I can reproduce the error you had. Let me know if you can understand what I try to explain.

@goodwanghan
Copy link
Collaborator

@keiranmraine please see the issue 332 ^^^

@goodwanghan
Copy link
Collaborator

Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants