Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add conveniant support for stdin #117

Open
jbdesbas opened this issue Oct 10, 2023 · 3 comments
Open

Add conveniant support for stdin #117

jbdesbas opened this issue Oct 10, 2023 · 3 comments

Comments

@jbdesbas
Copy link
Contributor

Hi,

I just discovered this great project, thanks a lot for this amazing work 😃

Since CSV processing usually occurs in data flow process, it would be great to improve conveniency as reading CSV data through stdin.

Writing to stdout is easy already, because sys.stdout is passed directly to csv.writer , but reading is a bit more tricky.

import io
from sys import stdout, stdin

import clevercsv
import chardet

# read
input_data = stdin.buffer.read() # Read as binary
detected_encoding = chardet.detect(input_data)['encoding'] # Guess encoding

csvfile = io.StringIO(input_data.decode(detected_encoding))

dialect = clevercsv.Sniffer().sniff(csvfile.read())
csvfile.seek(0)

reader = clevercsv.reader(csvfile, dialect)
rows = reader


# write
writer = clevercsv.write.writer(sys.stdout, encoding='utf8')
writer.writerows(rows)
@GjjvdBurg
Copy link
Collaborator

Hi @jbdesbas, thanks for the kinds words and for opening this issue! What exactly do you have in mind for the functionality that we can add to CleverCSV to make this easier? A wrapper function perhaps that returns dicts or rows of the CSV file similar to stream_table and stream_dicts (or modification of these to accept sys.stdin)?

Note that the example you shared is very similar to the standardize command in the CLI. If that command is what you're looking for, issue #107 could capture your request too (please let me know).

@jbdesbas
Copy link
Contributor Author

Hi @GjjvdBurg
Yes, I think read/stream table accepting sys.stdin instead of just filename would be a great improvement. 👍

My need is sligthy different that the standardize command do : standardize keep the original encoding for the output file, but I need an UTF8 file as output (regardless of orignal encoding). Additionally, my original script do other stuff between reading and writing (add suffix in order to deduplicate columns names).
However, standardize should accept stdin as input too.

@lisad
Copy link

lisad commented Jun 12, 2024

I would have used this too. I'm trying to wrap or adapt some part of this library and add the ability to remove completely empty lines or lines of only commas (frequently get added at the end of an Excel table) from the file before trying to turn lines into dicts,. It seems less reliable to detect a line of only commas after the line has been parsed into a dict and I have to check each value. I'm also adding logic to detect duplicate column names BEFORE turning rows into dicts.

Things I've tried or thought of:

  • Try to process lines of a file before passing them into stream_dicts -- doesn't work because stream_dicts does not take any data formats besides a filename
  • Try to write my own version of "stream_dicts" that calls DictReader directly -- doesn't work because DictReader needs to know the encoding, and the get_encoding method is not exported for me to use
  • Open the file, read through it all, save it again stripping empty rows, THEN call stream_dicts on the new file? could do it but then need file write permissions and it takes longer
  • Write my own version of stream_dicts that copies code from clevercsv, opens the file itself, uses a generator to not pass on empty lines, then uses DictReader... crossing my fingers that I don't run into problems since I skipped the encoding detection because it's not exported.

Although accepting other IO to stream_dicts (etc) besides a file would open this up enough for me to fix my problem in an easier way, so would making get_encoding an exported part of the library, or a number of other things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants