New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a _CSVSource and CSVInput subclass of FileInput #245
Conversation
@@ -101,6 +104,50 @@ def build_part(self, for_part, resume_state): | |||
assert for_part == str(self._path), "Can't resume reading from different file" | |||
return _FileSource(self._path, resume_state) | |||
|
|||
class CSVInput(FileInput): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call this DictCSVInput
? This input won't work correctly on CSV files without a header and that would parallel the name of the csv.DictReader
in the stdlib that works with header-based CSVs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure it makes sense to provide a non-header version right now because that would be the same as lines that you could parse downstream with next(csv.reader([line]))
if you don't care about the header.
I'd like to leave it as is for right now and we can always add a way to skip the header or provide your own header.
Co-authored-by: David Selassie <david@bytewax.io> Signed-off-by: Zander <6073079+awmatheson@users.noreply.github.com>
Co-authored-by: David Selassie <david@bytewax.io> Signed-off-by: Zander <6073079+awmatheson@users.noreply.github.com>
Co-authored-by: David Selassie <david@bytewax.io> Signed-off-by: Zander <6073079+awmatheson@users.noreply.github.com>
Co-authored-by: David Selassie <david@bytewax.io> Signed-off-by: Zander <6073079+awmatheson@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. It seems there's no better way than to reinitialize a csv.reader every time, since you can't use self_.f.tell()
otherwise, that's not really nice but well, it's what we have.
For the typing discussion I saw in the comments, keep in mind that type hints are never used by python to check anything, you'd have to use mypy
explicitely to make use of them. But, users of the library could decide to run mypy
on their codebase, and if they pass an argument that doesn't uphold the signature, they'd get an error.
So if we want to use type hints in public functions it's important to add all the types we want to allow.
I wanted to leverage the FileInput to read a csv as it is a common file type for tabular data.
This PR adds a step to FileInput that checks the suffix and then returns a _CSVSource instead of a _FileSource. The file is then read as a csv instead of just as lines. The output returns the headers and values in a dictionary for easy processing downstream.
example input:
example output: