Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a _CSVSource and CSVInput subclass of FileInput #245

Merged
merged 11 commits into from May 30, 2023
Merged

Conversation

awmatheson
Copy link
Contributor

I wanted to leverage the FileInput to read a csv as it is a common file type for tabular data.

This PR adds a step to FileInput that checks the suffix and then returns a _CSVSource instead of a _FileSource. The file is then read as a csv instead of just as lines. The output returns the headers and values in a dictionary for easy processing downstream.

example input:

index timestamp value instance
0 2022-02-24 11:42:08 0.132 24ae8d
0 2022-02-24 11:42:08 0.066 c6585a
0 2022-02-24 11:42:08 42.652 ac20cd
0 2022-02-24 11:42:08 51.846 5f5533

example output:

{'index': '0', 'timestamp': '2022-02-24 11:42:08', 'value': '0.132', 'instance': '24ae8d'}
{'index': '0', 'timestamp': '2022-02-24 11:42:08', 'value': '0.066', 'instance': 'c6585a'}
{'index': '0', 'timestamp': '2022-02-24 11:42:08', 'value': '42.652', 'instance': 'ac20cd'}
{'index': '0', 'timestamp': '2022-02-24 11:42:08', 'value': '51.846', 'instance': '5f5533'}

@awmatheson awmatheson added the ready for review Ready for review label May 23, 2023
@awmatheson awmatheson changed the title Add a _CSVSource to the files option Add a _CSVSource to the FileInput option May 23, 2023
pysrc/bytewax/connectors/files.py Outdated Show resolved Hide resolved
pysrc/bytewax/connectors/files.py Outdated Show resolved Hide resolved
pysrc/bytewax/connectors/files.py Outdated Show resolved Hide resolved
@awmatheson awmatheson changed the title Add a _CSVSource to the FileInput option Add a _CSVSource and CSVInput subclass of FileInput May 23, 2023
pysrc/bytewax/connectors/files.py Outdated Show resolved Hide resolved
@@ -101,6 +104,50 @@ def build_part(self, for_part, resume_state):
assert for_part == str(self._path), "Can't resume reading from different file"
return _FileSource(self._path, resume_state)

class CSVInput(FileInput):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call this DictCSVInput? This input won't work correctly on CSV files without a header and that would parallel the name of the csv.DictReader in the stdlib that works with header-based CSVs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it makes sense to provide a non-header version right now because that would be the same as lines that you could parse downstream with next(csv.reader([line])) if you don't care about the header.

I'd like to leave it as is for right now and we can always add a way to skip the header or provide your own header.

pysrc/bytewax/connectors/files.py Outdated Show resolved Hide resolved
pysrc/bytewax/connectors/files.py Show resolved Hide resolved
pysrc/bytewax/connectors/files.py Outdated Show resolved Hide resolved
pysrc/bytewax/connectors/files.py Outdated Show resolved Hide resolved
pysrc/bytewax/connectors/files.py Outdated Show resolved Hide resolved
awmatheson and others added 4 commits May 23, 2023 12:17
Co-authored-by: David Selassie <david@bytewax.io>
Signed-off-by: Zander <6073079+awmatheson@users.noreply.github.com>
Co-authored-by: David Selassie <david@bytewax.io>
Signed-off-by: Zander <6073079+awmatheson@users.noreply.github.com>
Co-authored-by: David Selassie <david@bytewax.io>
Signed-off-by: Zander <6073079+awmatheson@users.noreply.github.com>
awmatheson and others added 2 commits May 23, 2023 14:28
Co-authored-by: David Selassie <david@bytewax.io>
Signed-off-by: Zander <6073079+awmatheson@users.noreply.github.com>
Copy link
Contributor

@Psykopear Psykopear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. It seems there's no better way than to reinitialize a csv.reader every time, since you can't use self_.f.tell() otherwise, that's not really nice but well, it's what we have.

For the typing discussion I saw in the comments, keep in mind that type hints are never used by python to check anything, you'd have to use mypy explicitely to make use of them. But, users of the library could decide to run mypy on their codebase, and if they pass an argument that doesn't uphold the signature, they'd get an error.
So if we want to use type hints in public functions it's important to add all the types we want to allow.

@awmatheson awmatheson merged commit 9e1ebe3 into main May 30, 2023
21 checks passed
@davidselassie davidselassie deleted the add_csv branch August 29, 2023 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready for review Ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants