Fix escape chars in output when `linearize: False` by tomreitz · Pull Request #98 · edanalytics/earthmover

tomreitz · 2024-06-12T18:34:12Z

This PR fixes a bug introduced in earthmover 0.2.0. Writing out destination files was done with Dask.to_csv() and some "invisible" escape characters, however when a destination declared linearize: False (which prevents stripping out newline characters from each rendered Jinja template) these "invisible" characters were present in the output file.

Now, like we render_row() to render a Jinja template for each row of a dataframe, we similarly write_row() to append the rendered row to an open file handle.

I tested performance by loading a 31MB, 1M-row, 3-column CSV file and immediately writing it to a destination with the following template:

{
    {% for key, value in __row_data__.items() -%}
    "{{key}}": "{{value|trim}}"{% if not loop.last %},{% endif %}
    {% endfor %}
}

Results:

old .to_csv() method with linearize: True took 636s
old .to_csv() method with linearize: False took 456s
new map_partitions() method with linearize: True took 441s
new map_partitions() method with linearize: False took 431s

So the new method at least as fast as .to_csv(). I also checked the output file for malformed lines (in case different Dask processes writing to the same file conflicted somehow) but it looked fine.

…hen linearize is False

earthmover/nodes/destination.py

* Update CHANGELOG and VERSION. * Hotfix/optional parquet sources (#86) * Update optional file check in FileSource to build an empty dataframe if an empty folder is passed. * Remove explicit file check in compile. * Re-add filesize check in FileSource.execute(). * Move FtpSource connect from compile to execute. * Fix attribute naming bug. * Fix bug. * Allow filepaths to be passed in optional FileSources, and check the existance of the path before loading the dataframe. * Update CHANGELOG. * fix add_columns typo in readme * update changelog * Feature/union all columns (#94) * Add 'fill_missing' optional field to UnionOperation that uses default Pandas concat logic without erroring out. Still raise a debug message when applicable. * Rename new field to 'fill_missing_columns' for clarity. * Update dataframe.py Rename fill_missing_columns to fill_missing. * Update dataframe.py * Update CHANGELOG.md * Update CHANGELOG.md * Rename UnionOperation's fill_missing field to fill_missing_columns; update README. * Git clone timeout when running `earthmover deps` (#93) * try using subprocess with timeout * Update error message * tweak timeouts * switch to makedirs * don't error if dir already exists * remove package path on failure * adjust deletes * typo * switch to rmtree * remove gitpython dependency * remove unused import * remove unused var * add optional git timeout config * reverse accidentally removed kwargs * add notes on git_auth_timeout config to readme * code cleanup * Update README. --------- Co-authored-by: jayckaiser <jayckaiser@gmail.com> * Update changelog. * Fix escape chars in output when `linearize: False` (#98) * fixes a bug where escape characters were present in the output file when linearize is False * remove unneeded Dask import * update return value and comment based on notes from Jay --------- Co-authored-by: Tom Reitz <treitz@edanalytics.org> * fixing a bug introduced in the last version where nested JSON would be loaded as a stringified Python dictionaty, which is difficult to use in downstream Jinja (#97) Co-authored-by: Tom Reitz <treitz@edanalytics.org> * Only write `earthmover_compiled.yaml` on compile, not run (#91) * only write to disk on compile, not run * update readme with change to earthmover_compiled.yaml * Add `earthmover clean` command and some CLI error handling (#87) * add 'clean' command and clean up CLI messaging * comment justifying dictionary * update changlog * remove skip_mkdir, make compiled_yaml_file a class attribute * replace dict with list of constntas --------- Co-authored-by: Jay Kaiser <jayckaiser@gmail.com> * Update CHANGELOG with new features. * Fix `__row_data__` in `add_columns` and `modify_columns` operations (#99) * fix __row_data__ in Jinja expressions of add_columns and modify_columns operations * update how __row_data__ is added to prefent an error about modifying row --------- Co-authored-by: Tom Reitz <treitz@edanalytics.org> * Feature: Refactor Destination Execute (#95) * Update config parsing to use ErrorHandler.assert_get_key() for all fields; move and unify Jinja template processing to execute. * Update destination.py * Update CHANGELOG. * makes destination template optional (#88) * makes destination template optional; when not specified, each row is turned into a JSON object where column names become object properties * implement changes based on feedback from Jay * bugfix * Minor cleanup. --------- Co-authored-by: Tom Reitz <treitz@edanalytics.org> Co-authored-by: jayckaiser <jayckaiser@gmail.com> * Update CHANGELOG. * adding the `debug` operation (#100) * adding debug operation * Update dataframe.py Refactor code to improve readability and reference to existing Node attributes. --------- Co-authored-by: Tom Reitz <treitz@edanalytics.org> Co-authored-by: Jay Kaiser <jayckaiser@gmail.com> * Use Node.full_name in Node.check_expectations(), instead of redefining the string manually. * Update CHANGELOG. * Feature/flatten operation whitespace cleanup (#101) * adding a flatten_operation * README tweak * implement changes based on feedback from Jay * Clean up comments and whitespace in new FlattenOperation. * Add print statements to debug tuple problem. * Minor cleanup. * Minor cleanup. * Add single quotes to strip and trim variables in FlattenOperation. * Fix single quote representation in trim_whitespace. --------- Co-authored-by: Tom Reitz <treitz@edanalytics.org> * Update CHANGELOG. --------- Co-authored-by: johncmerfeld <John.Merfeld@gmail.com> Co-authored-by: Samantha LeBlanc <56237580+sleblanc23@users.noreply.github.com> Co-authored-by: Tom Reitz <tom@tomreitz.com> Co-authored-by: Tom Reitz <treitz@edanalytics.org>

Tom Reitz added 2 commits June 12, 2024 11:54

fixes a bug where escape characters were present in the output file w…

385e568

…hen linearize is False

remove unneeded Dask import

5060631

tomreitz requested a review from jayckaiser June 12, 2024 18:34

jayckaiser reviewed Jun 12, 2024

View reviewed changes

earthmover/nodes/destination.py Outdated Show resolved Hide resolved

jayckaiser reviewed Jun 12, 2024

View reviewed changes

earthmover/nodes/destination.py Show resolved Hide resolved

update return value and comment based on notes from Jay

11e8b0b

jayckaiser approved these changes Jun 12, 2024

View reviewed changes

jayckaiser merged commit 7b29a91 into rc/0.3.2 Jun 12, 2024

jayckaiser deleted the fix/escape_chars_in_output_when_linearize_false branch June 12, 2024 20:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix escape chars in output when `linearize: False`#98

Fix escape chars in output when `linearize: False`#98
jayckaiser merged 3 commits intorc/0.3.2from
fix/escape_chars_in_output_when_linearize_false

tomreitz commented Jun 12, 2024

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tomreitz commented Jun 12, 2024

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants