scrubcsv: Remove bad lines from a CSV file and normalize the rest
This is a CSV cleaning tool based on BurntSushi's
excellent csv library. It's
intended to be used for cleaning up and normalizing large data sets before
feeding them to other CSV parsers, at the cost of discarding the occasional
row. This program may further mangle syntactically-invalid CSV data!
See below for details.
Installing and using
To install, first install Rust if you haven't already:
curl https://sh.rustup.rs -sSf | shThen install scrubcsv using Cargo:
cargo install scrubcsvRun it:
$ scrubcsv giant.csv > scrubbed.csv
3000001 rows (1 bad) in 51.58 seconds, 72.23 MiB/secFor more options, run:
scrubcsv --helpData cleaning notes
We assume that, given hundreds of gigabytes of CSV from many sources, many files will contain a few unparsable lines.
Lines of the following form:
Name,Phone
"Robert "Bob" Smith",(202) 555-1212
...are invalid according the RFC 4180 because the quotes around "Bob" are
not escaped. The creator the file probably intended to write:
Name,Phone
"Robert ""Bob"" Smith",(202) 555-1212
scrubcsv will currently output this as:
Name,Phone
"Robert Bob"" Smith""",(202) 555-1212
If the resulting line has the wrong number of columns, it will be discarded. The precise details of cleanup and discarding are subject to change. The goal is to preserve data in valid CSV files, and to make a best effort to salvage or discard records that can't be parsed without being too picky about the details.
Performance notes
This is designed to be relatively fast. For comparison purposes, on particular laptop:
cat /dev/zero | pv > /dev/nullshows a throughput of about 5 GB/s.- The raw output string-writing routines in
scrubcsvcan reach about 3.5 GB/s. - The
csvparser can reach roughly 235 MB/s in zero-copy mode. - With full processing,
scrubcsvhits 67 MB/s. - A lot of old-school C command-line tools hit about 50 to 75 MB/s.
Unfortunately, we can't really use csv's zero-copy mode because we need
to see an entire row at once to decide whether or not it's valid before
deciding to output it. We could, I suppose, memmove each field as we see
it into an existing buffer to avoid malloc overhead (which is almost
certianly the bottleneck here), but that would require more code. Still,
file an issue if performance is a problem. We could probably make this a
maybe two to four times faster (and it would be fun to optimize).