scrubcsv: Remove bad lines from a CSV file and normalize the rest
This is a CSV cleaning tool based on BurntSushi's
csv library. It's
intended to be used for cleaning up and normalizing large data sets before
feeding them to other CSV parsers, at the cost of discarding the occasional
row. This program may further mangle syntactically-invalid CSV data!
See below for details.
Installing and using
To install, first install Rust if you haven't already:
curl https://sh.rustup.rs -sSf | sh
scrubcsv using Cargo:
cargo install scrubcsv
$ scrubcsv giant.csv > scrubbed.csv 3000001 rows (1 bad) in 51.58 seconds, 72.23 MiB/sec
For more options, run:
Data cleaning notes
We assume that, given hundreds of gigabytes of CSV from many sources, many files will contain a few unparsable lines.
Lines of the following form:
Name,Phone "Robert "Bob" Smith",(202) 555-1212
...are invalid according the RFC 4180 because the quotes around
not escaped. The creator the file probably intended to write:
Name,Phone "Robert ""Bob"" Smith",(202) 555-1212
scrubcsv will currently output this as:
Name,Phone "Robert Bob"" Smith""",(202) 555-1212
If the resulting line has the wrong number of columns, it will be discarded. The precise details of cleanup and discarding are subject to change. The goal is to preserve data in valid CSV files, and to make a best effort to salvage or discard records that can't be parsed without being too picky about the details.
This is designed to be relatively fast. For comparison purposes, on particular laptop:
cat /dev/zero | pv > /dev/nullshows a throughput of about 5 GB/s.
- The original raw output string-writing routines in
scrubcsvcould reach about 3.5 GB/s.
csvparser can reach roughly 235 MB/s in zero-copy mode.
- With various levels of processing,
scrubcsvhits 49 to 125 MB/s.
- A lot of old-school C command-line tools hit about 50 to 75 MB/s.
Unfortunately, we can't really use
csv's zero-copy mode because we need
to see an entire row at once to decide whether or not it's valid before
deciding to output it. We could, I suppose,
memmove each field as we see
it into an existing buffer to avoid
malloc overhead (which is almost
certianly the bottleneck here), but that would require more code. Still,
file an issue if performance is a problem. We could probably make this a
maybe two to four times faster (and it would be fun to optimize).