Skip to content
Joel Natividad edited this page Oct 13, 2021 · 14 revisions

Why create a new, renamed xsv fork?

xsv has proven to be an indispensable part of our data integration work. We've used it in several projects, large and small, and it has made data-wrangling much easier. We've used a bevy of tools before, both open-source and commercial (mainly OpenRefine, Trifacta née Data Wrangler, a library of python scripts, even sponsoring an open source project), and there was nothing that approached the speed and convenience of xsv.

Early 2021, there were several pending PRs we were interested in and features we wanted to contribute ourselves.

As it happens, there was a maintainership discussion on GitHub at the time, where @BurntSushi - xsv's original author (of ripgrep fame and prolific Rust contributor), suggested that "if folks want to carry on in a fork, that might be the best path forward. I might request giving the project a different name though, because I do at least intend to at some point breath life back into xsv."

And that's how qsv came to be. Itch scratched.

Why the name qsv?

Q stands for Quick(written in Rust), Queries (with joins and regular expressions!), Querl (sounds like curl), and Quartiles (check out stats!). qsv can handle large Quantities of data (most of its commands do not need to load the entire CSV into memory and can deal with very large files) and Quickly improve data Quality, leading to a Quantum increase in your productivity!

Also, my middle name is Queaño... 😁

What are your plans for qsv?

We've worked with a lot of jurisdictions with their open data efforts using CKAN, and we've seen how data wrangling is a big problem. We see qsv as an integral part of our data pipelines, that will dramatically lower the barrier to publishing high quality data.

From screening for PII; to slicing data to manageable, logical partitions; to geocoding; to prepping/normalizing data from various IoT vendors; to automatically computing their data dictionaries, qsv can be the "data wrangling duct tape" that will allow them to compose robust data pipelines.

We've also worked with the private sector, and we see the same issues...

Clone this wiki locally