Skip to content
Joel Natividad edited this page Oct 13, 2021 · 14 revisions

Why create a new, renamed xsv fork?

xsv has proven to be an indispensable part of our data integration work. We've used it in several projects, large and small, and it has made data-wrangling much easier. We've used a bevy of tools before, both open-source and commercial (mainly OpenRefine, Trifacta née Data Wrangler, a library of python scripts, we even sponsored an open source project), and there was nothing that approached the speed and convenience of xsv.

Early 2021, there were several pending PRs we were interested in and features we wanted to contribute ourselves.

As it happens, there was a maintainership discussion on GitHub at the time, where @BurntSushi - xsv's creator (of ripgrep fame and prolific Rust contributor), suggested that "if folks want to carry on in a fork, that might be the best path forward. I might request giving the project a different name though, because I do at least intend to at some point breath life back into xsv."

And that's how qsv came to be. Itch scratched.

Why the name qsv?

Q stands for Quick(written in Rust), Queries (with joins and regular expressions!), Querl (sounds like curl), and Quartiles (check out stats!). qsv can handle large Quantities of data (most of its commands do not need to load the entire CSV into memory and can deal with very large files) and Quickly improve data Quality, leading to a Quantum leap in your productivity!

Also, my middle name is Queaño... 😁

Another CLI tool for handling CSVs? There are more robust tools out there!

CSV is a universally supported format. Just about any system can export and import CSVs.

We'll continue to use the right tool for the job and not aim to solve all our data integration problems with the qsv hammer. qsv will primarily serve as our "interoperability duct tape" - as a "first mile/last mile" connector, massager, cleaner of data.

As any "data scientist/engineer" will tell you, data-wrangling continues to take an inordinate amount of time, and real-life data pipelines are brittle in nature, as data sources and business requirements inevitably change.

qsv affords you the raw speed and agility to quickly adapt to these ad-hoc changes on the edges. The complex transformations, and analytical heavy lifting still happens in your tool of choice.

Loosely-coupled, microservices, call it what you want. It's the composable Unix Philosophy that's been with us since the 70s.

Wait! Aren't you violating the Unix philosophy by overloading qsv with all these features?

One can argue that but if you look at the source code, qsv is basically composed of subcommands (most of which are less than 100 lines of code) sharing common CSV, Regular Expression and command-line parsing engines (all of which happens to be written by BurntSushi as well!).

Our ambition for qsv is to be a "CoreUtils of CSVs", with qsv's subcommand analogous to the coreutils programs.

What are your plans for qsv?

In the public sector, we've worked with a lot of jurisdictions with their open data efforts using CKAN, and we've seen how data wrangling is a pervasive problem. We see qsv as an integral part of our data pipelines moving forward that will dramatically lower the barrier to publishing high quality data.

From screening for PII; to slicing data to manageable, logical partitions; to geocoding; to prepping/normalizing data from various IoT vendors; to automatically creating validation schemas and data dictionaries using descriptive statistics - qsv will allow users to compose robust data pipelines with other best-of-breed tools.

In our private sector projects, we also see qsv becoming a useful tool in enterprises standing up Data Management Systems, as csv is the lingua franca of Data Exchange.

Is there a roadmap?

qsv hews to xsv's original goals, just changing tenses as we feel xsv has largely achieved these goals, as we move the fork forward.

  1. Simple tasks should be easy -> Simple tasks are easy.
  2. Performance trade offs should be exposed in the CLI interface -> Performance trade offs are exposed in the CLI interface.
  3. Composition should not come at the expense of performance -> Composition does not come at the expense of performance.

As a data engineering company, we'll prioritize features that will make qsv a better fit into composable data pipelines - logging, comprehensive test coverage & benchmarks, and integration into the CKAN ecosystem.

Beyond that, there is no formal roadmap right now, but there is a Project Backlog and a Discussion section that folks are welcome to contribute to.

After all, qsv started with us pulling in various pending pull requests from the xsv community that happened to align with our requirements.

Shortlisted for the 0.18.0 release that we're looking forward to is CSV validation with JSON Schemas. Most of these shortlisted features are primarily driven by current projects, as we scratch itches.

Does qsv support other file formats?

Yes. qsv supports TSV and TAB files as well. Also, the jsonl command converts a CSV/TSV/TAB file to the jsonl/ndjson format.

If your file uses an unconventional delimiter, you can specify it with the --delimiter option.

If you want to convert your CSV to JSON and vice-versa, you can use other composable, open source command line tools like jq and dasel.

There are other tools that deal with CSVs like mlr and csvkit, that complement and overlap with qsv that we also highly recommend.

I have a qsv recipe I'd like to share.

Feel free to share your recipe in the Cookbook!

I have more questions. Where can I find out more?

Go to General area of the Discussion section and post it there.

Clone this wiki locally