-
Notifications
You must be signed in to change notification settings - Fork 103
xsv has proven to be an indispensable part of our data integration work. We've used it in several projects, large and small, and it has made data-wrangling much easier. We've used a bevy of tools before, both open-source and commercial (mainly OpenRefine, Trifacta née Data Wrangler, a library of python scripts, we even sponsored an open source project), and there was nothing that approached the speed and convenience of xsv.
Early 2021, there were several pending PRs we were interested in and features we wanted to contribute ourselves.
As it happens, there was a maintainership discussion on GitHub at the time, where @BurntSushi - xsv's creator (of ripgrep fame and prolific Rust contributor), suggested that "if folks want to carry on in a fork, that might be the best path forward. I might request giving the project a different name though, because I do at least intend to at some point breath life back into xsv."
And that's how qsv came to be. Itch scratched.
Q stands for Quick(written in Rust), Queries (with joins and regular expressions!), Querl (sounds like curl), and Quartiles (check out stats!). qsv can handle large Quantities of data (most of its commands do not need to load the entire CSV into memory and can deal with very large files) and Quickly improve data Quality, leading to a Quantum leap in your productivity!
Also, my middle name is Queaño... 😁
CSV is a universally supported format. Just about any system can export and import CSVs.
We will continue to use the right tool for the job, using qsv primarily as "interoperability duct tape" - as a "first mile/last mile" connector, massager, cleaner of data. As any "data scientist" will tell you, data-wrangling continues to take an inordinate amount of time, and real-life data pipelines are brittle in nature, as data sources and business requirements inevitably change.
qsv affords you the raw speed and agility to quickly adapt to these changes on the edges. The complex transformations, and analytical heavy lifting still happens in your tool of choice.
In the public sector, we've worked with a lot of jurisdictions with their open data efforts using CKAN, and we've seen how data wrangling is a pervasive problem. We see qsv as an integral part of our data pipelines moving forward that will dramatically lower the barrier to publishing high quality data.
From screening for PII; to slicing data to manageable, logical partitions; to geocoding; to prepping/normalizing data from various IoT vendors; to automatically creating validation schemas and data dictionaries using descriptive statistics - qsv will allow users to compose robust data pipelines with other best-of-breed tools.
In the private sector, we also see qsv becoming a vital tool in enterprises standing up Data Management Systems, as csv is the lingua franca of Data Exchange - spanning the oldest legacy mainframe-based systems, to the latest analytics framework.
There is no formal roadmap right now, but there is a Project Backlog and a Discussion section that folks are welcome to contribute to.
After all, qsv started with us pulling in various pending pull requests from the xsv community that happened to align with our requirements. We welcome and look forward to your contributions!
Feel free to share your recipe in the Cookbook!
Go to General area of the Discussion section and post it there.
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Metadata Profiling (profile)
- Conversion & I/O
- Geospatial
- HTTP & Web
- Get & Disk Cache
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation
- Recipes index
- Inspect an Unknown CSV
- Clean & Normalize
- Geographic Enrichment
- Date Enrichment
- CKAN Integration
- JSON Schema Validation
- Build a Data Pipeline
- Stats → Insights
- Fetch & Cache
- Larger-than-RAM CSV
- Diff & Audit
- Multi-table Joins
- Synthesize Fake Data