Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Utilities for processing tab-separated files

branch: master
Octocat-spinner-32 test test for xlsx2tsv (doesnt automatically run yet, but does do the correct
Octocat-spinner-32 xlsx_test first commit
Octocat-spinner-32 .gitignore add cleaning to bridge-maker (now bridges.rb)
Octocat-spinner-32 README.md fix indents
Octocat-spinner-32 bridges.rb add cleaning to bridge-maker (now bridges.rb)
Octocat-spinner-32 csv2tsv Merge branch 'master' of github.com:brendano/tsvutils
Octocat-spinner-32 eq2tsv handle non-eq lines on top, though messily
Octocat-spinner-32 fmt2tsv more documentation
Octocat-spinner-32 gs2tsv google spreadsheet => tsv
Octocat-spinner-32 html2tsv html2tsv (inverse of tsv2html) for HTML tables
Octocat-spinner-32 hwrap merged
Octocat-spinner-32 json2tsv Merge branch 'master' of github.com:brendano/tsvutils
Octocat-spinner-32 lamecut doc tweaks
Octocat-spinner-32 namecut doc tweaks
Octocat-spinner-32 sas2tsv Convert a particular fixed-width SAS format ...
Octocat-spinner-32 ssv2tsv more documentation
Octocat-spinner-32 tabawk bunch of stuff
Octocat-spinner-32 tabsort more documentation
Octocat-spinner-32 tar2tsv tar2tsv: file contents as one-line JSON strings.
Octocat-spinner-32 tsv2csv first commit
Octocat-spinner-32 tsv2fmt fiddle with 2tsv and bridges
Octocat-spinner-32 tsv2html sortable HTML table
Octocat-spinner-32 tsv2my bugfixes etc
Octocat-spinner-32 tsv2ssv tsv => space-separated ... i seem to recall,
Octocat-spinner-32 tsv2tex bunch of stuff
Octocat-spinner-32 tsv2yaml add tsv2yaml
Octocat-spinner-32 tsvawk tsvawk colname bugfix
Octocat-spinner-32 tsvcat fixes
Octocat-spinner-32 tsvsort tsvsort
Octocat-spinner-32 tsvutil.py bunch of stuff
Octocat-spinner-32 uniq2tsv more documentation
Octocat-spinner-32 xls2csvfiles xls2csvfile, xls2tsv: use the python xlrd library
Octocat-spinner-32 xls2tsv xls2csvfile, xls2tsv: use the python xlrd library
Octocat-spinner-32 xlsx2tsv doc tweaks
Octocat-spinner-32 yaml2tsv doc tweaks
README.md

tsvutils -- utilities for processing tab-separated files

tsvutils are scripts that can convert and manipulate tabular data in the TSV file format: tab-separated values, sometimes with a header. They build on top of standard Unix utilities to allow ad-hoc, efficient, and reliable processing and summarization of tabular data from the shell.

Overview of scripts

Convert into tsv:

  • csv2tsv - convert from Excel-compatible csv.
  • json2tsv - convert from concatenated JSON records.
  • xlsx2tsv - convert from Excel's .xlsx format.
  • others: eq2tsv ssv2tsv uniq2tsv yaml2tsv ...

Manipulate tsv:

  • tsvawk - gives you column names in your awk.
  • hwrap - wraps pipeline process but preserves stdin's header.
  • tsvcat - concatenate tsv's, aligning common columns.
  • namecut - like 'cut' but with header names.
  • tabsort, tabawk - wrappers for tab delimitation.

Convert out of tsv:

  • tsv2csv - convert tsv to Excel-compatible csv.
  • tsv2my - load tsv into a new MySQL table.
  • tsv2fmt - format as ASCII-art table.
  • tsv2html - format as HTML table.
  • others: tsv2yaml tsv2tex ...

By "tsv" we mean honest-to-goodness tab-separated values, often with a header. No quoting, escaping, or comments. All rows should have the same number of fields. Rows end with a unix \n newline. Cell values cannot have tabs or newlines.

These conditions are all enforced in scripts that output tsv. For programs that take tsv input, if these assumptions do not hold, the script's behavior is undefined.

TSV is an easy format for other programs to handle:

  • After stripping the newline, split("\t") correctly parses a row.
  • To strip out the header beforehand: "tail -n+2" or "tail +2".

Weak naming convention: programs that don't work well with headers call that format "tab"; ones that either need a header or are agnostic call that "tsv". E.g., for tabsort you don't want to sort the header, but tsv2my is impossible without the header. csv2tsv and tsv2csv are agnostic, since a csv file may or may not have a header.

Examples: pipelines

The TSV format is intended to work with many other pipeline-friendly programs. Examples include:

  • General
    • cat, head, tail, tail -n+X, cut, merge, diff, comm, sort, uniq, uniq -c, wc -l
  • Multipurpose
    • perl -pe, ruby -ne, awk, sed, tr
  • SQL to TSV
    • echo 'select a,b from bla' | mysql
    • echo a b | ssv2tsv; echo "select a,b from bla" | sqlite3 -separator $(echo -e '\t')
    • echo a b | ssv2tsv; echo "select a,b from bla" | psql -tqAF $(echo -e '\t')
  • GUI to TSV
    • Excel: copy-and-paste cells <-> text as tsv (though kills double quotes)
    • Web browsers: copy rendered HTML table -> text as tsv
  • Misc

The tsvutils scripts' comments include further examples.

Examples: Named columns in programs

Here are examples of parsing TSV-with headers in several script-y languages, such that you get to refer to columns by their names, instead of positions. This makes the scripts much more maintainable.

Python has a built-in facility for TSV-with-headers:

tsv_reader = lambda f: csv.DictReader(f, dialect=None, delimiter='\t', quoting=csv.QUOTE_NONE)
for record in tsv_reader(sys.stdin):
  print record  # => hash of key/values

Or equivalently:

cols = sys.stdin.readline()[:-1]
for line in sys.stdin:
  vals = line[:-1].split("\t")
  record = dict((cols[j],vals[j]) for j in range(len(cols)))
  print record  # => hash of key/values

Ruby:

cols = STDIN.readline.chomp.split("\t")
STDIN.each do |line|
  vals = line.chomp.split("\t")
  record = (0...cols.size).map {|j| [cols[j], vals[j]]}.to_h
  p record  # => hash of key/values
end

R has a built-in facility:

data = read.delim("data.tsv", sep="\t")

Installation

It's probably useful to look at or tweak these scripts, so you're best off just putting the entire directory on your PATH.

The philosophy of tsvutils

There are many data processing and analysis situations where data consists of tables. A "table" is a list of flat records each with the same set of named attributes, where it's easy to manipulate a particular attribute across all records -- a "column". The main data structures in SQL, R, and Excel are tables.

TSV-with-headers sits in a sweet spot on the spectrum of data format complexity.

  • A more complex alternative is to encode in arbitrarily nested structures (XML, JSON). These have greater representational capacity, but are less convenient for data analysis. Since they can have high structural complexity, it's often error-prone to use them -- ad-hoc querying is generally difficult. Furthermore, it's wasteful of space to repeat key names over and over if all records are known to have the same set of keys. Finally, when doing data analysis, especially statistical analysis, you want to turn columns into vectors, which presupposes a flatter, more table-like structure.
  • A simpler alternative is a table with positional, un-named columns. The main weakness is that for more than several columns, it's hard to remember which column is which. Named columns improve maintainability.

But: SQL databases and Excel spreadsheets are often inconvenient data management environments compared to the filesystem on the unix commandline. Unfortunately, the most common file format for tables is CSV, which is complex and has several incompatible versions. It plays only moderately nicely with the unix commandline, which is the best ad-hoc processing tool for filesystem data. Often the only correct way to handle CSV is to use a parsing library, but it's inconvenient to fire up a python/perl/ruby session just to do simple sanity checks and casually inspect data.

To balance these needs, so far I've found that TSV-with-headers is the most convenient canonical format for table data in the filesystem/commandline environment, or at least the lingua franca in shell pipelines. These utilities are just a little bit of glue to make TSV play nicely with CSV, Excel, MySQL, and Unix tools. Interfaces in and out of other table-centric environments could easily be added.

On the philosophy of having NO escaping or special data value conventions: If you want those things in your data, make up your own convention (like backslash escaping, URL escaping, or whatever) and have your application be aware of it. Our philosophy is, a data processing utility should ignore that stuff in order to have safe and predictable behavior. I've seen too many bugs because some intermediate program imposed a special meaning on "NA" or "\N" or "NULL", etc., when really a program further downstream should have had sole responsibility for this interpretation.

In Conclusion

Hope you enjoy tsvutils!

Something went wrong with that request. Please try again.