Permalink
Browse files

doc tweaks

  • Loading branch information...
1 parent f62e636 commit 49ff4620d4cbbcac04da45060b22d71b079753fc @brendano committed Mar 16, 2009
Showing with 121 additions and 30 deletions.
  1. +56 −19 README.md
  2. +14 −2 hwrap
  3. +4 −1 lamecut
  4. +12 −2 namecut
  5. +2 −1 ssv2tsv
  6. +9 −0 tabawk
  7. +6 −3 tsvcat
  8. +2 −0 tsvutil.py
  9. +4 −2 xlsx2tsv
  10. +12 −0 yaml2tsv
View
75 README.md
@@ -16,9 +16,9 @@ Convert to tsv:
Manipulate tsv:
* namecut - like 'cut' but with header names.
-* tabsort - 'sort' wrapper with tab delimiter
-* tabawk - 'awk' wrapper with tab delimiter
-* hwrap - wraps anything but passes through header line
+* tabsort - 'sort' wrapper with tab delimiter.
+* tabawk - 'awk' wrapper with tab delimiter.
+* hwrap - wraps anything, preserving header.
Convert out of tsv:
@@ -27,40 +27,76 @@ Convert out of tsv:
Here, the "tsv" file format is honest-to-goodness tab-separated values, usually with a header. No quoting, escaping, or comments. All rows should have the same number of fields. Rows end with a unix \n newline. Cell values cannot have tabs or newlines.
-These conditions are all enforced in scripts that convert to tsv. For a program that convert *out* of tsv, if these assumptions do not hold, the script's behavior is undefined.
+These conditions are all enforced in scripts that convert to tsv. For programs that convert *out* of tsv, if these assumptions do not hold, the script's behavior is undefined.
-TSV is an easy format for other programs to handle: after removing the newline, split("\t") correctly parses a row.
+TSV is an easy format for other programs to handle: after removing the newline, split("\t") correctly parses a row.
-Note that "tail +2" or "tail -n+2" strips out a tsv file's header. A common pattern is to preserve preserve the header while manipulating the rows. For example, the following sorts a file. ("hwrap" does this too.)
-
- $ (head -1 file; tail +2 file | tabsort -k2) > outfile
+Note that "tail +2" or "tail -n+2" strips out a tsv file's header.
Weak naming convention: programs that don't work well with headers call that format "tab"; ones that either need a header or are agnostic call that "tsv". E.g., for tabsort you don't want to sort the header, but tsv2my is impossible without the header. csv2tsv and tsv2csv are agnostic, since a csv file may or may not have a header.
+
+Examples
+--------
+
The TSV format is intended to work with many other pipeline-friendly programs. Examples include:
* cat, head, tail, tail -n+X, cut, merge, diff, comm, sort, uniq, uniq -c, wc -l
* perl -pe, ruby -ne, awk, sed, tr
* echo 'select a,b from bla' | mysql
* echo -e "a\tb"; echo "select a,b from bla" | sqlite3 -separator $(echo -e '\t')
* echo -e "a\tb"; echo "select a,b from bla" | psql -tqAF $(echo -e '\t')
-* [shuffle][1]
-* [md5sort][2]
-* [setdiff][3]
-* [pv][4]
-* (GUI) Excel: copy-and-paste cells <-> text as tsv
+* [shuffle][sh]
+* [md5sort][md]
+* [setdiff][sd]
+* [mapagg][ma]
+* [pv][pv]
+* (GUI) Excel: copy-and-paste cells <-> text as tsv (though kills double quotes)
* (GUI) Web browsers: copy rendered HTML table -> text as tsv
-[1]: http://www.w3.org/People/Bos/Shuffle
-[2]: http://gist.github.com/22959
-[3]: http://gist.github.com/22958
-[4]: http://www.ivarch.com/programs/pv.shtml
+[sh]: http://www.w3.org/People/Bos/Shuffle
+[md]: http://gist.github.com/22959
+[sd]: http://gist.github.com/22958
+[ma]: http://gist.github.com/67656
+[pv]: http://www.ivarch.com/programs/pv.shtml
+
+
+A common pattern is to preserve preserve the header while manipulating the rows. For example, the following sorts a file.
+
+ $ (head -1 file; tail +2 file | tabsort -k2) > outfile
+
+Or equivalently:
+
+ $ hwrap tabsort -k2 <file >outfile
+
+Parsing TSV-with-headers in Ruby:
+
+ cols = STDIN.readline.chomp.split("\t")
+ STDIN.each do |line|
+ vals = line.chomp.split("\t")
+ record = (0...cols.size).map {|j| [cols[j], vals[j]]}.to_h
+ pp record # => hash of key/values
+ end
+
+Parsing TSV-with-headers in Python:
+
+ cols = sys.stdin.readline()[:-1]
+ for line in sys.stdin:
+ vals = line[:-1].split("\t")
+ record = dict((cols[j],vals[j]) for j in range(len(cols)))
+ print record # => hash of key/values
+
+Or equivalently,
+
+ tsv_reader = lambda f: csv.DictReader(f, dialect=None, delimiter='\\t', quoting=csv.QUOTE_NONE)
+ for record in tsv_reader(sys.stdin):
+ print record # => hash of key/values
Installation
------------
-Lots of these scripts aren't very polished -- needing fixes for python 2.5 vs 2.6's handling of utf-8 stdin/stdout and the like -- so you're best off just putting the entire directory on your PATH in case you need to hack up the scripts.
+Lots of these scripts aren't very polished -- needing fixes for python utf-8 stdin/stdout and the like -- so you're best off just putting the entire directory on your PATH in case you need to hack up the scripts.
The philosophy of tsvutil
@@ -72,9 +108,10 @@ These utilities are good at data munging back and forth between MySQL and Excel.
Long version:
-There are many data processing and analysis situations where data consists of tables. A "table" is a list of flat records with identical sets of named attributes, where it's easy to manipulate a particular attribute across all records -- a "column". The main data structures in SQL, R, and Excel are tables. A more complex alternative is to encode in arbitrarily nested structures (XML, JSON). Due to their potential complexity, it's always error-prone to use them. Ad-hoc querying is generally difficult if not impossible. A simpler alternative is a table with positional, un-named columns, but it's difficult to remember which column is which. Tables with named columns hit a sweet spot of both maintainability and simplicity.
+There are many data processing and analysis situations where data consists of tables. A "table" is a list of flat records each with the same set of named attributes, where it's easy to manipulate a particular attribute across all records -- a "column". The main data structures in SQL, R, and Excel are tables. A more complex alternative is to encode in arbitrarily nested structures (XML, JSON). Due to their potential complexity, it's always error-prone to use them. Ad-hoc querying is generally difficult. A simpler alternative is a table with positional, un-named columns, but it's difficult to remember which column is which. Tables with named columns hit a sweet spot of both maintainability and simplicity.
But SQL databases and Excel spreadsheets are often inconvenient data management environments compared to the filesystem on the unix commandline. Unfortunately, the most common file format for tables is CSV, which is complex and has several incompatible versions. It plays only moderately nicely with the unix commandline, which is the best ad-hoc processing tool for filesystem data. Often the only correct way to handle CSV is to use a parsing library, but it's inconvenient to fire up a python/perl/ruby session just to do simple sanity checks and casually inspect data.
To balance these needs, so far I've found that TSV-with-headers is the most convenient canonical format for table data in the filesystem/commandline environment. It's also good as a lingua franca intermediate format in shell pipelines. These utilities are just a little bit of glue to make TSV play nicely with Excel, MySQL, and Unix tools. Interfaces in and out of other table-centric environments could easily be added.
+
View
16 hwrap
@@ -1,7 +1,19 @@
#!/usr/bin/env python
-# assume stdin has a header and the rest are rows.
-# pass-through the header, then execute command only on the rows.
+"""
+hwrap [pipeline command to wrap]
+
+Assume stdin has a header and the rest are rows.
+Print header, then pass on only the rows to wrapped command's stdin.
+Useful for "sort", "grep", "head", "tail"
+and other commands that don't muck with rows' internal structure.
+If you want to wrap a command requiring shell metacharacters -- like pipe |'s
+then try: hwrap bash -c "bla | bla | bla"
+"""
+
import sys,os
+if sys.stdin.isatty():
+ print>>sys.stderr, __doc__.strip()
+ sys.exit(1)
sys.stdin = open('/dev/stdin','U',buffering=0)
sys.stdout = open('/dev/stdout','w',buffering=0)
header = sys.stdin.readline()
View
5 lamecut
@@ -1,6 +1,9 @@
#!/usr/bin/env python
""" like 'cut' but assumes a header and you specify the columns you want by
-name instead of position"""
+name instead of position
+
+not as good as 'namecut', use that instead
+"""
import sys
names = []
View
14 namecut
@@ -1,7 +1,17 @@
#!/usr/bin/env ruby
-# usage: give list of column names or indexes to extract (like sql select)
-# assumes first line is a header
+$doc = %{
+namecut [column names and/or 1-based indexes]
+
+Assume first line is a header.
+Those columns are extracted (like sql select).
+Like 'cut' but with names; and jankier.
+}
+
+if STDIN.tty?
+ STDERR.puts $doc.strip
+ exit 1
+end
colnames = $stdin.readline.chomp.split("\t")
cols = ARGV
View
3 ssv2tsv
@@ -1,5 +1,6 @@
#!/usr/bin/env python
-""" Space-separated fields => tab separated. very tolerant of inconsistent numbers of spaces. Use case: uniq -c | ssv2tsv
+""" Space-separated fields => tab separated. very tolerant of inconsistent
+numbers of spaces. Use case: uniq -c | ssv2tsv
space2tab a better name? vaguely violates naming convention? """
import sys
import tsvutil
View
9 tabawk
@@ -1,4 +1,13 @@
#!/usr/bin/env ruby
+$doc = %{
+tabawk [same args as awk]
+
+Wrapper around 'awk' for tab-separation for both input and output records
+}
+if ARGV.size==0 #and STDIN.tty?
+ STDERR.puts $doc.strip
+ exit 1
+end
awk_args = ARGV.clone
View
9 tsvcat
@@ -1,6 +1,9 @@
-#!/usr/bin/env python
-""" tsvcat - concatenates TSV-with-header files, aligning columns with same
-name. can rename columns and match columns across files with different names.
+#!/usr/bin/env python2.6
+"""
+tsvcat [files]
+
+Concatenates TSV-with-header files, aligning columns with same name.
+Can rename columns and match columns across files with different names.
"""
import sys,itertools
View
2 tsvutil.py
@@ -1,3 +1,5 @@
+"""Miscellaneous utilities to support some of the tsvutils scripts"""
+
import sys,csv,codecs
warning_count = 0
View
6 xlsx2tsv
@@ -18,7 +18,9 @@ To read from a pipe in python, ironically it takes a few tricks to disable the
csv module's complicated excel-compatible default rules:
csv.DictReader(os.popen("xlsx2tsv filename.xlsx"), dialect=None, delimiter='\\t', quoting=csv.QUOTE_NONE)
-brendan o'connor - anyall.org - gist.github.com/22764
+brendan o'connor - anyall.org
+originally at gist.github.com/22764
+but new home is github.com/brendano/tsvutils
"""
#from __future__ import print_function
@@ -36,7 +38,7 @@ if args:
elif not sys.stdin.isatty():
z = zipfile.ZipFile(sys.stdin)
else:
- print __doc__.strip()
+ print>>sys.stderr, __doc__.strip()
sys.exit(1)
n=lambda x: "{http://schemas.openxmlformats.org/spreadsheetml/2006/main}%s" % x
View
12 yaml2tsv
@@ -1,8 +1,20 @@
#!/usr/bin/env ruby
+$doc = %{
+yaml2tsv
+
+Takes a YAML document stream -- that is, a concatenation of YAML top-level
+objects -- and treats them as key/value records to be turned into
+tsv-with-header.
+}
require 'yaml'
require 'pp'
+if STDIN.tty?
+ STDERR.puts $doc.strip
+ exit 1
+end
+
all = []
YAML.load_documents(STDIN) do |ydoc|
all << ydoc

0 comments on commit 49ff462

Please sign in to comment.