Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

more documentation

  • Loading branch information...
commit 02bbd47b7cb8287265343b180afeeeea0fe2a299 1 parent f8f6976
@brendano authored
View
24 README.md
@@ -1,7 +1,7 @@
tsvutils -- utilities for processing tab-separated files
=====================================================================
-*tsvutils* are scripts that can convert and manipulate the TSV file format: tab-separated values, sometimes with a header. They are intended to allow ad-hoc but reliable processing and summarization of tabular data, with interfaces to Excel and MySQL.
+*tsvutils* are scripts that can convert and manipulate the TSV file format: tab-separated values, sometimes with a header. They build on top of standard Unix utilities to allow ad-hoc, efficient, and reliable processing and summarization of tabular data.
github.com/brendano/tsvutils - by Brendan O'Connor - anyall.org
@@ -13,11 +13,15 @@ Convert to tsv:
* xlsx2tsv - convert from Excel's .xlsx format.
* others: eq2tsv ssv2tsv uniq2tsv yaml2tsv
-Manipulate tsv:
+Manipulate tsv; header smartness:
-* namecut - like 'cut' but with header names.
-* tsvcat - concatenate tsv's, aligning common columns.
+* tsvawk - gives you column names in your awk.
* hwrap - wraps pipeline process but preserves stdin's header.
+* tsvcat - concatenate tsv's, aligning common columns.
+* namecut - like 'cut' but with header names.
+
+Manipulate tsv; wrappers for Unix utilities:
+
* tabsort - 'sort' wrapper with tab delimiter.
* tabawk - 'awk' wrapper with tab delimiter.
@@ -26,12 +30,14 @@ Convert out of tsv:
* tsv2csv - convert tsv to Excel-compatible csv.
* tsv2my - load tsv into a new MySQL table.
* tsv2fmt - format as ASCII-art table.
+* tsv2html - format as HTML table.
+* others: tsv2yaml tsv2tex
-Here, the "tsv" file format is honest-to-goodness tab-separated values, usually with a header. No quoting, escaping, or comments. All rows should have the same number of fields. Rows end with a unix \n newline. Cell values cannot have tabs or newlines.
+By "tsv" we mean honest-to-goodness tab-separated values, often with a header. No quoting, escaping, or comments. All rows should have the same number of fields. Rows end with a unix \n newline. Cell values cannot have tabs or newlines. (If you want those things in your data, make up your own convention (like backslash escaping) and have your application be aware of it. Our philosophy is, a data processing utility should ignore that stuff in order to have safe and predictable behavior.)
These conditions are all enforced in scripts that convert to tsv. For programs that convert *out* of tsv, if these assumptions do not hold, the script's behavior is undefined.
-TSV is an easy format for other programs to handle: after removing the newline, split("\t") correctly parses a row.
+TSV is an easy format for other programs to handle: after stripping the newline, split("\t") correctly parses a row.
Note that "tail +2" or "tail -n+2" strips out a tsv file's header.
@@ -102,11 +108,11 @@ Loading in R:
Installation
------------
-Some of these scripts aren't very polished -- might need utf-8 fixes or something -- so you're best off just putting the entire directory on your PATH in case you need to hack up the scripts.
+It's probably useful to look at or tweak these scripts, so you're best off just putting the entire directory on your PATH.
-The philosophy of tsvutil
--------------------------
+The philosophy of tsvutils
+--------------------------
Short version:
View
6 csv2tsv
@@ -1,8 +1,8 @@
#!/usr/bin/env python
-"""
+r"""
Input is Excel-style CSV. Either stdin or filename.
- (We can handle Mac Excel's \\r-delimited csv)
-Output is honest-to-goodness tsv: no quoting or any \\n\\r\\t.
+ (We can handle Mac Excel's \r-delimited csv)
+Output is honest-to-goodness tsv: no quoting or any \n\r\t.
"""
#from __future__ import print_function
View
4 fmt2tsv
@@ -1,3 +1,5 @@
#!/bin/bash
-# kinda silly but sometimes useful
+# Converts out of the pipe-delimited (and whitespace-padded) format
+# from 'tsv2fmt'.
+# The existence of this script at all is kinda silly but sometimes useful.
perl -pe 's/^\| *//g; s/ *\|$//; s/ *\| */\t/g'
View
10 hwrap
@@ -4,10 +4,12 @@ hwrap [pipeline command to wrap]
Assume stdin has a header and the rest are rows.
Print header, then pass on only the rows to wrapped command's stdin.
-Useful for "sort", "grep", "head", "tail"
-and other commands that don't muck with rows' internal structure.
-If you want to wrap a command requiring shell metacharacters -- like pipe |'s
-then try: hwrap bash -c "bla | bla | bla"
+Examples:
+
+cat file_with_header.tsv | hwrap tabsort -k3
+cat file_with_header.tsv | hwrap tail
+cat file_with_header.tsv | hwrap grep bla
+cat file_with_header.tsv | hwrap bash -c "grep bla | head"
"""
import sys,os
View
6 json2tsv
@@ -1,4 +1,10 @@
#!/usr/bin/env python
+"""
+Convert a stream of JSON objects to TSV, extracting the keys
+you specify into columns.
+
+If you don't specify any keys, it tries to figure out the set of all keys.
+"""
#from __future__ import print_function
import simplejson
#import json as simplejson # py 3.0
View
14 ssv2tsv
@@ -1,7 +1,15 @@
#!/usr/bin/env python
-""" Space-separated fields => tab separated. very tolerant of inconsistent
-numbers of spaces. One use case: uniq -c | ssv2tsv (though uniq2tsv better)
-space2tab a better name? vaguely violates naming convention? """
+"""
+Space-separated fields => tab separated. very tolerant of inconsistent
+numbers of spaces. Examples:
+
+uniq -c | ssv2tsv
+echo id name count | ssv2tsv
+
+This really should be called 'ssv2tab' or 'space2tab' to be more in-line with
+the naming conventions elsewhere, but I personally find those names harder to
+remember.
+"""
import sys
import tsvutil
tsvutil.fix_stdio()
View
1  tabsort
@@ -1,3 +1,4 @@
#!/bin/bash
+# Wrapper for 'sort' with tab-delimiting.
export TAB=$(echo -e "\t")
exec sort "-t$TAB" "$@"
View
3  tsv2fmt
@@ -1,4 +1,7 @@
#!/usr/bin/env python
+"""
+Outputs in a ascii-art table, reminiscent of Postgres and MySQL command-lines.
+"""
#from __future__ import print_function
import sys
from collections import defaultdict
View
1  tsv2tex
@@ -1,4 +1,5 @@
#!/usr/bin/env ruby
+# Outputs to a basic TeX table format.
puts "\\hline"
for line in STDIN
View
11 tsvawk
@@ -1,11 +1,16 @@
#!/usr/bin/env python
"""
-Wrapper around tabawk, supplying column names as integers for their
-corresponding column numbers, so you can do things like
+Wrapper around tabawk, letting you use column names instead of positions. It
+makes column names from the header into awk global variables that are integers
+for positions, so you can do things like
tsvawk '{print $id,$name}'
+tsvawk '$count >= 5'
-Your awk script will *not* see the header line from the file.
+Your awk script will *not* see the header line from the file; this script absorbs it.
+
+The consequence is that the output doesn't get a header; you could always do
+echo c1 c2 | ssv2tsv to get it yourself, perhaps.
"""
import sys,os
View
2  uniq2tsv
@@ -1,2 +1,4 @@
#!/bin/sh
+# USAGE:
+# ... | uniq -c | uniq2tsv
perl -ne 'print "$1\t$2\n" if /^ *(\d+) (.*)/ or die "doesnt look like uniq -c format: $_"'
Please sign in to comment.
Something went wrong with that request. Please try again.