Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse code

more documentation

  • Loading branch information...
commit 02bbd47b7cb8287265343b180afeeeea0fe2a299 1 parent f8f6976
Brendan O'Connor authored
24 README.md
Source Rendered
... ... @@ -1,7 +1,7 @@
1 1 tsvutils -- utilities for processing tab-separated files
2 2 =====================================================================
3 3
4   -*tsvutils* are scripts that can convert and manipulate the TSV file format: tab-separated values, sometimes with a header. They are intended to allow ad-hoc but reliable processing and summarization of tabular data, with interfaces to Excel and MySQL.
  4 +*tsvutils* are scripts that can convert and manipulate the TSV file format: tab-separated values, sometimes with a header. They build on top of standard Unix utilities to allow ad-hoc, efficient, and reliable processing and summarization of tabular data.
5 5
6 6 github.com/brendano/tsvutils - by Brendan O'Connor - anyall.org
7 7
@@ -13,11 +13,15 @@ Convert to tsv:
13 13 * xlsx2tsv - convert from Excel's .xlsx format.
14 14 * others: eq2tsv ssv2tsv uniq2tsv yaml2tsv
15 15
16   -Manipulate tsv:
  16 +Manipulate tsv; header smartness:
17 17
18   -* namecut - like 'cut' but with header names.
19   -* tsvcat - concatenate tsv's, aligning common columns.
  18 +* tsvawk - gives you column names in your awk.
20 19 * hwrap - wraps pipeline process but preserves stdin's header.
  20 +* tsvcat - concatenate tsv's, aligning common columns.
  21 +* namecut - like 'cut' but with header names.
  22 +
  23 +Manipulate tsv; wrappers for Unix utilities:
  24 +
21 25 * tabsort - 'sort' wrapper with tab delimiter.
22 26 * tabawk - 'awk' wrapper with tab delimiter.
23 27
@@ -26,12 +30,14 @@ Convert out of tsv:
26 30 * tsv2csv - convert tsv to Excel-compatible csv.
27 31 * tsv2my - load tsv into a new MySQL table.
28 32 * tsv2fmt - format as ASCII-art table.
  33 +* tsv2html - format as HTML table.
  34 +* others: tsv2yaml tsv2tex
29 35
30   -Here, the "tsv" file format is honest-to-goodness tab-separated values, usually with a header. No quoting, escaping, or comments. All rows should have the same number of fields. Rows end with a unix \n newline. Cell values cannot have tabs or newlines.
  36 +By "tsv" we mean honest-to-goodness tab-separated values, often with a header. No quoting, escaping, or comments. All rows should have the same number of fields. Rows end with a unix \n newline. Cell values cannot have tabs or newlines. (If you want those things in your data, make up your own convention (like backslash escaping) and have your application be aware of it. Our philosophy is, a data processing utility should ignore that stuff in order to have safe and predictable behavior.)
31 37
32 38 These conditions are all enforced in scripts that convert to tsv. For programs that convert *out* of tsv, if these assumptions do not hold, the script's behavior is undefined.
33 39
34   -TSV is an easy format for other programs to handle: after removing the newline, split("\t") correctly parses a row.
  40 +TSV is an easy format for other programs to handle: after stripping the newline, split("\t") correctly parses a row.
35 41
36 42 Note that "tail +2" or "tail -n+2" strips out a tsv file's header.
37 43
@@ -102,11 +108,11 @@ Loading in R:
102 108 Installation
103 109 ------------
104 110
105   -Some of these scripts aren't very polished -- might need utf-8 fixes or something -- so you're best off just putting the entire directory on your PATH in case you need to hack up the scripts.
  111 +It's probably useful to look at or tweak these scripts, so you're best off just putting the entire directory on your PATH.
106 112
107 113
108   -The philosophy of tsvutil
109   --------------------------
  114 +The philosophy of tsvutils
  115 +--------------------------
110 116
111 117 Short version:
112 118
6 csv2tsv
... ... @@ -1,8 +1,8 @@
1 1 #!/usr/bin/env python
2   -"""
  2 +r"""
3 3 Input is Excel-style CSV. Either stdin or filename.
4   - (We can handle Mac Excel's \\r-delimited csv)
5   -Output is honest-to-goodness tsv: no quoting or any \\n\\r\\t.
  4 + (We can handle Mac Excel's \r-delimited csv)
  5 +Output is honest-to-goodness tsv: no quoting or any \n\r\t.
6 6 """
7 7
8 8 #from __future__ import print_function
4 fmt2tsv
... ... @@ -1,3 +1,5 @@
1 1 #!/bin/bash
2   -# kinda silly but sometimes useful
  2 +# Converts out of the pipe-delimited (and whitespace-padded) format
  3 +# from 'tsv2fmt'.
  4 +# The existence of this script at all is kinda silly but sometimes useful.
3 5 perl -pe 's/^\| *//g; s/ *\|$//; s/ *\| */\t/g'
10 hwrap
@@ -4,10 +4,12 @@ hwrap [pipeline command to wrap]
4 4
5 5 Assume stdin has a header and the rest are rows.
6 6 Print header, then pass on only the rows to wrapped command's stdin.
7   -Useful for "sort", "grep", "head", "tail"
8   -and other commands that don't muck with rows' internal structure.
9   -If you want to wrap a command requiring shell metacharacters -- like pipe |'s
10   -then try: hwrap bash -c "bla | bla | bla"
  7 +Examples:
  8 +
  9 +cat file_with_header.tsv | hwrap tabsort -k3
  10 +cat file_with_header.tsv | hwrap tail
  11 +cat file_with_header.tsv | hwrap grep bla
  12 +cat file_with_header.tsv | hwrap bash -c "grep bla | head"
11 13 """
12 14
13 15 import sys,os
6 json2tsv
... ... @@ -1,4 +1,10 @@
1 1 #!/usr/bin/env python
  2 +"""
  3 +Convert a stream of JSON objects to TSV, extracting the keys
  4 +you specify into columns.
  5 +
  6 +If you don't specify any keys, it tries to figure out the set of all keys.
  7 +"""
2 8 #from __future__ import print_function
3 9 import simplejson
4 10 #import json as simplejson # py 3.0
14 ssv2tsv
... ... @@ -1,7 +1,15 @@
1 1 #!/usr/bin/env python
2   -""" Space-separated fields => tab separated. very tolerant of inconsistent
3   -numbers of spaces. One use case: uniq -c | ssv2tsv (though uniq2tsv better)
4   -space2tab a better name? vaguely violates naming convention? """
  2 +"""
  3 +Space-separated fields => tab separated. very tolerant of inconsistent
  4 +numbers of spaces. Examples:
  5 +
  6 +uniq -c | ssv2tsv
  7 +echo id name count | ssv2tsv
  8 +
  9 +This really should be called 'ssv2tab' or 'space2tab' to be more in-line with
  10 +the naming conventions elsewhere, but I personally find those names harder to
  11 +remember.
  12 +"""
5 13 import sys
6 14 import tsvutil
7 15 tsvutil.fix_stdio()
1  tabsort
... ... @@ -1,3 +1,4 @@
1 1 #!/bin/bash
  2 +# Wrapper for 'sort' with tab-delimiting.
2 3 export TAB=$(echo -e "\t")
3 4 exec sort "-t$TAB" "$@"
3  tsv2fmt
... ... @@ -1,4 +1,7 @@
1 1 #!/usr/bin/env python
  2 +"""
  3 +Outputs in a ascii-art table, reminiscent of Postgres and MySQL command-lines.
  4 +"""
2 5 #from __future__ import print_function
3 6 import sys
4 7 from collections import defaultdict
1  tsv2tex
... ... @@ -1,4 +1,5 @@
1 1 #!/usr/bin/env ruby
  2 +# Outputs to a basic TeX table format.
2 3
3 4 puts "\\hline"
4 5 for line in STDIN
11 tsvawk
... ... @@ -1,11 +1,16 @@
1 1 #!/usr/bin/env python
2 2 """
3   -Wrapper around tabawk, supplying column names as integers for their
4   -corresponding column numbers, so you can do things like
  3 +Wrapper around tabawk, letting you use column names instead of positions. It
  4 +makes column names from the header into awk global variables that are integers
  5 +for positions, so you can do things like
5 6
6 7 tsvawk '{print $id,$name}'
  8 +tsvawk '$count >= 5'
7 9
8   -Your awk script will *not* see the header line from the file.
  10 +Your awk script will *not* see the header line from the file; this script absorbs it.
  11 +
  12 +The consequence is that the output doesn't get a header; you could always do
  13 +echo c1 c2 | ssv2tsv to get it yourself, perhaps.
9 14 """
10 15
11 16 import sys,os
2  uniq2tsv
... ... @@ -1,2 +1,4 @@
1 1 #!/bin/sh
  2 +# USAGE:
  3 +# ... | uniq -c | uniq2tsv
2 4 perl -ne 'print "$1\t$2\n" if /^ *(\d+) (.*)/ or die "doesnt look like uniq -c format: $_"'

0 comments on commit 02bbd47

Please sign in to comment.
Something went wrong with that request. Please try again.