Permalink
Browse files

first commit

  • Loading branch information...
0 parents commit 2251b070bb65eb55338db2565360e240e559d947 @brendano committed Mar 2, 2009
Showing with 454 additions and 0 deletions.
  1. +78 −0 README.md
  2. +45 −0 csv2tsv
  3. +22 −0 json2tsv
  4. +24 −0 lamecut
  5. +26 −0 namecut
  6. +8 −0 ssv2tsv
  7. +3 −0 tabawk
  8. +3 −0 tabsort
  9. +12 −0 tsv2csv
  10. +64 −0 tsv2my
  11. +18 −0 tsvutil.py
  12. +147 −0 xlsx2tsv
  13. BIN xlsx_test/ragged.xlsx
  14. +2 −0 xlsx_test/s.xml
  15. +2 −0 xlsx_test/ss.xml
@@ -0,0 +1,78 @@
+tsvutils -- utilities for processing tab-separated files
+=====================================================================
+
+*tsvutils* are scripts that can convert and manipulate the TSV file format: tab-separated values, sometimes with a header. They are intended to allow ad-hoc but reliable processing and summarization of tabular data.
+
+[github.com/brendano/tsvutils](http://github.com/brendano/tsvutils) - Brendan O'Connor - anyall.org
+
+Convert to tsv:
+
+* csv2tsv - convert Excel-compatible csv to tsv.
+* json2tsv - convert JSON array of records to tsv.
+* ssv2tsv - convert space-separated values to tsv.
+* xlsx2tsv - convert Excel's .xlsx format to tsv.
+
+Manipulate tsv:
+
+* namecut - like 'cut' but with header names.
+* tabsort - 'sort' wrapper with tab delimiter
+* tabawk - 'awk' wrapper with tab delimiter
+
+Convert out of tsv:
+
+* tsv2csv - convert tsv to Excel-compatible csv.
+* tsv2my - load tsv into a new MySQL table.
+
+Here, the "tsv" file format is honest-to-goodness tab-separated values, usually with a header. No quoting, escaping, or comments. All rows should have the same number of fields. Rows end with a unix \n newline. Cell values cannot have tabs or newlines.
+
+These conditions are all enforced in scripts that convert to tsv. For a program that convert *out* of tsv, if these assumptions do not hold, the script's behavior is undefined.
+
+TSV is an easy format for other programs to handle: after removing the newline, split("\t") correctly parses a row.
+
+Note that "tail +2" or "tail -n+2" strips out a tsv file's header. A common pattern is to preserve preserve the header while manipulating the rows. For example, to sort a file:
+
+ $ (head -1 file; tail +2 file | tabsort -k2) > outfile
+
+Weak naming convention: programs that don't work well with headers call that format "tab"; ones that either need a header or are agnostic call that "tsv". E.g., for tabsort you don't want to sort the header, but tsv2my is impossible without the header. csv2tsv and tsv2csv are agnostic, since a csv file may or may not have a header.
+
+The TSV format is intended to work with many other pipeline-friendly programs. Examples include:
+
+* cat, head, tail, tail -n+X, cut, merge, diff, comm, sort, uniq, uniq -c, wc -l
+* perl -pe, ruby -ne, awk, sed, tr
+* echo 'select a,b from bla' | mysql
+* echo -e 'a\tb'; echo 'select a,b from bla' | sqlite3 -separator $(echo -e '\t')
+* echo -e 'a\tb'; echo 'select a,b from bla' | psql -tqAF $(echo -d '\t')
+* [shuffle][1]
+* [md5sort][2]
+* [setdiff][3]
+* [pv][4]
+* (GUI) Excel: copy-and-paste cells <-> text as tsv
+* (GUI) Web browsers: copy rendered HTML table -> text as tsv
+
+[1]: http://www.w3.org/People/Bos/Shuffle
+[2]: http://gist.github.com/22959
+[3]: http://gist.github.com/22958
+[4]: http://www.ivarch.com/programs/pv.shtml
+
+
+Installation
+------------
+
+Lots of these scripts aren't very polished -- needing fixes for python 2.5 vs 2.6's utf-8 and the like -- so you're best off just putting the entire directory on your PATH in case you need to hack up the scripts.
+
+
+The philosophy of tsvutil
+-------------------------
+
+Short version:
+
+These utilities are good at data munging back and forth between MySQL and Excel.
+
+Long version:
+
+There are many data processing and analysis situations where data consists of tables. A "table" is a list of flat records with identical sets of named attributes, where it's easy to manipulate a particular attribute across all records -- a "column". The main data structures in SQL, R, and Excel are tables. A more complex alternative is to encode in arbitrarily nested structures (XML, JSON). Due to their potential complexity, it's always error-prone to use them. Ad-hoc querying is generally difficult if not impossible. A simpler alternative is a table with positional, un-named columns, but it's difficult to remember which column is which. Tables with named columns hit a sweet spot of both maintainability and simplicity.
+
+But SQL databases and Excel spreadsheets are often inconvenient data management environments compared to the filesystem on the unix commandline. Unfortunately, the most common file format for tables is CSV, which is complex and has several incompatible versions. It plays only moderately nicely with the unix commandline, which is the best ad-hoc processing tool for filesystem data. Often the only correct way to handle CSV is to use a parsing library, but it's inconvenient to fire up a python/perl/ruby session just to do simple sanity checks and casually inspect data.
+
+To balance these needs, so far I've found that TSV-with-headers is the most convenient canonical format for table data in the filesystem/commandline environment. It's also good as a lingua franca intermediate format in shell pipelines. These utilities are just a little bit of glue to make this TSV play nicely with Excel, MySQL, and Unix tools. Interfaces in and out of other table-centric environments could easily be added.
+
45 csv2tsv
@@ -0,0 +1,45 @@
+#!/usr/bin/env python2.6
+"""
+Input is Excel-style CSV. Either stdin or filename.
+ (We handle Mac Excel's csv which is \\r-delimited)
+Output is honest-to-goodness tsv: no quoting or any \\n\\r\\t.
+"""
+
+from __future__ import print_function
+import csv, sys
+
+
+from tsvutil import cell_text_clean, warning
+
+def clean_row(row):
+ return [cell_text_clean(x) for x in row]
+ #return [x.replace("\n"," ").replace("\t"," ").replace("\r"," ") for x in row]
+ #print row
+ #return [x.encode('utf-8').replace("\n"," ").replace("\t"," ").replace("\r"," ") for x in row]
+ #return [x.replace("\n"," ").replace("\t"," ").replace("\r"," ").encode('utf-8') for x in row]
+
+args = sys.argv[:]
+args.pop(0)
+if len(args)==1:
+ reader = csv.reader(open(args[0],'U'))
+elif len(args) > 1:
+ raise Exception("No support for multiple files yet")
+ # could try to enforce conformity, or meld them together, etc.
+elif not sys.stdin.isatty():
+ reader = csv.reader(sys.stdin)
+else:
+ print(__doc__.strip())
+ sys.exit(1)
+
+header = reader.next()
+print(*clean_row(header), sep="\t")
+
+for row in reader:
+ if len(row) < len(header):
+ # warning("Row with %d values is too short; padding with %d blanks" % (len(row),len(header)-len(row)))
+ row += [''] * (len(header) - len(row))
+ print(*clean_row(row), sep="\t")
+
+
+
+
@@ -0,0 +1,22 @@
+#!/usr/bin/env python2.6
+from __future__ import print_function
+import simplejson
+#import json as simplejson # py 3.0
+import sys,re
+
+json = simplejson.load(sys.stdin)
+
+assert len(json)>0
+item1 = json[0]
+keys = item1.keys()
+keys.sort()
+
+BAD = re.compile("[\r\n\t]")
+
+def clean_cell(x):
+ if x is None: return ""
+ return BAD.sub(" ", unicode(x))
+
+for row in json:
+ print(*[clean_cell(row[k]) for k in keys], sep="\t")
+
24 lamecut
@@ -0,0 +1,24 @@
+#!/usr/bin/env python
+""" like 'cut' but assumes a header and you specify the columns you want by
+name instead of position"""
+import sys
+
+names = []
+for x in sys.argv:
+ if x.startswith('-f'):
+ name = x[2:]
+ if not name: raise Exception("need column name")
+ names.append(name)
+#print names
+
+input = sys.stdin
+header = input.readline()[:-1].split("\t")
+if any(n not in header for n in names):
+ raise Exception("all specified names must be in the header")
+name_pos = [header.index(n) for n in names]
+
+print "\t".join([header[i] for i in name_pos])
+for line in input:
+ pieces = line[:-1].split("\t")
+ out = [pieces[i] for i in name_pos]
+ print "\t".join(out)
26 namecut
@@ -0,0 +1,26 @@
+#!/usr/bin/env ruby
+
+# usage: give list of column names or indexes to extract (like sql select)
+# assumes first line is a header
+
+colnames = $stdin.readline.chomp.split("\t")
+cols = ARGV
+problems = []
+col_inds = cols.map { |c|
+ if c =~ /^\d+$/
+ c.to_i
+ else
+ c = $1 if c =~ /^-f(.*)/
+ raise "don't support cut's full -f syntax yet" if c =~ /,/
+ colnames.index(c) or problems << c
+ end
+}
+if problems.size > 0
+ problems.each{|problem| $stderr.puts "No column with name: #{problem}"}
+ exit -1
+end
+
+$stdin.each do |line|
+ parts = line.chomp.split("\t")
+ puts col_inds.map{|i| parts[i]}.join("\t")
+end
@@ -0,0 +1,8 @@
+#!/usr/bin/env python
+""" Space-separated fields => tab separated. very tolerant of inconsistent numbers of spaces. Use case: uniq -c | ssv2tsv
+space2tab a better name? vaguely violates naming convention? """
+import sys
+import tsvutil
+
+for line in sys.stdin:
+ print "\t".join([tsvutil.cell_text_clean(x) for x in line.split()])
3 tabawk
@@ -0,0 +1,3 @@
+#!/bin/bash
+export TAB=$(echo -e "\t")
+exec awk "-F$TAB" "$@"
@@ -0,0 +1,3 @@
+#!/bin/bash
+export TAB=$(echo -e "\t")
+exec sort "-t$TAB" "$@"
12 tsv2csv
@@ -0,0 +1,12 @@
+#!/usr/bin/env python
+import csv,sys
+
+def tsv_reader(f):
+ #return csv.DictReader(f, dialect=None,delimiter="\t",quoting=csv.QUOTE_NONE)
+ return csv.reader(f, dialect=None,delimiter="\t",quoting=csv.QUOTE_NONE)
+
+w = csv.writer(sys.stdout) # encoding issue
+for row in tsv_reader(sys.stdin):
+ w.writerow(row)
+
+
64 tsv2my
@@ -0,0 +1,64 @@
+#!/usr/bin/env python
+"""
+USAGES
+ tsv2my mydb.temptable < data.tsv
+ xlsx2tsv data.xlsx | python loadtsv.py mydb.temptable
+Loads TSV-with-header data into a new mysql table.
+All columns are varchar's.
+Uses strict TSV - no quoting or comments, and no tabs or newlines in values.
+"""
+import sys,os,re
+from collections import defaultdict
+
+if len(sys.argv) == 1:
+ print "need DBNAME.TABLENAME argument"
+ sys.exit(1)
+db_table_spec = sys.argv[1]
+assert '.' in db_table_spec
+db_name,table_name = db_table_spec.split('.')
+
+def uniq_c(seq):
+ ret = defaultdict(lambda:0)
+ for x in seq:
+ ret[x] += 1
+ return dict(ret)
+
+input = sys.stdin
+header = input.readline()
+columns = [re.sub('[,:; -]','_',h.strip()) for h in header[:-1].split("\t")]
+dups = set(col for col,count in uniq_c(columns).items() if count>1)
+#print dups
+if dups:
+ dup_counts = defaultdict(lambda:0)
+ for i,col in enumerate(columns):
+ if col in dups:
+ dup_counts[col] += 1
+ columns[i] = "%s%d" % (col, dup_counts[col])
+#print columns
+#print len(columns)
+assert len(set(columns)) == len(columns)
+
+import MySQLdb
+conn = MySQLdb.connect(user='root', db=db_name)
+curs = conn.cursor()
+
+curs.execute("drop table if exists `%s`" % table_name)
+
+max_size = 2000
+sql = "create table `%s` (" % table_name
+# sql += ",".join( "`%s` varchar(%d)" % (c,max_size) for c in columns )
+sql += ",".join( "`%s` text" % c for c in columns )
+sql += ")"
+curs.execute(sql)
+
+
+insert_sql = "insert into %s values (%s)" % (table_name, ",".join(["%s"] * len(columns)))
+#print insert_sql
+for line in input:
+ values = line[:-1].split("\t")
+ #print values
+ #print len(values)
+ # if any(len(v)>max_size for v in values):
+ # print "Warning, value truncated"
+ curs.execute(insert_sql, values)
+
@@ -0,0 +1,18 @@
+import sys
+
+warning_count = 0
+warning_max = 20
+def warning(s):
+ global warning_count
+ warning_count += 1
+ if warning_count > warning_max: return
+ print>>sys.stderr, "WARNING: %s" % s
+
+def cell_text_clean(text):
+ s = text.encode("utf8")
+ if "\t" in s: warning("Clobbering embedded tab")
+ if "\n" in s: warning("Clobbering embedded newline")
+ if "\r" in s: warning("Clobbering embedded carriage return")
+ s = s.replace("\t"," ").replace("\n"," ").replace("\r"," ")
+ return s
+
Oops, something went wrong.

0 comments on commit 2251b07

Please sign in to comment.