Pipe friendly command line tools for processing csv files with headers. The design goals of the tools are
- ease of use - fields are given by name instead of index
- interoperability within package through pipes
- minimal interface (i.e. exclusive use of standard input/output wherever possible)
- select, reorder and rename fields in stream
- remove fields from csv stream
- extract a group of fields into a new file with an id
- only distinct values are stored
- the output will receive the id field
convert csv to tsv stream
convert tsv to csv stream
- create and populate postgres table
csv... | csv_to_postgres new-table-name | psql -q [connection options]
All fields are created as VARCHAR NOT NULL. Possible NULL values in csv (unquoted empty strings) are imported as empty strings.
Possible future improvements:
- serial primary key column
- per field data types defined by a config file - field-data-type map
- custom index/indices (including primary key, unique constraint)
split csv file into two by columns - in a reversible way
- standard input: csv stream with header
- parameters:
- fields
- file name to receive fields not explicitly specified
- (optional) zip-id field name:
--id=zip-id
, defaults toid
- standard output: csv stream with the zip-id and the specified fields
- file whose name was given as parameter: csv file with fields including zip-id and fields not on stdout
TBD
join two csv files by a common sorted id field - reverse an unzip
Note: by default also removes the field to join on.
- standard input: csv stream with header
- parameters:
- other file name to join with
- (optional)
--keep-id
- (optional)
--rm
to remove other file
The field to join with is implicitly given, as the only common field name.
- standard output: joined csv stream
TBD
standard input into file chunks of given size
- standard input: csv stream with header
- parameters:
- output chunk size (in data rows)
- output file prefix
- standard output: nothing
- files named
output file prefix
.file number
Split csv stream with header to files. Each file has the same header as the input and contain exactly the number of data rows given. The last output file might potentially contain less than the chunk size.
into exactly the given number of equal sized files
which is reverse of split
Concatenate the inputs, so that the headers are skipped: only the first header is written.
- parameters:
- input file or input file prefix
- (optional) input file or input file prefix
- ...
The inputs are either file names or prefixes.
The input is a file if it exists.
The input is a prefix if it does not exist as a file, but prefix
0
does exist.
When the input is a prefix the files to be concatenated are all of
the files with prefix
+ number, so that a file exists for every
non-negative integer less than this number.
Thus it is possible to give multiple explicit file names to concatenate but it is also possible to give only a prefix for a series of files.
- standard output: concatenated csv stream
which is reverse of divide
Planned tools are under development, existing tools are in production use, even if used manually and rarely.
This documentation is in need of some content and improvement + formatting.
Currently tools can be invoked from anywhere by commands like
python -m csvtools.tool tool-arguments
where tool
is the above tools: select
, rmfields
, ...
In the future tools will become console scripts, so that they can be invoked like
csv_select arguments
csv_rmfields arguments
csv_extract_map arguments
csv_split arguments
csv_divide arguments
csv_concatenate arguments
csv_cat arguments
csv_weave arguments
csv2tsv
tsv2csv