Permalink
Fetching contributors…
Cannot retrieve contributors at this time
216 lines (156 sloc) 11.9 KB

csvpys

Description

Run arbitrary python expression on each CSV row to compute value of a new column. It is possible to compute many new columns at once using -s option multiple times:

usage: csvpys [-h] [-d DELIMITER] [-t] [-q QUOTECHAR] [-u {0,1,2,3}] [-b]
          [-p ESCAPECHAR] [-z MAXFIELDSIZE] [-e ENCODING] [-v] [-l]
          [--zero] [-n] [-s NEW_COLUMN PY_SCRIPT]
          [FILE]

Python scripting in CSV files. Run arbitrary python expression on each CSV row
to compute value of a new column.

positional arguments:
  FILE                  The CSV file to operate on. If omitted, will accept
                        input on STDIN.

optional arguments:

  -n, --names           Display column names and indices from the input CSV
                        and exit.
  -s NEW_COLUMN PY_SCRIPT, --script NEW_COLUMN PY_SCRIPT
                        New column name and a python script.". Inside script
                        use c[0] or ch['header'] locals.

Also see: :doc:`common_arguments`.

Scripting language

There is no custom scripting language. Just use plain python expressions. Internally expressions are evaluated using python standard eval() function (they are also precompiled for efficiency). Python expression can use following locals and globals:

  • Locals references two variables representing current row:
    • c - list of column values indexed with column id (default: 1-based indexing, with --zero enabled indexing is 0-based),
    • ch - dict with values accessible via columns name.
  • Globals are supplied with following modules:
    • re and Rex (rex is regular expressions for humans, specially written by experiences using csvkit scripting),
    • math,
    • random,
    • collections.

Note

Remember that both c and ch variables have unicode type. Need to be converted if used in other contexts than strings.

Examples

For following data:

$ csvcut -c 1,4,5 examples/realdata/FY09_EDU_Recipients_by_State.csv | csvlook
|--------------------+--------------------------------+----------------------------------------|
|  State Name        | Montgomery GI Bill-Active Duty | Montgomery GI Bill- Selective Reserve  |
|--------------------+--------------------------------+----------------------------------------|
|  ALABAMA           | 6,718                          | 1,728                                  |
|  ALASKA            | 776                            | 154                                    |
|  ARIZONA           | 26,822                         | 2,005                                  |
|  ARKANSAS          | 2,061                          | 988                                    |
|  CALIFORNIA        | 34,942                         | 2,987                                  |
|  COLORADO          | 10,389                         | 914                                    |
|  CONNECTICUT       | 1,771                          | 490                                    |

    ...                ...                              ...

|  WYOMING           | 686                            | 212                                    |
|  PUERTO RICO       | 822                            | 1,107                                  |
|                    |                                |                                        |
|--------------------+--------------------------------+----------------------------------------|

Let's sum values for both Montgomery GI Bill:

$ csvcut -c 1,4,5 examples/realdata/FY09_EDU_Recipients_by_State.csv | csvpys -s TOTAL "int(c[2].replace(',', '') if c[2] != '' else 0) + int(c[3].replace(',', '') if c[3] != '' else 0)" | csvlook
|--------------------+--------------------------------+---------------------------------------+--------|
|  State Name        | Montgomery GI Bill-Active Duty | Montgomery GI Bill- Selective Reserve | TOTAL  |
|--------------------+--------------------------------+---------------------------------------+--------|
|  ALABAMA           | 6,718                          | 1,728                                 | 8446   |
|  ALASKA            | 776                            | 154                                   | 930    |
|  ARIZONA           | 26,822                         | 2,005                                 | 28827  |
|  ARKANSAS          | 2,061                          | 988                                   | 3049   |
|  CALIFORNIA        | 34,942                         | 2,987                                 | 37929  |
|  COLORADO          | 10,389                         | 914                                   | 11303  |
|  CONNECTICUT       | 1,771                          | 490                                   | 2261   |

    ...                ...                              ...

|  WYOMING           | 686                            | 212                                   | 898    |
|  PUERTO RICO       | 822                            | 1,107                                 | 1929   |
|                    |                                |                                       | 0      |
|--------------------+--------------------------------+---------------------------------------+--------|

Note

Expression int(c[2].replace(',', '') if c[2] != '' else 0) converts a string e.g. "4,156" or empty string to a proper int value (4156), by removing semicolon and casting to int or returning 0 on empty string.

The same using column names:

$ csvcut -c 1,4,5 examples/realdata/FY09_EDU_Recipients_by_State.csv | csvpys -s TOTAL "int(ch['Montgomery GI Bill-Active Duty'].replace(',', '') if ch['Montgomery GI Bill-Active Duty'] != '' else 0) + int(ch['Montgomery GI Bill- Selective Reserve'].replace(',', '') if ch['Montgomery GI Bill- Selective Reserve'] != '' else 0)" | csvlook

Other example, let's play with data:

$ csvcut -c 1,8 examples/realdata/FY09_EDU_Recipients_by_State.csv | csvlook
|--------------------+------------------------------------------------------------|
|  State Name        | Post-Vietnam Era Veteran's Educational Assistance Program  |
|--------------------+------------------------------------------------------------|
|  ALABAMA           | 8                                                          |
|  ALASKA            | 2                                                          |
|  ARIZONA           | 11                                                         |
|  ARKANSAS          | 3                                                          |
|  CALIFORNIA        | 48                                                         |
|  COLORADO          | 10                                                         |
|  CONNECTICUT       | 4                                                          |

    ...                ...

|  WYOMING           | 1                                                          |
|  PUERTO RICO       | 3                                                          |
|                    |                                                            |
|--------------------+------------------------------------------------------------|

The task is to classify as True all states that have a value greater or equal than 10 in second column:

$ csvcut -c 1,8 examples/realdata/FY09_EDU_Recipients_by_State.csv | csvpys -s Classify "bool(int(c[2])>=10) if c[2] != '' else ''" | csvlook
|--------------------+-----------------------------------------------------------+-----------|
|  State Name        | Post-Vietnam Era Veteran's Educational Assistance Program | Classify  |
|--------------------+-----------------------------------------------------------+-----------|
|  ALABAMA           | 8                                                         | False     |
|  ALASKA            | 2                                                         | False     |
|  ARIZONA           | 11                                                        | True      |
|  ARKANSAS          | 3                                                         | False     |
|  CALIFORNIA        | 48                                                        | True      |
|  COLORADO          | 10                                                        | True      |
|  CONNECTICUT       | 4                                                         | False     |

    ...                ...                                                          ...

|  WYOMING           | 1                                                         | False     |
|  PUERTO RICO       | 3                                                         | False     |
|                    |                                                           |           |
|--------------------+-----------------------------------------------------------+-----------|

Note

If statement is only needed because we need to deal with last line which has empty string ''.

OK, within the last example we will calculate number of A's in state names:

$ csvcut -c 1 examples/realdata/FY09_EDU_Recipients_by_State.csv | csvpys -s "A letter count" "collections.Counter(c[1])['A']" | csvlook
|--------------------+-----------------|
|  State Name        | A letter count  |
|--------------------+-----------------|
|  ALABAMA           | 4               |
|  ALASKA            | 3               |
|  ARIZONA           | 2               |
|  ARKANSAS          | 3               |
|  CALIFORNIA        | 2               |
|  COLORADO          | 1               |
|  CONNECTICUT       | 0               |

    ...                 ...

|  WYOMING           | 0               |
|  PUERTO RICO       | 0               |
|                    | 0               |
|--------------------+-----------------|

Regular expressions can also be very useful in scripting. You can use very simple module named Rex, because standard python re implementation is not very continence to use in online expressions. For more information on Rex please refer to documentation.

Let's use already nice regular expressions support of csvkit, and grep columns to leave only states with started with "NEW":

$ csvcut -c 1-3 examples/realdata/FY09_EDU_Recipients_by_State.csv | csvgrep -c 1 -r "^NEW" | csvlook
|----------------+------------------+-------|
|  State Name    | State Abbreviate | Code  |
|----------------+------------------+-------|
|  NEW HAMPSHIRE | NH               | 33    |
|  NEW JERSEY    | NJ               | 34    |
|  NEW MEXICO    | NM               | 35    |
|  NEW YORK      | NY               | 36    |
|----------------+------------------+-------|

Let's say we would like to extract the second part of the state name and concatenate it with the state abbreviate (e.g. for NEW YORK => YORK (NY)). Of course it could be done in plenty of ways, but we would like to use regular expressions for that:

$ csvcut -c 1-3 examples/realdata/FY09_EDU_Recipients_by_State.csv | csvgrep -c 1 -r "^NEW" | csvpys -s "New name" "(c[1] == rex('/^new (.*)$/i'))[1] + ' (' + c[2] + ')'" | csvlook
|----------------+------------------+------+-----------------|
|  State Name    | State Abbreviate | Code | New name        |
|----------------+------------------+------+-----------------|
|  NEW HAMPSHIRE | NH               | 33   | HAMPSHIRE (NH)  |
|  NEW JERSEY    | NJ               | 34   | JERSEY (NJ)     |
|  NEW MEXICO    | NM               | 35   | MEXICO (NM)     |
|  NEW YORK      | NY               | 36   | YORK (NY)       |
|----------------+------------------+------+-----------------|

Yes, it's that simple and powerful!