# README

---

This document serves as an accompanying guide/index to the command-line tools discussed in the notebooks. Here we have presented the relevant functions/verbs from command-line-tools that work with tabular data in the context of data-analysis tasks.

Expand each section for names of functions that help in performing said task, then access the respective notebook for specific examples.

---

# $00 - Import, Inspect$

## `conversion`

- _from/to csv_

---

```bash
in2csv, csv2json
csvtk csv2tab, space2tab, tab2csv
xsv fmt
mlr cat

## compressed data?
mlr --prepipe
```

---
## `display`

```bash
head, tail
csvlook
csvtk pretty
xsv table
mlr head, tail
```

---
## `count`

- _rows, columns_

```bash
wc
xsv count
```

---
## `types`

- _detect types, conversion_

---

```bash
mlr put is_*
mlr put boolean, int, float, string
```

---
## `column names`

```bash
csvcut -n
xsv headers
csvtk headers
mlr label
```

# $01 - Subset$

---
## `rename`

-  _one/many columns_



```bash
csvtk rename, rename2
mlr rename
```

---
## `index`

- _create row names/indentifiers_



```bash
nl
xsv index
```

---
## `subset columns` 

- _select/exclude cols_


```bash
cut
csvcut
csvtk select
xsv select
mlr cut
mlr having-fields
```

---
## `subset rows` 

- _select/exclude rows_

```bash
cut
csvgrep
csvtk filter, filter2, grep
xsv search
mlr filter
```

---
## `sample`

- _with (bootstrap) or without replacement (permutation)_


```bash
shuf
csvtk sample
xsv sample
mlr bootstrap, sample, shuffle, decimate
```

---
## `split`

- _large file into smaller files_


---

```bash
split
csplit
xsv split
```

# $02-Clean$

## `missing`


## _detect, count, replace_


---

```bash
awk
csvstat --nulls
mlr put is_null, is_not_null
```

## `duplicates`

## _identify, remove dups_

---

```bash
mlr repeat # create dups
datamash rmdup
```

# $03-Mutate$

---

## `mutate`

- _create/drop rows, cols_


```bash
awk
mlr put
csvtk mutate
```

---

## `format`

- _numerical formatting_

```bash
numfmt
datamash round, ceil, floor, trunc, frac
```

---

## `time conversions`

-  _from/to epoch_

```bash
mlr put with strftime, strptime
mlr sec2gmt, sec2gmtdate
```

---

## `functions`

- _apply, map_


```bash
awk
mlr put

```

---

## `discretize`


- _cut numerics into categoricals_


---

```bash
datamash bin
mlr histogram --nbins
```

---

## `reshape`


- _long/wide to wide/long_


```bash
paste
csvtk transpose
datamash transpose

# pandas-like reshape
mlr reshape
```

---

# $04-Merge/Join/Concat$

---
## ` join`

- _merge tables_


```bash
join
csvtk join
xsv join
```

---
## `concat`

- _append/concat tables_


```bash
cat
csvstack
xsv cat

# concat when cols are not same
mlr unsparsify
```

---
## ` compare, intersect`

- _rows in A & B, rows in A not in B etc._


```bash
comm
csvtk intersect
```

# $05-Explore$ 

## `aggregate`

- _group-by, pivot_


```bash
datamash
mlr
```

---
## `sort`

```bash
sort
csvsort
```

---
## `uniques`



```bash
uniq
csvtk uniq
```

---
## `frequencies`


```bash
uniq -c
csvtk freq
xsv frequency
mlr top
mlr least-frequent, most-frequent
mlr fraction # convert frequencies to percentages
```

---
## `crosstabs`


```bash
datamash crosstab
```

---

# $06-Analyze/Visualize$

---
## `univariate`

- _mean/stddev, median, percentiles, skewness/kurtosis, mode, min/max_

```bash
csvstat
csvtk stats, stats2
xsv stats
mlr stats1, stats2
datamash
```

---
## `bivariate`

- _correlation/covariance, regression, r-squared_

```bash
mlr stats2
```

---
## `visualize`

- _histograms, scatterplots_

```bash
csvtk plot
mlr bar
```

# $07-Advanced$

<br><br><br><br>

---
## `generate`


- _random data_

```
seq, shuf, pr
mlr seqgen
mlr put "urand(), urandint()"
```

---
## `query`


- _run sql queries_

```bash
csvsql
q -H -d, """query"""

# generate a CREATE TABLE query for your csv
# super useful when pushing data into a local db (mysql, postgresql etc.)
csvsql -i sqlite joined.csv
```

----
<br><br><br><br>

# $Appendix$

---

Expand the sections below to read through the help pages of these tools and for a list of the most frequently used verbs.

## `datamash`

---


```bash
Primary operations:
  groupby, crosstab, transpose, reverse, check
  
Line-Filtering operations:
  rmdup

Per-Line operations:
  base64, debase64, md5, sha1, sha256, sha512,
  bin, strbin, round, floor, ceil, trunc, frac

Numeric Grouping operations:
  sum, min, max, absmin, absmax

Textual/Numeric Grouping operations:
  count, first, last, rand, unique, collapse, countunique

Statistical Grouping operations:
  mean, median, q1, q3, iqr, mode, antimode, pstdev, sstdev, pvar,
  svar, mad, madraw, pskew, sskew, pkurt, skurt, dpo, jarque,
  scov, pcov, spearson, ppearson
```

In [1]:
!datamash --help

Usage: datamash [OPTION] op [fld] [op fld ...]

Performs numeric/string operations on input from stdin.

'op' is the operation to perform.  If a primary operation is used,
it must be listed first, optionally followed by other operations.
'fld' is the input field to use.  'fld' can be a number (1=first field),
or a field name when using the -H or --header-in options.
Multiple fields can be listed with a comma (e.g. 1,6,8).  A range of
fields can be listed with a dash (e.g. 2-8).  Use colons for operations
which require a pair of fields (e.g. 'pcov 2:6').


Primary operations:
  groupby, crosstab, transpose, reverse, check
Line-Filtering operations:
  rmdup
Per-Line operations:
  base64, debase64, md5, sha1, sha256, sha512,
  bin, strbin, round, floor, ceil, trunc, frac
Numeric Grouping operations:
  sum, min, max, absmin, absmax
Textual/Numeric Grouping operations:
  count, first, last, rand, unique, collapse, countunique
Statistical Grouping operations:
  mean,

## `mlr`

---


```bash
Verbs

   bar bootstrap cat check count-distinct cut decimate filter grep group-by
   group-like having-fields head histogram join label least-frequent
   merge-fields most-frequent nest nothing fraction put regularize rename
   reorder repeat reshape sample sec2gmt sec2gmtdate seqgen shuffle sort stats1
   stats2 step tac tail tee top uniq unsparsify

   Use "mlr {verb} -h" for help

Functions (for the `filter` and `put` verbs)

    # arithmetic, logical, conditional operators
   + + - - * / // % ** | ^ & ~ << >> == != =~ !=~ > >= < <= && || ^^ ! ? : .
   
   # string functions
   gsub strlen sub substr tolower toupper   
   
   # math
   abs ceil floor log log10 log1p
   max min msub exp pow qnorm  
   sgn sqrt cbrt
   
   # random numbers
   urand urand32 urandint 
 
   # date, time
   dhms2fsec dhms2sec fsec2dhms fsec2hms
   gmt2sec hms2fsec hms2sec sec2dhms sec2gmt sec2gmt sec2gmtdate sec2hms
   strftime strptime systime 
   
   # booleans
   is_absent is_bool is_boolean is_empty is_empty_map
   is_float is_int is_map is_nonempty_map is_not_empty is_not_map is_not_null
   is_null is_numeric is_present is_string 
  
   # type-conversoin, rounding, formatting 
   boolean float int string
   round roundm
   fmtnum hexfmt 
   
Use "mlr --help-function {function name}" for function-specific help.
```

In [2]:
!mlr -h

Usage: mlr [I/O options] {verb} [verb-dependent options ...] {zero or more file names}

Command-line-syntax examples:
  mlr --csv cut -f hostname,uptime mydata.csv
  mlr --tsv --rs lf filter '$status != "down" && $upsec >= 10000' *.tsv
  mlr --nidx put '$sum = $7 < 0.0 ? 3.5 : $7 + 2.1*$8' *.dat
  grep -v '^#' /etc/group | mlr --ifs : --nidx --opprint label group,pass,gid,member then sort -f group
  mlr join -j account_id -f accounts.dat then group-by account_name balances.dat
  mlr --json put '$attr = sub($attr, "([0-9]+)_([0-9]+)_.*", "\1:\2")' data/*.json
  mlr stats1 -a min,mean,max,p10,p50,p90 -f flag,u,v data/*
  mlr stats2 -a linreg-pca -f u,v -g shape data/*
  mlr put -q '@sum[$a][$b] += $x; end {emit @sum, "a", "b"}' data/*
  mlr --from estimates.tbl put '
  for (k,v in $*) {
    if (is_numeric(v) && k =~ "^[t-z].*$") {
      $sum += v; $count += 1
    }
  }
  $mean = $sum / $count # no assignment if count unset'
  mlr --from infile.dat put -f analyze.mlr
 

## `csvtk`

---

In [3]:
!csvtk

A cross-platform, efficient and practical CSV/TSV toolkit

Version: 0.7.1

Author: Wei Shen <shenwei356@gmail.com>

Documents  : http://shenwei356.github.io/csvtk
Source code: https://github.com/shenwei356/csvtk

Attention:

    1. The CSV parser requires all the lines have same number of fields/columns.
       Even lines with spaces will cause error.
    2. By default, csvtk thinks your files have header row, if not, switch flag "-H" on.
    3. Column names better be unique.
    4. By default, lines starting with "#" will be ignored, if the header row
       starts with "#", please assign flag "-C" another rare symbol, e.g. '$'.
    5. By default, csvtk handles CSV files, use flag "-t" for tab-delimited files.
    6. If " exists in tab-delimited files, use flag "-l".

Usage:
  csvtk [command]

Available Commands:
  csv2md      convert CSV to markdown format
  csv2tab     convert CSV to tabular format
  cut         select parts of fields
  filter      filter 

## `xsv`

---

In [4]:
!xsv

xsv is a suite of CSV command line utilities.

Please choose one of the following commands:
    cat         Concatenate by row or column
    count       Count records
    fixlengths  Makes all records have same length
    flatten     Show one field per line
    fmt         Format CSV output (change field delimiter)
    frequency   Show frequency tables
    headers     Show header names
    help        Show this usage message.
    index       Create CSV index for faster access
    input       Read CSV data with special quoting rules
    join        Join CSV files
    sample      Randomly sample CSV data
    search      Search CSV data with regexes
    select      Select columns from CSV
    slice       Slice records from CSV
    sort        Sort CSV data
    split       Split CSV data into many files
    stats       Compute basic statistics
    table       Align CSV data into columns

