<a href="https://colab.research.google.com/github/cweiqiang/wq.github.io/blob/main/Cheatsheet_Command_Line_and_Bash_for_Data_Science.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Section 1. Manipulating files and directories

## Basic commands: `pwd`, `ls`

- `pwd` — print working directory

- `ls` — listing of items and directory
  - `ls -l` — lists read, write permission details of each item listed 
  - `ls -RF`
  - `ls -RF directoryA`
  - `ls -FR directoryA`
  - `ls -F -R directoryA`
  - `ls -R -F directoryA`
  
    - `-R` — recursive, list everything below a directory
    - `-F` — prints a / after the name of every directory and a * after the name of every runnable program.

## Basic commands: Directory

`.` — the current directory

`..` — the directory above the one I’m currently in

`~` — your home directory

## Basic commands: `cp`, `mv`

- `cp` — copy

  - `cp fileA fileB` —  copy fileA to fileB

  - `cp fileA fileB fileC NewDirectory` —  copy fileA fileB fileC to NewDirectory

- `mv` — move (when a new destination directory is specified)
  - `mv` – rename (when no new destination directory is specified) same syntax as copy
  - `mv fileA fileB` — rename fileA to fileB
  - `mv fileA fileB fileC NewDirectory`  — move fileA fileB fileC to NewDirectory

**Caution: both `cp` and `mv` will overwrite the existing file**

## Basic commands: `rm`, `rmdir`, `mkdir`

- `rm` — remove files
  - `rm fileA fileB`
  - `rm -r` — command deletes the folder recursively, even the empty folder. 
  - `rm -rf /` — Force deletion of everything in root directory
  - `rm -vrf dir1` — visual confirmation about deleting directory

See more details here at https://www.cyberciti.biz/faq/how-to-remove-non-empty-directory-in-linux/

**Use "" when your directory path (absolute or relative) or name contains spaces**

**caution: if a file is removed, it’s removed forever**

- `rmdir` — remove an empty directory
  - `rmdir directoryA`
- `mkdir` — make a directory
  - `mkdir directoryA`



# Section 2. Manipulating data

## `'cat'` (Show file contents, concatenate)

- `cat` — concatenate, show file contents
  - `cat fileA`

## `'less'` (show file contents)

- `less` — is more, show file contents
  - `less fileA fileB`

- When `less fileA`
  - space — page down
  - `:n` — to move to the next file
  - `:p` — to go back to the previous one
  - `:q` — to quit.

## `'head'` (display first few lines of a file)
- `head` — display first 10 lines of a file
  - `head fileA`

## `'tail'` (display last few lines of a file)

- `tail` — display last 10 lines of a file
  - `tail -n 1 fileA` — display last 1 lines of a fileA
  - `tail -n +2 fileA` — display from the 2nd lines to the end of a fileA

## Command-line flag
- `head -n 3 fileA` — display first $n$($n = 3$ here) lines of a file
  - `-n` — number of lines

## `'man'` (manual)

- `man — manual`
- `man head`

`man` automatically invokes `less`, so you may need to press spacebar to page through the information
and `:q` to quit.

The one-line description under NAME tells you briefly what the command does, and the summary
under SYNOPSIS lists all the flags it understands. 

Anything that is optional is shown in square
brackets [...], either/or alternatives are separated by |, and things that can be repeated are shown
by ..., 


so head‘s manual page is telling you that you can either give a line count with `-n` or a byte
count with `-c`, and that you can give it any number of filenames.

## `'cut'` (Select Columns)

- `cut -f 2-5,8 -d , fileA.csv`
  - `-f 2-5, 8` — fields, select columns 2 through 5 and columns 8, using comma as the separator
  - `-d` , — delimiter, use ‘,’ as default delimiter

- `cut -f2 -d , fileA.csv`
- `cut -f 2 -d , fileA.csv`
are the same.

Space is optional between `-f` and `2`

## `'history'`, `'!'` (Retrieve past commands)

- `history` — print a list of commands you have run recently
- `!some_command` — run the last some_command again(ex. `!cut`)
  - `!2` — run the 2nd command listed in history

## `'grep'` (Find Matches of Pattern in file)

- `grep patternA fileA`
  -`c`: print a count of matching lines of `patternA` in `fileA` rather than the lines themselves
  - `h`: do not print the names of files when searching multiple files
  - `i`: ignore case (e.g., treat “Regression” and “regression” as matches)
  - `l`: print the names of files that contain matches, not the matches
  - `n`: print line numbers for matching lines
  - `v`: invert the match, i.e., only show lines that don’t match



---



---



`grep molar seasonal/autumn.csv`

2017-02-01,molar

2017-05-25,molar


---



`grep -nv molar seasonal/spring.csv`

1:Date,Tooth

2:2017-01-25,wisdom

3:2017-02-19,canine

...

22:2017-08-13,incisor

23:2017-08-13,wisdom


---



---



`grep -c incisor seasonal/autumn.csv seasonal/winter.csv`

seasonal/autumn.csv:3

seasonal/winter.csv:6

## `'paste'` (Combine data files)

- `grep 2017-07 seasonal/spring.csv | wc -l
paste` — to combine data files



---



---



- `paste -d , seasonal/autumn.csv seasonal/winter.csv`

Date,Tooth,Date,Tooth

2017-01-05,canine,2017-01-03,bicuspid

2017-01-17,wisdom,2017-01-05,incisor

2017-01-18,canine,2017-01-21,wisdom

...

2017-08-16,canine,2017-07-01,incisor,

2017-07-17,canine,2017-08-10,incisor
...

**The last few rows have the wrong number of columns.**

# Section 3: Combining tools



## `'>'` (Store output in a file) 

- `some_command > new_file`
  - eg. `tail -n 5 seasonal/winter.csv > last.csv`

## ` '|' ` (Combine/Pipeline commands)

- `command_1 | command_2 | …`
  - `cut -f 2 -d , seasonal/summer.csv | grep -v Tooth`

 - Command_1: Select `field`, with column `2` using comma `,` as the `delimiter` for directory/file `seasonal/summer.csv`
 - Command_2: Print `inverted` (**non**-)matching lines with `Tooth` from the output of Command1

---
canine

wisdom

bicuspid
...



## `'wc'` (Wordcount)

- `wc` — word count, prints the number of `characters`, `words`, and `lines` in a file. You can make it print
only one of these using `-c`, `-w`, or `-l` respectively.

  - `grep 2017-07 seasonal/spring.csv | wc -l`

## `'*'`: (wildcards, match >=0 chars)

- `*`, which means “match zero or more characters”.



---


`head -n 3 seasonal/s*.csv`


---


==> seasonal/spring.csv <==

Date,Tooth

2017-01-25,wisdom

2017-02-19,canine

==> seasonal/summer.csv <==

Date,Tooth

2017-01-11,canine

2017-01-18,wisdom


## `'?'`, `'[]'`, `'{}'` (wildcards with more control) 

- `?` matches a single character, 
  - so `201?.txt` will match `2017.txt` or `2018.txt`, but
not `2017-01.txt`.

- `[...]` matches any one of the characters inside the square brackets,
  - so `201[78].txt` matches `2017.txt` or `2018.txt`, but not `2016.txt`.
- `{...}` matches any of the comma-separated patterns inside the curly brackets, 
  - so `{*.txt,*.csv}` matches any file whose name ends with `.txt` or `.csv`, but not files whose names end
with `.pdf`.

## `'sort'` (sort lines in csv)
- `sort` — By default it does this in ascending alphabetical order
  - `-n` and `-r` can be used to sort numerically and reverse the order of its output
  - `-b` tells it to ignore leading blanks
  - `-f` tells it to fold case (i.e., be case-insensitive)



---
`sort -r seasonal/summer.csv`


---


Date,Tooth

2017-08-04,canine

2017-08-03,bicuspid

2017-08-02,canine
...


## `'uniq'` (Remove duplicated lines **after sort**)

- `uniq` — remove duplicated lines


---


If a file contains of lines **already sorted with duplicates bunched together**:

2017-07-03

2017-07-03

2017-08-03

2017-08-03


---
Then `uniq` will produce:


---
2017-07-03

2017-08-03


---
but if it contains lines **NOT sorted with duplicates bunched together**:

2017-07-03

2017-08-03

2017-07-03

2017-08-03

then `uniq` will **print all four lines**.


The reason is that `uniq` is built to work with very large files. 

In order to remove non-adjacent lines
from a file, it would have to keep the whole file in memory (or at least, all the unique lines seen so
far). 

By only removing adjacent duplicates, it only has to keep the most recent unique line in memory.


---



---



- get the second column from `seasonal/winter.csv`,
- remove the word `“Tooth”` from the output so that only **non-Tooth ** names are displayed
- `sort` the output so that all occurrences of a particular non-tooth name are **adjacent**
- display each non-tooth name once along with a count of how often it occurs


---

`cut -d , -f 2 seasonal/winter.csv | grep -v Tooth | sort | uniq`


---



---


4 bicuspid

7 canine

6 incisor

4 molar
 
4 wisdom


## Combination of Commands with `'|'`

`wc -l seasonal/*.csv`


---


21 seasonal/autumn.csv

24 seasonal/spring.csv 

25 seasonal/summer.csv

26 seasonal/winter.csv

96 total



---


`wc -l seasonal/*.csv | grep -v total`


---


21 seasonal/autumn.csv

24 seasonal/spring.csv

25 seasonal/summer.csv

26 seasonal/winter.csv


---


`wc -l seasonal/*.csv | grep -v total | sort -n | head -n 1`


---


21 seasonal/autumn.csv

# Section 4: Batch Processing




## Environmental Variables


<table>
<tr>
<th>Variable</th>
<th>Purpose</th>
<th>Value</th>
</tr>
<tr>
<td>HOME</td>
<td>User’s home directory</td>
<td>/home/repl</td>
</tr>
<tr>
<td>PWD</td>
<td>Present working directory</td>
<td>Same as pwd command</td>
</tr>
<tr>
<td>SHELL</td>
<td>Which shell program is being used</td>
<td>/bin/bash</td>
</tr>
<tr>
<td>USER</td>
<td>User’s ID</td>
<td>repl</td>
</tr>
</table>

To get a complete list (which is quite long), you can type set in the shell.

e.g. `HISTFILESIZE` determines how many old commands are stored in your command history.


---

`set | grep HISTFILESIZE`


---



---


HISTFILESIZE=2000

## Print variable - `'echo $variable_name'`

`echo` — prints its arguments.

To get the variable’s value, you must put a dollar sign `$` in front of it.

This is true everywhere: to get the value of a variable called X, you must write $X. (This is so that the
shell can tell whether you mean “a file named X” or “the value of a variable named X”.)

`echo $OSTYPE`


---


linux-gnu




## Shell variable - `'variable_name=value'`

To create a shell variable, you simply assign a value to a name without any spaces before or after
the `=` sign.

`testing=seasonal/winter.csv`

`head -n 1 $testing`


## For Loops

`for filetype in gif jpg png`; 

 `do echo $filetype; done`

 it produces:


---
gif

jpg

png


1. The structure is for  `…variable… in …list…; do ...body... ; done`


2. The list of things the loop is to process (in our case, the words gif, jpg, and png).

Notice that the body uses `$filetype` to get the variable’s value instead of just filetype, just like
it does with any other shell variable. 

Also notice where the semi-colons go: 

- the first one comes
between the list and the keyword do, 

- and the second comes between the body and the keyword done.

## loops with wildcard *

`for filename in seasonal/*.csv; do echo $filename; done`

seasonal/autumn.csv

seasonal/spring.csv

seasonal/summer.csv

seasonal/winter.csv

## loops with variable `$`

`$ files=seasonal/*.csv`

`$ for f in $files; do echo $f; done`


---


seasonal/autumn.csv

seasonal/spring.csv

seasonal/summer.csv

seasonal/winter.csv

## loops with pipe `|`

`$ for file in seasonal/*.csv; `

`do head -n 2 $file | tail -n 1; done`



---
2017-01-05,canine

2017-01-25,wisdom

2017-01-11,canine

2017-01-03,bicuspid


`$ for file in seasonal/*.csv;` 

`do grep -h 2017-07 $file; done`



---
2017-07-10,incisor

2017-07-10,wisdom

2017-07-20,incisor

...


## Avoiding use space in file_name

use ‘ or ” if there is a space in file_name

`mv 'July 2017.csv' '2017 July data.csv'`

## loops with several commands

Seperate commands with `;`

`$ for f in seasonal/*.csv;` `do echo $f head -n 2 $f | tail -n 1;`

`seasonal/autumn.csv head -n 2 seasonal/autumn.csv`

`seasonal/spring.csv head -n 2 seasonal/spring.csv`

`seasonal/summer.csv head -n 2 seasonal/summer.csv`

`seasonal/winter.csv head -n 2 seasonal/winter.csv`



`$ for f in seasonal/*.csv; do echo $f; head -n 2 $f | tail -n 1`


---



seasonal/autumn.csv

2017-01-05,canine

seasonal/spring.csv

2017-01-25,wisdom

seasonal/summer.csv

2017-01-11,canine

seasonal/winter.csv

2017-01-03,bicuspid


## Save history commands for future use

`cp seasonal/s*.csv ~/`

`grep -hv Tooth s*.csv > temp.csv`

`history | tail -n 3 > steps.txt`

`cat steps.txt`


---

9  cp seasonal/s*.csv ~/

10 grep -hv Tooth s*.csv > temp.csv

11 history | tail -n 3 > steps.txt

## Run shell script using Bash

run shell script using bash script.sh

`nano dates.sh`

`cat dates.sh`


---



---


`cut -d , -f 1 seasonal/*.csv`


---


`bash dates.sh`


---

Date

2017-01-05

2017-01-17

2017-01-18

...

## Save a script output into a file

`nano teeth.sh`

`cat teeth.sh` 


---


`cut -d , -f 2 seasonal/*.csv | grep -v Tooth | sort | uniq -c`


---


` bash teeth.sh > teeth.out`

`cat teeth.out`


---


15 bicuspid

31 canine

18 incisor

11 molar

17 wisdom

## Pass filenames to scripts `$@`(Lamda, map.Apply function)

if `unique-lines.sh` contains this:

`sort $@ | uniq`

then when you run:

`bash unique-lines.sh seasonal/summer.csv`

the shell replaces `$@` with `seasonal/summer.csv` and processes one file. 



---



If you run this:

`bash unique-lines.sh seasonal/summer.csv seasonal/autumn.csv`

it processes two data files, and so on.


---



---



`nano count-records.sh`

` cat count-records.sh`


---


`tail -q -n +2 $@ | wc -l`


---
`bash count-records.sh seasonal/*.csv > num-records.out`

`head num-records.out`



---

92




## Command-line parameters

As well as `$@`, the shell lets you use `$1`, `$2`, and so on to refer to specific command-line parameters.


The script `get-field.sh` is supposed to take a filename, the number of the row to select, the
number of the column to select, and print just that field from a CSV file.

e.g. `bash get-field.sh seasonal/summer.csv 4 2`
should select the **second field** from **line 4** of `seasonal/summer.csv.`


---



---



`head -n $2 $1 | tail -n 1 | cut -d , -f $3`

`bash get-field.sh seasonal/summer.csv 4 2`

## Bash Scripts with 2 or more lines

`cat range.sh`


---


`wc -l $@ | grep -v total | sort -n | head -n 1`

`wc -l $@ | grep -v total | sort -nr | head -n 1`


---


`bash range.sh seasonal/*.csv > range.out`

`head range.out`


---


`21 seasonal/autumn.csv`

`26 seasonal/winter.csv`

## Bash Scripts with Loops

`cat date-range.sh`


---


---



# Print the first and last date from each data file.
`for filename in $@`

`do`

`cut -d , -f 1 $filename | grep -v Date | sort | head -n 1`

`cut -d , -f 1 $filename | grep -v Date | sort | tail -n 1`

`done`



---


---


`bash date-range.sh seasonal/*.csv`


---


2017-01-05

2017-08-16

2017-01-25

...


---


`bash date-range.sh seasonal/*.csv | sort`



---


2017-01-03

2017-01-05

2017-01-11

...
