A crash course in bash for the data scientist.
"This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface." -- Doug McIlroy
List directory contents with ls
, create a directory with mkdir
, copy files with cp
, move and rename files with mv
, remove files and directories with rm
and rmdir
.
Most importantly, access the Bash manual man
to get help with built-in commands.
Command-line keyboard shortcuts are immensely useful. For example, ctrl+a
moves the cursor to the beginning of the line and ctrl+e
moves the cursor to the end. Additionally, ctrl+k
clears the line after the cursor, and ctrl+l
clears the whole terminal window -- equivalent to clear
.
See a list of shortcuts here: Wikipedia article on Bash, shortcut section
Not only can you use the *
wildcard, but bash also supports optional matching. Show all tsv and csv files in the current directory:
ls -l *.[tc]sv
Expand items with brackets {}
. Only list filenames with certain extensions:
ls -l *.{csv,tsv,sql,json}
Another useful feature is the ability to repeat the last argument with !$
. Show all csv files, then move them to a directory, data/
:
ls -l *.csv
mv !$ data/
The find
command has several different applications.
To search all sub-directories within a directory called zipfian
for python files:
find ~/zipfian -name ".py"
The find command can also execute a given command on all of the files returned by the search using the -exec
flag. To search for all python files within the zipfian
directory that contain the string "RandomForestClassifier"
:
find ~/zipfian -name "*.py" -exec grep -l "RandomForestClassifier" {} \;
Many more examples on the find
command here: Blog post by Alvin Alexander
echo $PATH
$PATH
is a system variable string which contains paths separated by colons :
. Bash searches the paths in $PATH
, precedence goes to the command left-most path if you have more than one command in the search path.
It is very common for $PATH
to be defined or appended within your shell initialization script. By default, that script is ~/.bash_profile
.
The very basic system commands are located in /bin
and /usr/bin/
directories.
Have a look at those commands using ls
:
ls -l /usr/bin
Or, search those commands by piping the above output to grep
. In the below example, I search for the word "grep"
to search for all of the regular expression commands related to grep
.
ls -l /usr/bin | grep "grep"
Note that the quotation marks are not necessary. Bash knows that the pattern argument passed into grep
is a string.
To show the origin of a specific command, link, or alias, you can use either which
or type
. Also, most properly implemented commands and interpreters have a --version
flag that will display the version of the command.
which python
python --version
#!
The hashbang! Also known as the shebang.
example python script: environ.py
#!/usr/bin/env python
import os
for item in os.environ.items():
print item
Supplying #!/usr/bin/env python
in first line of a script will tell Bash to pass the script as an argument to your current environment's version of python
. In other words, it is equivalent to running python environ.py
in your current shell.
But, there's one more step. That is to add the executable mode to your file using chmod +x
.
chmod +x environ.py
Now, the file can be executed by typing out its relative or absolute path. In the current working directory, that will be ./
+filename
.
./environ.py
For Bash scripts, it is common to just provide the path to bash
in the shebang.
example bash script: environ.sh
#!/bin/bash
cat printenv > environment.txt
echo "environment variables stored in environment.txt"
Learn More about shebang here: Shebang wikipedia article
To run commands in a batch, separate them by semicolons:
command1 ; command2
Or, to run both commands concurrently:
command1 & command2
Note: command1
starts running in the background, and command2
runs in the foreground.
To quit a process running in the foreground with ctrl+c
or in some cases, ctrl+d
.
Suspend a process running in the foreground with ctrl+z
. You will see something like this:
^Z
[2] + 71826 suspended ./energy.py
Resume a suspended process in the foreground with fg
. Or you can run it in the background with bg
, which is equivalent to running ./energy.py &
.
Use jobs -l
to list ongoing and suspended jobs with their pid (process id). The output looks something like this:
[1] - 71979 running ./pima.py
[2] + 71826 suspended ./energy.py
Finally, you can quit that suspended process with kill
:
kill 71826
To display all current shell window processes use ps
, and to see all current system processes use ps ax
. Or, to see what process is eating up all of your memory have a look at the top
command.
When CSVs are improperly formatted, sed and awk can come to the rescue.
A somewhat useful blog post on csv manipulation: Sultan of Awk
CSVkit is a super-useful collection of commands that operate on CSVs. It is implemented in python, and can be easily installed with:
pip install csvkit
To display a csv in an easy-to-read format, use csvlook
:
cat example.csv | csvlook
If it is a csv with many columns and rows, this may not be a pretty output. In that case, you can then pipe the output to less -S
for easier viewing. The -S
flag tells less
not to fold lines.
cat example.csv | csvlook | less -S
Furthermore, you can choose columns with csvcut
and sort on specified columns with csvsort
.
CSVkit goes one step further. You can run summary statistics on a csv with the use of csvstat
.
Another super-useful tool for displaying and manipulating data on the command line is JQ. You can download JQ with homebrew:
brew install jq
Or, if you don't have homebrew, you can find it here: JQ
JQ can pretty-print, extract, and manipulate JSON data on the command line.
cat example.json | jq ".[0]"