In [None]:
# Introduction to Shell
## 1. Manipulating files and directories
## 2. Manipulating data
## 3. Combining tools
## 4. Batch processing
## 5. Creating new tools

## 1. Manipulating files and directories

**How does the shell compare to a desktop interface?**

An operating system like Windows, Linux, or Mac OS is a special kind of program. It controls the computer's processor, hard drive, and network connection, but its most important job is to run other programs.

Since human beings aren't digital, they need an interface to interact with the operating system. The most common one these days is a graphical file explorer, which translates clicks and double-clicks into commands to open files and run programs. Before computers had graphical displays, though, people typed instructions into a program called a command-line shell. Each time a command is entered, the shell runs some other programs, prints their output in human-readable form, and then displays a prompt to signal that it's ready to accept the next command. (Its name comes from the notion that it's the "outer shell" of the computer.)

Typing commands instead of clicking and dragging may seem clumsy at first, but as you will see, once you start spelling out what you want the computer to do, you can combine old commands to create new ones and automate repetitive operations with just a few keystrokes.

What is the relationship between the graphical file explorer that most people use and the command-line shell?

**Possible Answers**

- [ ] The file explorer lets you view and edit files, while the shell lets you run programs.
- [ ] The file explorer is built on top of the shell.
- [ ] The shell is part of the operating system, while the file explorer is separate.
- [x] They are both interfaces for issuing commands to the operating system.

**Where am I?**

The filesystem manages files and directories (or folders). Each is identified by an absolute path that shows how to reach it from the filesystem's root directory: `/home/repl` is the directory repl in the directory `home`, while `/home/repl/course.txt` is a file `course.txt` in that directory, and `/` on its own is the root directory.

To find out where you are in the filesystem, run the command pwd (short for "print working directory"). This prints the absolute path of your current working directory, which is where the shell runs commands and looks for files by default.

Run pwd. Where are you right now?

Possible Answers

- [ ] `/home`
- [ ] `/repl`
- [x] `/home/repl`

**How can I identify files and directories?**

pwd tells you where you are. To find out what's there, type ls (which is short for "listing") and press the enter key. On its own, ls lists the contents of your current directory (the one displayed by pwd). If you add the names of some files, ls will list them, and if you add the names of directories, it will list their contents. For example, ls /home/repl shows you what's in your starting directory (usually called your home directory).

Use ls with an appropriate argument to list the files in the directory /home/repl/seasonal (which holds information on dental surgeries by date, broken down by season). Which of these files is not in that directory?

**Possible Answers**

- [ ] autumn.csv
- [x] fall.csv
- [ ] spring.csv
- [ ] winter.csv

**How else can I identify files and directories?**

```
ls course.txt
ls seasonal/summer.csv
ls people
```

**How can I move to another directory?**

```
cd seasonal
pwd
ls
```

**How can I move up a directory?**

The parent of a directory is the directory above it. For example, /home is the parent of /home/repl, and /home/repl is the parent of /home/repl/seasonal. You can always give the absolute path of your parent directory to commands like cd and ls. More often, though, you will take advantage of the fact that the special path .. (two dots with no spaces) means "the directory above the one I'm currently in". If you are in /home/repl/seasonal, then cd .. moves you up to /home/repl. If you use cd .. once again, it puts you in /home. One more cd .. puts you in the root directory /, which is the very top of the filesystem. (Remember to put a space between cd and .. - it is a command and a path, not a single four-letter command.)

A single dot on its own, ., always means "the current directory", so ls on its own and ls . do the same thing, while cd . has no effect (because it moves you into the directory you're currently in).

One final special path is ~ (the tilde character), which means "your home directory", such as /home/repl. No matter where you are, ls ~ will always list the contents of your home directory, and cd ~ will always take you home.

If you are in /home/repl/seasonal, where does cd ~/../. take you?

**Possible Answers**

- [ ] /home/repl
- [x] /home
- [ ] /home/repl/seasonal
- [ ] / (the root directory)


**How can I copy files?**

```
cp seasonal/summer.csv backup/summer.bck
cp seasonal/spring.csv seasonal/summer.csv backup
```

**How can I move a file?**

```
mv seasonal/spring.csv seasonal/summer.csv backup
```

**How can I rename files?**

```
cd seasonal
mv winter.csv winter.csv.bck
ls
```

**How can I delete files?**

```
cd seasonal
rm autumn.csv
cd
rm seasonal/summer.csv
```

**How can I create and delete directories?**

```
rm people/agarwal.txt
rmdir people
mkdir yearly
mkdir yearly/2017
```

**Wrapping up**

```
cd /tmp
ls
mkdir scratch
mv ~/people/agarwal.txt scratch
```

## 2. Manipulating data

**How can I view a file's contents?**

```
cat course.txt
```

**How can I view a file's contents piece by piece?**

```
# You can leave out the '| cat' part here:
less seasonal/spring.csv seasonal/summer.csv | cat
```

**How can I look at the start of a file?**

The first thing most data scientists do when given a new dataset to analyze is figure out what fields it contains and what values those fields have. If the dataset has been exported from a database or spreadsheet, it will often be stored as comma-separated values (CSV). A quick way to figure out what it contains is to look at the first few rows.

We can do this in the shell using a command called head. As its name suggests, it prints the first few lines of a file (where "a few" means 10), so the command:

`head seasonal/summer.csv`

displays:

```
Date,Tooth
2017-01-11,canine
2017-01-18,wisdom
2017-01-21,bicuspid
2017-02-02,molar
2017-02-27,wisdom
2017-02-27,wisdom
2017-03-07,bicuspid
2017-03-15,wisdom
2017-03-20,canine
```

What does head do if there aren't 10 lines in the file? (To find out, use it to look at the top of people/agarwal.txt.)

**Possible Answers**

- [ ] Print an error message because the file is too short.
- [x] Display as many lines as there are.
- [ ] Display enough blank lines to bring the total to 10.

**How can I type less?**

```
head seasonal/autumn.csv
head seasonal/spring.csv
```

**How can I control what commands do?**

```
head -n 5 seasonal/winter.csv
```

**How can I list everything below a directory?**

```
ls -R -F /home/repl
```

**How can I get help for a command?**

```
# Run the following command *without* '| cat':
man tail | cat
tail -n +7 seasonal/spring.csv
```

**How can I select columns from a file?**

head and tail let you select rows from a text file. If you want to select columns, you can use the command cut. It has several options (use man cut to explore them), but the most common is something like:

`cut -f 2-5,8 -d , values.csv`

which means "select columns 2 through 5 and columns 8, using comma as the separator". cut uses -f (meaning "fields") to specify columns and -d (meaning "delimiter") to specify the separator. You need to specify the latter because some files may use spaces, tabs, or colons to separate columns.

What command will select the first column (containing dates) from the file spring.csv?

**Possible Answers**

- [ ] cut -d , -f 1 seasonal/spring.csv
- [ ] cut -d, -f1 seasonal/spring.csv
- [x] Either of the above.
- [ ] Neither of the above, because -f must come before -d.

**What can't cut do?**

cut is a simple-minded command. In particular, it doesn't understand quoted strings. If, for example, your file is:

```
Name,Age
"Johel,Ranjit",28
"Sharma,Rupinder",26
```

then:

```
cut -f 2 -d , everyone.csv
```

will produce:

```
Age
Ranjit"
Rupinder"
```

rather than everyone's age, because it will think the comma between last and first names is a column separator.

What is the output of cut -d : -f 2-4 on the line:

`first:second:third:`

(Note the trailing colon.)

**Possible Answers**

- [ ] second
- [ ] second:third
- [x] second:third:
- [ ] None of the above, because there aren't four fields.

**How can I repeat commands?**

```
head summer.csv
cd seasonal
!head
history
!3
```

**How can I select lines containing specific values?**

```
grep molar seasonal/autumn.csv
grep -v -n molar seasonal/spring.csv
grep -c incisor seasonal/autumn.csv seasonal/winter.csv
```

**Why isn't it always safe to treat data as text?**

The SEE ALSO section of the manual page for cut refers to a command called paste that can be used to combine data files instead of cutting them up.

Read the manual page for paste, and then run paste to combine the autumn and winter data files in a single table using a comma as a separator. What's wrong with the output from a data analysis point of view?

**Possible Answers**

- [ ] The column headers are repeated.
- [x] The last few rows have the wrong number of columns.
- [ ] Some of the data from winter.csv is missing.

## 3. Combining tools

**How can I store a command's output in a file?**

```
tail -n 5 seasonal/winter.csv > last.csv
```

**How can I use a command's output as an input?**

```
tail -n 2 seasonal/winter.csv > bottom.csv
head -n 1 bottom.csv
```

**What's a better way to combine commands?**
```
cut -d , -f 2 seasonal/summer.csv | grep -v Tooth
```

**How can I combine many commands?**

```
cut -d , -f 2 seasonal/summer.csv | grep -v Tooth | head -n 1
```

**How can I count the records in a file?**

```
grep 2017-07 seasonal/spring.csv | wc -l
```

**How can I specify many files at once?**

```
head -n 3 seasonal/s* # ...or seasonal/s*.csv, or even s*/s*.csv
```

**What other wildcards can I use?**

The shell has other wildcards as well, though they are less commonly used:

- ? matches a single character, so 201?.txt will match 2017.txt or 2018.txt, but not 2017-01.txt.
- [...] matches any one of the characters inside the square brackets, so 201[78].txt matches 2017.txt or 2018.txt, but not 2016.txt.
- {...} matches any of the comma-separated patterns inside the curly brackets, so {*.txt, *.csv} matches any file whose name ends with .txt or .csv, but not files whose names end with .pdf.
Which expression would match singh.pdf and johel.txt but not sandhu.pdf or sandhu.txt?

**Possible Answers**

- [ ] [sj]*.{.pdf, .txt}
- [ ] {s*.pdf, j*.txt}
- [ ] [singh,johel]{*.pdf, *.txt}
- [x] {singh.pdf, j*.txt}

**How can I sort lines of text?**

```
cut -d , -f 2 seasonal/winter.csv | grep -v Tooth | sort -r
```

**How can I remove duplicate lines?**

```
cut -d , -f 2 seasonal/winter.csv | grep -v Tooth | sort | uniq -c
```

**How can I save the output of a pipe?**

The shell lets us redirect the output of a sequence of piped commands:

`cut -d , -f 2 seasonal/*.csv | grep -v Tooth > teeth-only.txt`

However, > must appear at the end of the pipeline: if we try to use it in the middle, like this:

`cut -d , -f 2 seasonal/*.csv > teeth-only.txt | grep -v Tooth`

then all of the output from cut is written to teeth-only.txt, so there is nothing left for grep and it waits forever for some input.

What happens if we put redirection at the front of a pipeline as in:

`> result.txt head -n 3 seasonal/winter.csv`

Possible Answers

- [x] The command's output is redirected to the file as usual.
- [ ] The shell reports it as an error.
- [ ] The shell waits for input forever.

**How can I stop a running program?**

```
# Simply type head, hit Enter and exit the running program with `Ctrl` + `C`.
```

**Wrapping up**

```
wc -l seasonal/*.csv
wc -l seasonal/*.csv | grep -v total
wc -l seasonal/*.csv | grep -v total | sort -n | head -n 1
```

## 4. Batch processing

**How does the shell store information?**

Like other programs, the shell stores information in variables. Some of these, called environment variables, are available all the time. Environment variables' names are conventionally written in upper case, and a few of the more commonly-used ones are shown below.

```
Variable	Purpose	Value
HOME	User's home directory	/home/repl
PWD	Present working directory	Same as pwd command
SHELL	Which shell program is being used	/bin/bash
USER	User's ID	repl
```

To get a complete list (which is quite long), you can type set in the shell.

Use set and grep with a pipe to display the value of HISTFILESIZE, which determines how many old commands are stored in your command history. What is its value?

**Possible Answers**

- [ ] 10
- [ ] 500
- [x] 2000
- [ ] The variable is not there.


**How can I print a variable's value?**

```
echo $OSTYPE
```

**How else does the shell store information?**

```
testing=seasonal/winter.csv
head -n 1 $testing
```

**How can I repeat a command many times?**

```
for filetype in docx odt pdf; do echo $filetype; done
```

**How can I repeat a command once for each file?**

```
for filename in people/*; do echo $filename; done
```

**How can I record the names of a set of files?**

People often set a variable using a wildcard expression to record a list of filenames. For example, if you define datasets like this:

`datasets=seasonal/*.csv`

you can display the files' names later using:

`for filename in $datasets; do echo $filename; done`

This saves typing and makes errors less likely.

If you run these two commands in your home directory, how many lines of output will they print?

```
files=seasonal/*.csv
for f in $files; do echo $f; done
```

**Possible Answers**

- [ ] None: since files is defined on a separate line, it has no value in the second line.
- [ ] One: the word "files".
- [X] Four: the names of all four seasonal data files.

**A variable's name versus its value**

A common mistake is to forget to use $ before the name of a variable. When you do this, the shell uses the name you have typed rather than the value of that variable.

A more common mistake for experienced users is to mis-type the variable's name. For example, if you define datasets like this:

`datasets=seasonal/*.csv`

and then type:

`echo $datsets`

the shell doesn't print anything, because datsets (without the second "a") isn't defined.

If you were to run these two commands in your home directory, what output would be printed?

```
files=seasonal/*.csv
for f in files; do echo $f; done
```
(Read the first part of the loop carefully before answering.)

**Possible Answers**

- [X] One line: the word "files".
- [ ] Four lines: the names of all four seasonal data files.
- [ ] Four blank lines: the variable f isn't assigned a value.


**How can I run many commands in a single loop?**

```
for file in seasonal/*.csv; do grep 2017-07 $file | tail -n 1; done
```

**Why shouldn't I use spaces in filenames?**

It's easy and sensible to give files multi-word names like July 2017.csv when you are using a graphical file explorer. However, this causes problems when you are working in the shell. For example, suppose you wanted to rename July 2017.csv to be 2017 July data.csv. You cannot type:

`mv July 2017.csv 2017 July data.csv`

because it looks to the shell as though you are trying to move four files called July, 2017.csv, 2017, and July (again) into a directory called data.csv. Instead, you have to quote the files' names so that the shell treats each one as a single parameter:

`mv 'July 2017.csv' '2017 July data.csv'`

If you have two files called current.csv and last year.csv (with a space in its name) and you type:

`rm current.csv last year.csv`
what will happen:

**Possible Answers**

- [ ] The shell will print an error message because last and year.csv do not exist.
- [ ] The shell will delete current.csv.
- [X] Both of the above.
- [ ] Nothing.


**How can I do many things in a single loop?**

The loops you have seen so far all have a single command or pipeline in their body, but a loop can contain any number of commands. To tell the shell where one ends and the next begins, you must separate them with semi-colons:

`for f in seasonal/*.csv; do echo $f; head -n 2 $f | tail -n 1; done`

```
seasonal/autumn.csv
2017-01-05,canine
seasonal/spring.csv
2017-01-25,wisdom
seasonal/summer.csv
2017-01-11,canine
seasonal/winter.csv
2017-01-03,bicuspid
```

Suppose you forget the semi-colon between the echo and head commands in the previous loop, so that you ask the shell to run:

`for f in seasonal/*.csv; do echo $f head -n 2 $f | tail -n 1; done`

What will the shell do?

**Possible Answers**

- [ ] Print an error message.
- [X] Print one line for each of the four files.
- [ ] Print one line for autumn.csv (the first file).
- [ ] Print the last line of each file.

## 5. Creating new tools

**How can I edit a file?**

```
# This solution uses `cp` instead of `nano`
# because our automated tests can't edit files interactively.
cp /solutions/names.txt /home/repl
```

**How can I record what I just did?**

```
cp seasonal/s* ~
grep -h -v Tooth spring.csv summer.csv > temp.csv
history | tail -n 3 > steps.txt
```

**How can I save commands to re-run later?**

```
# This solution uses `cp` instead of `nano`
# because our automated tests can't edit files interactively.
cp /solutions/dates.sh ~
bash dates.sh
```

**How can I re-use pipes?**

```
# This solution uses `cp` instead of `nano`
# because our automated tests can't edit files interactively.
cp /solutions/teeth.sh ~
bash teeth.sh > teeth.out
cat teeth.out
```

**How can I pass filenames to scripts?**

```
# This solution uses `cp` instead of `nano`
# because our automated tests can't edit files interactively.
cp /solutions/count-records.sh ~
bash count-records.sh seasonal/*.csv > num-records.out
```

**How can I process a single argument?**

As well as $@, the shell lets you use $1, $2, and so on to refer to specific command-line parameters. You can use this to write commands that feel simpler or more natural than the shell's. For example, you can create a script called column.sh that selects a single column from a CSV file when the user provides the filename as the first parameter and the column as the second:

`cut -d , -f $2 $1`

and then run it using:

`bash column.sh seasonal/autumn.csv 1`

Notice how the script uses the two parameters in reverse order.

The script get-field.sh is supposed to take a filename, the number of the row to select, the number of the column to select, and print just that field from a CSV file. For example:

`bash get-field.sh seasonal/summer.csv 4 2`

should select the second field from line 4 of seasonal/summer.csv. Which of the following commands should be put in get-field.sh to do that?

**Possible Answers**

- [ ] `head -n $1 $2 | tail -n 1 | cut -d , -f $3`
- [x] `head -n $2 $1 | tail -n 1 | cut -d , -f $3`
- [ ] `head -n $3 $1 | tail -n 1 | cut -d , -f $2`
- [ ] `head -n $2 $3 | tail -n 1 | cut -d , -f $1`


**How can one shell script do many things?**

```
# This solution uses `cp` instead of `nano`
# because our automated tests can't edit files interactively.
cp /solutions/range-1.sh range.sh
cp /solutions/range-2.sh range.sh
cp /solutions/range-3.sh range.sh
bash range.sh seasonal/*.csv > range.out
```

**How can I write loops in a shell script?**

```
# This solution uses `cp` instead of `nano`
# because our automated tests can't edit files interactively.
cp /solutions/date-range.sh date-range.sh
bash date-range.sh seasonal/*.csv
bash date-range.sh seasonal/*.csv | sort
```

**What happens when I don't provide filenames?**

A common mistake in shell scripts (and interactive commands) is to put filenames in the wrong place. If you type:

`tail -n 3`

then since tail hasn't been given any filenames, it waits to read input from your keyboard. This means that if you type:

`head -n 5 | tail -n 3 somefile.txt`

then tail goes ahead and prints the last three lines of somefile.txt, but head waits forever for keyboard input, since it wasn't given a filename and there isn't anything ahead of it in the pipeline.

Suppose you do accidentally type:

`head -n 5 | tail -n 3 somefile.txt`

What should you do next?

**Possible Answers**

- [ ] Wait 10 seconds for head to time out.
- [ ] Type somefile.txt and press Enter to give head some input.
- [x] Use Ctrl + C to stop the running head program.