<a href="https://colab.research.google.com/github/rzl-ds/gu511/blob/master/003_linux_1.ipynb" target="_parent">
    <img src="https://colab.research.google.com/assets/colab-badge.svg"/>
</a>

# linux crash course

[linux](https://en.wikipedia.org/wiki/Linux) is one of the most popular operating systems perhaps the most powerful for data science. familiarity with the ins and outs of the linux operating system are an absolute must in modern data science and cloud computing / big data in particular.

at the same time, "linux" as a topic is enough to fill several courses (let alone slides). much of the material I leave out here *will* be relevant to you in your future.

this is all to say: I've tried to pare down the linux world to the things I think are *most useful* and *most frequently seen* in the wild. at the very least, I hope that when you encounter these commands and utilities in the wild you will think "I think we talked about that that one time..." -- those mental guideposts can be a lifesaver.

**<div align="center">start up and log in to your `ec2` instance (using `ssh`) and follow along</div>**

## file system, paths, and organization

most users are familiar with the windows/dos "drive" concept of file system organization -- you have lettered drives (e.g. the `C:\\`) and some top-level folders in that drive:

+ `C:\\`
    + Program Files
        + Google Chrome
    + Program Files (x86)
        + Minesweeper
    + Users
        + myname
            + AppData
            + My Documents
+ `D:\\`
    + no one knows. floppy disk? insanity.

over time and use, you've perhaps gotten used to know "where" things are, so when you need to tweak something, install something, or save something,  you have some instinct built up.

we should start by developing that instinct in the `linux` world

there is a pretty tried-and-true filesystem hierarchy in the linux world. knowing the organization can often prove helpful

### paths

you've probably heard this phrase before, but for completeness' sake: a "path" is a sequence of "directories" ("folders" in the windows and mac worlds) and possibly a file name with an extension. directories also have paths (hence "possibly" a file name).

`/Users/zach.lamberty/Documents/Programming in Scala, 3rd Edition.pdf`

<br><div align="center"><img width="400px" src="http://drive.google.com/uc?export=view&id=18PuLx1cWTiQvYayh0mQl4GpISfoizM4r"></div>

basically, a *path* is a text string which uniquely specifies the "location" of a file or directory on the file-system

these sequences are separated by the foward slash character (`/`) in the `mac` and `linux` world, and a back slash (`\`) in the `windows` world

#### working directory

whenever you are using a terminal, you are "working in" a specific directory. this is similar to windows file explorer or mac finder: you have a window open and you are "in" your `Documents` folder, e.g.

<br><div align="center"><img src="http://drive.google.com/uc?export=view&id=1p4x6b09_CxYrNHaZ7_5Q85ABZf6hruiZ"></div>

in `linux`, there is a similar notiong called your *working directory*. when you *do* something, you are "doing it **in**" a working directory.

it is important to know what your working directory is at any time

you can always print your working directory with the command `pwd`:

```sh
pwd
```

this string is the *path* that your terminal is *working in*

#### "absolute" vs. "relative" paths

it is standard to talk about paths in two different but related ways:

*absolute*:

this is a path string that contains every single directory (folder) relative to some shared root point (called the "root" folder, discussed below)

*relative*:

1. this is a text string which tells you the sub-folders and file name for a file *relative to* the place you are currently working (your working directory)
1. by default, your session will start in your home directory (discussed below), and paths can be relative to this point in the file system
1. already, it's important to know where you "are"

an analogy: imagine you have windows file explorer open and you're looking at your user's `Documents` directory.

<br><div align="center"><img src="http://drive.google.com/uc?export=view&id=1xwbAWAQ2lvz1JoFVQac1hYrEbHCuxyGF"></div>

+ your "working directory" is `C:\\Users\zach\Documents`
+ the "absolute" path is `C:\\Users\zach\Documents\top_secret\extremely_complex_analysis\data.h5`
+ the "relative" path (relative to your working directory) is `top_secret\extremely_complex_analysis\data.h5`

if you "clicked up into" `C:\\Users\zach`, how would those paths update?

#### `glob`s

[`glob`s](http://tldp.org/LDP/abs/html/globbingref.html) are closely related to paths. they are strings that look like a path, but contain "wild card" characters (`*`, ?, and a handful of others) which are allowed to match arbitrary content.

this means you can use glob expressions to match several paths.

for example, if you want to represent the path of every `csv` file in the directory `/home/users/zach/data`, you could use the `glob`

```bash
/home/users/zach/data/*.csv
```

the `*` character will match anything.

advanced: globs are not equivalent to regular expressions. they are a subset in both syntax and capabilities

**<div align="center">what are your quesitons?</div>**

### root folder

the "root" folder is (as the name implies) the base of all file paths on your computer. this is similar to the `C:\\` in windows-world except there is no analogy to "other" root folders.

in `linux` and `mac` environments, the root folder is symbolized by just one single "`/`" character.

in your `ec2` terminal, type the following:

``` bash
ls -lh /
```

*we'll get into the details later but this command "lists" (`ls`) the contents of the directory "`/`", and makes the list "long" (`-l`) and human-readable (`-h`)*

the contents of the [root folder](http://www.tldp.org/LDP/intro-linux/html/sect_03_01.html) are pretty established in linux.

as our `ec2` instances are all the same (Ubuntu 18.04) you should be seeing the same thing I just did

from linux machine to machine there may be some changes, but you can usually count on the same structure.

*note: the linux documentation project (or "tldp") which I linked above is a great beginners resource. It has become outdated over time, but is often the best and fullest resource for explaining linux concepts -- keep an eye out while googling!*

There are a couple of directories in `/` that deserve special mention.

#### `/bin`

the `/bin` directory (short for "binaries") contains common executible stuff (that is, stuff you can run which will do something).

the word "binary" is a bit of an anachronism now, as non-binary things commonly live in the `/bin` directory

check out the results of

```bash
ls -alh /bin/
```

notice anyting?

In [None]:
%%bash
ls -alh /bin/

#### `/etc`

Although the name comes from "*et cetera*" (it originally held all sorts of things), in modern times this folder is home to pretty much one type of file: system-wide configuration files. of all of these directories, this is the one that will most benefit you to know about.

we will cover configuration files in detail in the next section, but it suffices to say for now that `/etc` is the home of the *configuration files* that specify the default behaviors for programs you can run on your `linux` machine.

if you -- or, importantly, an administrator, or a `linux` developer -- thinks there is a best default, they will set them in files inside of the `/etc` directory

in the `ssh` lecture we talked about one particular `config` file: `~/.ssh/config`. this file allows you to create special shorthand names for collections of parameters that modify how the `ssh` command runs

there are also *global* defaults that are specified for all users in `/etc/ssh/ssh_config`

let's check out `/etc`

```bash
ls -alh /etc
```

In [None]:
%%bash
ls -alh /etc/

generally speaking, you should not mess around with files in `/etc`.

... generally speaking.

that being said, eventually they will come up and you'll have to do something about it. you'll want to know how to allow a made-up user to access a `postgres` database, or how to allow `http` access to a `neo4j` database, or how to restrict access to you cloud computer to be *only* via `ssh` key authentication (a really good idea!).

#### `/home`

the linux os developed out of an environment where multiple users utilized the same resources (an im-personal computer, so to speak), so keeping different user's items separate was important.

the `/home` directory (sometimes called `/user`) contains seperate, isolated directories for each user of the computer.

it is considered to be best practice to physically or "virtually physically" isolate these directories from everything else. this is so that you can copy your profile and configuration files from one installation / distribution / computer to the next and keep “your” stuff.

let's see who all has a `/home` directory on your `ec2` -- e.g. does your user (ubuntu) have one:

```bash
ls -alh /home
```

In [None]:
%%bash
ls -alh /home

#### `/mnt`

short for "mount", this directory is where external or network drives are "mounted" (*i.e.* networked and attached to the file system so that they can be accessed). these could be networked drives, or external hard drives.

eventually we may connect our `s3` buckets here.

you probably don't have anything mounted to your cloud computer, but take a peak:

In [None]:
%%bash
ls -alh /mnt

#### `/opt`

historically, this directory contained "optional" add-on software packages. this is still often the place on the filesystem you or your sysad might install "special" things (e.g. Citrix for making remote desktop connections, or special text editors). these things are sometimes also installed in another directory (`/usr/local`), so this isn't a hard-and-fast rule.

*note: not a lot of people know that this exists or that this is what it should be used for!
if there is a thing you want to install and you think "where should I put this..." and you think maybe other users might also use it, a great answer is `/opt`*

what's in your `/opt`?

```bash
ls -alh /opt
```

In [None]:
%%bash
ls -alh /opt

#### `/tmp`

the name "tmp" is (as you guessed) short for "temporary. this directory provides an ephemeral (*i.e.* it gets wiped with some frequency, usually on restart) bucket for any temporary thing you may want to save. 

this includes cache files (spotify or chrome downloads), backups of files (autosaved by some editors), *.etc*. you should feel absolutely comfortable using it in your work with the obvious caveat that you shouldn't put anything **important** in a place called "temporary"

what's already in your temporary folder?

```bash
ls -alh /tmp
```

In [None]:
%%bash
ls -alh /tmp

#### `/var`

here "var" is short for "variable". this is not in the sense of a programmatic of mathematical variable, but rather as an item in of unknown duration, number, or volume. these are files which are separated from all other files because they *must* be writable and changeable for a computer to work.

TLDP has [a good write-up](http://www.tldp.org/LDP/Linux-Filesystem-Hierarchy/html/var.html) of the sorts of items kept in `/var`: backups, cahe files, logs, database instances, and logs

Pay attention to `/var/log` in particular -- in linux world, when things go wrong this is often where you need to go to figure out what happened

```bash
ls -alh /var
```

In [None]:
%%bash
ls -alh /var

**<div align="center">what are your quesitons?</div>**

### the `.` and `..` paths

you may have noticed above that every call to `ls -alh` starts with two special lines that end in `.` and `..`. Those are special path tokens that refer to

+ `.`: assuming a command refers to a path, this is shorthand for that path
    + e.g. if you provide nothing to `ls`, this is the current working directory
    + e.g. if you wrote `ls /etc`, this would refer to the directory `/etc` itself
+ `..`: the parent directory of `.`, whatever `.` may be
    + e.g. if you provide nothing to `ls`, this is the parent directory of the current working directory
    + e.g. if you wrote `ls /etc`, this would refer to the root directory `/`

### home folders and the `~` symbol

the `/home` directory contains an isolated sub-directory for each user. this is called that user's home directory, and is so important that there is a permanent shortcut for getting there -- a `bash` environment variable (more on that below) that contains that path. to show this, let's print it out -- `echo` is the most common "print" command in linux, and will print to the screen whatever is written after it (after resolving variables):

```bash
echo ~
```

your terminal should return `/home/ubuntu`, since your user name is `ubuntu`.

In [None]:
%%bash
echo ~

Let's go a step further and check out what is in that directory:

```bash
ls -alh ~
```

In [None]:
%%bash
ls -alh ~

note that almost all of the items in the `/` directory had the phrase "`root root`" in the lines describing the files, and the files in this directory instead have `ubuntu ubuntu`.

what do you think is going on there?

those two words are the `owner` and `group` of the file. They don't have to be the same and they can change. we'll cover permissions and permissions structure in just a second.

for now, though, just notice that this directory contains files that are considered "yours" in a particular way -- hence, your "user home directory"

### hidden folders

let's finish our discussion of the file system with one last call out: `linux` has a notion of **hidden** directories or files.

this is very similar to hidden files in `mac` or `windows` systems, which you are perhaps used to seeing as sort of transparent items in a finder window

<br><div align="center"><img width=1000 src="https://photos5.appleinsider.com/gallery/26815-38789-hidden-files-macos-00002-l.jpg"></div>

in `linux`, *any* file or folder with a name that starts with a `period` will be considered "hidden", and commands that interact with files (e.g. `ls`) will only do so if they are specifically told to -- by default, those files and folders will appear as if they don't exist

to see this, try executing both of the following:

```bash
# no hidden files
ls -lh ~/

# "a"ll files
ls -alh ~/
```

WE WERE HERE

### file system, paths, and organization recap

`linux` is a files-first operating system. you need to know and care about files!

we talk about `files` by giving their `paths`

`paths` are either

+ **absolute**: `/this/long/path/started/with/slash`
+ **relative**: `with/slash`
    + *relative* paths are always *relative* to the **current working directory**
    + in the example above, the working directory is `/this/long/path/started`

we can use **wildcards** like `*` and `?` to represent arbitrary chracters in a path

e.g. `/this/absolute/path/to/data/*.csv` will match and `path` (file) that starts with `/this/absolute/path/to/data/` and ends with `.csv`

the base shared files are all in the `root` directory which is represented by a single slash (`/`)

within `root` we have executible things (`/bin`), configuration things (`/etc`), user's home directories (`/home`), external file system sources (`/mnt`), a catch-all location for optional software (`/opt`), a space for temporary, deletable files (`/tmp`), and a place for variable-sized data like log files (`/var`)

in addition, there are a number of special characters that we can include in paths to represent special places:

+ `.` in a list refers to the directory that list or function is describing
+ `..` is the directory one element above `.` (closer to `/`)
+ `~` is shorthand for our `home` directory (on our `ec2` boxes, `/home/ubuntu`)
+ `.something` suggest that the `.something` file or directory should be hidden

## configuration

we mentioned up above that the `/etc` directory contains many **configuration** files. what are configuration files? what does it mean to configure something?

by "configuration" we mean the ability to specify how a given program will execute. when you write functions, you often write them to have *parameters* (e.g. in the function `f(x, y, z)` I am suggesting there are parameters `x`, `y`, and `z` that you can change).

configuration is a generalization of that parameterization idea to software programs you can execute

linux has a multi-level understanding of *configuration* (that is, specifying how a program should run):

1. this command's configuration as I invoke it right now
    1. example: command line flags (`myapp --env dev --db mysql`)
1. "nearby" configuration
    1. example: a configuration file (`myapp.conf`) in the current working directory, or e.g. in `[current working directory/config`
1. user level configuration
    1. example: a configuration file (`~/.myapp/config` or `~/.myapprc`) in a user's home directory
        1. appending `rc` is a common way to signal a file is a configuration file)
    1. example: an environment variable (`$HOME`) in a user's current terminal session (more on this later)
1. system level configuration
    1. example: a configuration file (`/etc/myapp/myapp.conf`)
1. default values (if any) written in the code of the program itself

you have seen a little bit of this in the `ssh` command:

+ you could pass your key file as a command line flag: `-i /path/to/my/key`
+ you could write it permanently in a user-specific config file: `~/.ssh/config`

there's also a system-level configuration in `etc` -- let's look at it with the `cat` command:

```sh
cat /etc/ssh/ssh_config
```

the order of the configuration levels listed above (command line, nearby file, user-level file, system-level file) represents the order of importance -- the higher up in the list and "closer" you are to the user typing the command that is about to run, and the more important that configuration is. "close" configurations get precedent.

for example, if I write

```bash
ssh -i /path/to/my/file username@password
```

I have configured the `identity_file` variable, and this will override anything provided "further" from the command line (values in the `~/.ssh/config` file, values in `/etc/ssh/ssh_config`

**<div align="center">what are your quesitons?</div>**

## `bash` and environment variables

up above we looked at the contents of our "home" directory by executing

```bash
ls -alh ~/
```

In [None]:
%%bash
ls -alh ~/

among those files, there are a few worth pointing out right away -- the ones beginning with the phrase `.bash`. let's list them in our terminal, utilizing a glob (note the `*` character):

```bash
ls -alh ~/.bash*
```

this should match every file in our home directory `~` that starts with a `.bash` and ends with any number of characters of any kind

In [None]:
%%bash
ls -alh ~/.bash*

we actually already know a lot about these files, just because we know about `linux` paths! let's use some of our `linux` file system instincts:

+ these files start with a `.` character so they are *hidden*
    + we can only see them when the `-a` flag to `ls` (compare `ls -l ~/` with `ls -al ~/`)
+ one of these files is of the form `.****rc`, so it is probably a *configuration file*

these `~/.bash*` files are indeed all **configuration files**, and they configure how the [*bash*](https://en.wikipedia.org/wiki/Bash_%28Unix_shell%29) (Bourne-again shell) program operates. these files are responsible for a lot of your experience while you interact with the terminal.

programs written by other developers (or yourself) can check to see if certain variables are set and change they way they operate as a result

within the main file (`.bashrc`) we can set our own *environment variables*, change program configurations, change the color and wording in our command prompts, and so on.

speaking of environment variables...

the commands you are executing in your terminal are a disparate set of decades-old-but-evolving `C` and `python` scripts.

the process which exposes them to you as commands you can type is a *shell*, and there are different implementations of shells. the most common -- and the default in ubuntu (and therefore the one on your ec2 instances) -- is one called the "Bourne-again shell", or `bash`. `bash` acts as a powerful wrapper around another built in process in the linux world which is just called `shell`, or `sh` for short.

Aside from being a way of interactively executing commands, `bash` is also a scripting language -- you can save a sequence of commands you enter in your terminal for repeat use -- this basically how batch processing works in the `linux` world.

*this is actually just like `R` and `python`*

again, entire text books have been written about just `bash` and `bash` scripting; we don't have the time to cover it all in this course. for now, just the basics.

### environment variables

in `python` you are familiar with code like

```python
x = 1
y = x + 3
print(y)
```

where you create *variables* `x` and `y` to hold values like `1` and `1 + 3`

just like `python` or `R`, shell sessions have a concept of variables. they are strings of characters following a dollar sign:

```bash
VARIABLE=1
$VARIABLE_NAME
```

and they usually come in that all-caps snake case.

a lot of environment variables are already set -- look at the following, for example:

```sh
echo $USER
echo $HOME
echo $PWD
```

those are your user name, your home directory, and your current working directory (`pwd` is short for `p`rint `w`ork `d`irectory).

In [None]:
%%bash
echo $USER
echo $HOME
echo $PWD

many of these variables were set on a system level (in the aforementioned `/etc` files), and each user's `.bashrc` gives them the ability to make updates as desired.

Let's check out what variables already exist without us making changes via our `.bashrc` files. we can do this with the `env` (environment) command:

```bash
env
```

In [None]:
%%bash
env

That's a lot!

note that `USER`, `HOME`, and `PWD` all exist in that list, along with many more.

#### the `PATH` variable

At the moment, I'll only point out one other important variable, because it comes up all the time: the `PATH` variable.

```bash
echo $PATH
```

In [None]:
%%bash
echo $PATH

the `PATH` variable is a `:`-separated list of absolute paths. any time you want to run a command, the shell process will look in every directory in this `PATH` list to see if it finds that command.

suppose you just installed an awesome program called `l33tmode` and you want to run it from the command line. when you type in `l33tmode`, the bash process will check those paths one at a time to see if `l33tmode` is within any of them. If it finds `l33tmode` once, it will execute it. If it makes it through every directory and finds nothing, it will return an error

In [None]:
%%bash
command_that_doesnt_exist

**why bring this up?**

whether or not a given thing is found in a directory in your path is often a primary cause of unexpected problems. it's also something that you often find yourself updating based on trouble-shooting or installation instructions, so you it's good to know what it's there for.

for example, suppose you installed a program you want to use in `/opt`. while that may have been a good place to install it, without adding `/opt` to your `PATH`, you will need to be explicit when calling that command -- `/opt/zachs_super_programs/l33tmode`, for example.

#### setting variables

you are welcome to set variables in your current session, though I would recommend against updating some of the ones that have already been set (they've been set because programs for sure use those variables, and you will be changing how those programs work -- which you can do! you're allowed, that's the point).

try

```bash
# no spaces here!
MY_VAR=1

echo $MY_VAR
```

note that we *set* variables by just saying the name (e.g. `MY_VAR`) but we *reference* variables with a `$`. the `$` character is how we "tell" the `bash` session that what follows is a variable, and should be replaced by its value before `bash` continues executing the command

```sh
echo $MY_VAR  # --> echo 1
```

### sessions

if you wanted to work in `python` or `R` you would double click some icon or go to the terminal and type one of those two commands and "something" (some process) would start and then you'd have a "place" you could start entering your `python` or `R` commands.

what you are doing is creating a *session*, an interactive interpreter for `python` or `R` commands that can receive commands as text you enter, convert them into computations, and spit back out an answer

`bash` is similar: there is a `bash` command that creates a *session* with an interpreter of the `bash` language commands. the prompt in `python` looks like

```
ubuntu@ip-172-31-23-11:~$ python3
Python 3.6.8 (default, Aug 20 2019, 17:12:48) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
```

and in `bash` looks like

```
ubuntu@ip-172-31-23-11:~$ 
```

when we make an `ssh` connection to our `ec2` box, the `ssh` server on our `ec2` box is responsible for creating a `shell` for us to start executing commands -- it basically runs that first `bash` command for us, and we are put *directly* into the running `bash` session

unlike `python` or `R` though, we have the ability to open *other* `bash` sessions, or run entire programs written in `bash` within our current one.

for a bit of `bash`ception, try (in your current `bash` session, which we have called your `ec2` terminal),

```bash
bash
```

this will *look* like nothing changed, but you're actually in a `bash` program running inside that first `bash` program

type `exit` once to exit the `bash > bash` program.

type `exit` a second time to exit the top-level `bash` program and close your `ssh` connection

so aside from being cool, there is one other reason to know about this right now: often, when you execute code, you are creating new `bash` sessions, or running programs in sub-sessions.

there are a handful of environment variables -- those listed by the `env` command -- that will get passed into those sub-commands.

up above we wrote `MY_VAR=1`, and after that we can reference `$MY_VAR` and get the value 1 -- **in the current session**.

if we want to be able to access `$MY_VAR` in sub-sessions or other programs, we need to go a little further -- we need to `export` that variable.

`export` is equivalent to saying "set `MY_VAR` to be equal to 1, **but also** do that any time you start a sub-session too"

to see the difference, type

```bash
env
```

and verify `MY_VAR` is not listed. then type

```bash
export MY_VAR=1
env
```

to see the difference

it might not seem like it matters, but this pattern (creating and `export`ing specific variables for use in other programs) actually comes up quite a bit. often there are helper functions that set these variables for you, but you will need to know what they are to set them yourself from time to time

### `.bashrc`

as I mentioned above, `bash` has a primary configuration file for each user located at `~/.bashrc`. 

let's look at the contents of that file using the `cat` (concatenate) command 

```bash
cat ~/.bashrc
```

In [None]:
%%bash
cat ~/.bashrc

**diversion**: 

why did we `cat` this time and `echo` before?

what's different about the two programs?

+ `echo` will print exactly what is passed (after resolving variables) back to the screen
    + we used it above to *resolve* the variable `~` and then print it so we could read the result
    + if we attempted to `echo` that file, it would literally echo us -- it would take that file name and print it to screen. try it
+ `cat` is expecting a file, and it will open that file and print the contents

```bash
echo ~/.bashrc
```

In [None]:
%%bash
echo ~/.bashrc

Many things are done in `.bashrc` files, but three are most common:

1. run a command every time we start a command line session
    1. example: `conda activate my_conda_environment`
2. update or set our own variables
    1. example: `export ENV=DEV`
3. create aliases
    1. example: `alias l33t=my_complicated --expression -t hat -i am --tired oftyping`
    2. an alias is a shortcut for a larger command or sequence of commands
    3. technically, this is just a type of the first element

let's focus on piece 3 -- creating an alias. Note that we've typed the same thing many times above:

```bash
ls -alh /some/path
```

it would be convenient if we didn't have to type that all the time -- maybe we should make an alias? Try

```bash
ll ~
```

In [None]:
%%bash
ll ~/

on your ec2 system, that should have worked -- some one already put that alias in your `.bashrc` (that some one is the AMI creator).

this command -- and others you will get used to, over time -- are not pre-configured on every machine. eventually you'll get to a machine where you don't have the `ll` command, and you'll have to remember to create the alias and where you save aliases.

here's [a list of awesome aliases](https://www.cyberciti.biz/tips/bash-aliases-mac-centos-linux-unix.html). don't add them yet -- learning the commands the "long" way is useful itself, and gives you the appreciation of the reasons for these shorthands.

### keeping things separate

perhaps at this point you may have an appreciation for an off-hand comment back in the "file structure and organization" segment: the files in `/home` are seperated from the rest for a specific reason. As time goes on, you will tweak and update your `.bashrc` file to get your bash session *juuuuust* the way you want it. You'll do this for many other files too, and those tweaks and updates will live in your `~` directory.

when the sysad wants to do something reckless (like update the OS version overnight, for example), it will be good to have all the files that define *your* experience living in a single place that isn't disrupted by global changes.

it's also easier, when migrating to a new computer, to just zip up the contents of your home directory and move them to your new computer. Anyone who has ever bought a new laptop in the pre-cloud days knows the dance of keeping the old computer around long enough to be sure you haven't missed any super important files while copying over -- well, in the linux world that problem is solved via discipline and the `/home` directory

it is also not uncommon to compile your particular `.bashrc` modifications and keep them on `git` where they are always available when you enter a fresh new `linux` environment

## permissions

Let's dig in to the printout information that we see every time we run the `ll -h` command:

```
drwxr-xr-x 4 ubuntu ubuntu 4.0K Aug 23 02:58 ./
drwxr-xr-x 3 root   root   4.0K Aug 23 02:55 ../
-rw-r--r-- 1 ubuntu ubuntu  220 Aug 31  2015 .bash_logout
-rw-r--r-- 1 ubuntu ubuntu 3.7K Aug 31  2015 .bashrc
drwx------ 2 ubuntu ubuntu 4.0K Aug 23 02:58 .cache/
-rw-r--r-- 1 ubuntu ubuntu  655 May 16 12:49 .profile
drwx------ 2 ubuntu ubuntu 4.0K Aug 23 02:55 .ssh/
```

each of the above lines comes with 7 elements:

```
-rw-r--r-- 1 ubuntu ubuntu 3.7k Aug 31  2015 .bashrc
```

1. the permission bits
    1. a 10-character string with characters d, r, w, x, and -
2. the number of "links"
    1. for directories, the number of sub-directories including `.` and `..`
    2. for files, 1
3. the owner name
    1. a user name (`ubuntu` or `root`, here)
4. the owner group
    1. a group name (`ubuntu` or `root`, here)
5. the file size
    1. the flag `-h` writes them as `h`uman readable
6. a timestamp
    1. when the file was last touched
7. the file or directory name

we're going to focus on the permission items -- 1, 3, and 4

### permissions: user and group

the first thing to know is that the linux world considers three levels of permissions for every file on the system

1. user
2. group
3. global

every file has permissions that apply to the user that owns it, the group that owns it, and then everyone who does not fall into those first two buckets.

right now, your user is `ubuntu`. You could find this with either of the two following commands:

```bash
whoami
```

or

```bash
echo $USER
```

In [None]:
%%bash
whoami

In [None]:
%%bash
echo $USER

each user is also a member of some number of groups (a "group" is a collection of users or other groups). By default, every user in the ubuntu OS is put into a group with the same name as the user name, but you may be in many more

to check the groups your user is in:

```bash
groups
```

In [None]:
%%bash
groups

so, returning for a moment to our home directory:

```
drwxr-xr-x 4 ubuntu ubuntu 4.0K Aug 23 02:58 ./
drwxr-xr-x 3 root   root   4.0K Aug 23 02:55 ../
-rw-r--r-- 1 ubuntu ubuntu  220 Aug 31  2015 .bash_logout
-rw-r--r-- 1 ubuntu ubuntu 3.7K Aug 31  2015 .bashrc
drwx------ 2 ubuntu ubuntu 4.0K Aug 23 02:58 .cache/
-rw-r--r-- 1 ubuntu ubuntu  655 May 16 12:49 .profile
drwx------ 2 ubuntu ubuntu 4.0K Aug 23 02:55 .ssh/
```

we see that the parent directory (`..`) is owned by user `root` and group `root`, but every other item is owned by us (user `ubuntu` and group `ubuntu`)

how does that compare to the root directory? 

to `/var/log`?

*what is the command for displaying these directories, and who are the users and groups which owns those files?*

#### changing user or group ownership

**if** you are the user which owns the file, you can change the user or the group, including giving it away

**if** you are a member of the group which owns the file, you can change the group.


the command to `ch`ange the `own`er is `chown`

the command to `ch`ange the `gr`ou`p` is `chgrp`

### permissions: mode bit string

the real secret sauce of linux permissioning is these leading ten characters, and learning to read them and modify them is a huge win.

the very first character is either a `d` (for "directory"), or a `-` (not a directory).

the remaining 9 charaters are actually 3 groups of 3 characters.

the first group of 3 characters is for the **user** which owns the file

the second group of 3 characters is for the **group** which owns the file

the final group of 3 characters is for **everyone else**

each group of 3 characters lays out the privelege level of the user, group, or system:

1. a `r` (for "read") or a `-`
2. a `w` (for "write") or a `-`
3. a `x` (for "execute") or a `-`

"execute" means you the file is something which can be run (if the system knows how), or is a directory users can open.

again, returning to the `ll -h` results:

```
drwxr-xr-x 4 ubuntu ubuntu 4.0K Aug 23 02:58 ./
drwxr-xr-x 3 root   root   4.0K Aug 23 02:55 ../
-rw-r--r-- 1 ubuntu ubuntu  220 Aug 31  2015 .bash_logout
-rw-r--r-- 1 ubuntu ubuntu 3.7K Aug 31  2015 .bashrc
drwx------ 2 ubuntu ubuntu 4.0K Aug 23 02:58 .cache/
-rw-r--r-- 1 ubuntu ubuntu  655 May 16 12:49 .profile
drwx------ 2 ubuntu ubuntu 4.0K Aug 23 02:55 .ssh/
```

what are the permissions on `.bashrc`? consult with your neighbor.

what are the permissions on `.ssh`?

why might the permissions between the two be different?

#### bit representation

there is a common shorthand for discussing those 9 permission characters as a set of 3 digits. the rule is as follows:

+ start with a value of 0
+ if the `x` flag is set, add 1 ($2^0$)
+ if the `w` flag is set, add 2 ($2^1$)
+ if the `r` flag is set, add 4 ($2^2$)

this guarantees a unique permission number for every one of the 8 possible permission combinations:

| permission string | binary value | decimal value |
|-|-|-|
| `---` | `000` | 0 |
| `--x` | `001` | 1 |
| `-w-` | `010` | 2 |
| `-wx` | `011` | 3 |
| `r--` | `100` | 4 |
| `r-x` | `101` | 5 |
| `rw-` | `110` | 6 |
| `rwx` | `111` | 7 |

because of this, a sequence of 9 characters is often discussed as 3 numbers (spaces added for emphasis):

+ `rwx rwx rwx` is `111 111 111` is `7 7 7`
+ `rw- r-- ---` is `110 100 000` is `6 4 0`

#### changing permission mode string

above we changed the owner and group of a file with `chown` and `chgrp`. we `ch`ange the permission `mod`e string with `chmod`.

it works with either the chracter representation (e.g. `rw-r-----`) or the bit representation (e.g. `640`)

to update the permission mode string using the character representation, you ask the following:

+ am I changing permissions for the
    + `u`: user
    + `g`: group
    + `a`: all others
+ am I
    + `+`: adding permission
    + `-`: revoking permission
+ is the permission
    + `r`: read
    + `w`: write
    + `x`: execute
    
given the above answers, you concatenate them. if you want all users (`a`) to gain (`+`) read permission (`r`) you would run

```bash
chmod a+r /my/file
```

to update the permission mode string using the numeric representation, you simply calculate the exact permission string you wish to have as a number and assign it.

for example, to give users and groups complete control over a file (`rwx` is `7`), but completely restrict the outside world (`---` is `0`), you would apply

```bash
chmod 770 /my/file
```

for your `ssh` key, you probably had to make the permissions more restrictive -- this is to make sure that the file is readable by you (`r--` is `4`) but unknown to anyone else (`---` is `0`).

```bash
chmod 400 ~/.ssh/my_aws_private_key.pem
```

changing file permissions is a bit of a black art at first. you'll open the same webpages for the same tutorials every time. you'll learn that there are special characters beyond `rwx` and they do wild and mysterious things.

eventually, you'll get the hang of it!

at the very least, you should know how to read these permission strings. they will explain so many of the errors you encounter.

## super user, *aka* [`sudo`](https://xkcd.com/149/)

there is one major exception to the rules listed above: the super user.

`linux` has a concept called "super user" which is, as the name implies, pretty super. this user can be thought of as the adminstrator account on the `linux` machine, and as administrator it owns many of the most important configuration files and runs most of the essential processes.

anything that the super user does is protected by the permission structure -- only the super user may change the files or alter the processes run by the super user.

the super user is often colloquially and sometimes literally called `root` (literally on ubuntu -- you will notice that a user named `root` owns and runs most things).

this all begs the question: when I want to break everything, how do I do it? if this super user has been put in place to properly configure everything on my machine and keep things running smoothly, how am I expected to make a mess of it?

`linux` developers thought of this, too, and have constructed a system whereby non-`root` users can be given root permissions in some or all cases.

on your `ec2` instance, user `ubuntu` has already been granted these permissions -- `aws` took care of that for us, thanks gang.

in the real world, the sysad will make this decision. if a sysad *doesn't* give you root priveleges, know that while that is annoying that is *actually a very good sign* -- that sysad has standards and control policies in place and there is likely a centralized way of doing things.

*weeds note: granting sudo permission is done by adding users to a "sudoers" file, where you can configure the level and types of access granted to each user on the machine. it can be configured down to the command level*

you *use* your sudo priveleges by "becoming" the super user. you can do this in two ways:

1. `sudo su`
    1. this will have you "log in" and create a new shell in which you are acting in all ways as the super user.
    2. every command you type after this until you `exit` will be performed as the root user
    3. **BE CAREFUL!**
    4. probably don't do this instead
2. `sudo [type my command here]`
    1. this will momentarily log you in as the root user and execute the provided command, then log you out
    2. **STILL BE CAREFUL!**
    3. still probably don't
    
note: `sudo su` is technically just a special instance of `sudo [type my command here]` where the command is `su`.

## files are just text, extensions are *hints*, not *rules*

we've talked so far about *paths* -- ways to describe files. files themselves are just sequences of characters, sometimes binary, sometimes in human readable text.

files have names and those names often have *extensions* (e.g. `.pdf`, `.png`, `.exe`). from your time living in the mac and windows worlds, you are trained to think that applications know how to handle some extensions and not others, and that there are "correct" ways to open a `.pdf` file, for example.

because of this, it can be jarring to see files with *no extension at all*

what type of file even are these?

it might seems confusing, but it gets at the heart of an important fact about how our computers work:

+ *extensions* (e.g. `.pdf`) are a convention for giving hints to the OS on how to interpret the contents of the file
+ *file formats* are conventions for how the contents of a file should be laid out so that programs (like `adobe acrobat`) can read them

there are [many, many](https://en.wikipedia.org/wiki/List_of_file_formats) file formats

all programs *really* expect is *some* text strings formatted in *some* way. the extension is a nice tip to the operating system or a given program that "the text inside this file is formatted like a `pdf`, so just assume that first before freaking out and guessing"

for example, I could create the following file:

```sh
echo "print('hello world')" > hw.pdf
```

this is *`python`* code (formatted such that the `python` program will understand it) in a file with a `.pdf` extension. will the `python` program be able to run it? let's see:

```sh
python3 hw.pdf
```

I can assure you adobe acrobat would not be happy to open that "pdf"

should you run around naming things with arbitrary extensions? no, probably not. there are a lot of ways in which the extension is a useful heads up to programs that care, not to mention people that don't want to figure out why your `R` code is in `mpg` format.

the file extension is meaningless to the `bash` program -- it is a helpful hint the author of that file gave to **you**, the user, on how to run it

that being said, you will often see code that has no extension, especially in the `bash` context. this doesn't mean it does nothing, or that no program can read it.

### the shebang: `#!`

in addition to providing users a helpful hint for what what program to use in the file name extension, for plain text files they can also provide the `shell` a hint for how to proceed. it is commonly called the [shebang](https://en.wikipedia.org/wiki/Shebang_(Unix)) and is represented by the characters `#!` at the very start of a file

in any text file, the author can write

```sh
#!interpreter [optional-arg]
```

where interpreter is some interpreter program (specified by absolute path) and you can optionally add arguments to them

you have perhaps seen this at the top of `python` files and wondered wtf was happening:

```python
#!/usr/bin/env python3
```

you are giving the `shell` program a polite heads up that as it reads what follows in `myfile.py` it should use the program `/usr/bin/env` to find the program called `python3` and pass all the subsequent commands through it

you're doing `bash` a solid.

## using the terminal

there are a handfull of useful tips you should know about how to enter commands in the terminal

### autocomplete

in most scenarios, if you are trying to enter something for a command (the command name, the path to a file) `bash` supports autocompletion by pressing tab, or tab completion. in your terminal, try typing

```bash
pyt<PRESS TAB>
```

and then

```bash
py<PRESS TAB TWICE>
```

the way this works is fairly straightforward. there is a large list of known commands or file paths on your machine, and every time you press `TAB` the `bash` session will filter that list for commands or paths that start with that same set of characters.

+ if there is only one known command starting with those characters, it will completely fill it in for you
+ if there is a set of commands starting with those chracters,
    + if all those commands have the same following characters (e.g. `ab-cd-one`, `ab-cd-two`, `ab-cd-three` are all "found" by typing `ab-` and pressing `TAB`), it will fill in the shared characters (`ab-cd-`) and stop
    + if you press `TAB TAB` after this, it will list the collection of items that match up to that point
    + as a result (one `TAB` finishes as much as possible and two subsequent `TAB`s list the options), it is common to write a part of a command and press `TAB TAB TAB`

try the following and look at the items that are still in the list after each step

```bash
py<TAB TAB>
pyt<TAB>
python3<TAB TAB>
python3-<TAB>
python3-jsond<TAB>
```

as I said above, this also works for files, and is the fastest way to "peak" at the contents of the file system while executing commands:

```bash
ls /home/ub<TAB>
ls /home/ubuntu/<TAB TAB>
ls /home/ubuntu/.b<TAB TAB TAB>
# et cetera
```

while it might seem like the best reason to do this is speed, don't discount the value of avoiding typos -- you don't need to be careful about the paths to files if you tab complete them; you *know* those files exist and you *know* the string representing that path is correct

### command history

the commands you've executed so far are saved in a database that is accessible in a number of ways:

1. the `history` command
1. pressing `up` and `down`
1. reverse search via `ctrl + r`

the `history` command will print out your `history` of executed commands

```bash
history
```

this is often most useful in combination with another tool (`grep`); more on that another time

at any point you can cycle through you recent commands by pressing `up` and `down`. right now, pressing `up` will likely bring up `history`.

try it out!

finally, the most common situation is

> I know I typed that thing I want to type yesterday, what was it...

rather than pressing up 35 times, if you know *any* of what you want to type, press

```
<ctrl + r>
```

and you will be moved into a "recursive-i-search".

anything you type here will look for a previous command you successfully entered that starts with the same characters.

+ press `ctrl + r` multiple times and you will cycle through multiple past commands
+ press `tab` to put the current highlighted command into the terminal
+ press `enter` to skip straight to executing that command

try

```bash
<ctrl + r><type "ls "><press ctrl + r multiple times>
```

## philosophy

there is a much ballyhooed [list of linux and unix philosophies](https://en.wikipedia.org/wiki/Unix_philosophy#Mike_Gancarz:_The_UNIX_Philosophy):

1. small is beautiful
2. do one thing and do it well (DOTADIW)
3. build a prototype as soon as possible
4. choose portability over efficiency
5. store data in flat text files
6. use software leverage to your advantage
7. use shell scripts to increase leverage and portability
8. avoid captive user interfaces
9. make every program a filter

most of these are guidelines on how to *develop* linux, but there are important lessons in here about how you are intended to *use* linux.

##### do one thing and do it well (DOTADIW)

this is the most commonly cited linux philosophy, and is pretty central to the identity. I'd note that this is the opposite of many windows or mac world solutions, which attempt to act as swiss army knives across multiple domains.

if you want to do something, there is probably one way to do it. and there is hopefully just one way to do it. and that's all that thing will ever do.

avoid swiss army knives!

##### store data in flat text files

many of the `linux` tools you will use are optimized for reading and writing text files. this is contrary to some data science and data engineering instincts, which might favor databases. this changes as data gets larger, and flat text files take over again. there's a spectrum of storage needs

as a result, keep it simple from the start and complicate things only as you need -- start with local flat text files as the default; advanced storage methods are the exception.

##### avoid captive user interfaces

this is related to a basic usability question: should running a program require active participation from a user?

in `linux` the attitude is no.

stretching this all a bit for data scientists, this crops up when you make a tool you wish to share (e.g. a data science pipeline). the end goal of that software should tend toward one where a user sets it up, presses enter, and walks away.

for shareable, production code (not exploration, a different paradigm entirely) users should not have to manipulate your code, tweaking values, re-running cells

data science exploration often occurs in notebook-like settings, where you are re-defining parameters, re-running cells, and jumping around throughout the notebook.

for someone else to use your code, they need to know where to find these special variables, which cells to run and in which order -- a recipe for disaster

what is another option?

use the `linux` configuration paradigm. whatever these changeable parameters are, abstract them to be command line flags (`-X` or `--xyz`). convert the most important set of cells into functions in a `python` package.

consider creating external configuration files, such that users never touch your code but simply update `.config` files and re-run

this has several benefits:

1. the user doesn't change *the code* in any way to run many particular version of the process
2. the configuration files can be version controlled separately
3. the configuration files can stand in as a strucutred representation of the particular modelling process
4. the user can create 10 different `config` files, write a script to run them all, and walk away for a day.

let's make it at least a little bit more concrete.

when you start to have several feature selection options, several modelling approaches, several evaluation criteria, several test/train split methods.... you see how it might be nice to have a compact representation of those options. for example, in `python` I could write

```python
# file: my_fav_params.conf
my_good_params = {
    'featureselectors': ['boruta', 'lasso'],
    'bootstrap': True,
    'models': ['neuralnet', 'lasso', 'adaboost'],
}

clients_not_so_good_params = {
    'featureselectors': ['identity'],
    'bootstrap': False,
    'models': ['linear_regression'],
}
```

you could write your script to excpect a file with a `params` dictionary in it. if your code is slightly more generic, it can handle different configurations without changing the code 

##### make every program a filter

most `linux` programs are designed to take in lines of text, perform some calculation, and spit out lines of text. in modern data engineering parlance, `linux` programs are ETL processes where the extract and load steps are both normalizing results to sequences of text lines.

these sequences of text lines are pulled from or put into buffer objects called `stdin` ("standard input") and `stdout` ("standard output"), respectively. if errors occur, they are put into a separate buffer called `stderr` ("standard error")

the output of any command (`stdout`) is printed to the terminal by default. you may wish to save it to file instead -- you can change this output by

1. writing it as a file with the `>` characters, or
2. appending it to an existing file with the `>>` characters

There are actually several options, and [this stack overflow answer](https://askubuntu.com/a/731237) lays them out well.

one of the advantages of this ETL approach is the results of one program can be passed directly to another without saving and loading -- this is called "piping" in linux world, and the character which does the passing around of intermediate results is "`|`" (capital backslash on most keyboards). This character is often called "pipe" because of this.

take, for example, the following command, where three pipes pass results between four commands

```sh
who | awk '{print $1}' | sort | uniq
```

In [None]:
%%bash
who | awk '{print $1}' | sort | uniq

what just happend?

In [None]:
%%bash
whatis who
whatis awk
whatis sort
whatis uniq

In [None]:
%%bash
who

In [None]:
%%bash
who | awk '{print $1}'

In [None]:
%%bash
who | awk '{print $1}' | sort

In [None]:
%%bash
who | awk '{print $1}' | sort | uniq

<div align="center">we're half way there!</div>
<div align="center"><img src="https://www.eskimo.com/~lo/linux/tuxqqmerge.jpg"></div>

# END OF LECTURE

next lecture: [Linux pt 2](003_linux_2.ipynb)