# 3. Command Line Note
### linked index
1. [Working with programs](#wokring-with-programs)
2. [Command line python scripting](#command-line-python-scripting)
3. [Working with jupyter console](#working-with-jupyter-console)
4. [Piping and redirecting output](#piping-and-redirecting-output)
5. [Data cleaning and exploration using Csvkit](#csvkit)

## <a name="working-with-programs"></a>Working with Programs

While there are many UNIX shells, Bash is one of the most popular. Bash is the default shell on most Linux and OS X computers.<br><br>Bash is essentially a program that lets us run other programs. It does this by implementing a command language. This language specifies how to type and structure the commands we want to execute.

It's also important to avoid adding stray spaces around the equals sign. For example, this assignment will fail:

```bash
ANIMAL=Shark with a laser beam on its head
ANIMAL="Shark with a laser beam on its head"
```

* Type export FOOD="Chicken and waffles" to create an environment variable called FOOD.

```bash
export FOOD="Chicken and waffles"
```

We can run many programs from Bash, including Python. To run the Python interpreter from the Bash shell, we type python at the command prompt.<br><br>Once we're inside the command prompt, we can access the environment variables with commands that look like this:

```bash
import os
print(os.environ["home"])
```

On the last screen, we typed /usr/bin/python to access the Python interpreter. If the Python interpreter is at that location, though, how come we can also access it by typing python?

* We can do this because of the PATH environment variable, which is configured to point to several folders (creating a "shortcut"). We can run any program in any one of these folders just by typing the program's name. Because /usr/bin is one of the folders in PATH and python is in that folder, we can access the python interpreter just by typing python, instead of the full path.

### ls -flags

reference page : http://man7.org/linux/man-pages/man1/ls.1.html

```bash
ls -al --ignore=test.txt
ls -la --ignore=test.txt
```

## <a name="command-line-python-scripting"></a>Command Line Python Scripting

We can make a file that Python can execute on the command line by adding some lines of Python code to a blank file. Here's an example of Python code:

```bash
if __name__ == "__main__":
    print("Welcome to a Python script")
```

The code above will print Welcome to a Python script when we run it from the command line. To run it, we just need to put those lines into a file, save the file as file.py, and then call it with python file.py.

We can also edit a file directly from the terminal, without redirection. While there are a few programs that let us do this, the simplest is called nano. Nano is a command line text editor that lets us edit and save files directly from the terminal.

* To run nano, type nano, followed by the name of the file you want to edit. For example, nano test.txt will open the test.txt file for editing.

Once a file is open, we can make whatever changes we want, then **hit ctrl+x to quit. When we quit, the terminal will prompt us to save our work. Typing Y (for yes), then pressing Enter will save all changes.**

```bash
touch script.py
nano script.py
python script.py
```

install packages
```bash
pip install requests
```

### Virtual enviroments

On the previous screen, we used the default version of pip to install requests for the python executable, which is Python version 2.

#### What if we had wanted to install requests for Python 3 instead?
Different projects can require different packages and Python versions. This type of version switching can become confusing.<br>

For this reason, a computer system has one python executable, and we have to install all packages and libraries globally. **This means that every single project on a machine has to use the same version of Python, and the same version of every package.**<br>

By default, we can't use different versions of Python without some hacks. One such hack is renaming python to python3 so we can have access to both Python 2 and Python 3.<br>

#### A better solution is for each project we write to have its own version of Python, along with its own packages. 
This way, we don't need to worry that upgrading the version of a package will affect other projects on the system and cause them to stop working.

Virtual environments, or virtualenvs, let us do this. We can create a new virtualenv with the virtualenv command. While we normally have to install the virtualenv package first in order to access this command, we've already installed it for you to simplify the process.

Typing virtualenv main will create a virtualenv named main. It will create a folder in the current directory called main that will hold all of the packages we install into the virtual environment.

```bash
virtualenv python2
dir
```
* Note how it makes a folder called python2.

**By default,** virtualenv will use the python executable when it makes a new virtualenv, which means that it has the **same version of Python as the system.** In this case, we want to use python3 for our virtualenv instead. In order to do this, we pass the `-p` flag to the virtualenv command, which will allow us to change the Python interpreter that virtualenv uses.

* In this case, we can type `virtualenv -p /usr/bin/python3 python3` to use Python 3 instead of Python 2.

```bash
virtualenv -p /usr/bin/python3 python3
source python3/bin/activate
```
* assuming that the virtualenv is called `python3`
* assuming that the folder for the virtualenv in our current directory
* how to exit : `deactivate`

We can also look up which packages are currently installed (along with their versions) with `pip freeze`. 
* If we activate a virtualenv, all of the packages, including pip, will be from the virtualenv instead of the main system Python executable.

One of the great things about Python is that:
### we can import functions from a package into a file. 

We can also import functions and classes from one file into another file. This gives us a powerful way to structure larger projects without having to put everything into one file.

* We'll experiment with this style of import by writing a function in a file, and then importing it into another file.

If there's a file named utils.py, we can import it into another file in the same directory using import utils. All of the functions and classes defined in utils.py will then be available using dot notation. If there's a function called keep_time() in utils.py, we can access it with utils.keep_time() after importing it.

Create a file called utils.py that contains the following code:

```bash
touch utils.py
nano utils.py

...

def print_message():
    print("Hello from another file!")
```

Modify the original script.py file to contain this code instead:
```bash
nano script.py

...

import utils

if __name__ == "__main__":
    utils.print_message()
```
* Both script.py and utils.py should be in the same folder.
* Finally, run python script.py to print out the message.

```bash
python script.py
```

We can also pass command line options into Python scripts. We can retrieve them from inside the script through the `sys` package.<br>

Once we import the `sys` package, the `argv` list will allow us to retrieve the positional arguments passed into the script. We learned about positional arguments in the last mission -- they're the arguments that come after the command name. python `script.py 82` is one example. The first positional argument is `script.py`, and the second is `82`.<br>

The following code will read input from the command line and print it back out. If the code is in a file named `script.py`, we'd call `python script.py "Hello from the command line"` to pass in the text we want to display.

```bash
nano script.py

...

import sys

if __name__ == "__main__":
    print(sys.argv[1])

...

python script.py "Hello from the command line"
```
* Notice that we printed the second item in the argv list (sys.argv[1]). This is because the arguments come after the python command, so the first argument is the name of the file we want to run. The second argument is the actual text that we want to print.

## <a name="working-with-jupyter-console"></a>Working with Jupyter console

The **[Jupyter console](https://github.com/jupyter/jupyter_console)**, formerly known as **IPython**, is **an enhanced Python interpreter**. From our earlier missions, you may recall that by typing python on the command line, you get access to an interactive shell that lets you write and execute Python code. Jupyter console enhances this shell, and adds several niceties that make working with data easier.<br>

Generally, it's useful to use the shell **in situations where you need to quickly test some code you're writing**. This happens frequently when you're writing data analysis scripts. It can also be used to **quickly explore datasets and do basic analysis**. Another use case is **prototyping code** before later saving it to a script file.<br>

### The main difference between Jupyter console and Jupyter notebook is that:
the console functions in interactive mode. Whenever you type a line of code, it is immediately executed, and you can see the results. 
* If you want to write medium-length pieces of code or do deep exploration of a dataset, the notebook is better. 
* If you want to test out code you're writing, or run quick commands, the console is better.

The Jupyter project is in the midst of rebranding from IPython to Jupyter. Depending on the version of Jupyter you have installed, you can access the console by typing either jupyter console or ipython at the command line.

Jupyter console has a robust built-in help system. You can get help in several ways:

* You can type ? after starting the console. This will display help about Jupyter. You can exit by typing q.
* You can type %quickref. This is a magic that will tell you some useful commands. We'll talk more about Jupyter magics shortly.
* If you want information about a variable, just type the name of the variable, followed by ?. For information on the dq variable, you'd type dq?.
* Type help() to get access to Python help. This will enable you to get help on all the modules and functions currently available. You can quit by typing quit.
* If you want to use the Python help system to get information on a variable, type help(variable_name). If you wanted help with the variable dq, you'd type help(dq).

```ipython
/home/dq$ iypthon
In [1]: print(10)
10
In [2]: dq=5
In [3]: ?dq
In [4]: help(dq)
In [5]: dq_10=dq*10
```

You may have used the %quickref Jupyter magic in the last screen. Magics are special Jupyter commands that always start with %. They enable you to access Jupyter-specific functionality, without Python executing your commands.<br>

Some useful magics are:

* %run -- allows you to run an external Python script. Any variables in the script will be stored in the current kernel session.
* %edit -- opens a file editor. Any code you type into the editor will be executed by Jupyter when you exit the editor.
* %debug -- if there's an error in any of your code, running %debug afterwards will open an interactive debugger you can use to trace the error.
* %history -- shows you the last few commands you ran.
* %save -- saves the last few commands you ran to a file.
* %who -- print all the variables in the session.
* %reset -- resets the session, and removes all stored variables.

You can see a full list of magics [here](http://ipython.readthedocs.org/en/stable/interactive/magics.html).

0. You can use the %run, %who, and %debug magics to iteratively develop scripts with Jupyter console.
1. Have your favorite editor open, and start writing a Python script. In a separate shell, open Jupyter console.
2. As you get to checkpoints in your script where you want to test it out, use the %run magic to run the script.
3. Check the values of the variables using the %who magic.
4. If you see any errors, debug them with the %debug magic.
5. If you want to clear the session, use %reset.

```ipython
/home/dq$ nano test.py

print('test')
variable = 'test'

/home/dq$ ipython
In [1]: %run test.py
test

In [2]: %who test.py
No variables match your requested type.

```

[Tips]
* If you hit the **TAB key** while typing a variable name, Jupyter will show you all the possible variables it could be, or auto-complete the name if there's only a single option. If you hit TAB after typing a variable name, Jupyter will show you the methods on the variable.

* You can run shell commands in Jupyter console. Just **prefix your shell commands with an exclamation point(!)**. Running `!ls` in Jupyter will show the contents of the current directory. This can be useful **when you want to quickly inspect a file or check on the contents of a folder**.

You'll often want to paste code into Jupyter console to see if it runs properly. Because of how Python handles indentation, nested for loops, functions, and if statements will fail if you just copy and paste them in.<br>

In order to paste in code with indents, you'll need to use paste magics:

* **%cpaste** -- opens a special editing area where you can paste in code normally, without whitespace being a problem. You can type `--` alone on a line to exit. After you exit, any code you pasted in will be immediately executed.
* **%paste** -- takes code from your clipboard and runs it in Jupyter. This doesn't work on remote systems, where Jupyter doesn't have access to your clipboard.

We encourage you to keep exploring Jupyter console. Some specific explorations you can try:

* Explore more of the magics.
* Try using Jupyter to debug exceptions.
* Develop a Python script locally, and see if Jupyter can help with your workflow.

# <a name='piping-and-redirecting-output'></a>Piping(` | `) and redirecting output

```bash
/home/dq$ echo "99 bottles of beer on the wall..." > beer.txt
/home/dq$ echo "Take oone down, pass it around, 98 bottles of beer on the wall..." >> beer.txt
```

The [Linux sort](https://en.wikipedia.org/wiki/Sort) command will sort the lines of a file in alphabetical order. If we pass the `-r` flag, the lines will be sorted in reverse order.

```bash
/home/dq$ sort < beer.txt
99 bottles of beer on the wall...
Take oone down, pass it around, 98 bottles of beer on the wall...

/home/dq$ sort -r < beer.txt
Take oone down, pass it around, 98 bottles of beer on the wall...
99 bottles of beer on the wall...
```

Sometimes, we'll want to search through the contents of a set of files to find a specific line of text. We can use the [grep](http://www.gnu.org/software/grep/manual/grep.html) command for this.

```bash
grep "pass" beer.txt
```

The above command will print any lines in beer.txt where the string pass appears, and highlight the string pass.

```bash
grep "beer" beer.txt coffee.txt
```

This will show all lines from either file that contain the string beer.

#### But what if we wanted to search through all 1000 files in a folder? We definitely wouldn't want to type out all of the names. Let's say we have the following files in a directory:

* beer.txt
* beer1.txt
* beer2.txt
* coffee.txt
* better_coffee.txt

```bash
grep "beer" beer?.txt
```
The wildcard above will match both beer1.txt and beer2.txt. We can use as many wildcards as we want in a filename.

We can use the `*` character to match any number of characters, including `0`.

```bash
grep "beer" beer*.txt
```
<br>
We can also use the wildcard to match more than 1 character:

```bash
grep "beer" *.txt
```

<br>
We can use wildcards anytime we would otherwise enter a filename. For example:

```bash
ls *.txt
```
The above command will list any files with names ending in .txt in the current directory.

**The pipe character, `|`, allows you to send the standard output from one command to the standard input of another command.** This can be very useful for chaining together commands.<br>

For example, let's say we had a file called logs.txt with 100000 lines. We only want to search the last 10 lines for the string Error. We can use the tail -n 10 logs.txt to get the last 10 lines of logs.txt. We can then use the pipe character to chain it with a grep command to perform the search:

```bash
tail -n 10 logs.txt | grep "Error"
```
The above command will search the last 10 lines of logs.txt for the string Error.

<br>
We can also pipe the output of a Python script. Let's say we had this script called `rand.py`:

```python
import random
for i in range(10000):
    print(random.randint(1,10))
```
This command will run the script, and search each line of output to see if a 9 occurs:

```bash
python rand.py | grep 9
```
Any lines that output a 9 will be printed.

If we want to run two commands sequentially, but not pass output between them, we can use && to chain them. Let's say we want to add some content to a file, then print the whole file:

```bash
echo "All the beers are gone" >> beer.txt && cat beer.txt
```

1. This will first add the string All the beers are gone to the file 
2. beer.txt, then print the entire contents of beer.txt.

There are quite a few special characters that bash uses. 
* A full list can be found [here](http://tldp.org/LDP/abs/html/special-chars.html).

#### escaping(\)
Escaping tells the shell to not treat the character as special, but to treat it as a plain character instead. Here's an example:

```bash
echo ""Get out of here," said Neil Armstrong to the moon people." >> famous_quotes.txt

echo "\"Get out of here,\" said Neil Armstrong to the moon people." >> famous_quotes.txt
```

# <a name='csvkit'></a>Data cleaning and exploration using Csvkit

In this mission, we'll learn about the Csvkit library, which supercharges your workflow by adding 13 new command line tools specifically for working with CSV files. We'll focus on these 5 tools from Csvkit:

* `csvstack`: for stacking rows from multiple CSV files.
* `csvlook`: renders CSV in pretty table format.
* `csvcut`: for selecting specific columns from a CSV file.
* `csvstat`: for calculating descriptive statistics for some or all columns.
* `csvgrep`: for filtering tabular data using specific criteria.

Csvkit installation doc : https://csvkit.readthedocs.io/en/0.9.1/install.html

### csvstack
* usage: csvstack [-h] [-d DELIMITER] [-t] [-q QUOTEC
R] [-u {0,1,2,3}] [-b]                             
                [-p ESCAPECHAR] [-z MAXFIELDSIZE] [
 ENCODING] [-S] [-H] [-v]                          
                [-l] [--zero] [-g GROUPS] [-n GROUP
AME] [--filenames]                                 
                FILE [FILE ...]     

```bash

csvstack file1.csv file2.csv file3.csv > final.csv
# basic form

csvstack -n origin -g 1,2,3 file1.csv file2.csv file3.csv > final.csv
# The rows in final.csv that originated from file1.csv will contain the value 1 in the origin column and those from file2.csv will contain the value 2 in the origin column. Let's now use csvstack to combine the 3 datasets on U.S. housing affordability from the last challenge.


# practice
csvstack -n year -g 2005,2007,2013 Hud_2005.csv Hud_2007.csv Hud_2013.csv > Combined_hud.csv


# show the first 5 lines
head -5 Combined_hud.csv

# validate merged file with 154118 rows
wc -l Combined_hud.csv
```


### csvlook
The [csvlook tool](http://csvkit.readthedocs.io/en/0.9.1/scripts/csvlook.html) parses CSV formatted data from it's stdin and outputs a **pretty formatted table representation** of that data to it's stdout:

```bash
head -10 final.csv | csvlook

```

### csvcut
Let's now explore individual columns using the csvcut tool. Using the [csvcut](http://csvkit.readthedocs.io/en/0.9.1/scripts/csvcut.html) command with just the -n flag parses and displays all the columns in a CSV file along with an unique integer identifier for each column:

```bash
csvcut -n Combined_hud.csv


# will output
# 1: year
# 2: AGE1
# 3: BURDEN
# 4: FMR
# 5: FMTBEDRMS
# 6: FMTBUILT
# 7: TOTSAL
```

You can use the integer identifier for each column and the -c flag to select just a specific column:


```bash
csvcut -c 1 Combined_hud.csv
```
* will output just the year column.
* Instead, you can pipe the column output to head to preview just the first n rows.

```bash

csvcut -c 2 Combined_hud.csv | head -10

```

### csvstat
Now that we know how to select specific columns, we can select a column and pipe it to the csvstat tool to calculate summary statistics for that column:

```bash

csvcut -c 4 Combined_hud.csv | csvstat

```
This calculates a full suite of summary statistics, including:

* max,
* min,
* sum,
* mean,
* median,
* standard deviation.

```bash

# Just the max value.
csvcut -c 2 Combined_hud.csv | csvstat --max
# Just the mean value.
csvcut -c 2 Combined_hud.csv | csvstat --mean
# Just the number of null values.
csvcut -c 2 Combined_hud.csv | csvstat --nulls

```
csvstat docs : http://csvkit.readthedocs.io/en/0.9.1/scripts/csvstat.html#description

```
csvstat --mean Combined_hud.csv

#equivalent to:
csvcut -c 1 Combined_hud.csv | csvstat --mean
csvcut -c 2 Combined_hud.csv | csvstat --mean
csvcut -c 3 Combined_hud.csv | csvstat --mean
csvcut -c 4 Combined_hud.csv | csvstat --mean
csvcut -c 5 Combined_hud.csv | csvstat --mean
csvcut -c 6 Combined_hud.csv | csvstat --mean
csvcut -c 7 Combined_hud.csv | csvstat --mean

#result
1. year: 2008.9044232628457
2. AGE1: 46.511215505103266
3. BURDEN: 5.303764743668771
4. FMR: 1037.1186695822005
5. FMTBEDRMS: None
6. FMTBUILT: None
7. TOTSAL: 44041.841931779105
```

Let's use csvcut and csvstat to search for any problematic values in the AGE1 column.
```bash

csvcut -c 2 Combined_hud.csv | csvstat

# result

1. AGE1
        <class 'int'>
        Nulls: False 
        Min: -9 
        Max: 93 
        Sum: 7168169
        Mean: 46.511215505103266
        Median: 48
        Standard Deviation: 23.04901451351246
        Unique values: 80
        5 most frequent values:
                -9:  11553
                50:  3208
                45:  3056
                40:  3040
                48:  3006
```

### csvgrep
We can use csvgrep to select all the rows that match a specific pattern to dive a bit deeper. By default, [csvgrep](http://csvkit.readthedocs.io/en/0.9.1/scripts/csvgrep.html) will search all of the rows in the dataset but we can restrict the search to specific columns using the -c flag (just like with csvcut). We then use the -m flag to specify the pattern:

```bash

csvgrep -c 2 -m -9 Combined_hud.csv

```
This command will return all rows from Combined_hud.csv with -9 as the value for the AGE1 column. The behavior of csvgrep can be customized using the flags.
* For example, you can use the -r flag to pass in a regular expression as the pattern instead. We're now going to combined several of the tools we've talked about so far so that you can see the real power of using the csvkit tools combined with other CLI tools.

Display the first 10 rows from Combined_hud.csv where the value for the AGE1 column is -9 in a pretty table format.

```bash

csvgrep -c 2 -m -9 Combined_hud.csv | head -10 | csvlook

```

Let's now filter out all of these problematic rows from the dataset since they have data quality issues. Csvkit wasn't developed with a sharp focus on editing existing files, and the easiest way to filter rows is to create a separate file with just the rows we're interested in. 
* To accomplish this, we can redirect the output of csvgrep to a file. 
* So far, we've only used csvgrep to select rows that match a specific pattern. 
* We need to instead select the rows that don't match a pattern, which we can specify with the -i flag. 

You can read more about this flag in the [documentation](http://csvkit.readthedocs.io/en/0.9.1/scripts/csvgrep.html).


Select all rows where the value for AGE1 isn't -9 and write just those rows to positive_ages_only.csv.

```bash

csvgrep -c 2 -m -9 -i Combined_hud.csv > positive_ages_only.csv

```