# Introduction to Jupyter Notebooks

Jupyter notebooks are a handy tool that will allow you to write code and annotate with text-based sections.  Notebooks are divided into "cells", which in the standard form include `Markdown` and `Code` types.

The `Markdown` cells (such as the one in which these words appear) are useful for writing plain text, displaying static images, linking to external websites or sources, and using $\LaTeX$ formatting for math or other things.

The `Code` cells are where your actual code goes.  The programming language to be used is determined by the kernel, which in our case is the Python3 kernel (see the upper-right corner of the notebook page).  This means all code must be in the Python 3 code format.

It is possible to install kernels for other programming languages, however the process is somewhat more involved and is not necessary for this workshop.

____

Each notebook can be thought of as an interactive code session.  Every cell that has been "run" updates the overall session for that notebook.  In this way, you can divide up larger tasks into smaller pieces, carry variable information from one cell to the next, and define functions or set up workflows early on to then use later in the notebook.  There are also a number of ways to interact with things outside the notebook, such as files or even the system itself.

In the first few cells, we have the variable `x` being given different values.  The middle cell prints out the current value of `x` whenever you run it.  Run the first three cells in order, then run the `print(x)` command cell again and see how it changes the output.  This is to illustrate the point above, that every cell changes the overall environment as you run it.  So be careful about what variables you use, when and where you use them, and how they might get reassigned.

The fourth cell also deletes the variable entirely.  Try it out, followed again by the `print(x)` cell, to see the effects.

In [1]:
x = 0

In [2]:
print(x)

0


In [3]:
x = 1

In [4]:
del x

If we combined the overall effects of the instructions above into a single cell, it would look like this.

In [5]:
x = 0
print(x)
x = 1
print(x)
del x
print(x)

0
1


NameError: name 'x' is not defined

Notice that once we delete our variable using `del`, the variable is completely lost to us, and attempting to use it again without assigning a value results in an error.

Now let's look at some basic libraries that let your python code interact with the rest of the system.

### The `os` Library

The `os` Python library allows users to interact with the operating system, and is very useful for things ranging from file- and directory-manipulation up to actual command line interactions

To start, let's import the `os` library and see what our current location is.

In [None]:
import os

os.getcwd()

Now let's create a new directory in this location.  We'll call it `new_dir/`.  Also, if it already exists, we'll make sure the system doesn't throw up an error at us and just accepts it.  This is the python equivalent to the bash command `mkdir -p new_dir/`.

In [None]:
os.makedirs("new_dir/",exist_ok=True)

Now that the directory exists, we can move into it.  We'll also confirm that's our current location.

In [None]:
os.chdir("new_dir/")

In [None]:
os.getcwd()

We can move back and forth easily, just keep in mind that your current location is maintained in the entire notebook session.  If you change your working directory (`os.chdir()`), you'll need to account for that in subsequent cells if you're looking for data files in specific locations.

We can also save the path into a variable, and then use that variable for future things.

In [None]:
os.chdir("../")
my_location = os.getcwd()
print(my_location)

What if this notebook was a filename?  If we assign just the notebook name to a variable, it doesn't include any path information.  This is the "relative path".  We can use the `os.path` module to get the "absolute path".

In [None]:
this_notebook = "Jupyter_Notebook_Basics.ipynb"
print(this_notebook)

In [None]:
os.path.abspath(this_notebook)

We can see that the module added the string contained in the `this_notebook` variable and added it to the current location, creating the absolute path.  However, let's try this with a file that doesn't exist in this folder.

In [None]:
os.path.abspath("does_not_exist.txt")

The same thing happens.  But this file doesn't exist.  The module has just given us a filepath that would lead to that file in this folder, it didn't actually check to see if that file exists or not.  Fortunately, we can do that on our own.

In [None]:
os.path.exists("does_not_exist.txt")

Now we can see that the file doesn't actually exist, reported as a boolean (True/False)

Going back to the notebook file, what if we wanted additional information about it?  Let's say we wanted to know when the file was created.  We can use the `getctime` function to get that information.  We can also use `getmtime` to see when the file was modified.

In [None]:
create_time = os.path.getctime("Jupyter_Notebook_Basics.ipynb")
print(create_time)

That doesn't look like a date/time at all.  That's because the value is reported as the number of seconds since the epoch (Unix start of all time, which was **January 1st, 1970 at 00:00:00 UTC**).
If we want a more useful timestamp, we'll need to bring in the `time` library.


In [None]:
import time
local_time = time.ctime(create_time)
print(local_time)

That's much better.  Let's also get the file's extension, base name, and parent directory.

In [None]:
full_notebook_path = os.path.abspath("Jupyter_Notebook_Basics.ipynb")
print("\tAbsolute Path")
print(full_notebook_path)
print("")
base_name,extension = os.path.splitext(full_notebook_path)
print("\tBase Name")
print(base_name)
print("")
print("\tExtension")
print(extension)
print("")
parent_directory = os.path.dirname(full_notebook_path)
print("\tParent Directory")
print(parent_directory)

We can also "nest" the `dirname` function multiple times to go up multiple levels.

In [None]:
grandparent_directory = os.path.dirname(parent_directory)
print(grandparent_directory)

### The `subprocess` library

One of the ways we can interact with the rest of the system is to use the `subprocess` library.  This library contains a number of methods for these kinds of interactions, depending on what you need from each interaction.

First, we'll look at the `.call()` function, which is used when you just need to issue a command and wait for it to finish but don't need the actual outputs from that command.  There are two common ways to use the `.call()` function: With a list of individual parts or with a string and the `shell=True` argument included.  The list form is useful for when you have variables that need to be included in your command (and you don't want to use f-strings)

In [2]:
import subprocess
list_call  = subprocess.call(["echo","chickens"])
shell_call = subprocess.call("echo chickens",shell=True)

chickens
chickens


In the case of a notebook environment, this still gives us the output.  However, capturing it into a variable like `process` only captures the **return** from the command.  In this case, a return of `0` means the process completed with no errors and exited cleanly.

In [None]:
shell_call

In [None]:
list_call

If we wanted to actually capture the results from a command to use a given value, however, we need to use the `.Popen()` command, which has a few additional arguments.  First, we need to include the `stdout` and `stderr` keywords to properly capture the output and errors from our command.  For both of these, we can use the `subprocess.PIPE` functionality that is part of the `subprocess` library.  This ensures that outputs and errors are kept separate from each other, which can be useful in many ways.

In [None]:
process = subprocess.Popen("ls -lrth", shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

In [None]:
process

Notice how now, instead of the return value of 0, we got an object identifier from the subprocess module and its specific memory address (for our purposes, we don't really need to care about the memory address).  What we want, though, is the actual results from the command we ran.  To get that, we need to use the object's built-in `.communicate()` function.  This function returns **TWO** variables, so we'll capture them into two of our own to make it easier to work.  You can choose to capture them as a tuple, but I find that to be more trouble than it's worth later on, so we'll just use the cleaner methods.

In [None]:
out,err = process.communicate()

In [None]:
out

In [None]:
err

You may notice that both `out` and `err` look like strings, except they begin with a `b` character beforehand.  This indicates that the data is actually stored as `binary` data, not common string data.  It's important to know this if you intend to use the information contained in these results, because `string`s and `binary`s don't always provide the same kinds of interactions with other code.

To get the `out` in a more useful (for now) string format, we'll need to decode it using the `utf-8` language pack that our linux kernel is working with.  This is generally the most common, though there are other language packs out there.

In [None]:
output_string = out.decode('utf-8')
print(output_string)

Notice how, when we used the `print()` function on our new `output_string`, it's properly formatted with multiple lines, just as we'd expect from the `ls -lrth` command in bash.  This also means we can treat this string similar to how we might treat a block of text from a file.

What about errors?  In the previous example, `err` was an empty object.  Let's try the same steps as before, except we'll change the command to force an error.  Below, I will "forget" to include the dash (`-`) in our `ls -lrth` command.  I'll also decode both the `out` and `err` variables so we can see the results of each.

In [None]:
process = subprocess.Popen("ls lrth", shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out,err = process.communicate()
output_string = out.decode('utf-8')
error_string  = err.decode('utf-8')

In [None]:
print(output_string)

In [None]:
print(error_string)

Now we can see how separating the output and error can be useful.  If you are running something a bit more complex, getting a non-empty error message can be a quick way to identify problems and halt the program.
We can run effectively any bash command from a python environment using the `.call()` and `.Popen()` functions, including multi-line commands separated by semicolons (`;`).  

The next examples will print out a word, then wait for three seconds, and then print a second word.  When you run the first two cells below, pay careful attention to the behaviors.

In [3]:
subprocess.call("echo potato; sleep 3; echo cabbage",shell=True)

potato
cabbage


0

In [4]:
proc = subprocess.Popen("echo potato; sleep 3; echo cabbage",shell=True,stdout=subprocess.PIPE,stderr=subprocess.PIPE)

In [None]:
out,err = proc.communicate()
output = out.decode()
print(output)

The use of `.call()` allows each step of the output to be printed immediately to our console as the process runs, while the `.Popen()` immediately "completes".  The next cell with `.communicate()` doesn't give us anything until the entire process has completed.  This is fine if we just need to capture the entire output or if we just need the process to start while the rest of our program continues on.  We can, however, use another method to reclaim the immediacy of `.call()`, which can be useful if we need to watch for a specific pattern in the output.  In the expanded example below, we've added another `sleep 3` command, followed by `echo porcupine`.  This means the entire process should take six seconds (printing the text is effectively zero time for our purposes).

In [None]:
proc = subprocess.Popen("echo potato; sleep 3; echo cabbage; sleep 3; echo porcupine",shell=True,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
for line in proc.stdout:
    out_string = line.decode('utf-8')
    print(out_string)

In [None]:
proc = subprocess.Popen("echo potato; sleep 3; echo cabbage; sleep 3; echo porcupine",shell=True,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
for line in proc.stdout:
    out_string = line.decode('utf-8')
    print(out_string)
    if "cabbage" in out_string:
        break

Notice that the first example of `potato, cabbage, porcupine` printed out each word with a delay in between, just as when we used `.call()`.  In the second example, we added a check to see if we'd encountered `cabbage` in the output, and when we did, the `break` command immediately ended the loop.  However, the process itself is still running in the background.  If we combine pieces from the previous two examples, we can highlight this effect.

In [None]:
proc = subprocess.Popen("echo potato; sleep 3; echo cabbage; sleep 3; echo porcupine",shell=True,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
for line in proc.stdout:
    out_string = line.decode('utf-8')
    print(out_string)
    if "cabbage" in out_string:
        print("I've got to break free!")
        break

for line in proc.stdout:
    out_string = line.decode('utf-8')
    print(out_string)

The above example shows that even though we escaped the loop, the actual background process was still running, and by starting another loop with the same `proc`, we can just resume the observation.  This is to highlight that unless you actively **STOP** the process you've started, it will continue as normal.  Let's modify the example above to use `.kill()` to terminate the process the moment we encounter a `cabbage` (as one does).  I've also increased the second `sleep` command to 10 seconds to illustrate another point.

In [None]:
proc = subprocess.Popen("echo potato; sleep 3; echo cabbage; sleep 10; echo porcupine",shell=True,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
for line in proc.stdout:
    out_string = line.decode('utf-8')
    print(out_string)
    if "cabbage" in out_string:
        print("I've got to break free!")
        proc.kill()

for line in proc.stdout:
    out_string = line.decode('utf-8')
    print(out_string)
print("Now the process has been terminated.")

Even though we sent the kill signal immediately upon detecting a `cabbage`, the final line didn't print until after the `sleep 10` command was complete.  This is because each command in our sequence began *immediately* after the previous one finished, so the kill didn't actually get sent before `sleep 10` started.  Once it finished, however, the process was killed successfully and we never saw a `porcupine` in our outputs.

The `subprocess` library can also be useful when using python to create more complicated files and scripts that will then be executed.  However, be careful when combining bash commands into a script if you think you'll need to interrupt it.  Using `subprocess` to run the command `sh ./myscript.sh` means that *everything* inside `myscript.sh` is part of a single step, and you can't terminate it in the same way that you would by running a list of individual commands.  One possibly useful workaround for this is to read the contents of your script into a string variable, then use that string variable in your `subprocess.Popen()` command, rather than the file itself.

To help with the next section, we'll also use a `.Popen()` command to quickly create a file inside the `new_dir` folder we made in the `os` library segment.

In [13]:
proc = subprocess.Popen("echo 'endless garlic breadsticks' >> new_dir/something.txt",shell=True)

### The `glob` library

The `glob` library is fairly simple, in that it only has a few functions of any real importance to us (at least for this workshop).  It is primarily used to find all files matching a pattern.  The pattern can be specific to a single filename, or it can include wildcards, or even recursively search subdirectories.  Let's take a look at some examples using our current location.

First, let's look for this specific notebook, which *should* still be titled `Jupyter_Notebook_Basics.ipynb`.

In [7]:
import glob
glob.glob("Jupyter_Notebook_Basics.ipynb")

['Jupyter_Notebook_Basics.ipynb']

Note that it gives us the result as a single-element list, not just a string.  Always be sure to pay attention to the kinds of data you're getting from functions, because datatypes matter!

Now let's find **ALL** the notebook files in this folder.  We'll use the wildcard character (`*`) in our search pattern.

In [10]:
glob.glob("*.ipynb")

['Introduction_to_Loops.ipynb',
 'Introduction_to_Pandas.ipynb',
 'Introduction_to_RDKit.ipynb',
 'Introduction_to_Matplotlib.ipynb',
 'Introduction_to_Numpy.ipynb',
 'Jupyter_Notebook_Basics.ipynb']

We have a list of all of our notebooks ready to go.  You might notice they're not in any particular order.

We can also search for directories.

In [11]:
glob.glob("*/")

['new_dir/']

And finally, a search for any files that have extensions, including in any subdirectories.

In [14]:
glob.glob("**/*.*",recursive=True)

['Introduction_to_Loops.ipynb',
 'Introduction_to_Pandas.ipynb',
 'Introduction_to_RDKit.ipynb',
 'Introduction_to_Matplotlib.ipynb',
 'Introduction_to_Numpy.ipynb',
 'Jupyter_Notebook_Basics.ipynb',
 'new_dir/something.txt']

### Partial Imports

Sometimes, it's useful and/or safer to only import the specific functions and classes you actually need from a library.  Or, it can be helpful to import different sections of a library with a different assigned name for each.

In the case of the `glob` library above, we only really ever use `glob.glob`, where the first `glob` is the library name and the second `glob` is the function itself.

Since we only need that one function, we can do a **partial import**.

The syntax is simple:
```python
from <library> import <function>
```

In [3]:
from glob import glob

With this, we have imported the **function** directly, which means we can use it without the first `glob.` part.

In [4]:
glob("*.ipynb")

['Introduction_to_Matplotlib.ipynb',
 'Introduction_to_Numpy.ipynb',
 'Jupyter_Notebook_Basics.ipynb',
 'Introduction_to_RDKit.ipynb',
 'Introduction_to_Pandas.ipynb',
 'Introduction_to_Loops.ipynb']

This can also be useful if there's a specific function buried deep within a library that you want, but not necessarily everything else with that library.  It can also ensure that if multiple libraries use the same function name, that one does not overwrite the other when you import them.

In fact, let's pretend that the `glob` function was actually buried way down inside the library and we wanted a quick and easy way to call it without having to type a bunch of code.

We can reassign the name we use like this:

```python
from <library> import <function> as <name>
```
which will let us then use `name` in place of everything that would normally be required for that particular function/module/sublibrary.

In [5]:
from glob import glob as G
G("*.ipynb")

['Introduction_to_Matplotlib.ipynb',
 'Introduction_to_Numpy.ipynb',
 'Jupyter_Notebook_Basics.ipynb',
 'Introduction_to_RDKit.ipynb',
 'Introduction_to_Pandas.ipynb',
 'Introduction_to_Loops.ipynb']

Now, instead of needing `glob.glob()` or even just `glob()`, we can simply use `G()`.  This is a handy way to ease some readability in your code, especially if it's clear what the specific functions you're reassigning actually do.

#### Topics Covered

- `import os` 
  - `os.getcwd`
  - `os.chdir`
  - `os.makedirs`
  - `os.path.abspath`
  - `os.path.exists`
  - `os.path.getctime`
  - `os.path.getmtime`
  - `os.path.splitext`
  - `os.path.dirname`
- `import time`
  - `time.ctime`
- `import subprocess`
  - `subprocess.call`
  - `subprocess.Popen`
  - `subprocess.Popen.communicate`
  - `subprocess.Popen.stdout`
  - `subprocess.Popen.kill`
- `import glob`
  - `glob.glob`
- Partial imports
- Library name assignments