# Quick Introduction to Google Colaboratory

The Google Colaboratory (commmonly abbreviated as Google Colab) is a development environment that runs in the browser using Google Cloud.

The [official notebooks](https://colab.research.google.com/notebooks/) provided by Google Colab are a good place to start learning.

## Notebooks

A notebook represents the source file in Colab.
A notebook is a web application that contains live code, equations, visualizations or text.
The Colab kernel currently supports Python2/3 and Swift.
Kernels can also be installed for other languages such as Scala or R.

A Colab notebook constitutes two types of `cells`: `run` and `markdown`.
A run cell is an executable unit that expresses the action to be carried out.

You may run a cell either by clicking the button on the far left of the cell or with `SHIFT + Enter`.
The button on the left also signifies whether a cell is currently running, or if it has finished, its running order.

Here is a Python cell that outputs `Hello world!`:

In [None]:
print('Hello world!')

**Exercise:**
Change `world` in the above line with your name and re-run the cell.

### Shell

Shell scripts/commands can also be executed in a cell.
Here are some resources to brush up on shell scripting:

- [Interactive tutorial](https://www.learnshell.org/)
- ...

However, unlike Python code, each line has to be prepended with an exclamation mark, `!`, so that the notebook recognizes the Shell command.

For example, let's check our working directory:

In [None]:
!pwd
!ls

As you can see, we are currently in `/content`, which only contains the `sample_data` directory.

**Exercise:**
Create a file called `name.txt` that contains your name with `echo`.

In [None]:
!echo "Name" > name.txt

### Magics

Notebooks define a set of system commands called `magics`.
Line magics are prepended with a single `%` character and only affect the line that they are on.
More commonly, cell magics are expressed with `%%` and govern the whole cell that they precede.

A cell magic that is going to come up often in the following notebooks is `%%capture`, which suppresses the output of the respective cell.
It is especially useful when running a cell with verbose logging, e.g., downloading or extracting files.

Some of the other magics include:
- `%timeit`: automatically determines the execution time of a single-line Python statement
- `%%html`: treats the entire cell as an HTML block
- `%%bash`: treats the entire cell as a Bash script

For others, check out the [Jupyter documentation](https://nbviewer.jupyter.org/github/ipython/ipython/blob/1.x/examples/notebooks/Cell%20Magics.ipynb).

**Exercise:**
Suppress the output of the `Hello World` Python example with `%%capture`.

In [None]:
%%capture

print('Hello world!')

### Data

Colab keeps a temporary storage space for data and other intermediate files.
However, it is also possible to persist external data stored on your local file system, Google Drive or Google Cloud Storage (GCS).
To learn more about different storage options, please visit [this webpage](https://colab.research.google.com/notebooks/io.ipynb).

We have already created a [bucket](https://cloud.google.com/storage/docs/json_api/v1/buckets) on GCS for the tutorial.
Let's check its contents with the `gsutil` commandline tool:

In [None]:
!gsutil ls gs://afirm2020

`gs://...` means that our bucket, `afirm2020`, is located in the Google filesystem.

Colab notebooks are automatically saved in a folder named `Colab Notebooks` in your Google Drive.
You may also trigger this action at any time with the keyboard shortcut `CTRL+S`.
It is also possible to save a copy of your notebook to a Github repo through the Colab interface: `File -> Save a copy in Github...`

**Exercise:**
Copy the file `name.txt` that you created earlier to the Google bucket.

In [None]:
!gsutil cp name.txt gs://afirm2020/name.txt

### Restarting the Shell

Sometimes you may want to revert an unintended side-effect or simply start from scratch.
In this case, you may reset the Python shell through: `Runtime -> Restart runtime...`.
To reset the Colab image completely, do `Runtime -> Restart all runtimes...`.

### Using GPUs

Colab gives access to a GPU (or TPU, for more compute-intensive tasks).
Unfortunately, it is common for your session to time out after ~12 hours.
For this reason, a good practice is to save your checkpoints to GCS so that you can resume your training easily.

You may enable the GPU runtime through `Runtime -> Change runtime type -> Hardware accelerator -> GPU`.
We will need the TPU runtime for the final activity, and learn how to enable it programmatically.

# Text Processing with Python

Disclaimer: If you are already familiar with Python, it is probably safe to skip this section.

---

[Python]() is an interpreted, high level programming language:

- **Strongly typed:** types are enforced
- **Dynamically implicitly typed:** you don't have to explicitly specify the type of a variable (unlike languages like Java or C)
- **Case sensitive:** `afirm` and `AFIRM` represent different variables
- **Object-oriented:** everything is an object!

In this tutorial, we are going to use Python 3.
Now we will go through basic a few basic steps of text processing to introduce Python.
As you will see, the Python syntax is quite clean and intuitive, which makes it the ideal choice for data manipulation.


For the purposes of this activity, let's take a sample query and passage pair from the [MS MARCO passage ranking collection](http://www.msmarco.org/):

In [None]:
query = 'Why did the US make the atomic bomb?'
passage = '''
The presence of communication amid scientific minds was equally important \
to the success of the Manhattan Project as scientific intellect was. The \
only cloud hanging over the impressive achievement of the atomic researchers \
and engineers is what their success truly meant; hundreds of thousands of \
innocent lives obliterated.
'''

Comments start with the pound sign, `#`, and are ignored during interpretation.

A string in Python are enclosed in single or double quotes.
On the other hand, a multiline string has triplel quotes around it.
If it is too long, a multiline string can be dispersed over multiple lines with the `\` character.

Let's first normalize the case with the built-in Python function `lower()`:

In [None]:
query = query.lower()
passage = passage.lower()

print('{query}: {passage}'.format(query=query, passage=passage))

**Exercise:**
Note that `query` and `passage` are attached to the variables in the formatted print statement.
Try reversing the order of variables inside `format` and see if the output changes.

The output of `print` can be formatted as above where curly braces denote the variables inside parantheses.

Now, let's split `passage` into individual tokens.

In [None]:
tokens = passage.split()
print(tokens)

**Exercise:**
Split query into tokens.
Find the tokens shared between the query and the passage.

In [None]:
query_tokens = query.split()
set(tokens) & set(query_tokens)

The resulting variable is a Python list.
Let's check the number of tokens (equivalent to the length of the list):

In [None]:
len(tokens)

Indexing in Python starts with 0.
Let's search for the 7th (index 6) token of the query in the passage:

In [None]:
query[6]
query[6] in passage

**Exercise:**
Extract the first 5 tokens of the passage.

In [None]:
tokens[:5]

We can count the frequency of each token with `Counter`.
This function is defined in the `collections` module, which we need to import before using `Counter`.
Note that the output of `Counter` is a Python dictionary where the key is the token and the value is the number of occurrences.

In [None]:
import collections

counts = collections.Counter(tokens)
counts

Note that Python does not have a statement termination character, such as `;`.
Instead, blocks are specified by indentation.

Let's traverse the dictionary and only output terms that appear more than once:

In [None]:
for (k, v) in counts.items():
  if v > 1:
    print(k)

**Exercise:**
Combine query and passage into a single text, and count its tokens.

**Exercise:**
Implement TF-IDF in Python.
TODO: pull entire collection?

## Packages

Many Python packages come pre-installed in a new Colab notebook.
You may install any additional packages that you need through [pip](https://pypi.org/project/pip/), Python's de facto package management system.

Let's install `matplotlib` and `numpy` to plot the histogram of tokens in `passage`:

In [None]:
!pip install matplotlib
!pip install numpy

In [None]:
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline
x = np.arange(len(counts))
y = list(counts.values())

plt.bar(x, y, align='center', alpha=0.5)
plt.xticks(x, list(counts.keys()), rotation='vertical')
plt.ylabel('Term Frequency')
plt.show()