<a href="https://colab.research.google.com/github/castorini/anserini-notebooks-afirm2020/blob/master/afirm2020_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Quick Introduction to Google Colaboratory

The Google Colaboratory (commmonly abbreviated as Google Colab) is a development environment that runs in the browser using Google Cloud.

The [official notebooks](https://colab.research.google.com/notebooks/) provided by Google Colab are a good place to start learning.

## Notebooks

A notebook represents the source file in Colab.
It is essentially a web application that contains live code, equations, visualizations or text.
The Colab kernel officially supports Python and Swift.
Kernels can also be installed for other languages such as Scala or R.

A Colab notebook constitutes two types of `cells`: `run` and `markdown`.
A run cell is an executable unit that expresses the action to be carried out.

You may run a cell either by clicking the button on the far left of the cell or with `SHIFT + Enter`.
The button on the left also signifies whether a cell is currently running, or if it has finished, its running order.

Here is a Python cell that outputs `Hello world!`:

In [0]:
print('Hello world!')

Hello world!


**Exercise:**
Change `world` in the above line with your name and re-run the cell.

### Shell

Check out this [interactive tutorial](https://www.learnshell.org/) or consult the cheatsheet in the suggested reading to brush up on shell scripting.

Shell scripts/commands can also be executed in a cell.
However, unlike Python code, each line has to be prepended with an exclamation mark, `!`, so that the notebook recognizes the Shell command.

For example, let's check our working directory:

In [0]:
!pwd
!ls

/content
sample_data


As you can see, we are currently in `/content`, which only contains the `sample_data` directory.

**Exercise:**
Create a file called `your_name.txt` that contains your name with `echo`.
Make sure to change the filename as you need a unique identifier to create a file in the bucket.
Confirm the creation of the file in your environment.

In [0]:
!echo "Name" > your_name.txt
!ls

### Magics

Notebooks define a set of system commands called `magics`.
Line magics are prepended with a single `%` character and only affect the line that they are on.
More commonly, cell magics are expressed with `%%` and govern the whole cell that they precede.

A cell magic that is going to come up often in the following notebooks is `%%capture`, which suppresses the output of the respective cell.
It is especially useful when running a cell with verbose logging, e.g., downloading or extracting files.

Some of the other magics include:
- `%timeit`: automatically determines the execution time of a single-line Python statement
- `%%html`: treats the entire cell as an HTML block
- `%%bash`: treats the entire cell as a Bash script

For others, check out the [Jupyter documentation](https://nbviewer.jupyter.org/github/ipython/ipython/blob/1.x/examples/notebooks/Cell%20Magics.ipynb).

**Exercise:**
Suppress the output of the `Hello World` Python example with `%%capture`.

In [0]:
%%capture

print('Hello world!')

### Data

Colab keeps a temporary storage space for data and other intermediate files.
However, it is also possible to persist external data stored on your local file system, Google Drive or Google Cloud Storage (GCS).
To learn more about different storage options, please visit [this webpage](https://colab.research.google.com/notebooks/io.ipynb).

We have already created a [bucket](https://cloud.google.com/storage/docs/json_api/v1/buckets) on GCS for the tutorial.
Let's check its contents with the `gsutil` commandline tool:

In [0]:
!gsutil ls gs://afirm2020

gs://afirm2020/anserini-0.7.2-SNAPSHOT-fatjar.jar
gs://afirm2020/collection.tsv
gs://afirm2020/qrels.dev.small.tsv
gs://afirm2020/queries.dev.small.tsv
gs://afirm2020/queries.eval.small.tsv
gs://afirm2020/saved.msmarco_mb_1
gs://afirm2020/indexes/
gs://afirm2020/msmarco_passage/
gs://afirm2020/tfrecord/
gs://afirm2020/trec_eval.9.0.4/
gs://afirm2020/unique_name/


`gs://...` means that our bucket, `afirm2020`, is located in the Google filesystem.

Colab notebooks are automatically saved in a folder named `Colab Notebooks` in your Google Drive.
You may also trigger this action at any time with the keyboard shortcut `CTRL+S`.
It is also possible to save a copy of your notebook to a Github repo through the Colab interface: `File -> Save a copy in Github...`

**Exercise:**
Copy the file `your_name.txt` that you created earlier to the Google bucket.

In [0]:
!gsutil cp your_name.txt gs://afirm2020/your_name.txt

Copying file://name.txt [Content-Type=text/plain]...
/ [1 files][    5.0 B/    5.0 B]                                                
Operation completed over 1 objects/5.0 B.                                        


### Restarting the Shell

Sometimes you may want to revert an unintended side-effect or simply start from scratch.
In this case, you may reset the Python shell through: `Runtime -> Restart runtime...`.
To reset the Colab image completely, do `Runtime -> Restart all runtimes...`.

### Using GPUs

Colab gives access to a GPU (or TPU, for more compute-intensive tasks).
Unfortunately, it is common for your session to time out after ~12 hours.
For this reason, a good practice is to save your checkpoints to GCS so that you can resume your training easily.

You may enable the GPU runtime through `Runtime -> Change runtime type -> Hardware accelerator -> GPU`.
We will need the TPU runtime for the final activity, and learn how to enable it programmatically.

# Text Processing with Python

[Python]() is an interpreted, high level programming language:

- **Strongly typed:** types are enforced
- **Dynamically implicitly typed:** you don't have to explicitly specify the type of a variable (unlike languages like Java or C)
- **Case sensitive:** `afirm` and `AFIRM` represent different variables
- **Object-oriented:** everything is an object!

Now we will go through basic a few basic steps of text processing to introduce Python.
As you will see, the Python syntax is quite clean and intuitive, which makes it the ideal choice for data manipulation.


For the purposes of this activity, let's take a sample query and passage pair from the [MS MARCO passage ranking collection](http://www.msmarco.org/):

In [0]:
query = 'Why did the US make the atomic bomb?'
passage = '''
The presence of communication amid scientific minds was equally important \
to the success of the Manhattan Project as scientific intellect was. The \
only cloud hanging over the impressive achievement of the atomic researchers \
and engineers is what their success truly meant; hundreds of thousands of \
innocent lives obliterated.
'''

Comments start with the pound sign, `#`, and are ignored during interpretation.

A string in Python are enclosed in single or double quotes.
On the other hand, a multiline string has triplel quotes around it.
If it is too long, a multiline string can be dispersed over multiple lines with the `\` character.

Let's first normalize the case with the built-in Python function `lower()`:

In [0]:
query = query.lower()
passage = passage.lower()

print('{query}: {passage}'.format(query=query, passage=passage))

why did the us make the atomic bomb?: 
the presence of communication amid scientific minds was equally important to the success of the manhattan project as scientific intellect was. the only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.



**Exercise:**
Note that `query` and `passage` are attached to the variables in the formatted print statement.
Try reversing the order of variables inside `format` and see if the output changes.

The output of `print` can be formatted as above where curly braces denote the variables inside parantheses.

Now, let's split `passage` into individual tokens.

In [0]:
tokens = passage.split()
print(tokens)

['the', 'presence', 'of', 'communication', 'amid', 'scientific', 'minds', 'was', 'equally', 'important', 'to', 'the', 'success', 'of', 'the', 'manhattan', 'project', 'as', 'scientific', 'intellect', 'was.', 'the', 'only', 'cloud', 'hanging', 'over', 'the', 'impressive', 'achievement', 'of', 'the', 'atomic', 'researchers', 'and', 'engineers', 'is', 'what', 'their', 'success', 'truly', 'meant;', 'hundreds', 'of', 'thousands', 'of', 'innocent', 'lives', 'obliterated.']


**Exercise:**
Split query into tokens.
Find the tokens shared between the query and the passage.

In [0]:
query_tokens = query.split()
set(tokens) & set(query_tokens)

{'atomic', 'the'}

The resulting variable is a Python list.
Let's check the number of tokens (equivalent to the length of the list):

In [0]:
len(tokens)

48

Indexing in Python starts with 0.
Let's search for the 7th (index 6) token of the query in the passage:

In [0]:
query[6]
query[6] in passage

True

**Exercise:**
Extract the first 5 tokens of the passage.

In [0]:
tokens[:5]

['the', 'presence', 'of', 'communication', 'amid']

We can count the frequency of each token with `Counter`.
This function is defined in the `collections` module, which we need to import before using `Counter`.
Note that the output of `Counter` is a Python dictionary where the key is the token and the value is the number of occurrences.

In [0]:
import collections

counts = collections.Counter(tokens)
counts

Counter({'achievement': 1,
         'amid': 1,
         'and': 1,
         'as': 1,
         'atomic': 1,
         'cloud': 1,
         'communication': 1,
         'engineers': 1,
         'equally': 1,
         'hanging': 1,
         'hundreds': 1,
         'important': 1,
         'impressive': 1,
         'innocent': 1,
         'intellect': 1,
         'is': 1,
         'lives': 1,
         'manhattan': 1,
         'meant;': 1,
         'minds': 1,
         'obliterated.': 1,
         'of': 5,
         'only': 1,
         'over': 1,
         'presence': 1,
         'project': 1,
         'researchers': 1,
         'scientific': 2,
         'success': 2,
         'the': 6,
         'their': 1,
         'thousands': 1,
         'to': 1,
         'truly': 1,
         'was': 1,
         'was.': 1,
         'what': 1})

Note that Python does not have a statement termination character, such as `;`.
Instead, blocks are specified by indentation.

Let's traverse the dictionary and only output terms that appear more than once:

In [0]:
for (k, v) in counts.items():
  if v > 1:
    print(k)

the
of
scientific
success


**Exercise:**
Combine query and passage into a single text, and count its tokens.

**Exercise:**
Implement [TF-IDF](http://www.tfidf.com/) in Python over the toy collection provided below (corresponding to the first 10 passages of MS MARCO).
For the purposes of this exercise, assume that a token is case-insensitive, i.e., normalize the case.

```
0	The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.
1	The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peaceful uses of atomic energy continues to have an impact on history and science.
2	Essay on The Manhattan Project - The Manhattan Project The Manhattan Project was to see if making an atomic bomb possible. The success of this project would forever change the world forever making it known that something this powerful can be manmade.
3	The Manhattan Project was the name for a project conducted during World War II, to develop the first atomic bomb. It refers specifically to the period of the project from 194 â¦ 2-1946 under the control of the U.S. Army Corps of Engineers, under the administration of General Leslie R. Groves.
4	versions of each volume as well as complementary websites. The first websiteâThe Manhattan Project: An Interactive Historyâis available on the Office of History and Heritage Resources website, http://www.cfo. doe.gov/me70/history. The Office of History and Heritage Resources and the National Nuclear Security
5	The Manhattan Project. This once classified photograph features the first atomic bomb â a weapon that atomic scientists had nicknamed Gadget.. The nuclear age began on July 16, 1945, when it was detonated in the New Mexico desert.
6	Nor will it attempt to substitute for the extraordinarily rich literature on the atomic bombs and the end of World War II. This collection does not attempt to document the origins and development of the Manhattan Project.
7	Manhattan Project. The Manhattan Project was a research and development undertaking during World War II that produced the first nuclear weapons. It was led by the United States with the support of the United Kingdom and Canada. From 1942 to 1946, the project was under the direction of Major General Leslie Groves of the U.S. Army Corps of Engineers. Nuclear physicist Robert Oppenheimer was the director of the Los Alamos Laboratory that designed the actual bombs. The Army component of the project was designated the
8	In June 1942, the United States Army Corps of Engineersbegan the Manhattan Project- The secret name for the 2 atomic bombs.
9	One of the main reasons Hanford was selected as a site for the Manhattan Project's B Reactor was its proximity to the Columbia River, the largest river flowing into the Pacific Ocean from the North American coast.
```

Build a dictionary where each key, e.g., a term in the collection, is mapped to pairs of passage IDs and term frequencies.
For example:

`manhattan -> (0, 1), (1, 1), (2, 3), (3, 1), (4, 1), (5, 1), (6, 1), (7, 2), (8, 1), (9, 1)`

## Packages

Many Python packages come pre-installed in a new Colab notebook.
You may install any additional packages that you need through [pip](https://pypi.org/project/pip/), Python's de facto package management system.

Let's install `matplotlib` and `numpy` to plot the histogram of tokens in `passage`:

In [0]:
!pip install matplotlib
!pip install numpy



In [0]:
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline
x = np.arange(len(counts))
y = list(counts.values())

plt.bar(x, y, align='center', alpha=0.5)
plt.xticks(x, list(counts.keys()), rotation='vertical')
plt.ylabel('Term Frequency')
plt.show()

NameError: ignored