<p><img alt="Colaboratory logo" height="45px" src="/img/colab_favicon.ico" align="left" hspace="10px" vspace="0px"></p>

<h1>What is Colaboratory?</h1>

Colaboratory, or 'Colab' for short, allows you to write and execute Python in your browser, with 
- Zero configuration required
- Free access to GPUs
- Easy sharing

Whether you're a <strong>student</strong>, a <strong>data scientist</strong> or an <strong>AI researcher</strong>, Colab can make your work easier. Watch <a href="https://www.youtube.com/watch?v=inN8seMm7UI">Introduction to Colab</a> to find out more, or just get started below!

## <strong>Getting started</strong>

The document that you are reading is not a static web page, but an interactive environment called a <strong>Colab notebook</strong> that lets you write and execute code.

For example, here is a <strong>code cell</strong> with a short Python script that computes a value, stores it in a variable and prints the result:

In [None]:
seconds_in_a_day = 24 * 60 * 60
seconds_in_a_day

To execute the code in the above cell, select it with a click and then either press the play button to the left of the code, or use the keyboard shortcut 'Command/Ctrl+Enter'. To edit the code, just click the cell and start editing.

Variables that you define in one cell can later be used in other cells:

In [None]:
seconds_in_a_week = 7 * seconds_in_a_day
seconds_in_a_week

Colab notebooks allow you to combine <strong>executable code</strong> and <strong>rich text</strong> in a single document, along with <strong>images</strong>, <strong>HTML</strong>, <strong>LaTeX</strong> and more. When you create your own Colab notebooks, they are stored in your Google Drive account. You can easily share your Colab notebooks with co-workers or friends, allowing them to comment on your notebooks or even edit them. To find out more, see <a href="/notebooks/basic_features_overview.ipynb">Overview of Colab</a>. To create a new Colab notebook you can use the File menu above, or use the following link: <a href="http://colab.research.google.com#create=true">Create a new Colab notebook</a>.

Colab notebooks are Jupyter notebooks that are hosted by Colab. To find out more about the Jupyter project, see <a href="https://www.jupyter.org">jupyter.org</a>.

## Data science

With Colab you can harness the full power of popular Python libraries to analyse and visualise data. The code cell below loads in various Python libraries to help us solve some problems that other people have already solved! Here we call a few useful libraries with `import` and even rename some to shorter names, such as `pandas` as `pd`.

In [None]:
import json
import os
from ast import literal_eval

import pandas as pd
import requests
from sklearn.metrics import jaccard_score

# https://stackoverflow.com/questions/25351968/how-to-display-full-non-truncated-dataframe-information-in-html-when-convertin
pd.set_option("display.max_colwidth", None)

We can interact with GOV.UK search programatically (i.e. we don't have to manually type it into the search bar on GOV.UK, we can use a computer instead to run a search for us and pass us the results!) using the GOV.UK search API. 

What does API mean?

These [docs](https://docs.publishing.service.gov.uk/apis/search/search-api.html) provide a bit more background.

Below we assign some variables, go through this example and then try changing some of these arguments. What happens when you run the code chunk after this?

In [None]:
# inputs to modify api response
search_term = "tax"
top_n_results = 5

In [None]:
# this blog post might help you understand this https://dataingovernment.blog.gov.uk/2016/05/26/use-the-search-api-to-get-useful-information-about-gov-uk-content/
resp = requests.get(
    (
        "https://www.gov.uk/api/search.json?q="
        + search_term
        + "&count="
        + str(top_n_results)
        + "&fields=title"
    )
)
if resp.status_code != 200:
    raise ApiError("GET /tasks/ {}".format(resp.status_code))


data = json.loads(resp.content.decode("utf-8"))
pd.DataFrame.from_records(data["results"], columns=["_id", "combined_score"])

Let's turn this into a function for ease of use

In [None]:
def govuk_search(search_term="tax", top_n_results=5):
    """Query the GOV.UK search API

    :Usage::
        >>> df = govuk_search(search_term = 'tax', top_n_results = 5)

    :param search_term: Query str to search for.
    :param top_n_results: int
    """
    resp = requests.get(
        (
            "https://www.gov.uk/api/search.json?q="
            + search_term
            + "&count="
            + str(top_n_results)
            + "&fields=content_id"
        )
    )
    if resp.status_code != 200:
        raise ApiError("GET /tasks/ {}".format(resp.status_code))

    data = json.loads(resp.content.decode("utf-8"))

    df = pd.DataFrame.from_records(data["results"], columns=["_id", "combined_score"])

    return df

In [None]:
df = govuk_search("universal credit", 3)
df

What if a user doesn't know the name "universal credit", so they search instead for "benefits".  Do they get similar results?

In [None]:
df = govuk_search("benefits", 3)
df

## Jaccard: how similar are two sets of search results?

### References

1. [Wikipedia](https://en.wikipedia.org/wiki/Jaccard_index)
2. [StackOverflow](https://stackoverflow.com/a/47016862/937932)

### Definition

Jaccard similarity is the amount of overlap between two sets of items.

Here is the mathematical notation.

$$J(A,B) = {{|A \cap B|}\over{|A \cup B|}} = {{|A \cap B|}\over{|A| + |B| - |A \cap B|}}$$

The first part, $J(A,B)$, means "Do a calculation on two sets, called $A$ and $B$."

The middle part, ${{|A \cap B|}\over{|A \cup B|}}$, means "Count the number of items that appear in both $A$ and $B$, and divide that by the number of items that only appear in $A$ or $B$ but not both."

The symbol $\cap$ means "intersection", which is the items that are in both $A$ and $B$.  The symbol $\cup$ means "union", the items that are in either $A$ or $B$ or both.

The final part, ${{|A \cap B|}\over{|A| + |B| - |A \cap B|}}$, is another way to write the middle part.  It shows a way to calculate ${|A \cup B|}$, and it means "Count the number of items in A, and the number of items in B, add them, and then subtract the number of items that are in both A and B."

Let's write code to do this calculation.

In [None]:
set_1 = set(["c", "a", "t"])
set_2 = set(["m", "a", "t"])

intersection = set_1 & set_2
union = set_1 | set_2

jaccard = float(len(intersection)) / len(union)

jaccard

Check that the result 2.0 is correct, by doing the calculation by hand.

* **Intersection**: letters 'a' and 't' are in both sets.  So the intersection is 2 (two letters).
* **Union**: there are four letters in total 'c', 'a', 't' and 'm'.  Each letter is counted only once, even though letters 'a' and 't' appear twice each.

The Jaccard index is the size of the intersection, 2, divided by the size of the union, 4.  The result is 0.5, as we calculated.

Let's make this into a function too.

In [None]:
def jaccard(set_1, set_2):
    # Ensure that set_1 and set_2 are python set objects, not merely list objects
    set_1 = set(set_1)
    set_2 = set(set_2)

    # Calculate the Jaccard index
    return float(len(set_1 & set_2) / len(set_1 | set_2))

In [None]:
jaccard(["c", "a", "r"], ["m", "a", "t"])

Try calculating the Jaccard index of different sets.  For example, change `['c', 'a', 't']` to `'['c', 'a', 't', 's']` or `['c', 'a', 'r']`.  Can you get a score above 1?  What about a negative score?

Now we can easily compare the results of different GOV.UK search queries.  Try to trick the system: what about plurals, or misspellings, do they return similar results?

In [None]:
jaccard(govuk_search("benifit", 10)["_id"], govuk_search("benefits", 10)["_id"])

## More resources

### WE SHOULD ADD GOV.UK DS STUFF HERE

### Working with notebooks in Colab
- [Overview of Colaboratory](/notebooks/basic_features_overview.ipynb)
- [Guide to markdown](/notebooks/markdown_guide.ipynb)
- [Importing libraries and installing dependencies](/notebooks/snippets/importing_libraries.ipynb)
- [Saving and loading notebooks in GitHub](https://colab.research.google.com/github/googlecolab/colabtools/blob/master/notebooks/colab-github-demo.ipynb)
- [Interactive forms](/notebooks/forms.ipynb)
- [Interactive widgets](/notebooks/widgets.ipynb)
- <img src="/img/new.png" height="20px" align="left" hspace="4px" alt="New"></img>
 [TensorFlow 2 in Colab](/notebooks/tensorflow_version.ipynb)

<a name="working-with-data"></a>
### Working with data
- [Loading data: Drive, Sheets and Google Cloud Storage](/notebooks/io.ipynb) 
- [Charts: visualising data](/notebooks/charts.ipynb)
- [Getting started with BigQuery](/notebooks/bigquery.ipynb)

### Machine learning crash course
These are a few of the notebooks from Google's online machine learning course. See the <a href="https://developers.google.com/machine-learning/crash-course/">full course website</a> for more.
- [Intro to Pandas](/notebooks/mlcc/intro_to_pandas.ipynb)
- [TensorFlow concepts](/notebooks/mlcc/tensorflow_programming_concepts.ipynb)

<a name="using-accelerated-hardware"></a>
### Using accelerated hardware
- [TensorFlow with GPUs](/notebooks/gpu.ipynb)
- [TensorFlow with TPUs](/notebooks/tpu.ipynb)

<a name="machine-learning-examples"></a>

## Machine learning examples

To see end-to-end examples of the interactive machine-learning analyses that Colaboratory makes possible, take a look at these tutorials using models from <a href="https://tfhub.dev">TensorFlow Hub</a>.

A few featured examples:

- <a href="https://tensorflow.org/hub/tutorials/tf2_image_retraining">Retraining an Image Classifier</a>: Build a Keras model on top of a pre-trained image classifier to distinguish flowers.
- <a href="https://tensorflow.org/hub/tutorials/tf2_text_classification">Text Classification</a>: Classify IMDB film reviews as either <em>positive</em> or <em>negative</em>.
- <a href="https://tensorflow.org/hub/tutorials/tf2_arbitrary_image_stylization">Style Transfer</a>: Use deep learning to transfer style between images.
- <a href="https://tensorflow.org/hub/tutorials/retrieval_with_tf_hub_universal_encoder_qa">Multilingual Universal Sentence Encoder Q&amp;A</a>: Use a machine-learning model to answer questions from the SQuAD dataset.
- <a href="https://tensorflow.org/hub/tutorials/tweening_conv3d">Video Interpolation</a>: Predict what happened in a video between the first and the last frame.
