# Assignment 1: checking your environment

## Introduction

The goal of this first assignment is to check whether your Python environment
is correctly set up for using Jupyter Notebooks. If you are viewing this file
on Google Collab and would like to use it in your local Jupyter environment,
you can click *File -> Download -> Download .ipynb*. You can then open the
file with Jupyter Lab/Notebook.

## Preparation

In this assignment, you will use Huggingface Datasets to download a data set. It
is recommended that you read the [quick tour](https://huggingface.co/docs/datasets/quicktour.html)
before working on this assignment.

## Notebook environments

As discussed during the first lecture, there are several ways to use
notebooks. I recommend one of the following:

* [Google Colab](https://colab.research.google.com/) if you want to do the
  absolute minimum of setup. Once we start training heavier models, I also
  recommend that you do this on Google Colab if you do not have an NVIDIA
  GPU that is powerful enough.
* [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/) if you want to
  run the notebooks on your own machine. This also works offline and you can
  use your own CUDA-capable GPU (if you have one).
* [PyCharm Professional](https://www.jetbrains.com/pycharm/). This is a
  full Python IDE and offers completion, variable inspection, and other
  nice features. Even though PyCharm Professional is a commercial product,
  students can
  [apply for a free license](https://www.jetbrains.com/community/education/#students).

## Getting started with JupyterLab

I recommend you to use [Poetry](https://python-poetry.org/) to install
Python dependencies. I have provided an [example project](https://github.com/danieldk/tranformers-examples/tree/main/notebook-env) that you
can use as a starting project. If you have downloaded the `pyproject.toml`
and `poetry.lock` files to a directory, you can download packages
that we use throughout this course with:

```
$ poetry install
```

This step only has to be performed once. Afterwards, you can always go
to this project directory and start a shell in a Python virtual
environment:

```
$ poetry shell
```

you can then start JupyterLab from within this shell:

```
$ jupyter-lab
```

## Getting started with PyCharm Professional

Install [Poetry](https://python-poetry.org/) and get the [example project]()
described in the previous section. Start PyCharm and install the [Poetry
plugin](https://plugins.jetbrains.com/plugin/14307-poetry) from the plugins
section. After this is done, you can open the Poetry project in PyCharm.

## What if I use Windows?

I highly recommend you to use a Linux/Mac environment (or Google Colab), it is
more pleasant for machine learning and data science, and you will encounter
far fewer paper cuts.

Luckily, you can nowadays [run Linux on Windows](https://docs.microsoft.com/en-us/windows/wsl/install-win10)
through the Windows Subsystem for Linux 2, which is available in recent
versions of Windows 10.

## Imports

We will start with some imports of modules that can be used during the assignment.

In [None]:
# Uncomment the following line if you use Google Colab. If you use the
# Poetry project, then all necessary dependencies are already installed.
# !pip install datasets

%matplotlib inline
import re

import datasets
import matplotlib.pyplot as plt
import numpy as np

## 1.1 Load a data set

First we need to load a data set to analyze. Load the *train* fold of the
[banking77](https://huggingface.co/datasets/banking77) data set, using the
`datasets` package.

In [None]:
# Your code here

## 1.2 Inspect the metadata

Query the data set for the following properties:

* Find out what kind of data the data set contains.
* Find the number of instances in the training fold.
* Find the features of the data set (what does each instance contain?).

In [None]:
# Your code here

## 1.3 Average customer query length

Compute the approximate average length of the customer service queries in tokens. Since
this is only an approximate, it suffices to use whitespace as a token boundary.

In [None]:
# Your code here

## 1.4 Customer query length distribution

Plot a histogram of the text lengths using [matplotlib](https://matplotlib.org/) using
20 bins.

In [None]:
# Your code here

## 1.4 Label distribution

Plot a histogram of label frequencies in the training fold. Is the
distribution of labels balanced?

# Your code here