# Training a short text classifier of German business names

<a target="_blank" href="https://recognai.github.io/biome-text/v3.1.0/documentation/tutorials/1-Training_a_text_classifier.html"><img class="icon" src="https://recognai.github.io/biome-text/v3.1.0/assets/img/biome-isotype.svg" width=24 /></a>
[View on recogn.ai](https://https://recognai.github.io/biome-text/v3.1.0/documentation/tutorials/1-Training_a_text_classifier.html)

<a target="_blank" href="https://colab.research.google.com/github/recognai/biome-text/blob/v3.1.0/docs/docs/documentation/tutorials/1-Training_a_text_classifier.ipynb"><img class="icon" src="https://www.tensorflow.org/images/colab_logo_32px.png" width=24 /></a>
[Run in Google Colab](https://colab.research.google.com/github/recognai/biome-text/blob/v3.1.0/docs/docs/documentation/tutorials/1-Training_a_text_classifier.ipynb)

<a target="_blank" href="https://github.com/recognai/biome-text/blob/v3.1.0/docs/docs/documentation/tutorials/1-Training_a_text_classifier.ipynb"><img class="icon" src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png" width=24 /></a>
[View source on GitHub](https://github.com/recognai/biome-text/blob/v3.1.0/docs/docs/documentation/tutorials/1-Training_a_text_classifier.ipynb)

When running this tutorial in Google Colab, make sure to install *biome.text* first:

In [None]:
!pip install -U pip
!pip install -U biome-text
exit(0)  # Force restart of the runtime

*If* you want to log your runs with [WandB](https://wandb.ai/home), don't forget to install its client and log in.

In [None]:
!pip install wandb
!wandb login

## Introduction

In this tutorial we will train a basic short-text classifier for predicting the sector of a business based only on its business name. 
For this we will use a training data set with business names and business categories in German.

### Imports

Let us first import all the stuff we need for this tutorial:

In [None]:
from biome.text import Pipeline, Dataset, Trainer
from biome.text.configuration import VocabularyConfiguration, WordFeatures, TrainerConfiguration

## Explore the training data

Let's take a look at the data we will use for training. For this we will use the [`Dataset`](https://recognai.github.io/biome-text/v3.1.0/api/biome/text/dataset.html#dataset) class that is a very thin wrapper around HuggingFace's awesome [datasets.Dataset](https://huggingface.co/docs/datasets/master/package_reference/main_classes.html#datasets.Dataset).
We will download the data first to create `Dataset` instances.

Apart from the training data we will also download an optional validation data set to estimate the generalization error.

In [None]:
# Downloading the dataset first
!curl -O https://biome-tutorials-data.s3-eu-west-1.amazonaws.com/text_classifier/business.cat.train.csv
!curl -O https://biome-tutorials-data.s3-eu-west-1.amazonaws.com/text_classifier/business.cat.valid.csv

In [None]:
# Loading from local
train_ds = Dataset.from_csv("business.cat.train.csv")
valid_ds = Dataset.from_csv("business.cat.valid.csv")

Most of HuggingFace's `Dataset` API is exposed and you can checkout their nice [documentation](https://huggingface.co/docs/datasets/master/processing.html) on how to work with data in a `Dataset`. For example, let's quickly check the size of our training data and print the first 10 examples as a pandas DataFrame:

In [None]:
len(train_ds)

In [None]:
train_ds.head()

As we can see we have two relevant columns *label* and *text*. Our classifier will be trained to predict the *label* given the *text*.

::: tip Tip

The [TaskHead](https://recognai.github.io/biome-text/v3.1.0/api/biome/text/modules/heads/task_head.html#taskhead) of our model below will expect a *text* and a *label* column to be present in the `Dataset`. In our data set this is already the case, otherwise we would need to change or map the corresponding column names via `Dataset.rename_column_()` or `Dataset.map()`.

:::

We can also quickly check the distribution of our labels. Use `Dataset.head(None)` to return the complete data set as a pandas DataFrame:

In [None]:
train_ds.head(None)["label"].value_counts()

The `Dataset` class also provides access to Hugging Face's extensive NLP datasets collection via the `Dataset.load_dataset()` method. Have a look at their [quicktour](https://huggingface.co/docs/datasets/master/quicktour.html) for more details about their awesome library.

## Configure your *biome.text* Pipeline

A typical [Pipeline](https://recognai.github.io/biome-text/v3.1.0/api/biome/text/pipeline.html#pipeline) consists of tokenizing the input, extracting features, applying a language encoding (optionally) and executing a task-specific head in the end.

After training a pipeline, you can use it to make predictions.

As a first step we must define a configuration for our pipeline. 
In this tutorial we will create a configuration dictionary and use the `Pipeline.from_config()` method to create our pipeline, but there are [other ways](https://recognai.github.io/biome-text/v3.1.0/api/biome/text/pipeline.html#pipeline).

A *biome.text* pipeline has the following main components:

```yaml
name: # a descriptive name of your pipeline

tokenizer: # how to tokenize the input

features: # input features of the model

encoder: # the language encoder

head: # your task configuration

```

See the [Configuration section](https://recognai.github.io/biome-text/v3.1.0/documentation/user-guides/2-configuration.html) for a detailed description of how these main components can be configured.

Our complete configuration for this tutorial will be following:

In [None]:
pipeline_dict = {
    "name": "german_business_names",
    "tokenizer": {
        "text_cleaning": {
            "rules": ["strip_spaces"]
        }
    },
    "features": {
        "word": {
            "embedding_dim": 64,
            "lowercase_tokens": True,
        },
        "char": {
            "embedding_dim": 32,
            "lowercase_characters": True,
            "encoder": {
                "type": "gru",
                "num_layers": 1,
                "hidden_size": 32,
                "bidirectional": True,
            },
            "dropout": 0.1,
        },
    },
    "head": {
        "type": "TextClassification",
        "labels": train_ds.unique("label"),
        "pooler": {
            "type": "gru",
            "num_layers": 1,
            "hidden_size": 32,
            "bidirectional": True,
        },
        "feedforward": {
            "num_layers": 1,
            "hidden_dims": [32],
            "activations": ["relu"],
            "dropout": [0.0],
        },
    },       
}

With this dictionary we can now create a `Pipeline`:

In [None]:
pl = Pipeline.from_config(pipeline_dict)

## Configure the vocabulary

The default behavior of *biome.text* is to add all tokens from the training data set to the pipeline's vocabulary. 
This is done automatically when training the pipeline for the first time.

If you want to have more control over this step, you can define a `VocabularyConfiguration` and pass it to the [`Trainer`](https://recognai.github.io/biome-text/v3.1.0/api/biome/text/trainer.html) later on.
In our business name classifier we only want to include words with a general meaning to our word feature vocabulary (like "Computer" or "Autohaus", for example), and want to exclude specific names that will not help to generally classify the kind of business.
This can be achieved by including only the most frequent words in our training set via the `min_count` argument. For a complete list of available arguments see the [VocabularyConfiguration API](https://recognai.github.io/biome-text/v3.1.0/api/biome/text/configuration.html#vocabularyconfiguration).

In [None]:
vocab_config = VocabularyConfiguration(min_count={WordFeatures.namespace: 20})

## Configure the trainer

As a next step we have to configure the [`Trainer`](https://recognai.github.io/biome-text/v3.1.0/api/biome/text/trainer.html), which in essentially is a light wrapper around the amazing [Pytorch Lightning Trainer](https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html).

The default trainer has sensible defaults and should work alright for most of your cases.
In this tutorial, however, we want to tune a bit the learning rate and limit the training time to three epochs only.
We also want to modify the monitored validation metric (by default it is the `validation_loss`) that is used to rank the checkpoints, as well as for the early stopping mechanism and to load the best model weights at the end of the training.
For a complete list of available arguments see the [TrainerConfiguration API](https://recognai.github.io/biome-text/v3.1.0/api/biome/text/configuration.html#trainerconfiguration).

::: tip Tip

By default we will use a CUDA device if one is available. If you prefer not to use it, just set `gpus=0` in the `TrainerConfiguration`.

:::

::: tip Tip

The default [WandB](https://wandb.ai/site) logger will log the runs to the "biome" project. 
You can easily change this by setting the `WANDB_PROJECT` env variable:
```python
import os
os.environ["WANDB_PROJECT"] = "my_project"
```

:::

In [None]:
trainer_config = TrainerConfiguration(
    optimizer={
        "type": "adam",
        "lr": 0.01,
    },
    max_epochs=3,
    monitor="validation_accuracy",
    monitor_mode="max"
)

## Train your model

Now we have everything ready to start the training of our model:
- training data set
- pipeline
- trainer configuration

In a fist step we have to create a `Trainer` instance and pass in the pipeline, the training/validation data, the trainer configuration and our vocabulary configuration.
This will load the data into memory (unless you specify `layz=True`) and build the vocabulary.

In [None]:
trainer = Trainer(
    pipeline=pl,
    train_dataset=train_ds,
    valid_dataset=valid_ds,
    trainer_config=trainer_config,
    vocab_config=vocab_config,
)

In a second step we simply have to call the `Trainer.fit()` method to start the training.
By default, at the end of the training the trained pipeline and the training metrics will be saved in a folder called `output`.
The trained pipeline is saved as a `model.tar.gz` file that contains the pipeline configuration, the model weights and the vocabulary.
The metrics are saved to a `metrics.json` file.

During the training the `Trainer` will also create a logging folder called `training_logs` by default.
You can modify this path via the `default_root_dir` option in your `TrainerConfiguration`, that also supports remote addresses such as s3 or hdfs.
This logging folder contains all your checkpoints and logged metrics, like the ones logged for [TensorBoard](https://www.tensorflow.org/tensorboard/) for example.

In [None]:
trainer.fit()

After 3 epochs we achieve a validation accuracy of about 0.91.
The validation loss seems to be decreasing further, though, so we could probably train the model for a few more epochs without overfitting the training data.
For this we could simply reinitialize the `Trainer` and call `Trainer.fit(exist_ok=True)` again.

::: tip Tip

If for some reason the training gets interrupted, you can continue from the last saved checkpoint by setting the `resume_from_checkpoint` option in the `TrainerConfiguration`.

:::

::: tip Tip

If you receive warnings about the data loader being a bottleneck, try to increase the `num_workers_for_dataloader` parameter in the `TrainerConfiguration` (up to the number of cpus on your machine).

:::

## Make your first predictions

Now that we trained our model we can go on to make our first predictions. 
We provide the input expected by our `TaskHead` of the model to the `Pipeline.predict()` method.
In our case it is a `TextClassification` head that classifies a `text` input:

In [None]:
pl.predict(text="Autohaus biome.text")

The output of the `Pipeline.predict()` method is a dictionary with a `labels` and `probabilities` key containing a list of labels and their corresponding probabilities, ordered from most to less likely. 

::: tip Tip

When configuring the pipeline in the first place, we recommend to check that it is correctly setup by using the `predict` method.
Since the pipeline is still not trained at that moment, the predictions will be arbitrary.

:::

We can also load the trained pipeline from the training output. This is useful in case you trained the pipeline in some earlier session, and want to continue your work with the inference steps: 

In [None]:
pl_trained = Pipeline.from_pretrained("output/model.tar.gz")