<table class="tfo-notebook-buttons" align="top">
  <td>
    <a target="_blank" href="https://www.recogn.ai/biome-text/documentation/tutorials/Training_a_text_classifier.html"><img src="https://www.recogn.ai/biome-text/assets/img/biome-isotype.svg" width=32 />View on recogn.ai</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/recognai/biome-text/blob/master/docs/docs/documentation/tutorials/Training_a_text_classifier.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/recognai/biome-text/blob/master/docs/docs/documentation/tutorials/Training_a_text_classifier.ipynb"><img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png" width=32 />View source on GitHub</a>
  </td>
</table>

When running this tutorial in Google Colab, make sure to install *biome.text* first:
```
!pip install -U git+https://github.com/recognai/biome-text.git
```
Ignore warnings and restart your runtime afterwards (Runtime -> Restart runtime).

# Training a short text classifier of German business names

In this tutorial we will train a basic short-text classifier for predicting the sector of a business based only on its business name. For this we will use a training dataset with business names and business categories in German.

The tutorial will guide you through the following steps:


[[toc]]



## Explore and prepare training data

In [1]:
from biome.text.data import DataSource

Let's take a look at the data we will use for training. For this we create a `DataSource` instance providing a path to our data.

In [2]:
ds_train = DataSource("https://biome-tutorials-data.s3-eu-west-1.amazonaws.com/text_classifier/business.cat.train.csv")
ds_train.head(10)

Unnamed: 0,label,text,path
0,Edv,Cse Gmbh Computer Edv-service Bürobedarf,https://biome-tutorials-data.s3-eu-west-1.amaz...
1,Maler,Malerfachbetrieb U. Nee,https://biome-tutorials-data.s3-eu-west-1.amaz...
2,Gebrauchtwagen,Sippl Automobilverkäufer Hausmann,https://biome-tutorials-data.s3-eu-west-1.amaz...
3,Handelsvermittler Und -vertreter,Strenge Handelsagentur Werth,https://biome-tutorials-data.s3-eu-west-1.amaz...
4,Gebrauchtwagen,Dzengel Autohaus Gordemitz Rusch,https://biome-tutorials-data.s3-eu-west-1.amaz...
5,Apotheken,Schinkel-apotheke Bitzer,https://biome-tutorials-data.s3-eu-west-1.amaz...
6,Tiefbau,Franz Möbius Mehrings-bau-hude Und Stigge,https://biome-tutorials-data.s3-eu-west-1.amaz...
7,Handelsvermittler Und -vertreter,Kontze Hdl.vertr. Lau,https://biome-tutorials-data.s3-eu-west-1.amaz...
8,Autowerkstätten,Keßler Kfz-handel,https://biome-tutorials-data.s3-eu-west-1.amaz...
9,Gebrauchtwagen,Diko Lack Und Schrift Betriebsteil Der Autocen...,https://biome-tutorials-data.s3-eu-west-1.amaz...


As we can see we have two relevant columns *label* and *text*. The *path* column is added automatically by the `DataSource` class to keep track of the source file.

Our classifier will be trained to predict the *label* given a *text*.

The `DataSource` class stores the data in an underlying [Dask DataFrame](https://docs.dask.org/en/latest/dataframe.html) that you can easily access.
For example, let's check the size of our training data:

In [3]:
len(ds_train.to_dataframe())

8000

Or let's check the distribution of our labels:

In [4]:
df = ds_train.to_dataframe().compute()
df.label.value_counts()

Unternehmensberatungen              632
Friseure                            564
Tiefbau                             508
Dienstleistungen                    503
Gebrauchtwagen                      449
Elektriker                          430
Restaurants                         422
Architekturbüros                    417
Vereine                             384
Versicherungsvermittler             358
Maler                               330
Sanitärinstallationen               323
Edv                                 318
Werbeagenturen                      294
Apotheken                           289
Physiotherapie                      286
Vermittlungen                       277
Hotels                              274
Autowerkstätten                     263
Elektrotechnik                      261
Allgemeinärzte                      216
Handelsvermittler Und -vertreter    202
Name: label, dtype: int64

::: tip

The [TaskHead](../../api/biome/text/modules/heads/task_head.html#taskhead) of our model will expect a *text* and a *label* column to be present in the dataframe. 
Since they are already present, there is no need for a [mapping](../../api/biome/text/data/datasource.html#datasource) in the `DataSource`.

:::

## Configure your `biome.text` Pipeline

In [5]:
from biome.text import Pipeline

A typical [Pipeline](../../api/biome/text/pipeline.html#pipeline) consists of tokenizing the input, extracting features, applying a language encoding (optionally) and executing a task-specific head in the end.

After training a pipeline, you can use it to make predictions or explore the underlying model via the [UI](../../documentation/user-guides/02.explore.html).

As a first step we must define a configuration for our pipeline. 
In this tutorial we will create a configuration dictionary and use the `Pipeline.from_config()` method to create our pipeline, but there are [other ways](../../api/biome/text/pipeline.html#pipeline).

A `biome.text` pipeline has the following main components:

```yaml
name: # a descriptive name of your pipeline

tokenizer: # how to tokenize the input

features: # input features of the model

encoder: # the language encoder

head: # your task configuration

```

See the [Configuration section](../../documentation/user-guides/05.configuration.html) for a detailed description of how these main components can be configured.

Our complete configuration for this tutorial will be following:

In [6]:
pipeline_dict = {
    "name": "german_business_names",
    "tokenizer": {
        "text_cleaning": {
            "rules": ["strip_spaces"]
        }
    },
    "features": {
        "word": {
            "embedding_dim": 32,
            "lowercase_tokens": True,
        },
        "char": {
            "embedding_dim": 32,
            "lowercase_characters": True,
            "encoder": {
                "type": "gru",
                "num_layers": 1,
                "hidden_size": 32,
                "bidirectional": True,
            },
            "dropout": 0.0,
        },
    },
    "head": {
        "type": "TextClassification",
        "labels": list(df.label.value_counts().index),
        "pooler": {
            "type": "gru",
            "num_layers": 1,
            "hidden_size": 32,
            "bidirectional": True,
        },
        "feedforward": {
            "num_layers": 1,
            "hidden_dims": [32],
            "activations": ["relu"],
            "dropout": [0.0],
        },
    },       
}

With this dictionary we can now create a `Pipeline`:

In [7]:
pl = Pipeline.from_config(pipeline_dict)

## Create a vocabulary

Before we can start the training we need to create the vocabulary for our model.
For this we define a `VocabularyConfiguration`.

In [8]:
from biome.text.configuration import VocabularyConfiguration, WordFeatures

In our business name classifier we only want to include words with a general meaning to our word feature vocabulary (like "Computer" or "Autohaus", for example), and want to exclude specific names that will not help to generally classify the kind of business.
This can be achieved by including only the most frequent words in our training set via the `min_count` argument. For a complete list of available arguments see the [VocabularyConfiguration API](../../api/biome/text/configuration.html#vocabularyconfiguration).

In [9]:
vocab_config = VocabularyConfiguration(sources=[ds_train], min_count={WordFeatures.namespace: 100})

We then pass this configuration to our `Pipeline` to create the vocabulary:

In [10]:
pl.create_vocabulary(vocab_config)

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




After creating the vocabulary we can check the size of our entire model in terms of trainable parameters:

In [11]:
pl.trainable_parameters

43926

## Configure the trainer

As a next step we have to configure the *trainer*.

In [12]:
from biome.text.configuration import TrainerConfiguration

The default trainer has sensible defaults and should work alright for most of your cases.
In this tutorial, however, we want to tune a bit the learning rate and limit the training time to one epoch only.
For a complete list of available arguments see the [TrainerConfiguration API](../../api/biome/text/configuration.html#trainerconfiguration).

In case you have a cuda device available, you also specify it here.

In [13]:
trainer_config = TrainerConfiguration(
    optimizer={
        "type": "adam",
        "lr": 0.01,
    },
    num_epochs=1,
    # cuda_device=0,
)

## Train your model

Now we have everything ready to start the training of our model:
- training data set
- vocabulary
- trainer

Optionally we can provide a validation data set to estimate the generalization error.
For this we will create another `DataSource` pointing to our validation data.

In [14]:
ds_valid = DataSource("https://biome-tutorials-data.s3-eu-west-1.amazonaws.com/text_classifier/business.cat.valid.csv")

The training output will be saved in a folder specified by the `output` argument. It contains the trained model weights and the metrics, as well as the vocabulary and a *log* folder for visualizing the training process with [tensorboard](https://www.tensorflow.org/tensorboard/).

In [None]:
pl.train(
    output="output",
    training=ds_train,
    validation=ds_valid,
    trainer=trainer_config,
)

::: tip

If for some reason the training gets interrupted, you can continue where you left off by setting the `restore` argument in the `Pipeline.train()` method to `True`. 
If you want to train your model for a few more epochs, you can also use the `restore` argument, but you have to modify the `epochs` argument in your `TrainerConfiguration` to reflect the total amount of epochs you aim for.

:::

## Make your first predictions

Now that we trained our model we can go on to make our first predictions.
First we must load our trained model into a new `Pipeline`:

In [16]:
pl_trained = Pipeline.from_pretrained("output/model.tar.gz")

INFO:filelock:Lock 140282581614672 acquired on /tmp/tmpa8czwzha/vocabulary/.lock
INFO:filelock:Lock 140282581614672 released on /tmp/tmpa8czwzha/vocabulary/.lock


We then provide the input expected by our `TaskHead` of the model to the `Pipeline.predict()` method.
In our case it is a `TextClassification` head that classifies a `text` input:

In [17]:
pl_trained.predict(text="Autohaus biome.text")

{'logits': array([ -1.583786  ,  -6.9499893 ,   0.07082307,   0.8625684 ,
          9.981598  ,  -4.9741173 ,  -3.002375  , -15.961031  ,
        -11.91686   ,  -7.106097  ,  -2.3369193 ,  -6.279768  ,
         -2.2998297 ,  -8.338451  ,  -1.1450654 , -13.137184  ,
         -1.1679922 ,   0.814244  ,   4.70496   ,  -3.7449305 ,
        -14.830839  ,  -4.7980075 ], dtype=float32),
 'probs': array([9.4377356e-06, 4.4091546e-08, 4.9369104e-05, 1.0896972e-04,
        9.9460405e-01, 3.1802821e-07, 2.2844549e-06, 5.3815784e-12,
        3.0709402e-10, 3.7718824e-08, 4.4441222e-06, 8.6184414e-08,
        4.6120508e-06, 1.0999018e-08, 1.4635306e-05, 9.0633917e-11,
        1.4303575e-05, 1.0382911e-04, 5.0820094e-03, 1.0871634e-06,
        1.6662715e-11, 3.7927040e-07], dtype=float32),
 'classes': {'Gebrauchtwagen': tensor(0.9946),
  'Autowerkstätten': tensor(0.0051),
  'Dienstleistungen': tensor(0.0001),
  'Hotels': tensor(0.0001),
  'Tiefbau': tensor(4.9369e-05),
  'Apotheken': tensor(1.4635e-

The returned dictionary contains the logits and probabilities of all labels (classes).
The label with the highest probability is stored under the `label` key, together with its probability under the `prob` key.

::: tip

When configuring the pipeline in the first place, we recommend to check that it is correctly setup by using the `predict` method.
Since the pipeline is still not trained at that moment, the predictions will be arbitrary.

:::

## Explore the model's predictions

To check and understand the predictions of the model, you can use the **biome.text explore UI**.
For the UI to work you need a running [Elasticsearch](https://www.elastic.co/elasticsearch/) instance 

In [18]:
pl_trained.explore(ds_valid)
raise NotImplementedError

INFO:biome.text.ui.ui:Running biome UI on http://0.0.0.0:9999 with elasticsearch backend http://localhost:9200


Unnamed: 0_level_0,text,label,annotation,metadata
npartitions=4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,object,object,object,object
,...,...,...,...
,...,...,...,...
,...,...,...,...
,...,...,...,...
