# Training a short text classifier of German business names

In this tutorial we will train a basic short-text classifier for predicting the sector of a business based only on its business name. For this we will use a training dataset with business names and business categories in German.

The tutorial will guide you through the following steps:


[[toc]]



## Explore and prepare training data

In [1]:
from biome.text.data import DataSource

Let's take a look at the data we will use for training. For this we create a `DataSource` instance providing a path to our data.

In [2]:
ds_train = DataSource("https://biome-tutorials-data.s3-eu-west-1.amazonaws.com/text_classifier/business.cat.train.csv")
ds_train.head(10)

Unnamed: 0,label,text,path
0,Edv,Cse Gmbh Computer Edv-service Bürobedarf,https://biome-tutorials-data.s3-eu-west-1.amaz...
1,Maler,Malerfachbetrieb U. Nee,https://biome-tutorials-data.s3-eu-west-1.amaz...
2,Gebrauchtwagen,Sippl Automobilverkäufer Hausmann,https://biome-tutorials-data.s3-eu-west-1.amaz...
3,Handelsvermittler Und -vertreter,Strenge Handelsagentur Werth,https://biome-tutorials-data.s3-eu-west-1.amaz...
4,Gebrauchtwagen,Dzengel Autohaus Gordemitz Rusch,https://biome-tutorials-data.s3-eu-west-1.amaz...
5,Apotheken,Schinkel-apotheke Bitzer,https://biome-tutorials-data.s3-eu-west-1.amaz...
6,Tiefbau,Franz Möbius Mehrings-bau-hude Und Stigge,https://biome-tutorials-data.s3-eu-west-1.amaz...
7,Handelsvermittler Und -vertreter,Kontze Hdl.vertr. Lau,https://biome-tutorials-data.s3-eu-west-1.amaz...
8,Autowerkstätten,Keßler Kfz-handel,https://biome-tutorials-data.s3-eu-west-1.amaz...
9,Gebrauchtwagen,Diko Lack Und Schrift Betriebsteil Der Autocen...,https://biome-tutorials-data.s3-eu-west-1.amaz...


As we can see we have two relevant columns *label* and *text*. The *path* column is added automatically by the `DataSource` class to keep track of the source file.

Our classifier will be trained to predict the *label* given a *text*.

The `DataSource` class stores the data in an underlying [Dask DataFrame](https://docs.dask.org/en/latest/dataframe.html) that you can easily access.
For example, let's check the size of our training data:

In [3]:
len(ds_train.to_dataframe())  # TODO: Maybe make `DataSource.to_dataframe()` a property and simply rename it to `DataSource.dataframe`

8000

Or let's check the distribution of our labels:

In [4]:
df = ds_train.to_dataframe().compute()
df.label.value_counts()

Unternehmensberatungen              632
Friseure                            564
Tiefbau                             508
Dienstleistungen                    503
Gebrauchtwagen                      449
Elektriker                          430
Restaurants                         422
Architekturbüros                    417
Vereine                             384
Versicherungsvermittler             358
Maler                               330
Sanitärinstallationen               323
Edv                                 318
Werbeagenturen                      294
Apotheken                           289
Physiotherapie                      286
Vermittlungen                       277
Hotels                              274
Autowerkstätten                     263
Elektrotechnik                      261
Allgemeinärzte                      216
Handelsvermittler Und -vertreter    202
Name: label, dtype: int64

::: tip

The [TaskHead](link/to/API) of our model will expect a *text* and a *label* column to be present in the dataframe. 
Since they are already present, there is no need for a [mapping](link/to/API) in the `DataSource`.

:::

## Configure your `biome.text` Pipeline

In [5]:
from biome.text import Pipeline

A typical [Pipeline](link/to/API) consists of tokenizing the input, extracting features, applying a language encoding (optionally) and executing a task-specific head in the end.

After training a pipeline, you can use it to make predictions or explore the underlying model via the [UI](link/to/API).

As a first step we must define a configuration for our pipeline. 
In this tutorial we will create a configuration dictionary and use the `Pipeline.from_config()` method to create our pipeline, but there are [other ways](link/to/API).

A `biome.text` pipeline has the following main components:

```yaml
name: # a descriptive name of your pipeline

tokenizer: # how to tokenize the input

features: # input features of the model

encoder: # the language encoder

head: # your task configuration

```

See the [Configuration section](link/to/config) for a detailed description of how these main components can be configured.

Our complete configuration for this tutorial will be following:

In [6]:
pipeline_dict = {
    "name": "german_business_names",
    "tokenizer": {
        "text_cleaning": {
            "rules": ["strip_spaces"]
        }
    },
    "features": {
        "word": {
            "embedding_dim": 32,
            "lowercase_tokens": True,
        },
        "char": {
            "embedding_dim": 32,
            "lowercase_characters": True,
            "encoder": {
                "type": "gru",
                "num_layers": 1,
                "hidden_size": 32,
                "bidirectional": True,
            },
            "dropout": 0.0,
        },
    },
    "head": {
        "type": "TextClassification",
        "labels": list(df.label.value_counts().index),
        "pooler": {
            "type": "gru",
            "num_layers": 1,
            "hidden_size": 32,
            "bidirectional": True,
        },
        "feedforward": {
            "num_layers": 1,
            "hidden_dims": [32],
            "activations": ["relu"],
            "dropout": [0.0],
        },
    },       
}

With this dictionary we can now create a `Pipeline`:

In [7]:
pl = Pipeline.from_config(pipeline_dict)

## Create a vocabulary

Before we can start the training we need to create the vocabulary for our model.
For this we define a `VocabularyConfiguration`.

In [8]:
from biome.text.configuration import VocabularyConfiguration, WordFeatures

In our business name classifier we only want to include words with a general meaning to our word feature vocabulary (like "Computer" or "Autohaus", for example), and want to exclude specific names that will not help to generally classify the kind of business.
This can be achieved by including only the most frequent words in our training set via the `min_count` argument. For a complete list of available arguments see the [VocabularyConfiguration API](link/to/api).

In [9]:
vocab_config = VocabularyConfiguration(sources=[ds_train], min_count={WordFeatures.namespace: 100})

We then pass this configuration to our `Pipeline` to create the vocabulary:

In [None]:
pl.create_vocabulary(vocab_config)

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

After creating the vocabulary we can check the size of our entire model in terms of trainable parameters:

In [None]:
pl.trainable_parameters

## Configure the trainer

As a next step we have to configure the *trainer*.

In [None]:
from biome.text.configuration import TrainerConfiguration

The default trainer has sensible defaults and should work alright for most of your cases.
In this tutorial, however, we want to tune a bit the learning rate and limit the training time to one epoch only.
For a complete list of the available arguments see the [TrainerConfiguration API](link/to/api).

In [None]:
trainer_config = TrainerConfiguration(
    optimizer={
        "type": "adam",
        "lr": 0.01,
    },
    num_epochs=1,
    # cuda_device=0,
)

## Train your model

Now we have everything ready to start the training of our model:
- training data set
- vocabulary
- trainer

Optionally we can provide a validation data set to estimate the generalization error.
For this we will create another `DataSource` pointing to our validation data.

In [None]:
ds_valid = DataSource("https://biome-tutorials-data.s3-eu-west-1.amazonaws.com/text_classifier/business.cat.valid.csv")

The training output will be saved in a folder specified by the `output` argument. It contains the trained model weights and the metrics, as well as the vocabulary and a *log* folder for visualizing the training process with [tensorboard](https://www.tensorflow.org/tensorboard/).

In [16]:
pl.train(
    output="output",
    training=ds_train,
    validation=ds_valid,
    trainer=trainer_config,
)

INFO:allennlp.common.params:validation_dataset_reader = None
INFO:allennlp.common.params:train_data_path = output/.datasources/training_business.cat.train.csv.yml
INFO:biome.text._model:Reading training data from output/.datasources/training_business.cat.train.csv.yml
INFO:allennlp.common.params:validation_data_path = output/.datasources/validation_business.cat.valid.csv.yml
INFO:biome.text._model:Reading validation data from output/.datasources/validation_business.cat.valid.csv.yml
INFO:allennlp.common.params:test_data_path = None
INFO:allennlp.common.params:random_seed = 13370
INFO:allennlp.common.params:numpy_seed = 1337
INFO:allennlp.common.params:pytorch_seed = 133
INFO:allennlp.common.checks:Pytorch version: 1.5.0
INFO:biome.text._model:Serialization directory (output) already exists and is not empty.
INFO:filelock:Lock 139922988593680 acquired on output/vocabulary/.lock
INFO:filelock:Lock 139922988593680 released on output/vocabulary/.lock
INFO:allennlp.common.params:trainer.no_

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

INFO:allennlp.training.trainer:Validating





HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

INFO:allennlp.training.tensorboard_writer:                                                Training |  Validation
INFO:allennlp.training.tensorboard_writer:_recall/Elektriker                          |     0.800  |     0.952
INFO:allennlp.training.tensorboard_writer:_recall/Autowerkstätten                     |     0.646  |     0.594
INFO:allennlp.training.tensorboard_writer:_precision/Tiefbau                          |     0.720  |     0.927
INFO:allennlp.training.tensorboard_writer:_fscore/Unternehmensberatungen              |     0.784  |     0.941
INFO:allennlp.training.tensorboard_writer:_fscore/Werbeagenturen                      |     0.784  |     0.885
INFO:allennlp.training.tensorboard_writer:micro/fscore                                |     0.749  |     0.897
INFO:allennlp.training.tensorboard_writer:_recall/Vereine                             |     0.862  |     0.875
INFO:allennlp.training.tensorboard_writer:_precision/Friseure                         |     0.833  |     0.921




::: tip

If for some reason the training gets interrupted, you can continue where you left off by setting the `restore` argument in the `Pipeline.train()` method to `True`. 
If you want to train your model for a few more epochs, you can also use the `restore` argument, but you have to modify the `epochs` argument in your `TrainerConfiguration` to reflect the total amount of epochs you aim for.

:::

## Make your first predictions

Now that we trained our model we can go on to make our first predictions.
First we must load our trained model into a new `Pipeline`:

In [17]:
pl_trained = Pipeline.from_pretrained("output/model.tar.gz")

INFO:filelock:Lock 139926434191760 acquired on /tmp/tmpkrtxqu4v/vocabulary/.lock
INFO:filelock:Lock 139926434191760 released on /tmp/tmpkrtxqu4v/vocabulary/.lock


We then provide the input expected by our `TaskHead` of the model to the `Pipeline.predict()` method.
In our case it is a `TextClassification` head that classifies a `text` input:

In [18]:
pl_trained.predict(text="Autohaus biome.text")

{'logits': array([ -4.8484344 ,  -6.2444367 ,  -4.7423077 ,  -0.64842075,
          8.957563  ,  -4.9662757 ,  -1.4878473 ,  -4.364673  ,
        -15.998326  , -10.032035  ,  -3.9810708 , -11.56705   ,
         -4.8242006 ,  -9.484177  ,  -4.2695546 ,  -1.1502581 ,
        -11.844207  ,  -9.359994  ,   3.3760784 ,  -1.0784173 ,
        -16.371155  ,  -3.1280544 ], dtype=float32),
 'probs': array([1.0055722e-06, 2.4896445e-07, 1.1181589e-06, 6.7058871e-05,
        9.9605185e-01, 8.9378949e-07, 2.8966611e-05, 1.6312027e-06,
        1.4456938e-11, 5.6390328e-09, 2.3938924e-06, 1.2149434e-09,
        1.0302388e-06, 9.7529504e-09, 1.7939808e-06, 4.0598581e-05,
        9.2084884e-10, 1.1042501e-08, 3.7520956e-03, 4.3622556e-05,
        9.9577083e-12, 5.6177801e-06], dtype=float32),
 'classes': {'Gebrauchtwagen': tensor(0.9961),
  'Autowerkstätten': tensor(0.0038),
  'Dienstleistungen': tensor(6.7059e-05),
  'Elektrotechnik': tensor(4.3623e-05),
  'Physiotherapie': tensor(4.0599e-05),
  'Rest

The returned dictionary contains the logits and probabilities of all labels (classes).
The label with the highest probability is stored under the `label` key, together with its probability under the `prob` key.

::: tip

When configuring the pipeline in the first place, we recommend to check that it is correctly setup by using the `predict` method.
Since the pipeline is still not trained at that moment, the predictions will be arbitrary.

:::

## Explore the model's predictions

To check and understand the predictions of the model, you can use the **biome.text explore UI**.
For the UI to work you need a running [Elasticsearch]() instance 

In [8]:
pl_trained.explore(ds_valid)

INFO:biome.text.ui.ui:Running biome UI on http://0.0.0.0:9999 with elasticsearch backend http://localhost:9200


Unnamed: 0_level_0,text,label,annotation,metadata
npartitions=4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,object,object,object,object
,...,...,...,...
,...,...,...,...
,...,...,...,...
,...,...,...,...


ERROR:biome.text.ui.app:Exception on /97e11d14-a5c3-11ea-9274-74d83e8f6ee3 [GET]
Traceback (most recent call last):
  File "/home/david/recognai/biome/biome-text/src/biome/text/ui/app.py", line 68, in static_proxy
    return send_from_directory(statics_dir, path)
  File "/home/david/miniconda3/envs/biome/lib/python3.7/site-packages/flask/helpers.py", line 767, in send_from_directory
    raise NotFound()
werkzeug.exceptions.NotFound: 404 Not Found: The requested URL was not found on the server. If you entered the URL manually please check your spelling and try again.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/david/miniconda3/envs/biome/lib/python3.7/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/david/miniconda3/envs/biome/lib/python3.7/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/h