# Training a short text classifier of German business names

In this tutorial we will train a basic short-text classifier for predicting the sector of a business based only on its business name. For this we will use a training dataset with business names and business categories in German.

The tutorial will guide you through the following steps:


[[toc]]



## Explore and prepare training and evaluation data

Let's take a look at the data we will use for training.

In [29]:
import pandas as pd

In [30]:
df = pd.read_csv("s3://biome-tutorials-data/text_classifier/business.cat.10k.csv")

In [31]:
df.head(10)

Unnamed: 0,label,text
0,Tiefbau,Baugeschäft Haßmann Gmbh Wörblitz
1,Restaurants,"Gaststätten, Restaurants - Sucos Do Brasil Coc..."
2,Autowerkstätten,Lankes Kfz-werkstatt
3,Werbeagenturen,Feine Reklame Gesellschaft Für Strategische Kr...
4,Maler,Müller Vladimir & Co. Malermeister
5,Allgemeinärzte,Renninger Arztpraxis Für Allgemeinmedizin Dr.
6,Friseure,Coiffeur La Vie
7,Maler,Kiesewalter Malermeister Thomas
8,Dienstleistungen,Gerhard Pflaum Minden-herforder-verkehrs-servi...
9,Physiotherapie,Hellriegel - Thoms - Feliksßen Rückenzentrum K...


As we can see we have two relevant columns `label` and `text`. 

Our classifier will be trained to predict the `label` given a `text`.

Let's check the distribution of our `label` columns

In [8]:
pd.DataFrame(df.label.value_counts())

Unnamed: 0,label
Unternehmensberatungen,775
Friseure,705
Tiefbau,627
Dienstleistungen,613
Gebrauchtwagen,567
Restaurants,526
Architekturbüros,523
Elektriker,513
Vereine,488
Versicherungsvermittler,462


In [72]:
from biome.text import Pipeline

In [89]:
pipeline_dict = {
    "name": "german_business_names",
    "features": {
        "word": {
            "embedding_dim": 16,
            "lowercase_tokens": True,
        },
        "char": {
            "embedding_dim": 16,
            "encoder": {
                "type": "gru",
                "num_layers": 1,
                "hidden_size": 32,
                "bidirectional": True,
            },
            "dropout": 0.1,
        },
    },
    "head": {
        "type": "TextClassification",
        "labels": list(df.label.value_counts().index),
        "pooler": {
            "type": "gru",
            "num_layers": 1,
            "hidden_size": 16,
            "bidirectional": True,
        },
        "feedforward": {
            "num_layers": 1,
            "hidden_dims": [16],
            "activations": ["relu"],
            "dropout": [0.1],
        },
    },       
}

In [90]:
yaml_str = yaml.safe_dump(pipeline_dict) 

pl = Pipeline.from_config(yaml_str)

In [91]:
pl.trainable_parameters

19974

In [97]:
pl.config.as_dict()

{'name': 'german_business_names',
 'tokenizer': {'lang': 'en',
  'skip_empty_tokens': False,
  'max_sequence_length': None,
  'max_nr_of_sentences': None,
  'text_cleaning': None,
  'segment_sentences': False},
 'features': {'word': <biome.text.featurizer.WordFeatures at 0x1443f8310>,
  'char': <biome.text.featurizer.CharFeatures at 0x1443f8790>},
 'head': {'feedforward': {'activations': ['relu'],
   'dropout': [0.1],
   'hidden_dims': [16],
   'num_layers': 1},
  'labels': ['Unternehmensberatungen',
   'Friseure',
   'Tiefbau',
   'Dienstleistungen',
   'Gebrauchtwagen',
   'Restaurants',
   'Architekturbüros',
   'Elektriker',
   'Vereine',
   'Versicherungsvermittler',
   'Sanitärinstallationen',
   'Edv',
   'Maler',
   'Physiotherapie',
   'Werbeagenturen',
   'Apotheken',
   'Vermittlungen',
   'Hotels',
   'Autowerkstätten',
   'Elektrotechnik',
   'Allgemeinärzte',
   'Handelsvermittler Und -vertreter'],
  'pooler': {'bidirectional': True,
   'hidden_size': 16,
   'num_layers':

In [92]:
from biome.text.configuration import TrainerConfiguration

In [94]:
trainer = TrainerConfiguration(
    optimizer={
        "type": "adam",
        
    }
)

[0;31mInit signature:[0m
[0mTrainerConfiguration[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0moptimizer[0m[0;34m:[0m [0mDict[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mAny[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mvalidation_metric[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'-loss'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpatience[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mint[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mshuffle[0m[0;34m:[0m [0mbool[0m [0;34m=[0m [0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnum_epochs[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m20[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcuda_device[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;34m-[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mgrad_norm[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mfloat[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[

In [54]:
TaskHeadSpec

[0;31mInit signature:[0m [0mTaskHeadSpec[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwds[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m      Layer spec for TaskHead components
[0;31mFile:[0m           ~/recognai/biome/biome-text/src/biome/text/modules/heads/defs.py
[0;31mType:[0m           type
[0;31mSubclasses:[0m     


[0;31mInit signature:[0m
[0mPipelineConfiguration[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mname[0m[0;34m:[0m [0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfeatures[0m[0;34m:[0m [0mbiome[0m[0;34m.[0m[0mtext[0m[0;34m.[0m[0mconfiguration[0m[0;34m.[0m[0mFeaturesConfiguration[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mhead[0m[0;34m:[0m [0mbiome[0m[0;34m.[0m[0mtext[0m[0;34m.[0m[0mmodules[0m[0;34m.[0m[0mheads[0m[0;34m.[0m[0mdefs[0m[0;34m.[0m[0mTaskHeadSpec[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtokenizer[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mbiome[0m[0;34m.[0m[0mtext[0m[0;34m.[0m[0mconfiguration[0m[0;34m.[0m[0mTokenizerConfiguration[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mencoder[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mbiome[0m[0;34m.[0m[0mtext[0m[0;34m.[0m[0mmodules[0m[0;34m.[0m[0mspecs[0m[0;34m.[0m[0mallennlp

## Configure your `biome.text` Pipeline

In [12]:
from biome.text.api_new import Pipeline
from biome.text.api_new.configuration import TrainerConfiguration
from biome.text.api_new.helpers import yaml_to_dict

### Pipeline configuration from YAML

A `biome.text` pipeline has the following main components:

```yaml
name: # the name

tokenizer: # how to tokenize input text

features: # this input features of the model

encoder: # the backbone model encoder

head: # your task configuration


```

Our complete configuration for this tutorial is:

```yaml
name: german-business-categories

features:
    words:
        embedding_dim: 100
        lowercase_tokens: true
    chars:
        embedding_dim: 8
        encoder:
            type: cnn
            num_filters: 50
            ngram_filter_sizes: [ 4 ]
        dropout: 0.2

encoder:
    hidden_size: 512
    num_layers: 2
    dropout: 0.5
    type: lstm

head:
    type: TextClassification
    pooler:
        type: boe
    labels: ['Allgemeinärzte', 'Apotheken', 'Architekturbüros',
             'Autowerkstätten', 'Dienstleistungen', 'Edv', 'Elektriker',
             'Elektrotechnik', 'Friseure', 'Gebrauchtwagen',
             'Handelsvermittler Und -vertreter', 'Hotels', 'Maler',
             'Physiotherapie', 'Restaurants', 'Sanitärinstallationen',
             'Tiefbau', 'Unternehmensberatungen', 'Vereine', 'Vermittlungen',
             'Versicherungsvermittler', 'Werbeagenturen']
```

In [13]:
pl = Pipeline.from_file("configs/text_classifier.yml")

In [15]:
pl.config.as_dict()

{'name': 'business-categories',
 'tokenizer': {'lang': 'en',
  'skip_empty_tokens': False,
  'max_sequence_length': None,
  'max_nr_of_sentences': None,
  'text_cleaning': None,
  'segment_sentences': False},
 'features': {'words': {'embedding_dim': 100, 'lowercase_tokens': True},
  'chars': {'embedding_dim': 8,
   'encoder': {'type': 'cnn', 'num_filters': 50, 'ngram_filter_sizes': [4]},
   'dropout': 0.2}},
 'encoder': {'hidden_size': 512,
  'num_layers': 2,
  'dropout': 0.5,
  'type': 'lstm',
  'input_size': 150},
 'head': {'type': 'TextClassification',
  'pooler': {'type': 'boe'},
  'labels': ['Allgemeinärzte',
   'Apotheken',
   'Architekturbüros',
   'Autowerkstätten',
   'Dienstleistungen',
   'Edv',
   'Elektriker',
   'Elektrotechnik',
   'Friseure',
   'Gebrauchtwagen',
   'Handelsvermittler Und -vertreter',
   'Hotels',
   'Maler',
   'Physiotherapie',
   'Restaurants',
   'Sanitärinstallationen',
   'Tiefbau',
   'Unternehmensberatungen',
   'Vereine',
   'Vermittlungen',
  

### Testing our pipeline before training

It recommended to check that our pipeline is correctly setup using the `predict` method.

::: warning

Our pipeline has not been trained before, so its weights are random. Do not expect its predictions to make sense for now.

:::


In [13]:
pl.predict('Some text')

{'logits': array([-0.0333772 , -0.01114595,  0.08185824,  0.00720856, -0.01808064,
         0.0209163 , -0.04119281,  0.0234425 ,  0.00120479,  0.04529068,
        -0.02560528,  0.03243363, -0.02825472,  0.01238234,  0.00707909,
        -0.05999601,  0.05878261,  0.03128546, -0.01267068,  0.00673078,
         0.01568662,  0.02453783], dtype=float32),
 'probs': array([0.04366268, 0.04464422, 0.04899553, 0.04547121, 0.0443357 ,
        0.04609881, 0.04332276, 0.04621541, 0.04519903, 0.04723625,
        0.04400334, 0.04663282, 0.04388691, 0.04570708, 0.04546533,
        0.04251576, 0.04787787, 0.04657931, 0.04457621, 0.04544949,
        0.04585836, 0.04626606], dtype=float32),
 'classes': {'Architekturbüros': 0.04899553209543228,
  'Tiefbau': 0.047877874225378036,
  'Gebrauchtwagen': 0.04723624885082245,
  'Hotels': 0.04663281515240669,
  'Unternehmensberatungen': 0.046579305082559586,
  'Werbeagenturen': 0.046266064047813416,
  'Elektrotechnik': 0.04621541127562523,
  'Edv': 0.0460988096

In [17]:
yaml_to_dict("configs/trainer.yml")

{'batch_size': 64,
 'num_epochs': 100,
 'optimizer': {'type': 'adam', 'lr': 0.01},
 'validation_metric': '-loss',
 'patience': 2}

In [16]:
trainer = TrainerConfiguration(**yaml_to_dict("configs/trainer.yml"))