# Classification on DeepPavlov

**Task**:
Intent recognition on SNIPS dataset: https://github.com/snipsco/nlu-benchmark/tree/master/2017-06-custom-intent-engines that has already been recomposed to `csv` format and can be downloaded from http://files.deeppavlov.ai/datasets/snips_intents/train.csv

FastText English word embeddings ~8Gb: http://files.deeppavlov.ai/deeppavlov_data/embeddings/wiki.en.bin

## Plan of the notebook:

1. [Data aggregation](#Data-aggregation)
     * [DatasetReader](#DatasetReader)
     * [DatasetIterator](#DatasetIterator)
2. [Data preprocessing](#Data-preprocessing)
     * [Lowercasing](#Lowercasing)
     * [Tokenization](#Tokenization)
     * [Vocabulary](#Vocabulary)
3. [Featurization](#Featurization)
    * [Bag-of-words embedder](#Bag-of-words)
    * [TF-IDF vectorizer](#TF-IDF Vectorizer)
    * [fastText embedder](#fastText-embedder)
    * [fastText weighted by TF-IDF embedder](#fastText-weighted-by-TF-IDF-embedder)
    * [Mean fastText embedder](#Mean-fastText-embedder)
4. [Models](#Models)
    * [Building models in python](#Models-in-python)
        - [Sklearn component classifiers](#SklearnComponent-classifier-on-Tfidf-features-in-python)
        - [Keras classification models on fastText emb](#KerasClassificationModel-on-fastText-embeddings-in-python)
        - [Keras classification models on fastText weighted emb](#KerasClassificationModel-on-fastText-weighted-by-TF-IDF-embeddings-in-python)
    * [Building models from configs](#Models-from-configs)
        - [Sklearn component classifiers](#SklearnComponent-classifier-on-Tfidf-features-from-config)
        - [Keras classification models](#KerasClassificationModel-on-fastText-embeddings-from-config)
        - [Keras classification models on fastText weighted emb](#KerasClassificationModel-on-fastText-weighted-by-TF-IDF-embeddings-from-config)
    * [Bonus: pre-trained CNN model in DeepPavlov](#Bonus:-pre-trained-CNN-model-in-DeepPavlov)

## Data aggregation

First of all, let's download and look into data we will work with.

In [1]:
from deeppavlov.core.data.utils import simple_download
from deeppavlov.core.commands.utils import set_deeppavlov_root

# assign root for all the file paths to current directory
set_deeppavlov_root(config={"deeppavlov_root": "."})

#download train data file for SNIPS
simple_download(url="http://files.deeppavlov.ai/datasets/snips_intents/train.csv", 
                destination="./snips/train.csv")

2018-10-29 16:32:05.2 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 208: Starting new HTTP connection (1): files.deeppavlov.ai
2018-10-29 16:32:05.70 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 396: http://files.deeppavlov.ai:80 "GET /datasets/snips_intents/train.csv HTTP/1.1" 200 980824
2018-10-29 16:32:05.72 INFO in 'deeppavlov.core.data.utils'['utils'] at line 59: Downloading from http://files.deeppavlov.ai/datasets/snips_intents/train.csv to snips/train.csv
100%|██████████| 981k/981k [00:00<00:00, 22.3MB/s]


In [2]:
! head -n 15 snips/train.csv

text,intents
Add another song to the Cita RomГЎntica playlist. ,AddToPlaylist
add clem burke in my playlist Pre-Party R&B Jams,AddToPlaylist
Add Live from Aragon Ballroom to Trapeo,AddToPlaylist
add Unite and Win to my night out,AddToPlaylist
Add track to my Digster Future Hits,AddToPlaylist
add the piano bar to my Cindy Wilson,AddToPlaylist
Add Spanish Harlem Incident to cleaning the house,AddToPlaylist
add The Greyest of Blue Skies in Indie EspaГ±ol my playlist,AddToPlaylist
Add the name kids in the street to the plylist New Indie Mix,AddToPlaylist
add album radar latino,AddToPlaylist
Add Tranquility to the Latin Pop Rising playlist. ,AddToPlaylist
Add d flame to the Dcode2016 playlist.,AddToPlaylist
Add album to my fairy tales,AddToPlaylist
I need another artist in the New Indie Mix playlist. ,AddToPlaylist


### DatasetReader

Read data using `BasicClassificationDatasetReader` из DeepPavlov

In [3]:
from deeppavlov.dataset_readers.basic_classification_reader import BasicClassificationDatasetReader

2018-10-29 15:13:12.657 DEBUG in 'matplotlib'['__init__'] at line 415: $HOME=/home/dilyara
2018-10-29 15:13:12.658 DEBUG in 'matplotlib'['__init__'] at line 415: CONFIGDIR=/home/dilyara/.config/matplotlib
2018-10-29 15:13:12.659 DEBUG in 'matplotlib'['__init__'] at line 415: matplotlib data path: /home/dilyara/anaconda3/envs/deep36_reserve/lib/python3.6/site-packages/matplotlib/mpl-data
2018-10-29 15:13:12.662 DEBUG in 'matplotlib'['__init__'] at line 1085: loaded rc file /home/dilyara/anaconda3/envs/deep36_reserve/lib/python3.6/site-packages/matplotlib/mpl-data/matplotlibrc
2018-10-29 15:13:12.664 DEBUG in 'matplotlib'['__init__'] at line 1794: matplotlib version 3.0.0
2018-10-29 15:13:12.664 DEBUG in 'matplotlib'['__init__'] at line 1795: interactive is False
2018-10-29 15:13:12.665 DEBUG in 'matplotlib'['__init__'] at line 1796: platform is linux


In [4]:
# read data from particular columns of `.csv` file
dr = BasicClassificationDatasetReader().read(
    data_path='./snips/',
    train='train.csv',
    x = 'text',
    y = 'intents'
)



We don't have a ready train/valid/test split.

In [5]:
# check train/valid/test sizes
[(k, len(dr[k])) for k in dr.keys()]

[('train', 15884), ('valid', 0), ('test', 0)]

### DatasetIterator

Use `BasicClassificationDatasetIterator` to split `train` on `train` and `valid` and to generate batches of samples.

In [6]:
from deeppavlov.dataset_iterators.basic_classification_iterator import BasicClassificationDatasetIterator

In [7]:
# initialize data iterator splitting `train` field to `train` and `valid` in proportion 9/1
train_iterator = BasicClassificationDatasetIterator(data=dr,
                                                    field_to_split='train',
                                                    split_fields=['train', 'valid'],
                                                    split_proportions=[0.8, 0.2],
                                                    split_seed=23,
                                                    seed=42)

2018-10-29 15:14:47.555 INFO in 'deeppavlov.dataset_iterators.basic_classification_iterator'['basic_classification_iterator'] at line 73: Splitting field <<train>> to new fields <<['train', 'valid']>>


Let's look into training samples. 

In [8]:
# one can get train instances (or any other data type including `all`)
x_train, y_train = train_iterator.get_instances(data_type='train')
for x, y in list(zip(x_train, y_train))[:5]:
    print('x:', x)
    print('y:', y)
    print('=================')

x: Is it freezing in Offerman, California?
y: ['GetWeather']
x: put this song in the playlist Trap Land
y: ['AddToPlaylist']
x: show me a textbook with a rating of 2 and a maximum rating of 6 that is current
y: ['RateBook']
x: Will the weather be okay in Northern Luzon Heroes Hill National Park 4 and a half months from now?
y: ['GetWeather']
x: Rate the current album a four
y: ['RateBook']


## Data preprocessing

We will be using lowercasing and tokenization as data preparation.

### Lowercasing

`StrLower` lowercases texts.

In [9]:
from deeppavlov.models.preprocessors.str_lower import StrLower

[nltk_data] Downloading package punkt to /home/dilyara/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/dilyara/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package perluniprops to
[nltk_data]     /home/dilyara/nltk_data...
[nltk_data]   Package perluniprops is already up-to-date!
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data]     /home/dilyara/nltk_data...
[nltk_data]   Package nonbreaking_prefixes is already up-to-date!


In [10]:
str_lower = StrLower()
str_lower(['Is it freezing in Offerman, California?'])

['is it freezing in offerman, california?']

### Tokenization

`NLTKTokenizer` can split string to tokens.

In [11]:
from deeppavlov.models.tokenizers.nltk_moses_tokenizer import NLTKMosesTokenizer

In [12]:
tokenizer = NLTKMosesTokenizer()
tokenizer(['Is it freezing in Offerman, California?'])

[['Is', 'it', 'freezing', 'in', 'Offerman', ',', 'California', '?']]

Let's preprocess all `train` part of the dataset.

In [13]:
train_x_lower_tokenized = str_lower(tokenizer(train_iterator.get_instances(data_type='train')[0]))

### Vocabulary

Now we are ready to use `vocab`. They are very usefull for:
* extracting class labels and converting labels to indices and vice versa,
* building of characters or tokens vocabularies.

In [14]:
from deeppavlov.core.data.simple_vocab import SimpleVocabulary

In [15]:
# initialize simple vocabulary to collect all occured in the dataset classes
classes_vocab = SimpleVocabulary(
    save_path='./snips/classes.dict',
    load_path='./snips/classes.dict')

2018-10-29 15:16:12.681 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 97: [loading vocabulary from snips/classes.dict]


In [16]:
classes_vocab.fit((train_iterator.get_instances(data_type='train')[1]))
classes_vocab.save()

2018-10-29 15:16:14.938 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 86: [saving vocabulary to snips/classes.dict]


Let's see what classes the dataset contains and how many examples for each class we have.

In [17]:
list(classes_vocab.items())

[('GetWeather', 0),
 ('PlayMusic', 1),
 ('SearchScreeningEvent', 2),
 ('BookRestaurant', 3),
 ('RateBook', 4),
 ('SearchCreativeWork', 5),
 ('AddToPlaylist', 6)]

In [18]:
# also one can collect vocabulary of textual tokens appeared 2 and more times in the dataset
token_vocab = SimpleVocabulary(
    save_path='./snips/tokens.dict',
    load_path='./snips/tokens.dict',
    min_freq=2,
    special_tokens=('<PAD>', '<UNK>',),
    unk_token='<UNK>')

2018-10-29 15:17:24.959 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 97: [loading vocabulary from snips/tokens.dict]


In [19]:
token_vocab.fit(train_x_lower_tokenized)
token_vocab.save()

2018-10-29 15:17:28.606 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 86: [saving vocabulary to snips/tokens.dict]


In [20]:
# number of tokens in dictionary
len(token_vocab)

4564

In [21]:
# 10 most common words and number of times their appeared
token_vocab.freqs.most_common()[:10]

[('the', 6953),
 ('a', 3917),
 ('in', 3265),
 ('to', 3203),
 ('for', 2814),
 ('of', 2401),
 ('.', 2400),
 ('i', 2079),
 ('at', 1935),
 ('play', 1703)]

In [22]:
token_ids = token_vocab(str_lower(tokenizer(['Is it freezing in Offerman, California?'])))
token_ids

[[13, 36, 244, 4, 1, 29, 996, 20]]

In [23]:
tokenizer(token_vocab(token_ids))

['is it freezing in <UNK>, california?']

## Featurization

### Bag-of-words

Matches a vector to each text sample: text -> binary vector $v$: \[0, 1, 0, 0, 0, 1, ..., ...1, 0, 1\]. 

Dimensionality of vector $v$ is equal to vocabulary size.

$v_i$ == 1, if word $i$ is in the text,

$v_i$ == 0, else.

In [24]:
import numpy as np
from deeppavlov.models.embedders.bow_embedder import BoWEmbedder

In [25]:
# initialize bag-of-words embedder giving total number of tokens
bow = BoWEmbedder(depth=token_vocab.len)
# it assumes indexed tokenized samples
bow(token_vocab(str_lower(tokenizer(['Is it freezing in Offerman, California?']))))

[array([0, 1, 0, ..., 0, 0, 0], dtype=int32)]

In [26]:
# all 8 tokens are in the vocabulary
sum(bow(token_vocab(str_lower(tokenizer(['Is it freezing in Offerman, California?']))))[0])

8

### TF-IDF Vectorizer

Matches a vector to each text sample: text -> vector $v$ from $R^N$ where $N$ is a vocabulary size.

$TF-IDF(token, document) = TF(token, document) * IDF(token, document)$

$TF$ is a term frequency:

$TF(token, document) = \frac{n_{token}}{\sum_{k}n_k}.$

$IDF$ is a inverse document frequency:

$IDF(token, all\_documents) = \frac{всего\ документов}{число\ документов\ в\ которых\ встретился\ token}.$

`SklearnComponent` in DeepPavlov is a universal wrapper for any vecotirzer/estimator from `sklearn` package. The only requirement to specify component usage is following: model class and name of infer method should be passed as parameters.

In [27]:
from deeppavlov.models.sklearn import SklearnComponent

In [28]:
# initialize TF-IDF vectorizer sklearn component with `transform` as infer method
tfidf = SklearnComponent(
    model_class="sklearn.feature_extraction.text:TfidfVectorizer",
    infer_method="transform",
    save_path='./tfidf_v0.pkl',
    load_path='./tfidf_v0.pkl',
    mode='train')

2018-10-29 15:21:35.121 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 164: Initializing model sklearn.feature_extraction.text:TfidfVectorizer from scratch


In [29]:
# fit on textual train instances and save it
tfidf.fit(str_lower(train_iterator.get_instances(data_type='train')[0]))
tfidf.save()

2018-10-29 15:22:00.409 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 107: Fitting model sklearn.feature_extraction.text:TfidfVectorizer
2018-10-29 15:22:00.512 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 239: Saving model to tfidf_v0.pkl


In [30]:
tfidf(str_lower(['Is it freezing in Offerman, California?']))

<1x10709 sparse matrix of type '<class 'numpy.float64'>'
	with 6 stored elements in Compressed Sparse Row format>

In [31]:
len(tfidf.model.vocabulary_)

10709

### fastText embedder

In [32]:
from deeppavlov.models.embedders.fasttext_embedder import FasttextEmbedder

In [None]:
simple_download(url="http://files.deeppavlov.ai/deeppavlov_data/embeddings/wiki.en.bin", 
                destination="./wiki.en.bin")

In [33]:
embedder = FasttextEmbedder(load_path='./wiki.en.bin',
                            dim=300)

2018-10-29 15:28:11.569 INFO in 'deeppavlov.models.embedders.fasttext_embedder'['fasttext_embedder'] at line 52: [loading fastText embeddings from `wiki.en.bin`]


In [34]:
# output shape is (batch_size x num_tokens x embedding_dim)
embedded_batch = embedder(str_lower(tokenizer(['Is it freezing in Offerman, California?']))) 
len(embedded_batch), len(embedded_batch[0]), embedded_batch[0][0].shape

(1, 8, (300,))

### fastText weighted by TF-IDF embedder

In [35]:
from deeppavlov.models.embedders.tfidf_weighted_embedder import TfidfWeightedEmbedder

In [36]:
weighted_embedder = TfidfWeightedEmbedder(
    embedder=embedder,  # our fastText embedder instance
    tokenizer=tokenizer,  # out tokenizer instance
    mean=False,  # whether to return one vector per sample or embed every token separately
    vectorizer=tfidf  # our TF-IDF vectorizer
)

In [37]:
# output shape is (batch_size x num_tokens x embedding_dim)
embedded_batch = weighted_embedder(str_lower(tokenizer(['Is it freezing in Offerman, California?']))) 
len(embedded_batch), len(embedded_batch[0]), embedded_batch[0][0].shape

(1, 8, (300,))

### Mean fastText embedder

Embedder returns a vector per token while we want to get a vector per text sample. Therefore, let's calculate mean vector of embeddings of tokens. 
For that we can either init `FasttextEmbedder` with `mean=True` parameter (`mean=false` by default), or pass `mean=true` while calling function (this way `mean` value is assigned only for this call).

In [38]:
# output shape is (batch_size x embedding_dim)
embedded_batch = embedder(str_lower(tokenizer(['Is it freezing in Offerman, California?'])), mean=True) 
len(embedded_batch), embedded_batch[0].shape

(1, (300,))

## Models

In [39]:
from deeppavlov.metrics.accuracy import sets_accuracy

In [40]:
# get all train and valid data from iterator
x_train, y_train = train_iterator.get_instances(data_type="train")
x_valid, y_valid = train_iterator.get_instances(data_type="valid")

### Models in python

#### SklearnComponent classifier on Tfidf-features in python

In [41]:
# initialize sklearn classifier, all parameters for classifier could be passed
cls = SklearnComponent(
    model_class="sklearn.linear_model:LogisticRegression",
    infer_method="predict",
    save_path='./logreg_v0.pkl',
    load_path='./logreg_v0.pkl',
    C=1,
    mode='train')

2018-10-29 15:31:41.382 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 164: Initializing model sklearn.linear_model:LogisticRegression from scratch


In [42]:
# fit sklearn classifier and save it
cls.fit(tfidf(x_train), y_train)
cls.save()

2018-10-29 15:31:42.706 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 107: Fitting model sklearn.linear_model:LogisticRegression
2018-10-29 15:31:42.897 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 239: Saving model to logreg_v0.pkl


In [43]:
y_valid_pred = cls(tfidf(x_valid))

In [44]:
# let's calculate sets accuracy (because each element is a list of labels)
sets_accuracy(np.squeeze(y_valid), y_valid_pred)

0.982373308152345

#### KerasClassificationModel on fastText embeddings in python

In [45]:
from deeppavlov.models.classifiers.keras_classification_model import KerasClassificationModel
from deeppavlov.models.preprocessors.one_hotter import OneHotter
from deeppavlov.models.classifiers.proba2labels import Proba2Labels

Using TensorFlow backend.


In [46]:
# Intialize `KerasClassificationModel` that composes CNN shallow-and-wide network 
# (name here as`cnn_model`)
cls = KerasClassificationModel(save_path="./cnn_model_v0", 
                               load_path="./cnn_model_v0", 
                               embedding_size=embedder.dim,
                               n_classes=classes_vocab.len,
                               model_name="cnn_model",
                               text_size=15, # number of tokens
                               kernel_sizes_cnn=[3, 5, 7],
                               filters_cnn=128,
                               dense_size=100,
                               optimizer="Adam",
                               learning_rate=0.1,
                               learning_rate_decay=0.01,
                               loss="categorical_crossentropy")

2018-10-29 15:32:03.788 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 242: [initializing `KerasClassificationModel` from scratch as cnn_model]
2018-10-29 15:32:04.160 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 134: Model was successfully initialized!
Model summary:
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 15, 300)      0                                            
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 15, 128)      115328      input_1[0][0]                    
__________________________________________________________________________________________________
conv1d_2 (Conv1D)  

In [47]:
# `KerasClassificationModel` assumes one-hotted distribution of classes per sample.
# `OneHotter` converts indices to one-hot vectors representation.
#  To obtain indices we can use our `classes_vocab` intialized and fitted above
onehotter = OneHotter(depth=classes_vocab.len)

In [48]:
# Train for 10 epochs
for ep in range(10):
    for x, y in train_iterator.gen_batches(batch_size=64, 
                                           data_type="train"):
        x_embed = embedder(tokenizer(str_lower(x)))
        y_onehot = onehotter(classes_vocab(y))
        cls.train_on_batch(x_embed, y_onehot)

In [49]:
# Save model weights and parameters
cls.save()

2018-10-29 15:32:28.509 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 356: [saving model to cnn_model_v0_opt.json]


In [50]:
# Infering on validation data we get probability distribution on given data.
y_valid_pred = cls(embedder(tokenizer(str_lower(x_valid))))
y_valid_pred[0]

[0.0001224217121489346,
 5.4968368203844875e-05,
 0.00048470927868038416,
 0.9979118704795837,
 0.00019703485304489732,
 0.0011488927993923426,
 0.0005156921688467264]

In [51]:
# To convert probability distribution to labels, 
# we first need to convert probabilities to indices,
# and then using vocabulary `classes_vocab` convert indices to labels.
# 
# `Proba2Labels` converts probabilities to indices and supports three different modes:
# if `max_proba` is true, returns indices of the highest probabilities
# if `confident_threshold` is given, returns indices with probabiltiies higher than threshold
# if `top_n` is given, returns `top_n` indices with highest probabilities
prob2labels = Proba2Labels(max_proba=True)

In [52]:
# calculate sets accuracy
sets_accuracy(y_valid, classes_vocab(prob2labels(y_valid_pred)))

0.9889833175952156

#### KerasClassificationModel on fastText weighted by TF-IDF embeddings in python

In [60]:
# Intialize `KerasClassificationModel` that composes CNN shallow-and-wide network 
# (name here as`cnn_model`)
cls = KerasClassificationModel(save_path="./cnn_model_v1", 
                               load_path="./cnn_model_v1", 
                               embedding_size=weighted_embedder.dim,
                               n_classes=classes_vocab.len,
                               model_name="cnn_model",
                               text_size=15, # number of tokens
                               kernel_sizes_cnn=[3, 5, 7],
                               filters_cnn=128,
                               dense_size=100,
                               optimizer="Adam",
                               learning_rate=0.1,
                               learning_rate_decay=0.01,
                               loss="categorical_crossentropy")

2018-10-29 15:43:10.787 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 242: [initializing `KerasClassificationModel` from scratch as cnn_model]
2018-10-29 15:43:11.134 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 134: Model was successfully initialized!
Model summary:
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 15, 300)      0                                            
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 15, 128)      115328      input_1[0][0]                    
__________________________________________________________________________________________________
conv1d_2 (Conv1D)  

In [61]:
# `KerasClassificationModel` assumes one-hotted distribution of classes per sample.
# `OneHotter` converts indices to one-hot vectors representation.
#  To obtain indices we can use our `classes_vocab` intialized and fitted above
onehotter = OneHotter(depth=classes_vocab.len)

In [62]:
# Train for 10 epochs
for ep in range(10):
    for x, y in train_iterator.gen_batches(batch_size=64, 
                                           data_type="train"):
        x_embed = weighted_embedder(tokenizer(str_lower(x)))
        y_onehot = onehotter(classes_vocab(y))
        cls.train_on_batch(x_embed, y_onehot)

In [63]:
cls.save()

2018-10-29 15:47:03.867 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 356: [saving model to cnn_model_v1_opt.json]


In [64]:
# Infering on validation data we get probability distribution on given data.
y_valid_pred = cls(weighted_embedder(tokenizer(str_lower(x_valid))))
y_valid_pred[0]

[8.803312084637582e-05,
 0.00012861542927566916,
 0.0001089772122213617,
 0.9999517202377319,
 0.00046508893137797713,
 0.00012267997954040766,
 0.00035249924985691905]

In [65]:
# To convert probability distribution to labels, 
# we first need to convert probabilities to indices,
# and then using vocabulary `classes_vocab` convert indices to labels.
# 
# `Proba2Labels` converts probabilities to indices and supports three different modes:
# if `max_proba` is true, returns indices of the highest probabilities
# if `confident_threshold` is given, returns indices with probabiltiies higher than threshold
# if `top_n` is given, returns `top_n` indices with highest probabilities
prob2labels = Proba2Labels(max_proba=True)

In [66]:
# calculate sets accuracy
sets_accuracy(y_valid, classes_vocab(prob2labels(y_valid_pred)))

0.9798552093169657

### Models from configs

In [2]:
from deeppavlov.core.commands.infer import build_model_from_config
from deeppavlov.core.commands.train import train_evaluate_model_from_config, _test_model

#### SklearnComponent classifier on Tfidf-features from config

In [68]:
logreg_config = {"deeppavlov_root": ".",
  "dataset_reader": {
    "name": "basic_classification_reader",
    "x": "text",
    "y": "intents",
    "data_path": "./snips"
  },
  "dataset_iterator": {
    "name": "basic_classification_iterator",
    "seed": 42,
    "split_seed": 23,
    "field_to_split": "train",
    "split_fields": [
      "train",
      "valid"
    ],
    "split_proportions": [
      0.9,
      0.1
    ]
  },
  "chainer": {
    "in": [
      "x"
    ],
    "in_y": [
      "y"
    ],
    "pipe": [
      {
        "id": "classes_vocab",
        "name": "simple_vocab",
        "fit_on": [
          "y"
        ],
        "save_path": "./snips/classes.dict",
        "load_path": "./snips/classes.dict",
        "in": "y",
        "out": "y_ids"
      },
      {
        "in": [
          "x"
        ],
        "out": [
          "x_vec"
        ],
        "fit_on": [
          "x",
          "y_ids"
        ],
        "id": "tfidf_vec",
        "name": "sklearn_component",
        "save_path": "tfidf_v1.pkl",
        "load_path": "tfidf_v1.pkl",
        "model_class": "sklearn.feature_extraction.text:TfidfVectorizer",
        "infer_method": "transform"
      },
      {
        "in": "x",
        "out": "x_tok",
        "id": "my_tokenizer",
        "name": "nltk_moses_tokenizer",
        "tokenizer": "wordpunct_tokenize"
      },
      {
        "in": [
          "x_vec"
        ],
        "out": [
          "y_pred"
        ],
        "fit_on": [
          "x_vec",
          "y"
        ],
        "name": "sklearn_component",
        "main": True,
        "save_path": "logreg_v1.pkl",
        "load_path": "logreg_v1.pkl",
        "model_class": "sklearn.linear_model:LogisticRegression",
        "infer_method": "predict",
        "ensure_list_output": True
      }
    ],
    "out": [
      "y_pred"
    ]
  },
  "train": {
    "batch_size": 64,
    "metrics": [
      "accuracy"
    ],
    "validate_best": True,
    "test_best": False
  }
}


In [69]:
# we can train and evaluate model from config
m = train_evaluate_model_from_config(logreg_config)

2018-10-29 16:03:26.562 INFO in 'deeppavlov.dataset_iterators.basic_classification_iterator'['basic_classification_iterator'] at line 73: Splitting field <<train>> to new fields <<['train', 'valid']>>
2018-10-29 16:03:26.565 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 97: [loading vocabulary from snips/classes.dict]
2018-10-29 16:03:26.572 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 86: [saving vocabulary to snips/classes.dict]
2018-10-29 16:03:26.573 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 164: Initializing model sklearn.feature_extraction.text:TfidfVectorizer from scratch
2018-10-29 16:03:26.599 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 107: Fitting model sklearn.feature_extraction.text:TfidfVectorizer
2018-10-29 16:03:26.719 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 239: Saving model to tfidf_v1.pkl
2018-10-29 16

{"valid": {"eval_examples_count": 1589, "metrics": {"accuracy": 0.983}, "time_spent": "0:00:01"}}


In [70]:
# or we can just load pre-trained model (conicides with what we did above)
m = build_model_from_config(logreg_config)

2018-10-29 16:03:27.689 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 97: [loading vocabulary from snips/classes.dict]
2018-10-29 16:03:27.689 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 201: Loading model sklearn.feature_extraction.text:TfidfVectorizer from tfidf_v1.pkl
2018-10-29 16:03:27.694 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 208: Model sklearn.feature_extraction.textTfidfVectorizer loaded  with parameters
2018-10-29 16:03:27.696 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 201: Loading model sklearn.linear_model:LogisticRegression from logreg_v1.pkl
2018-10-29 16:03:27.697 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 208: Model sklearn.linear_model.logisticLogisticRegression loaded  with parameters


In [71]:
m(["Is it freezing in Offerman, California?"])

[['GetWeather']]

#### KerasClassificationModel on fastText embeddings from config

In [72]:
cnn_config = {"deeppavlov_root": ".",
  "dataset_reader": {
    "name": "basic_classification_reader",
    "x": "text",
    "y": "intents",
    "data_path": "snips"
  },
  "dataset_iterator": {
    "name": "basic_classification_iterator",
    "seed": 42,
    "split_seed": 23,
    "field_to_split": "train",
    "split_fields": [
      "train",
      "valid"
    ],
    "split_proportions": [
      0.9,
      0.1
    ]
  },
  "chainer": {
    "in": [
      "x"
    ],
    "in_y": [
      "y"
    ],
    "pipe": [
      {
        "id": "classes_vocab",
        "name": "simple_vocab",
        "fit_on": [
          "y"
        ],
        "level": "token",
        "save_path": "./snips/classes.dict",
        "load_path": "./snips/classes.dict",
        "in": "y",
        "out": "y_ids"
      },
      {
        "in": "x",
        "out": "x_tok",
        "id": "my_tokenizer",
        "name": "nltk_tokenizer",
        "tokenizer": "wordpunct_tokenize"
      },
      {
        "in": "x_tok",
        "out": "x_emb",
        "id": "my_embedder",
        "name": "fasttext",
        "load_path": "./wiki.en.bin"
      },
      {
        "in": "y_ids",
        "out": "y_onehot",
        "name": "one_hotter",
        "depth": "#classes_vocab.len"
      },
      {
        "in": [
          "x_emb"
        ],
        "in_y": [
          "y_onehot"
        ],
        "out": [
          "y_pred_probas"
        ],
        "main": True,
        "name": "keras_classification_model",
        "save_path": "./cnn_model_v2",
        "load_path": "./cnn_model_v2",
        "embedding_size": "#my_embedder.dim",
        "n_classes": "#classes_vocab.len",
        "kernel_sizes_cnn": [
          1,
          2,
          3
        ],
        "filters_cnn": 256,
        "optimizer": "Adam",
        "learning_rate": 0.01,
        "learning_rate_decay": 0.1,
        "loss": "categorical_crossentropy",
        "text_size": 15,
        "coef_reg_cnn": 1e-4,
        "coef_reg_den": 1e-4,
        "dropout_rate": 0.5,
        "dense_size": 100,
        "model_name": "cnn_model"
      },
      {
        "in": "y_pred_probas",
        "out": "y_pred_ids",
        "name": "proba2labels",
        "max_proba": True
      },
      {
        "in": "y_pred_ids",
        "out": "y_pred_labels",
        "ref": "classes_vocab"
      }
    ],
    "out": [
      "y_pred_labels"
    ]
  },
  "train": {
    "epochs": 10,
    "batch_size": 64,
    "metrics": [
      "sets_accuracy",
      "f1_macro",
      {
        "name": "roc_auc",
        "inputs": ["y_onehot", "y_pred_probas"]
      }
    ],
    "validation_patience": 5,
    "val_every_n_epochs": 1,
    "log_every_n_epochs": 1,
    "show_examples": True,
    "validate_best": True,
    "test_best": False
  }
}


In [73]:
# we can train and evaluate model from config
m = train_evaluate_model_from_config(cnn_config)

2018-10-29 16:09:09.865 INFO in 'deeppavlov.dataset_iterators.basic_classification_iterator'['basic_classification_iterator'] at line 73: Splitting field <<train>> to new fields <<['train', 'valid']>>
2018-10-29 16:09:09.871 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 97: [loading vocabulary from snips/classes.dict]
2018-10-29 16:09:09.879 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 86: [saving vocabulary to snips/classes.dict]
2018-10-29 16:09:09.880 INFO in 'deeppavlov.models.embedders.fasttext_embedder'['fasttext_embedder'] at line 52: [loading fastText embeddings from `wiki.en.bin`]
2018-10-29 16:09:52.603 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 242: [initializing `KerasClassificationModel` from scratch as cnn_model]
2018-10-29 16:09:53.31 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 134: Model was successfully 

{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.134, "f1_macro": 0.0501, "roc_auc": 0.5041}, "time_spent": "0:00:01", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}], "epochs_done": 0, "batches_seen": 0, "train_examples_seen": 0, "impatience": 0, "patience_limit": 5}}


2018-10-29 16:09:57.924 INFO in 'deeppavlov.core.commands.train'['train'] at line 518: New best sets_accuracy of 0.9465
2018-10-29 16:09:57.925 INFO in 'deeppavlov.core.commands.train'['train'] at line 520: Saving model
2018-10-29 16:09:57.925 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 356: [saving model to cnn_model_v2_opt.json]


{"train": {"epochs_done": 1, "batches_seen": 224, "train_examples_seen": 14295, "metrics": {"sets_accuracy": 0.9, "f1_macro": 0.8994, "roc_auc": 0.9846}, "time_spent": "0:00:05", "examples": [{"x": "Add lisa m to my guitar hero live playlist", "y_predicted": "y_pred_labels", "y_true": ["AddToPlaylist"]}], "loss": 1.4812453889421053}}
{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9465, "f1_macro": 0.9459, "roc_auc": 0.9959}, "time_spent": "0:00:05", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}], "epochs_done": 1, "batches_seen": 224, "train_examples_seen": 14295, "impatience": 0, "patience_limit": 5}}


2018-10-29 16:10:00.528 INFO in 'deeppavlov.core.commands.train'['train'] at line 518: New best sets_accuracy of 0.9578
2018-10-29 16:10:00.528 INFO in 'deeppavlov.core.commands.train'['train'] at line 520: Saving model
2018-10-29 16:10:00.529 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 356: [saving model to cnn_model_v2_opt.json]


{"train": {"epochs_done": 2, "batches_seen": 448, "train_examples_seen": 28590, "metrics": {"sets_accuracy": 0.9582, "f1_macro": 0.9583, "roc_auc": 0.9974}, "time_spent": "0:00:08", "examples": [{"x": "I give The Monkey and the Tiger a rating of 2 points.", "y_predicted": "y_pred_labels", "y_true": ["RateBook"]}], "loss": 1.337195518293551}}
{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9578, "f1_macro": 0.9572, "roc_auc": 0.9972}, "time_spent": "0:00:08", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}], "epochs_done": 2, "batches_seen": 448, "train_examples_seen": 28590, "impatience": 0, "patience_limit": 5}}


2018-10-29 16:10:03.231 INFO in 'deeppavlov.core.commands.train'['train'] at line 518: New best sets_accuracy of 0.9635
2018-10-29 16:10:03.232 INFO in 'deeppavlov.core.commands.train'['train'] at line 520: Saving model
2018-10-29 16:10:03.232 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 356: [saving model to cnn_model_v2_opt.json]


{"train": {"epochs_done": 3, "batches_seen": 672, "train_examples_seen": 42885, "metrics": {"sets_accuracy": 0.9668, "f1_macro": 0.9669, "roc_auc": 0.9982}, "time_spent": "0:00:11", "examples": [{"x": "play Iheart tunes by Neil Finn", "y_predicted": "y_pred_labels", "y_true": ["PlayMusic"]}], "loss": 1.2842914696250642}}
{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9635, "f1_macro": 0.9626, "roc_auc": 0.9977}, "time_spent": "0:00:11", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}], "epochs_done": 3, "batches_seen": 672, "train_examples_seen": 42885, "impatience": 0, "patience_limit": 5}}


2018-10-29 16:10:05.811 INFO in 'deeppavlov.core.commands.train'['train'] at line 518: New best sets_accuracy of 0.9679
2018-10-29 16:10:05.811 INFO in 'deeppavlov.core.commands.train'['train'] at line 520: Saving model
2018-10-29 16:10:05.812 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 356: [saving model to cnn_model_v2_opt.json]


{"train": {"epochs_done": 4, "batches_seen": 896, "train_examples_seen": 57180, "metrics": {"sets_accuracy": 0.9717, "f1_macro": 0.9717, "roc_auc": 0.9986}, "time_spent": "0:00:13", "examples": [{"x": "Please play a song off the Curtis Lee album Rough Diamonds", "y_predicted": "y_pred_labels", "y_true": ["PlayMusic"]}], "loss": 1.2475687728396483}}
{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9679, "f1_macro": 0.9671, "roc_auc": 0.9979}, "time_spent": "0:00:13", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}], "epochs_done": 4, "batches_seen": 896, "train_examples_seen": 57180, "impatience": 0, "patience_limit": 5}}


2018-10-29 16:10:08.438 INFO in 'deeppavlov.core.commands.train'['train'] at line 518: New best sets_accuracy of 0.9692
2018-10-29 16:10:08.439 INFO in 'deeppavlov.core.commands.train'['train'] at line 520: Saving model
2018-10-29 16:10:08.439 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 356: [saving model to cnn_model_v2_opt.json]


{"train": {"epochs_done": 5, "batches_seen": 1120, "train_examples_seen": 71475, "metrics": {"sets_accuracy": 0.975, "f1_macro": 0.9751, "roc_auc": 0.9988}, "time_spent": "0:00:16", "examples": [{"x": "Give me Slovakia's weather forecast for eight am", "y_predicted": "y_pred_labels", "y_true": ["GetWeather"]}], "loss": 1.2212532593735628}}
{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9692, "f1_macro": 0.9683, "roc_auc": 0.9981}, "time_spent": "0:00:16", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}], "epochs_done": 5, "batches_seen": 1120, "train_examples_seen": 71475, "impatience": 0, "patience_limit": 5}}


2018-10-29 16:10:11.19 INFO in 'deeppavlov.core.commands.train'['train'] at line 518: New best sets_accuracy of 0.9704
2018-10-29 16:10:11.20 INFO in 'deeppavlov.core.commands.train'['train'] at line 520: Saving model
2018-10-29 16:10:11.20 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 356: [saving model to cnn_model_v2_opt.json]


{"train": {"epochs_done": 6, "batches_seen": 1344, "train_examples_seen": 85770, "metrics": {"sets_accuracy": 0.9758, "f1_macro": 0.9758, "roc_auc": 0.9989}, "time_spent": "0:00:18", "examples": [{"x": "rate this current textbook 0 points", "y_predicted": "y_pred_labels", "y_true": ["RateBook"]}], "loss": 1.2003759257495403}}
{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9704, "f1_macro": 0.9695, "roc_auc": 0.9982}, "time_spent": "0:00:18", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}], "epochs_done": 6, "batches_seen": 1344, "train_examples_seen": 85770, "impatience": 0, "patience_limit": 5}}


2018-10-29 16:10:13.603 INFO in 'deeppavlov.core.commands.train'['train'] at line 518: New best sets_accuracy of 0.9711
2018-10-29 16:10:13.603 INFO in 'deeppavlov.core.commands.train'['train'] at line 520: Saving model
2018-10-29 16:10:13.603 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 356: [saving model to cnn_model_v2_opt.json]


{"train": {"epochs_done": 7, "batches_seen": 1568, "train_examples_seen": 100065, "metrics": {"sets_accuracy": 0.9771, "f1_macro": 0.9771, "roc_auc": 0.9991}, "time_spent": "0:00:21", "examples": [{"x": "I need a bar for four that serves argentinian in D'Iberville, WY for twelve PM", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}], "loss": 1.1837445806179727}}
{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9711, "f1_macro": 0.9702, "roc_auc": 0.9984}, "time_spent": "0:00:21", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}], "epochs_done": 7, "batches_seen": 1568, "train_examples_seen": 100065, "impatience": 0, "patience_limit": 5}}


2018-10-29 16:10:16.180 INFO in 'deeppavlov.core.commands.train'['train'] at line 518: New best sets_accuracy of 0.9729
2018-10-29 16:10:16.180 INFO in 'deeppavlov.core.commands.train'['train'] at line 520: Saving model
2018-10-29 16:10:16.181 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 356: [saving model to cnn_model_v2_opt.json]


{"train": {"epochs_done": 8, "batches_seen": 1792, "train_examples_seen": 114360, "metrics": {"sets_accuracy": 0.9778, "f1_macro": 0.9778, "roc_auc": 0.9991}, "time_spent": "0:00:24", "examples": [{"x": "play The Sea Cabinet", "y_predicted": "y_pred_labels", "y_true": ["SearchCreativeWork"]}], "loss": 1.1712328317974294}}
{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9729, "f1_macro": 0.9721, "roc_auc": 0.9984}, "time_spent": "0:00:24", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}], "epochs_done": 8, "batches_seen": 1792, "train_examples_seen": 114360, "impatience": 0, "patience_limit": 5}}


2018-10-29 16:10:18.838 INFO in 'deeppavlov.core.commands.train'['train'] at line 525: Did not improve on the sets_accuracy of 0.9729


{"train": {"epochs_done": 9, "batches_seen": 2016, "train_examples_seen": 128655, "metrics": {"sets_accuracy": 0.9789, "f1_macro": 0.9789, "roc_auc": 0.9992}, "time_spent": "0:00:26", "examples": [{"x": "add Ik Tara to laundry playlst", "y_predicted": "y_pred_labels", "y_true": ["AddToPlaylist"]}], "loss": 1.158443774495806}}
{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9717, "f1_macro": 0.9709, "roc_auc": 0.9985}, "time_spent": "0:00:26", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}], "epochs_done": 9, "batches_seen": 2016, "train_examples_seen": 128655, "impatience": 1, "patience_limit": 5}}


2018-10-29 16:10:21.388 INFO in 'deeppavlov.core.commands.train'['train'] at line 525: Did not improve on the sets_accuracy of 0.9729
2018-10-29 16:10:21.726 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 97: [loading vocabulary from snips/classes.dict]


{"train": {"epochs_done": 10, "batches_seen": 2240, "train_examples_seen": 142950, "metrics": {"sets_accuracy": 0.9794, "f1_macro": 0.9795, "roc_auc": 0.9993}, "time_spent": "0:00:29", "examples": [{"x": "Is it going to be hot in Karthaus at 7 AM?", "y_predicted": "y_pred_labels", "y_true": ["GetWeather"]}], "loss": 1.1475949878139156}}
{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9729, "f1_macro": 0.9721, "roc_auc": 0.9986}, "time_spent": "0:00:29", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}], "epochs_done": 10, "batches_seen": 2240, "train_examples_seen": 142950, "impatience": 2, "patience_limit": 5}}


2018-10-29 16:10:21.740 INFO in 'deeppavlov.models.embedders.fasttext_embedder'['fasttext_embedder'] at line 52: [loading fastText embeddings from `wiki.en.bin`]
2018-10-29 16:11:22.285 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 272: [initializing `KerasClassificationModel` from saved]
2018-10-29 16:11:22.621 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 282: [loading weights from cnn_model_v2.h5]
2018-10-29 16:11:22.848 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 134: Model was successfully initialized!
Model summary:
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 15, 300)      0                                         

{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9729, "f1_macro": 0.9721, "roc_auc": 0.9984}, "time_spent": "0:00:01", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}]}}


In [74]:
# or we can just load pre-trained model (conicides with what we did above)
m = build_model_from_config(logreg_config)

2018-10-29 16:11:23.410 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 97: [loading vocabulary from snips/classes.dict]
2018-10-29 16:11:23.424 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 201: Loading model sklearn.feature_extraction.text:TfidfVectorizer from tfidf_v1.pkl
2018-10-29 16:11:23.440 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 208: Model sklearn.feature_extraction.textTfidfVectorizer loaded  with parameters
2018-10-29 16:11:23.444 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 201: Loading model sklearn.linear_model:LogisticRegression from logreg_v1.pkl
2018-10-29 16:11:23.455 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 208: Model sklearn.linear_model.logisticLogisticRegression loaded  with parameters


In [75]:
m(["Is it freezing in Offerman, California?"])

[['GetWeather']]

#### KerasClassificationModel on fastText weighted by TF-IDF embeddings from config

In [76]:
cnn_config = {"deeppavlov_root": ".",
  "dataset_reader": {
    "name": "basic_classification_reader",
    "x": "text",
    "y": "intents",
    "data_path": "snips"
  },
  "dataset_iterator": {
    "name": "basic_classification_iterator",
    "seed": 42,
      "split_seed": 23,
    "field_to_split": "train",
    "split_fields": [
      "train",
      "valid"
    ],
    "split_proportions": [
      0.9,
      0.1
    ]
  },
  "chainer": {
    "in": [
      "x"
    ],
    "in_y": [
      "y"
    ],
    "pipe": [
      {
        "id": "classes_vocab",
        "name": "simple_vocab",
        "fit_on": [
          "y"
        ],
        "save_path": "./snips/classes.dict",
        "load_path": "./snips/classes.dict",
        "in": "y",
        "out": "y_ids"
      },
      {
        "in": [
          "x"
        ],
        "out": [
          "x_vec"
        ],
        "fit_on": [
          "x",
          "y_ids"
        ],
        "id": "my_tfidf_vectorizer",
        "name": "sklearn_component",
        "save_path": "tfidf_v2.pkl",
        "load_path": "tfidf_v2.pkl",
        "model_class": "sklearn.feature_extraction.text:TfidfVectorizer",
        "infer_method": "transform"
      },
      {
        "in": "x",
        "out": "x_tok",
        "id": "my_tokenizer",
        "name": "nltk_moses_tokenizer"
      },
      {
        "in": "x_tok",
        "out": "x_emb",
        "id": "my_embedder",
        "name": "fasttext",
        "save_path": "wiki.en.bin",
        "load_path": "wiki.en.bin",
        "dim": 300
      },
      {
        "name": "one_hotter",
        "id": "my_onehotter",
        "depth": "#classes_vocab.len",
        "in": "y_ids",
        "out": "y_onehot"
      },
      {
        "in": "x_tok",
        "out": "x_weighted_emb",
        "name": "tfidf_weighted",
        "id": "my_weighted_embedder",
        "embedder": "#my_embedder",
        "tokenizer": "#my_tokenizer",
        "vectorizer": "#my_tfidf_vectorizer"
      },
      {
        "in": [
          "x_weighted_emb"
        ],
        "in_y": [
          "y_onehot"
        ],
        "out": [
          "y_pred_probas"
        ],
        "main": True,
        "name": "keras_classification_model",
        "save_path": "./cnn_model_v3",
        "load_path": "./cnn_model_v3",
        "embedding_size": "#my_embedder.dim",
        "n_classes": "#classes_vocab.len",
        "kernel_sizes_cnn": [
          1,
          2,
          3
        ],
        "filters_cnn": 256,
        "optimizer": "Adam",
        "learning_rate": 0.01,
        "learning_rate_decay": 0.1,
        "loss": "categorical_crossentropy",
        "text_size": 15,
        "coef_reg_cnn": 1e-4,
        "coef_reg_den": 1e-4,
        "dropout_rate": 0.5,
        "dense_size": 100,
        "model_name": "cnn_model"
      },
      {
        "in": "y_pred_probas",
        "out": "y_pred_ids",
        "name": "proba2labels",
        "max_proba": True
      },
      {
        "in": "y_pred_ids",
        "out": "y_pred_labels",
        "ref": "classes_vocab"
      }
    ],
    "out": [
      "y_pred_labels"
    ]
  },
  "train": {
    "epochs": 10,
    "batch_size": 64,
    "metrics": [
      "sets_accuracy",
      "f1_macro",
      {
        "name": "roc_auc",
        "inputs": ["y_onehot", "y_pred_probas"]
      }
    ],
    "show_examples": False,
    "validate_best": True,
    "test_best": False
  }
}


In [77]:
# we can train and evaluate model from config
m = train_evaluate_model_from_config(cnn_config)

2018-10-29 16:21:51.996 INFO in 'deeppavlov.dataset_iterators.basic_classification_iterator'['basic_classification_iterator'] at line 73: Splitting field <<train>> to new fields <<['train', 'valid']>>
2018-10-29 16:21:52.1 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 97: [loading vocabulary from snips/classes.dict]
2018-10-29 16:21:52.7 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 86: [saving vocabulary to snips/classes.dict]
2018-10-29 16:21:52.38 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 164: Initializing model sklearn.feature_extraction.text:TfidfVectorizer from scratch
2018-10-29 16:21:52.65 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 107: Fitting model sklearn.feature_extraction.text:TfidfVectorizer
2018-10-29 16:21:52.177 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 239: Saving model to tfidf_v2.pkl
2018-10-29 16:21:52

2018-10-29 16:27:42.532 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 272: [initializing `KerasClassificationModel` from saved]
2018-10-29 16:27:42.948 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 282: [loading weights from cnn_model_v3.h5]
2018-10-29 16:27:43.134 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 134: Model was successfully initialized!
Model summary:
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 15, 300)      0                                            
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 15, 256)      77056 

{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.961, "f1_macro": 0.9607, "roc_auc": 0.9974}, "time_spent": "0:00:03"}}


In [78]:
# or we can just load pre-trained model (conicides with what we did above)
m = build_model_from_config(cnn_config)

2018-10-29 16:27:46.76 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 97: [loading vocabulary from snips/classes.dict]
2018-10-29 16:27:46.88 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 201: Loading model sklearn.feature_extraction.text:TfidfVectorizer from tfidf_v2.pkl
2018-10-29 16:27:46.106 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 208: Model sklearn.feature_extraction.textTfidfVectorizer loaded  with parameters
2018-10-29 16:27:46.108 INFO in 'deeppavlov.models.embedders.fasttext_embedder'['fasttext_embedder'] at line 52: [loading fastText embeddings from `wiki.en.bin`]
2018-10-29 16:28:41.707 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 272: [initializing `KerasClassificationModel` from saved]
2018-10-29 16:28:42.78 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 282

In [79]:
m(["Is it freezing in Offerman, California?"])

[['GetWeather']]

## Bonus: pre-trained CNN model in DeepPavlov

Download model files (`wiki.en.bin` 8Gb embeddings):

! python -m deeppavlov download intents_snips_big

Evaluate metrics on validation set (no test set provided):

! python -m deeppavlov evaluate intents_snips_big

Or one can use model from python code:

In [3]:
from pathlib import Path

import deeppavlov
from deeppavlov.core.commands.infer import build_model_from_config
from deeppavlov.download import deep_download

config_path = Path(deeppavlov.__file__).parent.joinpath('configs/classifiers/intents_snips_big.json')

In [81]:
# let's download all the required data - model files, embeddings, vocabularies
deep_download(config_path)

2018-10-29 16:29:11.234 INFO in 'deeppavlov.download'['download'] at line 112: Downloading...
2018-10-29 16:29:11.240 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 208: Starting new HTTP connection (1): files.deeppavlov.ai
2018-10-29 16:29:11.273 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 396: http://files.deeppavlov.ai:80 "GET /datasets/snips_intents/train.csv HTTP/1.1" 200 980824
2018-10-29 16:29:11.275 INFO in 'deeppavlov.core.data.utils'['utils'] at line 59: Downloading from http://files.deeppavlov.ai/datasets/snips_intents/train.csv to /home/dilyara/Documents/GitHub/reserve/DeepPavlov/download/snips/train.csv
100%|██████████| 981k/981k [00:00<00:00, 21.7MB/s]
2018-10-29 16:29:11.326 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 208: Starting new HTTP connection (1): files.deeppavlov.ai
2018-10-29 16:29:11.365 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 396: http://files.deeppavlov.ai:80 "GET /deeppavlov_data/classifiers/

In [4]:
# now one can initialize model
m = build_model_from_config(config_path)

2018-10-29 16:32:41.595 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 97: [loading vocabulary from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/download/classifiers/intents_snips_v8/classes.dict]
[nltk_data] Downloading package punkt to /home/dilyara/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/dilyara/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package perluniprops to
[nltk_data]     /home/dilyara/nltk_data...
[nltk_data]   Package perluniprops is already up-to-date!
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data]     /home/dilyara/nltk_data...
[nltk_data]   Package nonbreaking_prefixes is already up-to-date!
2018-10-29 16:32:42.519 INFO in 'deeppavlov.models.embedders.fasttext_embedder'['fasttext_embedder'] at line 52: [loading fastText embeddings from `/home/dilyara/Documents/GitHub/reserve/DeepPavlov/dow

2018-10-29 16:33:05.288 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 272: [initializing `KerasClassificationModel` from saved]
2018-10-29 16:33:05.649 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 282: [loading weights from model.h5]
2018-10-29 16:33:05.840 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 134: Model was successfully initialized!
Model summary:
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 15, 300)      0                                            
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 15, 256)      77056       i

In [5]:
m(["Is it freezing in Offerman, California?"])

[['GetWeather']]

In [6]:
# or one can evaluate model WITHOUT training
train_evaluate_model_from_config(config_path, to_train=False)

2018-10-29 16:33:07.842 INFO in 'deeppavlov.dataset_iterators.basic_classification_iterator'['basic_classification_iterator'] at line 73: Splitting field <<train>> to new fields <<['train', 'valid']>>
2018-10-29 16:33:07.847 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 97: [loading vocabulary from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/download/classifiers/intents_snips_v8/classes.dict]
2018-10-29 16:33:07.848 INFO in 'deeppavlov.models.embedders.fasttext_embedder'['fasttext_embedder'] at line 52: [loading fastText embeddings from `/home/dilyara/Documents/GitHub/reserve/DeepPavlov/download/embeddings/wiki.en.bin`]
2018-10-29 16:33:28.444 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 272: [initializing `KerasClassificationModel` from saved]
2018-10-29 16:33:28.825 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 282: [loading weights from m

{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9811, "f1_macro": 0.9808, "roc_auc": 0.9989}, "time_spent": "0:00:01", "examples": [{"x": "Put some mac wiseman in my latino caliente playlist. ", "y_predicted": "y_pred_labels", "y_true": ["AddToPlaylist"]}]}}


{'valid': OrderedDict([('sets_accuracy', 0.9811),
              ('f1_macro', 0.9808),
              ('roc_auc', 0.9989)])}