## You can also run the notebook in [COLAB](https://colab.research.google.com/github/deepmipt/DeepPavlov/blob/master/examples/tutorials/05_deeppavlov_classification.ipynb).

In [1]:
!pip3 install deeppavlov

# Classification on DeepPavlov

**Task**:
Intent recognition on SNIPS dataset: https://github.com/snipsco/nlu-benchmark/tree/master/2017-06-custom-intent-engines that has already been recomposed to `csv` format and can be downloaded from http://files.deeppavlov.ai/datasets/snips_intents/train.csv

FastText English word embeddings ~8Gb: http://files.deeppavlov.ai/deeppavlov_data/embeddings/wiki.en.bin

## Plan of the notebook with documentation links:

1. [Data aggregation](#Data-aggregation)
     * [DatasetReader](#DatasetReader): [docs link](https://deeppavlov.readthedocs.io/en/latest/apiref/dataset_readers.html)
     * [DatasetIterator](#DatasetIterator): [docs link](https://deeppavlov.readthedocs.io/en/latest/apiref/dataset_iterators.html)
2. [Data preprocessing](#Data-preprocessing): [docs link](https://deeppavlov.readthedocs.io/en/latest/components/data_processors.html)
     * [Lowercasing](#Lowercasing)
     * [Tokenization](#Tokenization)
     * [Vocabulary](#Vocabulary)
3. [Featurization](#Featurization): [docs link](https://deeppavlov.readthedocs.io/en/latest/components/data_processors.html), [pre-trained embeddings link](https://deeppavlov.readthedocs.io/en/latest/intro/pretrained_vectors.html)
    * [Bag-of-words embedder](#Bag-of-words)
    * [TF-IDF vectorizer](#TF-IDF Vectorizer)
    * [GloVe embedder](#GloVe-embedder)
    * [Mean GloVe embedder](#Mean-GloVe-embedder)
    * [GloVe weighted by TF-IDF embedder](#GloVe-weighted-by-TF-IDF-embedder)
4. [Models](#Models): [docs link](https://deeppavlov.readthedocs.io/en/latest/components/classifiers.html)
    * [Building models in python](#Models-in-python)
        - [Sklearn component classifiers](#SklearnComponent-classifier-on-Tfidf-features-in-python)
        - [Keras classification model on GloVe emb](#KerasClassificationModel-on-GloVe-embeddings-in-python)
        - [Sklearn component classifier on GloVe weighted emb](#SklearnComponent-classifier-on-GloVe-weighted-by-TF-IDF-embeddings-in-python)
    * [Building models from configs](#Models-from-configs)
        - [Sklearn component classifiers](#SklearnComponent-classifier-on-Tfidf-features-from-config)
        - [Keras classification model](#KerasClassificationModel-on-fastText-embeddings-from-config)
        - [Sklearn component classifier on GloVe weighted emb](#SklearnComponent-classifier-on-GloVe-weighted-by-TF-IDF-embeddings-from-config)
    * [Bonus: pre-trained CNN model in DeepPavlov](#Bonus:-pre-trained-CNN-model-in-DeepPavlov)

## Data aggregation

First of all, let's download and look into data we will work with.

In [2]:
from deeppavlov.core.data.utils import simple_download

#download train data file for SNIPS
simple_download(url="http://files.deeppavlov.ai/datasets/snips_intents/train.csv", 
                destination="./snips/train.csv")

2018-12-13 18:38:48.743 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 205: Starting new HTTP connection (1): files.deeppavlov.ai:80
2018-12-13 18:38:48.750 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 393: http://files.deeppavlov.ai:80 "GET /datasets/snips_intents/train.csv HTTP/1.1" 200 980824
2018-12-13 18:38:48.754 INFO in 'deeppavlov.core.data.utils'['utils'] at line 63: Downloading from http://files.deeppavlov.ai/datasets/snips_intents/train.csv to snips/train.csv
100%|██████████| 981k/981k [00:00<00:00, 22.0MB/s]


In [3]:
! head -n 15 snips/train.csv

text,intents
Add another song to the Cita RomГЎntica playlist. ,AddToPlaylist
add clem burke in my playlist Pre-Party R&B Jams,AddToPlaylist
Add Live from Aragon Ballroom to Trapeo,AddToPlaylist
add Unite and Win to my night out,AddToPlaylist
Add track to my Digster Future Hits,AddToPlaylist
add the piano bar to my Cindy Wilson,AddToPlaylist
Add Spanish Harlem Incident to cleaning the house,AddToPlaylist
add The Greyest of Blue Skies in Indie EspaГ±ol my playlist,AddToPlaylist
Add the name kids in the street to the plylist New Indie Mix,AddToPlaylist
add album radar latino,AddToPlaylist
Add Tranquility to the Latin Pop Rising playlist. ,AddToPlaylist
Add d flame to the Dcode2016 playlist.,AddToPlaylist
Add album to my fairy tales,AddToPlaylist
I need another artist in the New Indie Mix playlist. ,AddToPlaylist


### DatasetReader

Read data using `BasicClassificationDatasetReader` из DeepPavlov

In [4]:
from deeppavlov.dataset_readers.basic_classification_reader import BasicClassificationDatasetReader

In [5]:
# read data from particular columns of `.csv` file
dr = BasicClassificationDatasetReader().read(
    data_path='./snips/',
    train='train.csv',
    x = 'text',
    y = 'intents'
)



We don't have a ready train/valid/test split.

In [6]:
# check train/valid/test sizes
[(k, len(dr[k])) for k in dr.keys()]

[('train', 15884), ('valid', 0), ('test', 0)]

### DatasetIterator

Use `BasicClassificationDatasetIterator` to split `train` on `train` and `valid` and to generate batches of samples.

In [7]:
from deeppavlov.dataset_iterators.basic_classification_iterator import BasicClassificationDatasetIterator

In [8]:
# initialize data iterator splitting `train` field to `train` and `valid` in proportion 0.8/0.2
train_iterator = BasicClassificationDatasetIterator(
    data=dr,
    field_to_split='train',  # field that will be splitted
    split_fields=['train', 'valid'],   # fields to which the fiald above will be splitted
    split_proportions=[0.8, 0.2],  #proportions for splitting
    split_seed=23,  # seed for splitting dataset
    seed=42)  # seed for iteration over dataset

2018-12-13 18:38:50.180 INFO in 'deeppavlov.dataset_iterators.basic_classification_iterator'['basic_classification_iterator'] at line 73: Splitting field <<train>> to new fields <<['train', 'valid']>>


Let's look into training samples. 

In [9]:
# one can get train instances (or any other data type including `all`)
x_train, y_train = train_iterator.get_instances(data_type='train')
for x, y in list(zip(x_train, y_train))[:5]:
    print('x:', x)
    print('y:', y)
    print('=================')

x: Is it freezing in Offerman, California?
y: ['GetWeather']
x: put this song in the playlist Trap Land
y: ['AddToPlaylist']
x: show me a textbook with a rating of 2 and a maximum rating of 6 that is current
y: ['RateBook']
x: Will the weather be okay in Northern Luzon Heroes Hill National Park 4 and a half months from now?
y: ['GetWeather']
x: Rate the current album a four
y: ['RateBook']


## Data preprocessing

We will be using lowercasing and tokenization as data preparation. 

DeepPavlov also contains several other preprocessors and tokenizers.

### Lowercasing

`StrLower` lowercases texts.

In [10]:
from deeppavlov.models.preprocessors.str_lower import StrLower

[nltk_data] Downloading package punkt to /home/dilyara/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/dilyara/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package perluniprops to
[nltk_data]     /home/dilyara/nltk_data...
[nltk_data]   Package perluniprops is already up-to-date!
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data]     /home/dilyara/nltk_data...
[nltk_data]   Package nonbreaking_prefixes is already up-to-date!


In [11]:
str_lower = StrLower()
str_lower(['Is it freezing in Offerman, California?'])

['is it freezing in offerman, california?']

### Tokenization

`NLTKTokenizer` can split string to tokens.

In [12]:
from deeppavlov.models.tokenizers.nltk_moses_tokenizer import NLTKMosesTokenizer

In [13]:
tokenizer = NLTKMosesTokenizer()
tokenizer(['Is it freezing in Offerman, California?'])

[['Is', 'it', 'freezing', 'in', 'Offerman', ',', 'California', '?']]

Let's preprocess all `train` part of the dataset.

In [14]:
train_x_lower_tokenized = str_lower(tokenizer(train_iterator.get_instances(data_type='train')[0]))

### Vocabulary

Now we are ready to use `vocab`. They are very usefull for:
* extracting class labels and converting labels to indices and vice versa,
* building of characters or tokens vocabularies.

In [15]:
from deeppavlov.core.data.simple_vocab import SimpleVocabulary

In [16]:
# initialize simple vocabulary to collect all appeared in the dataset classes
classes_vocab = SimpleVocabulary(
    save_path='./snips/classes.dict',
    load_path='./snips/classes.dict')

2018-12-13 18:38:51.509 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 100: [loading vocabulary from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/snips/classes.dict]


In [17]:
classes_vocab.fit((train_iterator.get_instances(data_type='train')[1]))
classes_vocab.save()

2018-12-13 18:38:51.536 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 89: [saving vocabulary to /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/snips/classes.dict]


Let's see what classes the dataset contains and their indices in the vocabulary.

In [18]:
list(classes_vocab.items())

[('GetWeather', 0),
 ('PlayMusic', 1),
 ('SearchScreeningEvent', 2),
 ('BookRestaurant', 3),
 ('RateBook', 4),
 ('SearchCreativeWork', 5),
 ('AddToPlaylist', 6)]

In [19]:
# also one can collect vocabulary of textual tokens appeared 2 and more times in the dataset
token_vocab = SimpleVocabulary(
    save_path='./snips/tokens.dict',
    load_path='./snips/tokens.dict',
    min_freq=2,
    special_tokens=('<PAD>', '<UNK>',),
    unk_token='<UNK>')

2018-12-13 18:38:51.550 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 100: [loading vocabulary from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/snips/tokens.dict]


In [20]:
token_vocab.fit(train_x_lower_tokenized)
token_vocab.save()

2018-12-13 18:38:51.685 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 89: [saving vocabulary to /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/snips/tokens.dict]


In [21]:
# number of tokens in dictionary
len(token_vocab)

4564

In [22]:
# 10 most common words and number of times their appeared
token_vocab.freqs.most_common()[:10]

[('the', 6953),
 ('a', 3917),
 ('in', 3265),
 ('to', 3203),
 ('for', 2814),
 ('of', 2401),
 ('.', 2400),
 ('i', 2079),
 ('at', 1935),
 ('play', 1703)]

In [23]:
token_ids = token_vocab(str_lower(tokenizer(['Is it freezing in Offerman, California?'])))
token_ids

[[13, 36, 244, 4, 1, 29, 996, 20]]

In [24]:
tokenizer(token_vocab(token_ids))

['is it freezing in <UNK>, california?']

## Featurization

This part contains several possible ways of featurization of text samples. One can chose any appropriate vectorizer/embedder according to available resources and given task.

Bag-of-words (BoW) and TF-IDF vectorizers converts text samples to vectors (one vector per sample) while fastText, GloVe, fastText weighted by TF-IDF embedders either produce an embedding vector per token or an embedding vector per text sample (if `mean` set to True).

### Bag-of-words

Matches a vector to each text sample indicating which words appeared in the given sample: text -> binary vector $v$: \[0, 1, 0, 0, 0, 1, ..., ...1, 0, 1\]. 

Dimensionality of vector $v$ is equal to vocabulary size.

$v_i$ == 1, if word $i$ is in the text,

$v_i$ == 0, else.

In [25]:
import numpy as np
from deeppavlov.models.embedders.bow_embedder import BoWEmbedder

In [26]:
# initialize bag-of-words embedder giving total number of tokens
bow = BoWEmbedder(depth=token_vocab.len)
# it assumes indexed tokenized samples
bow(token_vocab(str_lower(tokenizer(['Is it freezing in Offerman, California?']))))

[array([0, 1, 0, ..., 0, 0, 0], dtype=int32)]

In [27]:
# all 8 tokens are in the vocabulary
sum(bow(token_vocab(str_lower(tokenizer(['Is it freezing in Offerman, California?']))))[0])

8

### TF-IDF Vectorizer

Matches a vector to each text sample: text -> vector $v$ from $R^N$ where $N$ is a vocabulary size.

$TF-IDF(token, document) = TF(token, document) * IDF(token, document)$

$TF$ is a term frequency:

$TF(token, document) = \frac{n_{token}}{\sum_{k}n_k}.$

$IDF$ is a inverse document frequency:

$IDF(token, all\_documents) = \frac{Total\ number\ of\ documents}{number\ of\ documents\ where\ token\ appeared}.$

`SklearnComponent` in DeepPavlov is a universal wrapper for any vecotirzer/estimator from `sklearn` package. The only requirement to specify component usage is following: model class and name of infer method should be passed as parameters.

In [28]:
from deeppavlov.models.sklearn import SklearnComponent

In [29]:
# initialize TF-IDF vectorizer sklearn component with `transform` as infer method
tfidf = SklearnComponent(
    model_class="sklearn.feature_extraction.text:TfidfVectorizer",
    infer_method="transform",
    save_path='./tfidf_v0.pkl',
    load_path='./tfidf_v0.pkl',
    mode='train')

2018-12-13 18:38:51.757 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 201: Loading model sklearn.feature_extraction.text:TfidfVectorizer from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/tfidf_v0.pkl
2018-12-13 18:38:51.763 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 208: Model sklearn.feature_extraction.textTfidfVectorizer loaded  with parameters


In [30]:
# fit on textual train instances and save it
tfidf.fit(str_lower(train_iterator.get_instances(data_type='train')[0]))
tfidf.save()

2018-12-13 18:38:51.779 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 107: Fitting model sklearn.feature_extraction.textTfidfVectorizer
2018-12-13 18:38:51.887 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 239: Saving model to /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/tfidf_v0.pkl


In [31]:
tfidf(str_lower(['Is it freezing in Offerman, California?']))

<1x10709 sparse matrix of type '<class 'numpy.float64'>'
	with 6 stored elements in Compressed Sparse Row format>

In [32]:
# number of tokens in the TF-IDF vocabulary
len(tfidf.model.vocabulary_)

10709

### GloVe embedder

[GloVe](https://nlp.stanford.edu/projects/glove/) is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

In [33]:
from deeppavlov.models.embedders.glove_embedder import GloVeEmbedder

2018-12-13 18:38:52.55 INFO in 'summarizer.preprocessing.cleaner'['textcleaner'] at line 37: 'pattern' package not found; tag filters are not available for English


Let's download GloVe embedding file

In [34]:
simple_download(url="http://files.deeppavlov.ai/embeddings/glove.6B.100d.txt", 
                destination="./glove.6B.100d.txt")

2018-12-13 18:38:52.74 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 205: Starting new HTTP connection (1): files.deeppavlov.ai:80
2018-12-13 18:38:52.79 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 393: http://files.deeppavlov.ai:80 "GET /embeddings/glove.6B.100d.txt HTTP/1.1" 200 None
2018-12-13 18:38:52.80 INFO in 'deeppavlov.core.data.utils'['utils'] at line 63: Downloading from http://files.deeppavlov.ai/embeddings/glove.6B.100d.txt to glove.6B.100d.txt
347MB [00:07, 47.5MB/s] 


In [35]:
embedder = GloVeEmbedder(load_path='./glove.6B.100d.txt',
                         dim=100, pad_zero=True)

2018-12-13 18:38:59.447 INFO in 'deeppavlov.models.embedders.glove_embedder'['glove_embedder'] at line 52: [loading GloVe embeddings from `/home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/glove.6B.100d.txt`]
2018-12-13 18:38:59.449 INFO in 'gensim.models.utils_any2vec'['utils_any2vec'] at line 170: loading projection weights from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/glove.6B.100d.txt
2018-12-13 18:38:59.450 DEBUG in 'smart_open.smart_open_lib'['smart_open_lib'] at line 149: {'kw': {}, 'mode': 'rb', 'uri': '/home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/glove.6B.100d.txt'}
2018-12-13 18:38:59.451 DEBUG in 'smart_open.smart_open_lib'['smart_open_lib'] at line 621: encoding_wrapper: {'errors': 'strict', 'encoding': None, 'mode': 'rb', 'fileobj': <_io.BufferedReader name='/home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/glove.6B.100d.txt'>}
2018-12-13 18:39:23.878 INFO in 'gensim.models.utils_any2

In [36]:
# output shape is (batch_size x max_num_tokens_in_the_batch x embedding_dim)
embedded_batch = embedder(str_lower(tokenizer(['Is it freezing in Offerman, California?']))) 
len(embedded_batch), len(embedded_batch[0]), embedded_batch[0][0].shape

(1, 8, (100,))

### Mean GloVe embedder

Embedder returns a vector per token while we want to get a vector per text sample. Therefore, let's calculate mean vector of embeddings of tokens. 
For that we can either init `GloVeEmbedder` with `mean=True` parameter (`mean=false` by default), or pass `mean=true` while calling function (this way `mean` value is assigned only for this call).

In [37]:
# output shape is (batch_size x embedding_dim)
embedded_batch = embedder(str_lower(tokenizer(['Is it freezing in Offerman, California?'])), mean=True) 
len(embedded_batch), embedded_batch[0].shape

(1, (100,))

### GloVe weighted by TF-IDF embedder

One of the possible ways to combine TF-IDF vectorizer and any token embedder is to weigh token embeddings by TF-IDF coefficients (therefore, `mean` set to True is obligatory to obtain embeddings of interest while it still **by default** returns embeddings of tokens.

In [38]:
from deeppavlov.models.embedders.tfidf_weighted_embedder import TfidfWeightedEmbedder

In [39]:
weighted_embedder = TfidfWeightedEmbedder(
    embedder=embedder,  # our GloVe embedder instance
    tokenizer=tokenizer,  # our tokenizer instance
    mean=True,  # to return one vector per sample
    vectorizer=tfidf  # our TF-IDF vectorizer
)

In [40]:
# output shape is (batch_size x  embedding_dim)
embedded_batch = weighted_embedder(str_lower(tokenizer(['Is it freezing in Offerman, California?']))) 
len(embedded_batch), embedded_batch[0].shape

(1, (100,))

## Models

In [41]:
from deeppavlov.metrics.accuracy import sets_accuracy

In [42]:
# get all train and valid data from iterator
x_train, y_train = train_iterator.get_instances(data_type="train")
x_valid, y_valid = train_iterator.get_instances(data_type="valid")

### Models in python

#### SklearnComponent classifier on Tfidf-features in python

In [43]:
# initialize sklearn classifier, all parameters for classifier could be passed
cls = SklearnComponent(
    model_class="sklearn.linear_model:LogisticRegression",
    infer_method="predict",
    save_path='./logreg_v0.pkl',
    load_path='./logreg_v0.pkl',
    C=1,
    mode='train')

2018-12-13 18:39:24.55 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 201: Loading model sklearn.linear_model:LogisticRegression from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/logreg_v0.pkl
2018-12-13 18:39:24.56 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 208: Model sklearn.linear_model.logisticLogisticRegression loaded  with parameters


In [44]:
# fit sklearn classifier and save it
cls.fit(tfidf(x_train), y_train)
cls.save()

2018-12-13 18:39:24.707 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 107: Fitting model sklearn.linear_model.logisticLogisticRegression
2018-12-13 18:39:24.919 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 239: Saving model to /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/logreg_v0.pkl


In [45]:
y_valid_pred = cls(tfidf(x_valid))

In [46]:
# Let's look into obtained result
print("Text sample: {}".format(x_valid[0]))
print("True label: {}".format(y_valid[0]))
print("Predicted label: {}".format(y_valid_pred[0]))

Text sample: I need seating at Floating restaurant in Tennessee for a group of 9
True label: ['BookRestaurant']
Predicted label: BookRestaurant


In [47]:
# let's calculate sets accuracy (because each element is a list of labels)
sets_accuracy(np.squeeze(y_valid), y_valid_pred)

0.982373308152345

#### KerasClassificationModel on GloVe embeddings in python

In [48]:
from deeppavlov.models.classifiers.keras_classification_model import KerasClassificationModel
from deeppavlov.models.preprocessors.one_hotter import OneHotter
from deeppavlov.models.classifiers.proba2labels import Proba2Labels

Using TensorFlow backend.


In [49]:
# Intialize `KerasClassificationModel` that composes CNN shallow-and-wide network 
# (name here as`cnn_model`)
cls = KerasClassificationModel(save_path="./cnn_model_v0", 
                               load_path="./cnn_model_v0", 
                               embedding_size=embedder.dim,
                               n_classes=classes_vocab.len,
                               model_name="cnn_model",
                               text_size=15, # number of tokens
                               kernel_sizes_cnn=[3, 5, 7],
                               filters_cnn=128,
                               dense_size=100,
                               optimizer="Adam",
                               learning_rate=0.1,
                               learning_rate_decay=0.01,
                               loss="categorical_crossentropy")

2018-12-13 18:39:26.47 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 287: [initializing `KerasClassificationModel` from saved]
2018-12-13 18:39:26.428 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 297: [loading weights from cnn_model_v0.h5]
2018-12-13 18:39:26.636 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 137: Model was successfully initialized!
Model summary:
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 15, 100)      0                                            
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 15, 128)      38528  

In [50]:
# `KerasClassificationModel` assumes one-hotted distribution of classes per sample.
# `OneHotter` converts indices to one-hot vectors representation.
#  To obtain indices we can use our `classes_vocab` intialized and fitted above
onehotter = OneHotter(depth=classes_vocab.len, single_vector=True)

In [51]:
# Train for 10 epochs
for ep in range(10):
    for x, y in train_iterator.gen_batches(batch_size=64, 
                                           data_type="train"):
        x_embed = embedder(tokenizer(str_lower(x)))
        y_onehot = onehotter(classes_vocab(y))
        cls.train_on_batch(x_embed, y_onehot)

In [52]:
# Save model weights and parameters
cls.save()

2018-12-13 18:40:00.378 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 371: [saving model to /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/cnn_model_v0_opt.json]


In [53]:
# Infering on validation data we get probability distribution on given data.
y_valid_pred = cls(embedder(tokenizer(str_lower(x_valid))))

In [54]:
# To convert probability distribution to labels, 
# we first need to convert probabilities to indices,
# and then using vocabulary `classes_vocab` convert indices to labels.
# 
# `Proba2Labels` converts probabilities to indices and supports three different modes:
# if `max_proba` is true, returns indices of the highest probabilities
# if `confident_threshold` is given, returns indices with probabiltiies higher than threshold
# if `top_n` is given, returns `top_n` indices with highest probabilities
prob2labels = Proba2Labels(max_proba=True)

In [55]:
# Let's look into obtained result
print("Text sample: {}".format(x_valid[0]))
print("True label: {}".format(y_valid[0]))
print("Predicted probability distribution: {}".format(dict(zip(classes_vocab.keys(), 
                                                               y_valid_pred[0]))))
print("Predicted label: {}".format(classes_vocab(prob2labels(y_valid_pred))[0]))

Text sample: I need seating at Floating restaurant in Tennessee for a group of 9
True label: ['BookRestaurant']
Predicted probability distribution: {'GetWeather': 1.6747324480093084e-05, 'PlayMusic': 1.4119169463810977e-05, 'SearchScreeningEvent': 9.400721864949446e-06, 'BookRestaurant': 0.9570612907409668, 'RateBook': 1.5034107491374016e-05, 'SearchCreativeWork': 3.359359106980264e-05, 'AddToPlaylist': 1.8879594790632837e-05}
Predicted label: ['BookRestaurant']


In [56]:
# calculate sets accuracy
sets_accuracy(y_valid, classes_vocab(prob2labels(y_valid_pred)))

0.983947119924457

####  SklearnComponent classifier on GloVe weighted by TF-IDF embeddings in python

In [57]:
# initialize sklearn classifier, all parameters for classifier could be passed
cls = SklearnComponent(
    model_class="sklearn.linear_model:LogisticRegression",
    infer_method="predict",
    save_path='./logreg_v1.pkl',
    load_path='./logreg_v1.pkl',
    C=1,
    mode='train')

2018-12-13 18:40:01.246 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 201: Loading model sklearn.linear_model:LogisticRegression from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/logreg_v1.pkl
2018-12-13 18:40:01.247 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 208: Model sklearn.linear_model.logisticLogisticRegression loaded  with parameters


In [58]:
# fit sklearn classifier and save it
cls.fit(weighted_embedder(str_lower(tokenizer(x_train))), y_train)
cls.save()

2018-12-13 18:40:26.836 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 107: Fitting model sklearn.linear_model.logisticLogisticRegression
2018-12-13 18:40:28.674 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 239: Saving model to /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/logreg_v1.pkl


In [59]:
y_valid_pred = cls(weighted_embedder(str_lower(tokenizer(x_valid))))

In [60]:
# Let's look into obtained result
print("Text sample: {}".format(x_valid[0]))
print("True label: {}".format(y_valid[0]))
print("Predicted label: {}".format(y_valid_pred[0]))

Text sample: I need seating at Floating restaurant in Tennessee for a group of 9
True label: ['BookRestaurant']
Predicted label: BookRestaurant


In [61]:
# let's calculate sets accuracy (because each element is a list of labels)
sets_accuracy(np.squeeze(y_valid), y_valid_pred)

0.9184765502045955

### Let's free our memory from embeddings and models

In [62]:
embedder.reset()
cls.reset()

### Models from configs

In [63]:
from deeppavlov import build_model
from deeppavlov import train_model

#### SklearnComponent classifier on Tfidf-features from config

In [64]:
logreg_config = {
  "dataset_reader": {
    "class_name": "basic_classification_reader",
    "x": "text",
    "y": "intents",
    "data_path": "./snips"
  },
  "dataset_iterator": {
    "class_name": "basic_classification_iterator",
    "seed": 42,
    "split_seed": 23,
    "field_to_split": "train",
    "split_fields": [
      "train",
      "valid"
    ],
    "split_proportions": [
      0.9,
      0.1
    ]
  },
  "chainer": {
    "in": [
      "x"
    ],
    "in_y": [
      "y"
    ],
    "pipe": [
      {
        "id": "classes_vocab",
        "class_name": "simple_vocab",
        "fit_on": [
          "y"
        ],
        "save_path": "./snips/classes.dict",
        "load_path": "./snips/classes.dict",
        "in": "y",
        "out": "y_ids"
      },
      {
        "in": [
          "x"
        ],
        "out": [
          "x_vec"
        ],
        "fit_on": [
          "x",
          "y_ids"
        ],
        "id": "tfidf_vec",
        "class_name": "sklearn_component",
        "save_path": "tfidf_v1.pkl",
        "load_path": "tfidf_v1.pkl",
        "model_class": "sklearn.feature_extraction.text:TfidfVectorizer",
        "infer_method": "transform"
      },
      {
        "in": "x",
        "out": "x_tok",
        "id": "my_tokenizer",
        "class_name": "nltk_moses_tokenizer",
        "tokenizer": "wordpunct_tokenize"
      },
      {
        "in": [
          "x_vec"
        ],
        "out": [
          "y_pred"
        ],
        "fit_on": [
          "x_vec",
          "y"
        ],
        "class_name": "sklearn_component",
        "main": True,
        "save_path": "logreg_v2.pkl",
        "load_path": "logreg_v2.pkl",
        "model_class": "sklearn.linear_model:LogisticRegression",
        "infer_method": "predict",
        "ensure_list_output": True
      }
    ],
    "out": [
      "y_pred"
    ]
  },
  "train": {
    "batch_size": 64,
    "metrics": [
      "accuracy"
    ],
    "validate_best": True,
    "test_best": False
  }
}


In [65]:
# we can train and evaluate model from config
m = train_model(logreg_config)

2018-12-13 18:40:35.904 INFO in 'deeppavlov.dataset_iterators.basic_classification_iterator'['basic_classification_iterator'] at line 73: Splitting field <<train>> to new fields <<['train', 'valid']>>
2018-12-13 18:40:35.907 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 100: [loading vocabulary from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/snips/classes.dict]
2018-12-13 18:40:36.38 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 89: [saving vocabulary to /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/snips/classes.dict]
2018-12-13 18:40:36.39 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 201: Loading model sklearn.feature_extraction.text:TfidfVectorizer from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/tfidf_v1.pkl
2018-12-13 18:40:36.44 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 208: Model sklea

{"valid": {"eval_examples_count": 1589, "metrics": {"accuracy": 0.983}, "time_spent": "0:00:01"}}


In [66]:
# or we can just load pre-trained model (conicides with what we did above)
m = build_model(logreg_config)

2018-12-13 18:40:37.330 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 100: [loading vocabulary from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/snips/classes.dict]
2018-12-13 18:40:37.331 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 201: Loading model sklearn.feature_extraction.text:TfidfVectorizer from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/tfidf_v1.pkl
2018-12-13 18:40:37.336 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 208: Model sklearn.feature_extraction.textTfidfVectorizer loaded  with parameters
2018-12-13 18:40:37.338 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 201: Loading model sklearn.linear_model:LogisticRegression from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/logreg_v2.pkl
2018-12-13 18:40:37.338 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklear

In [67]:
m(["Is it freezing in Offerman, California?"])

[['GetWeather']]

#### KerasClassificationModel on GloVe embeddings from config

In [68]:
cnn_config = {
  "dataset_reader": {
    "class_name": "basic_classification_reader",
    "x": "text",
    "y": "intents",
    "data_path": "snips"
  },
  "dataset_iterator": {
    "class_name": "basic_classification_iterator",
    "seed": 42,
    "split_seed": 23,
    "field_to_split": "train",
    "split_fields": [
      "train",
      "valid"
    ],
    "split_proportions": [
      0.9,
      0.1
    ]
  },
  "chainer": {
    "in": [
      "x"
    ],
    "in_y": [
      "y"
    ],
    "pipe": [
      {
        "id": "classes_vocab",
        "class_name": "simple_vocab",
        "fit_on": [
          "y"
        ],
        "level": "token",
        "save_path": "./snips/classes.dict",
        "load_path": "./snips/classes.dict",
        "in": "y",
        "out": "y_ids"
      },
      {
        "in": "x",
        "out": "x_tok",
        "id": "my_tokenizer",
        "class_name": "nltk_tokenizer",
        "tokenizer": "wordpunct_tokenize"
      },
      {
        "in": "x_tok",
        "out": "x_emb",
        "id": "my_embedder",
        "class_name": "glove",
        "load_path": "./glove.6B.100d.txt",
        "dim": 100,
        "pad_zero": True
      },
      {
        "in": "y_ids",
        "out": "y_onehot",
        "class_name": "one_hotter",
        "depth": "#classes_vocab.len",
        "single_vector": True
      },
      {
        "in": [
          "x_emb"
        ],
        "in_y": [
          "y_onehot"
        ],
        "out": [
          "y_pred_probas"
        ],
        "main": True,
        "class_name": "keras_classification_model",
        "save_path": "./cnn_model_v1",
        "load_path": "./cnn_model_v1",
        "embedding_size": "#my_embedder.dim",
        "n_classes": "#classes_vocab.len",
        "kernel_sizes_cnn": [
          1,
          2,
          3
        ],
        "filters_cnn": 256,
        "optimizer": "Adam",
        "learning_rate": 0.01,
        "learning_rate_decay": 0.1,
        "loss": "categorical_crossentropy",
        "coef_reg_cnn": 1e-4,
        "coef_reg_den": 1e-4,
        "dropout_rate": 0.5,
        "dense_size": 100,
        "model_name": "cnn_model"
      },
      {
        "in": "y_pred_probas",
        "out": "y_pred_ids",
        "class_name": "proba2labels",
        "max_proba": True
      },
      {
        "in": "y_pred_ids",
        "out": "y_pred_labels",
        "ref": "classes_vocab"
      }
    ],
    "out": [
      "y_pred_labels"
    ]
  },
  "train": {
    "epochs": 10,
    "batch_size": 64,
    "metrics": [
      "sets_accuracy",
      "f1_macro",
      {
        "name": "roc_auc",
        "inputs": ["y_onehot", "y_pred_probas"]
      }
    ],
    "validation_patience": 5,
    "val_every_n_epochs": 1,
    "log_every_n_epochs": 1,
    "show_examples": True,
    "validate_best": True,
    "test_best": False
  }
}


In [69]:
# we can train and evaluate model from config
m = train_model(cnn_config)

2018-12-13 18:40:38.266 INFO in 'deeppavlov.dataset_iterators.basic_classification_iterator'['basic_classification_iterator'] at line 73: Splitting field <<train>> to new fields <<['train', 'valid']>>
2018-12-13 18:40:38.271 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 100: [loading vocabulary from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/snips/classes.dict]
2018-12-13 18:40:38.390 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 89: [saving vocabulary to /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/snips/classes.dict]
2018-12-13 18:40:38.391 INFO in 'deeppavlov.models.embedders.glove_embedder'['glove_embedder'] at line 52: [loading GloVe embeddings from `/home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/glove.6B.100d.txt`]
2018-12-13 18:40:38.391 INFO in 'gensim.models.utils_any2vec'['utils_any2vec'] at line 170: loading projection weights from /home/dilyara/Documents/GitHu

{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.1624, "f1_macro": 0.0796, "roc_auc": 0.42}, "time_spent": "0:00:01", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": ["GetWeather"], "y_true": ["BookRestaurant"]}, {"x": "Rate the current textbook one of 6 stars", "y_predicted": ["GetWeather"], "y_true": ["RateBook"]}, {"x": "find a nearby movie schedule for movies", "y_predicted": ["GetWeather"], "y_true": ["SearchScreeningEvent"]}, {"x": "what is the Mississippi for the week", "y_predicted": ["GetWeather"], "y_true": ["GetWeather"]}, {"x": "Play me a song from 1968 on Spotify", "y_predicted": ["GetWeather"], "y_true": ["PlayMusic"]}, {"x": "Book a table for me, naomi and elisabeth at a brasserie with wifi", "y_predicted": ["PlayMusic"], "y_true": ["BookRestaurant"]}, {"x": "The current album gets three out of 6 points", "y_predicted": ["GetWeather"], "y_true": ["RateBook"]}, {"x": "find Goodrich Quality Theaters 

2018-12-13 18:41:10.571 INFO in 'deeppavlov.core.commands.train'['train'] at line 565: New best sets_accuracy of 0.939
2018-12-13 18:41:10.573 INFO in 'deeppavlov.core.commands.train'['train'] at line 567: Saving model
2018-12-13 18:41:10.575 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 371: [saving model to /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/cnn_model_v1_opt.json]


{"train": {"epochs_done": 1, "batches_seen": 224, "train_examples_seen": 14295, "metrics": {"sets_accuracy": 0.9088, "f1_macro": 0.9089, "roc_auc": 0.9815}, "time_spent": "0:00:08", "examples": [{"x": "Add lisa m to my guitar hero live playlist", "y_predicted": ["AddToPlaylist"], "y_true": ["AddToPlaylist"]}, {"x": "Play Ciribiribin by Sandeep Khare", "y_predicted": ["PlayMusic"], "y_true": ["PlayMusic"]}, {"x": "What's the forecast for Spenard, GU?", "y_predicted": ["GetWeather"], "y_true": ["GetWeather"]}, {"x": "Book a highly rated coffeehouse for four people.", "y_predicted": ["BookRestaurant"], "y_true": ["BookRestaurant"]}, {"x": "Tell me when Howling II: Your Sister Is a Werewolf is playing.", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "Add the tune to bandas sonoras", "y_predicted": ["AddToPlaylist"], "y_true": ["AddToPlaylist"]}, {"x": "Add Hanging On to my just dance by aftercluv playlist.", "y_predicted": ["AddToPlaylist"], "y_true": 

{"train": {"epochs_done": 2, "batches_seen": 448, "train_examples_seen": 28590, "metrics": {"sets_accuracy": 0.9582, "f1_macro": 0.9585, "roc_auc": 0.9974}, "time_spent": "0:00:13", "examples": [{"x": "I give The Monkey and the Tiger a rating of 2 points.", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "Find the weather prediction for Camdeboo-Nationalpark for jan. eleventh, 2037.", "y_predicted": ["GetWeather"], "y_true": ["GetWeather"]}, {"x": "find Back for Good, a novel I want to read", "y_predicted": ["SearchCreativeWork"], "y_true": ["SearchCreativeWork"]}, {"x": "what is the movie schedule for animated movies playing close by", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "Tell me when it'll be hot in Melbourne, NJ", "y_predicted": ["GetWeather"], "y_true": ["GetWeather"]}, {"x": "Find LaserLight.", "y_predicted": ["SearchCreativeWork"], "y_true": ["SearchCreativeWork"]}, {"x": "find animated movies close by with a movie sched

2018-12-13 18:41:14.735 INFO in 'deeppavlov.core.commands.train'['train'] at line 565: New best sets_accuracy of 0.9503
2018-12-13 18:41:14.736 INFO in 'deeppavlov.core.commands.train'['train'] at line 567: Saving model
2018-12-13 18:41:14.737 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 371: [saving model to /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/cnn_model_v1_opt.json]


{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9503, "f1_macro": 0.9498, "roc_auc": 0.9971}, "time_spent": "0:00:13", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": ["BookRestaurant"], "y_true": ["BookRestaurant"]}, {"x": "Rate the current textbook one of 6 stars", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "find a nearby movie schedule for movies", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "what is the Mississippi for the week", "y_predicted": ["SearchCreativeWork"], "y_true": ["GetWeather"]}, {"x": "Play me a song from 1968 on Spotify", "y_predicted": ["PlayMusic"], "y_true": ["PlayMusic"]}, {"x": "Book a table for me, naomi and elisabeth at a brasserie with wifi", "y_predicted": ["BookRestaurant"], "y_true": ["BookRestaurant"]}, {"x": "The current album gets three out of 6 points", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "find Go

2018-12-13 18:41:18.879 INFO in 'deeppavlov.core.commands.train'['train'] at line 565: New best sets_accuracy of 0.9528
2018-12-13 18:41:18.880 INFO in 'deeppavlov.core.commands.train'['train'] at line 567: Saving model
2018-12-13 18:41:18.880 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 371: [saving model to /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/cnn_model_v1_opt.json]


{"train": {"epochs_done": 3, "batches_seen": 672, "train_examples_seen": 42885, "metrics": {"sets_accuracy": 0.9633, "f1_macro": 0.9636, "roc_auc": 0.998}, "time_spent": "0:00:17", "examples": [{"x": "play Iheart tunes by Neil Finn", "y_predicted": ["PlayMusic"], "y_true": ["PlayMusic"]}, {"x": "Book a reservation for my parents and I at Red Crown Tourist Court in Slovakia", "y_predicted": ["BookRestaurant"], "y_true": ["BookRestaurant"]}, {"x": "I'd like to get reservations for four at a restaurant that serves apple sauce.", "y_predicted": ["BookRestaurant"], "y_true": ["BookRestaurant"]}, {"x": "Give the current novel three stars", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "Play some John Oates on Youtube.", "y_predicted": ["PlayMusic"], "y_true": ["PlayMusic"]}, {"x": "book a restaurant in Puerto Rico", "y_predicted": ["BookRestaurant"], "y_true": ["BookRestaurant"]}, {"x": "Give five out of 6 stars to The Plucker", "y_predicted": ["RateBook"], "y_true": ["RateBook

{"train": {"epochs_done": 4, "batches_seen": 896, "train_examples_seen": 57180, "metrics": {"sets_accuracy": 0.9654, "f1_macro": 0.9656, "roc_auc": 0.9982}, "time_spent": "0:00:21", "examples": [{"x": "Please play a song off the Curtis Lee album Rough Diamonds", "y_predicted": ["PlayMusic"], "y_true": ["PlayMusic"]}, {"x": "find a movie schedule for United Paramount Theatres", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "Book a restaurant in Papua New Guinea for me and my daughters", "y_predicted": ["BookRestaurant"], "y_true": ["BookRestaurant"]}, {"x": "show me movie schedules", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "looking for Liberalism and the Limits of Justice", "y_predicted": ["SearchCreativeWork"], "y_true": ["SearchCreativeWork"]}, {"x": "Add tune to my Para comer", "y_predicted": ["AddToPlaylist"], "y_true": ["AddToPlaylist"]}, {"x": "add Georgetown University Alma Mater to my evening acous

2018-12-13 18:41:23.250 INFO in 'deeppavlov.core.commands.train'['train'] at line 565: New best sets_accuracy of 0.9547
2018-12-13 18:41:23.251 INFO in 'deeppavlov.core.commands.train'['train'] at line 567: Saving model
2018-12-13 18:41:23.252 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 371: [saving model to /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/cnn_model_v1_opt.json]


{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9547, "f1_macro": 0.954, "roc_auc": 0.9978}, "time_spent": "0:00:21", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": ["BookRestaurant"], "y_true": ["BookRestaurant"]}, {"x": "Rate the current textbook one of 6 stars", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "find a nearby movie schedule for movies", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "what is the Mississippi for the week", "y_predicted": ["SearchCreativeWork"], "y_true": ["GetWeather"]}, {"x": "Play me a song from 1968 on Spotify", "y_predicted": ["PlayMusic"], "y_true": ["PlayMusic"]}, {"x": "Book a table for me, naomi and elisabeth at a brasserie with wifi", "y_predicted": ["BookRestaurant"], "y_true": ["BookRestaurant"]}, {"x": "The current album gets three out of 6 points", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "find Goo

2018-12-13 18:41:27.516 INFO in 'deeppavlov.core.commands.train'['train'] at line 565: New best sets_accuracy of 0.9559
2018-12-13 18:41:27.516 INFO in 'deeppavlov.core.commands.train'['train'] at line 567: Saving model
2018-12-13 18:41:27.516 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 371: [saving model to /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/cnn_model_v1_opt.json]


{"train": {"epochs_done": 5, "batches_seen": 1120, "train_examples_seen": 71475, "metrics": {"sets_accuracy": 0.9676, "f1_macro": 0.9677, "roc_auc": 0.9984}, "time_spent": "0:00:25", "examples": [{"x": "Give me Slovakia's weather forecast for eight am", "y_predicted": ["GetWeather"], "y_true": ["GetWeather"]}, {"x": "what time is The Challenge showing at the local movie house", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "rate the saga In Mortal Hands five out of 6 stars", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "Will it be hot in Keachi", "y_predicted": ["GetWeather"], "y_true": ["GetWeather"]}, {"x": "Buy novel Brokeback Mountain", "y_predicted": ["SearchCreativeWork"], "y_true": ["SearchCreativeWork"]}, {"x": "Play ballad music from 1958", "y_predicted": ["PlayMusic"], "y_true": ["PlayMusic"]}, {"x": "What will the weather be in East Liberty MN?", "y_predicted": ["GetWeather"], "y_true": ["GetWeather"]}, {"x": "Will there b

{"train": {"epochs_done": 6, "batches_seen": 1344, "train_examples_seen": 85770, "metrics": {"sets_accuracy": 0.9684, "f1_macro": 0.9685, "roc_auc": 0.9985}, "time_spent": "0:00:30", "examples": [{"x": "rate this current textbook 0 points", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "Find me the book The Van Dyke Show", "y_predicted": ["SearchCreativeWork"], "y_true": ["SearchCreativeWork"]}, {"x": "Book a table for nine people next mar.", "y_predicted": ["BookRestaurant"], "y_true": ["BookRestaurant"]}, {"x": "Look for a photograph of Tailwind", "y_predicted": ["SearchCreativeWork"], "y_true": ["SearchCreativeWork"]}, {"x": "Play some 1993 Mark Maclaine on Deezer", "y_predicted": ["PlayMusic"], "y_true": ["PlayMusic"]}, {"x": "What's the weather close to Cambodia at 05:44:13", "y_predicted": ["GetWeather"], "y_true": ["GetWeather"]}, {"x": "Know Ye Not Agincourt? gets 4 out of 6 points", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "What will the weathe

2018-12-13 18:41:32.498 INFO in 'deeppavlov.core.commands.train'['train'] at line 565: New best sets_accuracy of 0.9597
2018-12-13 18:41:32.499 INFO in 'deeppavlov.core.commands.train'['train'] at line 567: Saving model
2018-12-13 18:41:32.499 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 371: [saving model to /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/cnn_model_v1_opt.json]


{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9597, "f1_macro": 0.9591, "roc_auc": 0.998}, "time_spent": "0:00:30", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": ["BookRestaurant"], "y_true": ["BookRestaurant"]}, {"x": "Rate the current textbook one of 6 stars", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "find a nearby movie schedule for movies", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "what is the Mississippi for the week", "y_predicted": ["SearchCreativeWork"], "y_true": ["GetWeather"]}, {"x": "Play me a song from 1968 on Spotify", "y_predicted": ["PlayMusic"], "y_true": ["PlayMusic"]}, {"x": "Book a table for me, naomi and elisabeth at a brasserie with wifi", "y_predicted": ["BookRestaurant"], "y_true": ["BookRestaurant"]}, {"x": "The current album gets three out of 6 points", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "find Goo

2018-12-13 18:41:36.785 INFO in 'deeppavlov.core.commands.train'['train'] at line 565: New best sets_accuracy of 0.961
2018-12-13 18:41:36.786 INFO in 'deeppavlov.core.commands.train'['train'] at line 567: Saving model
2018-12-13 18:41:36.786 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 371: [saving model to /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/cnn_model_v1_opt.json]


{"train": {"epochs_done": 7, "batches_seen": 1568, "train_examples_seen": 100065, "metrics": {"sets_accuracy": 0.9702, "f1_macro": 0.9703, "roc_auc": 0.9986}, "time_spent": "0:00:35", "examples": [{"x": "I need a bar for four that serves argentinian in D'Iberville, WY for twelve PM", "y_predicted": ["BookRestaurant"], "y_true": ["BookRestaurant"]}, {"x": "where can I find the game A Little Bit of Mambo", "y_predicted": ["SearchCreativeWork"], "y_true": ["SearchCreativeWork"]}, {"x": "In Bon Secour National Wildlife Refuge at twelve pm will it be chilly", "y_predicted": ["GetWeather"], "y_true": ["GetWeather"]}, {"x": "What time is Between Tears and Smiles playing", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "Rate this chronicle a 3", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "play some Bertine Zetlitz record", "y_predicted": ["PlayMusic"], "y_true": ["PlayMusic"]}, {"x": "I want to hear Box Of Rain by Skeets Mcdonald", "y_predi

2018-12-13 18:41:40.990 INFO in 'deeppavlov.core.commands.train'['train'] at line 572: Did not improve on the sets_accuracy of 0.961


{"train": {"epochs_done": 8, "batches_seen": 1792, "train_examples_seen": 114360, "metrics": {"sets_accuracy": 0.9708, "f1_macro": 0.9709, "roc_auc": 0.9987}, "time_spent": "0:00:39", "examples": [{"x": "play The Sea Cabinet", "y_predicted": ["PlayMusic"], "y_true": ["SearchCreativeWork"]}, {"x": "I want to hear music from carman from the 1966 album", "y_predicted": ["PlayMusic"], "y_true": ["PlayMusic"]}, {"x": "find the television show Birth of the Cool", "y_predicted": ["SearchCreativeWork"], "y_true": ["SearchCreativeWork"]}, {"x": "Please provide me with movie schedules.", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "put on a Serge Robert track", "y_predicted": ["PlayMusic"], "y_true": ["PlayMusic"]}, {"x": "Give this textbook zero stars", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "put K Maro track on my soul lounge list", "y_predicted": ["AddToPlaylist"], "y_true": ["AddToPlaylist"]}, {"x": "Find a trailer called No Reserv

2018-12-13 18:41:45.526 INFO in 'deeppavlov.core.commands.train'['train'] at line 565: New best sets_accuracy of 0.9622
2018-12-13 18:41:45.527 INFO in 'deeppavlov.core.commands.train'['train'] at line 567: Saving model
2018-12-13 18:41:45.527 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 371: [saving model to /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/cnn_model_v1_opt.json]


{"train": {"epochs_done": 9, "batches_seen": 2016, "train_examples_seen": 128655, "metrics": {"sets_accuracy": 0.971, "f1_macro": 0.9712, "roc_auc": 0.9987}, "time_spent": "0:00:43", "examples": [{"x": "add Ik Tara to laundry playlst", "y_predicted": ["AddToPlaylist"], "y_true": ["AddToPlaylist"]}, {"x": "rat the current textbook a two out of 6 points", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "Play music on Iheart", "y_predicted": ["PlayMusic"], "y_true": ["PlayMusic"]}, {"x": "Give Anatomy of a Typeface a 1 rating.", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "What is the movie schedule for films nearby", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "I would give The Minority Report a rating of 0 points", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "Show me the song 15 Storeys High ", "y_predicted": ["SearchCreativeWork"], "y_true": ["SearchCreativeWork"]}, {"x": "add artist steve cropper to Rh

{"train": {"epochs_done": 10, "batches_seen": 2240, "train_examples_seen": 142950, "metrics": {"sets_accuracy": 0.972, "f1_macro": 0.9722, "roc_auc": 0.9988}, "time_spent": "0:00:48", "examples": [{"x": "Is it going to be hot in Karthaus at 7 AM?", "y_predicted": ["GetWeather"], "y_true": ["GetWeather"]}, {"x": "I would give French Poets and Novelists a best rating of 6 and a value of three", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "Find movie schedules for Bow Tie Cinemas.", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "What time is Careful, He Might Hear You playing at the cinema", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "What are the movie times for films playing in the area?", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "I'd like to give a two rating to The Abolition of Britain.", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "

2018-12-13 18:41:50.85 INFO in 'deeppavlov.core.commands.train'['train'] at line 565: New best sets_accuracy of 0.9629
2018-12-13 18:41:50.85 INFO in 'deeppavlov.core.commands.train'['train'] at line 567: Saving model
2018-12-13 18:41:50.86 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 371: [saving model to /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/cnn_model_v1_opt.json]
2018-12-13 18:41:50.150 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 100: [loading vocabulary from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/snips/classes.dict]
2018-12-13 18:41:50.151 INFO in 'deeppavlov.models.embedders.glove_embedder'['glove_embedder'] at line 52: [loading GloVe embeddings from `/home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/glove.6B.100d.txt`]
2018-12-13 18:41:50.151 INFO in 'gensim.models.utils_any2vec'['utils_any2vec'] at line 170: loading p

{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9629, "f1_macro": 0.9624, "roc_auc": 0.9983}, "time_spent": "0:00:48", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": ["BookRestaurant"], "y_true": ["BookRestaurant"]}, {"x": "Rate the current textbook one of 6 stars", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "find a nearby movie schedule for movies", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "what is the Mississippi for the week", "y_predicted": ["SearchCreativeWork"], "y_true": ["GetWeather"]}, {"x": "Play me a song from 1968 on Spotify", "y_predicted": ["PlayMusic"], "y_true": ["PlayMusic"]}, {"x": "Book a table for me, naomi and elisabeth at a brasserie with wifi", "y_predicted": ["BookRestaurant"], "y_true": ["BookRestaurant"]}, {"x": "The current album gets three out of 6 points", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "find Go

2018-12-13 18:42:12.550 INFO in 'gensim.models.utils_any2vec'['utils_any2vec'] at line 232: loaded (400000, 100) matrix from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/glove.6B.100d.txt
2018-12-13 18:42:12.554 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 287: [initializing `KerasClassificationModel` from saved]
2018-12-13 18:42:12.877 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 297: [loading weights from cnn_model_v1.h5]
2018-12-13 18:42:13.206 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 137: Model was successfully initialized!
Model summary:
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None, 

{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9629, "f1_macro": 0.9624, "roc_auc": 0.9983}, "time_spent": "0:00:01", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": ["BookRestaurant"], "y_true": ["BookRestaurant"]}, {"x": "Rate the current textbook one of 6 stars", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "find a nearby movie schedule for movies", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "what is the Mississippi for the week", "y_predicted": ["SearchCreativeWork"], "y_true": ["GetWeather"]}, {"x": "Play me a song from 1968 on Spotify", "y_predicted": ["PlayMusic"], "y_true": ["PlayMusic"]}, {"x": "Book a table for me, naomi and elisabeth at a brasserie with wifi", "y_predicted": ["BookRestaurant"], "y_true": ["BookRestaurant"]}, {"x": "The current album gets three out of 6 points", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "find Go

2018-12-13 18:42:36.563 INFO in 'gensim.models.utils_any2vec'['utils_any2vec'] at line 232: loaded (400000, 100) matrix from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/glove.6B.100d.txt
2018-12-13 18:42:36.577 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 287: [initializing `KerasClassificationModel` from saved]
2018-12-13 18:42:36.898 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 297: [loading weights from cnn_model_v1.h5]
2018-12-13 18:42:37.76 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 137: Model was successfully initialized!
Model summary:
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None, 1

In [70]:
# or we can just load pre-trained model (conicides with what we did above)
m = build_model(cnn_config)

2018-12-13 18:42:37.80 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 100: [loading vocabulary from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/snips/classes.dict]
2018-12-13 18:42:37.81 INFO in 'deeppavlov.models.embedders.glove_embedder'['glove_embedder'] at line 52: [loading GloVe embeddings from `/home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/glove.6B.100d.txt`]
2018-12-13 18:42:37.81 INFO in 'gensim.models.utils_any2vec'['utils_any2vec'] at line 170: loading projection weights from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/glove.6B.100d.txt
2018-12-13 18:42:37.82 DEBUG in 'smart_open.smart_open_lib'['smart_open_lib'] at line 149: {'kw': {}, 'mode': 'rb', 'uri': '/home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/glove.6B.100d.txt'}
2018-12-13 18:42:37.82 DEBUG in 'smart_open.smart_open_lib'['smart_open_lib'] at line 621: encoding_wrapper: {'errors': 'strict', 'encoding'

In [71]:
m(["Is it freezing in Offerman, California?"])

[['GetWeather']]

#### SklearnComponent classifier on GloVe weighted by TF-IDF embeddings from config

In [72]:
logreg_config = {
  "dataset_reader": {
    "class_name": "basic_classification_reader",
    "x": "text",
    "y": "intents",
    "data_path": "snips"
  },
  "dataset_iterator": {
    "class_name": "basic_classification_iterator",
    "seed": 42,
      "split_seed": 23,
    "field_to_split": "train",
    "split_fields": [
      "train",
      "valid"
    ],
    "split_proportions": [
      0.9,
      0.1
    ]
  },
  "chainer": {
    "in": [
      "x"
    ],
    "in_y": [
      "y"
    ],
    "pipe": [
      {
        "id": "classes_vocab",
        "class_name": "simple_vocab",
        "fit_on": [
          "y"
        ],
        "save_path": "./snips/classes.dict",
        "load_path": "./snips/classes.dict",
        "in": "y",
        "out": "y_ids"
      },
      {
        "in": [
          "x"
        ],
        "out": [
          "x_vec"
        ],
        "fit_on": [
          "x",
          "y_ids"
        ],
        "id": "my_tfidf_vectorizer",
        "class_name": "sklearn_component",
        "save_path": "tfidf_v2.pkl",
        "load_path": "tfidf_v2.pkl",
        "model_class": "sklearn.feature_extraction.text:TfidfVectorizer",
        "infer_method": "transform"
      },
      {
        "in": "x",
        "out": "x_tok",
        "id": "my_tokenizer",
        "class_name": "nltk_moses_tokenizer"
      },
      {
        "in": "x_tok",
        "out": "x_emb",
        "id": "my_embedder",
        "class_name": "glove",
        "save_path": "./glove.6B.100d.txt",
        "load_path": "./glove.6B.100d.txt",
        "dim": 100,
        "pad_zero": True
      },
      {
        "class_name": "one_hotter",
        "id": "my_onehotter",
        "depth": "#classes_vocab.len",
        "in": "y_ids",
        "out": "y_onehot",
        "single_vector": True
      },
      {
        "in": "x_tok",
        "out": "x_weighted_emb",
        "class_name": "tfidf_weighted",
        "id": "my_weighted_embedder",
        "embedder": "#my_embedder",
        "tokenizer": "#my_tokenizer",
        "vectorizer": "#my_tfidf_vectorizer",
          "mean": True
      },
      {
        "in": [
          "x_weighted_emb"
        ],
        "out": [
          "y_pred"
        ],
        "fit_on": [
          "x_weighted_emb",
          "y"
        ],
        "class_name": "sklearn_component",
        "main": True,
        "save_path": "logreg_v3.pkl",
        "load_path": "logreg_v3.pkl",
        "model_class": "sklearn.linear_model:LogisticRegression",
        "infer_method": "predict",
        "ensure_list_output": True
      }
    ],
    "out": [
      "y_pred"
    ]
  },
  "train": {
    "epochs": 10,
    "batch_size": 64,
    "metrics": [
      "sets_accuracy"
    ],
    "show_examples": False,
    "validate_best": True,
    "test_best": False
  }
}


In [73]:
# we can train and evaluate model from config
m = train_model(logreg_config)

2018-12-13 18:43:00.371 INFO in 'deeppavlov.dataset_iterators.basic_classification_iterator'['basic_classification_iterator'] at line 73: Splitting field <<train>> to new fields <<['train', 'valid']>>
2018-12-13 18:43:00.374 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 100: [loading vocabulary from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/snips/classes.dict]
2018-12-13 18:43:00.389 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 89: [saving vocabulary to /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/snips/classes.dict]
2018-12-13 18:43:00.390 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 164: Initializing model sklearn.feature_extraction.text:TfidfVectorizer from scratch
2018-12-13 18:43:00.433 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 107: Fitting model sklearn.feature_extraction.text:TfidfVectorizer
2018-12-13 18:4

2018-12-13 18:44:12.336 DEBUG in 'smart_open.smart_open_lib'['smart_open_lib'] at line 621: encoding_wrapper: {'errors': 'strict', 'encoding': None, 'mode': 'rb', 'fileobj': <_io.BufferedReader name='/home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/glove.6B.100d.txt'>}


{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9283}, "time_spent": "0:00:03"}}


2018-12-13 18:44:33.740 INFO in 'gensim.models.utils_any2vec'['utils_any2vec'] at line 232: loaded (400000, 100) matrix from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/glove.6B.100d.txt
2018-12-13 18:44:33.753 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 201: Loading model sklearn.linear_model:LogisticRegression from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/logreg_v3.pkl
2018-12-13 18:44:33.753 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 208: Model sklearn.linear_model.logisticLogisticRegression loaded  with parameters


In [74]:
# or we can just load pre-trained model (conicides with what we did above)
m = build_model(logreg_config)

2018-12-13 18:44:33.796 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 100: [loading vocabulary from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/snips/classes.dict]
2018-12-13 18:44:33.798 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 201: Loading model sklearn.feature_extraction.text:TfidfVectorizer from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/tfidf_v2.pkl
2018-12-13 18:44:33.804 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 208: Model sklearn.feature_extraction.textTfidfVectorizer loaded  with parameters
2018-12-13 18:44:33.805 INFO in 'deeppavlov.models.embedders.glove_embedder'['glove_embedder'] at line 52: [loading GloVe embeddings from `/home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/glove.6B.100d.txt`]
2018-12-13 18:44:33.805 INFO in 'gensim.models.utils_any2vec'['utils_any2vec'] at line 170: loading projection 

In [75]:
m(["Is it freezing in Offerman, California?"])

[['GetWeather']]

In [76]:
# let's free memory
del m

## Bonus: pre-trained CNN model in DeepPavlov

Download model files (`wiki.en.bin` 8Gb embeddings):

! python -m deeppavlov download intents_snips_big

Evaluate metrics on validation set (no test set provided):

! python -m deeppavlov evaluate intents_snips_big

Or one can use model from python code:

In [77]:
from pathlib import Path

import deeppavlov
from deeppavlov import build_model, evaluate_model
from deeppavlov.download import deep_download

config_path = Path(deeppavlov.__file__).parent.joinpath('configs/classifiers/intents_snips_big.json')

In [78]:
# let's download all the required data - model files, embeddings, vocabularies
deep_download(config_path)

2018-12-13 18:44:55.284 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 205: Starting new HTTP connection (1): files.deeppavlov.ai:80
2018-12-13 18:44:55.341 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 393: http://files.deeppavlov.ai:80 "GET /datasets/snips_intents/train.csv.md5 HTTP/1.1" 200 44
2018-12-13 18:44:55.346 INFO in 'deeppavlov.download'['download'] at line 115: Skipped http://files.deeppavlov.ai/datasets/snips_intents/train.csv download because of matching hashes
2018-12-13 18:44:55.348 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 205: Starting new HTTP connection (1): files.deeppavlov.ai:80
2018-12-13 18:44:55.540 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 393: http://files.deeppavlov.ai:80 "GET /deeppavlov_data/classifiers/intents_snips_v10.tar.gz.md5 HTTP/1.1" 200 193
2018-12-13 18:44:55.589 INFO in 'deeppavlov.download'['download'] at line 115: Skipped http://files.deeppavlov.ai/deeppavlov_data/classifiers/inte

In [79]:
# now one can initialize model
m = build_model(config_path)

2018-12-13 18:45:11.621 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 100: [loading vocabulary from /home/dilyara/.deeppavlov/models/classifiers/intents_snips_v10/classes.dict]
2018-12-13 18:45:11.632 INFO in 'deeppavlov.models.embedders.fasttext_embedder'['fasttext_embedder'] at line 52: [loading fastText embeddings from `/home/dilyara/.deeppavlov/downloads/embeddings/wiki.en.bin`]
2018-12-13 18:45:32.229 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 287: [initializing `KerasClassificationModel` from saved]
2018-12-13 18:45:32.554 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 297: [loading weights from model.h5]
2018-12-13 18:45:32.772 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 137: Model was successfully initialized!
Model summary:
______________________________________________________

In [80]:
m(["Is it freezing in Offerman, California?"])

[['GetWeather']]

In [81]:
# let's free memory
del m

In [82]:
# or one can evaluate model WITHOUT training
evaluate_model(config_path)

2018-12-13 18:45:33.676 INFO in 'deeppavlov.dataset_iterators.basic_classification_iterator'['basic_classification_iterator'] at line 73: Splitting field <<train>> to new fields <<['train', 'valid']>>
2018-12-13 18:45:33.679 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 100: [loading vocabulary from /home/dilyara/.deeppavlov/models/classifiers/intents_snips_v10/classes.dict]
2018-12-13 18:45:33.680 INFO in 'deeppavlov.models.embedders.fasttext_embedder'['fasttext_embedder'] at line 52: [loading fastText embeddings from `/home/dilyara/.deeppavlov/downloads/embeddings/wiki.en.bin`]
2018-12-13 18:45:54.568 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 287: [initializing `KerasClassificationModel` from saved]
2018-12-13 18:45:54.913 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 297: [loading weights from model.h5]
2018-12-13 18:45:55.112 INFO in 'deepp

{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9824, "f1_macro": 0.982, "roc_auc": 0.9986}, "time_spent": "0:00:01"}}


{'valid': OrderedDict([('sets_accuracy', 0.9824),
              ('f1_macro', 0.982),
              ('roc_auc', 0.9986)])}