## You can also run the notebook in [COLAB](https://colab.research.google.com/github/deepmipt/DeepPavlov/blob/master/examples/classification_tutorial.ipynb).

In [1]:
!pip3 install deeppavlov

# Classification on DeepPavlov

**Task**:
Intent recognition on SNIPS dataset: https://github.com/snipsco/nlu-benchmark/tree/master/2017-06-custom-intent-engines that has already been recomposed to `csv` format and can be downloaded from http://files.deeppavlov.ai/datasets/snips_intents/train.csv

FastText English word embeddings ~8Gb: http://files.deeppavlov.ai/deeppavlov_data/embeddings/wiki.en.bin

## Plan of the notebook with documentation links:

1. [Data aggregation](#Data-aggregation)
     * [DatasetReader](#DatasetReader): [docs link](https://deeppavlov.readthedocs.io/en/latest/apiref/dataset_readers.html)
     * [DatasetIterator](#DatasetIterator): [docs link](https://deeppavlov.readthedocs.io/en/latest/apiref/dataset_iterators.html)
2. [Data preprocessing](#Data-preprocessing): [docs link](https://deeppavlov.readthedocs.io/en/latest/components/data_processors.html)
     * [Lowercasing](#Lowercasing)
     * [Tokenization](#Tokenization)
     * [Vocabulary](#Vocabulary)
3. [Featurization](#Featurization): [docs link](https://deeppavlov.readthedocs.io/en/latest/components/data_processors.html), [pre-trained embeddings link](https://deeppavlov.readthedocs.io/en/latest/intro/pretrained_vectors.html)
    * [Bag-of-words embedder](#Bag-of-words)
    * [TF-IDF vectorizer](#TF-IDF-Vectorizer)
    * [GloVe embedder](#GloVe-embedder)
    * [Mean GloVe embedder](#Mean-GloVe-embedder)
    * [GloVe weighted by TF-IDF embedder](#GloVe-weighted-by-TF-IDF-embedder)
4. [Models](#Models): [docs link](https://deeppavlov.readthedocs.io/en/latest/components/classifiers.html)
    * [Building models in python](#Models-in-python)
        - [Sklearn component classifiers](#SklearnComponent-classifier-on-Tfidf-features-in-python)
        - [Keras classification model on GloVe emb](#KerasClassificationModel-on-GloVe-embeddings-in-python)
        - [Sklearn component classifier on GloVe weighted emb](#SklearnComponent-classifier-on-GloVe-weighted-by-TF-IDF-embeddings-in-python)
    * [Building models from configs](#Models-from-configs)
        - [Sklearn component classifiers](#SklearnComponent-classifier-on-Tfidf-features-from-config)
        - [Keras classification model](#KerasClassificationModel-on-fastText-embeddings-from-config)
        - [Sklearn component classifier on GloVe weighted emb](#SklearnComponent-classifier-on-GloVe-weighted-by-TF-IDF-embeddings-from-config)
    * [Bonus: pre-trained CNN model in DeepPavlov](#Bonus:-pre-trained-CNN-model-in-DeepPavlov)

## Data aggregation

First of all, let's download and look into data we will work with.

In [1]:
from deeppavlov.core.data.utils import simple_download

#download train data file for SNIPS
simple_download(url="http://files.deeppavlov.ai/datasets/snips_intents/train.csv", 
                destination="./snips/train.csv")

2019-02-12 12:14:21.101 INFO in 'deeppavlov.core.data.utils'['utils'] at line 63: Downloading from http://files.deeppavlov.ai/datasets/snips_intents/train.csv to snips/train.csv
100%|██████████| 981k/981k [00:00<00:00, 63.5MB/s]


In [2]:
! head -n 15 snips/train.csv

text,intents
Add another song to the Cita RomГЎntica playlist. ,AddToPlaylist
add clem burke in my playlist Pre-Party R&B Jams,AddToPlaylist
Add Live from Aragon Ballroom to Trapeo,AddToPlaylist
add Unite and Win to my night out,AddToPlaylist
Add track to my Digster Future Hits,AddToPlaylist
add the piano bar to my Cindy Wilson,AddToPlaylist
Add Spanish Harlem Incident to cleaning the house,AddToPlaylist
add The Greyest of Blue Skies in Indie EspaГ±ol my playlist,AddToPlaylist
Add the name kids in the street to the plylist New Indie Mix,AddToPlaylist
add album radar latino,AddToPlaylist
Add Tranquility to the Latin Pop Rising playlist. ,AddToPlaylist
Add d flame to the Dcode2016 playlist.,AddToPlaylist
Add album to my fairy tales,AddToPlaylist
I need another artist in the New Indie Mix playlist. ,AddToPlaylist


### DatasetReader

Read data using `BasicClassificationDatasetReader` из DeepPavlov

In [3]:
from deeppavlov.dataset_readers.basic_classification_reader import BasicClassificationDatasetReader

In [4]:
# read data from particular columns of `.csv` file
dr = BasicClassificationDatasetReader().read(
    data_path='./snips/',
    train='train.csv',
    x = 'text',
    y = 'intents'
)



We don't have a ready train/valid/test split.

In [5]:
# check train/valid/test sizes
[(k, len(dr[k])) for k in dr.keys()]

[('train', 15884), ('valid', 0), ('test', 0)]

### DatasetIterator

Use `BasicClassificationDatasetIterator` to split `train` on `train` and `valid` and to generate batches of samples.

In [6]:
from deeppavlov.dataset_iterators.basic_classification_iterator import BasicClassificationDatasetIterator

In [7]:
# initialize data iterator splitting `train` field to `train` and `valid` in proportion 0.8/0.2
train_iterator = BasicClassificationDatasetIterator(
    data=dr,
    field_to_split='train',  # field that will be splitted
    split_fields=['train', 'valid'],   # fields to which the fiald above will be splitted
    split_proportions=[0.8, 0.2],  #proportions for splitting
    split_seed=23,  # seed for splitting dataset
    seed=42)  # seed for iteration over dataset

2019-02-12 12:14:23.557 INFO in 'deeppavlov.dataset_iterators.basic_classification_iterator'['basic_classification_iterator'] at line 73: Splitting field <<train>> to new fields <<['train', 'valid']>>


Let's look into training samples. 

In [8]:
# one can get train instances (or any other data type including `all`)
x_train, y_train = train_iterator.get_instances(data_type='train')
for x, y in list(zip(x_train, y_train))[:5]:
    print('x:', x)
    print('y:', y)
    print('=================')

x: Is it freezing in Offerman, California?
y: ['GetWeather']
x: put this song in the playlist Trap Land
y: ['AddToPlaylist']
x: show me a textbook with a rating of 2 and a maximum rating of 6 that is current
y: ['RateBook']
x: Will the weather be okay in Northern Luzon Heroes Hill National Park 4 and a half months from now?
y: ['GetWeather']
x: Rate the current album a four
y: ['RateBook']


## Data preprocessing

We will be using lowercasing and tokenization as data preparation. 

DeepPavlov also contains several other preprocessors and tokenizers.

### Lowercasing

`str_lower` lowercases texts.

In [9]:
from deeppavlov.models.preprocessors.str_lower import str_lower

[nltk_data] Downloading package punkt to /home/vimary/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/vimary/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package perluniprops to
[nltk_data]     /home/vimary/nltk_data...
[nltk_data]   Package perluniprops is already up-to-date!
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data]     /home/vimary/nltk_data...
[nltk_data]   Package nonbreaking_prefixes is already up-to-date!


In [10]:
str_lower(['Is it freezing in Offerman, California?'])

['is it freezing in offerman, california?']

### Tokenization

`NLTKTokenizer` can split string to tokens.

In [11]:
from deeppavlov.models.tokenizers.nltk_moses_tokenizer import NLTKMosesTokenizer

In [12]:
tokenizer = NLTKMosesTokenizer()
tokenizer(['Is it freezing in Offerman, California?'])

[['Is', 'it', 'freezing', 'in', 'Offerman', ',', 'California', '?']]

Let's preprocess all `train` part of the dataset.

In [13]:
train_x_lower_tokenized = str_lower(tokenizer(train_iterator.get_instances(data_type='train')[0]))

### Vocabulary

Now we are ready to use `vocab`. They are very usefull for:
* extracting class labels and converting labels to indices and vice versa,
* building of characters or tokens vocabularies.

In [14]:
from deeppavlov.core.data.simple_vocab import SimpleVocabulary

In [15]:
# initialize simple vocabulary to collect all appeared in the dataset classes
classes_vocab = SimpleVocabulary(
    save_path='./snips/classes.dict',
    load_path='./snips/classes.dict')

In [16]:
classes_vocab.fit((train_iterator.get_instances(data_type='train')[1]))
classes_vocab.save()

2019-02-12 12:14:25.35 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 89: [saving vocabulary to /home/vimary/ipavlov/Pilot/examples/tutorials/snips/classes.dict]


Let's see what classes the dataset contains and their indices in the vocabulary.

In [17]:
list(classes_vocab.items())

[('GetWeather', 0),
 ('PlayMusic', 1),
 ('SearchScreeningEvent', 2),
 ('BookRestaurant', 3),
 ('RateBook', 4),
 ('SearchCreativeWork', 5),
 ('AddToPlaylist', 6)]

In [18]:
# also one can collect vocabulary of textual tokens appeared 2 and more times in the dataset
token_vocab = SimpleVocabulary(
    save_path='./snips/tokens.dict',
    load_path='./snips/tokens.dict',
    min_freq=2,
    special_tokens=('<PAD>', '<UNK>',),
    unk_token='<UNK>')

In [19]:
token_vocab.fit(train_x_lower_tokenized)
token_vocab.save()

2019-02-12 12:14:25.157 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 89: [saving vocabulary to /home/vimary/ipavlov/Pilot/examples/tutorials/snips/tokens.dict]


In [20]:
# number of tokens in dictionary
len(token_vocab)

4564

In [21]:
# 10 most common words and number of times their appeared
token_vocab.freqs.most_common()[:10]

[('the', 6953),
 ('a', 3917),
 ('in', 3265),
 ('to', 3203),
 ('for', 2814),
 ('of', 2401),
 ('.', 2400),
 ('i', 2079),
 ('at', 1935),
 ('play', 1703)]

In [22]:
token_ids = token_vocab(str_lower(tokenizer(['Is it freezing in Offerman, California?'])))
token_ids

[[13, 36, 244, 4, 1, 29, 996, 20]]

In [23]:
tokenizer(token_vocab(token_ids))

['is it freezing in <UNK>, california?']

## Featurization

This part contains several possible ways of featurization of text samples. One can chose any appropriate vectorizer/embedder according to available resources and given task.

Bag-of-words (BoW) and TF-IDF vectorizers converts text samples to vectors (one vector per sample) while fastText, GloVe, fastText weighted by TF-IDF embedders either produce an embedding vector per token or an embedding vector per text sample (if `mean` set to True).

### Bag-of-words

Matches a vector to each text sample indicating which words appeared in the given sample: text -> binary vector $v$: \[0, 1, 0, 0, 0, 1, ..., ...1, 0, 1\]. 

Dimensionality of vector $v$ is equal to vocabulary size.

$v_i$ == 1, if word $i$ is in the text,

$v_i$ == 0, else.

In [24]:
import numpy as np
from deeppavlov.models.embedders.bow_embedder import BoWEmbedder

In [25]:
# initialize bag-of-words embedder giving total number of tokens
bow = BoWEmbedder(depth=token_vocab.len)
# it assumes indexed tokenized samples
bow(token_vocab(str_lower(tokenizer(['Is it freezing in Offerman, California?']))))

[array([0, 1, 0, ..., 0, 0, 0], dtype=int32)]

In [26]:
# all 8 tokens are in the vocabulary
sum(bow(token_vocab(str_lower(tokenizer(['Is it freezing in Offerman, California?']))))[0])

8

### TF-IDF Vectorizer

Matches a vector to each text sample: text -> vector $v$ from $R^N$ where $N$ is a vocabulary size.

$TF-IDF(token, document) = TF(token, document) * IDF(token, document)$

$TF$ is a term frequency:

$TF(token, document) = \frac{n_{token}}{\sum_{k}n_k}.$

$IDF$ is a inverse document frequency:

$IDF(token, all\_documents) = \frac{Total\ number\ of\ documents}{number\ of\ documents\ where\ token\ appeared}.$

`SklearnComponent` in DeepPavlov is a universal wrapper for any vecotirzer/estimator from `sklearn` package. The only requirement to specify component usage is following: model class and name of infer method should be passed as parameters.

In [27]:
from deeppavlov.models.sklearn import SklearnComponent

In [28]:
# initialize TF-IDF vectorizer sklearn component with `transform` as infer method
tfidf = SklearnComponent(
    model_class="sklearn.feature_extraction.text:TfidfVectorizer",
    infer_method="transform",
    save_path='./tfidf_v0.pkl',
    load_path='./tfidf_v0.pkl',
    mode='train')

2019-02-12 12:14:25.269 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 165: Initializing model sklearn.feature_extraction.text:TfidfVectorizer from scratch


In [29]:
# fit on textual train instances and save it
tfidf.fit(str_lower(train_iterator.get_instances(data_type='train')[0]))
tfidf.save()

2019-02-12 12:14:25.296 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 108: Fitting model sklearn.feature_extraction.text:TfidfVectorizer
2019-02-12 12:14:25.395 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 240: Saving model to /home/vimary/ipavlov/Pilot/examples/tutorials/tfidf_v0.pkl


In [30]:
tfidf(str_lower(['Is it freezing in Offerman, California?']))

<1x10709 sparse matrix of type '<class 'numpy.float64'>'
	with 6 stored elements in Compressed Sparse Row format>

In [31]:
# number of tokens in the TF-IDF vocabulary
len(tfidf.model.vocabulary_)

10709

### GloVe embedder

[GloVe](https://nlp.stanford.edu/projects/glove/) is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

In [32]:
from deeppavlov.models.embedders.glove_embedder import GloVeEmbedder

Using TensorFlow backend.


Let's download GloVe embedding file

In [33]:
simple_download(url="http://files.deeppavlov.ai/embeddings/glove.6B.100d.txt", 
                destination="./glove.6B.100d.txt")

2019-02-12 12:14:26.153 INFO in 'deeppavlov.core.data.utils'['utils'] at line 63: Downloading from http://files.deeppavlov.ai/embeddings/glove.6B.100d.txt to glove.6B.100d.txt
347MB [00:06, 50.0MB/s] 


In [34]:
embedder = GloVeEmbedder(load_path='./glove.6B.100d.txt',
                         dim=100, pad_zero=True)

2019-02-12 12:14:33.99 INFO in 'deeppavlov.models.embedders.glove_embedder'['glove_embedder'] at line 52: [loading GloVe embeddings from `/home/vimary/ipavlov/Pilot/examples/tutorials/glove.6B.100d.txt`]


In [35]:
# output shape is (batch_size x max_num_tokens_in_the_batch x embedding_dim)
embedded_batch = embedder(str_lower(tokenizer(['Is it freezing in Offerman, California?']))) 
len(embedded_batch), len(embedded_batch[0]), embedded_batch[0][0].shape

(1, 8, (100,))

### Mean GloVe embedder

Embedder returns a vector per token while we want to get a vector per text sample. Therefore, let's calculate mean vector of embeddings of tokens. 
For that we can either init `GloVeEmbedder` with `mean=True` parameter (`mean=false` by default), or pass `mean=true` while calling function (this way `mean` value is assigned only for this call).

In [36]:
# output shape is (batch_size x embedding_dim)
embedded_batch = embedder(str_lower(tokenizer(['Is it freezing in Offerman, California?'])), mean=True) 
len(embedded_batch), embedded_batch[0].shape

(1, (100,))

### GloVe weighted by TF-IDF embedder

One of the possible ways to combine TF-IDF vectorizer and any token embedder is to weigh token embeddings by TF-IDF coefficients (therefore, `mean` set to True is obligatory to obtain embeddings of interest while it still **by default** returns embeddings of tokens.

In [37]:
from deeppavlov.models.embedders.tfidf_weighted_embedder import TfidfWeightedEmbedder

In [38]:
weighted_embedder = TfidfWeightedEmbedder(
    embedder=embedder,  # our GloVe embedder instance
    tokenizer=tokenizer,  # our tokenizer instance
    mean=True,  # to return one vector per sample
    vectorizer=tfidf  # our TF-IDF vectorizer
)

In [39]:
# output shape is (batch_size x  embedding_dim)
embedded_batch = weighted_embedder(str_lower(tokenizer(['Is it freezing in Offerman, California?']))) 
len(embedded_batch), embedded_batch[0].shape

(1, (100,))

## Models

In [40]:
from deeppavlov.metrics.accuracy import sets_accuracy

In [41]:
# get all train and valid data from iterator
x_train, y_train = train_iterator.get_instances(data_type="train")
x_valid, y_valid = train_iterator.get_instances(data_type="valid")

### Models in python

#### SklearnComponent classifier on Tfidf-features in python

In [42]:
# initialize sklearn classifier, all parameters for classifier could be passed
cls = SklearnComponent(
    model_class="sklearn.linear_model:LogisticRegression",
    infer_method="predict",
    save_path='./logreg_v0.pkl',
    load_path='./logreg_v0.pkl',
    C=1,
    mode='train')

2019-02-12 12:14:53.75 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 165: Initializing model sklearn.linear_model:LogisticRegression from scratch


In [43]:
# fit sklearn classifier and save it
cls.fit(tfidf(x_train), y_train)
cls.save()

2019-02-12 12:14:53.591 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 108: Fitting model sklearn.linear_model:LogisticRegression
2019-02-12 12:14:53.756 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 240: Saving model to /home/vimary/ipavlov/Pilot/examples/tutorials/logreg_v0.pkl


In [44]:
y_valid_pred = cls(tfidf(x_valid))

In [45]:
# Let's look into obtained result
print("Text sample: {}".format(x_valid[0]))
print("True label: {}".format(y_valid[0]))
print("Predicted label: {}".format(y_valid_pred[0]))

Text sample: I need seating at Floating restaurant in Tennessee for a group of 9
True label: ['BookRestaurant']
Predicted label: BookRestaurant


In [46]:
# let's calculate sets accuracy (because each element is a list of labels)
sets_accuracy(np.squeeze(y_valid), y_valid_pred)

0.982373308152345

#### KerasClassificationModel on GloVe embeddings in python

In [47]:
from deeppavlov.models.classifiers.keras_classification_model import KerasClassificationModel
from deeppavlov.models.preprocessors.one_hotter import OneHotter
from deeppavlov.models.classifiers.proba2labels import Proba2Labels

In [48]:
# Intialize `KerasClassificationModel` that composes CNN shallow-and-wide network 
# (name here as`cnn_model`)
cls = KerasClassificationModel(save_path="./cnn_model_v0", 
                               load_path="./cnn_model_v0", 
                               embedding_size=embedder.dim,
                               n_classes=classes_vocab.len,
                               model_name="cnn_model",
                               text_size=15, # number of tokens
                               kernel_sizes_cnn=[3, 5, 7],
                               filters_cnn=128,
                               dense_size=100,
                               optimizer="Adam",
                               learning_rate=0.1,
                               learning_rate_decay=0.01,
                               loss="categorical_crossentropy")

2019-02-12 12:14:54.421 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 272: [initializing `KerasClassificationModel` from scratch as cnn_model]
2019-02-12 12:14:54.818 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 136: Model was successfully initialized!
Model summary:
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 15, 100)      0                                            
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 15, 128)      38528       input_1[0][0]                    
__________________________________________________________________________________________________
conv1d_2 (Conv1D)  

In [49]:
# `KerasClassificationModel` assumes one-hotted distribution of classes per sample.
# `OneHotter` converts indices to one-hot vectors representation.
#  To obtain indices we can use our `classes_vocab` intialized and fitted above
onehotter = OneHotter(depth=classes_vocab.len, single_vector=True)

In [50]:
# Train for 10 epochs
for ep in range(10):
    for x, y in train_iterator.gen_batches(batch_size=64, 
                                           data_type="train"):
        x_embed = embedder(tokenizer(str_lower(x)))
        y_onehot = onehotter(classes_vocab(y))
        cls.train_on_batch(x_embed, y_onehot)

In [51]:
# Save model weights and parameters
cls.save()

2019-02-12 12:15:22.184 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 386: [saving model to /home/vimary/ipavlov/Pilot/examples/tutorials/cnn_model_v0_opt.json]


In [52]:
# Infering on validation data we get probability distribution on given data.
y_valid_pred = cls(embedder(tokenizer(str_lower(x_valid))))

In [53]:
# To convert probability distribution to labels, 
# we first need to convert probabilities to indices,
# and then using vocabulary `classes_vocab` convert indices to labels.
# 
# `Proba2Labels` converts probabilities to indices and supports three different modes:
# if `max_proba` is true, returns indices of the highest probabilities
# if `confidence_threshold` is given, returns indices with probabiltiies higher than threshold
# if `top_n` is given, returns `top_n` indices with highest probabilities
prob2labels = Proba2Labels(max_proba=True)

In [54]:
# Let's look into obtained result
print("Text sample: {}".format(x_valid[0]))
print("True label: {}".format(y_valid[0]))
print("Predicted probability distribution: {}".format(dict(zip(classes_vocab.keys(), 
                                                               y_valid_pred[0]))))
print("Predicted label: {}".format(classes_vocab(prob2labels(y_valid_pred))[0]))

Text sample: I need seating at Floating restaurant in Tennessee for a group of 9
True label: ['BookRestaurant']
Predicted probability distribution: {'GetWeather': 4.443174475454725e-05, 'PlayMusic': 0.0002085473679471761, 'SearchScreeningEvent': 6.492184911621734e-05, 'BookRestaurant': 0.9995043277740479, 'RateBook': 0.00021818796813022345, 'SearchCreativeWork': 0.0013526129769161344, 'AddToPlaylist': 8.029041782720014e-05}
Predicted label: ['BookRestaurant']


In [55]:
# calculate sets accuracy
sets_accuracy(y_valid, classes_vocab(prob2labels(y_valid_pred)))

0.982373308152345

####  SklearnComponent classifier on GloVe weighted by TF-IDF embeddings in python

In [56]:
# initialize sklearn classifier, all parameters for classifier could be passed
cls = SklearnComponent(
    model_class="sklearn.linear_model:LogisticRegression",
    infer_method="predict",
    save_path='./logreg_v1.pkl',
    load_path='./logreg_v1.pkl',
    C=1,
    mode='train')

2019-02-12 12:15:22.962 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 165: Initializing model sklearn.linear_model:LogisticRegression from scratch


In [57]:
# fit sklearn classifier and save it
cls.fit(weighted_embedder(str_lower(tokenizer(x_train))), y_train)
cls.save()

2019-02-12 12:15:44.521 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 108: Fitting model sklearn.linear_model:LogisticRegression
2019-02-12 12:15:46.59 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 240: Saving model to /home/vimary/ipavlov/Pilot/examples/tutorials/logreg_v1.pkl


In [58]:
y_valid_pred = cls(weighted_embedder(str_lower(tokenizer(x_valid))))

In [59]:
# Let's look into obtained result
print("Text sample: {}".format(x_valid[0]))
print("True label: {}".format(y_valid[0]))
print("Predicted label: {}".format(y_valid_pred[0]))

Text sample: I need seating at Floating restaurant in Tennessee for a group of 9
True label: ['BookRestaurant']
Predicted label: BookRestaurant


In [60]:
# let's calculate sets accuracy (because each element is a list of labels)
sets_accuracy(np.squeeze(y_valid), y_valid_pred)

0.9184765502045955

### Let's free our memory from embeddings and models

In [61]:
embedder.reset()
cls.reset()

### Models from configs

In [62]:
from deeppavlov import build_model
from deeppavlov import train_model

#### SklearnComponent classifier on Tfidf-features from config

In [63]:
logreg_config = {
  "dataset_reader": {
    "class_name": "basic_classification_reader",
    "x": "text",
    "y": "intents",
    "data_path": "./snips"
  },
  "dataset_iterator": {
    "class_name": "basic_classification_iterator",
    "seed": 42,
    "split_seed": 23,
    "field_to_split": "train",
    "split_fields": [
      "train",
      "valid"
    ],
    "split_proportions": [
      0.9,
      0.1
    ]
  },
  "chainer": {
    "in": [
      "x"
    ],
    "in_y": [
      "y"
    ],
    "pipe": [
      {
        "id": "classes_vocab",
        "class_name": "simple_vocab",
        "fit_on": [
          "y"
        ],
        "save_path": "./snips/classes.dict",
        "load_path": "./snips/classes.dict",
        "in": "y",
        "out": "y_ids"
      },
      {
        "in": [
          "x"
        ],
        "out": [
          "x_vec"
        ],
        "fit_on": [
          "x",
          "y_ids"
        ],
        "id": "tfidf_vec",
        "class_name": "sklearn_component",
        "save_path": "tfidf_v1.pkl",
        "load_path": "tfidf_v1.pkl",
        "model_class": "sklearn.feature_extraction.text:TfidfVectorizer",
        "infer_method": "transform"
      },
      {
        "in": "x",
        "out": "x_tok",
        "id": "my_tokenizer",
        "class_name": "nltk_moses_tokenizer",
        "tokenizer": "wordpunct_tokenize"
      },
      {
        "in": [
          "x_vec"
        ],
        "out": [
          "y_pred"
        ],
        "fit_on": [
          "x_vec",
          "y"
        ],
        "class_name": "sklearn_component",
        "main": True,
        "save_path": "logreg_v2.pkl",
        "load_path": "logreg_v2.pkl",
        "model_class": "sklearn.linear_model:LogisticRegression",
        "infer_method": "predict",
        "ensure_list_output": True
      }
    ],
    "out": [
      "y_pred"
    ]
  },
  "train": {
    "batch_size": 64,
    "metrics": [
      "accuracy"
    ],
    "validate_best": True,
    "test_best": False
  }
}


In [64]:
# we can train and evaluate model from config
m = train_model(logreg_config)

2019-02-12 12:15:52.311 INFO in 'deeppavlov.dataset_iterators.basic_classification_iterator'['basic_classification_iterator'] at line 73: Splitting field <<train>> to new fields <<['train', 'valid']>>
2019-02-12 12:15:52.322 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 103: [loading vocabulary from /home/vimary/ipavlov/Pilot/examples/tutorials/snips/classes.dict]
2019-02-12 12:15:52.339 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 89: [saving vocabulary to /home/vimary/ipavlov/Pilot/examples/tutorials/snips/classes.dict]
2019-02-12 12:15:52.341 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 165: Initializing model sklearn.feature_extraction.text:TfidfVectorizer from scratch
2019-02-12 12:15:52.389 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 108: Fitting model sklearn.feature_extraction.text:TfidfVectorizer
2019-02-12 12:15:52.493 INFO in 'deeppavlov.models.sklearn.sk

{"valid": {"eval_examples_count": 1589, "metrics": {"accuracy": 0.983}, "time_spent": "0:00:01"}}


In [65]:
# or we can just load pre-trained model (conicides with what we did above)
m = build_model(logreg_config)

2019-02-12 12:15:53.359 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 103: [loading vocabulary from /home/vimary/ipavlov/Pilot/examples/tutorials/snips/classes.dict]
2019-02-12 12:15:53.360 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 202: Loading model sklearn.feature_extraction.text:TfidfVectorizer from /home/vimary/ipavlov/Pilot/examples/tutorials/tfidf_v1.pkl
2019-02-12 12:15:53.366 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 209: Model sklearn.feature_extraction.textTfidfVectorizer loaded  with parameters
2019-02-12 12:15:53.368 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 202: Loading model sklearn.linear_model:LogisticRegression from /home/vimary/ipavlov/Pilot/examples/tutorials/logreg_v2.pkl
2019-02-12 12:15:53.369 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 209: Model sklearn.linear_model.logisticLogisti

In [66]:
m(["Is it freezing in Offerman, California?"])

[['GetWeather']]

#### KerasClassificationModel on GloVe embeddings from config

In [67]:
cnn_config = {
  "dataset_reader": {
    "class_name": "basic_classification_reader",
    "x": "text",
    "y": "intents",
    "data_path": "snips"
  },
  "dataset_iterator": {
    "class_name": "basic_classification_iterator",
    "seed": 42,
    "split_seed": 23,
    "field_to_split": "train",
    "split_fields": [
      "train",
      "valid"
    ],
    "split_proportions": [
      0.9,
      0.1
    ]
  },
  "chainer": {
    "in": [
      "x"
    ],
    "in_y": [
      "y"
    ],
    "pipe": [
      {
        "id": "classes_vocab",
        "class_name": "simple_vocab",
        "fit_on": [
          "y"
        ],
        "level": "token",
        "save_path": "./snips/classes.dict",
        "load_path": "./snips/classes.dict",
        "in": "y",
        "out": "y_ids"
      },
      {
        "in": "x",
        "out": "x_tok",
        "id": "my_tokenizer",
        "class_name": "nltk_tokenizer",
        "tokenizer": "wordpunct_tokenize"
      },
      {
        "in": "x_tok",
        "out": "x_emb",
        "id": "my_embedder",
        "class_name": "glove",
        "load_path": "./glove.6B.100d.txt",
        "dim": 100,
        "pad_zero": True
      },
      {
        "in": "y_ids",
        "out": "y_onehot",
        "class_name": "one_hotter",
        "depth": "#classes_vocab.len",
        "single_vector": True
      },
      {
        "in": [
          "x_emb"
        ],
        "in_y": [
          "y_onehot"
        ],
        "out": [
          "y_pred_probas"
        ],
        "main": True,
        "class_name": "keras_classification_model",
        "save_path": "./cnn_model_v1",
        "load_path": "./cnn_model_v1",
        "embedding_size": "#my_embedder.dim",
        "n_classes": "#classes_vocab.len",
        "kernel_sizes_cnn": [
          1,
          2,
          3
        ],
        "filters_cnn": 256,
        "optimizer": "Adam",
        "learning_rate": 0.01,
        "learning_rate_decay": 0.1,
        "loss": "categorical_crossentropy",
        "coef_reg_cnn": 1e-4,
        "coef_reg_den": 1e-4,
        "dropout_rate": 0.5,
        "dense_size": 100,
        "model_name": "cnn_model"
      },
      {
        "in": "y_pred_probas",
        "out": "y_pred_ids",
        "class_name": "proba2labels",
        "max_proba": True
      },
      {
        "in": "y_pred_ids",
        "out": "y_pred_labels",
        "ref": "classes_vocab"
      }
    ],
    "out": [
      "y_pred_labels"
    ]
  },
  "train": {
    "epochs": 10,
    "batch_size": 64,
    "metrics": [
      "sets_accuracy",
      "f1_macro",
      {
        "name": "roc_auc",
        "inputs": ["y_onehot", "y_pred_probas"]
      }
    ],
    "validation_patience": 5,
    "val_every_n_epochs": 1,
    "log_every_n_epochs": 1,
    "show_examples": True,
    "validate_best": True,
    "test_best": False
  }
}


In [68]:
# we can train and evaluate model from config
m = train_model(cnn_config)

2019-02-12 12:15:54.313 INFO in 'deeppavlov.dataset_iterators.basic_classification_iterator'['basic_classification_iterator'] at line 73: Splitting field <<train>> to new fields <<['train', 'valid']>>
2019-02-12 12:15:54.319 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 103: [loading vocabulary from /home/vimary/ipavlov/Pilot/examples/tutorials/snips/classes.dict]
2019-02-12 12:15:54.335 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 89: [saving vocabulary to /home/vimary/ipavlov/Pilot/examples/tutorials/snips/classes.dict]
2019-02-12 12:15:54.337 INFO in 'deeppavlov.models.embedders.glove_embedder'['glove_embedder'] at line 52: [loading GloVe embeddings from `/home/vimary/ipavlov/Pilot/examples/tutorials/glove.6B.100d.txt`]
2019-02-12 12:16:14.207 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 272: [initializing `KerasClassificationModel` from scratch as cnn_model]
2019-02-12 12:16:14.

{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.1479, "f1_macro": 0.044, "roc_auc": 0.5499}, "time_spent": "0:00:01", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": ["GetWeather"], "y_true": ["BookRestaurant"]}, {"x": "Rate the current textbook one of 6 stars", "y_predicted": ["GetWeather"], "y_true": ["RateBook"]}, {"x": "find a nearby movie schedule for movies", "y_predicted": ["GetWeather"], "y_true": ["SearchScreeningEvent"]}, {"x": "what is the Mississippi for the week", "y_predicted": ["RateBook"], "y_true": ["GetWeather"]}, {"x": "Play me a song from 1968 on Spotify", "y_predicted": ["GetWeather"], "y_true": ["PlayMusic"]}, {"x": "Book a table for me, naomi and elisabeth at a brasserie with wifi", "y_predicted": ["GetWeather"], "y_true": ["BookRestaurant"]}, {"x": "The current album gets three out of 6 points", "y_predicted": ["GetWeather"], "y_true": ["RateBook"]}, {"x": "find Goodrich Quality Theaters 

2019-02-12 12:16:19.387 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 163: New best sets_accuracy of 0.9434
2019-02-12 12:16:19.388 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 165: Saving model
2019-02-12 12:16:19.388 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 386: [saving model to /home/vimary/ipavlov/Pilot/examples/tutorials/cnn_model_v1_opt.json]


{"train": {"eval_examples_count": 64, "metrics": {"sets_accuracy": 0.9375, "f1_macro": 0.9421, "roc_auc": 0.9938}, "time_spent": "0:00:05", "examples": [{"x": "Please find me the work, Instrumental Directions.", "y_predicted": ["SearchCreativeWork"], "y_true": ["SearchCreativeWork"]}, {"x": "What weather will it be in Battlement Mesa?", "y_predicted": ["GetWeather"], "y_true": ["GetWeather"]}, {"x": "play theme by Yanni on Vimeo", "y_predicted": ["PlayMusic"], "y_true": ["PlayMusic"]}, {"x": "rate the Beyond Black saga a one", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "Find the schedule for The Tooth Will Out at sunrise.", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "Rate Lords of the Rim zero stars", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "play an Masaki Aiba tune", "y_predicted": ["PlayMusic"], "y_true": ["PlayMusic"]}, {"x": "I need a table for 5 at the restaurant I ate at last Oct.", "y_predicted": ["Boo

2019-02-12 12:16:21.734 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 163: New best sets_accuracy of 0.9515
2019-02-12 12:16:21.735 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 165: Saving model
2019-02-12 12:16:21.735 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 386: [saving model to /home/vimary/ipavlov/Pilot/examples/tutorials/cnn_model_v1_opt.json]


{"train": {"eval_examples_count": 64, "metrics": {"sets_accuracy": 0.9688, "f1_macro": 0.9623, "roc_auc": 0.999}, "time_spent": "0:00:08", "examples": [{"x": "She me movie times", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "I'd like a table in a smoking room in a taverna on sep. 23, 2023", "y_predicted": ["BookRestaurant"], "y_true": ["BookRestaurant"]}, {"x": "find a movie called No More Sadface", "y_predicted": ["SearchCreativeWork"], "y_true": ["SearchCreativeWork"]}, {"x": "A rating of 5 of 6 points goes to Dickson McCunn trilogy", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "For the book The Mirrored Heavens  I give one of a possiable 6 stars", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "What are the weather conditions in Patagonia, South Africa?", "y_predicted": ["GetWeather"], "y_true": ["GetWeather"]}, {"x": "Where can I watch the trailer for Home Economics", "y_predicted": ["SearchCreativeWork"], "y_true

2019-02-12 12:16:24.94 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 163: New best sets_accuracy of 0.9553
2019-02-12 12:16:24.94 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 165: Saving model
2019-02-12 12:16:24.95 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 386: [saving model to /home/vimary/ipavlov/Pilot/examples/tutorials/cnn_model_v1_opt.json]


{"train": {"eval_examples_count": 64, "metrics": {"sets_accuracy": 0.9844, "f1_macro": 0.9859, "roc_auc": 0.9998}, "time_spent": "0:00:10", "examples": [{"x": "find the trailer for Hit the Ice", "y_predicted": ["SearchCreativeWork"], "y_true": ["SearchCreativeWork"]}, {"x": "Can I get the movies  showtimes for the closest movie house.", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "I want to give this book zero", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "give The Creator zero points out of 6", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "Find the movie schedules for Cineplex Odeon Corporation.", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "Get soundtrack of Comprehensive Knowledge Archive Network", "y_predicted": ["SearchCreativeWork"], "y_true": ["SearchCreativeWork"]}, {"x": "play Pandora tunes from the fourties", "y_predicted": ["PlayMusic"], "y_true": ["PlayMusic"]}, {"

2019-02-12 12:16:26.435 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 163: New best sets_accuracy of 0.9566
2019-02-12 12:16:26.435 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 165: Saving model
2019-02-12 12:16:26.436 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 386: [saving model to /home/vimary/ipavlov/Pilot/examples/tutorials/cnn_model_v1_opt.json]


{"train": {"eval_examples_count": 64, "metrics": {"sets_accuracy": 0.9531, "f1_macro": 0.9521, "roc_auc": 0.999}, "time_spent": "0:00:12", "examples": [{"x": "Book a northeastern brazilian restaurant for 10 am", "y_predicted": ["BookRestaurant"], "y_true": ["BookRestaurant"]}, {"x": "Rate The Life and Loves of a She-Devil 5 out of 6", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "what is the forecast for Montana at dinner", "y_predicted": ["GetWeather"], "y_true": ["GetWeather"]}, {"x": "Where is The Toxic Avenger II playing", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "Play some music on Last Fm", "y_predicted": ["PlayMusic"], "y_true": ["PlayMusic"]}, {"x": "Tell me the weather forecast one year from now in Kulpsville, Togo", "y_predicted": ["GetWeather"], "y_true": ["GetWeather"]}, {"x": "show freezing forcast now within the same area in North Dakota", "y_predicted": ["GetWeather"], "y_true": ["GetWeather"]}, {"x": "Play some m

2019-02-12 12:16:28.776 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 163: New best sets_accuracy of 0.9585
2019-02-12 12:16:28.776 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 165: Saving model
2019-02-12 12:16:28.777 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 386: [saving model to /home/vimary/ipavlov/Pilot/examples/tutorials/cnn_model_v1_opt.json]


{"train": {"eval_examples_count": 64, "metrics": {"sets_accuracy": 0.9688, "f1_macro": 0.9702, "roc_auc": 1.0}, "time_spent": "0:00:15", "examples": [{"x": "Play Pandora on Last Fm", "y_predicted": ["PlayMusic"], "y_true": ["PlayMusic"]}, {"x": "add this artist to my SinfonГ­a Hipster", "y_predicted": ["AddToPlaylist"], "y_true": ["AddToPlaylist"]}, {"x": "play some movement by Franky Gee", "y_predicted": ["PlayMusic"], "y_true": ["PlayMusic"]}, {"x": "When and where is Nefertiti, Queen of the Nile playing?", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "What movies are playing at Loews Cineplex?", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "Include hohenfriedberger marsch to my Novedades Pop list.", "y_predicted": ["AddToPlaylist"], "y_true": ["AddToPlaylist"]}, {"x": "Find movie schedules at IMAX Corporation", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "Rate A Tale

2019-02-12 12:16:31.141 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 163: New best sets_accuracy of 0.9604
2019-02-12 12:16:31.141 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 165: Saving model
2019-02-12 12:16:31.142 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 386: [saving model to /home/vimary/ipavlov/Pilot/examples/tutorials/cnn_model_v1_opt.json]


{"train": {"eval_examples_count": 64, "metrics": {"sets_accuracy": 0.9844, "f1_macro": 0.9808, "roc_auc": 0.9935}, "time_spent": "0:00:17", "examples": [{"x": "Will it be freezing on 4/20/2038 in AMerican Beach NC", "y_predicted": ["GetWeather"], "y_true": ["GetWeather"]}, {"x": "put live and rare into dancehall official", "y_predicted": ["AddToPlaylist"], "y_true": ["AddToPlaylist"]}, {"x": "What is the weather like in Wyatte", "y_predicted": ["GetWeather"], "y_true": ["GetWeather"]}, {"x": "rate The Descendants two points", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "Find the movie schedule for animated movies in the neighbourhood.", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "What time is The Bride’s Journey playing at Star Theatres?", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "Help me find the saga titled The Eternal Return", "y_predicted": ["SearchCreativeWork"], "y_true": ["SearchC

2019-02-12 12:16:33.547 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 163: New best sets_accuracy of 0.9622
2019-02-12 12:16:33.548 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 165: Saving model
2019-02-12 12:16:33.548 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 386: [saving model to /home/vimary/ipavlov/Pilot/examples/tutorials/cnn_model_v1_opt.json]


{"train": {"eval_examples_count": 64, "metrics": {"sets_accuracy": 0.9375, "f1_macro": 0.9374, "roc_auc": 0.997}, "time_spent": "0:00:19", "examples": [{"x": "How much wind will there be in NM on november 11th", "y_predicted": ["GetWeather"], "y_true": ["GetWeather"]}, {"x": "find The Many Loves of Dobie Gillis", "y_predicted": ["SearchCreativeWork"], "y_true": ["SearchCreativeWork"]}, {"x": "Put Jazzy B on Lazy Chill Afternoon playlist", "y_predicted": ["PlayMusic"], "y_true": ["AddToPlaylist"]}, {"x": "What time is The Bride from Hell playing at Malco Theatres", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "I am giving the book After Henry a rating of 0 out of 6 stars", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "I need to add an artist to one of my playlists, Classical New Releases Spotify Picks.", "y_predicted": ["AddToPlaylist"], "y_true": ["AddToPlaylist"]}, {"x": "Will it be warm here in one hour", "y_predicted": ["GetWeath

2019-02-12 12:16:35.979 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 169: Did not improve on the sets_accuracy of 0.9622


{"train": {"eval_examples_count": 64, "metrics": {"sets_accuracy": 0.9844, "f1_macro": 0.9849, "roc_auc": 1.0}, "time_spent": "0:00:22", "examples": [{"x": "can you get me the trailer of The Multiversity?", "y_predicted": ["SearchCreativeWork"], "y_true": ["SearchCreativeWork"]}, {"x": "Find the films at ArcLight Hollywood.", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "Will the weather be temperate 22 minutes from now in Alba", "y_predicted": ["GetWeather"], "y_true": ["GetWeather"]}, {"x": "I'm looking for a picture titled Rock Painting", "y_predicted": ["SearchCreativeWork"], "y_true": ["SearchCreativeWork"]}, {"x": "What's the weather forecast for Haigler?", "y_predicted": ["GetWeather"], "y_true": ["GetWeather"]}, {"x": "Plpay my Disco Fever playlist.", "y_predicted": ["AddToPlaylist"], "y_true": ["PlayMusic"]}, {"x": "Add artist to playlist Epic Gaming", "y_predicted": ["AddToPlaylist"], "y_true": ["AddToPlaylist"]}, {"x": "Show me Rapid Ci

2019-02-12 12:16:38.311 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 163: New best sets_accuracy of 0.9629
2019-02-12 12:16:38.312 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 165: Saving model
2019-02-12 12:16:38.312 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 386: [saving model to /home/vimary/ipavlov/Pilot/examples/tutorials/cnn_model_v1_opt.json]


{"train": {"eval_examples_count": 64, "metrics": {"sets_accuracy": 0.9844, "f1_macro": 0.9837, "roc_auc": 0.9983}, "time_spent": "0:00:24", "examples": [{"x": "Rate my current essay 1 out of 6 stars", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "What's the weather in FL?", "y_predicted": ["GetWeather"], "y_true": ["GetWeather"]}, {"x": "play me some Dom Pachino", "y_predicted": ["PlayMusic"], "y_true": ["PlayMusic"]}, {"x": "Is cloudy in Lyncourt?", "y_predicted": ["GetWeather"], "y_true": ["GetWeather"]}, {"x": "Is temperature in Hanksville freezing ?", "y_predicted": ["GetWeather"], "y_true": ["GetWeather"]}, {"x": "play some Bertine Zetlitz record", "y_predicted": ["PlayMusic"], "y_true": ["PlayMusic"]}, {"x": "play latest George Ducas music", "y_predicted": ["PlayMusic"], "y_true": ["PlayMusic"]}, {"x": "For The Curious Incident of the Dog in the Nightdress I rate it 2 of 6 points", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "book verdure serving re

2019-02-12 12:16:40.661 INFO in 'deeppavlov.core.trainers.nn_trainer'['nn_trainer'] at line 169: Did not improve on the sets_accuracy of 0.9629
2019-02-12 12:16:40.693 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 103: [loading vocabulary from /home/vimary/ipavlov/Pilot/examples/tutorials/snips/classes.dict]
2019-02-12 12:16:40.693 INFO in 'deeppavlov.models.embedders.glove_embedder'['glove_embedder'] at line 52: [loading GloVe embeddings from `/home/vimary/ipavlov/Pilot/examples/tutorials/glove.6B.100d.txt`]


{"train": {"eval_examples_count": 64, "metrics": {"sets_accuracy": 1.0, "f1_macro": 1.0, "roc_auc": 0.9996}, "time_spent": "0:00:26", "examples": [{"x": "book in town for 3 at a restaurant outdoor that is not far", "y_predicted": ["BookRestaurant"], "y_true": ["BookRestaurant"]}, {"x": "Need a table for the day after tomorrow in Clarenceville at the Black Rapids Roadhouse", "y_predicted": ["BookRestaurant"], "y_true": ["BookRestaurant"]}, {"x": "What will the weather be like this tuesday in the area neighboring Rendezvous Mountain Educational State Forest?", "y_predicted": ["GetWeather"], "y_true": ["GetWeather"]}, {"x": "Rate The CIA and the Cult of Intelligence a 5.", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "Is the forecast windy in Nigeria on Nov. the 6th", "y_predicted": ["GetWeather"], "y_true": ["GetWeather"]}, {"x": "Book the nearby Meriton Grand Hotel Tallinn in Missouri.", "y_predicted": ["BookRestaurant"], "y_true": ["BookRestaurant"]}, {"x": "please give 

2019-02-12 12:17:00.634 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 302: [initializing `KerasClassificationModel` from saved]
2019-02-12 12:17:00.963 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 312: [loading weights from cnn_model_v1.h5]
2019-02-12 12:17:01.131 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 136: Model was successfully initialized!
Model summary:
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None, 100)    0                                            
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, None, 256)    25856 

{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9629, "f1_macro": 0.9623, "roc_auc": 0.9983}, "time_spent": "0:00:01", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": ["BookRestaurant"], "y_true": ["BookRestaurant"]}, {"x": "Rate the current textbook one of 6 stars", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "find a nearby movie schedule for movies", "y_predicted": ["SearchScreeningEvent"], "y_true": ["SearchScreeningEvent"]}, {"x": "what is the Mississippi for the week", "y_predicted": ["SearchScreeningEvent"], "y_true": ["GetWeather"]}, {"x": "Play me a song from 1968 on Spotify", "y_predicted": ["PlayMusic"], "y_true": ["PlayMusic"]}, {"x": "Book a table for me, naomi and elisabeth at a brasserie with wifi", "y_predicted": ["BookRestaurant"], "y_true": ["BookRestaurant"]}, {"x": "The current album gets three out of 6 points", "y_predicted": ["RateBook"], "y_true": ["RateBook"]}, {"x": "find 

2019-02-12 12:17:21.399 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 302: [initializing `KerasClassificationModel` from saved]
2019-02-12 12:17:21.744 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 312: [loading weights from cnn_model_v1.h5]
2019-02-12 12:17:21.909 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 136: Model was successfully initialized!
Model summary:
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None, 100)    0                                            
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, None, 256)    25856 

In [69]:
# or we can just load pre-trained model (conicides with what we did above)
m = build_model(cnn_config)

2019-02-12 12:17:21.914 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 103: [loading vocabulary from /home/vimary/ipavlov/Pilot/examples/tutorials/snips/classes.dict]
2019-02-12 12:17:21.915 INFO in 'deeppavlov.models.embedders.glove_embedder'['glove_embedder'] at line 52: [loading GloVe embeddings from `/home/vimary/ipavlov/Pilot/examples/tutorials/glove.6B.100d.txt`]
2019-02-12 12:17:42.89 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 302: [initializing `KerasClassificationModel` from saved]
2019-02-12 12:17:42.406 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 312: [loading weights from cnn_model_v1.h5]
2019-02-12 12:17:42.569 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 136: Model was successfully initialized!
Model summary:
_______________________________________________________________

In [70]:
m(["Is it freezing in Offerman, California?"])

[['GetWeather']]

#### SklearnComponent classifier on GloVe weighted by TF-IDF embeddings from config

In [71]:
logreg_config = {
  "dataset_reader": {
    "class_name": "basic_classification_reader",
    "x": "text",
    "y": "intents",
    "data_path": "snips"
  },
  "dataset_iterator": {
    "class_name": "basic_classification_iterator",
    "seed": 42,
      "split_seed": 23,
    "field_to_split": "train",
    "split_fields": [
      "train",
      "valid"
    ],
    "split_proportions": [
      0.9,
      0.1
    ]
  },
  "chainer": {
    "in": [
      "x"
    ],
    "in_y": [
      "y"
    ],
    "pipe": [
      {
        "id": "classes_vocab",
        "class_name": "simple_vocab",
        "fit_on": [
          "y"
        ],
        "save_path": "./snips/classes.dict",
        "load_path": "./snips/classes.dict",
        "in": "y",
        "out": "y_ids"
      },
      {
        "in": [
          "x"
        ],
        "out": [
          "x_vec"
        ],
        "fit_on": [
          "x",
          "y_ids"
        ],
        "id": "my_tfidf_vectorizer",
        "class_name": "sklearn_component",
        "save_path": "tfidf_v2.pkl",
        "load_path": "tfidf_v2.pkl",
        "model_class": "sklearn.feature_extraction.text:TfidfVectorizer",
        "infer_method": "transform"
      },
      {
        "in": "x",
        "out": "x_tok",
        "id": "my_tokenizer",
        "class_name": "nltk_moses_tokenizer"
      },
      {
        "in": "x_tok",
        "out": "x_emb",
        "id": "my_embedder",
        "class_name": "glove",
        "save_path": "./glove.6B.100d.txt",
        "load_path": "./glove.6B.100d.txt",
        "dim": 100,
        "pad_zero": True
      },
      {
        "class_name": "one_hotter",
        "id": "my_onehotter",
        "depth": "#classes_vocab.len",
        "in": "y_ids",
        "out": "y_onehot",
        "single_vector": True
      },
      {
        "in": "x_tok",
        "out": "x_weighted_emb",
        "class_name": "tfidf_weighted",
        "id": "my_weighted_embedder",
        "embedder": "#my_embedder",
        "tokenizer": "#my_tokenizer",
        "vectorizer": "#my_tfidf_vectorizer",
          "mean": True
      },
      {
        "in": [
          "x_weighted_emb"
        ],
        "out": [
          "y_pred"
        ],
        "fit_on": [
          "x_weighted_emb",
          "y"
        ],
        "class_name": "sklearn_component",
        "main": True,
        "save_path": "logreg_v3.pkl",
        "load_path": "logreg_v3.pkl",
        "model_class": "sklearn.linear_model:LogisticRegression",
        "infer_method": "predict",
        "ensure_list_output": True
      }
    ],
    "out": [
      "y_pred"
    ]
  },
  "train": {
    "epochs": 10,
    "batch_size": 64,
    "metrics": [
      "sets_accuracy"
    ],
    "show_examples": False,
    "validate_best": True,
    "test_best": False
  }
}


In [72]:
# we can train and evaluate model from config
m = train_model(logreg_config)

2019-02-12 12:32:01.418 INFO in 'deeppavlov.dataset_iterators.basic_classification_iterator'['basic_classification_iterator'] at line 73: Splitting field <<train>> to new fields <<['train', 'valid']>>
2019-02-12 12:32:01.421 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 103: [loading vocabulary from /home/vimary/ipavlov/Pilot/examples/tutorials/snips/classes.dict]
2019-02-12 12:32:01.439 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 89: [saving vocabulary to /home/vimary/ipavlov/Pilot/examples/tutorials/snips/classes.dict]
2019-02-12 12:32:01.441 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 165: Initializing model sklearn.feature_extraction.text:TfidfVectorizer from scratch
2019-02-12 12:32:01.486 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 108: Fitting model sklearn.feature_extraction.text:TfidfVectorizer
2019-02-12 12:32:01.587 INFO in 'deeppavlov.models.sklearn.sk

{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9283}, "time_spent": "0:00:03"}}


2019-02-12 12:33:27.702 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 202: Loading model sklearn.linear_model:LogisticRegression from /home/vimary/ipavlov/Pilot/examples/tutorials/logreg_v3.pkl
2019-02-12 12:33:27.702 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 209: Model sklearn.linear_model.logisticLogisticRegression loaded  with parameters


In [73]:
# or we can just load pre-trained model (conicides with what we did above)
m = build_model(logreg_config)

2019-02-12 12:33:27.742 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 103: [loading vocabulary from /home/vimary/ipavlov/Pilot/examples/tutorials/snips/classes.dict]
2019-02-12 12:33:27.743 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 202: Loading model sklearn.feature_extraction.text:TfidfVectorizer from /home/vimary/ipavlov/Pilot/examples/tutorials/tfidf_v2.pkl
2019-02-12 12:33:27.748 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 209: Model sklearn.feature_extraction.textTfidfVectorizer loaded  with parameters
2019-02-12 12:33:27.750 INFO in 'deeppavlov.models.embedders.glove_embedder'['glove_embedder'] at line 52: [loading GloVe embeddings from `/home/vimary/ipavlov/Pilot/examples/tutorials/glove.6B.100d.txt`]
2019-02-12 12:33:47.483 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 202: Loading model sklearn.linear_model:LogisticRegression from /home/vim

In [74]:
m(["Is it freezing in Offerman, California?"])

[['GetWeather']]

In [75]:
# let's free memory
del m