## You can also run the notebook in [COLAB](https://colab.research.google.com/github/deepmipt/DeepPavlov/blob/master/examples/tutorials/05_deeppavlov_classification.ipynb).

In [None]:
!pip3 install deeppavlov

# Classification on DeepPavlov

**Task**:
Intent recognition on SNIPS dataset: https://github.com/snipsco/nlu-benchmark/tree/master/2017-06-custom-intent-engines that has already been recomposed to `csv` format and can be downloaded from http://files.deeppavlov.ai/datasets/snips_intents/train.csv

FastText English word embeddings ~8Gb: http://files.deeppavlov.ai/deeppavlov_data/embeddings/wiki.en.bin

## Plan of the notebook with documentation links:

1. [Data aggregation](#Data-aggregation)
     * [DatasetReader](#DatasetReader): [docs link](https://deeppavlov.readthedocs.io/en/latest/apiref/dataset_readers.html)
     * [DatasetIterator](#DatasetIterator): [docs link](https://deeppavlov.readthedocs.io/en/latest/apiref/dataset_iterators.html)
2. [Data preprocessing](#Data-preprocessing): [docs link](https://deeppavlov.readthedocs.io/en/latest/components/data_processors.html)
     * [Lowercasing](#Lowercasing)
     * [Tokenization](#Tokenization)
     * [Vocabulary](#Vocabulary)
3. [Featurization](#Featurization): [docs link](https://deeppavlov.readthedocs.io/en/latest/components/data_processors.html), [pre-trained embeddings link](https://deeppavlov.readthedocs.io/en/latest/intro/pretrained_vectors.html)
    * [Bag-of-words embedder](#Bag-of-words)
    * [TF-IDF vectorizer](#TF-IDF Vectorizer)
    * [GloVe embedder](#GloVe-embedder)
    * [Mean GloVe embedder](#Mean-GloVe-embedder)
    * [GloVe weighted by TF-IDF embedder](#GloVe-weighted-by-TF-IDF-embedder)
4. [Models](#Models): [docs link](https://deeppavlov.readthedocs.io/en/latest/components/classifiers.html)
    * [Building models in python](#Models-in-python)
        - [Sklearn component classifiers](#SklearnComponent-classifier-on-Tfidf-features-in-python)
        - [Keras classification model on GloVe emb](#KerasClassificationModel-on-GloVe-embeddings-in-python)
        - [Sklearn component classifier on GloVe weighted emb](#SklearnComponent-classifier-on-GloVe-weighted-by-TF-IDF-embeddings-in-python)
    * [Building models from configs](#Models-from-configs)
        - [Sklearn component classifiers](#SklearnComponent-classifier-on-Tfidf-features-from-config)
        - [Keras classification model](#KerasClassificationModel-on-fastText-embeddings-from-config)
        - [Sklearn component classifier on GloVe weighted emb](#SklearnComponent-classifier-on-GloVe-weighted-by-TF-IDF-embeddings-from-config)
    * [Bonus: pre-trained CNN model in DeepPavlov](#Bonus:-pre-trained-CNN-model-in-DeepPavlov)

## Data aggregation

First of all, let's download and look into data we will work with.

In [1]:
from deeppavlov.core.data.utils import simple_download

#download train data file for SNIPS
simple_download(url="http://files.deeppavlov.ai/datasets/snips_intents/train.csv", 
                destination="./snips/train.csv")

2018-11-09 16:33:35.510 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 208: Starting new HTTP connection (1): files.deeppavlov.ai
2018-11-09 16:33:35.514 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 396: http://files.deeppavlov.ai:80 "GET /datasets/snips_intents/train.csv HTTP/1.1" 200 980824
2018-11-09 16:33:35.515 INFO in 'deeppavlov.core.data.utils'['utils'] at line 63: Downloading from http://files.deeppavlov.ai/datasets/snips_intents/train.csv to snips/train.csv
100%|██████████| 981k/981k [00:00<00:00, 22.2MB/s]


In [2]:
! head -n 15 snips/train.csv

text,intents
Add another song to the Cita RomГЎntica playlist. ,AddToPlaylist
add clem burke in my playlist Pre-Party R&B Jams,AddToPlaylist
Add Live from Aragon Ballroom to Trapeo,AddToPlaylist
add Unite and Win to my night out,AddToPlaylist
Add track to my Digster Future Hits,AddToPlaylist
add the piano bar to my Cindy Wilson,AddToPlaylist
Add Spanish Harlem Incident to cleaning the house,AddToPlaylist
add The Greyest of Blue Skies in Indie EspaГ±ol my playlist,AddToPlaylist
Add the name kids in the street to the plylist New Indie Mix,AddToPlaylist
add album radar latino,AddToPlaylist
Add Tranquility to the Latin Pop Rising playlist. ,AddToPlaylist
Add d flame to the Dcode2016 playlist.,AddToPlaylist
Add album to my fairy tales,AddToPlaylist
I need another artist in the New Indie Mix playlist. ,AddToPlaylist


### DatasetReader

Read data using `BasicClassificationDatasetReader` из DeepPavlov

In [4]:
from deeppavlov.dataset_readers.basic_classification_reader import BasicClassificationDatasetReader

In [5]:
# read data from particular columns of `.csv` file
dr = BasicClassificationDatasetReader().read(
    data_path='./snips/',
    train='train.csv',
    x = 'text',
    y = 'intents'
)



We don't have a ready train/valid/test split.

In [6]:
# check train/valid/test sizes
[(k, len(dr[k])) for k in dr.keys()]

[('train', 15884), ('valid', 0), ('test', 0)]

### DatasetIterator

Use `BasicClassificationDatasetIterator` to split `train` on `train` and `valid` and to generate batches of samples.

In [7]:
from deeppavlov.dataset_iterators.basic_classification_iterator import BasicClassificationDatasetIterator

In [8]:
# initialize data iterator splitting `train` field to `train` and `valid` in proportion 0.8/0.2
train_iterator = BasicClassificationDatasetIterator(
    data=dr,
    field_to_split='train',  # field that will be splitted
    split_fields=['train', 'valid'],   # fields to which the fiald above will be splitted
    split_proportions=[0.8, 0.2],  #proportions for splitting
    split_seed=23,  # seed for splitting dataset
    seed=42)  # seed for iteration over dataset

2018-11-09 16:34:30.398 INFO in 'deeppavlov.dataset_iterators.basic_classification_iterator'['basic_classification_iterator'] at line 73: Splitting field <<train>> to new fields <<['train', 'valid']>>


Let's look into training samples. 

In [9]:
# one can get train instances (or any other data type including `all`)
x_train, y_train = train_iterator.get_instances(data_type='train')
for x, y in list(zip(x_train, y_train))[:5]:
    print('x:', x)
    print('y:', y)
    print('=================')

x: Is it freezing in Offerman, California?
y: ['GetWeather']
x: put this song in the playlist Trap Land
y: ['AddToPlaylist']
x: show me a textbook with a rating of 2 and a maximum rating of 6 that is current
y: ['RateBook']
x: Will the weather be okay in Northern Luzon Heroes Hill National Park 4 and a half months from now?
y: ['GetWeather']
x: Rate the current album a four
y: ['RateBook']


## Data preprocessing

We will be using lowercasing and tokenization as data preparation. 

DeepPavlov also contains several other preprocessors and tokenizers.

### Lowercasing

`StrLower` lowercases texts.

In [10]:
from deeppavlov.models.preprocessors.str_lower import StrLower

[nltk_data] Downloading package punkt to /home/dilyara/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/dilyara/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package perluniprops to
[nltk_data]     /home/dilyara/nltk_data...
[nltk_data]   Package perluniprops is already up-to-date!
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data]     /home/dilyara/nltk_data...
[nltk_data]   Package nonbreaking_prefixes is already up-to-date!


In [11]:
str_lower = StrLower()
str_lower(['Is it freezing in Offerman, California?'])

['is it freezing in offerman, california?']

### Tokenization

`NLTKTokenizer` can split string to tokens.

In [12]:
from deeppavlov.models.tokenizers.nltk_moses_tokenizer import NLTKMosesTokenizer

In [13]:
tokenizer = NLTKMosesTokenizer()
tokenizer(['Is it freezing in Offerman, California?'])

[['Is', 'it', 'freezing', 'in', 'Offerman', ',', 'California', '?']]

Let's preprocess all `train` part of the dataset.

In [14]:
train_x_lower_tokenized = str_lower(tokenizer(train_iterator.get_instances(data_type='train')[0]))

### Vocabulary

Now we are ready to use `vocab`. They are very usefull for:
* extracting class labels and converting labels to indices and vice versa,
* building of characters or tokens vocabularies.

In [15]:
from deeppavlov.core.data.simple_vocab import SimpleVocabulary

In [16]:
# initialize simple vocabulary to collect all appeared in the dataset classes
classes_vocab = SimpleVocabulary(
    save_path='./snips/classes.dict',
    load_path='./snips/classes.dict')

In [17]:
classes_vocab.fit((train_iterator.get_instances(data_type='train')[1]))
classes_vocab.save()

2018-11-09 16:34:36.222 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 89: [saving vocabulary to /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/snips/classes.dict]


Let's see what classes the dataset contains and their indices in the vocabulary.

In [18]:
list(classes_vocab.items())

[('GetWeather', 0),
 ('PlayMusic', 1),
 ('SearchScreeningEvent', 2),
 ('BookRestaurant', 3),
 ('RateBook', 4),
 ('SearchCreativeWork', 5),
 ('AddToPlaylist', 6)]

In [19]:
# also one can collect vocabulary of textual tokens appeared 2 and more times in the dataset
token_vocab = SimpleVocabulary(
    save_path='./snips/tokens.dict',
    load_path='./snips/tokens.dict',
    min_freq=2,
    special_tokens=('<PAD>', '<UNK>',),
    unk_token='<UNK>')

In [20]:
token_vocab.fit(train_x_lower_tokenized)
token_vocab.save()

2018-11-09 16:34:38.172 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 89: [saving vocabulary to /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/snips/tokens.dict]


In [21]:
# number of tokens in dictionary
len(token_vocab)

4564

In [22]:
# 10 most common words and number of times their appeared
token_vocab.freqs.most_common()[:10]

[('the', 6953),
 ('a', 3917),
 ('in', 3265),
 ('to', 3203),
 ('for', 2814),
 ('of', 2401),
 ('.', 2400),
 ('i', 2079),
 ('at', 1935),
 ('play', 1703)]

In [23]:
token_ids = token_vocab(str_lower(tokenizer(['Is it freezing in Offerman, California?'])))
token_ids

[[13, 36, 244, 4, 1, 29, 996, 20]]

In [24]:
tokenizer(token_vocab(token_ids))

['is it freezing in <UNK>, california?']

## Featurization

This part contains several possible ways of featurization of text samples. One can chose any appropriate vectorizer/embedder according to available resources and given task.

Bag-of-words (BoW) and TF-IDF vectorizers converts text samples to vectors (one vector per sample) while fastText, GloVe, fastText weighted by TF-IDF embedders either produce an embedding vector per token or an embedding vector per text sample (if `mean` set to True).

### Bag-of-words

Matches a vector to each text sample indicating which words appeared in the given sample: text -> binary vector $v$: \[0, 1, 0, 0, 0, 1, ..., ...1, 0, 1\]. 

Dimensionality of vector $v$ is equal to vocabulary size.

$v_i$ == 1, if word $i$ is in the text,

$v_i$ == 0, else.

In [25]:
import numpy as np
from deeppavlov.models.embedders.bow_embedder import BoWEmbedder

In [26]:
# initialize bag-of-words embedder giving total number of tokens
bow = BoWEmbedder(depth=token_vocab.len)
# it assumes indexed tokenized samples
bow(token_vocab(str_lower(tokenizer(['Is it freezing in Offerman, California?']))))

[array([0, 1, 0, ..., 0, 0, 0], dtype=int32)]

In [27]:
# all 8 tokens are in the vocabulary
sum(bow(token_vocab(str_lower(tokenizer(['Is it freezing in Offerman, California?']))))[0])

8

### TF-IDF Vectorizer

Matches a vector to each text sample: text -> vector $v$ from $R^N$ where $N$ is a vocabulary size.

$TF-IDF(token, document) = TF(token, document) * IDF(token, document)$

$TF$ is a term frequency:

$TF(token, document) = \frac{n_{token}}{\sum_{k}n_k}.$

$IDF$ is a inverse document frequency:

$IDF(token, all\_documents) = \frac{Total\ number\ of\ documents}{number\ of\ documents\ where\ token\ appeared}.$

`SklearnComponent` in DeepPavlov is a universal wrapper for any vecotirzer/estimator from `sklearn` package. The only requirement to specify component usage is following: model class and name of infer method should be passed as parameters.

In [28]:
from deeppavlov.models.sklearn import SklearnComponent

In [29]:
# initialize TF-IDF vectorizer sklearn component with `transform` as infer method
tfidf = SklearnComponent(
    model_class="sklearn.feature_extraction.text:TfidfVectorizer",
    infer_method="transform",
    save_path='./tfidf_v0.pkl',
    load_path='./tfidf_v0.pkl',
    mode='train')

2018-11-09 16:34:54.331 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 164: Initializing model sklearn.feature_extraction.text:TfidfVectorizer from scratch


In [30]:
# fit on textual train instances and save it
tfidf.fit(str_lower(train_iterator.get_instances(data_type='train')[0]))
tfidf.save()

2018-11-09 16:34:55.565 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 107: Fitting model sklearn.feature_extraction.text:TfidfVectorizer
2018-11-09 16:34:55.686 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 239: Saving model to /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/tfidf_v0.pkl


In [31]:
tfidf(str_lower(['Is it freezing in Offerman, California?']))

<1x10709 sparse matrix of type '<class 'numpy.float64'>'
	with 6 stored elements in Compressed Sparse Row format>

In [32]:
# number of tokens in the TF-IDF vocabulary
len(tfidf.model.vocabulary_)

10709

### GloVe embedder

[GloVe](https://nlp.stanford.edu/projects/glove/) is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

In [33]:
from deeppavlov.models.embedders.glove_embedder import GloVeEmbedder

2018-11-09 16:34:58.552 INFO in 'summarizer.preprocessing.cleaner'['textcleaner'] at line 37: 'pattern' package not found; tag filters are not available for English


Let's download GloVe embedding file

In [34]:
simple_download(url="http://files.deeppavlov.ai/embeddings/glove.6B.100d.txt", 
                destination="./glove.6B.100d.txt")

2018-11-09 16:35:00.958 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 208: Starting new HTTP connection (1): files.deeppavlov.ai
2018-11-09 16:35:00.981 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 396: http://files.deeppavlov.ai:80 "GET /embeddings/glove.6B.100d.txt HTTP/1.1" 200 None
2018-11-09 16:35:00.982 INFO in 'deeppavlov.core.data.utils'['utils'] at line 63: Downloading from http://files.deeppavlov.ai/embeddings/glove.6B.100d.txt to glove.6B.100d.txt
347MB [00:11, 29.2MB/s] 


In [35]:
embedder = GloVeEmbedder(load_path='./glove.6B.100d.txt',
                         dim=100)

2018-11-09 16:38:06.81 INFO in 'deeppavlov.models.embedders.glove_embedder'['glove_embedder'] at line 52: [loading GloVe embeddings from `/home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/glove.6B.100d.txt`]
2018-11-09 16:38:06.82 INFO in 'gensim.models.utils_any2vec'['utils_any2vec'] at line 170: loading projection weights from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/glove.6B.100d.txt
2018-11-09 16:38:06.82 DEBUG in 'smart_open.smart_open_lib'['smart_open_lib'] at line 149: {'kw': {}, 'mode': 'rb', 'uri': '/home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/glove.6B.100d.txt'}
2018-11-09 16:38:06.83 DEBUG in 'smart_open.smart_open_lib'['smart_open_lib'] at line 621: encoding_wrapper: {'errors': 'strict', 'encoding': None, 'mode': 'rb', 'fileobj': <_io.BufferedReader name='/home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/glove.6B.100d.txt'>}
2018-11-09 16:38:32.92 INFO in 'gensim.models.utils_any2vec'[

In [36]:
# output shape is (batch_size x num_tokens x embedding_dim)
embedded_batch = embedder(str_lower(tokenizer(['Is it freezing in Offerman, California?']))) 
len(embedded_batch), len(embedded_batch[0]), embedded_batch[0][0].shape

(1, 8, (100,))

### Mean GloVe embedder

Embedder returns a vector per token while we want to get a vector per text sample. Therefore, let's calculate mean vector of embeddings of tokens. 
For that we can either init `GloVeEmbedder` with `mean=True` parameter (`mean=false` by default), or pass `mean=true` while calling function (this way `mean` value is assigned only for this call).

In [37]:
# output shape is (batch_size x embedding_dim)
embedded_batch = embedder(str_lower(tokenizer(['Is it freezing in Offerman, California?'])), mean=True) 
len(embedded_batch), embedded_batch[0].shape

(1, (100,))

### GloVe weighted by TF-IDF embedder

One of the possible ways to combine TF-IDF vectorizer and any token embedder is to weigh token embeddings by TF-IDF coefficients (therefore, `mean` set to True is obligatory to obtain embeddings of interest while it still **by default** returns embeddings of tokens.

In [38]:
from deeppavlov.models.embedders.tfidf_weighted_embedder import TfidfWeightedEmbedder

In [39]:
weighted_embedder = TfidfWeightedEmbedder(
    embedder=embedder,  # our GloVe embedder instance
    tokenizer=tokenizer,  # our tokenizer instance
    mean=True,  # to return one vector per sample
    vectorizer=tfidf  # our TF-IDF vectorizer
)

In [40]:
# output shape is (batch_size x  embedding_dim)
embedded_batch = weighted_embedder(str_lower(tokenizer(['Is it freezing in Offerman, California?']))) 
len(embedded_batch), embedded_batch[0].shape

(1, (100,))

## Models

In [41]:
from deeppavlov.metrics.accuracy import sets_accuracy

In [42]:
# get all train and valid data from iterator
x_train, y_train = train_iterator.get_instances(data_type="train")
x_valid, y_valid = train_iterator.get_instances(data_type="valid")

### Models in python

#### SklearnComponent classifier on Tfidf-features in python

In [43]:
# initialize sklearn classifier, all parameters for classifier could be passed
cls = SklearnComponent(
    model_class="sklearn.linear_model:LogisticRegression",
    infer_method="predict",
    save_path='./logreg_v0.pkl',
    load_path='./logreg_v0.pkl',
    C=1,
    mode='train')

2018-11-09 16:38:44.882 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 164: Initializing model sklearn.linear_model:LogisticRegression from scratch


In [44]:
# fit sklearn classifier and save it
cls.fit(tfidf(x_train), y_train)
cls.save()

2018-11-09 16:38:46.393 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 107: Fitting model sklearn.linear_model:LogisticRegression
2018-11-09 16:38:46.625 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 239: Saving model to /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/logreg_v0.pkl


In [45]:
y_valid_pred = cls(tfidf(x_valid))

In [46]:
# Let's look into obtained result
print("Text sample: {}".format(x_valid[0]))
print("True label: {}".format(y_valid[0]))
print("Predicted label: {}".format(y_valid_pred[0]))

Text sample: I need seating at Floating restaurant in Tennessee for a group of 9
True label: ['BookRestaurant']
Predicted label: BookRestaurant


In [47]:
# let's calculate sets accuracy (because each element is a list of labels)
sets_accuracy(np.squeeze(y_valid), y_valid_pred)

0.982373308152345

#### KerasClassificationModel on GloVe embeddings in python

In [47]:
from deeppavlov.models.classifiers.keras_classification_model import KerasClassificationModel
from deeppavlov.models.preprocessors.one_hotter import OneHotter
from deeppavlov.models.classifiers.proba2labels import Proba2Labels

In [48]:
# Intialize `KerasClassificationModel` that composes CNN shallow-and-wide network 
# (name here as`cnn_model`)
cls = KerasClassificationModel(save_path="./cnn_model_v0", 
                               load_path="./cnn_model_v0", 
                               embedding_size=embedder.dim,
                               n_classes=classes_vocab.len,
                               model_name="cnn_model",
                               text_size=15, # number of tokens
                               kernel_sizes_cnn=[3, 5, 7],
                               filters_cnn=128,
                               dense_size=100,
                               optimizer="Adam",
                               learning_rate=0.1,
                               learning_rate_decay=0.01,
                               loss="categorical_crossentropy")

2018-11-01 11:09:46.672 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 242: [initializing `KerasClassificationModel` from scratch as cnn_model]
2018-11-01 11:09:46.994 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 134: Model was successfully initialized!
Model summary:
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 15, 100)      0                                            
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 15, 128)      38528       input_1[0][0]                    
__________________________________________________________________________________________________
conv1d_2 (Conv1D)  

In [49]:
# `KerasClassificationModel` assumes one-hotted distribution of classes per sample.
# `OneHotter` converts indices to one-hot vectors representation.
#  To obtain indices we can use our `classes_vocab` intialized and fitted above
onehotter = OneHotter(depth=classes_vocab.len)

In [50]:
# Train for 10 epochs
for ep in range(10):
    for x, y in train_iterator.gen_batches(batch_size=64, 
                                           data_type="train"):
        x_embed = embedder(tokenizer(str_lower(x)))
        y_onehot = onehotter(classes_vocab(y))
        cls.train_on_batch(x_embed, y_onehot)

In [51]:
# Save model weights and parameters
cls.save()

2018-11-01 11:12:21.498 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 356: [saving model to cnn_model_v0_opt.json]


In [52]:
# Infering on validation data we get probability distribution on given data.
y_valid_pred = cls(embedder(tokenizer(str_lower(x_valid))))

In [53]:
# To convert probability distribution to labels, 
# we first need to convert probabilities to indices,
# and then using vocabulary `classes_vocab` convert indices to labels.
# 
# `Proba2Labels` converts probabilities to indices and supports three different modes:
# if `max_proba` is true, returns indices of the highest probabilities
# if `confident_threshold` is given, returns indices with probabiltiies higher than threshold
# if `top_n` is given, returns `top_n` indices with highest probabilities
prob2labels = Proba2Labels(max_proba=True)

In [54]:
# Let's look into obtained result
print("Text sample: {}".format(x_valid[0]))
print("True label: {}".format(y_valid[0]))
print("Predicted probability distribution: {}".format(dict(zip(classes_vocab.keys(), 
                                                               y_valid_pred[0]))))
print("Predicted label: {}".format(classes_vocab(prob2labels(y_valid_pred))[0]))

Text sample: I need seating at Floating restaurant in Tennessee for a group of 9
True label: ['BookRestaurant']
Predicted probability distribution: {'GetWeather': 0.00014258868759498, 'PlayMusic': 0.00028503243811428547, 'SearchScreeningEvent': 0.00023403449449688196, 'BookRestaurant': 0.9942096471786499, 'RateBook': 0.0005214287666603923, 'SearchCreativeWork': 0.0004547557036858052, 'AddToPlaylist': 0.0007166761788539588}
Predicted label: ['BookRestaurant']


In [55]:
# calculate sets accuracy
sets_accuracy(y_valid, classes_vocab(prob2labels(y_valid_pred)))

0.9842618822788795

####  SklearnComponent classifier on GloVe weighted by TF-IDF embeddings in python

In [56]:
# initialize sklearn classifier, all parameters for classifier could be passed
cls = SklearnComponent(
    model_class="sklearn.linear_model:LogisticRegression",
    infer_method="predict",
    save_path='./logreg_v1.pkl',
    load_path='./logreg_v1.pkl',
    C=1,
    mode='train')

2018-11-01 11:12:26.171 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 164: Initializing model sklearn.linear_model:LogisticRegression from scratch


In [57]:
# fit sklearn classifier and save it
cls.fit(weighted_embedder(str_lower(tokenizer(x_train))), y_train)
cls.save()

2018-11-01 11:12:50.390 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 107: Fitting model sklearn.linear_model:LogisticRegression
2018-11-01 11:12:52.105 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 239: Saving model to logreg_v1.pkl


In [58]:
y_valid_pred = cls(weighted_embedder(str_lower(tokenizer(x_valid))))

In [59]:
# Let's look into obtained result
print("Text sample: {}".format(x_valid[0]))
print("True label: {}".format(y_valid[0]))
print("Predicted label: {}".format(y_valid_pred[0]))

Text sample: I need seating at Floating restaurant in Tennessee for a group of 9
True label: ['BookRestaurant']
Predicted label: BookRestaurant


In [60]:
# let's calculate sets accuracy (because each element is a list of labels)
sets_accuracy(np.squeeze(y_valid), y_valid_pred)

0.9184765502045955

### Let's free our memory from embeddings and models

In [61]:
embedder.reset()
cls.reset()

### Models from configs

In [49]:
from deeppavlov import build_model
from deeppavlov.core.commands.train import train_evaluate_model_from_config, _test_model

#### SklearnComponent classifier on Tfidf-features from config

In [52]:
logreg_config = {
  "dataset_reader": {
    "class_name": "basic_classification_reader",
    "x": "text",
    "y": "intents",
    "data_path": "./snips"
  },
  "dataset_iterator": {
    "class_name": "basic_classification_iterator",
    "seed": 42,
    "split_seed": 23,
    "field_to_split": "train",
    "split_fields": [
      "train",
      "valid"
    ],
    "split_proportions": [
      0.9,
      0.1
    ]
  },
  "chainer": {
    "in": [
      "x"
    ],
    "in_y": [
      "y"
    ],
    "pipe": [
      {
        "id": "classes_vocab",
        "class_name": "simple_vocab",
        "fit_on": [
          "y"
        ],
        "save_path": "./snips/classes.dict",
        "load_path": "./snips/classes.dict",
        "in": "y",
        "out": "y_ids"
      },
      {
        "in": [
          "x"
        ],
        "out": [
          "x_vec"
        ],
        "fit_on": [
          "x",
          "y_ids"
        ],
        "id": "tfidf_vec",
        "class_name": "sklearn_component",
        "save_path": "tfidf_v1.pkl",
        "load_path": "tfidf_v1.pkl",
        "model_class": "sklearn.feature_extraction.text:TfidfVectorizer",
        "infer_method": "transform"
      },
      {
        "in": "x",
        "out": "x_tok",
        "id": "my_tokenizer",
        "class_name": "nltk_moses_tokenizer",
        "tokenizer": "wordpunct_tokenize"
      },
      {
        "in": [
          "x_vec"
        ],
        "out": [
          "y_pred"
        ],
        "fit_on": [
          "x_vec",
          "y"
        ],
        "class_name": "sklearn_component",
        "main": True,
        "save_path": "logreg_v2.pkl",
        "load_path": "logreg_v2.pkl",
        "model_class": "sklearn.linear_model:LogisticRegression",
        "infer_method": "predict",
        "ensure_list_output": True
      }
    ],
    "out": [
      "y_pred"
    ]
  },
  "train": {
    "batch_size": 64,
    "metrics": [
      "accuracy"
    ],
    "validate_best": True,
    "test_best": False
  }
}


In [53]:
# we can train and evaluate model from config
m = train_evaluate_model_from_config(logreg_config)

2018-11-09 16:41:44.919 INFO in 'deeppavlov.dataset_iterators.basic_classification_iterator'['basic_classification_iterator'] at line 73: Splitting field <<train>> to new fields <<['train', 'valid']>>
2018-11-09 16:41:44.924 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 100: [loading vocabulary from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/snips/classes.dict]
2018-11-09 16:41:44.950 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 89: [saving vocabulary to /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/snips/classes.dict]
2018-11-09 16:41:44.952 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 164: Initializing model sklearn.feature_extraction.text:TfidfVectorizer from scratch
2018-11-09 16:41:45.13 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 107: Fitting model sklearn.feature_extraction.text:TfidfVectorizer
2018-11-09 16:41

{"valid": {"eval_examples_count": 1589, "metrics": {"accuracy": 0.983}, "time_spent": "0:00:01"}}


In [54]:
# or we can just load pre-trained model (conicides with what we did above)
m = build_model(logreg_config)

2018-11-09 16:41:48.877 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 100: [loading vocabulary from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/snips/classes.dict]
2018-11-09 16:41:48.878 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 201: Loading model sklearn.feature_extraction.text:TfidfVectorizer from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/tfidf_v1.pkl
2018-11-09 16:41:48.883 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 208: Model sklearn.feature_extraction.textTfidfVectorizer loaded  with parameters
2018-11-09 16:41:48.886 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 201: Loading model sklearn.linear_model:LogisticRegression from /home/dilyara/Documents/GitHub/reserve/DeepPavlov/examples/tutorials/logreg_v2.pkl
2018-11-09 16:41:48.888 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklear

In [55]:
m(["Is it freezing in Offerman, California?"])

[['GetWeather']]

#### KerasClassificationModel on GloVe embeddings from config

In [67]:
cnn_config = {
  "dataset_reader": {
    "class_name": "basic_classification_reader",
    "x": "text",
    "y": "intents",
    "data_path": "snips"
  },
  "dataset_iterator": {
    "class_name": "basic_classification_iterator",
    "seed": 42,
    "split_seed": 23,
    "field_to_split": "train",
    "split_fields": [
      "train",
      "valid"
    ],
    "split_proportions": [
      0.9,
      0.1
    ]
  },
  "chainer": {
    "in": [
      "x"
    ],
    "in_y": [
      "y"
    ],
    "pipe": [
      {
        "id": "classes_vocab",
        "class_name": "simple_vocab",
        "fit_on": [
          "y"
        ],
        "level": "token",
        "save_path": "./snips/classes.dict",
        "load_path": "./snips/classes.dict",
        "in": "y",
        "out": "y_ids"
      },
      {
        "in": "x",
        "out": "x_tok",
        "id": "my_tokenizer",
        "class_name": "nltk_tokenizer",
        "tokenizer": "wordpunct_tokenize"
      },
      {
        "in": "x_tok",
        "out": "x_emb",
        "id": "my_embedder",
        "class_name": "glove",
        "load_path": "./glove.6B.100d.txt",
        "dim": 100
      },
      {
        "in": "y_ids",
        "out": "y_onehot",
        "class_name": "one_hotter",
        "depth": "#classes_vocab.len"
      },
      {
        "in": [
          "x_emb"
        ],
        "in_y": [
          "y_onehot"
        ],
        "out": [
          "y_pred_probas"
        ],
        "main": True,
        "class_name": "keras_classification_model",
        "save_path": "./cnn_model_v1",
        "load_path": "./cnn_model_v1",
        "embedding_size": "#my_embedder.dim",
        "n_classes": "#classes_vocab.len",
        "kernel_sizes_cnn": [
          1,
          2,
          3
        ],
        "filters_cnn": 256,
        "optimizer": "Adam",
        "learning_rate": 0.01,
        "learning_rate_decay": 0.1,
        "loss": "categorical_crossentropy",
        "text_size": 15,
        "coef_reg_cnn": 1e-4,
        "coef_reg_den": 1e-4,
        "dropout_rate": 0.5,
        "dense_size": 100,
        "model_name": "cnn_model"
      },
      {
        "in": "y_pred_probas",
        "out": "y_pred_ids",
        "class_name": "proba2labels",
        "max_proba": True
      },
      {
        "in": "y_pred_ids",
        "out": "y_pred_labels",
        "ref": "classes_vocab"
      }
    ],
    "out": [
      "y_pred_labels"
    ]
  },
  "train": {
    "epochs": 10,
    "batch_size": 64,
    "metrics": [
      "sets_accuracy",
      "f1_macro",
      {
        "name": "roc_auc",
        "inputs": ["y_onehot", "y_pred_probas"]
      }
    ],
    "validation_patience": 5,
    "val_every_n_epochs": 1,
    "log_every_n_epochs": 1,
    "show_examples": True,
    "validate_best": True,
    "test_best": False
  }
}


In [68]:
# we can train and evaluate model from config
m = train_evaluate_model_from_config(cnn_config)

2018-11-01 11:17:31.292 INFO in 'deeppavlov.dataset_iterators.basic_classification_iterator'['basic_classification_iterator'] at line 73: Splitting field <<train>> to new fields <<['train', 'valid']>>
2018-11-01 11:17:31.297 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 97: [loading vocabulary from snips/classes.dict]
2018-11-01 11:17:31.302 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 86: [saving vocabulary to snips/classes.dict]
2018-11-01 11:17:31.303 INFO in 'deeppavlov.models.embedders.glove_embedder'['glove_embedder'] at line 52: [loading GloVe embeddings from `glove.6B.100d.txt`]
2018-11-01 11:17:31.303 INFO in 'gensim.models.keyedvectors'['keyedvectors'] at line 204: loading projection weights from glove.6B.100d.txt
2018-11-01 11:17:31.304 DEBUG in 'smart_open.smart_open_lib'['smart_open_lib'] at line 149: {'kw': {}, 'mode': 'rb', 'uri': 'glove.6B.100d.txt'}
2018-11-01 11:17:31.304 DEBUG in 'smart_open.smart_open_lib'['smart_open_lib

{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.1517, "f1_macro": 0.1107, "roc_auc": 0.5186}, "time_spent": "0:00:01", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}], "epochs_done": 0, "batches_seen": 0, "train_examples_seen": 0, "impatience": 0, "patience_limit": 5}}


2018-11-01 11:17:57.713 INFO in 'deeppavlov.core.commands.train'['train'] at line 518: New best sets_accuracy of 0.9327
2018-11-01 11:17:57.714 INFO in 'deeppavlov.core.commands.train'['train'] at line 520: Saving model
2018-11-01 11:17:57.714 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 356: [saving model to cnn_model_v1_opt.json]


{"train": {"epochs_done": 1, "batches_seen": 224, "train_examples_seen": 14295, "metrics": {"sets_accuracy": 0.9027, "f1_macro": 0.9029, "roc_auc": 0.9797}, "time_spent": "0:00:05", "examples": [{"x": "Add lisa m to my guitar hero live playlist", "y_predicted": "y_pred_labels", "y_true": ["AddToPlaylist"]}], "loss": 1.3867814327989305}}
{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9327, "f1_macro": 0.9328, "roc_auc": 0.9959}, "time_spent": "0:00:05", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}], "epochs_done": 1, "batches_seen": 224, "train_examples_seen": 14295, "impatience": 0, "patience_limit": 5}}


2018-11-01 11:18:00.79 INFO in 'deeppavlov.core.commands.train'['train'] at line 518: New best sets_accuracy of 0.9478
2018-11-01 11:18:00.80 INFO in 'deeppavlov.core.commands.train'['train'] at line 520: Saving model
2018-11-01 11:18:00.80 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 356: [saving model to cnn_model_v1_opt.json]


{"train": {"epochs_done": 2, "batches_seen": 448, "train_examples_seen": 28590, "metrics": {"sets_accuracy": 0.9543, "f1_macro": 0.9548, "roc_auc": 0.9974}, "time_spent": "0:00:07", "examples": [{"x": "I give The Monkey and the Tiger a rating of 2 points.", "y_predicted": "y_pred_labels", "y_true": ["RateBook"]}], "loss": 1.2632445618510246}}
{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9478, "f1_macro": 0.9473, "roc_auc": 0.997}, "time_spent": "0:00:07", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}], "epochs_done": 2, "batches_seen": 448, "train_examples_seen": 28590, "impatience": 0, "patience_limit": 5}}


2018-11-01 11:18:02.625 INFO in 'deeppavlov.core.commands.train'['train'] at line 518: New best sets_accuracy of 0.9497
2018-11-01 11:18:02.625 INFO in 'deeppavlov.core.commands.train'['train'] at line 520: Saving model
2018-11-01 11:18:02.626 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 356: [saving model to cnn_model_v1_opt.json]


{"train": {"epochs_done": 3, "batches_seen": 672, "train_examples_seen": 42885, "metrics": {"sets_accuracy": 0.9593, "f1_macro": 0.9596, "roc_auc": 0.9979}, "time_spent": "0:00:10", "examples": [{"x": "play Iheart tunes by Neil Finn", "y_predicted": "y_pred_labels", "y_true": ["PlayMusic"]}], "loss": 1.2202769278415613}}
{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9497, "f1_macro": 0.9492, "roc_auc": 0.9974}, "time_spent": "0:00:10", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}], "epochs_done": 3, "batches_seen": 672, "train_examples_seen": 42885, "impatience": 0, "patience_limit": 5}}


2018-11-01 11:18:04.987 INFO in 'deeppavlov.core.commands.train'['train'] at line 518: New best sets_accuracy of 0.9534
2018-11-01 11:18:04.988 INFO in 'deeppavlov.core.commands.train'['train'] at line 520: Saving model
2018-11-01 11:18:04.988 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 356: [saving model to cnn_model_v1_opt.json]


{"train": {"epochs_done": 4, "batches_seen": 896, "train_examples_seen": 57180, "metrics": {"sets_accuracy": 0.9633, "f1_macro": 0.9636, "roc_auc": 0.9982}, "time_spent": "0:00:12", "examples": [{"x": "Please play a song off the Curtis Lee album Rough Diamonds", "y_predicted": "y_pred_labels", "y_true": ["PlayMusic"]}], "loss": 1.1903010856892382}}
{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9534, "f1_macro": 0.953, "roc_auc": 0.9976}, "time_spent": "0:00:12", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}], "epochs_done": 4, "batches_seen": 896, "train_examples_seen": 57180, "impatience": 0, "patience_limit": 5}}


2018-11-01 11:18:07.329 INFO in 'deeppavlov.core.commands.train'['train'] at line 518: New best sets_accuracy of 0.9553
2018-11-01 11:18:07.329 INFO in 'deeppavlov.core.commands.train'['train'] at line 520: Saving model
2018-11-01 11:18:07.330 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 356: [saving model to cnn_model_v1_opt.json]


{"train": {"epochs_done": 5, "batches_seen": 1120, "train_examples_seen": 71475, "metrics": {"sets_accuracy": 0.9659, "f1_macro": 0.9662, "roc_auc": 0.9984}, "time_spent": "0:00:14", "examples": [{"x": "Give me Slovakia's weather forecast for eight am", "y_predicted": "y_pred_labels", "y_true": ["GetWeather"]}], "loss": 1.1678305316184248}}
{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9553, "f1_macro": 0.9548, "roc_auc": 0.9977}, "time_spent": "0:00:15", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}], "epochs_done": 5, "batches_seen": 1120, "train_examples_seen": 71475, "impatience": 0, "patience_limit": 5}}


2018-11-01 11:18:09.857 INFO in 'deeppavlov.core.commands.train'['train'] at line 518: New best sets_accuracy of 0.9578
2018-11-01 11:18:09.857 INFO in 'deeppavlov.core.commands.train'['train'] at line 520: Saving model
2018-11-01 11:18:09.858 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 356: [saving model to cnn_model_v1_opt.json]


{"train": {"epochs_done": 6, "batches_seen": 1344, "train_examples_seen": 85770, "metrics": {"sets_accuracy": 0.9676, "f1_macro": 0.9678, "roc_auc": 0.9985}, "time_spent": "0:00:17", "examples": [{"x": "rate this current textbook 0 points", "y_predicted": "y_pred_labels", "y_true": ["RateBook"]}], "loss": 1.1507995431976659}}
{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9578, "f1_macro": 0.9574, "roc_auc": 0.9979}, "time_spent": "0:00:17", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}], "epochs_done": 6, "batches_seen": 1344, "train_examples_seen": 85770, "impatience": 0, "patience_limit": 5}}


2018-11-01 11:18:12.237 INFO in 'deeppavlov.core.commands.train'['train'] at line 518: New best sets_accuracy of 0.9585
2018-11-01 11:18:12.238 INFO in 'deeppavlov.core.commands.train'['train'] at line 520: Saving model
2018-11-01 11:18:12.238 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 356: [saving model to cnn_model_v1_opt.json]


{"train": {"epochs_done": 7, "batches_seen": 1568, "train_examples_seen": 100065, "metrics": {"sets_accuracy": 0.9687, "f1_macro": 0.9688, "roc_auc": 0.9986}, "time_spent": "0:00:19", "examples": [{"x": "I need a bar for four that serves argentinian in D'Iberville, WY for twelve PM", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}], "loss": 1.137291977448123}}
{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9585, "f1_macro": 0.958, "roc_auc": 0.998}, "time_spent": "0:00:19", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}], "epochs_done": 7, "batches_seen": 1568, "train_examples_seen": 100065, "impatience": 0, "patience_limit": 5}}


2018-11-01 11:18:14.601 INFO in 'deeppavlov.core.commands.train'['train'] at line 518: New best sets_accuracy of 0.9604
2018-11-01 11:18:14.601 INFO in 'deeppavlov.core.commands.train'['train'] at line 520: Saving model
2018-11-01 11:18:14.601 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 356: [saving model to cnn_model_v1_opt.json]


{"train": {"epochs_done": 8, "batches_seen": 1792, "train_examples_seen": 114360, "metrics": {"sets_accuracy": 0.97, "f1_macro": 0.9702, "roc_auc": 0.9987}, "time_spent": "0:00:22", "examples": [{"x": "play The Sea Cabinet", "y_predicted": "y_pred_labels", "y_true": ["SearchCreativeWork"]}], "loss": 1.1260945099805082}}
{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9604, "f1_macro": 0.9599, "roc_auc": 0.9981}, "time_spent": "0:00:22", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}], "epochs_done": 8, "batches_seen": 1792, "train_examples_seen": 114360, "impatience": 0, "patience_limit": 5}}


2018-11-01 11:18:17.115 INFO in 'deeppavlov.core.commands.train'['train'] at line 518: New best sets_accuracy of 0.961
2018-11-01 11:18:17.116 INFO in 'deeppavlov.core.commands.train'['train'] at line 520: Saving model
2018-11-01 11:18:17.116 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 356: [saving model to cnn_model_v1_opt.json]


{"train": {"epochs_done": 9, "batches_seen": 2016, "train_examples_seen": 128655, "metrics": {"sets_accuracy": 0.97, "f1_macro": 0.9702, "roc_auc": 0.9987}, "time_spent": "0:00:24", "examples": [{"x": "add Ik Tara to laundry playlst", "y_predicted": "y_pred_labels", "y_true": ["AddToPlaylist"]}], "loss": 1.1170877803649222}}
{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.961, "f1_macro": 0.9605, "roc_auc": 0.9981}, "time_spent": "0:00:24", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}], "epochs_done": 9, "batches_seen": 2016, "train_examples_seen": 128655, "impatience": 0, "patience_limit": 5}}


2018-11-01 11:18:19.616 INFO in 'deeppavlov.core.commands.train'['train'] at line 518: New best sets_accuracy of 0.9622
2018-11-01 11:18:19.617 INFO in 'deeppavlov.core.commands.train'['train'] at line 520: Saving model
2018-11-01 11:18:19.617 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 356: [saving model to cnn_model_v1_opt.json]
2018-11-01 11:18:19.671 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 97: [loading vocabulary from snips/classes.dict]
2018-11-01 11:18:19.672 INFO in 'deeppavlov.models.embedders.glove_embedder'['glove_embedder'] at line 52: [loading GloVe embeddings from `glove.6B.100d.txt`]
2018-11-01 11:18:19.672 INFO in 'gensim.models.keyedvectors'['keyedvectors'] at line 204: loading projection weights from glove.6B.100d.txt
2018-11-01 11:18:19.673 DEBUG in 'smart_open.smart_open_lib'['smart_open_lib'] at line 149: {'kw': {}, 'mode': 'rb', 'uri': 'glove.6B.100d.txt'}
2018-11-01 11:18:19.6

{"train": {"epochs_done": 10, "batches_seen": 2240, "train_examples_seen": 142950, "metrics": {"sets_accuracy": 0.9713, "f1_macro": 0.9715, "roc_auc": 0.9988}, "time_spent": "0:00:27", "examples": [{"x": "Is it going to be hot in Karthaus at 7 AM?", "y_predicted": "y_pred_labels", "y_true": ["GetWeather"]}], "loss": 1.1077362325574671}}
{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9622, "f1_macro": 0.9617, "roc_auc": 0.9982}, "time_spent": "0:00:27", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}], "epochs_done": 10, "batches_seen": 2240, "train_examples_seen": 142950, "impatience": 0, "patience_limit": 5}}


2018-11-01 11:18:42.734 INFO in 'gensim.models.keyedvectors'['keyedvectors'] at line 266: loaded (400000, 100) matrix from glove.6B.100d.txt
2018-11-01 11:18:42.740 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 272: [initializing `KerasClassificationModel` from saved]
2018-11-01 11:18:43.157 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 282: [loading weights from cnn_model_v1.h5]
2018-11-01 11:18:43.409 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 134: Model was successfully initialized!
Model summary:
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 15, 100)      0                                            
_________________

{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9622, "f1_macro": 0.9617, "roc_auc": 0.9982}, "time_spent": "0:00:01", "examples": [{"x": "Book a table at Carter House Inn in Saint Bonaventure, Alaska.", "y_predicted": "y_pred_labels", "y_true": ["BookRestaurant"]}]}}


In [69]:
# or we can just load pre-trained model (conicides with what we did above)
m = build_model(cnn_config)

2018-11-01 11:18:43.729 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 97: [loading vocabulary from snips/classes.dict]
2018-11-01 11:18:43.731 INFO in 'deeppavlov.models.embedders.glove_embedder'['glove_embedder'] at line 52: [loading GloVe embeddings from `glove.6B.100d.txt`]
2018-11-01 11:18:43.731 INFO in 'gensim.models.keyedvectors'['keyedvectors'] at line 204: loading projection weights from glove.6B.100d.txt
2018-11-01 11:18:43.732 DEBUG in 'smart_open.smart_open_lib'['smart_open_lib'] at line 149: {'kw': {}, 'mode': 'rb', 'uri': 'glove.6B.100d.txt'}
2018-11-01 11:18:43.733 DEBUG in 'smart_open.smart_open_lib'['smart_open_lib'] at line 621: encoding_wrapper: {'errors': 'strict', 'encoding': None, 'mode': 'rb', 'fileobj': <_io.BufferedReader name='glove.6B.100d.txt'>}
2018-11-01 11:19:06.676 INFO in 'gensim.models.keyedvectors'['keyedvectors'] at line 266: loaded (400000, 100) matrix from glove.6B.100d.txt
2018-11-01 11:19:06.691 INFO in 'deeppavlov.models.cl

In [70]:
m(["Is it freezing in Offerman, California?"])

[['GetWeather']]

#### SklearnComponent classifier on GloVe weighted by TF-IDF embeddings from config

In [71]:
logreg_config = {
  "dataset_reader": {
    "class_name": "basic_classification_reader",
    "x": "text",
    "y": "intents",
    "data_path": "snips"
  },
  "dataset_iterator": {
    "class_name": "basic_classification_iterator",
    "seed": 42,
      "split_seed": 23,
    "field_to_split": "train",
    "split_fields": [
      "train",
      "valid"
    ],
    "split_proportions": [
      0.9,
      0.1
    ]
  },
  "chainer": {
    "in": [
      "x"
    ],
    "in_y": [
      "y"
    ],
    "pipe": [
      {
        "id": "classes_vocab",
        "class_name": "simple_vocab",
        "fit_on": [
          "y"
        ],
        "save_path": "./snips/classes.dict",
        "load_path": "./snips/classes.dict",
        "in": "y",
        "out": "y_ids"
      },
      {
        "in": [
          "x"
        ],
        "out": [
          "x_vec"
        ],
        "fit_on": [
          "x",
          "y_ids"
        ],
        "id": "my_tfidf_vectorizer",
        "class_name": "sklearn_component",
        "save_path": "tfidf_v2.pkl",
        "load_path": "tfidf_v2.pkl",
        "model_class": "sklearn.feature_extraction.text:TfidfVectorizer",
        "infer_method": "transform"
      },
      {
        "in": "x",
        "out": "x_tok",
        "id": "my_tokenizer",
        "class_name": "nltk_moses_tokenizer"
      },
      {
        "in": "x_tok",
        "out": "x_emb",
        "id": "my_embedder",
        "class_name": "glove",
        "save_path": "./glove.6B.100d.txt",
        "load_path": "./glove.6B.100d.txt",
        "dim": 100
      },
      {
        "class_name": "one_hotter",
        "id": "my_onehotter",
        "depth": "#classes_vocab.len",
        "in": "y_ids",
        "out": "y_onehot"
      },
      {
        "in": "x_tok",
        "out": "x_weighted_emb",
        "class_name": "tfidf_weighted",
        "id": "my_weighted_embedder",
        "embedder": "#my_embedder",
        "tokenizer": "#my_tokenizer",
        "vectorizer": "#my_tfidf_vectorizer",
          "mean": True
      },
      {
        "in": [
          "x_weighted_emb"
        ],
        "out": [
          "y_pred"
        ],
        "fit_on": [
          "x_weighted_emb",
          "y"
        ],
        "class_name": "sklearn_component",
        "main": True,
        "save_path": "logreg_v3.pkl",
        "load_path": "logreg_v3.pkl",
        "model_class": "sklearn.linear_model:LogisticRegression",
        "infer_method": "predict",
        "ensure_list_output": True
      }
    ],
    "out": [
      "y_pred"
    ]
  },
  "train": {
    "epochs": 10,
    "batch_size": 64,
    "metrics": [
      "sets_accuracy"
    ],
    "show_examples": False,
    "validate_best": True,
    "test_best": False
  }
}


In [72]:
# we can train and evaluate model from config
m = train_evaluate_model_from_config(logreg_config)

2018-11-01 11:24:06.729 INFO in 'deeppavlov.dataset_iterators.basic_classification_iterator'['basic_classification_iterator'] at line 73: Splitting field <<train>> to new fields <<['train', 'valid']>>
2018-11-01 11:24:06.732 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 97: [loading vocabulary from snips/classes.dict]
2018-11-01 11:24:06.737 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 86: [saving vocabulary to snips/classes.dict]
2018-11-01 11:24:06.739 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 164: Initializing model sklearn.feature_extraction.text:TfidfVectorizer from scratch
2018-11-01 11:24:06.763 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 107: Fitting model sklearn.feature_extraction.text:TfidfVectorizer
2018-11-01 11:24:06.870 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 239: Saving model to tfidf_v2.pkl
2018-11-01 11

{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9283}, "time_spent": "0:00:03"}}


In [73]:
# or we can just load pre-trained model (conicides with what we did above)
m = build_model(logreg_config)

2018-11-01 11:25:17.318 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 97: [loading vocabulary from snips/classes.dict]
2018-11-01 11:25:17.319 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 201: Loading model sklearn.feature_extraction.text:TfidfVectorizer from tfidf_v2.pkl
2018-11-01 11:25:17.324 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 208: Model sklearn.feature_extraction.textTfidfVectorizer loaded  with parameters
2018-11-01 11:25:17.326 INFO in 'deeppavlov.models.embedders.glove_embedder'['glove_embedder'] at line 52: [loading GloVe embeddings from `glove.6B.100d.txt`]
2018-11-01 11:25:17.326 INFO in 'gensim.models.keyedvectors'['keyedvectors'] at line 204: loading projection weights from glove.6B.100d.txt
2018-11-01 11:25:17.326 DEBUG in 'smart_open.smart_open_lib'['smart_open_lib'] at line 149: {'kw': {}, 'mode': 'rb', 'uri': 'glove.6B.100d.txt'}
2018-11-01 11:25:17.327 DEBUG in '

In [74]:
m(["Is it freezing in Offerman, California?"])

[['GetWeather']]

In [75]:
# let's free memory
del m

## Bonus: pre-trained CNN model in DeepPavlov

Download model files (`wiki.en.bin` 8Gb embeddings):

! python -m deeppavlov download intents_snips_big

Evaluate metrics on validation set (no test set provided):

! python -m deeppavlov evaluate intents_snips_big

Or one can use model from python code:

In [56]:
from pathlib import Path

import deeppavlov
from deeppavlov import build_model
from deeppavlov.download import deep_download

config_path = Path(deeppavlov.__file__).parent.joinpath('configs/classifiers/intents_snips_big.json')

In [78]:
# let's download all the required data - model files, embeddings, vocabularies
deep_download(config_path)

2018-10-31 17:10:28.776 INFO in 'deeppavlov.download'['download'] at line 112: Downloading...
2018-10-31 17:10:28.778 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 208: Starting new HTTP connection (1): files.deeppavlov.ai
2018-10-31 17:10:28.801 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 396: http://files.deeppavlov.ai:80 "GET /datasets/snips_intents/train.csv HTTP/1.1" 200 980824
2018-10-31 17:10:28.802 INFO in 'deeppavlov.core.data.utils'['utils'] at line 59: Downloading from http://files.deeppavlov.ai/datasets/snips_intents/train.csv to /home/dilyara/Documents/GitHub/DeepPavlov/download/snips/train.csv
100%|██████████| 981k/981k [00:00<00:00, 13.5MB/s]
2018-10-31 17:10:28.880 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 208: Starting new HTTP connection (1): files.deeppavlov.ai
2018-10-31 17:10:28.908 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 396: http://files.deeppavlov.ai:80 "GET /deeppavlov_data/classifiers/intents_

In [79]:
# now one can initialize model
m = build_model(config_path)

2018-10-31 17:12:51.320 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 97: [loading vocabulary from /home/dilyara/Documents/GitHub/DeepPavlov/download/classifiers/intents_snips_v8/classes.dict]
2018-10-31 17:12:51.656 INFO in 'deeppavlov.models.embedders.fasttext_embedder'['fasttext_embedder'] at line 52: [loading fastText embeddings from `/home/dilyara/Documents/GitHub/DeepPavlov/download/embeddings/wiki.en.bin`]
2018-10-31 17:13:13.599 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 272: [initializing `KerasClassificationModel` from saved]
2018-10-31 17:13:14.75 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 282: [loading weights from model.h5]
2018-10-31 17:13:14.309 INFO in 'deeppavlov.models.classifiers.keras_classification_model'['keras_classification_model'] at line 134: Model was successfully initialized!
Model summary:
________________________

In [80]:
m(["Is it freezing in Offerman, California?"])

[['GetWeather']]

In [82]:
# let's free memory
del m

In [3]:
# or one can evaluate model WITHOUT training
train_evaluate_model_from_config(config_path, to_train=False)

2018-10-31 17:16:00.88 INFO in 'deeppavlov.dataset_iterators.basic_classification_iterator'['basic_classification_iterator'] at line 73: Splitting field <<train>> to new fields <<['train', 'valid']>>
2018-10-31 17:16:00.99 INFO in 'deeppavlov.core.data.simple_vocab'['simple_vocab'] at line 97: [loading vocabulary from /home/dilyara/Documents/GitHub/DeepPavlov/download/classifiers/intents_snips_v8/classes.dict]
[nltk_data] Downloading package punkt to /home/dilyara/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/dilyara/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package perluniprops to
[nltk_data]     /home/dilyara/nltk_data...
[nltk_data]   Package perluniprops is already up-to-date!
[nltk_data] Downloading package nonbreaking_prefixes to
[nltk_data]     /home/dilyara/nltk_data...
[nltk_data]   Package nonbreaking_prefixes is already up-to-date!
2018-10-31

{"valid": {"eval_examples_count": 1589, "metrics": {"sets_accuracy": 0.9811, "f1_macro": 0.9808, "roc_auc": 0.9989}, "time_spent": "0:00:02", "examples": [{"x": "Put some mac wiseman in my latino caliente playlist. ", "y_predicted": "y_pred_labels", "y_true": ["AddToPlaylist"]}]}}


{'valid': OrderedDict([('sets_accuracy', 0.9811),
              ('f1_macro', 0.9808),
              ('roc_auc', 0.9989)])}