# DeepPavlov: Utterances classification with small training set (autoFAQ models)


This notebook consists of code snippets of [DeepPavlov](https://github.com/deepmipt/DeepPavlov) - open-source conversational AI framework. The snippets show how to interact with text classification models that were specifically developed to be effective when training data are limited. The popular use case scenario for these models is to classify user utterances into one of the FAQ questions and retrieve the corresponding answer (autoFAQ models). As a testbed, we used the students’ FAQ from the [MIPT website](https://mipt.ru/english/edu/faqs/). The FAQ contains the most popular first-year students' questions with corresponding answers.
The framework allows you to train models, fine-tune hyperparameters, and to test models.

# Requirements

First, install all required packages

In [None]:
# !pip install spacy
# !python -m spacy download en_core_web_sm

In [None]:
%%bash
# pip install deeppavlov
# pip install spacy
python -m spacy download en_core_web_sm

Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz (37.4MB)

[93m    Linking successful[0m
    /usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
    /usr/local/lib/python3.6/dist-packages/spacy/data/en_core_web_sm

    You can now load the model via spacy.load('en_core_web_sm')



#Model Description

DeepPavlov contains several text classification models that work well on a few training pairs. All the models are based on two major text representations: fastText word embeddings and tf-idf representation. The models described in the separated configuration files under the [config/faq folder](https://github.com/deepmipt/DeepPavlov/tree/master/deeppavlov/configs/faq). The config file consists of four main sections: **dataset_reader**, **dataset_iterator**, **chainer**, and **train**.

The **dataset_iteratot** specifies how to split the data into train, valid, test sets. The **chainer** section of the configuration files contains a pipeline of the required components to interact with the models, i.e. tokenizer, lemmatizer, tf-idf vectorizer, and others. The tokenizer splits a string into tokens, lemmatizer converts all tokens into lemmas. The tf-idf vectorizer transforms the lemmas into tf-idf vectors. The component’s input and output are defined in the **in** and **out** keys correspondingly.

The [configuration file](https://github.com/deepmipt/DeepPavlov/blob/master/deeppavlov/configs/faq/tfidf_logreg_en_faq.json) based on logistic regression is following.

In [None]:
{
  "dataset_reader": {
    "name": "faq_reader",
    "x_col_name": "Question",
    "y_col_name": "Answer",
    "data_url": "http://files.deeppavlov.ai/faq/mipt/faq.csv"
  },
  "dataset_iterator": {
    "name": "data_learning_iterator"
  },
  "chainer": {
    "in": "q",
    "pipe": [
      {
        "name": "stream_spacy_tokenizer",
        "in": "q",
        "id": "my_tokenizer",
        "lemmas": true,
        "out": "q_token_lemmas"
      },
      {
        "ref": "my_tokenizer",
        "in": "q_token_lemmas",
        "out": "q_lem"
      },
      {
        "in": [
          "q_lem"
        ],
        "out": [
          "q_vect"
        ],
        "fit_on": [
          "q_lem"
        ],
        "id": "tfidf_vec",
        "name": "sklearn_component",
        "save_path": "faq/mipt/en_mipt_faq_v1/tfidf.pkl",
        "load_path": "faq/mipt/en_mipt_faq_v1/tfidf.pkl",
        "model_class": "sklearn.feature_extraction.text:TfidfVectorizer",
        "infer_method": "transform"
      },
      {
        "in": "q_vect",
        "fit_on": [
          "q_vect",
          "y"
        ],
        "out": [
          "answer"
        ],
        "name": "sklearn_component",
        "main": true,
        "save_path": "faq/mipt/en_mipt_faq_v1/logreg.pkl",
        "load_path": "faq/mipt/en_mipt_faq_v1/logreg.pkl",
        "model_class": "sklearn.linear_model:LogisticRegression",
        "infer_method": "predict",
        "C": 1000,
        "penalty": "l2"
      }
    ],
    "out": [
      "answer"
    ]
  },
  "train": {
    "validate_best": false,
    "test_best": false
  },
  "metadata": {
    "requirements": [
      "../dp_requirements/spacy.txt",
      "../dp_requirements/en_core_web_sm.txt"
    ],
    "download": [
      {
        "url": "http://files.deeppavlov.ai/faq/mipt/en_mipt_faq_v1.tar.gz",
        "subdir": "faq/mipt"
      }
    ]
  }
}


# Interacting with the model

The DeepPavlov framework contains several models pre-trained on the aforementioned MIPT FAQ corpus. The files with the pre-trained models defined in the **metadata: download** section of the model's configuration file. You can interact with the model by running it from the command line with ***interact*** parameter and the name of the model's configuration file (-d indicates to download all required files)

In [None]:
!python -m deeppavlov install tfidf_logreg_en_faq

2018-12-06 15:26:50.590 INFO in 'deeppavlov.core.common.file'['file'] at line 31: Interpreting 'tfidf_logreg_en_faq' as '/usr/local/lib/python3.6/dist-packages/deeppavlov/configs/faq/tfidf_logreg_en_faq.json'
Collecting spacy==2.0.5
[?25l  Downloading https://files.pythonhosted.org/packages/eb/21/d0370cd5d6b7061b1fe09d9d4266cc0a82e3feb39de4aa22f6c574ae84b7/spacy-2.0.5.tar.gz (13.3MB)
[K    100% |████████████████████████████████| 13.3MB 2.7MB/s 
Collecting murmurhash<0.29,>=0.28 (from spacy==2.0.5)
  Downloading https://files.pythonhosted.org/packages/82/55/7f050e9f73c9a58df219c63e77304b0ff01676847061dc99abb484cff3a8/murmurhash-0.28.0-cp36-cp36m-manylinux1_x86_64.whl
Collecting cymem<1.32,>=1.30 (from spacy==2.0.5)
  Downloading https://files.pythonhosted.org/packages/a5/0f/d29aa68c55db37844c77e7e96143bd96651fd0f4453c9f6ee043ac846b77/cymem-1.31.2-cp36-cp36m-manylinux1_x86_64.whl
Collecting preshed<2.0.0,>=1.0.0 (from spacy==2.0.5)
[?25l  Downloading https://files.pythonhosted.org/pac

In [None]:
!python -m deeppavlov interact tfidf_logreg_en_faq -d

2018-12-06 15:17:11.217 INFO in 'deeppavlov.core.common.file'['file'] at line 31: Interpreting 'tfidf_logreg_en_faq' as '/usr/local/lib/python3.6/dist-packages/deeppavlov/configs/faq/tfidf_logreg_en_faq.json'
2018-12-06 15:17:11.222 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 208: Starting new HTTP connection (1): files.deeppavlov.ai
2018-12-06 15:17:11.531 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 396: http://files.deeppavlov.ai:80 "GET /faq/mipt/en_mipt_faq_v1.tar.gz.md5 HTTP/1.1" 200 119
2018-12-06 15:17:11.533 INFO in 'deeppavlov.download'['download'] at line 115: Skipped http://files.deeppavlov.ai/faq/mipt/en_mipt_faq_v1.tar.gz download because of matching hashes
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package perluniprops to /root/nltk_

Alternatively, you can ***build_model*** from the Python code as on the example below

In [None]:
from deeppavlov.deep import find_config
from deeppavlov.core.commands.infer import build_model
config_path = find_config('tfidf_logreg_en_faq')
faq = build_model(config_path, load_trained = True, download = True)

a = faq(["I need help"])
a

2018-12-06 15:39:10.567 INFO in 'deeppavlov.core.common.file'['file'] at line 31: Interpreting 'tfidf_logreg_en_faq' as '/usr/local/lib/python3.6/dist-packages/deeppavlov/configs/faq/tfidf_logreg_en_faq.json'
2018-12-06 15:39:10.581 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 208: Starting new HTTP connection (1): files.deeppavlov.ai
2018-12-06 15:39:10.985 DEBUG in 'urllib3.connectionpool'['connectionpool'] at line 396: http://files.deeppavlov.ai:80 "GET /faq/mipt/en_mipt_faq_v1.tar.gz.md5 HTTP/1.1" 200 119
2018-12-06 15:39:10.990 INFO in 'deeppavlov.download'['download'] at line 115: Skipped http://files.deeppavlov.ai/faq/mipt/en_mipt_faq_v1.tar.gz download because of matching hashes
2018-12-06 15:39:11.637 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 201: Loading model sklearn.feature_extraction.text:TfidfVectorizer from /root/.deeppavlov/models/faq/mipt/en_mipt_faq_v1/tfidf.pkl
2018-12-06 15:39:11.639 INFO in 'deeppavlov.models.

['If you have any problems you can address to Department of Foreign Students: +7 (495) 408-70-43 (Auditorium building, room 315).']

In [None]:
import en_core_web_sm
m = en_core_web_sm.load()
m

<spacy.lang.en.English at 0x7f6f319c99e8>

# Training the model

You can train a model by running the library with ***train*** parameter, wherein the model will be trained on the dataset defined in the dataset_reader section of the configuration file. If **metrics** key along with either **validate_best** or **test_best** are defined in the train section, the model will be validated/tested on the corresponding set in the dataset_iterator section.

In [None]:
!python -m deeppavlov train tfidf_logreg_en_faq

Let's modify the training data and retrain the model.

In [None]:
%%bash
wget -q http://files.deeppavlov.ai/faq/mipt/faq.csv -O faq.csv
echo "What's iPavlov?, iPavlov is the project of the Neural Networks and Deep Learning lab at MIPT " >> faq.csv

In [None]:
import json
from deeppavlov import configs, train_model, train_evaluate_model_from_config
from deeppavlov.deep import find_config

config = json.loads(configs.faq.tfidf_logreg_en_faq.read_text(encoding='utf8'))
config["dataset_reader"]["data_path"] = "/content/faq.csv"
config["dataset_reader"]["data_url"] = None
faq = train_model(config)
a = faq(["tell me about iPavlov"])
a

# About Us

We are iPavlov, our story started in 2017 when we decided to build a conversational AI framework that on the one hand will contain all required NLP components to build chatbots and on the other hand will be easy to use. Our work resulted in releasing DeepPavlov library. Our lab at MIPT is honored with Facebook AI Academic Partnership and NVIDIA GPU Research Center status. We successfully combine research and extreme coding in our week-long DeepHack.me hackathons — DeepHack.Game, DeepHack.Q&A and DeepHack.RL. We serve a global AI community by organizing NIPS Conversational Challenge to evaluate state-of-the-art techniques in the field of dialog systems and collect open source dialog datasets.

# Useful links

[DeepPavlov repository](https://github.com/deepmipt/DeepPavlov)

[DeepPavlov demo page](demo.ipavlov.ai)

[DeepPavlov documentation](docs.deeppavlov.ai)