Permalink
6302b7c Nov 26, 2018
3 contributors

Users who have contributed to this file

@seliverstov @yoptar @dilyararimovna
178 lines (103 sloc) 5.36 KB

Frequently Asked Questions (FAQ)

This is implementation of FAQ skill which helps to classify incoming questions.

:: What is your open hours?
>> 8am - 8pm

Quick Start

Building

from deeppavlov import configs
from deeppavlov.core.commands.infer import build_model

faq = build_model(configs.faq.tfidf_logreg_en_faq, load_trained=True)

Inference

result = faq(['What is your open hours?'])

If some required packages are missing, install all the requirements by running in command line:

python -m deeppavlov install fasttext_avg_autofaq
python -m deeppavlov install fasttext_tfidf_autofaq
python -m deeppavlov install tfidf_autofaq
python -m deeppavlov install tfidf_logreg_autofaq
python -m deeppavlov install tfidf_logreg_en_faq

Config

As usual, config consists of:

  • dataset_reader
  • dataset_iterator
  • chainer

You can use you own dataset_reader, dataset_iterator for speficic data. Let's consider chainer in more details.

Config Structure

  • chainer - pipeline manager
    • in - pipeline input data: question
    • out - pipeline output data: answer + score[0,1]
  • preprocessing - it can be tokenization, lemmatization, stemming and etc. In example tfidf_logreg_autofaq.json there are tokenization and lemmatization.
  • vectorizer - vectorizer of incoming sentences. It can be word embeddings vectorizer, bag of words vectorizer, tf-idf vectorizer and etc. Th output is vectorized sentences (numeric vectors).
  • classifier - This is faq model that classify incoming question. Model receive vectorized train sentences and vectorized question for inference. Output is classified answer from train dataset.

Vectorizers

Vectorizers produce numeric vectors of input sentences

  • sentence2vector_v2w_tfidf - Sentence vectorizer: weighted sum of word embeddings from sentence
    • in - input data: question
    • fit_on - train data: [token lemmas of question, word embeddings]
    • save_path - path where to save model
    • load_path - path where to load model
    • out - output data: vectorized sentence

Classifiers for FAQ

This is models that classify incoming question and find corresponding answer

  • cos_sim_classifier - Classifier based on cosine similarity
    • in - input data: question
    • fit_on - train data: [vectorized sentences, answers]
    • save_path - path where to save model
    • load_path - path where to load model
    • out - output data: [answer, score]
  • logreg_classifier - Logistic Regression classifier, that output most probable answer with score
    • in - input data: question
    • fit_on - train data: [vectorized sentences, answers]
    • c - regularization parameter for logistic regression model
    • penalty - regularization type: 'l1' or 'l2'
    • save_path - path where to save model
    • load_path - path where to load model
    • out - output data: [answer, score]

Running FAQ

Training

To train your own model by running command train, for example:

python -m deeppavlov train tfidf_autofaq

Interacting

After model has trained, you can use it for inference: model will return answers from FAQ data that used for train.

python -m deeppavlov interact fidf_autofaq -d

Inference example:

:: What is your open hours?
>> 8am - 8pm

Available Data and Pretrained Models

As an example you can try pretrained models on FAQ dataset in English: MIPT FAQ for entrants - https://mipt.ru/english/edu/faqs/

tfidf_logreg_classifier_en_mipt_faq  - http://files.deeppavlov.ai/faq/mipt/tfidf_logreg_classifier_en_mipt_faq.pkl
tfidf_vectorizer_en_mipt_faq         - http://files.deeppavlov.ai/faq/mipt/tfidf_vectorizer_en_mipt_faq.pkl
  • tfidf_logreg_classifier_en_mipt_faq.pkl - pre-trained logistic regression classifier for classifying input question (vectorized by tfidf)
  • tfidf_vectorizer_en_mipt_faq.pkl - pre-trained model for TF-IDF vectorizer based on MIPT FAQ

Example config - :config:`tfidf_logreg_en_faq.json <faq/tfidf_logreg_en_faq.json>`

Also you can use pretrained model on Russan FAQ dataset from school-site: http://www.ftl.name/page/989

tfidf_cos_sim_classifier    - http://files.deeppavlov.ai/faq/school/faq_tfidf_cos_model.pkl
tfidf_logreg_classifier     - http://files.deeppavlov.ai/faq/school/faq_tfidf_logreg_model.pkl
fasttext_cos_classifier     - http://files.deeppavlov.ai/faq/school/faq_fasttext_cos_model.pkl
tfidf_vectorizer_ruwiki     - http://files.deeppavlov.ai/vectorizer/tfidf_vectorizer_ruwiki.pkl
  • tfidf_cos_sim_classifier.pkl - pre-trained cosine similarity classifier for classifying input question (vectorized by tfidf)
  • tfidf_logreg_classifier.pkl - pre-trained logistic regression classifier for classifying input question (vectorized by tfidf)
  • fasttext_cos_classifier.pkl - pre-trained cosine similarity classifier for classifying input question (vectorized by word embeddings)
  • tfidf_vectorizer_ruwiki.pkl - pre-trained model for TF-IDF vectorizer based on Russian Wikipedia