# Fine-tuning a model on your own data

This tutorial shows you how to fine-tune a pretrained model on your own dataset for the task of question-answering.

In [1]:
from haystack import Finder
from haystack.database.sql import SQLDocumentStore
from haystack.indexing.cleaning import clean_wiki_text
from haystack.indexing.io import write_documents_to_db, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.retriever.tfidf import TfidfRetriever
from haystack.utils import print_answers

## Training
We take a reader as a base model and fine-tune it on our own custom dataset (should be in SQuAD-like format).

In [2]:
reader = FARMReader(model_name_or_path="distilbert-base-uncased-distilled-squad", use_gpu=False)
train_data = "data/squad20"
#train_data = "PATH/TO_YOUR/TRAIN_DATA" 
reader.train(data_dir=train_data, train_filename="dev-v2.0.json", use_gpu=False, n_epochs=1)

03/17/2020 12:49:49 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
03/17/2020 12:49:49 - INFO - farm.infer -   Could not find `distilbert-base-uncased-distilled-squad` locally. Try to download from model hub ...
	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.
03/17/2020 12:49:59 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
03/17/2020 12:49:59 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
Preprocessing Dataset squad20/dev-v2.0.json: 100%|██████████| 1204/1204 [00:19<00:00, 62.20 Dicts/s]
Train epoch 1/1 (Cur. train loss: 4.7116):   0%|          | 5/1193 [00:56<3:40:15, 11.12s/it]



## Use trained model to ask questions
### Indexing & cleaning documents

In [3]:
# Let's get the data (Game of thrones articles from wikipedia)
doc_dir = "data/article_txt_got"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

03/17/2020 12:51:27 - INFO - haystack.indexing.io -   Fetching from https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip to `data/article_txt_got`

  0%|          | 0/1167348 [00:00<?, ?B/s][A
  3%|▎         | 34816/1167348 [00:00<00:03, 331625.32B/s][A
  6%|▌         | 69632/1167348 [00:00<00:03, 328515.21B/s][A
 12%|█▏        | 139264/1167348 [00:00<00:02, 386975.00B/s][A
 16%|█▋        | 191488/1167348 [00:00<00:02, 418553.76B/s][A
 24%|██▍       | 278528/1167348 [00:00<00:01, 477974.91B/s][A
 31%|███▏      | 365568/1167348 [00:00<00:01, 552512.24B/s][A
 41%|████▏     | 484352/1167348 [00:00<00:01, 658105.80B/s][A
 51%|█████     | 591872/1167348 [00:00<00:00, 741883.95B/s][A
 63%|██████▎   | 731136/1167348 [00:00<00:00, 849161.66B/s][A
 71%|███████   | 830464/1167348 [00:01<00:00, 857359.15B/s][A
 82%|████████▏ | 957440/1167348 [00:01<00:00, 924786.67B/s][A
100%|██████████| 1167348/1167348 [00:01<00:00, 881727.47B/s] [A


True

In [4]:
# Init Document store & write docs to it
document_store = SQLDocumentStore(url="sqlite:///qa.db")
write_documents_to_db(
    document_store=document_store,
    document_dir=doc_dir,
    clean_func=clean_wiki_text,
    only_empty_db=True
)

03/17/2020 12:51:36 - INFO - haystack.indexing.io -   Wrote 517 docs to DB


### Initialize Reader, Retriever & Finder
A retriever identifies the k most promising chunks of text that might contain the answer for our question. The Finder sticks together reader and retriever in a pipeline to answer our actual questions.

Retrievers use some simple but fast algorithm, here: TF-IDF

In [5]:
retriever = TfidfRetriever(document_store=document_store)

03/17/2020 12:51:40 - INFO - haystack.retriever.tfidf -   Found 2811 candidate paragraphs from 517 docs in DB


In [6]:
finder = Finder(reader, retriever)

### Voilà! Ask a question!
You can configure how many candidates the reader and retriever shall return.
The higher `top_k_retriever`, the better (but also the slower) your answers.

In [7]:
# You can configure how many candidates the reader and retriever shall return
# The higher top_k_retriever, the better (but also the slower) your answers.
prediction = finder.get_answers(question="Who is the father of Arya Stark?", top_k_retriever=10, top_k_reader=5)

#prediction = finder.get_answers(question="Who created the Dothraki vocabulary?", top_k_reader=5)
#prediction = finder.get_answers(question="Who is the sister of Sansa?", top_k_reader=5)

03/17/2020 12:51:43 - INFO - haystack.retriever.tfidf -   Identified 10 candidates via retriever:
  paragraph_id  document_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

03/17/2020 12:51:43 - INFO - haystack.finder -   Applying the reader now to look for the answer in detail ...

Inferencing:   0%|          | 0/1 [00:00<?, ?it/s][A
Inferencing: 100%|██████████| 1/1 [00:08<00:00,  8.58s/it][A


In [8]:
print_answers(prediction, details="minimal")

[   {   'answer': 'Eddard and Catelyn Stark',
        'context': 'tark ===\n'
                   'Arya Stark is the third child and younger daughter of '
                   'Eddard and Catelyn Stark. She serves as a POV character '
                   "for 33 chapters throughout ''A "},
    {   'answer': 'Eddard',
        'context': 's Nymeria after a legendary warrior queen. She travels '
                   "with her father, Eddard, to King's Landing when he is made "
                   'Hand of the King. Before she leaves,'},
    {   'answer': 'Eddard and Catelyn Stark.',
        'context': 'ark ===\n'
                   'Arya Stark is the third child and younger daughter of '
                   'Eddard and Catelyn Stark. She serves as a POV character '
                   "for 33 chapters throughout ''A G"},
    {   'answer': 'Joffrey Baratheon',
        'context': '==\n'
                   'Sansa Stark begins the novel by being betrothed to Crown '
                   'Prince Joffrey 