# Fine-tuning a Model on Your Own Data

EXECUTABLE VERSION: [*colab*](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial2_Finetune_a_model_on_your_data.ipynb)

For many use cases it is sufficient to just use one of the existing public models that were trained on SQuAD or other public QA datasets (e.g. Natural Questions).
However, if you have domain-specific questions, fine-tuning your model on custom examples will very likely boost your performance.
While this varies by domain, we saw that ~ 2000 examples can easily increase performance by +5-20%.

This tutorial shows you how to fine-tune a pretrained model on your own dataset.

In [None]:
# Install the latest release of Haystack in your own environment 
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install git+https://github.com/deepset-ai/haystack.git
!pip install urllib3==1.25.4

Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-req-build-dijo635w
  Running command git clone -q https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-dijo635w
Collecting farm==0.5.0
[?25l  Downloading https://files.pythonhosted.org/packages/a3/e4/2f47c850732a1d729e74add867e967f058370f29a313da05dc871ff8465e/farm-0.5.0-py3-none-any.whl (207kB)
[K     |████████████████████████████████| 215kB 2.8MB/s 
[?25hCollecting fastapi
[?25l  Downloading https://files.pythonhosted.org/packages/4c/0b/5df17eaadb7fe39dad349f484e551e802ce0581be672822f010c530d5e75/fastapi-0.61.2-py3-none-any.whl (48kB)
[K     |████████████████████████████████| 51kB 4.6MB/s 
[?25hCollecting uvicorn
[?25l  Downloading https://files.pythonhosted.org/packages/30/cc/01cc4cb980dfcf04eb283b6497c7f280928a0b02c68c0f85b6901e7716ae/uvicorn-0.12.2-py3-none-any.whl (45kB)
[K     |████████████████████████████████| 51kB 4.6MB/s 
[?25hCollecting

In [None]:
from haystack.reader.farm import FARMReader

11/10/2020 06:56:54 - INFO - faiss -   Loading faiss with AVX2 support.
11/10/2020 06:56:54 - INFO - faiss -   Loading faiss.



## Create Training Data

There are two ways to generate training data

1. **Annotation**: You can use the [annotation tool](https://github.com/deepset-ai/haystack#labeling-tool) to label your data, i.e. highlighting answers to your questions in a document. The tool supports structuring your workflow with organizations, projects, and users. The labels can be exported in SQuAD format that is compatible for training with Haystack.

![Snapshot of the annotation tool](https://raw.githubusercontent.com/deepset-ai/haystack/master/docs/img/annotation_tool.png)

2. **Feedback**: For production systems, you can collect training data from direct user feedback via Haystack's [REST API interface](https://github.com/deepset-ai/haystack#rest-api). This includes a customizable user feedback API for providing feedback on the answer returned by the API. The API provides a feedback export endpoint to obtain the feedback data for fine-tuning your model further.


## Fine-tune your model

Once you have collected training data, you can fine-tune your base models.
We initialize a reader as a base model and fine-tune it on our own custom dataset (should be in SQuAD-like format).
We recommend using a base model that was trained on SQuAD or a similar QA dataset before to benefit from Transfer Learning effects.

**Recommendation**: Run training on a GPU.
If you are using Colab: Enable this in the menu "Runtime" > "Change Runtime type" > Select "GPU" in dropdown.
Then change the `use_gpu` arguments below to `True`

In [None]:
reader = FARMReader(model_name_or_path="distilbert-base-uncased-distilled-squad", use_gpu=True)
train_data = "data/"
# train_data = "PATH/TO_YOUR/TRAIN_DATA" 
reader.train(data_dir=train_data, train_filename="dev-v2.0.json", use_gpu=True, n_epochs=1, save_dir="my_model")

11/10/2020 06:58:22 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
11/10/2020 06:58:22 - INFO - farm.infer -   Could not find `distilbert-base-uncased-distilled-squad` locally. Try to download from model hub ...
11/10/2020 06:58:23 - INFO - filelock -   Lock 139895293915032 acquired on /root/.cache/torch/transformers/e88f38f2c8bc669ef7873de68f36bf764d4f64b9833ca8401efe271aab476745.0f15800a5b4c30725c555e054e3d0262e9916635f0de9d397c30acd86c21dc73.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=451.0, style=ProgressStyle(description_…

11/10/2020 06:58:23 - INFO - filelock -   Lock 139895293915032 released on /root/.cache/torch/transformers/e88f38f2c8bc669ef7873de68f36bf764d4f64b9833ca8401efe271aab476745.0f15800a5b4c30725c555e054e3d0262e9916635f0de9d397c30acd86c21dc73.lock





11/10/2020 06:58:23 - INFO - filelock -   Lock 139895293987696 acquired on /root/.cache/torch/transformers/dfa987aac92dc15d249af90a287974fd64aedb6548e287a4c031a16b06eb173c.f4565e3948d4331d7e0460adbcbdcac536e9886f24a2fad1190d6b53c231a3a3.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=265481570.0, style=ProgressStyle(descri…

11/10/2020 06:58:28 - INFO - filelock -   Lock 139895293987696 released on /root/.cache/torch/transformers/dfa987aac92dc15d249af90a287974fd64aedb6548e287a4c031a16b06eb173c.f4565e3948d4331d7e0460adbcbdcac536e9886f24a2fad1190d6b53c231a3a3.lock





	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.
11/10/2020 06:58:32 - INFO - filelock -   Lock 139895280541032 acquired on /root/.cache/torch/transformers/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…

11/10/2020 06:58:32 - INFO - filelock -   Lock 139895280541032 released on /root/.cache/torch/transformers/9b3c03a36e83b13d5ba95ac965c9f9074a99e14340c523ab405703179e79fc46.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.lock
11/10/2020 06:58:32 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None





11/10/2020 06:58:32 - INFO - farm.infer -   Got ya 1 parallel workers to do inference ...
11/10/2020 06:58:32 - INFO - farm.infer -    0 
11/10/2020 06:58:33 - INFO - farm.infer -   /w\
11/10/2020 06:58:33 - INFO - farm.infer -   /'\
11/10/2020 06:58:33 - INFO - farm.infer -   
11/10/2020 06:58:33 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
Preprocessing Dataset data/dev-v2.0.json:   0%|          | 0/62 [00:00<?, ? Dicts/s]11/10/2020 06:58:34 - ERROR - farm.data_handler.processor -   Could not convert this sample to features: 
 

      .--.        _____                       _      
    .'_\/_'.     / ____|                     | |     
    '. /\ .'    | (___   __ _ _ __ ___  _ __ | | ___ 
      "||"       \___ \ / _` | '_ ` _ \| '_ \| |/ _ \ 
       || /\     ____) | (_| | | | | | | |_) | |  __/
    /\ ||//\)   |_____/ \__,_|_| |_| |_| .__/|_|\___|
   (/\||/                             |_|           
______\||/____

In [None]:
# Saving the model happens automatically at the end of training into the `save_dir` you specified
# However, you could also save a reader manually again via:
reader.save(directory="my_model")

In [None]:
from haystack import Finder
from haystack.preprocessor.cleaning import clean_wiki_text
from haystack.preprocessor.utils import convert_files_to_dicts, fetch_archive_from_http
from haystack.reader.farm import FARMReader
from haystack.reader.transformers import TransformersReader
from haystack.utils import print_answers

# # If you want to load it at a later point, just do:
# new_reader = FARMReader(model_name_or_path="my_model")

11/10/2020 07:35:02 - INFO - faiss -   Loading faiss with AVX2 support.
11/10/2020 07:35:02 - INFO - faiss -   Loading faiss.


In [None]:
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.6.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.6.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.6.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [None]:
# Connect to Elasticsearch

from haystack.document_store.elasticsearch import ElasticsearchDocumentStore
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

11/10/2020 08:04:47 - INFO - elasticsearch -   HEAD http://localhost:9200/document [status:200 request:0.106s]
11/10/2020 08:04:47 - INFO - elasticsearch -   GET http://localhost:9200/document [status:200 request:0.008s]
11/10/2020 08:04:47 - INFO - elasticsearch -   PUT http://localhost:9200/document/_mapping [status:200 request:0.082s]
11/10/2020 08:04:47 - INFO - elasticsearch -   HEAD http://localhost:9200/label [status:200 request:0.003s]


In [None]:
doc_dir = "data/article_txt_got1"
# s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
# fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

# Convert files to dicts
# You can optionally supply a cleaning function that is applied to each doc (e.g. to remove footers)
# It must take a str as input, and return a str.
dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

# We now have a list of dictionaries that we can write to our document store.
# If your texts come from a different source (e.g. a DB), you can of course skip convert_files_to_dicts() and create the dictionaries yourself.
# The default format here is:
# {
#    'text': "<DOCUMENT_TEXT_HERE>",
#    'meta': {'name': "<DOCUMENT_NAME_HERE>", ...}
#}
# (Optionally: you can also add more key-value-pairs here, that will be indexed as fields in Elasticsearch and
# can be accessed later for filtering or shown in the responses of the Finder)

# Let's have a look at the first 3 entries:
print(dicts[:3])

# Now, let's write the dicts containing documents to our DB.
document_store.write_documents(dicts)

11/10/2020 08:04:49 - INFO - haystack.preprocessor.utils -   Converting data/article_txt_got1/thông tin sinh viên.txt
11/10/2020 08:04:49 - INFO - haystack.preprocessor.utils -   Converting data/article_txt_got1/thi môn năng khiếu ở trường khác.txt
11/10/2020 08:04:49 - INFO - haystack.preprocessor.utils -   Converting data/article_txt_got1/Chi tiết và thời gian xét tuyển phương thức 1 đợt 1.txt
11/10/2020 08:04:49 - INFO - haystack.preprocessor.utils -   Converting data/article_txt_got1/Chỉ tiêu.txt
11/10/2020 08:04:49 - INFO - haystack.preprocessor.utils -   Converting data/article_txt_got1/KHOA KHOA HỌC XÃ HỘI_& NHÂN VĂN.txt
11/10/2020 08:04:49 - INFO - haystack.preprocessor.utils -   Converting data/article_txt_got1/Chi tiết và thời gian xét tuyển phương thức 1 đợt 2.txt
11/10/2020 08:04:49 - INFO - haystack.preprocessor.utils -   Converting data/article_txt_got1/KHOA QUẢN TRỊ KINH DOANH.txt
11/10/2020 08:04:49 - INFO - haystack.preprocessor.utils -   Converting data/article_txt_go

[{'text': 'Sau khi in phiếu thành công mục nào để trống thì để nguyên các bạn không điền gì thêm vào.', 'meta': {'name': 'thông tin sinh viên.txt'}}, {'text': 'Thí sinh đăng ký xét tuyển vào nhóm ngành môn năng khiếu thì thí sinh phải tham gia thi môn năng khiếu do TDTU tổ chức. TDTU không nhận điểm thi năng khiếu của các Trường khác chuyển sang.TDTU sẽ tổ chức 2 đợt thi môn năng khiếu tại cơ sở chính của Trường. Dự kiến Đợt 1: ngày 18 - 19/7/2020 và Đợt 2: ngày 21 - 22/8/2020 (Thí sinh được quyền đăng ký thi cả 2 đợt để lấy điểm cao nhất xét tuyển).Thí sinh tham khảo thông tin chi tiết trong thông báo thi năng khiếu trên website: admission.tdtu.edu.vn', 'meta': {'name': 'thi môn năng khiếu ở trường khác.txt'}}, {'text': 'ĐỢT 1: DÀNH CHO HỌC SINH CÁC TRƯỜNG THPT ĐÃ KÝ KẾT HỢP TÁC VỚI TRƯỜNG ĐẠI HỌC TÔN ĐỨC THẮNG Thời gian đăng ký xét tuyển trực tuyến: từ 15/4/2020 – 30/6/2020 Chi tiết về quy định xét tuyển theo kết quả học tập 05 Học kỳ đợt 1 :https://admission.tdtu.edu.vn/sites/admiss

11/10/2020 08:04:50 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.955s]


In [None]:
from haystack.retriever.sparse import ElasticsearchRetriever
retriever = ElasticsearchRetriever(document_store=document_store)

In [None]:
reader = FARMReader(model_name_or_path="my_model", use_gpu=True)

11/10/2020 08:04:50 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
11/10/2020 08:04:52 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
11/10/2020 08:04:52 - INFO - farm.infer -   Got ya 1 parallel workers to do inference ...
11/10/2020 08:04:52 - INFO - farm.infer -    0 
11/10/2020 08:04:52 - INFO - farm.infer -   /w\
11/10/2020 08:04:52 - INFO - farm.infer -   /'\
11/10/2020 08:04:52 - INFO - farm.infer -   


In [None]:
finder = Finder(reader, retriever)

In [None]:
prediction = finder.get_answers(question="Tuyển sinh theo mấy phương thức ?", top_k_retriever=10, top_k_reader=5)

11/10/2020 08:12:08 - INFO - elasticsearch -   POST http://localhost:9200/document/_search [status:200 request:0.035s]
11/10/2020 08:12:08 - INFO - haystack.finder -   Got 10 candidates from retriever
11/10/2020 08:12:08 - INFO - haystack.finder -   Reader is looking for detailed answer in 3768 chars ...

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s][A
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  1.57 Batches/s]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s][A
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.04 Batches/s]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s][A
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.03 Batches/s]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s][A
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  2.03 Batches/s]

Inferencing Samples:   0%|          | 0/1 [00:00<?, ? Batches/s][A
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  

In [None]:
print_answers(prediction, details="all")

{   'answers': [   {   'answer': 'Năm 2020, Trường Đại học Tôn Đức Thắng tuyển '
                                 'sinh theo 03 phương thức xét tuyển:Phương '
                                 'thức 1: Xét tuyển theo kết quả quá trình học '
                                 'tập bậc THPTPhương thức 2: Xét tuyển theo '
                                 'kết quả thi tốt nghiệp THPT năm 2020.Phương '
                                 'thức 3: Xét tuyển thẳng.Về Điều kiện, Hồ sơ, '
                                 'thủ tục và thời gian công bố kết quả bạn có '
                                 'thể xem thông tin chi tiết tại: '
                                 'https://admission.tdtu.edu.vn',
                       'context': 'Năm 2020, Trường Đại học Tôn Đức Thắng '
                                  'tuyển sinh theo 03 phương thức xét '
                                  'tuyển:Phương thức 1: Xét tuyển theo kết quả '
                                  'quá trình học tập bậc THPTPhương thức 2: '
