# Extending your Metadata using DocumentClassifiers at Index Time

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial8_Preprocessing.ipynb)

DocumentClassifier adds the classification result (label and score) to Document's meta property. 
Hence, we can use it to classify documents at index time. The result can be accessed at query time: for example by applying a filter for "classification.label".

This tutorial will show you how to integrate a classification model into your preprocessing steps and how you can filter for this additional metadata at query time.

In [None]:
# Let's start by installing Haystack

# Install the latest release of Haystack in your own environment
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install grpcio-tools==1.34.1
!pip install git+https://github.com/deepset-ai/haystack.git
!wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.03.tar.gz
!tar -xvf xpdf-tools-linux-4.03.tar.gz && sudo cp xpdf-tools-linux-4.03/bin64/pdftotext /usr/local/bin

# If you run this notebook on Google Colab, you might need to
# restart the runtime after installing haystack.

In [1]:
# Here are the imports we need
from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore
from haystack.nodes import PreProcessor, TransformersDocumentClassifier, FARMReader, ElasticsearchRetriever
from haystack.schema import Document
from haystack.utils import convert_files_to_dicts, fetch_archive_from_http, print_answers, launch_es

In [2]:
# This fetches some sample files to work with

doc_dir = "data/preprocessing_tutorial"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/preprocessing_tutorial.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

False

## read and preprocess documents


In [3]:
# note that you can also use the document classifier before applying the PreProcessor, e.g. before splitting your documents

all_docs = convert_files_to_dicts(dir_path="data/preprocessing_tutorial")
preprocessor_sliding_window = PreProcessor(
    split_overlap=3,
    split_length=10,
    split_respect_sentence_boundary=False
)
docs_sliding_window = preprocessor_sliding_window.process(all_docs)

pdftotext version 0.86.1
Copyright 2005-2020 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
100%|██████████| 3/3 [00:00<00:00, 324.71docs/s]


## DocumentClassifier

We can enrich the document metadata at index time using any transformers document classifier model.
Here we use an emotion model that distinguishes between 'sadness', 'joy', 'love', 'anger', 'fear' and 'surprise'.
These classes can later on be accessed at query time.

In [4]:
doc_classifier_model = 'bhadresh-savani/distilbert-base-uncased-emotion'
doc_classifier = TransformersDocumentClassifier(model_name_or_path=doc_classifier_model)

In [5]:
# convert to Document using a fieldmap for custom content fields the classification should run on
field_map = {}
docs_to_classify = [Document.from_dict(d, field_map=field_map) for d in docs_sliding_window]

In [8]:
# classify using gpu, batch_size makes sure we do not run out of memory
classified_docs = doc_classifier.predict(docs_to_classify, batch_size=16)

In [9]:
# convert back to dicts if you want, note that DocumentStore.write_documents() can handle Documents too
# classified_docs = [doc.to_dict(field_map=field_map) for doc in classified_docs]

In [10]:
# let's see how it looks: there should be a classification result in the meta entry containing label and score.
print(classified_docs[0].to_dict(field_map=field_map))

{'content': 'Heavy metal\n\nHeavy metal (or simply metal) is a genre of', 'content_type': 'text', 'score': None, 'meta': {'name': 'heavy_metal.docx', '_split_id': 0, 'classification': {'label': 'anger', 'score': 0.5151640772819519}}, 'embedding': None, 'id': '9903d23737f3d05a9d9ee170703dc245'}


In [None]:
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT
es_server = Popen(['elasticsearch-7.9.2/bin/elasticsearch'],
                   stdout=PIPE, stderr=STDOUT,
                   preexec_fn=lambda: os.setuid(1)  # as daemon
                  )
# wait until ES has started
! sleep 30

In [11]:
# Connect to Elasticsearch
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

In [12]:
# Now, let's write the docs to our DB.
document_store.write_documents(classified_docs)

In [13]:
# check if indexed docs contain classification results
test_doc = document_store.get_all_documents()[0]
print(f'document {test_doc.id} has label {test_doc.meta["classification"]["label"]}')

document 9903d23737f3d05a9d9ee170703dc245 has label anger


In [None]:
# Initialize QA-Pipeline
from haystack.pipelines import ExtractiveQAPipeline
retriever = ElasticsearchRetriever(document_store=document_store)
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)
pipe = ExtractiveQAPipeline(reader, retriever)    

In [15]:
## Voilà! Ask a question while filtering for "joy"-only documents
prediction = pipe.run(
    query="How is heavy metal?", params={"Retriever": {"top_k": 10, "filters": {"classification.label": ["joy"]}}, "Reader": {"top_k": 5}}
)

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.04 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 51.09 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 53.24 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 52.72 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 54.40 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 54.01 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 53.55 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 53.98 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 53.72 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 53.99 Batches/s]


In [16]:
print_answers(prediction, details="high")

{   'answers': [   Answer(answer='thick, massive sound', type='extractive', score=0.6946582794189453, context=',[6] heavy metal bands developed a thick, massive sound, characterized', offsets_in_document=[Span(start=35, end=55)], offsets_in_context=[Span(start=35, end=55)], document_id='b69a8816c2c8d782dceb412b80a4bf6e', meta={'_split_id': 5, 'classification': {'label': 'joy', 'score': 0.9432986974716187}, 'name': 'heavy_metal.docx'}),
                   Answer(answer='modified heavy metal into more accessible forms', type='extractive', score=0.2581551596522331, context='Several American bands modified heavy metal into more accessible forms', offsets_in_document=[Span(start=23, end=70)], offsets_in_context=[Span(start=23, end=70)], document_id='5a9cefc4732e4b2f97529a79231345ec', meta={'_split_id': 13, 'classification': {'label': 'joy', 'score': 0.9782285094261169}, 'name': 'heavy_metal.docx'}),
                   Answer(answer='fans', type='extractive', score=0.08210562914609909, conte

## About us

This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany

We bring NLP to the industry via open source!  
Our focus: Industry specific language models & large scale QA systems.  
  
Some of our other work: 
- [German BERT](https://deepset.ai/german-bert)
- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)
- [FARM](https://github.com/deepset-ai/FARM)

Get in touch:
[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)

By the way: [we're hiring!](https://www.deepset.ai/jobs)
