Skip to content

Introduce Attachment type for attachment mapping #2085

@russcam

Description

@russcam

The mapper-attachment plugin allows indexing and searching of documents in various different formats such as PDF, Word, etc.

This is exposed in NEST with AttachmentProperty, which can be used to map the properties of a CLR object to the attachment field mapping.

In its simplest form, indexing an attachment need pass a base64 encoded string of the document as the attachment field:

PUT http://localhost:9200/docs/document/1?pretty=true&refresh=true 
{
  "id": 1,
  "title": "Some document",
  "file": "some base64 encoded string"
}

If the attachment mapping has specified other metadata fields to indexed such as content_type, language, etc. these will be extracted from the content field and indexed as requested (NOTE: they do not exist in source, only in the index).

The plugin also allows explicit metadata field values to be passed when indexing, by using the name of the metadata field prefixed with an underscore

PUT http://localhost:9200/docs/document/2?pretty=true&refresh=true 
{
  "id": 2,
  "title": "Another document",
  "file": {
    "_content":  "some base64 encoded string",
    "_content_type": "text/plain"
  }
}

Having explicit metadata fields affects the structure of the source returned from results e.g.


POST http://localhost:9200/docs/document/_search?pretty=true 
{
  "query": {
    "match": {
      "file.content": {
        "query": "NEST mapper"
      }
    }
  }
}

Status: 200
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.01125201,
    "hits" : [ {
      "_index" : "docs",
      "_type" : "document",
      "_id" : "2",
      "_score" : 0.01125201,
      "_source" : {
        "id" : 2,
        "title" : "Another document",
        "file" : {
          "_content" : "some base64 encoded string",
          "_content_type" : "text/plain"
     }
      }
    }, {
      "_index" : "docs",
      "_type" : "document",
      "_id" : "1",
      "_score" : 0.01125201,
      "_source" : {
        "id" : 1,
        "title" : "Some document",
        "file" : "some base64 encoded string"
      }
    } ]
  }
}

Here, the first document has an explicit metadata field for _content_type, and has therefore passed the attachment content to be indexed as _content. The second document passed the attachment content against the name of the attachment field.

(using explicit metadata fields does not affect the attachment mapping i.e. new fields beginning with underscore are not added to the mapping)

The extracted values are available in the hits

POST http://localhost:9200/docs/document/_search?pretty=true 
{
  "fields": [
    "file.content_type"
  ],
  "query": {
    "match": {
      "file.content_type": {
        "query": "pdf"
      }
    }
  }
}

Status: 200
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.19178301,
    "hits" : [ {
      "_index" : "docs",
      "_type" : "document",
      "_id" : "1",
      "_score" : 0.19178301,
      "fields" : {
        "file.content_type" : [ "application/pdf" ]
      }
    } ]
  }
}

I think we should introduce an Attachment type to make using the mapper-attachment plugin easier with NEST to

  1. Handle the serialization of properties when indexing
  2. Handle the deserialization of content with any additional explicit meta fields
  3. Handle using any of the Attachment type properties to for strongly typed access to fields.

Thoughts?

/cc @Mpdreamz, @gmarz

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions