Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Attachment type for attachment mapping #2085

Closed
russcam opened this issue May 10, 2016 · 16 comments
Closed

Introduce Attachment type for attachment mapping #2085

russcam opened this issue May 10, 2016 · 16 comments

Comments

@russcam
Copy link
Contributor

russcam commented May 10, 2016

The mapper-attachment plugin allows indexing and searching of documents in various different formats such as PDF, Word, etc.

This is exposed in NEST with AttachmentProperty, which can be used to map the properties of a CLR object to the attachment field mapping.

In its simplest form, indexing an attachment need pass a base64 encoded string of the document as the attachment field:

PUT http://localhost:9200/docs/document/1?pretty=true&refresh=true 
{
  "id": 1,
  "title": "Some document",
  "file": "some base64 encoded string"
}

If the attachment mapping has specified other metadata fields to indexed such as content_type, language, etc. these will be extracted from the content field and indexed as requested (NOTE: they do not exist in source, only in the index).

The plugin also allows explicit metadata field values to be passed when indexing, by using the name of the metadata field prefixed with an underscore

PUT http://localhost:9200/docs/document/2?pretty=true&refresh=true 
{
  "id": 2,
  "title": "Another document",
  "file": {
    "_content":  "some base64 encoded string",
    "_content_type": "text/plain"
  }
}

Having explicit metadata fields affects the structure of the source returned from results e.g.


POST http://localhost:9200/docs/document/_search?pretty=true 
{
  "query": {
    "match": {
      "file.content": {
        "query": "NEST mapper"
      }
    }
  }
}

Status: 200
{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.01125201,
    "hits" : [ {
      "_index" : "docs",
      "_type" : "document",
      "_id" : "2",
      "_score" : 0.01125201,
      "_source" : {
        "id" : 2,
        "title" : "Another document",
        "file" : {
          "_content" : "some base64 encoded string",
          "_content_type" : "text/plain"
     }
      }
    }, {
      "_index" : "docs",
      "_type" : "document",
      "_id" : "1",
      "_score" : 0.01125201,
      "_source" : {
        "id" : 1,
        "title" : "Some document",
        "file" : "some base64 encoded string"
      }
    } ]
  }
}

Here, the first document has an explicit metadata field for _content_type, and has therefore passed the attachment content to be indexed as _content. The second document passed the attachment content against the name of the attachment field.

(using explicit metadata fields does not affect the attachment mapping i.e. new fields beginning with underscore are not added to the mapping)

The extracted values are available in the hits

POST http://localhost:9200/docs/document/_search?pretty=true 
{
  "fields": [
    "file.content_type"
  ],
  "query": {
    "match": {
      "file.content_type": {
        "query": "pdf"
      }
    }
  }
}

Status: 200
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.19178301,
    "hits" : [ {
      "_index" : "docs",
      "_type" : "document",
      "_id" : "1",
      "_score" : 0.19178301,
      "fields" : {
        "file.content_type" : [ "application/pdf" ]
      }
    } ]
  }
}

I think we should introduce an Attachment type to make using the mapper-attachment plugin easier with NEST to

  1. Handle the serialization of properties when indexing
  2. Handle the deserialization of content with any additional explicit meta fields
  3. Handle using any of the Attachment type properties to for strongly typed access to fields.

Thoughts?

/cc @Mpdreamz, @gmarz

@nasreekar
Copy link

@russcam Hi Russ. thanks for the detailed explanation on attachment plugin.
Even I have the same problem as Issue 1828..

@nasreekar
Copy link

@russcam Hi Russ.. Is there any plugin or method to check if any new files(either doc or pdf or ppt) are added to a folder and if added, they should be indexed to the ES server as an attachment. TIA :)

@russcam
Copy link
Contributor Author

russcam commented May 10, 2016

@nasreekar that question would be better asked on discuss.elastic.co as it will have more visibility

@nasreekar
Copy link

@russcam sure.. thanks.. Asked on the forum. :)

@dadoonet
Copy link
Member

-1 as is because we are deprecating this plugin in 5.0.
It's replaced by ingest attachment which you might need to support with a similar approach you wrote.

@russcam
Copy link
Contributor Author

russcam commented May 13, 2016

@dadoonet Deprecated but not removed in 5?

This is aimed more at NEST 2.x, although I think it would also be useful to have in NEST 5.x, marked as deprecated. This would help those using the mapper-attachments plugin in 2.x to migrate to using ingest in 5.x more smoothly.

@dadoonet
Copy link
Member

Indeed. My comment was just here to highlight the fact that this code might be removed in the near future so it might not worth investing too much on it.

@russcam
Copy link
Contributor Author

russcam commented May 13, 2016

Gotchya. I think it's a small piece of work to have something here to help with using the plugin with NEST.

russcam added a commit that referenced this issue May 13, 2016
@piyushgiri
Copy link

piyushgiri commented May 13, 2016

@russcam Hi Russ, is the issue resolved for #2085?

@russcam
Copy link
Contributor Author

russcam commented May 14, 2016

@piyushgiri It's not an issue but an enhancement/feature :) There's a PR open with an Attachment type that you may want to do something similar to in your own project.

This doesn't mean that it's going to make it into NEST though, it still needs review and discussion.

@nasreekar
Copy link

nasreekar commented May 24, 2016

Is there any way in which i can search for phrases with special charecters in the content of the attached file? for eg: i want to search for asn@gmail.com in the contents of all the indexed files.

I actually tried it on my code and it seems to be like '@' is not considered and instead it is searching for asn, gmail and com separately and generating results. How can I solve this problem. TIA

@gmarz gmarz added v2.3.3 and removed v2.3.2 labels May 25, 2016
@gmarz
Copy link
Contributor

gmarz commented May 25, 2016

Bumped to 2.3.3...but this change could arguably target 2.4 instead.

russcam added a commit that referenced this issue May 27, 2016
@nasreekar
Copy link

@russcam Hi Russ. I have a doubt in using attachments plugin. when we are attaching any document(say pdf or doc) as an attachment , are the meta fields considered by default or should we set them explicitly? (NEST)

for eg: previously i was doing something like below.

Attachment attach = new Attachment
                {
                    Name = Path.GetFileNameWithoutExtension(file),
                    Content = encodedData,
                    ContentType = Path.GetExtension(file)

                };

                doc = new Document()
                {
                    Title = Path.GetFileNameWithoutExtension(file),
                    FilePath = Path.GetFullPath(file), //added to get the path of the file
                    CreatedDate = DateTime.Now,
                    File = attach
                };
class Document
    {

        public string Title { get; set; } 

        public string FilePath { get; set; }

        public Attachment File { get; set; }

    }
class Attachment
    {
            [String(Name = "_name")]
            public string Name { get; set; } //name of the document

            [String(Name = "_content", TermVector = TermVectorOption.WithPositionsOffsets)]
            public string Content { get; set; } //content of the document

            [String(Name = "_content_type")]
            public string ContentType { get; set; } //pdf or doc or etc
    }

Is there any other way in which I can index a document with some more metafields (like I want to add author and modified date of the document )as an attachment??

TIA

russcam added a commit that referenced this issue Jun 16, 2016
russcam added a commit that referenced this issue Jun 16, 2016
@russcam
Copy link
Contributor Author

russcam commented Jun 16, 2016

@nasreekar, the mapper attachments plugin extracts metadata fields from the file where they are available (and the plugin can extract them); some of the metadata fields can be explicitly sent when indexing the document and as far as I know, these are the following (in addition to content which is sent as _content).

  • content type (sent explicitly as _content_type)
  • name (sent explicitly as _name)
  • language (sent explicitly as _language)

@dadoonet, is this correct?

If you wish to index other data alongside an attachment, then you create other properties on your Document class as you have done.

I've just merged a PR into 2.x that should make it a little easier to use the attachment plugin with NEST.. This will go into the next 2.x release

@nasreekar
Copy link

@russcam I got it. Thanks Russ.

@dadoonet
Copy link
Member

Correct. https://www.elastic.co/guide/en/elasticsearch/plugins/current/mapper-attachments-usage.html

@russcam russcam closed this as completed Jun 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants