Welcome to the Dropbox River Plugin for Elasticsearch
This river plugin helps to index documents from your dropbox account.
WARNING: You need to have the Attachment Plugin.
Dropbox River Plugin | ElasticSearch | Attachment Plugin |
master (0.2.0) | 0.21.0.Beta1-SNAPSHOT | 1.6.0 |
0.1.0 | 0.20.4 | 1.6.0 |
Thanks to cloudbees for the build status :
Just type :
$ bin/plugin -install fr.pilato.elasticsearch.river/dropbox/0.1.0
This will do the job...
-> Installing fr.pilato.elasticsearch.river/dropbox/0.1.0...
Trying http://download.elasticsearch.org/fr.pilato.elasticsearch.river/dropbox/dropbox-0.1.0.zip...
Trying http://search.maven.org/remotecontent?filepath=fr/pilato/elasticsearch/river/dropbox/0.1.0/dropbox-0.1.0.zip...
Trying https://oss.sonatype.org/service/local/repositories/releases/content/fr/pilato/elasticsearch/river/dropbox/0.1.0/dropbox-0.1.0.zip...
Downloading ......DONE
Installed dropbox
First, you need to create your own application in Dropbox Developers.
If you create a Full Dropbox application, you will have access to all folders.
If you create a App folder application, you will only have access to your app folder files. You will get Dropbox HTTP Error 403 : {"error": "Forbidden"}
errors when accessing to other folders.
Note your AppKey
and your AppSecret
.
You need then to get an Authorization from the user for this new Application.
Just open the _dropbox
REST Endpoint with your AppKey
and AppSecret
parameters: http://localhost:9200/_dropbox/oauth/AppKey/AppSecret
$ curl http://localhost:9200/_dropbox/oauth/AppKey/AppSecret
You will get back a URL:
{
"oauth_token":"OAUTHTOKEN",
"oauth_secret":"OAUTHSECRET",
"url" : "https://www.dropbox.com/1/oauth/authorize?oauth_token=OAUTHTOKEN"
}
Open the URL in your browser. You will be asked by Dropbox to Allow your application to access to your dropbox account.
If you have added to the url a oauth_callback
parameter, Dropbox will redirect your user to this end point.
For example,
https://www.dropbox.com/1/oauth/authorize?oauth_token=OAUTHTOKEN&oauth_callback=http://yourwebserver/callback
will
redirect your user to http://yourwebserver/callback
if your user allows your application to have an access to its
Dropbox folders.
Once you get back the success reply from Dropbox, you can get the user Token and Secret by calling
$ curl http://localhost:9200/_dropbox/oauth/apptoken/appsecret/OAUTHTOKEN/OAUTHSECRET
You will get back a JSON document like the following:
{
"token" : "yourtoken",
"secret" : "yoursecret"
}
You will just have to use it when you will create the river (see below).
By the way, you can use the SettingUpDropboxTestsCases
test class to get a token and a secret for your user.
We create first an index to store our documents (optional):
$ curl -XPUT 'localhost:9200/mydocs/' -d '{}'
We create the river with the following properties :
- AppKey: AAAAAAAAAAAAAAAA
- AppSecret: BBBBBBBBBBBBBBBB
- Token: XXXXXXXXXXXXXXXX
- Secret: YYYYYYYYYYYYYYYY
- Dropbox directory URL :
/tmp
- Update Rate : every 15 minutes (15 * 60 * 1000 = 900000 ms)
- Get only docs like
*.doc
and*.pdf
- Don't index
resume*
$ curl -XPUT 'localhost:9200/_river/mydocs/_meta' -d '{
"type": "dropbox",
"dropbox": {
"appkey": "AAAAAAAAAAAAAAAA",
"appsecret": "BBBBBBBBBBBBBBBB",
"token": "XXXXXXXXXXXXXXXX",
"secret": "YYYYYYYYYYYYYYYY",
"name": "My tmp dropbox dir",
"url": "/tmp",
"update_rate": 900000,
"includes": "*.doc,*.pdf",
"excludes": "resume"
}
}'
We add another river with the following properties :
- AppKey: AAAAAAAAAAAAAAAA
- AppSecret: BBBBBBBBBBBBBBBB
- Token: 2XXXXXXXXXXXXXXX
- Secret: 2YYYYYYYYYYYYYYY
- Dropbox directory URL :
/tmp2
- Update Rate : every hour (60 * 60 * 1000 = 3600000 ms)
- Get only docs like
*.doc
,*.xls
and*.pdf
By the way, we define to index in the same index/type as the previous one:
- index:
docs
- type:
doc
$ curl -XPUT 'localhost:9200/_river/mynewriver/_meta' -d '{
"type": "dropbox",
"dropbox": {
"appkey": "AAAAAAAAAAAAAAAA",
"appsecret": "BBBBBBBBBBBBBBBB",
"token": "2XXXXXXXXXXXXXXX",
"secret": "2YYYYYYYYYYYYYYY",
"name": "My tmp2 dropbox dir",
"url": "/tmp2",
"update_rate": 3600000,
"includes": [ "*.doc" , "*.xls", "*.pdf" ]
},
"index": {
"index": "mydocs",
"type": "doc",
bulk_size: 50
}
}'
Note that you can index for another Dropbox Application (appkey
and appsecret
may be different
than the previous river).
Note that you can use the same credentials (appkey
, appsecret
, token
, secret
) as
the previous river if you only want to index another directory for the same user.
This is a common use case in elasticsearch, we want to search for something ;-)
$ curl -XGET http://localhost:9200/docs/doc/_search -d '{
"query" : {
"match" : {
"_all" : "I am searching for something !"
}
}
}'
When the Dropbox detect a new type, it creates automatically a mapping for this type.
{
"doc" : {
"properties" : {
"file" : {
"type" : "attachment",
"path" : "full",
"fields" : {
"file" : {
"type" : "string",
"store" : "yes",
"term_vector" : "with_positions_offsets"
},
"author" : {
"type" : "string"
},
"title" : {
"type" : "string",
"store" : "yes"
},
"name" : {
"type" : "string"
},
"date" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"keywords" : {
"type" : "string"
},
"content_type" : {
"type" : "string"
}
}
},
"name" : {
"type" : "string",
"analyzer" : "keyword"
},
"pathEncoded" : {
"type" : "string",
"analyzer" : "keyword"
},
"postDate" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"rootpath" : {
"type" : "string",
"analyzer" : "keyword"
},
"virtualpath" : {
"type" : "string",
"analyzer" : "keyword"
}
}
}
}
If you want to define your own mapping to set analyzers for example, you can push the mapping before starting the Dropbox River.
{
"doc" : {
"properties" : {
"file" : {
"type" : "attachment",
"path" : "full",
"fields" : {
"file" : {
"type" : "string",
"store" : "yes",
"term_vector" : "with_positions_offsets",
"analyzer" : "french"
},
"author" : {
"type" : "string"
},
"title" : {
"type" : "string",
"store" : "yes"
},
"name" : {
"type" : "string"
},
"date" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"keywords" : {
"type" : "string"
},
"content_type" : {
"type" : "string"
}
}
},
"name" : {
"type" : "string",
"analyzer" : "keyword"
},
"pathEncoded" : {
"type" : "string",
"analyzer" : "keyword"
},
"postDate" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"rootpath" : {
"type" : "string",
"analyzer" : "keyword"
},
"virtualpath" : {
"type" : "string",
"analyzer" : "keyword"
}
}
}
}
To send mapping to Elasticsearch, refer to the Put Mapping API
Dropbox River creates some meta fields :
Field | Description | Example |
name | Original file name | mydocument.pdf |
pathEncoded | BASE64 encoded file path (for internal use) | 112aed83738239dbfe4485f024cd4ce1 |
postDate | Indexing date | 1312893360000 |
rootpath | BASE64 encoded root path (for internal use) | 112aed83738239dbfe4485f024cd4ce1 |
virtualpath | Relative path | mydir/otherdir |
You can use meta fields to perform search on.
$ curl -XGET http://localhost:9200/docs/doc/_search -d '{
"query" : {
"term" : {
"name" : "mydocument.pdf"
}
}
}'
TO BE COMPLETED
This software is licensed under the Apache 2 license, quoted below.
Copyright 2011-2013 David Pilato
Licensed under the Apache License, Version 2.0 (the "License"); you may not
use this file except in compliance with the License. You may obtain a copy of
the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations under
the License.