Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bulk does not index source #229

Closed
tomron opened this issue May 5, 2015 · 10 comments
Closed

bulk does not index source #229

tomron opened this issue May 5, 2015 · 10 comments

Comments

@tomron
Copy link

tomron commented May 5, 2015

I run the following code and expect it to index the content (i.e _source) as well. Instead I get the following output -

{u'_score': 1.0, u'_type': u'doc', u'_id': u'0', u'_source': {}, u'_index': u'test'}
{u'_score': 1.0, u'_type': u'doc', u'_id': u'3', u'_source': {}, u'_index': u'test'}
{u'_score': 1.0, u'_type': u'doc', u'_id': u'1', u'_source': {}, u'_index': u'test'}
{u'_score': 1.0, u'_type': u'doc', u'_id': u'2', u'_source': {}, u'_index': u'test'}

from elasticsearch import Elasticsearch

es_client = Elasticsearch(hosts = [{ "host" : "localhost", "port" : 9200 }])

index_name = "test"

if es_client.indices.exists(index_name):
    print("deleting '%s' index..." % (index_name))
    print(es_client.indices.delete(index = index_name, ignore=[400, 404]))

print("creating '%s' index..." % (index_name))
print(es_client.indices.create(index = index_name))

bulk_data = []

for i in range(4):
    bulk_data.append({
        "index": {
            "_index": index_name, 
            "_type": 'doc', 
            "_id": i,
            "_source": {"hello": "world"}
        }
    })
    bulk_data.append({ "idx": i })

print("bulk indexing...")
res = es_client.bulk(index=index_name,body=bulk_data,refresh=True)
print(res)

print("results:")
for doc in es_client.search(index=index_name)['hits']['hits']:
    print(doc)
@honzakral
Copy link
Contributor

You shouldn't specify the _source in the metadata line, that is causing the bulk to misfire.

It is an error with elasticsearch that is causing it to ignore some parts of the request if _source is present in the action line. I will file a ticket there.

For simpler interface you can also look at the bulk helper which should be a lot friendlier and do the right thing here: http://elasticsearch-py.readthedocs.org/en/latest/helpers.html#elasticsearch.helpers.bulk

hope this helps

@honzakral
Copy link
Contributor

Issue filed as elastic/elasticsearch#10977

@tomron
Copy link
Author

tomron commented May 5, 2015

Still not working for the following code -

from elasticsearch import Elasticsearch, helpers

es_client = Elasticsearch(hosts = [{ "host" : "localhost", "port" : 9200 }])

index_name = "test"
j = 0
actions = []
while (j <= 10):
    action = {
        "_index": index_name,
        "_type": "doc",
        "_id": j,
        "_source": {
            "any":"data" + str(j)
            }
        }
    actions.append(action)
    j += 1

helpers.bulk(es_client, actions)

print("results:")
for doc in es_client.search(index=index_name)['hits']['hits']:
    print(doc)

@honzakral
Copy link
Contributor

helpers.bulk is just not calling refresh, either add refresh=True to the call or call the refresh manually by calling es_client.indices.refresh(index=index_name) after you are done indexing.

@tomron
Copy link
Author

tomron commented May 5, 2015

sorry, still not working -

from elasticsearch import Elasticsearch, helpers
from time import sleep

es_client = Elasticsearch(hosts = [{ "host" : "localhost", "port" : 9200 }])

index_name = "test"
j = 0
actions = []
while (j <= 10):
    action = {
        "_index": index_name,
        "_type": "doc",
        "_id": j,
        "_source": {
            "any":"data" + str(j)
            }
        }
    actions.append(action)
    j += 1

helpers.bulk(es_client, actions, refresh=True)
sleep(1)
es_client.indices.refresh(index=index_name) 
print("results:")
for doc in es_client.search(index=index_name)['hits']['hits']:
    print(doc)

@honzakral
Copy link
Contributor

This is what your code gives me, which is correct:

{'_id': '4', '_type': 'doc', '_source': {'any': 'data4'}, '_score': 1.0, '_index': 'test'}
{'_id': '9', '_type': 'doc', '_source': {'any': 'data9'}, '_score': 1.0, '_index': 'test'}
{'_id': '0', '_type': 'doc', '_source': {'any': 'data0'}, '_score': 1.0, '_index': 'test'}
{'_id': '5', '_type': 'doc', '_source': {'any': 'data5'}, '_score': 1.0, '_index': 'test'}
{'_id': '1', '_type': 'doc', '_source': {'any': 'data1'}, '_score': 1.0, '_index': 'test'}
{'_id': '6', '_type': 'doc', '_source': {'any': 'data6'}, '_score': 1.0, '_index': 'test'}
{'_id': '2', '_type': 'doc', '_source': {'any': 'data2'}, '_score': 1.0, '_index': 'test'}
{'_id': '7', '_type': 'doc', '_source': {'any': 'data7'}, '_score': 1.0, '_index': 'test'}
{'_id': '3', '_type': 'doc', '_source': {'any': 'data3'}, '_score': 1.0, '_index': 'test'}
{'_id': '8', '_type': 'doc', '_source': {'any': 'data8'}, '_score': 1.0, '_index': 'test'}

Could you provide some more information what exactly isn't working?

@tomron
Copy link
Author

tomron commented May 5, 2015

The output that I get is below
I use -
python elasticsearch==1.4.0

"version" : {
"number" : "1.4.4",
"build_hash" : "c88f77ffc81301dfa9dfd81ca2232f09588bd512",
"build_timestamp" : "2015-02-19T13:05:36Z",
"build_snapshot" : false,
"lucene_version" : "4.10.3"
}

I get no error when I check the stats

{u'_score': 1.0, u'_type': u'doc', u'_id': u'0', u'_source': {}, u'_index': u'test'}
{u'_score': 1.0, u'_type': u'doc', u'_id': u'3', u'_source': {}, u'_index': u'test'}
{u'_score': 1.0, u'_type': u'doc', u'_id': u'6', u'_source': {}, u'_index': u'test'}
{u'_score': 1.0, u'_type': u'doc', u'_id': u'9', u'_source': {}, u'_index': u'test'}
{u'_score': 1.0, u'_type': u'doc', u'_id': u'10', u'_source': {}, u'_index': u'test'}
{u'_score': 1.0, u'_type': u'doc', u'_id': u'1', u'_source': {}, u'_index': u'test'}
{u'_score': 1.0, u'_type': u'doc', u'_id': u'4', u'_source': {}, u'_index': u'test'}
{u'_score': 1.0, u'_type': u'doc', u'_id': u'7', u'_source': {}, u'_index': u'test'}
{u'_score': 1.0, u'_type': u'doc', u'_id': u'2', u'_source': {}, u'_index': u'test'}
{u'_score': 1.0, u'_type': u'doc', u'_id': u'5', u'_source': {}, u'_index': u'test'}

@honzakral
Copy link
Contributor

Even with the exact same versions I cannot replicate your issue, could you please make sure you are deleting and recreating the index correctly? Thanks

@tomron
Copy link
Author

tomron commented May 6, 2015

I restarted the cluster and now it works fine

@honzakral
Copy link
Contributor

Thanks, closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants