Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support dynamic index/type #175

Closed
cynosureabu opened this Issue Mar 25, 2014 · 5 comments

Comments

Projects
None yet
2 participants
@cynosureabu
Copy link

cynosureabu commented Mar 25, 2014

my data is a partitioned hive table. I know i can read partition by partition and create external table with es.resources pointing to that partition. But is it possible to read multiple partitions altogether, and have different partition data writing to different type/index?

something like
es.resource.index = column_name1,
es.resource.type = column_partition_column name

Is there such functionality already?
Thanks,
Chen

@costin

This comment has been minimized.

Copy link
Member

costin commented Mar 25, 2014

Let me first start by saying that partitions and external tables are clunky and buggy.
If I understand correctly you'd like to use the partitions of a table as the type for writing data to Elasticsearch.
This should be fairly easy to achieve by parameterizing your Hive script - create a simple script that reads a table based on partition {X} and writes to Elasticsearch based on index/{X}
Then run the Hive script binding {X} to the partition you desire.

As a side note, partitioning is typically used to improve performance - if you deal with large volumes of data you probably want each partition to point to a different index instead of a different type (since that would push all the data under the same index).

@costin costin added hive labels Mar 25, 2014

@cynosureabu

This comment has been minimized.

Copy link
Author

cynosureabu commented Mar 25, 2014

" parameterizing your Hive script " This is what I am currently doing. The issue is that I cannot query multiple partitions at the same time.

I also see lots of connection(out of nodes and retry?) exceptions in my mapper. Is there any known tune up I could do to avoid this issue? i have tried to increase the time out to be 10m, it seems to get better, but wanna know if any better ways.

My cluster has 6 machines each ES is running with 20G mem, and each partition is around 30million records.(with 4-5 string fields).

@costin

This comment has been minimized.

Copy link
Member

costin commented Mar 25, 2014

From an elasticsearch perspective, if you push your data under the same index, you can access it with the same query. If you have multiple indices, you can create a query that queries all of them.

The connection exceptions can have a plethora of reasons and without any concrete information I can only guess what's the issue. If you are writing then consider minimizing the bulk size (we'll do this in the next release); if it's reading then depends on how you stream data.

You can always turn on logging on the various packages in org.elasticsearch.hadoop to see what's wrong. I also recommend trying the latest master.

@cynosureabu

This comment has been minimized.

Copy link
Author

cynosureabu commented Mar 25, 2014

Thanks Costin. All my operations are writes. I will try to decrease the bulk size, and will turn on the logging. will keep you posted.

@costin

This comment has been minimized.

Copy link
Member

costin commented Mar 25, 2014

Try with a bulk size of 5 MBs and move from there. Note that, this is the batch size per task - if you job has 10 jobs, it leads to 100MBs bulks, 20, 200MBs, etc...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.