I started a project to benchmark ElasticSearch. I used YCSB as part of my testing process and I would like to share my ElasticSearch database implementation.
Questions and feedback welcomed.
First crack at creating a YCSB client for ElasticSearch
Updated the client code to fix some issues and created a unit test fo…
Further cleanup to insure that the node settings is configurable. Als…
…o added README instructions for how to use the client.
gh-93 New ElasticSearch Database Implementation
Sorry it took a while to get to this. Thanks for the patch!
I am new to both YCSB and ElasticSearch. I was able to run YCSB easily for Cassandra. However, I have not been able to do the same with ES (or perhaps I have but I am not sure).
Following the steps documented in YCSB/elasticsearch, I was able to start the test and I even got the results. I am not sure on which ES instance is it running on? For Cassandra, I had to start the Cassandra Server myself and then run the tests (providing the hosts details along with the ycsb command). ES, on the other hand, does not require us to do anything of that sort. So how does YCSB run these tests. I didn't even have my local ES instance up but the tests gave results.
Any insights would really help?
ElasticSearch enables you to start a local node ("embedded" instance inside the JVM) and that's exactly what is being used in YCSB. ElasticSearch can be confusing initially but it all comes down to understanding the fact that you can create a local node that does or doesn't contain data and that clients are obtained via these nodes. If a node doesn't contain data then it simply routes data to other nodes in the cluster.
You can also use a TransportClient to avoid having to create a node. If you wish to test against a "remote" node then you will have to create your own properties file (i.e "myproperties.data") that contains custom ElasticSearch node configuration and pass it into YCSB as described in the documentation. If you're planning on doing some funky and cool testing "remote" ElasticSearch testing be sure to explicitly overwrite the default properties as that will insure your configuration file is in full command of ElasticSearch.
Please take a look at "http://www.elasticsearch.org/guide/reference/java-api/client.html" and note the following:
"[C]ommon usage is to start the Node and use the Client in unit/integration tests. In such a case, we would like to start a “local” Node (with a “local” discovery and transport). Again, this is just a matter of a simple setting when starting the Node. Note, “local” here means local on the JVM (well, actually class loader) level, meaning that two local servers started within the same JVM will discover themselves and form a cluster."
And the current ESClient in YCSB does not implement TransportClient. Thanks for all the information and the link. Got to know a lot about ES this way.
There really isn't a good reason to use Transport Client as it is less efficient. For maximum performance it is recommended that you start a local dataless node that connects to your cluster and obtain a client from it.
If you still want to benchmark ES using Transport Client feel free to change the code to fit your test case.
Well you're right. TransportClient does look less efficient. Therefore, I completely ignored that option.
Local dataless node looks like the way. But, how do I verify that the local node is able to connect with my cluster. As far as I know, my other ES instances will show connection info as soon as another ES instance comes up. But, that doesn't happen when I connect with the local node.
UPDATE: Running everything locally. Also, on checking /tmp/esdata manually, I was able to confirm that the supposedly dataless node is writing data to itself on running bin/ycsb load .... And the other ES node does not get any writes. Exactly the opposite is happening, most probably because the local node is not able to join the cluster.
bin/ycsb load ...
Any clue what might be wrong?
I suspect you don't have auto-discovery turned on or you're not using the same cluster name on both nodes. Try setting the following on both your local and remote nodes:
Also, you may want to checkout one of the many fronts available to monitor your cluster:
"I was able to confirm that the supposedly dataless node is writing data to itself on running"
Insure that "node.data" is set to false or "node.client" to true and try setting "node.local" to false.
One more thing, if you have a firewall/proxy in place it can prevent you from utilizing multicast discovery and you may need to open those ports for communication. An alternative to multicast discovery is unicast discovery. Assuming you have correctly configured your remote data nodes and know their IP address and ports here's a surefire configuration that should work.
# all nodes must have the same cluster name
# create a dataless client node
# make sure this is unique and each node has it's own name or elasticsearch generated name
# disable mutlicast and enable unicast
# you will need to make sure these match the ip addresses and ports of your remote nodes
# In this case I have two local nodes started and I made sure to explicitly
# set their "network.host" config properties to 127.0.0.1 and their "transport.tcp.port"
# to 9301 and 9302 respectively
Hope this helps.
I tried multicast following your directions and got a warning saying "failed to receive confirmation on sent ping response...". After trying a couple of things, I realized that this might be a firewall issue that I was not able to ID. Therefore, I switched to unicast with your settings. That failed again on my machine.
Then I setup a new Ubuntu and tried unicast again. Finally I figured out that the client version mentioned in elasticsearch/pom.xml is was 0.19.8, whereas I was using the latest ES Server, 0.20.1. So the Java client being used in YCSB/elasticsearch was not compatible with the server version and that's why everything was failing. Changed the version in pom.xml and rebuilt YCSB, and it worked!
Perhaps you should point this out somewhere in the README.
Thanks for your patience and all the help! :)