Add a track for the NYC taxi rides dataset. #5

jpountz · 2016-08-04T10:16:43Z

The idea of this track comes from http://tech.marksblogg.com/all-billion-nyc-taxi-rides-elasticsearch.html. Original data be dowloaded from http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml. This PR only adds the December 2015 subset, which contains 11.5M documents but we could easily have the whole 1.1B dataset if we wanted.

This dataset is interesting because it is very structured: it has a 10 float fields, 2 geo_point fields, 2 date fields, 1 int field, 6 low-cardinality keyword fields and 1 text field.

danielmitterdorfer · 2016-08-05T11:47:07Z

Thanks for your PR. I'll try to test it within the next days.

danielmitterdorfer · 2016-08-08T05:44:06Z

LGTM

danielmitterdorfer · 2016-08-08T05:52:56Z

This PR only adds the December 2015 subset, which contains 11.5M documents but we could easily have the whole 1.1B dataset if we wanted.

The resulting index size was roughly 2 GB, so as a ballpark estimate the whole dataset would result (very) roughly in a 200GB index size. Do you know how much documents the 2015 subset contains? I guess it should be roughly 140M docs, so using the 2015 subset could be a good compromise.

jpountz · 2016-08-08T13:00:50Z

Thanks for having a look, I'll look into making the dataset bigger as you suggested.

djschny · 2016-08-08T13:17:19Z

I went to go look at the source data but the links to the dataset files appear to not be working. Curious how you are fetching them @danielmitterdorfer

For example - https://storage.googleapis.com/tlc-trip-data/2015/yellow_tripdata_2015-01.csv

danielmitterdorfer · 2016-08-08T13:19:15Z

@djschny I fetched them from the S3 bucket where Adrien has put them (http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/nyc_taxis/documents.json.bz2)

jpountz · 2016-08-08T13:19:26Z

@djschny I am having the same issue, it was working two or three weeks ago when I downloaded the data.

danielmitterdorfer · 2016-08-08T13:33:55Z

Seems they've hit some quota on GCS. I've pinged them on Twitter.

djschny · 2016-08-08T13:35:15Z

Thanks all, I'll use our internal copy/hosted version of the data in the interim.

jpountz · 2016-08-11T10:51:16Z

I updated the PR so that the dataset contains green and yellow taxi rides of 2015, or about 165M documents/75GB of data, 4.5GB once compressed. It took 2 hours and 45 minutes to run on my machine:

------------------------------------------------------
|                                                   Metric |     Value |
|---------------------------------------------------------:|----------:|
|                         Min Indexing Throughput [docs/s] |     19185 |
|                      Median Indexing Throughput [docs/s] |     20887 |
|                         Max Indexing Throughput [docs/s] |     25697 |
|                                      Indexing time [min] |   498.225 |
|                                         Merge time [min] |   310.807 |
|                                       Refresh time [min] |   2.90512 |
|                                         Flush time [min] |   4.34565 |
|                                Merge throttle time [min] |   163.515 |
|             Query latency default (90.0 percentile) [ms] |   1728.03 |
|             Query latency default (99.0 percentile) [ms] |   1900.94 |
|              Query latency default (100 percentile) [ms] |   2343.27 |
|               Query latency range (90.0 percentile) [ms] |   2380.11 |
|               Query latency range (99.0 percentile) [ms] |   2518.81 |
|                Query latency range (100 percentile) [ms] |   2614.53 |
| Query latency distance_amount_agg (90.0 percentile) [ms] |   1966.39 |
| Query latency distance_amount_agg (99.0 percentile) [ms] |    2090.8 |
|  Query latency distance_amount_agg (100 percentile) [ms] |   2290.11 |
|                             Median CPU usage (index) [%] |     365.6 |
|                            Median CPU usage (search) [%] |     109.9 |
|                                   Total Young Gen GC [s] |   295.445 |
|                                     Total Old Gen GC [s] |    75.868 |
|                                          Index size [GB] |   27.5261 |
|                                     Totally written [GB] |   265.138 |
|                              Heap used for segments [MB] |   70.2554 |
|                            Heap used for doc values [MB] | 0.0788841 |
|                                 Heap used for terms [MB] |   34.8949 |
|                                Heap used for points [MB] |   31.0068 |
|                         Heap used for stored fields [MB] |   4.27471 |
|                                            Segment count |        57 |

danielmitterdorfer · 2016-08-11T10:54:32Z

Great. :) I'll review it ASAP.

danielmitterdorfer · 2016-08-12T10:28:56Z

Thanks for the PR. LGTM.

jpountz · 2016-08-16T11:22:09Z

@dm is there anything left to be done to have this track running at https://elasticsearch-benchmarks.elastic.co/ ?

danielmitterdorfer · 2016-08-16T11:23:12Z

@jpountz Yes. It needs to be added to our benchmark suite. I can take care of that.

danielmitterdorfer added the Review label Aug 5, 2016

danielmitterdorfer self-assigned this Aug 5, 2016

This was referenced Aug 5, 2016

Consider adding an Uber-benchmark #1

Closed

Backport NYC taxi track #7

Closed

danielmitterdorfer removed the Review label Aug 8, 2016

danielmitterdorfer added Review and removed Review labels Aug 11, 2016

danielmitterdorfer removed their assignment Aug 12, 2016

Add a track for the NYC taxi rides dataset.

cddb1b7

jpountz force-pushed the nyc_taxis branch from e1475c3 to cddb1b7 Compare August 16, 2016 11:20

jpountz merged commit 0802c11 into elastic:master Aug 16, 2016

jpountz deleted the nyc_taxis branch August 16, 2016 11:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a track for the NYC taxi rides dataset. #5

Add a track for the NYC taxi rides dataset. #5

jpountz commented Aug 4, 2016

danielmitterdorfer commented Aug 5, 2016

danielmitterdorfer commented Aug 8, 2016

danielmitterdorfer commented Aug 8, 2016

jpountz commented Aug 8, 2016

djschny commented Aug 8, 2016

danielmitterdorfer commented Aug 8, 2016

jpountz commented Aug 8, 2016

danielmitterdorfer commented Aug 8, 2016

djschny commented Aug 8, 2016

jpountz commented Aug 11, 2016

danielmitterdorfer commented Aug 11, 2016

danielmitterdorfer commented Aug 12, 2016

jpountz commented Aug 16, 2016

danielmitterdorfer commented Aug 16, 2016

Add a track for the NYC taxi rides dataset. #5

Add a track for the NYC taxi rides dataset. #5

Conversation

jpountz commented Aug 4, 2016

danielmitterdorfer commented Aug 5, 2016

danielmitterdorfer commented Aug 8, 2016

danielmitterdorfer commented Aug 8, 2016

jpountz commented Aug 8, 2016

djschny commented Aug 8, 2016

danielmitterdorfer commented Aug 8, 2016

jpountz commented Aug 8, 2016

danielmitterdorfer commented Aug 8, 2016

djschny commented Aug 8, 2016

jpountz commented Aug 11, 2016

danielmitterdorfer commented Aug 11, 2016

danielmitterdorfer commented Aug 12, 2016

jpountz commented Aug 16, 2016

danielmitterdorfer commented Aug 16, 2016