Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a track for the NYC taxi rides dataset. #5

Merged
merged 1 commit into from Aug 16, 2016

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Aug 4, 2016

The idea of this track comes from http://tech.marksblogg.com/all-billion-nyc-taxi-rides-elasticsearch.html. Original data be dowloaded from http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml. This PR only adds the December 2015 subset, which contains 11.5M documents but we could easily have the whole 1.1B dataset if we wanted.

This dataset is interesting because it is very structured: it has a 10 float fields, 2 geo_point fields, 2 date fields, 1 int field, 6 low-cardinality keyword fields and 1 text field.

@danielmitterdorfer
Copy link
Member

Thanks for your PR. I'll try to test it within the next days.

This was referenced Aug 5, 2016
@danielmitterdorfer
Copy link
Member

LGTM

@danielmitterdorfer
Copy link
Member

This PR only adds the December 2015 subset, which contains 11.5M documents but we could easily have the whole 1.1B dataset if we wanted.

The resulting index size was roughly 2 GB, so as a ballpark estimate the whole dataset would result (very) roughly in a 200GB index size. Do you know how much documents the 2015 subset contains? I guess it should be roughly 140M docs, so using the 2015 subset could be a good compromise.

@jpountz
Copy link
Contributor Author

jpountz commented Aug 8, 2016

Thanks for having a look, I'll look into making the dataset bigger as you suggested.

@djschny
Copy link

djschny commented Aug 8, 2016

I went to go look at the source data but the links to the dataset files appear to not be working. Curious how you are fetching them @danielmitterdorfer

For example - https://storage.googleapis.com/tlc-trip-data/2015/yellow_tripdata_2015-01.csv

screen shot 2016-08-08 at 9 16 06 am

@danielmitterdorfer
Copy link
Member

@djschny I fetched them from the S3 bucket where Adrien has put them (http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/nyc_taxis/documents.json.bz2)

@jpountz
Copy link
Contributor Author

jpountz commented Aug 8, 2016

@djschny I am having the same issue, it was working two or three weeks ago when I downloaded the data.

@danielmitterdorfer
Copy link
Member

Seems they've hit some quota on GCS. I've pinged them on Twitter.

@djschny
Copy link

djschny commented Aug 8, 2016

Thanks all, I'll use our internal copy/hosted version of the data in the interim.

@jpountz
Copy link
Contributor Author

jpountz commented Aug 11, 2016

I updated the PR so that the dataset contains green and yellow taxi rides of 2015, or about 165M documents/75GB of data, 4.5GB once compressed. It took 2 hours and 45 minutes to run on my machine:

------------------------------------------------------
|                                                   Metric |     Value |
|---------------------------------------------------------:|----------:|
|                         Min Indexing Throughput [docs/s] |     19185 |
|                      Median Indexing Throughput [docs/s] |     20887 |
|                         Max Indexing Throughput [docs/s] |     25697 |
|                                      Indexing time [min] |   498.225 |
|                                         Merge time [min] |   310.807 |
|                                       Refresh time [min] |   2.90512 |
|                                         Flush time [min] |   4.34565 |
|                                Merge throttle time [min] |   163.515 |
|             Query latency default (90.0 percentile) [ms] |   1728.03 |
|             Query latency default (99.0 percentile) [ms] |   1900.94 |
|              Query latency default (100 percentile) [ms] |   2343.27 |
|               Query latency range (90.0 percentile) [ms] |   2380.11 |
|               Query latency range (99.0 percentile) [ms] |   2518.81 |
|                Query latency range (100 percentile) [ms] |   2614.53 |
| Query latency distance_amount_agg (90.0 percentile) [ms] |   1966.39 |
| Query latency distance_amount_agg (99.0 percentile) [ms] |    2090.8 |
|  Query latency distance_amount_agg (100 percentile) [ms] |   2290.11 |
|                             Median CPU usage (index) [%] |     365.6 |
|                            Median CPU usage (search) [%] |     109.9 |
|                                   Total Young Gen GC [s] |   295.445 |
|                                     Total Old Gen GC [s] |    75.868 |
|                                          Index size [GB] |   27.5261 |
|                                     Totally written [GB] |   265.138 |
|                              Heap used for segments [MB] |   70.2554 |
|                            Heap used for doc values [MB] | 0.0788841 |
|                                 Heap used for terms [MB] |   34.8949 |
|                                Heap used for points [MB] |   31.0068 |
|                         Heap used for stored fields [MB] |   4.27471 |
|                                            Segment count |        57 |

@danielmitterdorfer
Copy link
Member

Great. :) I'll review it ASAP.

@danielmitterdorfer danielmitterdorfer removed their assignment Aug 12, 2016
@danielmitterdorfer
Copy link
Member

Thanks for the PR. LGTM.

@jpountz jpountz merged commit 0802c11 into elastic:master Aug 16, 2016
@jpountz jpountz deleted the nyc_taxis branch August 16, 2016 11:20
@jpountz
Copy link
Contributor Author

jpountz commented Aug 16, 2016

@dm is there anything left to be done to have this track running at https://elasticsearch-benchmarks.elastic.co/ ?

@danielmitterdorfer
Copy link
Member

@jpountz Yes. It needs to be added to our benchmark suite. I can take care of that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants