New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a track for the NYC taxi rides dataset. #5
Conversation
Thanks for your PR. I'll try to test it within the next days. |
LGTM |
The resulting index size was roughly 2 GB, so as a ballpark estimate the whole dataset would result (very) roughly in a 200GB index size. Do you know how much documents the 2015 subset contains? I guess it should be roughly 140M docs, so using the 2015 subset could be a good compromise. |
Thanks for having a look, I'll look into making the dataset bigger as you suggested. |
I went to go look at the source data but the links to the dataset files appear to not be working. Curious how you are fetching them @danielmitterdorfer For example - https://storage.googleapis.com/tlc-trip-data/2015/yellow_tripdata_2015-01.csv |
@djschny I fetched them from the S3 bucket where Adrien has put them (http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/nyc_taxis/documents.json.bz2) |
@djschny I am having the same issue, it was working two or three weeks ago when I downloaded the data. |
Seems they've hit some quota on GCS. I've pinged them on Twitter. |
Thanks all, I'll use our internal copy/hosted version of the data in the interim. |
I updated the PR so that the dataset contains green and yellow taxi rides of 2015, or about 165M documents/75GB of data, 4.5GB once compressed. It took 2 hours and 45 minutes to run on my machine:
|
Great. :) I'll review it ASAP. |
Thanks for the PR. LGTM. |
@dm is there anything left to be done to have this track running at https://elasticsearch-benchmarks.elastic.co/ ? |
@jpountz Yes. It needs to be added to our benchmark suite. I can take care of that. |
The idea of this track comes from http://tech.marksblogg.com/all-billion-nyc-taxi-rides-elasticsearch.html. Original data be dowloaded from http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml. This PR only adds the December 2015 subset, which contains 11.5M documents but we could easily have the whole 1.1B dataset if we wanted.
This dataset is interesting because it is very structured: it has a 10 float fields, 2 geo_point fields, 2 date fields, 1 int field, 6 low-cardinality keyword fields and 1 text field.