Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LDI] Improve performance of historical data import #11

Open
amotl opened this issue Oct 1, 2019 · 4 comments
Open

[LDI] Improve performance of historical data import #11

amotl opened this issue Oct 1, 2019 · 4 comments

Comments

@amotl
Copy link
Member

amotl commented Oct 1, 2019

Coming from #9 and #8, we see the ingesting performance of historical data from LDI acquired from http://archive.luftdaten.info/ through wget could well be improved. We will outline some thoughts about this here (in no particular order):

  • Better tune the current implementation re. buffering to better balance between memory consumption and ingesting performance.
  • See whether using tablib's Dataset for ingesting the raw CSV files will gain better performance. The underlying machinery is based on Pandas.
  • Stop ingesting LDI CSV files at all and use the Parquet files instead, see also Reading Parquet files.
@amotl amotl changed the title Improve performance of historical data import from LDI [LDI] Improve performance of historical data import Dec 9, 2019
@d-roet
Copy link

d-roet commented Jan 23, 2020

I stumbled across this InfluxDB blog mentioning enhancements for improved ingestion speeds by switching from JSON to the native line protocol format. A Python function to translate JSON's into this LP-format is included. See Writing Data to InfluxDB with Python

Perhaps this idea is useful for luftdatenpumpe too?

@amotl
Copy link
Member Author

amotl commented Jan 23, 2020

Thanks for mentioning that detail. However, it's a bit misleading: There is no JSON at all! Data is always transferred to InfluxDB using the line protocol. If data is passed in as a Python dictionary, it will get converted into line protocol through the make_lines routine.

I believe tuning the batch_size parameter will be more promising as data has to be converted to line protocol anyway. Using a more lightweight variant of make_lines might improve things performance-wise but also might miss some others.

Please recognize that luftdatenpumpe also provides the possibility to submit data using UDP [1] instead of HTTP over TCP. I am wondering why this detail was not mentioned at all within the blog post you have referenced.

Do you still have troubles with ingest performance, @d-roet? I would be happy to look into that if time permits.

[1] https://docs.influxdata.com/influxdb/v1.7/supported_protocols/udp/

@d-roet
Copy link

d-roet commented Jan 25, 2020

Thanks very much for mentioning that UDP possibility. I tried it following the docs you linked, but even the high-traffic UDP case mentioned there did not speed-up things as compared to the default HTTP ingestion. I experimented a bit with the batch-size and batch-pending options in the UDP configuration but that did not affect, nor improve, things much either.

As a work-around I have tried filtering our locally mirrored Luftdaten archive snapshot by only keeping the CSV archives of sensor id's that are relevant for our case. That, not surprisingly, speeds up things a lot (factor x20-x30 faster than before). I imagine this is because now luftdatenpumpe no longer needs to sift through all offered CSV's and filter by --country using geocoding as I did before.

So for my case this is a workable solution and probably does not warrant further investigation for optimizations inside luftdatenpumpe itself.

@amotl
Copy link
Member Author

amotl commented Feb 12, 2020

Stop ingesting LDI CSV files at all and use the Parquet files instead, see also Reading Parquet files.

I've just made a gist about how to acquire compressed Parquet files from archive.sensor.community and store their contents into InfluxDB. If this works out well, I will be happy to integrate this into Luftdatenpumpe appropriately.

The synopsis is easy to grok:

python ldi-parquet-to-influxdb.py <parquetfile> <database> <measurement>

However, please note the Parquet files are partitioned by time range, so there is no way to filter by country before actually reading them.

cc @d-roet, @wetterfrosch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants