[LDI] Improve performance of historical data import #11

amotl · 2019-10-01T12:04:17Z

Coming from #9 and #8, we see the ingesting performance of historical data from LDI acquired from http://archive.luftdaten.info/ through wget could well be improved. We will outline some thoughts about this here (in no particular order):

Better tune the current implementation re. buffering to better balance between memory consumption and ingesting performance.
See whether using tablib's Dataset for ingesting the raw CSV files will gain better performance. The underlying machinery is based on Pandas.
Stop ingesting LDI CSV files at all and use the Parquet files instead, see also Reading Parquet files.

The text was updated successfully, but these errors were encountered:

d-roet · 2020-01-23T14:52:56Z

I stumbled across this InfluxDB blog mentioning enhancements for improved ingestion speeds by switching from JSON to the native line protocol format. A Python function to translate JSON's into this LP-format is included. See Writing Data to InfluxDB with Python

Perhaps this idea is useful for luftdatenpumpe too?

amotl · 2020-01-23T16:58:56Z

Thanks for mentioning that detail. However, it's a bit misleading: There is no JSON at all! Data is always transferred to InfluxDB using the line protocol. If data is passed in as a Python dictionary, it will get converted into line protocol through the make_lines routine.

I believe tuning the batch_size parameter will be more promising as data has to be converted to line protocol anyway. Using a more lightweight variant of make_lines might improve things performance-wise but also might miss some others.

Please recognize that luftdatenpumpe also provides the possibility to submit data using UDP [1] instead of HTTP over TCP. I am wondering why this detail was not mentioned at all within the blog post you have referenced.

Do you still have troubles with ingest performance, @d-roet? I would be happy to look into that if time permits.

[1] https://docs.influxdata.com/influxdb/v1.7/supported_protocols/udp/

d-roet · 2020-01-25T16:39:35Z

Thanks very much for mentioning that UDP possibility. I tried it following the docs you linked, but even the high-traffic UDP case mentioned there did not speed-up things as compared to the default HTTP ingestion. I experimented a bit with the batch-size and batch-pending options in the UDP configuration but that did not affect, nor improve, things much either.

As a work-around I have tried filtering our locally mirrored Luftdaten archive snapshot by only keeping the CSV archives of sensor id's that are relevant for our case. That, not surprisingly, speeds up things a lot (factor x20-x30 faster than before). I imagine this is because now luftdatenpumpe no longer needs to sift through all offered CSV's and filter by --country using geocoding as I did before.

So for my case this is a workable solution and probably does not warrant further investigation for optimizations inside luftdatenpumpe itself.

amotl · 2020-02-12T04:17:04Z

Stop ingesting LDI CSV files at all and use the Parquet files instead, see also Reading Parquet files.

I've just made a gist about how to acquire compressed Parquet files from archive.sensor.community and store their contents into InfluxDB. If this works out well, I will be happy to integrate this into Luftdatenpumpe appropriately.

The synopsis is easy to grok:

python ldi-parquet-to-influxdb.py <parquetfile> <database> <measurement>

However, please note the Parquet files are partitioned by time range, so there is no way to filter by country before actually reading them.

cc @d-roet, @wetterfrosch

amotl changed the title ~~Improve performance of historical data import from LDI~~ [LDI] Improve performance of historical data import Dec 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LDI] Improve performance of historical data import #11

[LDI] Improve performance of historical data import #11

amotl commented Oct 1, 2019 •

edited

d-roet commented Jan 23, 2020

amotl commented Jan 23, 2020 •

edited

d-roet commented Jan 25, 2020

amotl commented Feb 12, 2020 •

edited

[LDI] Improve performance of historical data import #11

[LDI] Improve performance of historical data import #11

Comments

amotl commented Oct 1, 2019 • edited

d-roet commented Jan 23, 2020

amotl commented Jan 23, 2020 • edited

d-roet commented Jan 25, 2020

amotl commented Feb 12, 2020 • edited

amotl commented Oct 1, 2019 •

edited

amotl commented Jan 23, 2020 •

edited

amotl commented Feb 12, 2020 •

edited