Introduce persistent storage to Azure exporter #632

reyang · 2019-04-26T19:36:37Z

NOTE: discussion of data reliability and persistence will continue in #633.

@c24t @songy23 I hope to get your early feedback before I finish the docs/tests. And to see if we could take the same approach for agents / other SDKs.

I plan to add local file persistency for Azure trace exporter, to help the following cases:

The application is experiencing networking issue, or the ingestion/backend is not responding. Instead of accumulating traces in memory, user can opt-in and configure the exporter to dump information to local file in order to reduce memory pressure. [Memory increase due to too many traces #590]
In case of application crash / restart, we need a way to pick up the leftover traces.
The storage should provide consistency support under multiple processing / multi-threading.
For console application (e.g. backend job, periodic task, command line tools), there could be unsent traces after the grace period, either due to massive amount of traces, or due to intermittent transmission error.

The LocalFileStorage is designed to take minimum dependency on operating system level synchronization primitives, the only requirement is file rename should be mutual exclusive. This makes it easier to port the logic to other languages/platforms.

from opencensus.ext.azure.common.storage import LocalFileStorage

if __name__ == '__main__':
    using LocalFileStorage('test', maintenance_period=5) as stor:
        for blob in stor.gets():
            print(blob.fullpath)
        blob = stor.put(['Hello, World!'])
        blob = stor.get()
        if blob and blob.lease(10):
            print(blob.get())
            blob.delete()

contrib/opencensus-ext-azure/opencensus/ext/azure/common/schedule.py

c24t · 2019-04-26T21:41:42Z

The implementation looks good, but if you have the option to export to the agent without buffering and then exporting from the agent to some persistent sink (a file, a message queue, etc.), I think this would be better than building filesystem access into the exporters.

I don't know if this would solve your use cases though. It would solve (3), but not (2), although I'm not sure that this solves (2) either. I agree that we need a better general solution for (1), e.g. letting the user configure whether to drop messages when the backend is down, which messages to drop when we hit the memory limit, etc.

@songy23 can speak to persistence in the agent.

bogdandrutu · 2019-04-26T21:44:01Z

Why do you not have a FileExporter, and run a demon on that machine (oc-agent for example) to read from these files and export traces/metrics?

bogdandrutu · 2019-04-26T21:51:40Z

Why I think this is better:

You don't need to implement the logic of retrieving in all the languages;
In system like K8S you may not be rescheduled on the same host after a restart;

reyang · 2019-04-26T21:54:19Z

Why do you not have a FileExporter, and run a demon on that machine (oc-agent for example) to read from these files and export traces/metrics?

For Microsoft, there are client scenario (e.g. command line tools) which doesn't allow agent.

bogdandrutu · 2019-04-26T21:56:40Z

Don't know what commanline tools you have there but usually they run for a short period of time. I think having a small document where we try to capture all the requirements and possible implementations and analyze tradeoffs will be great for this feature.

I am not against having this capability but I would like to understand what design fits best our requirements.

reyang · 2019-04-26T21:58:50Z

Don't know what commanline tools you have there but usually they run for a short period of time. I think having a small document where we try to capture all the requirements and possible implementations and analyze tradeoffs will be great for this feature.

I've described the requirement in this PR. Let's use this PR for now, if the conversation goes too long to fit here, I'm okay to create a separate document.

I am not against having this capability but I would like to understand what design fits best our requirements.

Yep, totally understand.

bogdandrutu · 2019-04-26T22:06:39Z

Couple of questions about the design:

Does written data have a TTL?
Do we have a size limit on the disk?
Do we want to support something like file rotations (similar to logs)?
Do we want to use the logging system for this?

reyang · 2019-04-26T22:19:06Z

@bogdandrutu these are great questions!

Here goes the definition of the LocalFileStorage class:

class LocalFileStorage(object):
    def __init__(
        self,
        path,
        max_size=100*1024*1024,  # 100MB
        maintenance_period=60,  # 1 minute
        retention_period=7*24*60*60,  # 7 days
        write_timeout=60,  # 1 minute
    ):

TTL is controlled by the retention_period.
size limit is controlled by max_size (although I haven't yet implemented it at this moment).
rotation is done by the blobs API (we have lease, cleanup job).
for logging system, I would prefer to write locally and having out-of-proc agent (e.g. FluentBit) to deliver it. This should be more efficient/reliable/performant. What OpenCensus library would do is to 1) integrate with the runtime logging 2) enrich the log data with trace_id/span_id/tags/etc. There are corner cases where we don't have agent, where we might need to provide some alternative option (but we shouldn't optimize for this corner scenario).

reyang · 2019-04-26T22:29:29Z

The implementation looks good, but if you have the option to export to the agent without buffering and then exporting from the agent to some persistent sink (a file, a message queue, etc.), I think this would be better than building filesystem access into the exporters.

Definitely. For Windows we have ETW. If we can have a cross-platform mechanism in OpenCensus, that'll be fantastic!

I don't know if this would solve your use cases though. It would solve (3), but not (2), although I'm not sure that this solves (2) either. I agree that we need a better general solution for (1), e.g. letting the user configure whether to drop messages when the backend is down, which messages to drop when we hit the memory limit, etc.

For (2), it is tricky, given OpenCensus is not designed to be full-transactional (e.g. When we end a span, we put it in the memory queue and wait for the exporter to pick it up later. This means the span data could get lost if application crashed in the middle). Making it fully transactional is way too expensive for the major scenarios, and probably is not what we want to shoot for. (@c24t and I discussed about this, and so far we only see auditing scenario which might require this)

In this PR, I want to provide a certain level of solution to 2), instead of a 100% guarantee.
For example, in case we get "Server Too Busy" from the agent/ingestion, there are three options:

Simply drop the data.
Put the data back to the queue, which could cause significant memory pressure. Also, the more we put in memory, the more data loss if there is a crash/restart.
Persist the data locally, and retry later.

SergeyKanzhelev · 2019-04-26T23:53:45Z

@reyang thank you! Persistent storage will definitely help in specific scenarios and is a valuable addition to Azure exporter. It's great that you are working on it in a generic fashion so it may be reused. On this path, you will probably hit the need to introduce SpanData exporting feature census-instrumentation/opencensus-specs#255 that is currently implemented in C# and OpenConsensus project. And then to solve a problem of a potentially writing span to disk again.

Long term we will need to post guidance on how persistence can be achieved via the combination of Flush on exit, agent-based persistency, SDK-based persistency, and going full transactional. Hopefully, the need to SDK-based persistency will be minimal as it's hardest to implement.

Some things to remember while implementing:

Have a comment on "path" that specifying temp folder has potential privacy and security concerns as telemetry will be shared with other apps and accounts deployed on the same host.
it may be a good idea to allow using environment variables or resource attributes as a part of a file name or path. This way one misbehaving app cannot "attack" another by having too many random files.
some way of mutexing may be needed between processes while reading and removing files to avoid double upload issues.
some way of limiting the oldest time of span start should be in place. So very old spans eventually will be simply dropped and not written and attempted to be sent over and over.

reyang · 2019-04-27T00:05:36Z

some way of mutexing may be needed between processes while reading and removing files to avoid double upload issues.

We're using file rename as the synchronization primitive among processes.
One clarification, we're trying to mitigate double load, while scientifically this is unavoidable from client side (e.g. if client got a connection reset, there is no way to tell whether the server accepted the data or not at that very moment). De-dupe will need help from the backend.

some way of limiting the oldest time of span start should be in place. So very old spans eventually will be simply dropped and not written and attempted to be sent over and over.

We have retention_period in LocalFileStorage which is 7 days by default. Anything older than 7 days will be dropped.

c24t

A few comments, but otherwise LGTM. We can revisit before moving this out of the azure exporter.

c24t · 2019-04-29T20:09:34Z

contrib/opencensus-ext-azure/opencensus/ext/azure/trace_exporter/__init__.py

+                },
+                timeout=self.options.timeout,
+            )
+        except Exception as ex:


You might want to catch RequestException specifically here to tell retryable from non-retryable errors.

Added a TODO comment for now, will revisit when refactoring the exporter specific context (e.g. instead of having exporter manipulating the blacklist, having a dedicated context flag in exporter logic, so that integrations like requests would not intercept such activities and cause dead loop).

contrib/opencensus-ext-azure/opencensus/ext/azure/trace_exporter/__init__.py

c24t · 2019-04-29T20:30:40Z

opencensus/common/schedule/__init__.py

+    :param args: The kwargs passed in while calling `function`.
+    """
+
+    def __init__(self, interval, function, args=None, kwargs=None):


The signature change here means we lose access to the other Thread constructor args, but that's probably fine.

c24t · 2019-04-29T20:35:30Z

opencensus/common/schedule/__init__.py

+            elapsed_time = time.time() - start_time
+            wait_time = max(self.interval - elapsed_time, 0)
+
+    def cancel(self):


I called this stop instead of cancel originally because it's still possible to call it after the function has run, but the change makes sense if you want this to mimic Timer.

It is hard to name things, here we take the easy approach to mimic Timer.

My personal thinking: stop sounds like a synchronous operation, which notifies the thread and wait for it to join. (e.g. I would expect the thread to be stopped after stop returns)

opencensus/metrics/transport.py

c24t · 2019-04-29T20:41:51Z

opencensus/common/schedule/__init__.py

+            start_time = time.time()
+            self.function(*self.args, **self.kwargs)
+            elapsed_time = time.time() - start_time
+            wait_time = max(self.interval - elapsed_time, 0)


This is better than the original.

contrib/opencensus-ext-azure/opencensus/ext/azure/common/storage.py

contrib/opencensus-ext-azure/opencensus/ext/azure/trace_exporter/__init__.py

c24t · 2019-04-29T20:57:25Z

contrib/opencensus-ext-azure/opencensus/ext/azure/common/storage.py

+                else:
+                    yield LocalFileBlob(path)
+
+    def get(self):


Looks like this is unused, what do you need it for?

Currently not used except as an util for test cases. I was thinking to keep it as symmetrical to put?

reyang · 2019-04-29T22:59:32Z

Discussion of data reliability and persistence will continue in #633.

introduce persistent storage

c23fe0e

reyang added the do not merge label Apr 26, 2019

reyang requested review from c24t, songy23 and a team as code owners April 26, 2019 19:36

googlebot added the cla: yes label Apr 26, 2019

reyang commented Apr 26, 2019

View reviewed changes

contrib/opencensus-ext-azure/opencensus/ext/azure/common/schedule.py Outdated Show resolved Hide resolved

fix lint, change default max_size

309c5da

add basic error handling

d12d670

reyang closed this Apr 26, 2019

reyang reopened this Apr 26, 2019

fix lint

1e3b9f2

reyang added 7 commits April 26, 2019 17:16

minor refactor

1b937f8

integrate storage into azure exporter

af63b62

make storage more configurable

11479a8

refactor

a562d7d

fix test

a8c7cbf

fix unit test

1fa84e3

fix unit test

0e50e7c

reyang added 3 commits April 28, 2019 18:20

fix code coverage for storage module

a69e0e4

fix test coverage for Azure trace exporter

c78eb65

change default storage path

3ef7926

reyang removed the do not merge label Apr 29, 2019

reyang changed the title ~~[WIP] Introduce persistent storage~~ Introduce persistent storage to Azure exporter Apr 29, 2019

reyang added 5 commits April 29, 2019 11:45

support __enter__ and __exit__ for storage

1d6c751

minor refactor

6b900bc

fix unit test

7ebc912

fix code coverage

501eb42

fix code coverage

5e8608b

reyang mentioned this pull request Apr 29, 2019

Data reliability and persistence story for SDK #633

Open

fix lint

c7a1e0e

c24t approved these changes Apr 29, 2019

View reviewed changes

reyang added 3 commits April 29, 2019 14:16

avoid Windows line endings

d109158

address review comments

8d7bc89

fix lint

778f124

reyang merged commit 992b223 into master Apr 29, 2019

reyang deleted the azure branch April 30, 2019 03:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce persistent storage to Azure exporter #632

Introduce persistent storage to Azure exporter #632

reyang commented Apr 26, 2019 •

edited

Loading

c24t commented Apr 26, 2019

bogdandrutu commented Apr 26, 2019

bogdandrutu commented Apr 26, 2019

reyang commented Apr 26, 2019

bogdandrutu commented Apr 26, 2019

reyang commented Apr 26, 2019 •

edited

Loading

bogdandrutu commented Apr 26, 2019

reyang commented Apr 26, 2019

reyang commented Apr 26, 2019 •

edited

Loading

SergeyKanzhelev commented Apr 26, 2019

reyang commented Apr 27, 2019

c24t left a comment

c24t Apr 29, 2019

reyang Apr 29, 2019

c24t Apr 29, 2019

c24t Apr 29, 2019

reyang Apr 29, 2019

c24t Apr 29, 2019

c24t Apr 29, 2019

reyang Apr 29, 2019

reyang commented Apr 29, 2019

Introduce persistent storage to Azure exporter #632

Introduce persistent storage to Azure exporter #632

Conversation

reyang commented Apr 26, 2019 • edited Loading

c24t commented Apr 26, 2019

bogdandrutu commented Apr 26, 2019

bogdandrutu commented Apr 26, 2019

reyang commented Apr 26, 2019

bogdandrutu commented Apr 26, 2019

reyang commented Apr 26, 2019 • edited Loading

bogdandrutu commented Apr 26, 2019

reyang commented Apr 26, 2019

reyang commented Apr 26, 2019 • edited Loading

SergeyKanzhelev commented Apr 26, 2019

reyang commented Apr 27, 2019

c24t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reyang commented Apr 29, 2019

reyang commented Apr 26, 2019 •

edited

Loading

reyang commented Apr 26, 2019 •

edited

Loading

reyang commented Apr 26, 2019 •

edited

Loading