# Sampling Traces using OpenTelemetry

In this section, we continue exploring sampling options with OpenTelemetry, namely tail-based sampling.

## Tail Sampling

Tail sampling is where the decision to sample a trace takes place by considering all or most of the spans within the trace. Tail Sampling gives you the option to **sample your traces based on specific criteria derived from different parts of a trace**, which isn’t an option with Head Sampling.

Some examples of how you can use Tail Sampling include:

* Always sampling traces that contain an **error**
* Sampling traces based on overall **latency**
* Sampling traces based on the presence or value of specific attributes on one or more spans in a trace
    - for example, sampling more traces originating from a newly deployed service
* Applying different sampling rates to traces based on certain criteria, such as when traces only come from low-volume services versus traces with high-volume services.

The downside to tail sampling today is:

1. Can be difficult to implement. Depending on the kind of sampling techniques available to you, it is not always a "set and forget" kind of thing.
    - As your systems change, so too will your sampling strategies. For a large and sophisticated distributed system, rules that implement sampling strategies can also be large and sophisticated.
2. Can be difficult to operate. The component(s) that implement tail sampling must be stateful systems that can **accept and store a large amount of data**.
    - Depending on traffic patterns, this can require a large number of compute nodes that all utilize resources differently.
    - Furthermore, a tail sampler might need to "fall back" to less computationally intensive sampling techniques if it is unable to keep up with the volume of data it is receiving.

Because of these factors, it is **critical to monitor tail-sampling components** to ensure that they have the resources they need to make the correct sampling decisions.

## Modify the OpenTelemetry Collector config

Here we will reconfigure our Collector to use the [Tail Sampling Processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor). This processor supports *only* works with traces.

1. Edit `config.yaml`, add `tail_sampling` under the `processors.probabilistic_sampler` section:

    ```yaml
    processors:
      batch:
        timeout: 2s
      probabilistic_sampler:
        sampling_percentage: 15        
      tail_sampling:
        decision_wait: 10s
        num_traces: 100
        expected_new_traces_per_sec: 10
        decision_cache:
          sampled_cache_size: 100000
        policies: [
            {
                name: policy1-always_sample,
                type: always_sample
            },
            {
                name: policy2-latency_gt_3000,
                type: latency,
                latency: {threshold_ms: 3000}
            },
            {
              name: policy3-status_code_error,
              type: status_code,
              status_code: {status_codes: [ERROR]}
            },        
        ]
    ```
    ```
    ```
    
2. Include the `tail_sampling` in the `service.pipelines.traces.processors` section:

    ```yaml
        traces:
          receivers: [otlp]
          processors: [tail_sampling, batch]
          exporters: [debug/basic, datadog/connector, datadog]
    ```
    ```
    ```

    <div class="alert alert-block alert-danger">Be sure to remove the <b>probabilistic_sampler</b> from the <b>services.pipeline.traces.processors</b> section from the previous lab.</div>

    > NOTE: We followed a common practice of *defining* multiple processor configurations (i.e., `probabilistic_sampler` *and* `tail_sampling`), but we're only using one of them (`tail_sampling`) in the `service.pipeline.traces.processors` definition. Only the **receivers**, **processors**, and **exporters** that are included in the `service.pipeline` section get used when the Collector is running.

3. Save the config and restart the Collector.

   

## Import OpenTelemetry Modules for Traces

In [None]:
from opentelemetry import baggage, trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (
    ConsoleSpanExporter,
    BatchSpanProcessor
)
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.trace import Status, StatusCode
import datetime, random, socket, time, uuid
from tqdm.notebook import tqdm

## Send Traces to the Collector

The code to produce traces has been modified to test out **Tail Sampling policies** in the following ways:

* To test **policy2-latency_gt_3000**, random processing delays have been introduced to simulate latency between service invocations.
    - This should result in some traces that exceed the maximum allowable latency defined in the policy causing the entire trace to be sampled and sent to Datadog.
* To test **policy3-status_code_error**, the random errors that are produced in the **Payment Service** will be used to evaluate spans that return ERROR codes.
    - When an error occurs in a span, the entire trace should be sampled and sent to Datadog.

In [None]:
def getTracer(service_name):
    provider = TracerProvider(resource=Resource.create({
        "service.name": service_name,
        "service.instance.id": str(uuid.uuid4()),        
        "deployment.environment": "otel-adventure",
        "host.name": socket.gethostname(),
    }))
    provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint="localhost:4317", insecure=True)))
    return trace.get_tracer("python", tracer_provider=provider)
    
def frontend():
    frontend_tracer = getTracer("frontend")
    with frontend_tracer.start_as_current_span("frontend") as frontend_span:
        print("Processing web transaction...")
        start = time.time()
        time.sleep(random.random())
        handle_checkout()
        time.sleep(random.random())
    
        frontend_span.set_status(Status(StatusCode.OK))
        elapsed = int((time.time() - start) * 1000)
        print(f"Transaction complete. {elapsed} msec.")

def handle_checkout():
    checkout_tracer = getTracer("checkout")
    with checkout_tracer.start_as_current_span("checkout") as checkout_span:
        # print("Handling checkout...")
        checkout_span.set_attribute("order_num", int(datetime.datetime.timestamp(datetime.datetime.now())*1000) % 100000)
        time.sleep(random.random())
        handle_payment()
        time.sleep(random.random())
        handle_shipping()
        time.sleep(random.random())
    
        checkout_span.set_status(Status(StatusCode.OK))
        # print("Checkout complete.")
        
def handle_payment():
    payment_tracer = getTracer("payment")
    with payment_tracer.start_as_current_span("payment") as payment_span:
        # print("Handling payment...")
        payment_span.set_attribute("payment_id", str(uuid.uuid4()))
        if (random.random() < 0.2):
            payment_span.set_status(Status(StatusCode.ERROR, "Failed to process credit card payment."))
            print(f"Simulated error, payment service, trace id: {trace.format_trace_id(payment_span.context.trace_id)}")
        else:
            time.sleep(random.random())
            payment_span.set_status(Status(StatusCode.OK))
        # print("Payment complete.")
    
def handle_shipping():
    shipping_tracer = getTracer("shipping")
    with shipping_tracer.start_as_current_span("shipping") as shipping_span:
        # print("Handling shipping...")
        shipping_span.set_attribute("tracking_num", str(uuid.uuid4()))
        time.sleep(random.random())
        # print("Shipping complete.")

for n in tqdm(range(10)):
    frontend()

## Verify Results

<div class="alert alert-block alert-warning"><b>DID IT WORK???</div>

### How can we verify results?

For this example, consider the following output from the previous section:

```
Processing web transaction...
Transaction complete. 4327.771902084351 msec.
Processing web transaction...
Transaction complete. 2657.130002975464 msec.
Processing web transaction...
Transaction complete. 3325.37579536438 msec.
Processing web transaction...
Simulated error, payment service, trace id: c2a0172b8f9f31ab73dc03323fdee573
Transaction complete. 3863.502264022827 msec.
Processing web transaction...
Transaction complete. 4370.552062988281 msec.
Processing web transaction...
Transaction complete. 3984.522819519043 msec.
Processing web transaction...
Transaction complete. 4206.815958023071 msec.
Processing web transaction...
Transaction complete. 4184.7779750823975 msec.
Processing web transaction...
Transaction complete. 4402.576923370361 msec.
Processing web transaction...
Transaction complete. 2019.6621417999268 msec.
```

#### Take note of the following:

1. There were were 10 total executions, resulting in 10 total traces produced,
2. 8 out of the 10 had execution times > 3000 milliseconds,
3. 1 out of the 10 had an ERROR code.

### tail_sampling metrics

Open the [documentation](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/tailsamplingprocessor/documentation.md) for the `tailsampling` processor.

There are numerous metrics emitted by this processor:

|Metric|Description|
|---|---|
|`otelcol_processor_tail_sampling_count_spans_sampled`|Count of spans that were sampled or not per sampling policy|
|`otelcol_processor_tail_sampling_count_traces_sampled`|Count of traces that were sampled or not per sampling policy|
|`otelcol_processor_tail_sampling_early_releases_from_cache_decision`|Number of spans that were able to be immediately released due to a decision cache hit.|
|`otelcol_processor_tail_sampling_global_count_traces_sampled`|Global count of traces that were sampled or not by at least one policy|
|`otelcol_processor_tail_sampling_new_trace_id_received`|Counts the arrival of new traces|
|`otelcol_processor_tail_sampling_sampling_decision_latency`|Latency (in microseconds) of a given sampling policy|
|`otelcol_processor_tail_sampling_sampling_decision_timer_latency`|Latency (in microseconds) of each run of the sampling decision timer|
|`otelcol_processor_tail_sampling_sampling_late_span_age`|Time (in seconds) from the sampling decision was taken and the arrival of a late span|
|`otelcol_processor_tail_sampling_sampling_policy_evaluation_error`|Count of sampling policy evaluation errors|
|`otelcol_processor_tail_sampling_sampling_trace_dropped_too_early`|Count of traces that needed to be dropped before the configured wait time|
|`otelcol_processor_tail_sampling_sampling_trace_removal_age`|Time (in seconds) from arrival of a new trace until its removal from memory|
|`otelcol_processor_tail_sampling_sampling_traces_on_memory`|Tracks the number of traces current on memory|


### Review the Collector metrics

1. Either open the Collector's metrics at [http://localhost:8888/metrics](http://localhost:8888/metrics) or execute this shortcut:

In [None]:
!curl -s http://localhost:8888/metrics | grep otelcol_processor_tail_sampling_count_traces_sampled

2. Search for the metric name: `otelcol_processor_tail_sampling_count_traces_sampled`. There should be five instances of the same metric:

    * one for **policy1** where `sampled="true"`,
    * two for **policy2**; one where `sampled="true"` and the other where `sampled=false`,
    * two for **policy3**; one where `sampled="true"` and the other where `sampled=false`,

    ```
    ```
    ```
    # HELP otelcol_processor_tail_sampling_count_traces_sampled Count of traces that were sampled or not per sampling policy
    # TYPE otelcol_processor_tail_sampling_count_traces_sampled counter
    otelcol_processor_tail_sampling_count_traces_sampled{policy="policy1-always_sample",sampled="true",service_instance_id="040a2a22-7be8-4cb8-8fba-9aac40ba86dc",service_name="otelcol-contrib",service_version="0.112.0"} 10
    otelcol_processor_tail_sampling_count_traces_sampled{policy="policy2-latency_gt_3000",sampled="false",service_instance_id="040a2a22-7be8-4cb8-8fba-9aac40ba86dc",service_name="otelcol-contrib",service_version="0.112.0"} 2
    otelcol_processor_tail_sampling_count_traces_sampled{policy="policy2-latency_gt_3000",sampled="true",service_instance_id="040a2a22-7be8-4cb8-8fba-9aac40ba86dc",service_name="otelcol-contrib",service_version="0.112.0"} 8
    otelcol_processor_tail_sampling_count_traces_sampled{policy="policy3-status_code_error",sampled="false",service_instance_id="040a2a22-7be8-4cb8-8fba-9aac40ba86dc",service_name="otelcol-contrib",service_version="0.112.0"} 9
    otelcol_processor_tail_sampling_count_traces_sampled{policy="policy3-status_code_error",sampled="true",service_instance_id="040a2a22-7be8-4cb8-8fba-9aac40ba86dc",service_name="otelcol-contrib",service_version="0.112.0"} 1
    ```
    ```
    ```

3. We can verify our sampling policies are working by looking at the metrics for each policy.

    - Specifically, **policy1-always_sample** (which samples **everything**) has 10 traces:

        ```
        ```
        ```
        otelcol_processor_tail_sampling_count_traces_sampled{policy="policy1-always_sample",sampled="true",service_instance_id="040a2a22-7be8-4cb8-8fba-9aac40ba86dc",service_name="otelcol-contrib",service_version="0.112.0"} 10
        ```
        ```
        ```

    - **policy2-latency_gt_3000** should show 8 sampled traces (that took > 3000msecs to execute) and 2 unsampled traces:


        ```
        ```
        ```
        otelcol_processor_tail_sampling_count_traces_sampled{policy="policy2-latency_gt_3000",sampled="false",service_instance_id="040a2a22-7be8-4cb8-8fba-9aac40ba86dc",service_name="otelcol-contrib",service_version="0.112.0"} 2
        otelcol_processor_tail_sampling_count_traces_sampled{policy="policy2-latency_gt_3000",sampled="true",service_instance_id="040a2a22-7be8-4cb8-8fba-9aac40ba86dc",service_name="otelcol-contrib",service_version="0.112.0"} 8
        ```
        ```
        ```

    - Lastly, the **policy3-status_code_error** had 1 sampled trace (that had an error status) and 9 unsampled traces:

        ```
        ```
        ```
        otelcol_processor_tail_sampling_count_traces_sampled{policy="policy3-status_code_error",sampled="false",service_instance_id="040a2a22-7be8-4cb8-8fba-9aac40ba86dc",service_name="otelcol-contrib",service_version="0.112.0"} 9
        otelcol_processor_tail_sampling_count_traces_sampled{policy="policy3-status_code_error",sampled="true",service_instance_id="040a2a22-7be8-4cb8-8fba-9aac40ba86dc",service_name="otelcol-contrib",service_version="0.112.0"} 1
        ```
        ```
        ```
   

## What's wrong with what we've done?

By sampling spans being sent through the collector, we've impacted our ability to accurately calculate APM statistics (request counts, error counts, and latency measures). These calculations are performed by the `datadog/connector` Connector. This Connector only gets called for what is exported from the `traces` pipeline. Recall the Collector config:

```yaml
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [debug/basic, datadog/connector, datadog]
```      

As such, our APM Metrics are now inaccurate.

### Solution

To solve for this, we need **two** `traces` pipelines. Create a new `traces` pipeline as follows:

```yaml
    traces/alltraces:
      receivers:
        - otlp
      processors:
        - transform/datadog_metadata
        - batch
      exporters:
        - datadog/connector
```

This pipeline will be used to receive traces from any and all receivers hence the name `traces/alltraces`.

Next, we'll modify the original `traces` pipeline to receive **only** from the `datadog/connector` instance:

```yaml
        traces:
          receivers: [datadog/connector]
          processors: [tail_sampling, batch]
          exporters: [debug/basic, datadog]
```

and notice that we've removed the `datadog/connector` reference in the `exporters`.

The resulting `service.pipelines` section should look like this now:

```yaml
  pipelines:
    metrics:
      receivers: [otlp, datadog/connector]
      processors: [batch]
      exporters: [debug/basic, datadog]

    traces/alltraces:
      receivers:
        - otlp
      processors:
        - batch
      exporters:
        - datadog/connector
        
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [debug/basic, datadog]

    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug/basic, datadog]
```

Restart the Collector for the changes to take effect.

## Open Service Catalog

View the results by opening [https://app.datadoghq.com/services?env=otel-adventure](https://app.datadoghq.com/services?env=otel-adventure). the **REQUESTS**, **ERROR RATE**, and **P95 LATENCY** were only being calculated for sampled (i.e, egressed) spans which is **not** an accurate accounting of the spans hitting the Collector. With this new configuration in place, these metrics are calculated for all spans sent to the Collector.

![image.png](imgs/service-catalog.png)


#### End of Section