#### Note:
This is a Jupyter Notebook, which means all code sections are actually executed live. Please select "Cell -> Run All" before reading for the first time to generate the plots/widgets.

#### Note:
Some raw code for this IPython notebook is hidden by default for easier reading. If you want to see it, clicking on the toggle buttons will allow you to read and edit it.

In [1]:
from IPython.display import HTML

# function to create those toggle buttons, adapted from something from stackoverflow
def add_toggle_button(desc, *inputs):
    x = ', '.join([str(x) for x in inputs])
    n = str(inputs[0])
    code = '''
        <script>
        var others%n = [%x];
        var code_shown%n = true; 
        function code_toggle%n () {
            var selector = "div.input";
            var inputs = $(selector).toArray();
            if (code_shown%n) {
                for (var i in others%n) {
                    var x = others%n[i];
                    $(inputs[x]).hide();
                }
            }
            else {
                for (var i in others%n) {
                    var x = others%n[i];
                    $(inputs[x]).show();
                }
            }

            code_shown%n = !code_shown%n;
        } 
        $( document ).ready(code_toggle%n);
        </script>
        <form action="javascript:code_toggle%n()">
        <input type="submit" value="Toggle on/off the display of the %d code.">
        </form>'''
    
    return code.replace('%d', desc).replace('%n', str(n)).replace('%x', x)

HTML(add_toggle_button('document setup', 0, 1, 2))

In [2]:
%%html
<style>
/* prevent truncation of the slider labels */
.widget-label, .widget-button { width: unset !important; }
</style>

In [3]:
import numpy
from bqplot import LinearScale, Axis, Lines, Figure, Hist, Bars, Scatter
from ipywidgets import HBox, VBox, SelectionSlider, Button

## Investigation into Adaptive Sampling

This notebook contains some simulations we created to investigate the performance of the Better CAT v1 adaptive sampling algorithm. 

To start with, we gathered some data from a random customer on the number of transaction events they were generating for a random 24 hour period. (well, the customer wasn't selected truly at random, it was just the APM link from the last support ticket I had worked on.)

In [4]:
# APM data for the Python/RequestSampler/requests metric,
# which describes the number of transaction events seen in
# a given harvest. It is lumped into per-hour counts when
# inspected in APM. So te_per_hour[0] is from midnight to
# 1:00am, te_per_hour[1] is 1:00am to 2:00am, etc.
te_per_hour = [
    294, 254, 325, 516, 886, 3022,
    10109, 16507, 22262, 24752, 26362, 25034,
    27172, 27208, 26751, 20724, 14936, 10160,
    5808, 3053, 1397, 571, 152, 351
]

initial_harvester = 'Per-Spec'

# baseline samples-per-harvest, which is set to "10" in the agent spec
baseline_sph = 10 

# the sampling period
initial_hpm = 60

initial_seed = 1234567
red_height = 500.0

Next, we'll implement a Mock Agent to run our simulation, which will simply be to cram our customer data through a loop that calls adaptively_sample() on each transaction event.

Unfortunately, we don't know when each transaction event will actually be, so we'll have to use inverse transform sampling (ITS) to "generate" the finer-grained data that we want. To do that, we need to have a prospective model for the distribution of the number of transaction events that will happen in a given period of time. And, although the number of transactions per hour varies greatly throughout the day (low at night, high during the day), we can estimate that they'll be relatively uniform *within* a given hour. Thus, we can model it with the [Poisson distribution](https://en.wikipedia.org/wiki/Poisson_distribution). 

After our data is set up, the rest of our mock harvester is relatively simple: just loop through each harvest and call adaptively_sample() on each transaction event. We'll also add some extra instrumentation later so that we can toggle our plot with sliders.

In [5]:
class Harvester(object):
    def __init__(self, seed, target_samples_per_harvest, harvests_per_hour):
        # the delta-timestep for each next sample after the previous one
        self.timesteps_per_sample = []
        
        # how many total transaction events that occur per harvest
        self.te_per_harvest = []
        
        # an array that will contain the number of sampled=True for each harvest in the day
        self.num_sampled = []
        
        self.xdata = []
        
        # downsampled version of the data, for faster plot rendering
        self.xdata_ds = []
        self.sph_ds = []

        self.seed = seed
        self.target_samples_per_harvest = target_samples_per_harvest
        self.harvests_per_hour = harvests_per_hour
        
        self.current_timestamp = 0
        self.rand = numpy.random.RandomState(self.seed)
        
    # prep data performs a 2-layered inverse-transform data on the sampled data set
    def prep_data(self, te_per_hour):
        for h in range(0, len(te_per_hour)):
            
            # first layer: assume that within a given hour the number of samples per harvest
            #              will be distributed according to a Poisson distribution. Therefore,
            #              we can inverse-sample them via the exponential distribution.
            sh = te_per_hour[h]
            raw_sph = self.rand.exponential(sh/self.harvests_per_hour, self.harvests_per_hour)
            sph = numpy.rint(raw_sph*(sh/sum(raw_sph)))
            self.te_per_harvest = numpy.append(self.te_per_harvest, sph)

            timesteps = []
            for v in range(0, self.harvests_per_hour):
                # format x as "HOUR.FRACTIONOFHOUR"
                # x is gonna be a point for our plot of samples over time, so
                # we don't need it to have a precise value. 
                xpoint = h + float(v)/float(self.harvests_per_hour)
                self.xdata.append(xpoint)
                
                shh = sph[v]
                if shh == 0:
                    self.timesteps_per_sample.append([])
                else:
                    # second layer: assume that the number of samples per harvest is
                    #               *also* be a Poisson distribution, meaning that for a
                    #               given N, we can inversely sample according to the 
                    #               exponential distribution again in order to find an 
                    #               estimated timestamp for every sample for the entire day.
                    #               
                    raw_sps = self.rand.exponential(self.harvests_per_hour/shh, int(shh))
                    sps = raw_sps*(sum(raw_sps)/shh)
                    self.timesteps_per_sample.append(sps)

        # finally, downsample our data to just 10 harvests per minute for the plot, so that
        # our sliders won't be really clunky
        ds_factor = int(len(self.te_per_harvest)/240.0)
        for i in range(0, 240):
            self.xdata_ds.append(self.xdata[ds_factor*i])
            self.sph_ds.append(self.te_per_harvest[ds_factor*i])

    def simulate(self):
        for i in range(len(self.te_per_harvest)):
            self.harvest(i)

    # perform a single harvest, counting the number of transaction events
    # are assigned sampled=True
    def harvest(self, i):
        samples = self.timesteps_per_sample[i]

        last_harvest_count = -1
        if i > 0:
            last_harvest_count = int(self.te_per_harvest[i-1])

        sampled_count = 0
        for timestep in samples:
            self.current_timestamp += timestep

            # note: the spec states that we should stop sampling once we hit
            #       2*self.target_samples_per_harvest
            if sampled_count < 2*self.target_samples_per_harvest:
                sampled_count += int(self.adaptively_sample(self.current_timestamp, last_harvest_count))

        self.num_sampled.append(sampled_count)

    # API: returns "True" if we should set Sampled=True, "False" otherwise
    def adaptively_sample(self, timestamp, last_harvest_count):
        raise NotImplementedError('IMPLEMENT THIS, YA NUMBSKULL!')
        
#HTML(add_toggle_button('simulation', 3, 8))

Next, we'll create a couple subclasses of our Harvester to implement various adaptive sampling algorithms. We'll do a base case, that just randomly assigns Sampled=True, and then another that implements the adaptive sampling algorithm in the spec.

In [6]:
class RandomHarvester(Harvester):
    def adaptively_sample(self, timestamp, last_harvest_count):
        if last_harvest_count == -1:
            return False
        
        if self.rand.uniform(1) > 0.5:
            return True
        
        return False

In [7]:
class SpecHarvester(Harvester):
    def adaptively_sample(self, timestamp, last_harvest_count):
        if last_harvest_count == -1:
            return False

        if self.rand.uniform(last_harvest_count) < self.target_samples_per_harvest:
            return True

        return False

In [20]:
harvester_classes = {
    'Per-Spec': SpecHarvester,
    'Random': RandomHarvester
}

# some orchestration around the harvester to allow us to try out various parameters, and then
# cache the results so that the sliders won't be slow due to recalculating the plot every damn time
current_harvester = None
tas_cache = {}
def test_adaptive_sampling(class_name, seed, target_samples_per_harvest, harvests_per_hour):
    global current_harvester
    global tas_cache
    
    cls = harvester_classes[class_name]
    tag = "%s/%s/%s" % (cls.__name__, seed, harvests_per_hour)
    
    if tag in tas_cache:
        harvester = tas_cache[tag]
    else:
        harvester = cls(seed, target_samples_per_harvest, harvests_per_hour)
        harvester.prep_data(te_per_hour)
        harvester.simulate()
        tas_cache[tag] = harvester
    
    current_harvester = harvester
    return harvester


def cond_sum(harvester, ns):
    a = numpy.array(harvester.num_sampled)
    return numpy.sum([numpy.sum(a==n) for n in ns])

def count_output(harvester):
    total_sum = numpy.sum(harvester.num_sampled)
    diff_from_middle = harvester.target_samples_per_harvest//2
    middle_percent = 100.0*cond_sum(harvester, range(diff_from_middle, 3*diff_from_middle))/total_sum
    print("For harvests_per_hour=%d, target_samples_per_hour=%d, and random seed=%d, the percentage of samples in the middle / outside the middle: %5.2f%% / %5.2f%%.\n" % (
        int(harvester.harvests_per_hour), int(harvester.target_samples_per_harvest),
        harvester.seed, middle_percent, 100.0-middle_percent)
     )


Finally, we'll add some sliders and plot the result. On the left is a view of what our ITS data looks like (although note that it is downsampled somewhat so that it plots faster). On the right is a histogram of transaction_events that were selected to have Sampled=True over the course of the 24 hour period.

Note: you can adjust the sliders to try out different paramters.

In [21]:
initial_data = test_adaptive_sampling(initial_harvester, initial_seed, baseline_sph, initial_hpm)

# first set up our scales and axes
bx_sc = LinearScale()
by_sc = LinearScale()
bax_x = Axis(label='x', scale=bx_sc, grid_lines='solid')
bax_y = Axis(label='y', scale=by_sc, orientation='vertical', side='left', grid_lines='solid')

hx_sc = LinearScale()
hy_sc = LinearScale()
hax_x = Axis(label='x', scale=hx_sc, grid_lines='solid')
hax_y = Axis(label='y', scale=hy_sc, orientation='vertical', side='left', grid_lines='solid')

# cache-busting button
cache_button = Button(
    description='Click to clear the plotting cache.',
    disabled=False,
    button_style='',
    tooltip='Click to clear the plot-cache.'
)

# input sliders
ss = SelectionSlider(
    options=['Random', 'Per-Spec'],
    value='Per-Spec',
    description='Sampling Algorithm',
)

rss = SelectionSlider(
    options=[1234567, 23523465, 57433462],
    value=initial_seed,
    description='Random Seed',
)

hphs = SelectionSlider(
    options=[30, 60, 120, 240],
    value=initial_hpm,
    description='Harvests Per Hour',
)

# input scatter plot
te_scatter = Scatter(
    x=initial_data.xdata_ds,
    y=initial_data.sph_ds,
    scales={'x': bx_sc, 'y': by_sc},
    visible=True)

# histogram for the count of the number of sampled=True payloads
samples_hist = Hist(
    sample=initial_data.num_sampled,
    scales={'sample': hx_sc, 'count': hy_sc},
    bins=20)

# red bar in the middle of the samples plot
red_line = Lines(
    x=[baseline_sph-0.5, baseline_sph-0.5, baseline_sph+0.5, baseline_sph+0.5],
    y=[0, red_height, red_height, 0],
    colors=['red'],
    fill='inside',
    scales={'x': hx_sc, 'y': hy_sc})

# hookup the cache-buster
def bust_cache(button):
    tas_cache = {}
cache_button.observe(cache_button, names='value')

#HTML(add_toggle_button('plotting', 8, 9, 10))

In [22]:
def update_plots(slider):
    class_name = ss.value
    random_seed = rss.value
    harvests_per_hour = hphs.value
    
    factor = 60.0/harvests_per_hour
    target_samples_per_harvest = int(baseline_sph*factor)
    
    harvester = test_adaptive_sampling(
        class_name, random_seed,
        target_samples_per_harvest, harvests_per_hour
    )
    
    te_scatter.x = harvester.xdata_ds
    te_scatter.y = harvester.sph_ds
    
    hax_x.tick_values = range(0, 2*target_samples_per_harvest+1)
    samples_hist.bins = 2*target_samples_per_harvest
    samples_hist.sample = harvester.num_sampled
    
    l = target_samples_per_harvest - 0.5
    h = target_samples_per_harvest + 0.5
    red_line.x = [l, l, h, h]
    red_line.y = [0, red_height/factor, red_height/factor, 0]
    count_output(harvester)
    
ss.observe(update_plots, names='value')
rss.observe(update_plots, names='value')
hphs.observe(update_plots, names='value')

In [26]:
fig_bar = Figure(marks=[te_scatter], axes=[bax_x, bax_y], title='Transaction Events Per Harvest')
fig_hist = Figure(marks=[samples_hist, red_line], axes=[hax_x, hax_y], title='Super-Samples per Harvest')

display(VBox([HBox([cache_button, ss]), HBox([hphs, rss]), HBox([fig_bar, fig_hist])]))
count_output(initial_data)

VBox(children=(HBox(children=(Button(description='Click to clear the plotting cache.', style=ButtonStyle(), to…

For harvests_per_hour=60, target_samples_per_hour=10, and random seed=1234567, the percentage of samples in the middle / outside the middle:  3.31% / 96.69%.



My interpretation of this result is that if purpose of the adaptive sampling algorithm is to ensure that we approximately get a certain number of sampled=True payloads per harvest, then doesn't really work how it is expected to. For example, when harvesting once a minute (i.e. 60 times an hour) and targeting 10 samples per harvest, the Spec Sampler only assigns between 5 and 15 samples per harvest only 3.31% of the time! The rest of the time, it will be near the min and the max: 0 or 20.

This is because our rate of throughput is not constant. If our throughput was constant for the entire 24 hour period, then we would expect that roughly half of them would end up with between 5 and 15 sampled=True payloads. However, a more realistic workload varies across time. Thus, if a given harvest has a high throughput, and the next harvest has a much lower one, then basing our sample rate on the previous harvest's throughput means that it is likely that we will oversample it. Likewise, if a given harvest has a low throughput, and the next one ends up with a high throughput, then it is likely that we will undersample it.

For harvests_per_hour=60 and target_samples_per_hour=10, the percentage of samples in the middle / outside the middle:  3.31% / 96.69%
