In [56]:
import numpy as np
import scipy as sp

from pybdm import BDM
from pybdm import PartitionIgnore
from pyinform.blockentropy import block_entropy
from compress import Compressor

# plotting tools
from bokeh.io import output_notebook, show
from bokeh.layouts import gridplot
from bokeh.plotting import figure
from bokeh.models import CustomJS, Slider, ColumnDataSource

output_notebook()

In this notebook we will try to study the relative mathematical randomness in some strings and relate it to the actual number of parameters required for a machine learning model to produce that string in practice.

Useful links:

pybdm: [https://pybdm-docs.readthedocs.io/en/latest/index.html](https://pybdm-docs.readthedocs.io/en/latest/index.html)

pyinform: [https://elife-asu.github.io/PyInform/index.html](https://elife-asu.github.io/PyInform/index.html)

compress: [https://pypi.org/project/compress/](https://pypi.org/project/compress/)

bokeh: [https://docs.bokeh.org/en/latest/index.html](https://docs.bokeh.org/en/latest/index.html)


For plotting use the following code. Make sure to label all plots properly

```
plot_options = dict(width=450,
                    plot_height=250,
                    tools='pan,wheel_zoom,reset,save')

bent_plot = figure(**plot_options)
bent_plot.line(<x_data>,
                <y_data>,
                line_width=2)

bent_plot.xaxis.axis_label = '<xlabel>'
bent_plot.yaxis.axis_label = '<ylabel>'

show(bent_plot)

```

In [4]:
## Set of strings
strings = [
    [ 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0 ], # String 1
    [ 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1 ], # String 2
    [ 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1 ], # String 3
    [ 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1 ], # String 4
    [ 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1 ], # String 5
    [ 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1 ], # String 6
    [ 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0 ], # String 7
    [ 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1 ]  # String 8
]

In [20]:
## These are values taken from actual experiments 
## on the minimum number of parameters required to 
## obtain the corresponding string as output from a machine learning
## model(a sequence generation LSTM)
lstm_min_nparams = [ 185, 379, 460, 602, 632, 637, 581, 601 ]

In [22]:
plot_options = dict(width=450,
                    plot_height=250,
                    tools='pan,wheel_zoom,reset,save')

lstm_plot = figure(**plot_options)
lstm_plot.line(range(len(string)),
                lstm_min_nparams,
                line_width=2)
lstm_plot.title.text = "LSTM Parameter Trend"
lstm_plot.xaxis.axis_label = 'string index'
lstm_plot.yaxis.axis_label = 'lstm minimum number of parameters'
show(lstm_plot)



## Shannon Principles

First we will check what shannon principles of entropy say about the strings. 

The question will be how well this corresponds to a source that is generating the strings?

- **TASK**: compute and plot `block_entropy`(as a function of block_size) for each of the strings 

In [16]:
## INSERT CODE HERE

Did you notice something informative in the actual values of entropy calculated?

```If you notice, the strings are chosen in such a way that they have consistently high entropy/block entropy except for the first string```

**TASK** We now have block entropy as a function of block size - choose the block size that gives minimum block entropy as the block entropy of a string, plot the obtained minimum against the index of the string in the same order.

In [18]:
bent_plot = figure(**plot_options)
bent_plot.line(range(len(strings)),
                min_b_ent,
                line_width=2)

bent_plot.xaxis.axis_label = 'string_index'
bent_plot.yaxis.axis_label = 'entropy'
show(bent_plot)

What do you observe in the trend? what do you think it says about the relative randomness among strings? which strings are the least and most random? How well does this correlate to `LSTM Parameter Trend` plot

```Except the first string, all the other strings have high entropy and hence shannon block entropy is saying that except first string all the other strings have the same entropy(high). The string with the least amount of information is the first string. This is in contrast to the parameter trend plot of the LSTM which we precomputed, when Shannon entropy says that the strings 1-8 have the same randomness, LSTM does not follow that trend```

## Compression schemes

There is an alternate way to measure randomness in strings - which is by using compression algorithms on the strings, the size after compression tells something about the regularities present in the input

In [54]:
## Use the compression module create one plot with each of the following compressors
## zlib, bz2, lzma
## Each line should be distinguished by different colors
## TASK: string vs compressed length plot

z_lib = []
bz2 = []
lzma = []

def bitstring_to_bytes(s):
    return int(s).to_bytes(len(s) // 8, byteorder='big')

c = Compressor()

for string in strings:
    string = [ str(val) for val in string ]
    string = ''.join(string)
    
    c.use_zlib()
    z_lib.append(len(c.compress(string.encode('utf-8'))))
    
    c.use_bz2()
    bz2.append(len(c.compress(string.encode('utf-8'))))
    
    c.use_lzma()
    lzma.append(len(c.compress(string.encode('utf-8'))))

comp_plot = figure(**plot_options)
comp_plot.line(range(len(strings)),
                z_lib,
                legend_label="zlib",
                line_color='red',
                line_width=2)
comp_plot.line(range(len(strings)),
                bz2,
                legend_label="bz2",
                line_color='green',
                line_width=2)
comp_plot.line(range(len(strings)),
                lzma,
                legend_label="lzma",
                line_color='blue',
                line_width=2)

comp_plot.legend.click_policy="hide"
comp_plot.xaxis.axis_label = 'string_index'
comp_plot.yaxis.axis_label = 'compressed length'
show(comp_plot)

In [55]:
## Plot **only** the best compression algorithm here
comp_plot = figure(**plot_options)
comp_plot.line(range(len(strings)),
                z_lib,
                legend_label="zlib",
                line_color='red',
                line_width=2)

comp_plot.legend.click_policy="hide"
comp_plot.xaxis.axis_label = 'string_index'
comp_plot.yaxis.axis_label = 'compressed length'
show(comp_plot)

Which compression scheme performs the best? What do you observe in the trend? what do you think it says about the relative randomness among strings? which strings are the least and most random? How well does this correlate to `LSTM Parameter Trend` plot

```The increase in the size needed to encode is not regular but exhibit patterns although in general the average can be thought of as going up. The compression scheme zlib which gave the minimum compression says that string 2 is less random than string 1. There are some irregularities in the overall trend.```

Now that you have seen statistical randomness prediction techniques, what do you think about using these to tell something about the predictability of strings? What advantages/disadvantages do you observe in them?

## CTM

Now let us see another alternate definition to randomness in string which is connected to Kolmogorov Algorithmic information theory. This theory is founded on the grounds of Turing machines hence it is universal-in the sense that I do not need to make any assumptions on the data in order get these values. Such a powerful theory obviously has some drawbacks in terms of practical usability. The Kolmgorov Complexity values are even theoretically proved to be uncomputable in general - But for small strings we can find pretty tight upper bounds. That is something nice to have in a very general theory.

The method used to compute this is called CTM(Coding Theorem Method). The papers are a nice read about how exactly this is calculated but on a high level the upper bound is found by enumerating all possible Turing machines and finding if it produces the string or not.

**TASK**: Compute and plot the CTM values of the strings similar to what we have been doing. You can use the `BDM`(Block Decomposition Method) class from `pybdm` module to do this. `BDM` is a generalization of `CTM` method - when you set the `ndim=12` CTM is activated. 

In [62]:
plot_options = dict(width=450,
                    plot_height=250,
                    tools='pan,wheel_zoom,reset,save')
ctm = []

bdm = BDM(ndim=1, partition=PartitionIgnore)
for string in strings:
    b_ent = []
    ctm.append(bdm.bdm(np.array(string)))
    
ctm_plot = figure(**plot_options)
ctm_plot.line(range(len(strings)),
                ctm,
                line_width=2)
ctm_plot.xaxis.axis_label = 'block size'
ctm_plot.yaxis.axis_label = 'Kolmogorov Complexity(bits)'
show(ctm_plot)

What do you observe in the trend? what do you think it says about the relative randomness among strings? which strings are the least and most random? How well does this correlate to `LSTM Parameter Trend` plot?

```CTM plot shows an increasing trend in the amount of mathematical randomness in the strings. The randomness seems to be correlated with the number of parameters in the LSTM model. At this point it is unsure what this means but this warrants further exploration.```

Now that we explored 3 different methods, it is time to compare them. Plot all the three(block entropy, compression technique, kolmogorov complexity) in the same plot(Note that the actual values are different for each method - you will need to normalize the values to get the relative randomness in range \[0, 1\])

In [65]:
def normalize(a):
    a = np.array(a)
    return (a - np.min(a))/(np.max(a) - np.min(a))

ctm = normalize(ctm)
z_lib = normalize(z_lib)
min_b_ent = normalize(min_b_ent)
lstm_min_nparams = normalize(lstm_min_nparams)

plot_options = dict(width=550,
                    plot_height=300,
                    tools='pan,wheel_zoom,reset,save')

ctm_plot = figure(**plot_options)
ctm_plot.line(range(len(strings)),
              ctm,
              line_color="red",
              line_width=2,
              legend_label="ctm")
ctm_plot.line(range(len(strings)),
              z_lib,
              line_color="green",
              line_width=2,
              legend_label="zlib")
ctm_plot.line(range(len(strings)),
              min_b_ent,
              line_color="blue",
              line_width=2,
              legend_label="block entropy")
ctm_plot.line(range(len(strings)),
              lstm_min_nparams,
              line_color="orange",
              line_width=2,
              legend_label="LSTM")
ctm_plot.xaxis.axis_label = 'string index'
ctm_plot.legend.location = 'bottom_right'
ctm_plot.yaxis.axis_label = 'Randomness Estimation'
show(ctm_plot)

Which plot do you think approximates relative randomness computed by ctm method well?

`LSTM` was able to capture the relative trends in randomness.

(BONUS) Create a sequence generation model - with `L0 regularization` and find the minimum number of parameters in the model. Run the model on the given strings and try to obtain our LSTM n_params list.