# DCGraM Algorithm

This notebook implements the D-Markov with Clustering and Graph Minimization (DCGraM) Algorithm. Its objective is to model a discrete dynamical system using a Probabilistic Finite State Machine (PFSA). 

Given a sequence *X* over the alphabet $\Sigma$ of length *N* that is an output of the original dynamical system, DCGraM works by:

1. Creating a D-Markov model for the original system for a given *D*;
2. Using a clustering algorithm on the D-Markov model states in order to create an initial partition;
3. Using a graph minimization algorithm to refine the initial partition until the final reduced PFSA is obtained.

## Initialization
First, it is necessary to create the directories that store the working files for the current system. The first cell sets the system's name and the tag to be used in the current run. The following cell only has to be ran when creating modeling a new system.  A directory is then created with this tag and inside it subdirectories that contain the sequence, PFSA and result files.

In [5]:
import pandas as pd
import yaml
import sequenceanalyzer as sa
import dmarkov

In [2]:
name = 'ternary_even_shift'
tag = 'v1'

In [3]:
import os
if not os.path.exists(name):
    os.makedirs(name)
    os.makedirs(name + '/sequences')
    os.makedirs(name + '/pfsa')
    os.makedirs(name + '/results')
    os.makedirs(name + '/results/probabilities')
    os.makedirs(name + '/results/probabilities/conditional')
    os.makedirs(name + '/results/cond_entropies')
    os.makedirs(name + '/results/kldivergences')
    os.makedirs(name + '/results/autocorrelations')
    os.makedirs(name + '/results/prob_distances')
    os.makedirs(name + '/results/plots')

### Parameters

The next cell initializes the parameters that are used throughout the code. They are listed as:

  * `N`: The original sequence length *N*, which is also the length of the sequences that are going to be generated by the PFSA generated by DCGraM;
  * `drange`: range of values of *D* for which D-Markov and DCGraM machines that will be generated;
  * `a`: value up to which the autocorrelation is computed.

In [4]:
N = 10000000
drange = range(4,11)
a = 20

## Original Sequence Analysis

Make sure that the original sequence of length `N` is stored in the correct directory and run the cell to load it to `X`. After this, run the cells corresponding to the computation of the subsequence probabilities and the conditional probabilites for the value `d_max`, which is the last value in `drange`. Additional results can also be computed in the respective cells (autocorrelation and conditional entropy).

In [None]:
#Open original sequence from yaml file
with open(name + '/sequences/original_len_' + str(N) + '_' + tag + '.yaml', 'r') as f:
    X = yaml.load(f)
    
#Value up to which results are computed
d_max = drange[-1]

In [None]:
#Compute subsequence probabilities of occurrence up to length d_max
p, alphabet = sa.calc_probs(X, d_max)
p.to_csv(name + '/results/probabilities/original_' + tag + '.csv')
with open(name + '/alphabet.yaml', 'w') as f:
    yaml.dump(alphabet, f)

In [None]:
#If p has been previously computed, use this cell to load the values
if not p:
    p = pd.read_csv(name + '/results/probabilities/original_' + tag + '.csv')
    with open(name + '/alphabet.yaml', 'r') as f:
        alphabet = yaml.load(f)

In [None]:
#Compute conditional probabilities of subsequences occurring after given each symbol of the alphabet
#One of the two previous cells needs to be executed first.
if p:
    p_cond = sa.calc_cond_probs(p, alphabet, d_max) 
    p_cond.to_csv(name + '/results/probabilities/conditional/original_' + tag + '.csv')
else:
    print("Run a cell that either computes or opens the probabilities.")

In [None]:
#If p_cond has been previously computed, use this cell to load the values
if not p_cond:
    p_cond = pd.read_csv(name + '/results/probabilities/conditional/original_' + tag + '.csv')

In [None]:
#Compute conditional entropy
if p and p_cond:
    h = sa.calc_cond_entropy(p, p_cond, d_max)
    h.to_csv(name + '/results/cond_entropies/original_' + tag + '.csv')
else:
    print("Run the conditional probabilities cell first.")

In [None]:
#If p_cond has been previously computed, use this cell to load the values
if not h:
    h = pd.read_csv(name + '/results/cond_entropies/original_' + tag + '.csv')

In [None]:
#Compute autocorrelation
aut = sa.calc_autocorr(X, a)
aut.to_csv(name + '/results/autocorrelations/original_' + tag + '.csv')

In [None]:
#If aut has been previously computed, use this cell to load the values
if not aut:
    aut = pd.read_csv(name + '/results/autocorrelations/original_' + tag + '.csv')

## D-Markov Machines

The next step of DCGraM consists of generating D-Markov Machines for each value of *D* in `drange` defined above. The values of `p_cond` for each of these values is then needed, so it is necessary to compute it above. A D-Markov Machine is a PFSA with $|\Sigma|^D$ states, each one labeled with one of the subsquences of length $D$. Given a state $\omega = \sigma_1\sigma_2\ldots\sigma_D$, for each $\sigma \in \Sigma$, it transitions to the state $\sigma_2\sigma_3\ldots\sigma_D\sigma$ with probability $\Pr(\sigma|\omega)$. This is done for all states in the D-Markov machine.

In [None]:
dmark_machines = []

In [None]:
#If the D-Markov machines have not been previously created, generate them with this cell
for D in drange:
    dmark_machines.append(dmarkov.create(p_cond, D))
    dmark_machines[-1].to_csv(name + '/pfsa/dmarkov_D' + D + '_' + tag + '.csv')

In [1]:
#On the other hand, if there already are D-Markov machines, load them with this cell
if not dmark_machines:
    for D in drange:
        dmark_machines.append(pd.read_csv(name + '/pfsa/dmarkov_D' + D + '_' + tag + '.csv'))

### D-Markov Machine Analysis

First of all, sequences should be generated from the D-Markov Machines. The same parameters computed in the analysis of the original sequence should be computed for the D-Markov Machines' sequences. Besides those parameters, the Kullback-Leibler Divergence and Distribution Distance between these sequences and the original sequence.

In [None]:
dmark_seqs = []

In [None]:
#Generate sequences:
count = 0
for machine in dmark_machines:
    seq = machine.generate_sequence(N)
    with open(name + '/sequences/dmarkov_D' + drange[count++] + '_' + tag + '.yaml', 'w') as f:
        yaml.dump(seq, f)
    dmark_seqs.append(seq)

In [None]:
#If the sequences have been previously generated, load them here:
if not dmark_seqs:
    for D in drange:
        with open(name + '/sequences/dmarkov_D' + D + '_' + tag + '.yaml', 'w') as f:
            dmark_seqs.append(yaml.load(f))

In [None]:
#Compute subsequence probabilities of occurrence of the D-Markov sequences
count = 0
p_dmark = []
for seq in dmark_seqs:
    p_dm, alphabet = sa.calc_probs(seq, d_max)
    p_dm.to_csv(name + '/results/probabilities/dmarkov_D'+ drange[count++]  + '_' + tag + '.csv')
    p_dmark.append(p_dm)

In [None]:
#If p_dmark has been previously computed, use this cell to load the values
if not p_dmark:
    for D in drange:
        p_dm = pd.read_csv(name + '/results/probabilities/dmarkov_D' + D + '_' + tag + '.csv')
        p_dmark.append(p_dm)
    with open(name + '/alphabet.yaml', 'r') as f:
        alphabet = yaml.load(f)

In [None]:
#Compute conditional probabilities of subsequences occurring after given each symbol of the alphabet
#One of the two previous cells needs to be executed first.
p_cond_dmark = None
count = 0
if p_dmark:
    for p_dm in p_dmark:
        p_cond_dm = sa.calc_cond_probs(p_dm, alphabet, d_max) 
        p_cond_dm.to_csv(name + '/results/probabilities/conditional/dmarkov_D' + drange[count++] + '_' + tag + '.csv')
        p_cond_dmark.append(p_cond_dm)
else:
    print("Run a cell that either computes or opens the probabilities.")

In [None]:
#If p_cond has been previously computed, use this cell to load the values
if not p_cond:
    p_cond = pd.read_csv(name + '/results/probabilities/conditional/original_' + tag + '.csv')

In [None]:
#Compute conditional entropy
if p and p_cond:
    h = sa.calc_cond_entropy(p, p_cond, d_max)
    h.to_csv(name + '/results/cond_entropies/original_' + tag + '.csv')
else:
    print("Run the conditional probabilities cell first.")

In [None]:
#If p_cond has been previously computed, use this cell to load the values
if not h:
    h = pd.read_csv(name + '/results/cond_entropies/original_' + tag + '.csv')

In [None]:
#Compute autocorrelation
aut = sa.calc_autocorr(X, a)
aut.to_csv(name + '/results/autocorrelations/original_' + tag + '.csv')

In [None]:
#If aut has been previously computed, use this cell to load the values
if not aut:
    aut = pd.read_csv(name + '/results/autocorrelations/original_' + tag + '.csv')

## Clustering

## Graph Minimization

## DCGraM Analysis

### Plots