# DCGraM Algorithm

This notebook implements the D-Markov with Clustering and Graph Minimization (DCGraM) Algorithm. Its objective is to model a discrete dynamical system using a Probabilistic Finite State Machine (PFSA). 

Given a sequence *X* over the alphabet $\Sigma$ of length *N* that is an output of the original dynamical system, DCGraM works by:

1. Creating a D-Markov model for the original system for a given *D*;
2. Using a clustering algorithm on the D-Markov model states in order to create an initial partition;
3. Using a graph minimization algorithm to refine the initial partition until the final reduced PFSA is obtained.

## Initialization
First, it is necessary to create the directories that store the working files for the current system. The first cell sets the system's name and the tag to be used in the current runThe following cell only has to be ran when creating modeling a new system.  A directory is then created with this tag and inside it subdirectories that contain the sequence, PFSA and result files.

In [5]:
import pandas as pd
import yaml
import sequenceanalyzer as sa

In [2]:
name = 'ternary_even_shift'
tag = 'v1'

In [3]:
import os
if not os.path.exists(name):
    os.makedirs(name)
    os.makedirs(name + '/sequences')
    os.makedirs(name + '/graphs')
    os.makedirs(name + '/results')
    os.makedirs(name + '/results/probabilities')
    os.makedirs(name + '/results/probabilities/conditional')
    os.makedirs(name + '/results/cond_entropies')
    os.makedirs(name + '/results/kldivergences')
    os.makedirs(name + '/results/autocorrelations')
    os.makedirs(name + '/results/prob_distances')
    os.makedirs(name + '/results/plots')

### Parameters

The next cell initializes the parameters that are used throughout the code. They are listed as:

  * `N`: The original sequence length *N*, which is also the length of the sequences that are going to be generated by the PFSA generated by DCGraM;
  * `drange`: range of values of *D* for which D-Markov and DCGraM machines that will be generated;

In [4]:
N = 10000000
drange = range(4,11)

## Original Sequence Analysis

Make sure that the original sequence of length `N` is stored in the correct directory and run the cell to load it to `X`. After this, run the cells corresponding to the computation of the subsequence probabilities and the conditional probabilites for the value `d_max`, which is the last value in `drange`. Additional results can also be computed in the respective cells (autocorrelation and conditional entropy).

In [None]:
#Open original sequence from yaml file
with open(name + '/sequences/original_len_' + str(N) + '_' + tag + '.yaml', 'r') as f:
    X = yaml.load(f)
    
#Value up to which results are computed
d_max = drange[-1]

In [None]:
#Compute subsequence probabilities of occurrence up to length d_max
p, alphabet = sa.calc_probs(X, d_max)
p.to_csv(name + '/results/probabilities/original_' + tag + '.csv')
with open(name + '/alphabet.yaml', 'w') as f:
    yaml.dump(alphabet, f)

In [None]:
#If p has been previously computed, use this cell to load the values
p = pd.read_csv(name + '/results/probabilities/original_' + tag + '.csv')
with open(name + '/alphabet.yaml', 'r') as f:
    alphabet = yaml.load(f)

In [None]:
#Compute conditional probabilities of subsequences occurring after given each symbol of the alphabet
p_cond = sa.calc_cond_probs(p, alphabet, d_max) 
p_cond.to_csv(name + '/results/probabilities/conditional/original_' + tag + '.csv')

In [None]:
#Compute conditional entropy

In [None]:
#Compute autocorrelation

## D-Markov Machines

### D-Markov Machine Analysis

## Clustering

## Graph Minimization

## DCGraM Analysis

### Plots