#Sequence Modeling with EDeN
##The case for real valued vector labels

**Aim:** Suppose you are given two sets of sequences. Each sequence is composed of characters in a finite alphabet. However there are similarity relationships between the characters. We want to build a predictive model that can discriminate between the two sets.

##Artificial Dataset

Lets build an artificial case. We construct two classes in the following way: for each class we start from a specific but random seed sequence, and the full set is then generated every time by permuting the position of k pairs of characters chosen at random in the seed sequence.

To simulate the relationship between characters we do as follows: we select at random some charaters and we capitalize them. For the machine, a capitalized character is completely different from its lowercase counterpart, but it is easier for humans to see them. 

Assume the similarity between chars is given as a symmetric matrix. We can then perform a low dimensionality embedding of the similarity matrix (e.g. MDS in $\mathbb{R}^4$) and obtain some vector representation for each char such that their euclidean distance is proportional to their dissimilarity. Lets assume we are already given the vector representation. In our case we just take some random vectors as they will be roughly equally distant from each other. In order to simulate that the capitalized version of a cahr should be similar to its lowercase counterpart, we just add a small amount of noise to the vector representation of one of the two.  

###Auxiliary Code

In [1]:
#code for making artificial dataset
import random

def swap_two_characters(seq):
    '''define a function that swaps two characters at random positions in a string '''
    line = list(seq)
    id_i = random.randint(0,len(line)-1)
    id_j = random.randint(0,len(line)-1)
    line[id_i], line[id_j] = line[id_j], line[id_i]
    return ''.join(line)

def swap_characters(seed, n):
    seq=seed
    for i in range(n):
        seq = swap_two_characters(seq)
    return seq
    
def make_seed(start=0, end=26):
    seq = ''.join([str(unichr(97+i)) for i in range(start,end)])
    return swap_characters(seq, end-start)
    
def make_dataset(n_sequences=None, seed=None, n_swaps=None):
    seqs = []
    seqs.append( seed )
    for i in range(n_sequences):
        seq = swap_characters( seed, n_swaps )
        seqs.append( seq )        
    return seqs

def random_capitalize(seqs, p=0.5):
    new_seqs=[]
    for seq in seqs:
        new_seq = [c.upper() if random.random() < p else c for c in seq ]
        new_seqs.append(''.join(new_seq))
    return new_seqs

def make_artificial_dataset(sequence_length=None, n_sequences=None, n_swaps=None):
    seed = make_seed(start=0, end=sequence_length)
    print 'Seed: ',seed
    seqs = make_dataset(n_sequences=n_sequences, seed=seed, n_swaps=n_swaps)
    train_seqs_orig=seqs[:len(seqs)/2]
    test_seqs_orig=seqs[len(seqs)/2:]
    seqs = random_capitalize(seqs, p=0.5)
    print 'Sample with random capitalization:',seqs[:7]
    train_seqs=seqs[:len(seqs)/2]
    test_seqs=seqs[len(seqs)/2:]
    return train_seqs_orig, test_seqs_orig, train_seqs, test_seqs

In [7]:
#code to estimate predictive performance on categorical labeled sequences

def discriminative_estimate(train_pos_seqs, train_neg_seqs, test_pos_seqs, test_neg_seqs):
    from eden.graph import Vectorizer
    vectorizer = Vectorizer(complexity=complexity)

    from eden.converter.graph.sequence import sequence_to_eden
    iterable_pos = sequence_to_eden(train_pos_seqs)
    iterable_neg = sequence_to_eden(train_neg_seqs)

    from eden.util import  fit, estimate
    estimator = fit(iterable_pos,iterable_neg, vectorizer, n_iter_search=n_iter_search)

    from eden.converter.graph.sequence import sequence_to_eden
    iterable_pos = sequence_to_eden(test_pos_seqs)
    iterable_neg = sequence_to_eden(test_neg_seqs)
    estimate(iterable_pos, iterable_neg, estimator, vectorizer)

In [8]:
#code to create real vector labels
def make_encoding(encoding_vector_dimension=3, sequence_length=None, noise_size=0.01):
    #vector encoding for chars
    default_encoding = [0]*encoding_vector_dimension
    start=0
    end=sequence_length
    #take a list of all chars up to 'length' 
    char_list = [str(unichr(97+i)) for i in range(start,end)]

    encodings={}
    import numpy as np
    codes = np.random.rand(len(char_list),encoding_vector_dimension)
    for i, code in enumerate(codes):
        c = str(unichr(97+i))
        cc = c.upper()
        encoding = list(code)
        encodings[c] = encoding
        #add noise for the encoding of capitalized chars
        noise = np.random.rand(encoding_vector_dimension)*noise_size
        encodings[cc] = list(code + noise)
    return encodings, default_encoding

def make_encodings(n_encodings=3, encoding_vector_dimension=3, sequence_length=None, noise_size=0.01):
    encodings=[]
    for i in range(1,n_encodings+1):
        encoding, default_encoding = make_encoding(encoding_vector_dimension, sequence_length, noise_size=noise_size)
        encodings.append(encoding)
    return encodings, default_encoding

##Artificial data generation

In [9]:
#problem parameters
random.seed(1)
sequence_length = 8 #sequences length
n_sequences = 50    #num sequences in positive and negative set
n_swaps = 2         #num pairs of chars that are swapped at random
n_iter_search = 30  #num paramter configurations that are evaluated in hyperparameter optimization
complexity = 2      #feature complexity for the vectorizer 
n_encodings = 5     #num vector encoding schemes for chars
encoding_vector_dimension = 9 #vector dimension for char encoding
noise_size = 0.05   #amount of random noise 

In [10]:
print 'Positive examples:'
train_pos_seqs_orig, test_pos_seqs_orig, train_pos_seqs, test_pos_seqs = make_artificial_dataset(sequence_length,n_sequences,n_swaps)
print 'Negative examples:'
train_neg_seqs_orig, test_neg_seqs_orig, train_neg_seqs, test_neg_seqs = make_artificial_dataset(sequence_length,n_sequences,n_swaps)

Positive examples:
Seed:  dgbcefah
Sample with random capitalization: ['DgBcEFaH', 'ghbCEFad', 'egBhDFAc', 'cdBgeFaH', 'DGbceFAh', 'bCdGEfAH', 'dFBCAgEh']
Negative examples:
Seed:  afbdhgce
Sample with random capitalization: ['AfbdhGce', 'afcdeGBh', 'dfBGhacE', 'aFBDhGCE', 'ahbcfgde', 'afGdEBch', 'efBdhgcA']


##Discriminative model on categorical labels

In [11]:
%%time
#lets estimate the predictive performance of a classifier over the original sequences
print 'Predictive performance on original sequences'
discriminative_estimate(train_pos_seqs_orig, train_neg_seqs_orig, test_pos_seqs_orig, test_neg_seqs_orig)
print '\n\n'
#lets estimate the predictive performance of a classifier over the capitalized sequences
print 'Predictive performance on sequences with random capitalization'
discriminative_estimate(train_pos_seqs, train_neg_seqs, test_pos_seqs, test_neg_seqs)

Predictive performance on original sequences


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Test set
Instances: 52 ; Features: 1048577 with an avg of 63 features per instance
--------------------------------------------------------------------------------
Test Estimate
             precision    recall  f1-score   support

         -1       0.85      0.85      0.85        26
          1       0.85      0.85      0.85        26

avg / total       0.85      0.85      0.85        52

APR: 0.897
ROC: 0.901



Predictive performance on sequences with random capitalization
Test set
Instances: 52 ; Features: 1048577 with an avg of 63 features per instance
--------------------------------------------------------------------------------
Test Estimate
             precision    recall  f1-score   support

         -1       0.70      0.81      0.75        26
          1       0.77      0.65      0.71        26

avg / total       0.74      0.73      0.73        52

APR: 0.853
ROC: 0.848
CPU times: user 2.9 s, sys: 698 ms, total: 3.6 s
Wall time: 29.5 s


**Note:** as expected the capitalization makes the predicitve task harder since it expands the vocabulary size and adds variations that look random

##Discriminative model on real valued vector labels

In [15]:
#lets make a vector encoding for the chars simply using a random encoding 
#and a small amount of noise for the capitalized versions

#we can generate a few encodings and let the algorithm choose the best one.
encodings, default_encoding = make_encodings(n_encodings, encoding_vector_dimension, sequence_length, noise_size)

In [16]:
#lets define the 3 main machines: 1) pre_processor, 2) vectorizer, 3) estimator

#the pre_processor takes the raw format and makes graphs
def pre_processor( seqs, encoding=None, default_encoding=None, **args ):
    #convert sequences to path graphs
    from eden.converter.graph.sequence import sequence_to_eden
    graphs = sequence_to_eden(seqs)
    
    #relabel nodes with corresponding vector encoding
    from eden.modifier.graph.vertex_attributes import translate 
    graphs = translate(graphs, label_map = encoding, default = default_encoding)
    
    return graphs  

#the vectorizer takes graphs and makes sparse vectors
from eden.graph import Vectorizer
vectorizer = Vectorizer()

#the estimator takes a sparse data matrix and a target column vector and makes a predictive model 
from sklearn.linear_model import SGDClassifier
estimator = SGDClassifier(class_weight='auto', shuffle=True)

#the model takes a pre_processor, a vectorizer, an estimator and returns the predictive model
from eden.model import ActiveLearningBinaryClassificationModel
model = ActiveLearningBinaryClassificationModel(pre_processor=pre_processor, 
                                                estimator=estimator, 
                                                vectorizer=vectorizer, 
                                                fit_vectorizer=True )

In [17]:
#lets define hyper-parameters vaule ranges
from numpy.random import randint
from numpy.random import uniform

pre_processor_parameters={'encoding':encodings, 'default_encoding':[default_encoding]}

vectorizer_parameters={'complexity':[complexity],
                       'n_discretization_levels':randint(3, 20, size=n_iter_search)}

estimator_parameters={'n_iter':randint(5, 100, size=n_iter_search),
                      'penalty':['l1','l2','elasticnet'],
                      'l1_ratio':uniform(0.1,0.9, size=n_iter_search), 
                      'loss':['hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron'],
                      'power_t':uniform(0.1, size=n_iter_search),
                      'alpha': [10**x for x in range(-8,0)],
                      'eta0': [10**x for x in range(-4,-1)],
                      'learning_rate': ["invscaling", "constant", "optimal"]}

##Model Auto Optimization

In [22]:
from eden.util import configure_logging
import logging
configure_logging(logging.getLogger(),verbosity=1)

In [23]:
%%time
#optimize hyperparameters and fit a predictive model

#determine optimal parameter configuration
model.optimize(train_pos_seqs, train_neg_seqs,
               model_name='my_seq.model', 
               n_active_learning_iterations=0,
               n_iter=n_iter_search, cv=3,
               pre_processor_parameters=pre_processor_parameters, 
               vectorizer_parameters=vectorizer_parameters, 
               estimator_parameters=estimator_parameters)

#print optimal parameter configuration
print model.get_parameters()

#evaluate predictive performance
apr, roc = model.estimate(test_pos_seqs, test_neg_seqs)



	Iteration: 1/30 (after 1.6 sec; 0:00:01.624881)
Best score (roc_auc): 0.813 (0.899 +- 0.086)

Data:
Instances: 50 ; Features: 1048577 with an avg of 517 features per instance
class: 1 count:25 (0.50)	class: -1 count:25 (0.50)	

	Model parameters:

Pre_processor:
default_encoding: [0, 0, 0, 0, 0, 0, 0, 0, 0]
  encoding: {'a': [0.60132554724186549, 0.8631803661199251, 0.40420333955508791, 0.061365764853678395, 0.1269206253692915, 0.10386918941107937, 0.62089986283970311, 0.1697580667768378, 0.21587501676108245], 'A': [0.65022318265111645, 0.86912060291309101, 0.44425605078408009, 0.064745313713752681, 0.15240438849998827, 0.13461961331590541, 0.65820273873693247, 0.19996591601087288, 0.21940676979574097], 'c': [0.48674198105432087, 0.4253183530023732, 0.07527793126056781, 0.22434226478314656, 0.29682116536138714, 0.49688532896919957, 0.024118609267932523, 0.38621563341208975, 0.28462328713849172], 'B': [0.72433964909888227, 0.46636688978674029, 0.76701010379428503, 0.36851552745793392