# Random Forest playground 
## My goals are
- Explore RF with tranditional decision trees (kd-trees)
- Explore RF approach with RP-trees
- Explore RF with oblique splits (PCA, LDA, etc)
- Explore RF with preprocessings
 * Random rotation (Vempala)
 * Randomer forest
 * Output space random projections

Building of random forest (bagging technique) is adapted from tutorial: http://machinelearningmastery.com/implement-random-forest-scratch-python/
and part of the design of my spatial-tree class is inspired by "spatialtree" package:https://github.com/bmcfee/spatialtree
(note: I believe my design is better than that of Brian McFee, since it is more concise and flexible)

On sources of data compression:
 - Forest has three sources of data compression, which may or may not involve randomness: 
  * preprocessing such as PCA, random projection, random rotation, etc, which can be applied either for a single tree or for the entire forest and achieves
    - feature selection/extraction for the entire forest
      * This includes the theoretically justified randomly oriented kd-tree in [Vempala 12]
    - feature selection/extraction for a single tree
  * data subsampling for each tree, which adds randomness
  * feature selection/extraction at each tree node (before splitting direction and value are determined)

 - A generalized forest where methods to achieve compression vary at the node level in [Tomita, Maggioni, Vogelstein 16], where at each step a p by d projection matrix A is chosen; this subsumes
  * Forest-IC, Forest-RC in Breiman: sparsity constrained projection matrix sampling
  * Rotation forest: deterministic projection matrix via PCA
  * Randomer forest: sparse projection matrix via "very sparse random projections" 
  * RP-tree proposed in Dasgupta can also be viewed as a special case here, where the matrix A is a 1-D dense or sparse projection vector
(note: when d>1, after projection, the splitting rule must be able to pick a coordinate for splitting, so the first three approaches are more suitable for supervised tasks where coordinate-selection can be made via labels)

In [97]:
import urllib
from six.moves import cPickle as pickle
import random 
from csv import reader
from math import sqrt
from math import floor
import os
import numpy as np
from sklearn import random_projection
from pandas import DataFrame
from IPython.display import display

In [32]:
###---------- Utility functions
## computing spread of one-dim data
def spread_1D(data_1D):
    return np.max(data_1D)-np.min(data_1D)

## computing data diamter of a cell
def data_diameter(data):
    """
    Input data is assumed to be confined in the
    desired cell
    """
    dist, indx, indy = 0, None, None
    for i in range(data.shape[0]):
        for j in range(i+1, data.shape[0]):
            dist_new = np.linalg.norm(data[i,:]-data[j,:])
            if dist_new > dist:
                dist, indx, indy = dist_new, i, j
    return dist, indx, indy

###---------- compressive projection matrix designs
def comp_projmat(data, **kwargs):
    """
    returns a projection matrix
    """
    namelist = ['breiman','ho','tomita', 'dasgupta']
    assert kwargs['name'] in namelist, "No such method for constructing projection matrix!"
    
    if kwargs['name'] == 'breiman':
        ## Breiman's Forest-IC and Forest-RC
        s = kwargs['sparsity']
        d = kwargs['target_dim']
        A = np.zeros((data.shape[1],d))
        ## sample sparsity-constrained A
        for i in range(d):
            ind = np.random.choice(data.shape[1], size=s, replace=False)
            if s == 1:
                A[ind,i] = 1
            else:
                for j in range(len(ind)):
                    A[ind[j], i] = np.random.uniform(-1,1)
    
    elif kwargs['name'] == 'ho':
        ## rotation forest
        d = kwargs['target_dim']
        ## find A by PCA
    elif kwargs['name'] == 'tomita':
        ## randomer forest
        d = kwargs['target_dim']
        ## sample sparse A via very sparse rp
        density = 1/(data.shape[1]**(1/2)) #default density value
        if 'density' in kwargs:
            if kwargs['density'] <= 1 and kwargs['density']>0:
                density = kwargs['density']
        
        transformer = random_projection.SparseRandomProjection(n_components=d, density=density)    
        transformer.fit(data)
        A = transformer.components_.copy()
        A = A.T
        
    else:
        ## dasgupta rp-tree 
        d = 1 # default to a random vector          
        if 'target_dim' in kwargs:
            d = kwargs['target_dim']
        n_features = data.shape[1]
        A = np.zeros((data.shape[1], d))
        # sample dense projection matrix
        for i in range(d):
            A[:,i] = np.random.normal(0, 1/np.sqrt(n_features), n_features)

    return A

#######-------split designs

## cart splits
def cart_split(data, proj_mat, labels=None, regress=False):
    # test for the best split feature and threshold on data CART criterion
    data_trans = np.dot(data, proj_mat) # n-by-d
    score, ind, thres = -999, None, None
    for i in range(data_trans.shape[1]):
        if not regress:
            # classification
            score_new, thres_new = cscore(data_trans[:,i], labels)
        else:
            # regression
            score_new, thres_new = rscore(data_trans[:,i], labels)
        if score_new > score:
            score = score_new
            ind = i
            thres = thres_new
    w = proj_mat[:,ind]
    return score, w, thres

def cscore(data_1D, labels):
    ## cart classification criterion
    n = len(labels)
    p1 = np.mean(labels)
    data_sorted, ind_sorted = np.sort(data_1D), np.argsort(data_1D)
    score, thres = -999, None
    for i in range(1,n):
        cell_l = ind_sorted[:i]
        cell_r = ind_sorted[i:]
        split_val = data_sorted[i]
        p1_l = np.mean(labels[cell_l])
        p1_r = np.mean(labels[cell_r])
        n_l = len(cell_l)
        score_new = p1*(1-p1) - (n_l/n)*(1-p1_l)*p1_l - (n-n_l)/n*(1-p1_r)*p1_r  
        if score_new > score:
            score = score_new
            thres = split_val
    return score, thres

def rscore(data_1D, labels):
    ## cart regression criterion
    n = len(labels)
    ybar = np.mean(labels)
    data_sorted, ind_sorted = np.sort(data_1D), np.argsort(data_1D)
    score, thres = -999, None
    for i in range(1,n):
        cell_l = ind_sorted[:i]
        cell_r = ind_sorted[i:]
        split_val = data_sorted[i]
        ybar_l = np.mean(labels[cell_l])
        ybar_r = np.mean(labels[cell_r])
        score_new =(np.sum((labels-ybar)**2)-np.sum((labels[cell_l]-ybar_l)**2)\
         -np.sum((labels[cell_r]-ybar_r)**2))/n
        if score_new > score:
            score = score_new
            thres = split_val
    return score, thres

##### median splits

def median_split(data, proj_mat, labels=None):
    data_transformed = np.dot(data, proj_mat)
    if proj_mat.ndim > 1:
        score, ind = 0, 0
        for i in range(proj_mat.shape[1]):
            score_new = spread_1D(data_transformed[:,i])
            if score_new > score:
                score, ind = score_new, i
        w = proj_mat[:, ind]
        thres = np.median(data_transformed[:,ind])
    else:
        thres = np.median(data_transformed)
        w = proj_mat
        score = spread_1D(data_transformed)
    return score, w, thres

def median_perturb_split(data, proj_mat, node_height, labels=None, diameter=None, root_height=None):
    assert root_height is not None, "Please provide the level of the root!"
    if (node_height+root_height) % 2 == 0:
        # normal median splits
        return median_split(data, proj_mat, labels=labels)
    else:
        assert diameter is not None, "Diameter of the cell must be given!"
        # noisy splits
        data_transformed = np.dot(data, proj_mat)
        if proj_mat.ndim > 1:
            score, ind = 0, 0
            for i in range(proj_mat.shape[1]):
                score_new = spread_1D(data_transformed[:, i])
                if score_new > score:
                    score, ind = score_new, i
            w = proj_mat[:, ind]
            jitter = np.random.uniform(-1,1) * 6/np.sqrt(data.shape[1]) * diameter
            thres = np.median(data_transformed[:,ind])+jitter
        
        else:
            jitter = np.random.uniform(-1,1) * 6/np.sqrt(data.shape[1]) * diameter
            thres = np.median(data_transformed)+jitter
            w = proj_mat
            score = spread_1D(data_transformed)
        
    
        return score, w, thres

## 2-means split
def two_means_split(data, proj_mat, labels=None):
    # this essentially defines a hierarchical clustering on data
    return score, w, thres

#######-------stop rules
def naive_stop_rule(data, height=None):
    if data.shape[0] < 4:
        return True
    if height > 10:
        ## DO NOT ever make it exceed 15!!!
        return True
    
    return False

def cell_size_rule(data, target_diameter=None):
    ddiameter,_,_ = data_diameter(data)
    
    if ddiameter <= target_diameter:
        return True
    
    else:
        return False
    

In [15]:
## A class of binary spatial-trees 
        
class flex_binary_trees(object):
    """
    A recursive data structure based on binary trees
     - at each node, it contains data, left, right child (or none if leaf), just as any other binary tree
     - it also knows its height
     - additionally, it has meta information about split direction and split threshold
     - to incorporate the use of master-slave trees (see below), it also has an optional reference
      to a master tree
     
    On splitting method
     - if rpart or cpart are used, labels must be provided
    """

    def __init__(self, data, data_indices=None, 
                 proj_design={'name':'projmat','params':{'name':'breiman','sparsity':3,'target_dim':10}}, 
                 split_design={'name':'cart', 'params':{'regress':False}}, 
                 stop_design={'name':'naive'}, 
                 height=0, labels=None, master_tree=None):
        """
        data: n by d matrix, the entire dataset assigned to the tree
        data_indices: the subset of indices assigned to this node
        proj_design: A dictionary that contains name and params of a method; 
          the method returns one or more splitting directions (projection matrix)
        split_design: A dict that contains name and params of a function;
          the function s.t. given 1D projected data, it must return the splitting threshold
        stop_rule: a boolean function of data_indices and height
        height: height of current node (root has 0 height)
        """
        self.data = data
        self.data_ind = data_indices
        if data_indices is None:
            self.data_ind = np.ones(data.shape[0], dtype=bool)
        self.proj_design = proj_design
        self.split_design = split_design 
        self.stop_design = stop_design
        
        self.height_ = height
        self.labels = labels
        self.master_tree = master_tree
        self.leftChild_ = None
        self.rightChild_ = None
            
            
        ## set stop rule: a boolean function
        if self.stop_design['name'] == 'naive':
            self.isLeaf = naive_stop_rule(self.data[self.data_ind,:], height=self.height_) 
            
        elif self.stop_design['name'] == 'cell_size':
            self.isLeaf = cell_size_rule(self.data[self.data_ind,:], 
                                         target_diameter=self.stop_design['diameter']/2)
            
    
    def proj_rule_function(self):
        """
        A function such that given name of a method, returns a projection vector (splitting direction)
        Can be override (user-defined)
        returns a n_features by n_projected_dim projection matrix, A
        """
        name_list = ['projmat', 'cyclic', 'full']
        
        method = self.proj_design['name']
        
        assert method in name_list, 'No such rule implemented in the current tree class!'
        
        if method == 'projmat':
            
            return comp_projmat(self.data[self.data_ind,:], **self.proj_design['params'])
        
        elif method == 'cyclic':
            # cycle through features using height information
            # here A is 1D
            n_features = self.data.shape[1]
            A = np.zeros(n_features)
            A[self.height_ % n_features] = 1
            return A
        else:
            # no compression, 'full'
            return np.eye(self.data.shape[1])
        
    def split_rule_function(self, A):
        name_list = ['cart', 'median', 'median_perturb', '2means']
        
        method = self.split_design['name']
        assert method in name_list, 'No such split rule implemented in current tree class!'
        
        if 'params' in self.split_design:
            params = self.split_design['params']
        else:
            params = dict()
        
        if method == 'cart':
            return cart_split(self.data[self.data_ind,:], A, self.labels[self.data_ind], **params)
        elif method == 'median':
            return median_split(self.data[self.data_ind,:], A, **params)
        elif method == 'median_perturb':
            node_height = self.height_
            return median_perturb_split(self.data[self.data_ind,:], A, node_height, **params)
        else:
            return two_means_split(self.data[self.data_ind,:], A, **params)
        
    
    def buildtree(self):
        """
        Recursively build a tree starting from current node as root
        Constructs w (projection direction) and threshold for each node
        """
        if not self.isLeaf:
            ## set projection/transformation/selection matrix
            A = self.proj_rule_function() 
            ## transform data to get one or more candidate features 
            #projected_data = np.dot(self.data[self.data_ind,:], self.w_)
    
            ## find the best split feature and the best threshold
            #_, self.thres_ = self.split_rule(projected_data) ###### change the input to *args in the future
            split_rule = self.split_design['name']
            _, self.w_, self.thres_ = self.split_rule_function(A)
            
            projected_data = np.dot(self.data[self.data_ind, :], self.w_) ## project data to 1-D
            
            data_indices = []
            for i in range(len(self.data_ind)):
                if self.data_ind[i] == 1:
                    data_indices.append(i)
            assert len(data_indices) == len(projected_data)
            data_indices = np.array(data_indices)
            
            ## split data into left and right
            left_indices = projected_data < self.thres_
            right_indices = projected_data >= self.thres_
            
            assert np.sum(left_indices)+np.sum(right_indices) == len(data_indices)
            
            left_ind = data_indices[left_indices]
            right_ind = data_indices[right_indices]
            ##
            n_data = self.data.shape[0]
            left = np.zeros(n_data, dtype=bool)
            left[left_ind] = 1
            right = np.zeros(n_data, dtype=bool)
            right[right_ind] = 1
            ## build subtrees on left and right partitions
            self.leftChild_ = flex_binary_trees(self.data, left, self.proj_design, self.split_design, self.stop_design, self.height_+1, self.labels)
            self.leftChild_.buildtree()
            self.rightChild_ = flex_binary_trees(self.data, right, self.proj_design, self.split_design, self.stop_design, self.height_+1, self.labels)
            self.rightChild_.buildtree()
            
        
    def train(self):
        self.buildtree()
        
    def predict_one(self, point, predict_type='class'):
        return predict_one_bt(self, point, predict_type=predict_type)
        
    def predict(self, test, predict_type='class'):
        return predict_bt(self, test, predict_type=predict_type)
            
            
####-------- Utility functions for binary trees   

def retrievalLeaf(btree, query):
    """
    Given a binary partition tree
    returns leaf node that contains query
    """
    if btree.leftChild_ is None and btree.rightChild_ is None:
        return btree
    if np.dot(btree.w_, query) < btree.thres_:
        return retrievalLeaf(btree.leftChild_, query)
    else:
        return retrievalLeaf(btree.rightChild_, query)
    
def retrievalSet(btree, query):
    """
    Given a binary partition tree
    returns indices of points in the cell that contains query point
    """
    ## base case: return data indices if leaf is reached    
    if btree.leftChild_ is None and btree.rightChild_ is None:
        return btree.data_ind
        
    ## check which subset does the query belong to
    if np.dot(btree.w_, query) < btree.thres_: 
        return retrievalSet(btree.leftChild_, query)
    else:
        return retrievalSet(btree.rightChild_, query)
        
def getDepth(btree, depth):
    """
    find out the depth (maximal height of all branch) of a binary tree
    depth is the depth of current node
    """
    ## via DFS
    if btree.leftChild_ is None and btree.rightChild_ is None:
        return depth
        
    depthL = getDepth(btree.leftChild_, depth+1)
    depthR = getDepth(btree.rightChild_, depth+1)
        
    if depthL >= depthR:
        return depthL
    else:
        return depthR
        
def predict_one_bt(btree, point, predict_type='class'):
    assert btree.labels is not None, "This tree has no associated data labels!"
    set_ind = retrievalSet(btree, point)
    
    if predict_type == 'class':
        if not list(set_ind):
            return round(np.mean(btree.labels))
        return round(np.mean(btree.labels[set_ind]))
        
    else:
        # regression
        if not list(set_ind):
            return np.mean(btree.labels)
        return np.mean(btree.labels[set_ind])
        
    
def predict_bt(btree, test, predict_type='class'):
    """
    Returns a list of predictions corresponding to test set
    """
    predictions = list()
    for point in test:
        predictions.append(predict_one_bt(btree, point, predict_type=predict_type))
    return predictions

        
def k_nearest(tree, query):
    """
    Given dataset organized in a tree structure  ----- TO BE IMPLEMENTED
    find the approximate k nearest neighbors of query point
    """
    return k_nearest
        
    
###
def printPartition(tree, level):
    """
    Starting from root of the tree, traverse each node at given level
    and print the partitioning it holds
    Can be used for testing purposes
    """
    if tree.height_ == level or (tree.leftChild_ is None and tree.rightChild_ is None):
        ind_set = []
        for i in range(len(tree.data_ind)):
            if tree.data_ind[i] == 1:
                ind_set.append(i)
        print(ind_set)
    else:
        printPartition(tree.leftChild_, level)
        printPartition(tree.rightChild_, level)
        
def traverseLeaves(tree):
    
    if tree.leftChild_ is None and tree.rightChild_ is None:
        yield tree
    
    if tree.leftChild_ is not None:
        for t in traverseLeaves(tree.leftChild_):
            yield t
    if tree.rightChild_ is not None:
        for t in traverseLeaves(tree.rightChild_):
            yield t
        


In [4]:
#######------- A class of master trees inspired by Kpotufe's adaptive tree structure
## these are not binary trees but are "meta-trees" built on binary trees
## it is used to prune a binary tree as we grow it

class master_trees(object):
    
    def __init__(self, data, labels=None,slave_tree=None,
                 slave_tree_params={'name':'coreRP','repeat':0}, 
                 curr_height=0, max_height=5):
        self.data = data
        self.labels = labels
        self.slave_tree_params={'name':'coreRP','repeat':0}
        self.curr_height = curr_height
        self.max_height = max_height
        self.leaves_list = list()
        
        ## initialize slave tree
        if slave_tree is None:
            if slave_tree_params['name'] == 'coreRP':
                data_ind = np.ones(self.data.shape[0], dtype=bool)
                proj_design = {'name':'projmat', 'params':{'name':'dasgupta'}}
            
                ddiameter,_,_ = data_diameter(self.data)
                split_design = {'name':'median_perturb', 'params':{'root_height':curr_height, 'diameter':ddiameter}}
                stop_design = {'name':'cell_size', 'diameter':ddiameter}
            
                self.slave_tree = flex_binary_trees(self.data, data_ind, proj_design=proj_design, 
                                                split_design=split_design, stop_design=stop_design, 
                                                height=self.curr_height, labels=self.labels, master_tree=self)
        else:
            self.slave_tree = slave_tree
            self.slave_tree.master_tree = self
            
            
        
    def grow_leaves(self):
        self.slave_tree.buildtree()
        depth = getDepth(self.slave_tree, 0)
        
        for i in range(self.slave_tree_params['repeat']):
            ## select tree with shortest depth
            slave_tree = flex_binary_trees(self.data, data_ind, proj_design=proj_design, split_design=split_design, 
                              stop_design=stop_design, height=self.curr_height, labels=self.labels, master_tree=self)
            slave_tree.buildtree()
            if depth > getDepth(slave_tree, 0):
                self.slave_tree = slave_tree
        
        
    def iter_leaves(self):
        return traverseLeaves(self.slave_tree)
        
    def test_stop(self):
        if getDepth(self.slave_tree, self.curr_height) >= self.max_height:
            return True
        else:
            return False
        
    def build_master_trees(self):
        self.grow_leaves()
        leaves_gen = self.iter_leaves() # up to this point, nodes are binary trees
        
        for leaf in leaves_gen:
            new_master_tree = master_trees(data, labels=self.labels, slave_tree=leaf,
                                           slave_tree_params=self.slave_tree_params, 
                                            curr_height=leaf.height_ ,max_height=self.max_height )
            self.leaves_list.append(new_master_tree)
            if not self.test_stop():
                ## if max height is not achieved in any cell of slave tree
                # continue grow it
                new_master_tree.build_master_trees()
                
    def train(self):
        ## interface with cross-validation evalutaion
        self.build_master_trees()
        
    def predict_one(self, point, predict_type='class'):
        return predict_one_mt(self, point, predict_type=predict_type)
     
    def predict(self, test, predict_type='class'):
        ## interface with cross-validation evaluation
        return predict_mt(self, test, predict_type=predict_type)

            

####----------- Utility functions for master trees

def retrievalLeaf_mtree(mtree, query):
    ## base case
    if not mtree.leaves_list:
        return mtree
    
    ##
    slave_leaf = retrievalLeaf(mtree.slave_tree, query) #find the leaf containing query using its binary slave tree
    return retrievalLeaf_mtree(slave_leaf.master_tree, query) #recurse on master tree of the found leaf

def traverseLeaves_mtree(mtree):
    if not mtree.leaves_list:
        yield mtree
    
    for leaf in mtree.iter_leaves():
        traverseLeaves_mtree(leaf.master_tree)
        
        
def predict_one_mt(mtree, point, predict_type='class'):
    mcell = retrievalLeaf_mtree(mtree, point)
    set_ind = mcell.slave_tree.data_ind
    
    if predict_type == 'class':
        #print(round(np.mean(mcell.slave_tree.labels[set_ind])))
        if not list(set_ind):
            # if set index is an empty list
            return round(np.mean(mcell.slave_tree.labels))
        return round(np.mean(mcell.slave_tree.labels[set_ind]))
        
    else:
        # regression
        if not list(set_ind):
            return np.mean(mcell.slave_tree.labels)
        return np.mean(mcell.slave_tree.labels[set_ind])
    
def predict_mt(mtree, test, predict_type='class'):
    """
    Returns a list of predictions corresponding to test set
    """
    predictions = list()
    for point in test:
        predictions.append(predict_one_mt(mtree, point, predict_type=predict_type))
    return predictions        
            

In [5]:
## Some helper functions

def subsample(data, n_samples, n_features=None):
    """
    sample WITH replacement
    """
    ind_data = np.random.choice(data.shape[0], size=n_samples, replace=True)
    if n_features is not None:
        ind_features = np.random.choice(data.shape[1], size=n_features, replace=True)
        return data[ind_data, ind_features]
        
    return data[ind_data,:] 

def cross_valid_split(n_data, n_folds):
    """
    Given size of the data and number of folds
    Returns n_folds disjoint sets of indices, where indices
    in each fold are chosen u.a.r. without replacement
    """
    data_ind = range(n_data)
    folds = list()
    fold_size = floor(n_data/n_folds)
    for i in range(n_folds):
        if i < n_folds-1:
            fold = list()
            while len(fold) <= fold_size:
                index = random.randrange(len(data_ind))
                fold.append(data_ind.pop(index))
            folds.append(fold)
        else:
            ## assign all remaining data to the last fold
            folds.append(data_ind)
    return folds

def zero_one_loss(labels, predictions):
    correct = 0
    for i in range(len(labels)):
        if labels[i] == predictions[i]:
            correct += 1
    loss = (len(labels)-correct)/float(len(labels))
    return loss

def explained_var_loss(labels, predictions):
    #res_var = np.sum(np.array([diff**2 for diff in labels-predictions]))
    res_var = np.var(np.array(labels)-np.array(predictions))
    tot_var = np.var(np.array(labels))
    
    return 1-res_var/tot_var

def l2_loss(labels, predictions):
    loss = np.linalg.norm(np.array(labels)-np.array(predictions))
    return loss/(len(labels)**(1/2))

def csize_decrease_rate(data, tree):
    """
    This is an unsupervised evaluation, which tries to capture how fast
    the data size of a cell decreases after building a tree
    """
    diam_s,_,_ = data_diameter(data)
    diam_f = 0
    
    if hasattr(tree, 'slave_tree'):
        ## if this is a master tree
        for leaf in traverseLeaves_mtree(tree):
            diam, _, _ = data_diameter(leaf.slave_tree.data[leaf.slave_tree.data_ind, :])
            if diam > diam_f:
                diam_f = diam
            
    else:
        ## if this is normal binary tree
        for leaf in traverseLeaves(tree):
            diam, _, _ = data_diameter(leaf.data[leaf.data_ind,:])
            if diam > diam_f:
                diam_f = diam
        
    return (diam_s/diam_f)/getDepth(tree)
        

def cross_valid_eval(data, labels, n_folds, loss, algorithm, **kwargs):
    """
    Given data and labels, a loss function, and a method
    generate a list of cv-losses
    
    """
    ## generate random folds
    folds_ind = cross_valid_split(data.shape[0], n_folds)
    losses = list()
    for fold_ind in folds_ind:
        test_ind = fold_ind
        folds_ind_ = list(folds_ind) # this ensures we are not modifying the original list!
        folds_ind_.remove(fold_ind)
        train_ind = [item for sublist in folds_ind_ for item in sublist] #flatten remaining index set
        data_tr = data[train_ind,:]
        labels_tr = labels[train_ind]
        data_tt = data[test_ind,:]
        labels_tt = labels[test_ind]
        # train the algorithm 
        alg = algorithm(data_tr, labels=labels_tr, **kwargs) #init
        alg.train()
        # calculate loss
        losses.append(loss(labels_tt, alg.predict(data_tt)))
        #print(labels[0],alg.predict(data_tt)[0])
        del alg
    return losses
        
"""
Exp
"""

class forest(object):
    def __init__(self, data, labels=None, tree_type=None, predictor_type='regress',
                  n_trees=10, n_samples=100, n_features=None):
        """
        tree_type: A dictionary containing 
          - a proj_design dictionary, 
          - a split_design dictionary,
          - a stop_rule method
        
        """
        self.data = data
        self.labels = labels
        self.tree_type = tree_type
        self.predictor_type = predictor_type
        self.n_trees = n_trees
        self.trees = list() # store trees for re-use
        self.n_samples = n_samples
        self.n_features = n_features
        
        
    def reset_sample_size(self, n_samples=None, n_features=None):
        if n_samples is not None:
            self.n_samples = n_samples
        if n_features is not None:
            self.n_features = n_features
            
    def reset_predictor_type(self, method):
        self.predictor_type = method
        
    def build_forest_classic(self, isPredict=True):
        if isPredict:
            assert labels is not None, "Data labels missing!"
        tree_type = self.tree_type['tree']
        for i in range(self.n_trees):
            #sample data points with replacement
            data_ind = np.random.choice(self.data.shape[0], self.n_samples, replace=True)
            data_tree = self.data[data_ind,:] # data unique to this tree
            
            if self.n_features is not None:
                ## optionally subsample features
                feature_ind = np.random.choice(self.data.shape[1], self.n_features, replace=True)
                data_tree = data_tree[:,feature_ind] #features unique to this tree
            
            if isPredict:
                labels_tree = self.labels[data_ind]
            else:
                labels_tree = None
            
            if tree_type == 'flex':
                proj_design, split_design, stop_design = self.tree_type['proj_design'],\
                   self.tree_type['split_design'],self.tree_type['stop_design']
                self.trees.append(flex_binary_trees(data_tree, np.ones(data_tree.shape[0], dtype=bool), 
                                        proj_design,split_design, stop_design, labels=labels_tree))
            else:
                ## use Kpotufe's adaptive RP tree
                self.trees.append(master_trees(data_tree, labels=labels_tree))
    
    
    def build_forest_with_tree_preproc(self, method):
        pass
        
    def build_forest_with_forest_preproc(self, method):
        pass
    
    
    def train(self):
        self.build_forest_classic(isPredict=True)
    
    def predict_one(self, point):
        """
        Predictor_type can be either 'class' for classification,
        'regress' for regression, or a user-defined callable function
        """
        assert self.trees, "You must first build a forest"
        
        if self.predictor_type == 'class':
            ## binary classification
            avg_predict = 0
            for tree in self.trees:
                avg_predict += tree.predict_one(point, predict_type='class')
            if avg_predict/float(len(self.trees)) > 0.5:
                return 1
            else:
                return 0
        elif self.predictor_type == 'regress':
            ## regression
            avg_predict = 0
            for tree in self.trees:
                avg_predict += tree.predict_one(point, predict_type='regress')
            return avg_predict/float(len(self.trees))
        else:
            print("Unrecognized prediction method!")
           
    def predict(self, test):
        predictions = list()
        for point in test:
            predictions.append(self.predict_one(point))
        return predictions
                
        
            
            

In [6]:
# Utility function: Get data from url link and store
def download_data(fname, url_name, force = False):
    if force or not os.path.exists(fname):
        print('Downloading data from the internet...')
        try:
            urllib.urlretrieve(url_name, fname)
        except Exception as e:
            print("Unable to retrieve file from given url")
    else:
        print("File already exists")
        
        
    

# Utility function: pickle or get pickled data with desired dataname
def maybe_pickle(dataname, data = None, force = False, verbose = True):
    """
    Process and pickle a dataset if not present
    """
    filename = dataname + '.pickle'
    if force or not os.path.exists(filename):
        # pickle the dataset
        print('Pickling data to file %s' % filename)
        try:
            with open(filename, 'wb') as f:
                pickle.dump(data, f, pickle.HIGHEST_PROTOCOL)
        except Exception as e:
            print('Unable to save to', filename, ':', e) 
    else:
        print('%s already present - Skipping pickling.' % filename)
        with open(filename, 'rb') as f:
            data = pickle.load(f)

    return data


In [7]:
## Download sonar dataset for classification problems
fname = "sonar.all_data.csv"
url_name = "https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data"
download_data(fname, url_name)

File already exists


In [8]:
#### module for preprocessing raw data and pickling it

# Load a CSV file
def load_csv(filename):
    dataset = list()
    with open(filename, 'r') as file:
        csv_reader = reader(file)
        for row in csv_reader:
            if not row:
                continue
            dataset.append(row)
    return dataset
 
# Convert string column to float
# this is for data conversion
def str_column_to_float(dataset, column):
    for row in dataset:
        row[column] = float(row[column].strip())
 
# Convert string
# this is for label conversion
def str_column_to_int(dataset, column):
    class_values = [row[column] for row in dataset]
    unique = set(class_values)
    lookup = dict()
    for i, value in enumerate(unique):
        lookup[value] = i
    for row in dataset:
        row[column] = lookup[row[column]]
    return lookup


### load and preprocess data
fname = 'sonar.all_data.csv'
data = load_csv(fname)
for i in range(len(data[0])-1):
    str_column_to_float(data, i)
    
lookup = str_column_to_int(data, -1)
data = np.array(data)
## random shuffling
pInd = np.random.permutation(data.shape[0])
data = data[pInd]
X = data[:,:data.shape[-1]-1]
labels = data[:, -1]
print("Preprocessed data has shape %d by %d" % X.shape)
print("Preprocessed labels has length %d" % len(labels))
## pickle data
maybe_pickle('sonar.all_data', {'data':X, 'labels':labels}) 

Preprocessed data has shape 208 by 60
Preprocessed labels has length 208
sonar.all_data.pickle already present - Skipping pickling.


{'data': array([[ 0.0151,  0.032 ,  0.0599, ...,  0.0019,  0.0023,  0.0062],
        [ 0.0336,  0.0294,  0.0476, ...,  0.0015,  0.0069,  0.0051],
        [ 0.0195,  0.0213,  0.0058, ...,  0.0095,  0.0021,  0.0053],
        ..., 
        [ 0.0519,  0.0548,  0.0842, ...,  0.0047,  0.0048,  0.0053],
        [ 0.1371,  0.1226,  0.1385, ...,  0.0079,  0.0146,  0.0051],
        [ 0.0089,  0.0274,  0.0248, ...,  0.0069,  0.006 ,  0.0018]]),
 'labels': array([ 0.,  0.,  0.,  1.,  0.,  1.,  0.,  1.,  1.,  1.,  0.,  0.,  0.,
         1.,  1.,  1.,  1.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,
         1.,  1.,  0.,  1.,  0.,  0.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,
         1.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,
         1.,  1.,  1.,  1.,  1.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,
         0.,  0.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  0.,  0.,  1.,
         0.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  0.,
         1.,  1.,  0.,  1.,  1.,  0.,  0.

In [10]:
data_dict = maybe_pickle('sonar.all_data')
data = data_dict['data']
labels = data_dict['labels']
data_tr = data[:104,:]
labels_tr = labels[:104]
data_tt = data[104:,:]
labels_tt = labels[104:]

sonar.all_data.pickle already present - Skipping pickling.


In [49]:
### Test: CART-decision tree + Breiman's features selection/projection rule
sparsity = [1, 3, 10]
target_dim = [1, 5, 10]
stop_rule = ['naive', 'cell_size']
scores_c1 = list()
params_c1 = list()

##
for s in sparsity:
    for t_dim in target_dim:
        for s_rule in stop_rule:
            proj_design={'name':'projmat','params':{'name':'breiman','sparsity':s,'target_dim':t_dim}}
            split_design={'name':'cart'} 
            stop_design={'name':s_rule}
            if s_rule == 'cell_size':
                ddiam, _, _ = data_diameter(data_tr)
                stop_design['diameter'] = ddiam
            kwargs = {'proj_design':proj_design, 'split_design':split_design, 'stop_design':stop_design}
            scores_c1.append(cross_valid_eval(data_tr, labels_tr, 5, zero_one_loss, flex_binary_trees, **kwargs))
            params_c1.append([s, t_dim, s_rule])
            
#scores = cross_valid_eval(data_tr, labels_tr, 5, zero_one_loss, flex_binary_trees)
#score1 = np.mean(scores)

## best param: 3, 10, cell_size

In [106]:
s_dict = {'sparsity':list(), 'target_dim':list(), 'stop_rule': list(), 'zero-one-loss':list()}
count = 0
for s in sparsity:
    for t_dim in target_dim:
        for s_rule in stop_rule:
            s_dict['sparsity'].append(s)
            s_dict['target_dim'].append(t_dim)
            s_dict['stop_rule'].append(s_rule)
            s_dict['zero-one-loss'].append(np.mean(scores_c1[count]))
            count += 1  
s_dict_c1 = s_dict
df = DataFrame(s_dict_c1)

print('CART-DCTree-Breiman')
display(df)

CART-DCTree-Breiman


Unnamed: 0,sparsity,stop_rule,target_dim,zero-one-loss
0,1,naive,1,0.421429
1,1,cell_size,1,0.538095
2,1,naive,5,0.393333
3,1,cell_size,5,0.460952
4,1,naive,10,0.39381
5,1,cell_size,10,0.459524
6,3,naive,1,0.411905
7,3,cell_size,1,0.404762
8,3,naive,5,0.432381
9,3,cell_size,5,0.471429


In [55]:
### Test: CART-decision tree + RP-dense projection (dasgupta)
target_dim = [1, 5, 10]
stop_rule = ['naive', 'cell_size']
scores_c2 = list()
params_c2 = list()
##
for t_dim in target_dim:
    for s_rule in stop_rule:
        proj_design = {'name':'projmat','params':{'name':'dasgupta','target_dim':t_dim}}
        split_design = {'name':'cart'}
        stop_design = {'name':s_rule}
        if s_rule == 'cell_size':
                ddiam, _, _ = data_diameter(data_tr)
                stop_design['diameter'] = ddiam
        kwargs = {'proj_design':proj_design, 'split_design':split_design, 'stop_design':stop_design}
        scores_c2.append(cross_valid_eval(data_tr, labels_tr, 5, zero_one_loss, flex_binary_trees, **kwargs))
        params_c2.append([t_dim, s_rule])

## best param: 10, cell_size

In [108]:
s_dict = {'target_dim':list(), 'stop_rule': list(), 'zero-one-loss':list()}
count = 0

for t_dim in target_dim:
    for s_rule in stop_rule:
        s_dict['target_dim'].append(t_dim)
        s_dict['stop_rule'].append(s_rule)
        s_dict['zero-one-loss'].append(np.mean(scores_c2[count]))
        count += 1  
s_dict_c2 = s_dict
df = DataFrame(s_dict_c2)

print('CART-DCTree-RPdense')
display(df)

CART-DCTree-RPdense


Unnamed: 0,stop_rule,target_dim,zero-one-loss
0,naive,1,0.452857
1,cell_size,1,0.364762
2,naive,5,0.393333
3,cell_size,5,0.470476
4,naive,10,0.433333
5,cell_size,10,0.288571


In [58]:
### Test: CART-decision tree + RP-sparse projection (randomer forest paper)
target_dim = [1, 5, 10]
stop_rule = ['naive', 'cell_size']
scores_c3 = list()
params_c3 = list()
##
for t_dim in target_dim:
    for s_rule in stop_rule:
        proj_design = {'name':'projmat','params':{'name':'tomita','target_dim':t_dim}}
        split_design = {'name':'cart'}
        stop_design = {'name':s_rule}
        if s_rule == 'cell_size':
                ddiam, _, _ = data_diameter(data_tr)
                stop_design['diameter'] = ddiam
        kwargs = {'proj_design':proj_design, 'split_design':split_design, 'stop_design':stop_design}
        scores_c3.append(cross_valid_eval(data_tr, labels_tr, 5, zero_one_loss, flex_binary_trees, **kwargs))
        params_c3.append([t_dim, s_rule])

## best param: 1, cell_size

In [109]:
s_dict = {'target_dim':list(), 'stop_rule': list(), 'zero-one-loss':list()}
count = 0

for t_dim in target_dim:
    for s_rule in stop_rule:
        s_dict['target_dim'].append(t_dim)
        s_dict['stop_rule'].append(s_rule)
        s_dict['zero-one-loss'].append(np.mean(scores_c3[count]))
        count += 1  
s_dict_c3 = s_dict
df = DataFrame(s_dict_c3)

print('CART-DCTree-RPsparse')
display(df)

CART-DCTree-RPsparse


Unnamed: 0,stop_rule,target_dim,zero-one-loss
0,naive,1,0.396667
1,cell_size,1,0.26
2,naive,5,0.402857
3,cell_size,5,0.412857
4,naive,10,0.393333
5,cell_size,10,0.470476


In [60]:
## Test: Median split tree + Breiman's feature selection/projection rule
sparsity = [1, 3, 10]
target_dim = [1, 5, 10]
stop_rule = ['naive', 'cell_size']
perturb = [True, False]
scores_m1 = list()
params_m1 = list()

##
for s in sparsity:
    for t_dim in target_dim:
        for s_rule in stop_rule:
            for p in perturb:
                proj_design={'name':'projmat','params':{'name':'breiman','sparsity':s,'target_dim':t_dim}}
                if p:
                    ddiam,_,_ = data_diameter(data_tr)
                    split_design={'name':'median_perturb', 'params':{'diameter':ddiam, 'root_height':0}}
                else:
                    split_design = {'name':'median'}
                stop_design={'name':s_rule}
                if s_rule == 'cell_size':
                    ddiam, _, _ = data_diameter(data_tr)
                    stop_design['diameter'] = ddiam
                kwargs = {'proj_design':proj_design, 'split_design':split_design, 'stop_design':stop_design}
                scores_m1.append(cross_valid_eval(data_tr, labels_tr, 5, zero_one_loss, flex_binary_trees, **kwargs))
                params_m1.append([s, t_dim, s_rule, p])

## best param: 10,10,cell_size,True

In [114]:
sparsity = [1, 3, 10]
target_dim = [1, 5, 10]
stop_rule = ['naive', 'cell_size']
perturb = [True, False]
##
s_dict = {'sparsity':list(), 'target_dim':list(), 'stop_rule': list(), 'perturb':list(), 'zero-one-loss':list()}
count = 0
for s in sparsity:
    for t_dim in target_dim:
        for s_rule in stop_rule:
            for p in perturb:
                s_dict['sparsity'].append(s)
                s_dict['target_dim'].append(t_dim)
                s_dict['stop_rule'].append(s_rule)
                s_dict['perturb'].append(str(p))
                s_dict['zero-one-loss'].append(np.mean(scores_m1[count]))
                count += 1  
s_dict_m1 = s_dict
df = DataFrame(s_dict_m1)

print('MedianSplitTree-Breiman')
display(df)

MedianSplitTree-Breiman


Unnamed: 0,perturb,sparsity,stop_rule,target_dim,zero-one-loss
0,True,1,naive,1,0.36619
1,False,1,naive,1,0.384762
2,True,1,cell_size,1,0.346667
3,False,1,cell_size,1,0.404286
4,True,1,naive,5,0.490952
5,False,1,naive,5,0.442381
6,True,1,cell_size,5,0.328095
7,False,1,cell_size,5,0.335714
8,True,1,naive,10,0.432857
9,False,1,naive,10,0.365714


In [116]:
## Test: Median split tree + RP-dense projection
target_dim = [1, 5, 10]
stop_rule = ['naive', 'cell_size']
perturb = [True, False]
scores_m2 = list()
params_m2 = list()
## best params: 5, naive, True

##
for t_dim in target_dim:
    for s_rule in stop_rule:
        for p in perturb:
            proj_design={'name':'projmat','params':{'name':'dasgupta', 'target_dim':t_dim}}
            if p:
                ddiam,_,_ = data_diameter(data_tr)
                split_design={'name':'median_perturb', 'params':{'diameter':ddiam, 'root_height':0}}
            else:
                split_design = {'name':'median'}
            stop_design={'name':s_rule}
            if s_rule == 'cell_size':
                ddiam, _, _ = data_diameter(data_tr)
                stop_design['diameter'] = ddiam
            kwargs = {'proj_design':proj_design, 'split_design':split_design, 'stop_design':stop_design}
            scores_m2.append(cross_valid_eval(data_tr, labels_tr, 5, zero_one_loss, flex_binary_trees, **kwargs))
            params_m2.append([t_dim, s_rule, p])

In [118]:
target_dim = [1, 5, 10]
stop_rule = ['naive', 'cell_size']
perturb = [True, False]
##
s_dict = {'target_dim':list(), 'stop_rule': list(), 'perturb':list(), 'zero-one-loss':list()}
count = 0

for t_dim in target_dim:
    for s_rule in stop_rule:
        for p in perturb:
            s_dict['target_dim'].append(t_dim)
            s_dict['stop_rule'].append(s_rule)
            s_dict['perturb'].append(str(p))
            s_dict['zero-one-loss'].append(np.mean(scores_m2[count]))
            count += 1  
s_dict_m2 = s_dict
df = DataFrame(s_dict_m2)

print('MedianSplitTree-RPdense')
display(df)

MedianSplitTree-RPdense


Unnamed: 0,perturb,stop_rule,target_dim,zero-one-loss
0,True,naive,1,0.412381
1,False,naive,1,0.42381
2,True,cell_size,1,0.433333
3,False,cell_size,1,0.403333
4,True,naive,5,0.344762
5,False,naive,5,0.345714
6,True,cell_size,5,0.365238
7,False,cell_size,5,0.365238
8,True,naive,10,0.414762
9,False,naive,10,0.424286


In [62]:
## Test: Median split tree + RP-spare projection
target_dim = [1, 5, 10]
stop_rule = ['naive', 'cell_size']
perturb = [True, False]
scores_m3 = list()
params_m3 = list()
## best params: 5, naive, False

##
for t_dim in target_dim:
    for s_rule in stop_rule:
        for p in perturb:
            proj_design={'name':'projmat','params':{'name':'tomita', 'target_dim':t_dim}}
            if p:
                ddiam,_,_ = data_diameter(data_tr)
                split_design={'name':'median_perturb', 'params':{'diameter':ddiam, 'root_height':0}}
            else:
                split_design = {'name':'median'}
            stop_design={'name':s_rule}
            if s_rule == 'cell_size':
                ddiam, _, _ = data_diameter(data_tr)
                stop_design['diameter'] = ddiam
            kwargs = {'proj_design':proj_design, 'split_design':split_design, 'stop_design':stop_design}
            scores_m3.append(cross_valid_eval(data_tr, labels_tr, 5, zero_one_loss, flex_binary_trees, **kwargs))
            params_m3.append([t_dim, s_rule, p])

In [131]:
target_dim = [1, 5, 10]
stop_rule = ['naive', 'cell_size']
perturb = [True, False]
##
s_dict = {'target_dim':list(), 'stop_rule': list(), 'perturb':list(), 'zero-one-loss':list()}
count = 0

for t_dim in target_dim:
    for s_rule in stop_rule:
        for p in perturb:
            s_dict['target_dim'].append(t_dim)
            s_dict['stop_rule'].append(s_rule)
            s_dict['perturb'].append(str(p))
            s_dict['zero-one-loss'].append(np.mean(scores_m3[count]))
            count += 1  
s_dict_m3 = s_dict
df = DataFrame(s_dict_m3)

print('5 fold CV-loss of MedianSplitTree-RPsparse')
display(df)

5 fold CV-loss of MedianSplitTree-RPsparse


Unnamed: 0,perturb,stop_rule,target_dim,zero-one-loss
0,True,naive,1,0.373333
1,False,naive,1,0.385238
2,True,cell_size,1,0.453333
3,False,cell_size,1,0.43381
4,True,naive,5,0.431429
5,False,naive,5,0.307619
6,True,cell_size,5,0.403333
7,False,cell_size,5,0.450952
8,True,naive,10,0.395714
9,False,naive,10,0.365714


In [140]:
######-------- Testing adaptive tree construction rules with fixed params (chosen from best above)


### Test adaptive RP-tree (Kpotufe's paper)
#scores_k = cross_valid_eval(data_tr, labels_tr, 5, zero_one_loss, master_trees)
scores_k = cross_valid_eval(data_tt, labels_tt, 5, zero_one_loss, master_trees)
print('5 fold CV-loss of adaptive RP-tree in Kpotufe is %f' %np.mean(scores_k))

5 fold CV-loss of adaptive RP-tree in Kpotufe is 0.405714


In [141]:
### Test adaptive CART-decision tree + Breiman's features selection/projection rule
data_ind = np.ones(data_tr.shape[0], dtype=bool)
proj_design={'name':'projmat','params':{'name':'breiman','sparsity':3,'target_dim':10}}
ddiam, _, _ = data_diameter(data_tr)
split_design={'name':'cart'} 
stop_design={'name': 'cell_size', 'diameter':ddiam}
kwargs = {'proj_design':proj_design, 'split_design':split_design, 'stop_design':stop_design}
base_slave_tree = flex_binary_trees(data_tr, labels=labels_tr, **kwargs)
#scores_k1 = cross_valid_eval(data_tr, labels_tr, 5, zero_one_loss, master_trees, slave_tree=base_slave_tree)
scores_k1 = cross_valid_eval(data_tt, labels_tt, 5, zero_one_loss, master_trees, slave_tree=base_slave_tree)

In [142]:
print('5 fold CV-Loss of adaptive CART-Breiman is %f' %np.mean(scores_k1))

5 fold CV-Loss of adaptive CART-Breiman is 0.404286


In [143]:
### Test adaptive CART-decision tree + RP-dense projection rule
data_ind = np.ones(data_tr.shape[0], dtype=bool)
proj_design={'name':'projmat','params':{'name':'dasgupta','target_dim':10}}
split_design={'name':'cart'} 
ddiam, _, _ = data_diameter(data_tr)
stop_design={'name': 'cell_size', 'diameter':ddiam}
kwargs = {'proj_design':proj_design, 'split_design':split_design, 'stop_design':stop_design}
base_slave_tree = flex_binary_trees(data_tr, labels=labels_tr, **kwargs)
#scores_k2 = cross_valid_eval(data_tr, labels_tr, 5, zero_one_loss, master_trees, slave_tree=base_slave_tree)
scores_k2 = cross_valid_eval(data_tt, labels_tt, 5, zero_one_loss, master_trees, slave_tree=base_slave_tree)

In [144]:
print('5 fold CV-Loss of adaptive CART-RPdense is %f' %np.mean(scores_k2))

5 fold CV-Loss of adaptive CART-RPdense is 0.453333


In [145]:
### Test adaptive CART-decision tree + RP-sparse projection rule
data_ind = np.ones(data_tr.shape[0], dtype=bool)
proj_design={'name':'projmat','params':{'name':'tomita','target_dim':1}}
split_design={'name':'cart'} 
ddiam, _, _ = data_diameter(data_tr)
stop_design={'name': 'cell_size', 'diameter':ddiam}
kwargs = {'proj_design':proj_design, 'split_design':split_design, 'stop_design':stop_design}
base_slave_tree = flex_binary_trees(data_tr, labels=labels_tr, **kwargs)
#scores_k3 = cross_valid_eval(data_tr, labels_tr, 5, zero_one_loss, master_trees, slave_tree=base_slave_tree)
scores_k3 = cross_valid_eval(data_tt, labels_tt, 5, zero_one_loss, master_trees, slave_tree=base_slave_tree)

In [146]:
print('5 fold CV-Loss of adaptive CART-RPsparse is %f' %np.mean(scores_k3))

5 fold CV-Loss of adaptive CART-RPsparse is 0.414762


In [147]:
### Test adaptive median-split tree + Breiman's features selection/projection rule
data_ind = np.ones(data_tr.shape[0], dtype=bool)
proj_design={'name':'projmat','params':{'name':'breiman','sparsity':10,'target_dim':10}}
ddiam, _, _ = data_diameter(data_tr)
split_design={'name':'median_perturb', 'params':{'diameter':ddiam, 'root_height':0}} 
stop_design={'name': 'cell_size', 'diameter':ddiam}
kwargs = {'proj_design':proj_design, 'split_design':split_design, 'stop_design':stop_design}
base_slave_tree = flex_binary_trees(data_tr, labels=labels_tr, **kwargs)
#scores_km1 = cross_valid_eval(data_tr, labels_tr, 5, zero_one_loss, master_trees, slave_tree=base_slave_tree)
scores_km1 = cross_valid_eval(data_tt, labels_tt, 5, zero_one_loss, master_trees, slave_tree=base_slave_tree)

In [148]:
print('5 fold CV-Loss of adaptive MedianSplit-Breiman is %f' %np.mean(scores_km1))

5 fold CV-Loss of adaptive MedianSplit-Breiman is 0.363810


In [149]:
### Test adaptive median-split tree + RP-dense projection rule
data_ind = np.ones(data_tr.shape[0], dtype=bool)
proj_design={'name':'projmat','params':{'name':'dasgupta','target_dim':5}}
split_design={'name':'median'} 
stop_design={'name': 'naive'}
kwargs = {'proj_design':proj_design, 'split_design':split_design, 'stop_design':stop_design}
base_slave_tree = flex_binary_trees(data_tr, labels=labels_tr, **kwargs)
#scores_km2 = cross_valid_eval(data_tr, labels_tr, 5, zero_one_loss, master_trees, slave_tree=base_slave_tree)
scores_km2 = cross_valid_eval(data_tt, labels_tt, 5, zero_one_loss, master_trees, slave_tree=base_slave_tree)

In [150]:
print('5 fold CV-Loss of adaptive MedianSplit-RPdense is %f' %np.mean(scores_km2))

5 fold CV-Loss of adaptive MedianSplit-RPdense is 0.357143


In [151]:
### Test adaptive median-split tree + RP-sparse projection rule
data_ind = np.ones(data_tr.shape[0], dtype=bool)
proj_design={'name':'projmat','params':{'name':'tomita','target_dim':5}}
split_design={'name':'median'} 
ddiam, _, _ = data_diameter(data_tr)
stop_design={'name': 'cell_size','diameter':ddiam}
kwargs = {'proj_design':proj_design, 'split_design':split_design, 'stop_design':stop_design}
base_slave_tree = flex_binary_trees(data_tr, labels=labels_tr, **kwargs)
#scores_km3 = cross_valid_eval(data_tr, labels_tr, 5, zero_one_loss, master_trees, slave_tree=base_slave_tree)
scores_km3 = cross_valid_eval(data_tt, labels_tt, 5, zero_one_loss, master_trees, slave_tree=base_slave_tree)

In [152]:
print('5 fold CV-Loss of adaptive MedianSplit-RPsparse is %f' %np.mean(scores_km3))

5 fold CV-Loss of adaptive MedianSplit-RPsparse is 0.383333


In [88]:
#############--- Testing Forest ensembles + X tree method
n_trees_list = [10,100,500]
n_samples_list = [10, 50, 100, 200]
n_features = 5

In [91]:
## Test: forest + CART breiman-tree   

## Best params (they are roughly the same): 500, 200
proj_design={'name':'projmat','params':{'name':'breiman','sparsity':3,'target_dim':10}}
split_design={'name':'cart'} 
ddiam, _, _ = data_diameter(data_tr)
stop_design={'name':'cell_size', 'diameter':ddiam}
kwargs = {'tree_type':{"tree":'flex','proj_design':proj_design,'split_design':split_design,'stop_design':stop_design}, 
          'predictor_type':'class'}
scores_fc1 = list()
params_fc1 = list()
for n_trees in n_trees_list:
    for n_samples in n_samples_list:
        kwargs['n_trees'] = n_trees
        kwargs['n_samples'] = n_samples
        kwargs['n_features'] = n_features
        scores_fc1.append(cross_valid_eval(data_tr, labels_tr, 5, zero_one_loss, forest, **kwargs))
        params_fc1.append([n_trees, n_samples])


In [154]:
## Test error
proj_design={'name':'projmat','params':{'name':'breiman','sparsity':3,'target_dim':10}}
split_design={'name':'cart'} 
ddiam, _, _ = data_diameter(data_tr)
stop_design={'name':'cell_size', 'diameter':ddiam}
kwargs = {'tree_type':{"tree":'flex','proj_design':proj_design,'split_design':split_design,'stop_design':stop_design}, 
          'predictor_type':'class'}
scores_fc1_test = cross_valid_eval(data_tt, labels_tt, 5, zero_one_loss, forest, **kwargs)
display(np.mean(scores_fc1_test))

0.49190476190476184

In [93]:
## Test: forest + CART + RP-dense   

## best params (they are roughly the same): 10, 10
proj_design={'name':'projmat','params':{'name':'dasgupta', 'target_dim':10}}
split_design={'name':'cart'} 
ddiam, _, _ = data_diameter(data_tr)
stop_design={'name':'cell_size', 'diameter':ddiam}
kwargs = {'tree_type':{"tree":'flex','proj_design':proj_design,'split_design':split_design,'stop_design':stop_design}, 
          'predictor_type':'class'}
scores_fc2 = list()
params_fc2 = list()
for n_trees in n_trees_list:
    for n_samples in n_samples_list:
        kwargs['n_trees'] = n_trees
        kwargs['n_samples'] = n_samples
        kwargs['n_features'] = n_features
        scores_fc2.append(cross_valid_eval(data_tr, labels_tr, 5, zero_one_loss, forest, **kwargs))
        params_fc2.append([n_trees, n_samples])

In [155]:
## Test error
proj_design={'name':'projmat','params':{'name':'dasgupta', 'target_dim':10}}
split_design={'name':'cart'} 
ddiam, _, _ = data_diameter(data_tr)
stop_design={'name':'cell_size', 'diameter':ddiam}
kwargs = {'tree_type':{"tree":'flex','proj_design':proj_design,'split_design':split_design,'stop_design':stop_design}, 
          'predictor_type':'class'}
scores_fc2_test = cross_valid_eval(data_tt, labels_tt, 5, zero_one_loss, forest, **kwargs)
display(np.mean(scores_fc2_test))

0.51904761904761909

In [None]:
## Test: forest + CART + RP-sparse   
proj_design={'name':'projmat','params':{'name':'tomita', 'target_dim':1}}
split_design={'name':'cart'} 
ddiam, _, _ = data_diameter(data_tr)
stop_design={'name':'cell_size', 'diameter':ddiam}
kwargs = {'tree_type':{"tree":'flex','proj_design':proj_design,'split_design':split_design,'stop_design':stop_design}, 
          'predictor_type':'class'}
scores_fc3 = list()
params_fc3 = list()
for n_trees in n_trees_list:
    for n_samples in n_samples_list:
        kwargs['n_trees'] = n_trees
        kwargs['n_samples'] = n_samples
        kwargs['n_features'] = n_features
        scores_fc3.append(cross_valid_eval(data_tr, labels_tr, 5, zero_one_loss, forest, **kwargs))
        params_fc3.append([n_trees, n_samples])

In [138]:
## Test: Forest + adaptive CART Breiman

#### NEED TO FIRST MODIFY FOREST class to allow parameter passing for master trees
proj_design={'name':'projmat','params':{'name':'breiman','sparsity':3,'target_dim':10}}
split_design={'name':'cart'} 
ddiam, _, _ = data_diameter(data_tr)
stop_design={'name':'cell_size', 'diameter':ddiam}
kwargs = {'tree_type':{"tree":'master','proj_design':proj_design,'split_design':split_design,'stop_design':stop_design}, 
          'predictor_type':'class'}
scores_fc_k1 = list()
params_fc_k1 = list()

kwargs['n_trees'] = 10
kwargs['n_samples'] = 100

scores_fc_k1.append(cross_valid_eval(data_tr, labels_tr, 5, zero_one_loss, forest, **kwargs))
params_fc_k1.append([n_trees, n_samples])

In [157]:
proj_design={'name':'projmat','params':{'name':'breiman','sparsity':3,'target_dim':10}}
split_design={'name':'cart'} 
ddiam, _, _ = data_diameter(data_tr)
stop_design={'name':'cell_size', 'diameter':ddiam}
kwargs = {'tree_type':{"tree":'master','proj_design':proj_design,'split_design':split_design,'stop_design':stop_design}, 
          'predictor_type':'class'}

scores_fc_k1 = cross_valid_eval(data_tt, labels_tt, 5, zero_one_loss, forest, **kwargs)
display(np.mean(scores_fc_k1))

0.61666666666666659

In [None]:
## Test: forest+ adaptiveRP
kwargs = {'tree_type':{"tree":'master'},'predictor_type':'class','n_trees':100, 'n_samples':50, 'n_features':5}

scores = cross_valid_eval(data_tr, labels_tr, 5, zero_one_loss, forest, **kwargs)
score4 = np.mean(scores)


In [None]:
## Test: forest + naive RP
proj_design={'name':'projmat','params':{'name':'dasgupta'}}
split_design={'name':'median_perturb', 'params':{'regress':False}}
stop_design={'name':'naive'}
kwargs = {'tree_type':{"tree":'flex','proj_design':proj_design,'split_design':split_design,'stop_design':stop_design}, 
          'predictor_type':'class','n_trees':100, 'n_samples':50, 'n_features':5}

scores = cross_valid_eval(data_tr, labels_tr, 5, zero_one_loss, forest, **kwargs)
score5 = np.mean(scores)

In [None]:
## Test: forest + naive RP
proj_design={'name':'projmat','params':{'name':'dasgupta'}}
split_design={'name':'median_perturb', 'params':{'regress':False}}
stop_design={'name':'naive'}
kwargs = {'tree_type':{"tree":'flex','proj_design':proj_design,'split_design':split_design,'stop_design':stop_design}, 
          'predictor_type':'class','n_trees':100, 'n_samples':50, 'n_features':5}

scores = cross_valid_eval(data_tr, labels_tr, 5, zero_one_loss, forest, **kwargs)
score5 = np.mean(scores)

In [139]:
avg_scores = np.mean(np.array(scores_fc_k1), axis=1)
for res in zip(params_fc_k1, avg_scores):
    print(res)

([500, 200], 0.4238095238095238)


### My observations from preliminary experiments:
- Forest type estimators seem to perform uniformly worse than single-tree based estimators on sonar data
- Trees with adaptive pruning strategy (inspired by Kpotufe's paper) seems to perform significantly better than others, though it takes a lot more time 
- But combining forest with adaptive pruning doesn't help; it achieves similar bad performance as forest with naive stopping criteria