## Examples of WAVE Ensemble Model Fitting  

This notebook illustrates examples of using WAVE. 

### 1. Import WAVE and other modules

The implementation of **WAVE class** is included in the **wave.py** file. To import WAVE class, the wave.py needs to be saved in the same directory of this notebook. WAVE class is built based on two other modlues: **numpy** and **sklearn**. These two modules are imported at the beginning of the wave.py. When we import WAVE in this notebook, numpy and sklearn are automatically imported as well. We also need to import **pandas** for data processing.

In [1]:
# import wave
from wave import *

# import pandas
import pandas as pd

A simple description of WAVE class can be found by running the following cell:

In [2]:
?WAVE()

### 2. Example: Fit a Weight-Adjusted CERP (WACERP) Ensemble on Imprinting Data 

Weight-Adjusted CERP is an ensemble method designed for high-dimensional data. It applies WAVE to the CERP base ensemble. This example uses a high dimensional data called imprinting data set. The imprinting data set is included in the repo as imp.txt. 

**Load up the imprinting data as a data frame:**

In [3]:
imp = pd.read_csv("imp.txt", sep=" ")

**Check the first 5 instances of the imprinting data:**

In [4]:
imp.head()

Unnamed: 0,SIMREP.UPSS5,SIMREP.UPSC5,SIMREP.DNSC5,SIMREP.DNSS5,SIMREP.DNES5,SIMREP.DNEC5,SIMREP.BDYS10,SIMREP.BDYC10,SIMREP.UPSS10,SIMREP.UPSC10,...,s14,s15,s16,s17,s18,s19,s20,m1,m2,Y
0,0,0,2,66,34,1,66,2,0,0,...,0,3,2,2,0,0,0,0,0,1
1,0,0,0,0,0,0,629,4,66,2,...,0,0,1,1,2,0,0,3,1,1
2,0,0,1,41,275,2,325,5,283,2,...,2,0,1,1,0,0,1,0,0,1
3,0,0,0,0,40,1,0,0,0,0,...,3,0,0,0,1,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,1,7,1


**The imprinting data has 131 instances, each of which has 1254 features. The column "Y" denotes the target variable, which can take either 0 or 1. **

In [5]:
imp.shape

(131, 1255)

**Randomly split the data set into training data and testing data. The testing data is set to be about 10% of the whole data set.**

In [6]:
# import the method from sklearn for splitting data set into training and testing data sets
from sklearn.model_selection import train_test_split

# specify the name of features' columns 
xcols = [col for col in imp.columns if col not in ["Y"]]

# extract features and targets from the imprinting data
features = imp[xcols].values
targets = imp["Y"].values


train_X, test_X, train_y, test_y = train_test_split(features, targets,
                                                    test_size = 0.1,
                                                    random_state = 817)

In [7]:
print ("the number of instances in training data set is {}".format(len(train_X)))
print ("the number of instances in testing data set is {}".format(len(test_X)))

the number of instances in training data set is 117
the number of instances in testing data set is 14


In [8]:
print ("labels of the instances in testing set are:")
print (test_y)

labels of the instances in testing set are:
[1 1 0 0 0 0 0 1 0 0 0 0 1 1]


**Initialize the Weight-Adjusted CERP (WACERP) model:**

In [9]:
# In the initialization, we specify the base ensemble to be cerp.
# Also, weset the ensemble size to be 10
wacerp = WAVE(base_ensemble="cerp", ensemble_size=10)

**Train the WACERP model:**

In [10]:
wacerp.fit(train_X, train_y)

After the model is trained, we can look into base classifiers and corresponding weights using methods in WAVE class as follows:  
** Get base classfiers:**

In [11]:
# call get_base_classifiers() to return a list of base classifiers
base_classifiers = wacerp.get_base_classifiers()

# the first base classifier in the list of base classifiers
base_classifiers[0]

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=5, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

**Get weights of base classifiers:**

In [12]:
# call get_weights() method to get weight vector of base classifiers
weights = wacerp.get_weights()

# print the weight vector.
# for example, 0.1105 denotes the weight for the first base classifier
print (weights)

[[ 0.11053991]
 [ 0.0972806 ]
 [ 0.11053991]
 [ 0.07746206]
 [ 0.10398347]
 [ 0.11690922]
 [ 0.08416423]
 [ 0.10407805]
 [ 0.09096449]
 [ 0.10407805]]


**Make predictions on testing set:**  
The return type can be either "prob" or "label"

In [13]:
# when return_type is set to be "prob", 
# the prediction of each testing instance is returned as a python dictionary consisting of probabilities of each label 
# for the first instance in testing set, the probability of label 0 is 0.272, and the probablitiy of label 1 is 0.728
wacerp.predict(test_X, return_type="prob")

[{0: 0.27216620086202475, 1: 0.72783379913797541},
 {0: 0.38328482596071134, 1: 0.61671517403928888},
 {0: 0.77255086789426708, 1: 0.22744913210573314},
 {0: 0.68130398450060037, 1: 0.31869601549939991},
 {0: 0.89592194724366592, 1: 0.1040780527563344},
 {0: 0.82525733896014641, 1: 0.17474266103985386},
 {0: 0.91583577197505217, 1: 0.084164228024948107},
 {0: 0.22089269302597295, 1: 0.77910730697402719},
 {0: 0.90903550757690588, 1: 0.090964492423094401},
 {0: 0.39615807834974548, 1: 0.60384192165025463},
 {0: 0.68139856634996387, 1: 0.31860143365003635},
 {0: 0.79212628545790376, 1: 0.20787371454209649},
 {0: 0.1040780527563344, 1: 0.89592194724366592},
 {0: 0.61025331680889228, 1: 0.38974668319110795}]

In [14]:
# when return_type is set to be "label", 
# predictions of testing test is returned as a list of predicted labels
# for example, the first instance in testing set is predicted as label 1
wacerp.predict(test_X, return_type="label")

[1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0]

**Accuracy on testing set:**

In [15]:
#one way to compute accuracy is to compare the predictions with test_y
predictions = wacerp.predict(test_X)
accuracy = np.mean(predictions == test_y)
print ("accuracy on testing set is {}".format(accuracy))

accuracy on testing set is 0.8571428571428571


In [18]:
# another way to compute accuracy is call accuracy() method in WAVE class directly
accuracy = wacerp.accuracy(test_X, test_y)
print ("accuracy on testing set is {}".format(accuracy))

accuracy on testing set is 0.8571428571428571


**Control base tree complexity:**  
When initializing the WACERP model, we can control the complexitiy of each base tree classifier by specifying the min_samples_split_cerp argument. This argument the minimum number of samples required to split an internal node for trees in CERP, the default value is 5. Let's fit another WACERP model.

In [17]:
#initialize another WACERP model that has ensemble size 20, and set min_samples_split_cerp to be 10
wacerp2 = WAVE(base_ensemble="cerp", ensemble_size=20, min_samples_split_cerp=10)

# train the model
wacerp2.fit(train_X, train_y)

# compute the accuracy on testing set by this model
accuracy = wacerp2.accuracy(test_X, test_y)

#print the accuracy
print ("accuracy by another WACERP on testing set is {}".format(accuracy))

accuracy by another WACERP on testing set is 0.9285714285714286


### 3. Example: Fit a Weight-Adjusted Random Forest (WARF) on Breast Cancer Dataset

Weight-Adjusted Random Forest is an ensemble method that applies WAVE to the Random Forest base ensemble. This example fit WARF models on breast cancer wisconsin dataset. 

**Load the breast cancer dataset from sklearn:**

In [2]:
# load dataset from sklearn
from sklearn.datasets import load_breast_cancer

# extract features and targets separately by specifying return_X_y to be True
features, targets = load_breast_cancer(return_X_y = True)

The dataset has 569 instances, each of which has 30 features. 

In [3]:
features.shape

(569, 30)

The target value is either 0 or 1.

In [47]:
np.unique(targets)

array([0, 1])

**Randomly split dataset into training and testing datasets:**

In [5]:
# import the method from sklearn for splitting data set into training and testing data sets
from sklearn.model_selection import train_test_split

# the testing set is about 20% of the whole dataset
train_X, test_X, train_y, test_y = train_test_split(features, targets,
                                                    test_size = 0.2,
                                                    random_state = 817)

**Initialize a WARF that the ensemble size is 15:**

In [37]:
warf1 = WAVE(base_ensemble="rf", ensemble_size=15)

** Train the WARF:**

In [38]:
warf1.fit(train_X, train_y)

**Look into the weight vector of base classifiers:**

In [40]:
print (warf1.get_weights())

[[ 0.06678138]
 [ 0.06636731]
 [ 0.06472468]
 [ 0.07007673]
 [ 0.07091804]
 [ 0.07008807]
 [ 0.07009323]
 [ 0.06885567]
 [ 0.06760836]
 [ 0.06470502]
 [ 0.06180222]
 [ 0.06470855]
 [ 0.07050676]
 [ 0.05972108]
 [ 0.06304291]]


**Predictions on testing set:**

In [41]:
print (warf1.predict(test_X))

[1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0]


**Accuracy on testing set:**

In [42]:
warf1.accuracy(test_X, test_y)

0.93859649122807021

In [43]:
warf2 = WAVE(base_ensemble="rf", ensemble_size=100)

In [44]:
warf2.fit(train_X, train_y)

In [45]:
warf2.accuracy(test_X, test_y)

0.94736842105263153