# Example of training a  MultiViewStacking classifier.

This document shows how to use the `multiviestacking` library on a dataset with $2$ views. The library supports an arbitrary number of views (only limited by your computer's memory).

### The HTAD dataset.

The Home-Tasks Activities Dataset contains wrist-accelerometer and audio features collected by individuals performing at-home tasks such as sweeping, brushing teeth, washing hands, and watching TV. The dataset is included in the same directory as this document ('htad.csv'). We will use this database to build a Multi-View Stacking model with two views. One view for the audio features and one view for the accelerometer features. For more information about the dataset, check [this](https://osf.io/preprints/osf/j723c). 


### Load the dataset.

First, let's load the dataset into a pandas data frame and display the first rows.
The first column is the **class** and the remaining ones are the features. The feature names have a prefix of **v1_*** or **v2_***.* The features prefixed with v1_ are mel frequency cepstral coefficients extracted from audio signals. Features prefixed with v2_ are summary statistics extracted from accelerometer signals. Note that column names can be anything. But to make things easier, in this case a prefix was added so we can get the corresponding views' column indices.


In [None]:
import pandas as pd
import numpy as np

# Read file.
df = pd.read_csv('htad.csv')

# Display the first rows of the data.
df.head()

### Encode the labels as integers

`multiviewstacking` expects the labels to be in integer format. You can use a `LabelEncoder()` to do that.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Create the LabelEncoder object.
le = LabelEncoder()

# Apply the transformation to the "class" column.
df["class"] = le.fit_transform(df["class"])

# Check the unique new values.
np.unique(df["class"])

Now let's store the class in the variable `y` and the features in the variable `X`. We also store the column names so later we can get the column indices.

In [None]:
y = df["class"]

X = df.drop(["class"], axis = 1)

# Store column names.
colnames = list(X.columns)

### Split the dataset

Now we split the dataset into train and test sets.

In [None]:
from sklearn.model_selection import train_test_split

# Split into train, and test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    train_size = 0.5,
                                                    stratify = y, 
                                                    random_state = 200)

### Get column indices

The `MultiViewStacking` object expects the features of each view to be passed as indices. Either as ranges as a tuple or the complete list of indices. In this case we will pass the complete list of indices. Since we stored the column names, we can extract their indices by searching for those that start with "v1_" and "v2_". This can also be done manually or using a different method. The important thing is to have one list (per view) with the corresponding column indices with respect to the input training data `X`. In the example below, the indices for the first view (audio features) go from $0$ to $35$. This could also have been represented with a tuple like `(0,35)`. The `multiviewstacking` library allows the list format in case the features are not contiguous.

In [None]:
# Get column indices for each view.
ind_v1 = [colnames.index(x) for x in colnames if "v1_" in x]
ind_v2 = [colnames.index(x) for x in colnames if "v2_" in x]

print(ind_v1)

### Defining te first-level-learners and meta-learner

Let's define the first level learners for each of the views and the meta-learner. The `multiviewstacking` library supports most of `scikit-learn` classifiers. A `MultiViewStacking` model is not limited to a single type of model but supports heterogenous types of models. For example, if you know that a KNN classifier is more suitable for audio classification and Gaussian Naive Bayes is better for the accelerometer view, you can specify a different model for each view. Furthermore, you can even specify a different model for the meta-learner. In this case, a Random Forest.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

# Define the first-level-learner for the audio view.
# In this case, a KNN classifier with k=3. 
m_v1 = KNeighborsClassifier(n_neighbors=3)

# Define the first-level-learner for the accelerometer view.
# In this case, a Naive Bayes classifier.
m_v2 = GaussianNB()

# Define the meta-learner.
m_meta = RandomForestClassifier(n_estimators=50, random_state=123)

### Create the MultiViewStacking classifier

Now we are ready to create our `MultiViewStacking` classifier. We first pass the `views_indices` parameter as a list of lists. The first list is the list of indices of the first view (audio), the second list is the list of indices of the second view (accelerometer). Then, we pass a list of `first_level_learners`. **Note that the order of the views for all parameters must be the same.** That is, if in `view_indices` you pass the indices of some view $A$ and view $B$ then in the `first_level_learners` you must pass a list with the corresponding models for view $A$ and view $B$ in the same order.

Then, we specify the `meta_learner` and `k`. The parameter `k` specifies the number of folds in the internal cross-validation of the Multi-View Stacking algorithm. See [here](https://enriquegit.github.io/behavior-free/ensemble.html#stacked-generalization) for details of the algorithm.

Finally we set the `random_state` parameter for reproducibility. The `random_state` value is passed to the internal cross-validation procedure that splits the data into folds. This parameter is optional with a default value of `123`.

In [None]:
from multiviewstacking import MultiViewStacking

model = MultiViewStacking(views_indices = [ind_v1, ind_v2],
                      first_level_learners = [m_v1, m_v2],
                      meta_learner = m_meta,
                      k = 10,
                      random_state = 100)

### Train the model

Once the model has been created, we can proceed to train it.

In [None]:
# Now it's time to fit the model with the training data.
model.fit(X_train, y_train)

### Test the model

Now you can test your model by making predictions on the test set and computing the accuracy.

In [None]:
predictions = model.predict(X_test)

# Print accuracy.
print(np.sum(y_test == predictions) / len(y_test))

### Convert predictions to original strings

You can use the `LabelEncoder` to convert the integer predictions back to strings with its method `inverse_transform()`.

In [None]:
# Print first 10 predictions with the original names.
string_predictions = le.inverse_transform(predictions)
print(string_predictions[0:10])

You can try changing the first-level-learners and meta-learner and see if you get better results. Be aware that if you try different classifiers and/or parameters you should do so with an independent validation set. Once you have found the best combination of models and parameters, you test the final model on an independent test set only once.

### Testing the first-level-learners individually

The fitted `MultiViewStacking` model has several attributes including `fitted_first_level_learners_`. You can use this to test the performance of the individual models.

In [None]:
# Get the fitted KNN.
fitted_view1 = model.fitted_first_level_learners_[0]

# Get the fitted Naive Bayes.
fitted_view2 = model.fitted_first_level_learners_[1]

In [None]:
# Make predictions on the test set using only the view 1 (audio) model.
predictions_v1 = fitted_view1.predict(X_test.values[:,ind_v1])

# Print accuracy.
print(np.sum(y_test == predictions_v1) / len(y_test))

In [None]:
# Make predictions on the test set using only the view 2 (accelerometer) model.
predictions_v2 = fitted_view2.predict(X_test.values[:,ind_v2])

# Print accuracy.
print(np.sum(y_test == predictions_v2) / len(y_test))

Note that for each views' fitted model we only pass the corresponding column features `inv_v1` and `inv_v2`. In his case, the individual models performed much worse compared to combining the views with Multi-View Stacking.
