# Web Search 2018 - Tutorial 5: Multi-Feature Label Propagation
## Contents

1. [Overview](#head1)
  1. [Code Imports](#head11)
2. [Iterative Label Propagation on Web Data](#head2)
  1. [Multi-label LP Algorithm](#head21)
  2. [Implement the Multi-label LP Algorithm](#head22)
  3. [Evaluation](#head23)
  4. [Exercises](#head24)
3. [Multi-Feature Iterative Label Propagation](#head3)
  1. [Implement the Multi-Feature Iterative Label Propagation](#head31)
  2. [Exercises](#head32)

## <a name="head1"></a> Overview

In the previous lab you implemented the Iterative Label Propagation algorithm, which consists of a semi-supervised graph approach to annotate uncategorized/unlabelled data starting from a small set of categorized/labelled data.

The target dataset was the MNIST Digits, which is only adequate for implementation purposes (i.e. testing, debugging, etc.). In this lab, the first step will be to apply the LP algorithm to Web data and analyse its behaviour.


Additionally, in the LP implementation of the previous lab, semantic affinity between documents (images/texts) was computed based on a **single feature space**. In this lab the LP algorithm definition will be revisited in order to accomodate the computation of semantic affinity between documents under **multiple feature spaces**. This will allow the construction of a much more richer graph, supporting propagation of labels by different similarity criteria. 


**Lab objectives:**
* Apply the iterative version of the Label Propagation algorithm to Web data scenario and analyse the results;
* Implement the Multi-feature iterative Label Propagation algorithm.

### <a name="head11"></a> Code Imports

In [1]:
import numpy as np
from numpy.random import shuffle

# <a name="head2"></a> Iterative Label Propagation on Web Data

Consider a dataset $X=\{x_1, x_2, \ldots, x_L, \ \ x_{L+1}, \ldots, x_N\}$, with $N$ data points, where each $x_i$ consists of some feature representation of document $i$. Given a categories set $C=\{1, 2, \ldots, |C|\}$, it is assumed that the first $L$ data points are labelled with a label $c \in C$, and the remaining ones are unlabelled.

Please refer to the "Mining Data Graphs" class (lectured on 29/10), namely slides 28, 29 and 30, for a description of the algorithm steps.

For more information, you can check the original paper: Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, Bernhard Schoelkopf. Learning with local and global consistency (2004) http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.115.3219


## <a name="head21"></a> Multi-label LP Algorithm

Apply the Iterative LP algorithm on your project dataset and discuss its effectiveness.

The dataset has a total of 13 categories. The categories of each document are available in the corresponding line of that document (column 'gt_class'), in the provided .csv file. Multiple categories are separated by a comma ','. You should represent each document's categories as a 13-dimensional vector (one-hot encoding), as you did for the MNIST dataset. In this case, you may have more than 1 active dimension.

### <a name="head22"></a> Implement the Multi-label LP Algorithm

**Multi-label LP:** As each document has multiple categories, you will need to modify your LP implementation. Instead of an **argmax** to select the final category of each document, you will need to **select the top-k categories**, by applying a threshold on the number of categories assigned.

**Discuss:** Discuss examples of thresholds (e.g. select the top-3 labels, keep all categories with their values $>\alpha$, etc.)  .


### <a name="head23"></a>  Evaluation
Evaluate the results of each run of the Iterative LP.

In [None]:
# The function classification_report computes and prints a set of commonly used metrics.
# docs: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report
from sklearn.metrics import classification_report

# Get the predictions of the unlabeled documents
Y_pred = Y[indices_unlabeled, :]

# Get the corresponding groundtruth
y_gt = Y_true[indices_unlabeled, :]

print(classification_report(y_gt, Y_pred))

### <a name="head24"></a>  Exercises

In [2]:
# Which feature spaces are more effective? Do a per-class inspection and understand which feature spaces are more effective for each class and why. 

# How does the LP algorithm behaves when you change the number of initial labels? (variable labeled_set_size on lab4)

# Note that documents (Tweets) from your project's dataset have multiple labels, i.e. each document may belong to 1 or more classes. 
# Discuss how this impacts the label propagation. 

# <a name="head3"></a> Multi-Feature Iterative Label Propagation 

Recall the computation of the affinity matrix S.

Given some feature space representation, each entry $w_{ij}$, for $i\neq j$ is computed as:

$$
\begin{align}
w_{ij} = exp\Big({-\frac{||x_i - x_j||^2}{2\sigma^2}}\Big),
\end{align}
$$
where a Gaussian kernel is applied over the distance on the considered feature space.


In order compute affinity by considering **multiple feature spaces**, the above expression can be extended as:

$$
\begin{align}
w_{ij} = exp\Big({-\frac{\Big[\sum_{f}\alpha_f\cdot||x_i^f - x_j^f||\Big]^2}{2\sigma^2}}\Big),
\end{align}
$$
where each $x^f$ denotes a given feature space and $\alpha_f$ the weight associated with that space. The weights should be defined such that $\sum_f \alpha_f = 1$.

You can define the contribution of each feature space by adequately setting its associated weight $\alpha_f$.

##  <a name="head31"></a> Implement the Multi-Feature Iterative Label Propagation

Note that given the Multi-label implementation of Iterative LP, you should only need to change the computation of each $w_{ij}$.

### <a name="head32"></a>  Exercises

In [4]:
# Discuss the effectiveness of the Multi-Feature approach versus the Single feature variant. Namely, compare HoC+HoG with VGG.

# Change the weights of each feature space and interpret the results. Which features better contribute to the overall effectiveness?