# Modeling Interestingness with Deep Neural Networks

    Jianfeng Gao, Patrick Pantel, Michael Gamon, Xiaodong He, Li Deng
    2014

https://www.microsoft.com/en-us/research/wp-content/uploads/2014/10/604_Paper.pdf

## 总结
两个功能：

1. 自动推荐感兴趣的实体（即当前文章的关键字），便于作为后续文章推荐的依据
1. 推荐新的兴趣文章，关于上述实体的

模型架构：

1. Input Layer x
    1. 每个输入word是一个 one-hot vector，实验所用vocabulary 150K
    1. 把每个w拆分成tri-letter vector，30K
    1. build x
1. Convolutional Layer u
    1. 对每个w extracts local features
1. Max-pooling Layer v
    1. 得到global的feature
1. Fully-Connected Layers h and y

训练DSSM：

对于interestingness需要标注目标文档（可监督）

使用 DSSM：

1. 作为特征
1. 直接作为interestingness的预测

## Introduction
- `Automatic Highlighting`. In this task we want a recommendation system to automatically discover the entities (e.g., a person, location, organization etc.) that interest a user when reading a document and to highlight the corresponding text spans, referred to as `keywords` afterwards.
- `Contextual entity search`. After identifying the keywords that represent the entities of interest to the user, we also want the system to recommend new, interesting documents by searching the Web for supplementary information about these entities. 

However, their model is designed to represent the `relevance` between queries and documents, which differs from the notion of `interestingness` between documents studied in this paper. It is often the case that a user is interested in a document because it provides supplementary information about the entities or concepts she encounters when reading another document although the overall contents of the second documents is not highly relevant. 

To better model interestingness,

1. DSSM treats a document as a sequence of words and tries to discover prominent keywords. These keywords represent the entities or concepts that might interest users, via the convolutional and max-pooling layers. The DSSM then forms the high-level semantic representation of the whole document based on these keywords.
1. We feed the features derived from the semantic representations of documents to a ranker which is trained in a supervised manner.

## The Notion of Interestingness
Let 𝐷 be the set of all documents, we formally define the `interestingness` modeling task as learning the mapping function:

$\sigma:D \times D \to \mathbb{R}^+$

where the function 𝜎(𝑠, 𝑡) is the quantified degree of interest that the user has in the target document 𝑡 ∈ 𝐷 after or while reading the source document 𝑠 ∈ 𝐷.

## A Deep Semantic Similarity Model(DSSM)
### Network Architecture
![1](http://ou8qjsj0m.bkt.clouddn.com//17-8-6/13679357.jpg)

#### Input Layer x
1. convert each word in d to a word vector
    1. represent w by a one-hot vector using a vocabulary that contains N high frequent words(N = 150K in this study)
    1.  map w to a separate tri-letter vector. Consider the word “#dog#”, where # is a word boundary symbol. The nonzero elements in its triletter vector are “#do”, “dog”, and “og#”. 
1. build x by concatenating these word vectors

Although the number of unique English words on the Web is extremely large, the total number of distinct triletters in English is limited.

#### Convolutional Layer u
A convolutional layer extracts local features around each word $w_i$
in a word sequence of length I as follows. 

1. generate a contextual vector $c_i$ by concatenating the word vectors of $w_i$ and its surrounding words defined by a window (the window size is set to 3 in this paper).
1. we generate for each word a local feature vector $u_i$ using a tanh activation function and a linear projection matrix $W_c$, which is the same across all windows 𝑖 in the word sequence, as:

$u_i=tanh(W_c^T c_i), where i=1,\cdots,I$ (1)

#### Max-pooling Layer v
$v(j)=max_{i=1,\cdots,I} \{u_i(j)\}$ (2)

That convolutional and max-pooling layers are able to discover prominent keywords of a document can be demonstrated using the procedure in Figure 2

![2](http://ou8qjsj0m.bkt.clouddn.com//17-8-6/45479331.jpg)

1. the convolutional layer of (1) generates for each word in a 5-word document a 4-dimensional local feature vector, which represents a distribution of four `topics`
    1. the most prominent topic of $w_2$ within its three word context window is the first topic, denoted by $u_2(1)$
    1. the most prominent topic of $w_5$ is $u_5(3)$
1.  use max-pooling of (2) to form a global feature vector, which represents the topic distribution of the whole document
    1.  v(1) and v(3) are two prominent topics
    1.  for each prominent topic, we trace back to the local feature vector that survives max-pooling:$v(1)=max_{i=1,\cdots,5}\{u_i(1)\}=u_2(1),v(3)=max_{i=1,\cdots,5}\{u_i(3)\}=u_5(3)$
1. label the corresponding words of these local feature vectors, $w_2$ and $w_5$, as keywords of the document

Figure 3 presents a sample of document snippets and their keywords detected by the DSSM according to the procedure elaborated in Figure 2.

![3](http://ou8qjsj0m.bkt.clouddn.com//17-8-6/59818700.jpg)

#### Fully-Connected Layers h and y
$h=tanh(W_1^T v)$ (3)

$y=tanh(W_2^T h)$ (4)

where $W_1$ and $W_2$ are learned linear projection matrices.

### Training the DSSM
To optimize the parameters of the DSSM of Figure1, i.e., 𝛉 = {𝐖𝑐
,𝐖1,𝐖2}, we use a pair-wise rank loss as objective.

Consider a source document 𝑠 and two candidate target documents 𝑡1 and 𝑡2, where 𝑡1 is more interesting than 𝑡2 to a user when reading 𝑠. We construct two pairs of documents (𝑠,𝑡1) and (𝑠,𝑡2), where the former is preferred and should have a higher interestingness score. Let ∆ be the difference of their interestingness scores: ∆ = 𝜎(𝑠,𝑡1) − 𝜎(𝑠,𝑡2) , where 𝜎 is the interestingness score, computed as the cosine similarity:

$\sigma(s,t) \equiv sim_{\theta}(s,t) = \frac{y_s^T y_t}{\lVert y_s \rVert \lVert y_t \rVert}$ (5)

where 𝐲𝑠 and 𝐲𝑡 are the feature vectors of 𝑠 and 𝑡,respectively, which are generated using the DSSM, parameterized by 𝛉.  Intuitively, we want to learn 𝛉 to maximize ∆. 

We use the following logistic loss over ∆ ,which can be shown to upper bound the pairwise accuracy:

$\mathbb{L}(\Delta;\theta)=log(1+exp(-\gamma\Delta))$ (6)

### Using the DSSM
1. we use the DSSM as a feature generator. The output layer of the DSSM can be seen as a set of semantic features, which can be incorporated in a boosted tree based ranker
1. we use the DSSM as a direct implementation of the interestingness function $\sigma$