# 11. TEXT MINING IN MULTIMEDIA

* 바벨피쉬 : NLPpy - 텍스트마이닝 (1)
* 김무성

# Contents

1. Introduction
2. Surrounding Text Mining
3. Tag Mining
   - 3.1 Tag Ranking
   - 3.2 Tag Refinement
   - 3.3 Tag Information Enrichment
4. Joint Text and Visual Content Mining
   - 4.1 Visual Re-ranking
5. Cross Text and Visual Content Mining
6. Summary and Open Issues
   - Joint text and visual content multimedia ranking
   - Scalable text mining for large-scale multimedia man- agement
   - Multimedia social network mining

# Abstract

* A multimedia entity does not appear in isolation, but is accompanied by various forms of metadata, such as 
    - surrounding text, 
    - user tags, 
    - ratings, and 
    - comments etc.
* Specifically, the survey focuses on four aspects: 
    - (a) surrounding text mining; 
    - (b) tag mining; 
    - (c) joint text and visual content mining; and 
    - (d) cross text and visual content mining. 

# Keywords :
* Text Mining, 
* Multimedia, 
* Surrounding Text, 
* Tagging, 
* Social Network

# 1. Introduction

<font color="red">On the other hand, a multimedia entity does not appear in isola- tion but is accompanied by various forms of textual metadata.</font>

<img src="figures/cap11.1.png" width=600 />

<img src="figures/cap11.2.png" width=600 />

<img src="figures/cap11.3.png" width=600 />

# 2. Surrounding Text Mining

* Developing effective extraction algorithm for the comprehensive analysis of surrounding text has been a very challenging task. 
    - In many cases, automatically determining which page region is more relevant to the image than the others could be difficult.
    - Moreover, how large the region nearby should be considered is still an open question. 
    - Further, the quality of surrounding texts could be low and inconsistent.

#### The earliest efforts on modeling and analyzing surrounding texts to facilitate <font color="red">multimedia retrieval</font> 
* AltaVista’s A/V Photo Finder [1]
    - The indexing terms are precomputed based on the HTML documents containing the Web images.
* WebSeer system [12]
    - With a similar approach, the WebSeer system harvests the information for indexing Web images from two different sources:
        - the related HTML text
            - It extracts keywords from page title, file name, caption, alternative text, image hyperlinks, and body text titles.
            - A weight is calculated for each keyword based on its location inside a page.
        - and the embedded image itself
* PICITION system [40]
    - exploit both textual and visual information to index a pictorial database
    - Image captions - important cue -> identify faces appearing in a related newspaper photograph
    - While the system can be successfully adopted for ac- cessing photographs in newspaper or magazine, it is not straightforward to apply it for Web image retrieval.
* WebSeek [39]
    - Smith and Chang proposed the WebSeek framework designed to search images from the Web.
    - The key idea is to analyze and classify the Web multimedia objects into a predefined taxonomy of categories.
    - Thus, an initial search can be performed to explore a catalog associated with the query terms.
    - The image attribute (e.g., color histogram for images) is then computed for similarity matching within the category.

#### Besides its efficacy in image retrieval, surrounding text has been explored for <font color="red">image annotation</font> recently.
* predefined semantic concepts [9]
    - To achieve better annotation effectiveness, 
    - a co-training scheme is designed to explore the association between 
        - the text features computed using corresponding HTML documents and 
        - visual features extracted from image content.
    - Iterative Similarity Propagation
        - Observing that the links between the visual content and the surrounding texts can be modeled via Web page analysis
        - a novel method called Iterative Simi- larity Propagation is proposed to refine the closeness between the Web images and their annotations [50]
* Consequently, accurate clustering is a very crucial technique to facilitate Web multimedia search and many algorithms have recently been proposed based on the analysis of surrounding texts and low level visual features [3][13][34].
    - For example, Cai et al. [3] proposed a hierarchical clustering method that exploits
        - visual, 
        - textual, and 
        - link analysis.
    - By using block-level link analysis techniques, an image graph is constructed. They then applied spectral techniques to find a Euclidean embedding of the images.
        - As a result, each image has three types of representations: 
            - visual feature, 
            - textual feature, and 
            - graph-based representation.

# 3. Tag Mining
* 3.1 Tag Ranking
* 3.2 Tag Refinement
* 3.3 Tag Information Enrichment

<font color="red">In newly emerging social media sharing services, such as the Flickr and Youtube, users are encouraged to share multimedia data on the Web and annotate content with tags.</font>

The tags can be used to index multimedia data and support efficient tag-based search.

The existing works mainly focus on the following three aspects: 
* (a) tag ranking, 
    - which aims to differentiate the tags associated with the images with various levels of relevance; 
* (b) tag refinement 
    - with the purpose to refine the unreliable human-provided tags; and 
* (c) tag information enrichment, 
    - which aims to supplement tags with additional information [26].

## 3.1 Tag Ranking

* As shown in [25], the relevance level of the tags cannot be distin- guished from the tag list of an image. 
* <font color="red">The lack of relevance information in the tag list</font> has limited the application of tags.

### Works
* Liu et al. [25] 
    - proposed to estimate tag relevance scores using kernel density estimation, 
    - and then employ random walk to boost this primary estimation.
* Li et al. [22] 
    - proposed a data driven method for tag ranking. 
    - They learned the relevance scores of tags by a neighborhood voting approach.
        - Given an image and one of its associated tag, the relevance score is learned by accumulating the votes from the visual neighbors of the image.
    - multiple visual spaces [23]
* score fusion or rank fusion method
    - They learned the relevance scores of tags and ranked them by neighborhood voting in different feature spaces, and the results are aggregated with a score fusion or rank fusion method
    - Borda count and RankBoost
        * Different aggregation methods have been investigated, such as the average score fusion, Borda count and RankBoost.

## 3.2 Tag Refinement

<font color="red">User-provided tags are often noisy and incomplete.</font>

Tag refinement technologies are proposed aiming at obtaining more accurate and complete tags for multimedia description

<img src="figures/cap11.4.png" width=600 />

A lot of tag refinement approaches have been developed based on various statistical learning techniques. Most of them are based on the following three assumptions.
* The refined tags <font color="red">should not change too much from those provided by the users</font>. 
    - This assumption is usually used to <font color="blue">regularize</font> the tag refinement.
* <font color="red">The tags of visually similar images should be closely related</font>. 
    - This is a natural assumption that most <font color="blue">automatic tagging</font> methods are also built upon.
* <font color="red">Semantically close or correlative tags should appear with high correlation</font>. 
    - For example, when a tag “sea” exists for an image, the tags “beach” and “water” should be assigned with higher confi- dence while the tag “street” should have low confidence.

### Works
* Chen et al. [6] 
    - first trained a SVM classifier for each tag with the loosely labeled positive and negative samples. 
    - The classifiers are used to estimate the initial relevance scores of tags.
    - They then refined the scores with a graph-based method 
        - that simultaneously considers the similarity between images and semantic correlation among tags.
* Xu et al. [52]
    - proposed a tag refinement algorithm from topic modeling point of view.
    - regularized latent Dirichlet allocation (rLDA)
* Zhu et al. [64]
    - proposed a matrix decomposition method. 
    - They used a matrix to represent the image-tag relationship
* Fan et al. [8] 
    - grouped images with a target tag into clusters. 
    - Each cluster is regarded as a unit. 
    - The initial relevance scores of the clusters are estimated and then refined by a random walk process.
* Liu et al. [24] 
    - adopted a three-step approach. 
        - The first step filters out tags that are intrinsically content-unrelated based on the ontology in WordNet. 
        - The second step refines the tags based on the consistency of visual similarity and semantic similarity of images. 
        - The last step performs tag enrichment, which expands the tags with their appropriate synonyms and hypericum.

## 3.3 Tag Information Enrichment

* In the manual tagging process, generally human labelers will only assign appropriate tags to multimedia entities without any additional information, such as the image regions depicted by the corresponding tags. 
* But <font color="red">by employing computer vision and machine learning technologies</font>, certain information of the tags, such as the descriptive regions and saliency, <font color="red">can be automatically obtained</font>.
* We refer to these as tag information enrichment.
    - Most existing works employ the following two steps for tag information enrichment. 
        - First, <font color="red">tags</font> are <font color="red">localized</font> into regions of images or sub-clips of videos. 
        - Second, the characteristics of the regions or sub-clips are <font color="red">analyzed</font>, 
            - and the information about the tags is enriched accordingly.

<img src="figures/cap11.5.png" width=600 />

### Works
* Liu et al. [28] 
    - proposed a method to locate image tags to corresponding regions. 
        - They first performed over-segmentation to decompose each image into patches 
        - and then discovered the relationship between patches and tags via sparse coding. 
        - The over-segmented regions are then merged to accomplish the tag-to-region process.
    - Liu et al. extended the approach based on image search [29].
        - For a tag of the target image, they collected a set of images by using the tag as query with an image search engine. 
        - They then learned the relationship between the tag and the patches in this image set.
    - Liu et al. [27] accomplished the tag-to-region task 
        - by regarding an image as a bag of regions 
        - and then performed tag propagation on a graph, 
            - in which vertices are images and edges are constructed based on the visual link of regions.
* Feng et al. [10] 
    - proposed a tag saliency learning scheme,
        - which is able to rank tags according to their saliency levels to an image’s content.
        - They first located tags to images’ regions with a multi-instance learning approach.
            - In multi-instance learning, an image is regarded as a bag of multiple instances, i.e., regions [58]
        - They then analyzed the saliency values of these regions.
* Yang et al. [55] 
    - proposed a method to associate a tag with a set of properties, 
        - including location, color, texture, shape, size and dominance.
        - They employed a multi-instance learning method 
            - to establish the region that each tag is corresponding to, 
            - and the region is then analyzed to establish the properties, 
                - as shown in Figure 11.5 (b).
* Sun and Bhowmick [41] 
    - defined a tag’s visual representativeness
        - based on a large image set and the subset that is associated with the tag. 
    - They employed two distance metrics,
        - cohesion and 
        - separation, 
            - to estimate the visual representativeness measure.
* Ulges et al. [43] 
    - proposed an approach to localize video-level tags to keyframes. 
        - Given a tag, it regards whether a keyframe is relevant as a latent random variable. 
        - An EM-style process is then adopted to estimate the variables.
* Li et al. [21] 
    - employed a multi-instance learning approach to accomplish the video tag localization, 
        - in which video and shot are regarded as bag and shot, respectively.

# 4. Joint Text and Visual Content Mining
* 4.1 Visual Re-ranking

<font color="red">The integration of text and visual content has been found to be more effective than exploiting purely text or visual content separately.</font>

* The joint text and content mining in multimedia retrieval often comes down to finding effective mechanisms for <font color="red">fusing multi-modality information</font> from textual metadata and visual content.
* Existing research efforts can generally be categorized into four paradigms: 
    - (a) linear fusion; 
    - (b) latent-space-based fusion; 
    - (c) graph-based fusion; and 
    - (d) visual re-ranking 
        - that exploits visual information to refine text-based retrieval results.

### Linear fusion 
* Linear fusion combines the retrieval results from various modalities linearly [18][4][31].
* In [18], 
    - visual content and text are combined in both online learning stage with relevance feedback and offline keyword propagation. 
* In [31], 
    - linear, max, and average fusion strategies are employed to aggregate the search results from visual and textual modalities. 
* Chang et al. [4] 
    - adopted a query-class-dependent fusion approach. 
* <font color="red">The critical task in linear fusion is the estimation of fusion weights of different modalities.</font>
* Jing and Baluja[17]
    - proposed a <font color="red">VisualRank</font> framework to efficiently model similarity of Google image search results with graph [17]. 
    - The framework casts the re-ranking problem
        - as random walk on an affinity graph and 
        - reorders images according to the visual similarities. 
    - The final result list is generated via sorting the images based on graph nodes’ weights.

### The latent space based fusion 
* The laten space based fusion assumes that
    - <font color="red">there is a latent space 
        - shared by different modalities</font> 
    - and thus unify different modalities 
        - by transferring 
            - the features of these modalities into 
            - the shared latent space [63][62]. 
* Zhao et al. [63] 
    - adopted the Latent Semantic Indexing (LSI) method to fuse text and visual content. 
        - Zhang et al. [62] 
            - proposed a probabilistic context model to explicitly exploit the synergy between text and visual content. 
            - The synergy is represented as a hidden layer between the image and text modalities.

### Graph based approach 
* Graph based approach [49] 
    - first builds the relations between different modalities, 
        - such as relations between images and text using the Web page structure. 
    - The relations are then utilized to iteratively update the similarity graphs computed from different modalities. 
* <font color="red">The difficulty of creating similarity graphs for billions of images on the Web makes this approach insufficiently scalable.</font>

## 4.1 Visual Re-ranking

Visual re-ranking is emerging as one of the promising technique for automated boosting of retrieval precision [42] [30] [55]

* In particular, 
    - given a textual query, 
    - an <font color="red">initial list of multimedia entities</font> is returned <font color="red">using the text-based retrieval scheme</font>.
    - Subsequently, the most relevant results are moved to the top of the result list while the less relevant ones are <font color="red">reordered</font> to the lower ranks. 
    - As such, the overall search precision at the top ranks can be enhanced dramatically.

* According to the statistical analysis model used, the <font color="red">existing re-ranking approaches</font> can roughly be categorized into <font color="red">three categories</font> including 
    - the clustering based, 
    - classification based and 
    - graph based methods.

### clustering based
* Cluster analysis is very useful to <font color="red">estimate the inter-entity similarity</font>.
* The clustering based re-ranking methods stem from the key observation that <font color="red">a lot of visual characteristics can be shared by relevant images or video clips</font>
* mean-shift, K- means, and K-medoids etc..
* Hsu et al. [16]
    - One good example of clustering based re-ranking algorithms is an Information Bottle based scheme developed by Hsu et al. [16].
    - Its main objective is to identify optimal clusters of images that can minimize the loss of mutual information.
* In [19]
    - a fast and accurate scheme is proposed for grouping Web image search results into semantic clusters.
    - For a given query, a few related semantic clusters are identified in the first step. 
    - Then, the cluster names relating to query are derived and used as text keywords for querying image search engine. 
    - The empirical results from a set of user studies demonstrate an improvement in performance over Google image search results

### classification based
* In the classification based methods, visual re-ranking is formulated <font color="red">as a binary classification problem</font> aiming to identify <font color="red">whether each search result is relevant or not</font>.
* The major process for result list reordering consists of three major steps: 
    - (a) the selection of pseudo-positive and pseudo-negative samples; 
    - (b) use the samples obtained in step (a) to train a classification scheme; and 
    - (c) reorder the samples according to their relevance scores given by the trained classifier.
    - <font color="red">pseudo relevance feedback (PRF)</font>
        - For existing classifi- cation methods, pseudo relevance feedback (PRF) is applied to select the training examples. 
        - It assumes that: 
            - (a) a limited number of top-ranked entities in the initial retrieval results are highly relevant to the search queries; and 
            - (b) automatic local analysis over the entities can be very helpful to refine query representation.
* In [54], 
    - the query images or video clip examples are used as the pseudo-positive samples. 
    - The pseudo-negative samples are selected from either the least relevant samples in the initial result list or the databases that contain less samples related to the query. 
    - The second step of the classification based methods aim to train classifiers and a wide range of statistical classifiers can be adopted. 
        - They include the Support Vector Machine (SVM) [54], Boosting [53] and ListNet [57].
* <font color="red">However, in many real scenarios, the training examples obtained via PRF are very noisy and might not be adequate for training effective classifier.</font>
* Fergus et al. [11] 
    - Fergus et al. [11] used RANSAC to sample a training subset with a high percentage of relevant images.
    - A generative constellation model is learned for the query category while a background model is learned from the query “things”. 
    - Images are re-ranked based on their likelihood ratio.
* Schroff et al. [35] 
    - first learned a query independent text based re-ranker. 
    - The top ranked results from the text based re-ranking are then selected as positive training examples. 
    - Negative training examples are picked randomly from the other queries. 
    - A binary SVM classifier is then used to re-rank the results on the basis of visual features.
* Wang et al. [44] 
    - learned a generative text model from the query’s Wikipedia 4 page and a discriminative image model from the Caltech [15] and Flickr data sets. 
    - Search results are then re-ranked on the basis of these learned probability models. Some user interactions are required to disambiguate the query.

### graph based
* Graphs provide a natural and comprehensive way to <font color="red">explore complex relations between data at different levels</font> and have been applied to a wide range of applications [59][46][47][60].
* With the graph based re-ranking methods, 
    - the multimedia entities in top ranks and their associations/dependencies can be represented as a collection of nodes (vertices) and edges.
* In [16], Hsu et al. 
    - modeled the re-ranking process as a random walk over the context graph. 
    - In order to effectively leverage the retrieved results from text search, 
    - each sample corresponds to a “dongle” node containing ranking score based on text. 
    - For the framework, edges between “dongle” nodes are weighted with multi-modal similarities.
* <font color="red">In many cases, the structure of large scale graphs can be very complex and this easily makes related analysis process very expensive in terms of computational cost.</font>
* Jing and Baluja[17]
    - proposed a <font color="red">VisualRank</font> framework to efficiently model similarity of Google image search results with graph [17]. 
    - The framework 
        - casts the re-ranking problem as random walk on an affinity graph and 
        - reorders images according to the visual similarities. 
        - The final result list is generated via sorting the images based on graph nodes’ weights.
* In [42], Tian et al., 
    - presented a Bayesian video search re-ranking framework 
        - formulating the re-ranking process as an energy minimization problem. 
    - The main design goal is to 
        - optimize the consistency of ranking scores over visually similar videos and 
        - minimize the disagreement between the optimal list and the initial list.
* <font color="red">Indeed, graph analysis has been shown to be a very powerful tool for analyzing and identifying salient structure and useful patterns inside the visual search results.</font>

# 5. Cross Text and Visual Content Mining

* However, <font color="red">in some real world applications, images may not always have associated text</font>. 
    - For example, most surveillance images/videos in in-house repository are not accompanied with any text. 
    - Even on social media Website such as the Flickr, there exist a substantial number of images without any tags.
* In such cases, <font color="red">joint text and visual content mining cannot be applied due to missing text modality.</font>
* Recently, <font color="red">cross text and visual content mining</font> has been studied in the context of transfer learning techniques. This class of techniques emphasizes the <font color="red">transferring of knowledge across different domains or tasks</font> [32].

Cross text and visual content mining does not require that a test image has an associated text modality, and is thus <font color="red">beneficial to dealing with the images without any text by propagating the semantic knowledge from text to images</font>
* It is also motivated by two observations. 
- First, visual content of images is much more complicated than the text feature. 
    - While the textual words are easier to interpret, there exist a tremendous semantic gap between visual content and high-level semantics. 
- Second, image understanding becomes particularly challenging when only a few labeled images are available for training.

However, it is not trivial to transfer knowledge between various domains/tasks due to the following <font color="red">challenges</font>:
* The target data may be drawn from a <font color="red">distribution different from the source data</font>.
* The target and source data may be in <font color="red">different feature spaces</font> (e.g., image and text) and there may be <font color="red">no correspondence</font> between instances in these spaces.
* The target and source tasks may have <font color="red">different output spaces</font>.

Figure 11.6 from [56] presents an intuitive illustration of four learning paradigms, including 
* traditional machine learning, 
* transfer learning across different distributions, 
* multi-view learning and 
* heterogenous transfer learning. 

<img src="figures/cap11.6.png" width=600 />

* As we can see, heterogenous transfer learning is usually much more challenging due to the unknown correspondence across the distinct feature spaces. 
* In order to learn the underlying correspondence for knowledge transformation, a “<font color="red">semantic bridge</font>” is required.

### Works
* Most existing works exploit the tag information that provide text-to-image linking information.
* Dai et al. [7] 
    - showed that such information can be effectively leveraged for transferring knowledge between text and images. 
    - The key idea of [7] is to construct a correspondence between the images and the auxiliary text data with the use of tags.
    - Probabilistic latent semantic analysis (PLSA) model is employed to construct a latent semantic space which can be used for transferring knowledge.
* Chen et al. [56] 
    - proposed the concept of heterogeneous transfer learning and applied it to improve image clustering by leveraging auxiliary text data. 
    - They collected annotated images from the social web, 
    - and used them to construct a text to image mapping. 
    - The algorithm is referred to as aPLSA (Annotated Probabilistic La- tent Semantic Analysis). 
    - The key idea is to unify two different kinds of latent semantic analysis in order to create a bridge between the text and images.
        - The first kind of technique performs PLSA analysis on the target images, which are converted to an image instance-to-feature co- occurrence matrix.
        - The second kind of PLSA is applied to the annotated image data from social Web, which is converted into a text-to-image feature co-occurrence matrix.
    - In order to unify those two separate PLSA models, these two steps are done simultaneously with common latent variables used as a bridge linking them.
* Qi et al. [33] 
    - proposed to learn a “translator” which can directly establish the semantic correspondence between text and images 
        - even if they are new instances of the image data with unknown correspondence to the text articles.
    - This capability increase the flexibility of the approach and makes it more widely applicable. 
        - Specifically, they created a new topic space into which both the text and images are mapped. 
    - A translator is then learned to link the instances across heterogeneous text and image spaces. 
    - With the resultant translator, the semantic labels can be propagated from any labeled text corpus to any new image by a process of cross-domain label propagation.

# 6. Summary and Open Issues
* Joint text and visual content multimedia ranking
* Scalable text mining for large-scale multimedia management
* Multimedia social network mining

Although research efforts in this filed have made great progress in various aspects, there are still many open research issues that need to be explored.

## Joint text and visual content multimedia ranking

* Despite the success of visual re-ranking in multimedia retrieval, visual re-ranking <font color="red">only employs the visual content to refine text-based retrieval results</font>; 
* visual content has not been used to assist in learning the ranking model of search engine, and sometimes it is only able to bring in limited performance improvements.
    -  In particular, if text-based ranking model is biased or over-fitted, re-ranking step will suffer from the error that is propagated from the initial results, and thus the performance improvement will be negatively impacted.

## Scalable text mining for large-scale multimedia management

Despite of the success of existing text mining in multimedia, most existing techniques <font color="red">suffer from difficulties in handling large-scale multimedia data</font>.

## Multimedia social network mining

* Multimedia social networking is becoming an important part of media consumption for Internet users. 
* It brings in new and rich metadata, such as user preferences, interests, behaviors, social relationships, and social network structure etc.
* Numerous research topics can be explored, including 
    - (a) the combination of conventional techniques with information derived from social network communities; 
    - (b) fusion analysis of content, text, and social network data; and 
    - (c) personalized multimedia analysis in social networking environments.

# 참고자료
* [] 주는 (1) 책에 나오는 레퍼런스 번호
* (1) Mining Text Data - http://link.springer.com/book/10.1007/978-1-4614-3223-4/page/1