# 8. Probabilistic Models for Text Mining

* 바벨피쉬 / 바벨피쉬Py : 파트 5 - 텍스트마이닝 [1]
* 김무성

# Contents

1. Introduction 
2. Mixture Models 
3. Stochastic Processes in Bayesian Nonparametric Models 
4. Graphical Models 
5. Probabilistic Models with Constraints 
6. Parallel Learning Algorithms 
7. Conclusions

# 1. Introduction 

The major probabilistic models covered in this chapter include:


* Mixture Models
    - PLSA
    - LDA
* Bayesian Nonparametric Models
    - Dirichlet process
* Bayesian Networks
* Hidden Markov Model
    - part-of-speech tagging in NLP
* Markov Random Fields
* Conditional Random Fields
    - Name entity recognition

# 2. Mixture Models 

* 2.1 General Mixture Model Framework 
* 2.2 Variations and Applications 
* 2.3 The Learning Algorithms 

<img src="http://www.vtkjournal.org/download/logopublication/4876/big" />

## 2.1 General Mixture Model Framework 

<img src="http://cfile8.uf.tistory.com/image/257DDF35514C692E232BE5" />

<img src="figures/cap8.1.png" />

<img src="figures/cap8.2.png" />

<img src="figures/cap8.3.png" width=600 />

From generative process point of view, each observed data xi is generated by:

<img src="figures/cap8.4.png" width=600 />

### Example: Mixture of Unigrams

<img src="http://yosinski.com/mlss12/media/slides/MLSS-2012-Domingos-Statistical-Relational-Learning_082.png" />

<img src="https://reference.wolfram.com/language/ref/Files/MultinomialDistribution.en/O_1.gif" />

<img src="http://openeco.eu/images/home/text-mining/bag-of-words.png" />

<img src="http://www.xperseverance.net/blogs/wp-content/uploads/image/LDA_mixture_of_unigram.png" />

A document di composed of a bag of words wi = (ci,1, ci,2, . . . , ci,m), where m is the size of the vocabulary and ci,j is the number of term wj in document di, is considered as a mixture of unigram language models. That is, each component is a multinomial distribution over terms, with parameters βk,j, denoting the probability of term wj in cluster k, i.e., p(wj|βk), for k = 1,...,K and j = 1,...,m.

The joint probability of observing the whole document collection is then:

<img src="figures/cap8.5.png" width=600 />

* where πk is the proportion weight for cluster k

## 2.2 Variations and Applications 

* 2.2.1 Topic Models
* 2.2.2 Other Applications

### 2.2.1 Topic Models

#### PLSA

* Probabilistic latent semantic analysis (PLSA)

<img src="http://parkcu.com/blog/wp-content/uploads/2013/06/word-topic-example.png" />

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Plsi_1.svg/300px-Plsi_1.svg.png" />

The probability of observation term wj in di is then defined by the mixture in the following:

<img src="figures/cap8.6.png" width=600 />

* where p(k|di) = p(zi,j = k) is the mixing proportion of different topics for di, βk is the parameter set for multinomial distribution over terms for topic k, and p(wj|βk) = βk,j. 

The joint probability of observing all the terms in document di is:

<img src="figures/cap8.7.png" width=600 />

* where wi is the same defined as in the mixture of unigrams and p(di) is the probability of generating di.

#### LDA

* LDA. Latent Dirichlet allocation (LDA) extends PLSA by further adding priors to the parameters.

<img src="http://yosinski.com/mlss12/media/slides/MLSS-2012-Blei-Probabilistic-Topic-Models_020.png" />

<img src="http://salsahpc.indiana.edu/b649proj/images/proj3_LDA%20structure.png" />

<img src="http://image.slidesharecdn.com/nonparametricbayes-130329120556-phpapp01/95/bayesian-nonparametrics-models-based-on-the-dirichlet-process-28-638.jpg?cb=1364559167" />

The probability of observing all the terms in document di is then:

<img src="figures/cap8.8.png" width=600 />

<img src="figures/cap8.9.png" width=600 />

### 2.2.2 Other Applications

Now, we briefly introduce some other applications of mixture models in text mining.

* Comparative text mining (CTM)
    - Given a set of comparable text collections (e.g., the reviews for different brands of laptops), the task of compara- tive text mining is to discover any latent common themes across all collections as well as special themes within one collection.

<img src="http://i0.wp.com/statistical-research.com/wp-content/uploads/2012/10/Rplot.png" />

* Contextual text mining (CtxTM)
    - which extracts topic models from a collection of text with context information (e.g., time and location) and models the variations of topics over different context.

<img src="https://i.ytimg.com/vi/7KLSo9d4Xzg/hqdefault.jpg" />

<img src="https://i.ytimg.com/vi/t1LBEjtC6gc/hqdefault.jpg" />

<img src="https://i.ytimg.com/vi/esGqS120klg/mqdefault.jpg" />

* Topic Sentiment Mixture (TSM)
    - which aims at modeling facets and opinions in we-blogs.

<img src="http://www2007.org/htmlpapers/paper680/fp680-mei-img3.png" />

<img src="http://image.slidesharecdn.com/opinion-analysis1-150123152807-conversion-gate02/95/statistical-methods-for-integration-and-analysis-of-online-opinionated-text-data-30-638.jpg?cb=1422028206" />

<img src="http://www2007.org/htmlpapers/paper680/fp680-mei-img30.png" />

## 2.3 The Learning Algorithms 

* 2.3.1 Overview
* 2.3.2 EM Algorithm
* 2.3.3 Gibbs Sampling

### 2.3.1 Overview

The general idea of learning parameters in mixture models (and other probabilistic models) is to find a set of “good” parameters θ that maximizes the probability of generating the observed data.

Two estimation criterions are frequently used,

* maximum-likelihood estimation (MLE)

<img src="http://www.ncbi.nlm.nih.gov/Class/NAWBIS/Modules/Phylogenetics/images/phylonv48.gif" />

<img src="http://doc.openturns.org/openturns-0.13.2/doc/html/ReferenceGuide/output/OpenTURNS_ReferenceGuide197x.png" />

<img src="http://statgen.iop.kcl.ac.uk/media/ml1.gif" />

* maximum-a-posteriori-probability (MAP)

<img src="http://www.lancaster.ac.uk/pg/jamest/Group/images/bayesthm.JPG" />

<img src="http://i1.wp.com/statistical-research.com/wp-content/uploads/2013/09/prior-likelihood-posterior1.png" />

<img src="http://images.slideplayer.com/17/5277039/slides/slide_4.jpg" />

The likelihood (or likelihood function) of a set of parameters given the observed data is defined as the probability of all the observations under those parameter values. 

<img src="figures/cap8.10.png" width=600 />

Most of the time, log-likelihood is optimized instead, as it converts products into summations and makes the computation easier:

<img src="figures/cap8.11.png" width=600 />

When priors are incorporated to the mixture models (such as in LDA), the MAP estimation is used instead, which is to find a set of parameters θ that maximizes the posterior density function of θ given the observed data:

<img src="figures/cap8.12.png" width=600 />

### 2.3.2 EM Algorithm

<img src="http://cse-wiki.unl.edu/wiki/images/thumb/a/a3/EM.jpg/400px-EM.jpg" />

<img src="http://i.stack.imgur.com/mj0nb.gif" />

<img src="https://upload.wikimedia.org/wikipedia/commons/6/69/EM_Clustering_of_Old_Faithful_data.gif" />

For mixture models, the likelihood function can be further viewed as the marginal over the complete likelihood involving hidden variables:


<img src="figures/cap8.13.png" width=600 />

The log-likelihood function is then:

<img src="figures/cap8.14.png" width=600 />

#### E-step (Expectation step)

<img src="figures/cap8.15.png" width=600 />

#### M-step (Maximization-step)

<img src="figures/cap8.16.png" width=600 />

There are several variants for EM algorithm when the original EM algorithm is difficult to compute, and some of which are listed in the following:

* Generalized EM. 

* Variational EM

<img src="http://parkcu.com/blog/wp-content/uploads/2013/07/variational-inference.png" />

## 2.3.3 Gibbs Sampling

#### Markov chain Monte Carlo (MCMC)

<img src="http://parkcu.com/blog/wp-content/uploads/2013/08/monte_carlo_integration.gif" />

<img src="http://cs.brown.edu/courses/cs242/lectures/images/mcmc.png" />

<img src="http://fedc.wiwi.hu-berlin.de/xplore/ebooks/html/csa/img1321.gif" />

The hidden cluster zi,j for term wi,j, i.e., the term wj in document di, is sampled according to the conditional distribution of zi,j, given the observations of all the terms as well as the their hidden cluster labels except for wi,j in the corpus:


<img src="figures/cap8.17.png" width=600 />

# 3. Stochastic Processes in Bayesian Nonparametric Models 

* 3.1 Chinese Restaurant Process 
* 3.2 Dirichlet Process 
* 3.3 Pitman-Yor Process 
* 3.4 Others 

#### 참고자료
* [12] Stat 547Q : Statistical Modeling with Stochastic Processes - http://www.stat.ubc.ca/~bouchard/courses/stat547-sp2011/

Bayesian nonparametric models 

<img src="https://raw.githubusercontent.com/psygrammer/coco/d31b61b25bbf42f869621fe02104df4a9e1c413e/part2/compsy/study04/figures/fig9.1.png" width=600 />

## 3.1 Chinese Restaurant Process 

#### 참고자료 
* [4] Chinese Restaurant Process - http://www.slideshare.net/MohitdeepSingh/chinese-restaurant-process

<img src="https://raw.githubusercontent.com/psygrammer/coco/d31b61b25bbf42f869621fe02104df4a9e1c413e/part2/compsy/study04/figures/fig9.2.png" width=600 />

<img src="figures/cap8.18.png" width=600 />

<img src="figures/cap8.19.png" width=600 />

## 3.2 Dirichlet Process 

* 3.2.1 Overview of Dirichlet Process
* 3.2.2 Dirichlet Process Mixture Model
* 3.2.3 The Learning Algorithms
* 3.2.4 Applications in Text Mining

#### 참고
* [5] Digging into the Dirichlet Distribution - http://www.slideshare.net/g33ktalk/machine-learning-meetup-12182013
* [6] Bayesian Nonparametrics: Models Based on the Dirichlet Proces - http://www.slideshare.net/AlessandroPanella1/nonparametric-bayes
* [7] Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes - http://www.slideshare.net/NoSyu/bayesian-nonparametric-topic-modeling-hierarchical-dirichlet-processes

### 3.2.1 Overview of Dirichlet Process

<img src="figures/cap8.20.png" width=600 />

<img src="figures/cap8.21.png" width=600 />

<img src="figures/cap8.22.png" width=600 />

### 3.2.2 Dirichlet Process Mixture Model

<img src="figures/cap8.23.png"  />

<img src="figures/cap8.24.png" width=600 />

### 3.2.3 The Learning Algorithms

<img src="figures/cap8.25.png" />

### 3.2.4 Applications in Text Mining

## 3.3 Pitman-Yor Process 

#### 참고
* [8] A Hierarchical Nonparametric Bayesian Approach to Statistical Language Model Domain Adaptation - https://docs.google.com/viewer?url=http%3A%2F%2Fece.duke.edu%2F~lcarin%2FMingyuan12.18.09.ppt
* [9] Hierarchical Bayesian Models of Language and Text - http://www.stats.ox.ac.uk/~teh/research/compling/bayeslm.pdf
* [10] Dirichlet Process and Stick-Breaking - http://web.cse.ohio-state.edu/~kulis/teaching/788_sp12/scribe_notes/lecture14.pdf
* [11] Lecture 10: More on hierarchical models and PY. Infinite HMM, Beta process - http://www.stat.ubc.ca/~bouchard/courses/stat547-sp2011/lecture10.pdf

<img src="figures/cap8.26.png" width=600 />

## 3.4 Others 

# 4. Graphical Models 

* 4.1 Bayesian Networks 
* 4.2 Hidden Markov Models 
* 4.3 Markov Random Fields 
* 4.4 Conditional Random Fields 
* 4.5 Other Models 

#### 참고
* [13] Probabilistic Graphical Models - https://www.coursera.org/course/pgm
* [14] stanford-pgm/slides/Section-1-Introduction - http://spark-university.s3.amazonaws.com/stanford-pgm/slides/Section-1-Introduction-Combined.pdf

## 4.1 Bayesian Networks 

* 4.1.1 Overview
* 4.1.2 The Learning Algorithms
* 4.1.3 Applications in Text Mining

#### 참고
* [15] Bayesian Network Fundamentals - http://spark-university.s3.amazonaws.com/stanford-pgm/slides/Section-2-Representation-Bayes-Nets.pdf

### 4.1.1 Overview

#### Conditional Independence

#### Factorization Definition

<img src="figures/cap8.27.png" width=600 />

<img src="figures/cap8.28.png" width=600 />

<img src="figures/cap8.29.png" width=600 />

### 4.1.2 The Learning Algorithms

### 4.1.3 Applications in Text Mining

## 4.2 Hidden Markov Models 

* 4.2.1 Overview
* 4.2.2 The Learning Algorithms
* 4.2.3 Applications in Text Mining

#### 참고
* [16] Template Models - http://spark-university.s3.amazonaws.com/stanford-pgm/slides/Section-2-Representation-Template-Models.pdf
* [17] Hidden Markov models, graphical models - https://www.cs.berkeley.edu/~jordan/courses/294-fall09/lectures/hmm/slides.ppt
* [19] Graphical models and Hidden Markov Models - http://www.asl.ethz.ch/education/master/info-process-rob/graph_HMM.pdf

<img src="figures/cap8.31.png" width=600 />

<img src="figures/cap8.30.png" width=600 />

### 4.2.1 Overview

<img src="figures/cap8.32.png" width=600 />

### 4.2.2 The Learning Algorithms

<img src="figures/cap8.33.png" width=600 />

<img src="figures/cap8.34.png" />

<img src="figures/cap8.35.png" />

#### Baum-Welch algorithm

### 4.2.3 Applications in Text Mining

## 4.3 Markov Random Fields 

* 4.3.1 Overview
* 4.3.2 The Learning Algorithms
* 4.3.3 Applications in Text Mining

#### 참고
* [18] Markov Network Fundamentals http://spark-university.s3.amazonaws.com/stanford-pgm/slides/Section-2-Representation-Markov-Nets.pdf

### 4.3.1 Overview

#### Conditional Independence

#### Clique Factorization

<img src="figures/cap8.36.png" width=600 />

<img src="figures/cap8.37.png" width=600 />

<img src="figures/cap8.38.png" width=600 />

### 4.3.2 The Learning Algorithms

<img src="figures/cap8.40.png" width=600 />

<img src="figures/cap8.41.png" />

### 4.3.3 Applications in Text Mining

## 4.4 Conditional Random Fields 

* 4.4.1 Overview
* 4.4.2 The Learning Algorithms
* 4.4.3 Applications in Text Mining

#### 참고
* [20] Conditional Random Fields and beyond …  - http://web.engr.illinois.edu/~khashab2/files/2013_crf.pptx
* [21] An Introduction to Conditional Random Field - http://archer.ee.nctu.edu.tw/powerpoint/CRF_2.pptx
* [22] Conditional Random Fields -  http://www.cedar.buffalo.edu/~srihari/CSE574/Chap13/Ch13.5-ConditionalRandomFields.pdf
* [23] Conditional Random Fields - Stanford NLP Group - http://nlp.stanford.edu/software/je

### 4.4.1 Overview

<img src="figures/cap8.42.png" width=600 />

<img src="figures/cap8.39.png" width=600 />

### 4.4.2 The Learning Algorithms

### 4.4.3 Applications in Text Mining

## 4.5 Other Models 

# 5. Probabilistic Models with Constraints 

# 6. Parallel Learning Algorithms 

# 7. Conclusions

# 참고자료

* [1] Mining Text Data - http://link.springer.com/book/10.1007/978-1-4614-3223-4/page/1
* [2] EM - http://www.cs.tut.fi/kurssit/TLT-5906/EM_presentation_2013.pdf
* [3] Variational Inference (LDA) - http://parkcu.com/blog/latent-dirichlet-allocation/
* [4] Chinese Restaurant Process - http://www.slideshare.net/MohitdeepSingh/chinese-restaurant-process
* [5] Digging into the Dirichlet Distribution - http://www.slideshare.net/g33ktalk/machine-learning-meetup-12182013
* [6] Bayesian Nonparametrics: Models Based on the Dirichlet Proces - http://www.slideshare.net/AlessandroPanella1/nonparametric-bayes
* [7] Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes - http://www.slideshare.net/NoSyu/bayesian-nonparametric-topic-modeling-hierarchical-dirichlet-processes
* [8] A Hierarchical Nonparametric Bayesian Approach to Statistical Language Model Domain Adaptation - https://docs.google.com/viewer?url=http%3A%2F%2Fece.duke.edu%2F~lcarin%2FMingyuan12.18.09.ppt
* [9] Hierarchical Bayesian Models of Language and Text - http://www.stats.ox.ac.uk/~teh/research/compling/bayeslm.pdf
* [10] Dirichlet Process and Stick-Breaking - http://web.cse.ohio-state.edu/~kulis/teaching/788_sp12/scribe_notes/lecture14.pdf
* [11] Lecture 10: More on hierarchical models and PY. Infinite HMM, Beta process - http://www.stat.ubc.ca/~bouchard/courses/stat547-sp2011/lecture10.pdf
* [12] Stat 547Q : Statistical Modeling with Stochastic Processes - http://www.stat.ubc.ca/~bouchard/courses/stat547-sp2011/
* [13] Probabilistic Graphical Models - https://www.coursera.org/course/pgm
* [14] stanford-pgm/slides/Section-1-Introduction - http://spark-university.s3.amazonaws.com/stanford-pgm/slides/Section-1-Introduction-Combined.pdf
* [15] Bayesian Network Fundamentals - http://spark-university.s3.amazonaws.com/stanford-pgm/slides/Section-2-Representation-Bayes-Nets.pdf
* [16] Template Models - http://spark-university.s3.amazonaws.com/stanford-pgm/slides/Section-2-Representation-Template-Models.pdf
* [17] Hidden Markov models, graphical models - https://www.cs.berkeley.edu/~jordan/courses/294-fall09/lectures/hmm/slides.ppt
* [18] Markov Network Fundamentals http://spark-university.s3.amazonaws.com/stanford-pgm/slides/Section-2-Representation-Markov-Nets.pdf
* [19] Graphical models and Hidden Markov Models - http://www.asl.ethz.ch/education/master/info-process-rob/graph_HMM.pdf
* [20] Conditional Random Fields and beyond …  - http://web.engr.illinois.edu/~khashab2/files/2013_crf.pptx
* [21] An Introduction to Conditional Random Field - http://archer.ee.nctu.edu.tw/powerpoint/CRF_2.pptx
* [22] Conditional Random Fields -  http://www.cedar.buffalo.edu/~srihari/CSE574/Chap13/Ch13.5-ConditionalRandomFields.pdf
* [23] Conditional Random Fields - Stanford NLP Group - http://nlp.stanford.edu/software/jenny-ner-2007.ppt
