In this lab, we shall learn the unsupervise training of HMM.

Let's review the code of supervise training of HMM from the previous lab.

In [None]:
import nltk
from nltk.probability import LaplaceProbDist
from nltk.corpus import treebank
from nltk.tag import hmm


trainData2 = treebank.tagged_sents()[:2000]

#Extract distinct tags and words (vocab) from our dataset
allStates2=set() # tags in our casse
observationSymbols2=set() #Vocabulary in our case
for t in trainData2:
    for (word,tag) in t:
        allStates2.add(tag)
        observationSymbols2.add(word)

allStates2=list(allStates2)
observationSymbols2=list(observationSymbols2) 
        
print ("Total States (tags): ",len(allStates2))
print("Total Observation Symbols (Vocab): ",len(observationSymbols2))

# Laplace prob distribution for somoothing of probabilities
smoothingFunction = lambda fdist, bins: LaplaceProbDist(fdist,bins)

trainer = hmm.HiddenMarkovModelTrainer(states=allStates2,
                                       symbols=observationSymbols2)
supervisedHmm = trainer.train_supervised(trainData2,estimator=smoothingFunction)

#print(tagger)

#test="Cigarette has caused a high percentage of cancer."
test="Israel television rejected a skit by comedian Tuvia Tzafir that attacked public apathy by depicting an Israeli family watching TV while a fire raged outside ."
print ("\n**** Test: Ouptut of HMM tagger based on Viterbi algorithm ****")
print (supervisedHmm.tag(nltk.word_tokenize(test)))


Suppose, we have additional data other than the Treebank tagged data. This new data is not tagged with parts of speech. We have already built a supervised HMM on the tagged data as above, and we wish to update our trained model by using the new unlabeled data. This data is present in the file sents-01.txt. Open it in a Notepad or elsewhere to see its contents. 

In [None]:
# Get states and symbols from the previously trained model
theStates=supervisedHmm._states
theSymbols=set(supervisedHmm._symbols)

# Load the training sentences from the file
trainData=[]
file=open("sents-01.txt")
for line in file:
    trainData.append(line)

#We have to convert training sentences into the tokenize words of the 
# following form: [ [(w1,None)...(wk,None)],[(w1,None)...(wk,None)]] 
# As there is no tag associated with the words, NLTK requires us to
# have None keyword instead of tag beside each word. The following code
#performs this operation. In addition, the code also updates the symbols
# (words) of the previous model with the new symbols (words). In other
# words, we need to update the vocabulary in advance to tell HMM that 
# these new words we shall see during training.
tokenSeq=[]
for sent in trainData:
    words=nltk.word_tokenize(sent)
    wordTagList=[]
    for word in words:
        theSymbols.add(word) ## we have to add new vocabulary to our symbols list
        wordTagList.append((word,None))
    tokenSeq.append(wordTagList)

# Since HMM is nothing but a bunch of matrices, we need to extract all
#the matrices from the previous model and initialize a new model with our
# updated symbols. Transition matrix represents the transititon probs. from
# one state (tag) to another and output matrix and priors represent what?
# I leave that to you to explore this from the lecture.
initModel=hmm.HiddenMarkovModelTagger(states=theStates,
                                        symbols=list(theSymbols),
                                       transitions=supervisedHmm._transitions, 
                                        outputs=supervisedHmm._outputs,
                                        priors=supervisedHmm._priors)
print (initModel)
# Next we need to initalize the trainer model, just like the sueprvise HMM
# (Note this is the requirement in NLTK to intialize things like this, other
#libraries may have different steps and you can create your own library
#with different steps.)
unSuperTrainer= hmm.HiddenMarkovModelTrainer(states=theStates,
                                        symbols=list(theSymbols))

# Finally, we shall now use Baum Welch for training on new unlabeled data
# by using our initial supervise model as the starting point.
unsupervisedHmm=unSuperTrainer.train_unsupervised(unlabeled_sequences=tokenSeq,
                                         max_iterations=50,model=initModel)

print(unsupervisedHmm)
print ("Training finished.....")






Let's compare the results of supervise and unsupervise HMM models on the same sentence.

In [None]:

test="Israel television rejected a skit by comedian Tuvia Tzafir that attacked public apathy by depicting an Israeli family watching TV while a fire raged outside ."
print ("\n**** Test: Ouptut of HMM tagger based on Viterbi algorithm ****")
print ("Unsupervised Result: ",unsupervisedHmm.tag(nltk.word_tokenize(test)))
print ("Supervised Result:  ",supervisedHmm.tag(nltk.word_tokenize(test)))

# Exercise 4.1

The difference in improvement does not seem to be much between unsupervised HMM and supervised HMM. There are several reasons.There is not sufficient data for training an unsupervised HMM and also the loss of tags for training can result in the loss of information. Nonetheless, your task is to pick 5 different sentences from the treebank corpus, use unsupervise and supervise HMM models trained above for predictions of tags and then compare the results to determine which one produces better results.

##############################################################

A better use of unsupervise HMM is classfication or anomaly detection. In the following example, we shall train two HMMs  on the training setnences of sent-01 (unlabeled data) and sent-02 (unlabled data).  We shall then try to predict whether any unknown sentence belongs to one of the training dataset or not by using the log probabilities. 

In [None]:
import numpy as np
import nltk
from nltk.tag import hmm
# All symbols
allSymbols=set()

# Function to convert a file to training sentences
def fileToTrainData(fileName):
    trainSent=[]
    file=open(fileName)
    for line in file:
        trainSent.append(line)
    return trainSent

# Function to convert training setences to token sequences
def getTokenSeq(trainSents):
    tokenSeq=[]
    for sent in trainSents:
        words=nltk.word_tokenize(sent)
        wordTagList=[]
        for word in words:
            allSymbols.add(word)## creating a vocabulary set
            wordTagList.append((word,None))
        tokenSeq.append(wordTagList)
    return tokenSeq


# Convert training files to token sequences and determine total vocabulary
# (symbols) as well (see above functions)
tokenSeq01=getTokenSeq(fileToTrainData("sents-01.txt"))
tokenSeq02=getTokenSeq(fileToTrainData("sents-02.txt"))

# We are randomly picking 10 states for our experiments
totStates=10
states=[s for s in range(totStates)]
totSymbols=len(allSymbols)

print ("Total symbols: ",totSymbols)
print ("States: ",states)

# We need ot intialize HMM with some random values of 
#transition matrix (A),  ouptut matrix (B) and prior probability matrix (pi)
# See the lecture notes to understand their dimensions. Feel free to 
#print the matrices to understand them. Note that the sum of the rows of 
# each matrix needs to be one as each cell represents a probability value.

# Matrix A
# create a matrix with random values
randArr=np.random.rand(totStates,totStates) 
#making sum of row 1: dividing by total sum of rows
A=randArr/randArr.sum(axis=1, keepdims=True) 

#Matrix B
randArr=np.random.rand(totStates,totSymbols)
B=randArr/randArr.sum(axis=1, keepdims=True)

# Vector Pi (one row only)
randVector=np.random.rand(totStates)
pi=randVector/randVector.sum(axis=0, keepdims=True)

## Validate that the sum is one
print(pi)
print (pi.sum(0))

#print (A)
#print (B)



We shall now train the unsupervised HMM on the training sequences extracted in the previous step. At the end of the following example you will see the log probabilities.

In [None]:

hTagger01=hmm._create_hmm_tagger(states,list(allSymbols) , A, B, pi)
trainer01 = hmm.HiddenMarkovModelTrainer(states, list(allSymbols))
trainer01Hmm = trainer01.train_unsupervised(tokenSeq01, model=hTagger01,
                                            max_iterations=50)

test="Israel television rejected a skit by comedian Tuvia Tzafir."
print ("\n**** Test: Log Probability ****")
tokensTest=[(w,None) for w in nltk.word_tokenize(test)]
print (trainer01Hmm.log_probability(tokensTest))


# Exercise 4.2

We have trained the first HMM on the training data (sent-01). You will have to repeat the process on the sent-02 data. Follow the same process as above. Once the second HMM is trained, measure the log probabilties of the following sentences by using both the HMM models of sent-01 and sent-02. 
<code>
test1="An art exhibit in Arab east Jerusalem was a series of portraits"
test2="Rome is in Lazio province."
test3="W. Dale Nelson covers the White House for The Associated Press ."
test4="The news agency in Umbria"
</code>

Whichever model genrates a bigger log probability (note in negative number -4 > -10) the test sentence will belong to that particular dataset.  

# Exercise 4.3
In the previous exercise (4.2), we performed random intialization of matrices for training two HMMs. Repeat the experiments 5 times, every time with different intialization of random values and compare the results on the four test sentences of Exercise 4.2.

# Exercise 4.4
Repeat the exepriments of Exercise 4.2 by changing the number of states to 5, 15 and 20. For each number of states perform testing on the four test sentences of Exercise 4.2 and compare the results.
