# Text generation using Hidden Markov Model

For this project I will generate new text and perform text prediction using the Hidden Markov Model based of ABC news headline.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [2]:
headline_df = pd.read_csv("../data/abcnews-date-text.csv") 

## Data cleaning 

As the data set is too large with over a million rows and it takes too long to run and train the model. I decided to sample 99999 rows out of the million+ rows.

In [3]:
headline_df = headline_df[['headline_text']]
headline_df = headline_df.sample(99999)
headline_df

Unnamed: 0,headline_text
217838,mp calls for desal plant funds to be diverted
103594,williams davenport reach stanford semi finals
406293,call for regional cancer treatment centres
247914,fortescue to launch joint iron ore venure
795082,qld government seeks foreign interest in ecoto...
...,...
460487,us fears over pakistan weapons
6575,us air assault division crosses into iraq
952709,poison chemical found on russian who died in u...
157323,who recalls flu virus sent to labs worldwide


## Formulate ideas on how ML can be used to learn word correlations and distributions

Idea 1: A possible Machine Learning algorithm that comes to mind that may be used to learn word correlations and distributions would be the K-means algorithm. By using K-means, we can distribute text and words that are correlated to each other into clusters. Therefore, basing the word prediction by determining which cluster a given word belongs to. (Just an idea)

Idea 2: Another idea would be to use the Markov Model. As words and text are sequential data, representing its correlation and distribution using a Markov Model is intuitive. However, often times the states we want to understand are hidden such as part-of-speech tags when modeling text data. Therefore, by including hidden states (hence using the Hidden Markov Model) it allows us to use observed and hidden states as a factor when determining the probability of the next generated word. This is what we will be building for this project.

## Building Hidden Markov Model

### 1. Collecting all the different words from the dataset

In [4]:
words = []
headlines = headline_df['headline_text']

for headline in headlines:
    headline = headline.split()
    for word in headline:
        words.append(word)
        
distinct_words = list(set(words))
distinct_words.append(None)  # Null State
distinct_words_count = len(distinct_words)
word_dict = {word: i for i, word in enumerate(distinct_words)}

### 2. Initializing and defining the transition matrix

In [5]:
matrix_1, matrix_2 = np.zeros((distinct_words_count, distinct_words_count)), np.zeros((distinct_words_count, distinct_words_count))

for headline in headlines:
    data = headline.split()
    for i in range(len(data)):
        if i < len(data) - 1:
            matrix_1[word_dict[data[i]]][word_dict[data[i + 1]]] += 1
        else:
            matrix_1[word_dict[data[i]]][distinct_words_count - 1] += 1

        if i < len(data) - 2:
            matrix_2[word_dict[data[i]]][word_dict[data[i + 2]]] += 1
        else:
            matrix_2[word_dict[data[i]]][distinct_words_count - 1] += 1

matrix_1[distinct_words_count - 1][distinct_words_count - 1], matrix_2[distinct_words_count - 1][distinct_words_count - 1] = 1, 1

for i in range(len(matrix_1)):
    matrix_1[i], matrix_2[i]= matrix_1[i] / matrix_1[i].sum(), matrix_2[i] / matrix_2[i].sum()

### 3. Implementing the hidden states

Through the research papers I read, most authors uses part-of-speech tags as their hidden state when building a Hidden Markov Model for text generation. However, I am unsure how to implement part-of-speech tags as the hidden state as it is not being labeled in the dataset I chose.

Therefore, I will approach it differently. For the purpose of this project I will have my hidden states as words that either start with a vowels or non-vowels. As there are only 5 vowels in the alphabet, I would assume the probability of a word being followed by another word that starts with a vowel is far lower.

In [6]:
hidden_dict = {'vowel': 0, 'non-vowel': 1}
hidden_states = ['vowel','non-vowel']
hidden_matrix = [[.8, .2],[.9, .1]]

def emission(probability, state):
    for i in range(len(probability)-1):
        if state == 'vowel':
            if distinct_words[i][0] == 'a' or distinct_words[i][0] == 'e' or distinct_words[i][0] == 'i' or distinct_words[i][0] == 'o' or distinct_words[i][0] == 'u':
                probability[i] *= 2
            else:
                probability[i] /= 2
        else:
            if distinct_words[i][0] == 'a' or distinct_words[i][0] == 'e' or distinct_words[i][0] == 'i' or distinct_words[i][0] == 'o' or distinct_words[i][0] == 'u':
                probability[i] /= 2
            else:
                probability[i] *= 2
    probability[i] /= probability.sum()
    return probability


def chooseHiddenState(hidden_state):
    new_hidden_state = np.random.choice(hidden_states, size = 1, p = hidden_matrix[hidden_dict[hidden_state]])
    return new_hidden_state[0]

### 4. Sampling matrix to generate text

In [7]:
def generateText(init_word, text_arr):
    hidden_state = "vowel"
    if init_word == "":
        init_word = np.random.choice(distinct_words)
        text_arr.append(init_word)
    following_word = np.random.choice(distinct_words, p = matrix_1[word_dict[text_arr[-1]]])
    
    for i in range(8):
        text_arr.append(following_word)
        probability = matrix_1[word_dict[text_arr[-1]]]/4 + matrix_1[word_dict[text_arr[-1]]] * matrix_2[word_dict[text_arr[-2]]]
        probability = emission(probability, hidden_state)
        probability /= probability.sum()    
        following_word = np.random.choice(distinct_words, p = probability)
        hidden_state = chooseHiddenState(hidden_state)
        
    return text_arr

### 5. Generating new text

In [11]:
texts = generateText("", [])
output = ""
for text in texts:
    if text != None:
        output = output + text + " "
print(output)

fires burn off for good samaritan killed in arizona 


### 6. Predicting text given sequence of words

In [13]:
texts = generateText("israel arrest", ["israel", "arrest"])
output = ""
for text in texts:
    if text != None:
        output = output + text + " "
print(output)

israel arrest of al qaeda linked to close 


## Conclusion

As we can see, there are some relevancy towards the text being generated and predicted through the Hidden Markov Model. Overall, this has been a tough project, however, I am satisfied with the results of the Machine Learning algorithm.