## EE 502 P: Analytical Methods for Electrical Engineering
    
# Homework 10: Review
## Due 15 December, 2019 at 11:59 PM

Copyright &copy; 2019, University of Washington

### <span style="color: red">Ryan Maroney</span>

<hr>

**Instructions**: Choose **<u>one</u>** of the following problems. Solve the problem and then write up your solution in a stand alone Jupyter Notebook. Your notebook should have the following elements:

- Problem statement
- Mathematical description of the solution
- Executable code, commented, clear code

You will be graded on how well your notebook reads like a nicely formated, well written report. You must:

- Write mathematical descriptions using complete sentences, paragraphs, and LaTeX formulas. 
- Comment your code as necessary, with a description of what each function does and all major steps.
- Label plots axes, use legends, and use plot titles. 

### Problem 3 Chosen: Hallucinating the Constitution

Consider the constitution of the United States:

> https://www.usconstitution.net/const.txt .

This document contains upper- and lower-case letters, numbers, and basic punctuation. 

**One letter prediction:**

1. Find the set of all characters used in the document. Call the number of characters $n$. 
2. Create an $n \times n$ matrix whose $i,j$ entry is the probability that the next character is $j$ given that the current character is $i$. Estimate this probability by looking at all occurrences of character $i$ in the document and the number of times character $j$ immediately follows it. 
3. Simulate this system as a Markov chain that starts with an arbitrary capital letter and continues until it gets to a space. Produce $100$ random "words" this way. How many of them are actual words? Use a [Scrabble dictionary](https://scrabble.hasbro.com/en-us/tools#dictionary) if you are not certain whether a given sequence is a word. 

**Two letter prediction:**

1. Create an $n \times n \times n$ tensor whose $i,j,k$ entry is the probability that the next character is $k$ given that the current character is $j$ and the previous character is $i$. Use the document to empirically find these probabilities. 
2. Use this model to construct random words. 

**Sentence prediction:**

Do a one word prediction, but use all the unique *words* in the document. Hallucinate sentences. Consider a punctuation mark as a word. 

**Notes:** Use `open` and `file.read` to read in the file as a string. For the sentence. Use `replace` to add space before punctuation and then `split()` to turn the string into a list. Use a `DiGraph` from the `networkx` library to store the data. Note that you can make weighted edges by adding data to the edges, as in [this document](https://networkx.github.io/documentation/stable/auto_examples/drawing/plot_weighted_graph.html).

In [1]:
import nltk # can use to check for words
nltk.download('words')

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\rumar\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [2]:
"vine" in nltk.corpus.words.words()

True

In [3]:
from urllib.request import urlopen
import networkx as nx
import numpy as np
import random
import sympy as sm
from time import time
import matplotlib.pyplot as plt
%matplotlib inline

In this project I will use the text from the United States Constitution to generate words and sentences randomly, using the Markov process, where probabilities are formed from the frequency of specific combinations of consecutive letters or words in the constitution. Markov processes generate information by transition from state to state based on the probability of transitioning to the other states from the current state.

In the provided text file there is a title and a note before the actual text from the constitution starts, I will exclude this from the data that I use to determine the probabilities because it isn't actually a part of the constitution.

In [4]:
def read_file(url):
    '''
    Function:     read_file
    
    Description:  Reads the text for the constitution from a url link. The portion of text before the
                  actual constitution is cut out becuase it's a separate note.
    
    Parameters:   url - Link to the text file of the constitution.
    
    Returned:     The text of the constitution as a single string.
    '''

    with urlopen(url) as f:
        
        # Remove note in file before actual constitution.
        for i in range(7):
            f.readline()
        
        text = f.read()
    
    return text.decode('UTF-8')

In [5]:
url = "https://www.usconstitution.net/const.txt"
text = read_file(url)
# text =  text[:1000]

I will define punctuation as individual words to distinguish them from actual words, so I need to add spaces between letters and punctuation. I will also replace all newlines with a space to have consistent white spacing. I also will add a space to the end of the text to make sure that there is a whitespace character at the end of the text so that every non-whitespace character has a character following it because the last character in the text can't be used as data to calculate the probability of the next letter.

In [6]:
if text[len(text)-1] not in [" ", "\n"]:
    text = text + " "          # Add a space at the end so the last character has a space following it.
    
# will treat newlines as spaces.
text = text.replace("\n", " ")

# Add spaces before punctuation to treat them as separate words.
text = text.replace(",", " ,")
text = text.replace(".", " .")
text = text.replace(";", " ;")
text = text.replace(":", " :")
text = text.replace(")", " )")
text = text.replace("(", "( ")
text = text.replace(",", " ,")
text = text.replace(" \"", " \" ")
text = text.replace("\" ", " \" ")

For the first part I will generate words using a Markov process to predict the letters in the word. The current state will be represented as the current letter and the next state will represent the next letter to be chosen. The probabilities represent the probability of the next letter following the current letter. Let's use the first few words in the text to as an example to demonstrate this.

"We the People "

The set of characters in this string are,

"W", "e", " ", "t", "h", "P", "o", "p", "l"

(Note that lowercase and uppercase letters are distinct.)

Now we find the probabilities of each character following the other. To start we find the number of occurrences of each letter following every each letter. Here are the probabilities for the string above, where the row is the current character and the column is the next character.

| Current Character | "W" | "e" | " " | "t" | "h" | "P" | "o" | "p" | "l" |
|---|---|---|---|---|---|---|---|---|---|  
| "W" | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |  
| "e" | 0 | 0 | 3 | 0 | 0 | 0 | 1 | 0 | 0 |
| " " | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
| "t" | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| "h" | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| "P" | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| "o" | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| "p" | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| "l" | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

From the table we can read that there are 3 occurrences where a space follows an "e". There is one 1 occurrence where a "p" follows an "o".

Then I normalize occurrences based on the total number of occurrences of the letter in the row to find the probability.

| Current Character | "W" | "e" | " " | "t" | "h" | "P" | "o" | "p" | "l" |
|---|---|---|---|---|---|---|---|---|---|  
| "W" | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |  
| "e" | 0 | 0 | 3/4 | 0 | 0 | 0 | 1/4 | 0 | 0 |
| " " | 0 | 0 | 0 | 1/2 | 0 | 1/2 | 0 | 0 | 0 |
| "t" | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| "h" | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| "P" | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| "o" | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| "p" | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| "l" | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

So, if the current letter is "e" the there is a 0.75 probability that the next character is a space and a 0.25 probability that the next character is an "o". If the current letter is "P" then the probability that the next letter will be "e" is 1. In the case that the only occurrence of the character is at the end of the text, meaning that there is no character following it, there is no data, so I have set the probability of each of each character occurring as the next character to uniform. In most cases this should not occur because I have added a space to the end of the text.

In [7]:
def prob_from_previous(text):
    '''
    Function:     prob_from_previous
    
    Description:  Calculates the probability of a letter or word, given the previous letter or word.
                  This works for both a string of characters or a list of words.
    
    Parameters:   text - A string of characters or a list of words.
    
    Returned:     states - The set of letters or words in text.
                  probs  - 2D array of where the row represents the current letter or word
                           and the column represents the next letter or word. Values of array 
                           represent probability of next letter or word given the previous one.
    '''

    # Find set of characters or words in the text
    states = list(set(text))
    
    probs = [[0 for i in range(len(states))] for j in range(len(states))]
    
    # Calculate the sum each occurance of consecutive of letters or words
    for i, state in enumerate(text):
        if not i+1 == len(text):
            probs[states.index(state)][states.index(text[i+1])] += 1
    
    text = text[:len(text)-1]
    
    # Normalize the probabilities for the next character or word given the current character or word.
    for i in range(len(probs)):
        for j in range(len(probs)):
            if 0 == text.count(states[i]):
                probs[i][j] = sm.Rational(1,len(states))
            else:
                probs[i][j] = sm.Rational(probs[i][j],text.count(states[i]))
    
    
    return states, probs

In [8]:
letters, oneLetterProbs = prob_from_previous(text)

Now that I have found the probability of each character occurring given the next character I can now generate words using a Markov chain.

First, I randomly select a capital letter that is in the set of characters, because capital letters indicate the beginning of a word. Then, given the arbitrary starting letter, I randomly select the next letter based on the probabilities of the other characters following that character. Then I repeat this process using each newly selected letter to determine the probabilities to use to generate the next letter. This is the Markov chain process and I will walk through an example using the information from the example above.

Recall that the sample text that is used is "We the People ". So there are two capital letters that can be chosen to start with, "W" and "P". Say that "P" is selected as the starting letter. Then, from the probability matrix above there is a probability of 1 that the next letter is "e". So, now our word is "Pe" and the current letter is "e". With the current letter being "e" then there are two possible characters that can come next. There is a 0.75 probability that the next character will be a space and a 0.25 probability that the next character will be an "o". Say a space is randomly selected because that is more probable. Since we have reached a space, this is the end of the word. The generated word is "Pe". According to the scrabble dictionary "Pe" is a word. It is a Hebrew letter. 

This process doesn't guarantee that an actual word will be generated, but it does make common letter combinations more likely to be generated.

In [9]:
def gen_one_letter_prediction(letters, letterProbs, seed=None, numWords=100):
    '''
    Function:     gen_one_letter_prediction
    
    Description:  Generates words using the Markov chain technique by randomly generating letters
                  based on the probabilities of a letter following the current letter.
    
    Parameters:   letters     - Set of letters to be used.
                  letterProbs - 2D array of probabilities where the row represents the current letter
                                and the column represents the next letter and the i,j entry represents
                                the probability that the next letter will be j give the current letter
                                is i.
                  seed        - Seed to generate the same random order each time.
                  numWords    - Number of words to randomly generate.
    
    Returned:     randWords - List of randomly generated words.
    '''
    
    random.seed(seed)
    np.random.seed(seed)

    capitals = []
    randWords = []

    # Find all capital letters in the set.
    for x in letters:
        if x >= 'A' and x <= 'Z':
            capitals.append(x)

    for i in range(numWords):
        # Select a random capital letter to start the word.
        current = capitals[random.randint(0,len(capitals)-1)]
        word = ''
        
        while current not in [' ', '\n']:
            word += current
            
            # Generate the next letter based on the calculated probabilities.
            current = np.random.choice(letters, p=letterProbs[letters.index(current)])
        
        randWords.append(word)

    return randWords

In [10]:
randWords = gen_one_letter_prediction(letters, oneLetterProbs, 0)

for word in randWords:
    print(word)

Jund
Lam
Bisce
Mashersubin
Int
Nobe
Jof
Cor
Numin
Vinondexe
Kicioor
Que
Imemecofofowes
Torelllour
Coter
Tit
Grd
Pr
Mofr
El
Spr
Pr
Thic
Co
Ge
Oa
Apen
Win
Yeremed
No
Excempof
Goff
Vof
Liofontsud
Yon
Pofifrentererales
Dor
Quof
Exce
Noolin
Hothe
In
Meron
Busit
Ell
Un
Ambesentintinecthe
Of
Jedgr
St
We
Deomes
Unshe-Prery
Pl
Nonomiatitisumbof
Yo
Fowe
Of
Yeal
Sthajusuthe
Aroprntheghe
Qunexthe
Kine
Fofongne
Fe
The
Elde
Honiny
Am
Amexesof
Yes
In
Nand
Ged
Cofices
Ea
Casur
Sthe
Gerie
Ere
Yeriatot
Ely
Quthexth
Paich
Exil
Kit
Con
Honen
Admshe
Peisseithit
Joppre
Ye
Kicofe
Figis
Cosichetexctis
Reds
Qute
Re
Be
Prmesshe


After generating the words I checked to see how many of them were real words by looking to see if they are in the scrabble dictionary (https://scrabble.hasbro.com/en-us/tools#dictionary). Every time the kernel is restarted the random generation of letters changes, even with a seed. I used the commented out cell below to calculate statistics for one set of randomly generated words that I manually checked against the scrabble dictionary.

In [11]:
# # List of boolean values stating if a randomly generated word is real or not.
# isWord = [
#     0,1,0,0,1,0,1,0,0,0,
#     0,0,1,1,0,0,1,1,0,1,
#     0,0,1,0,0,0,0,0,0,0,
#     0,0,0,0,0,0,0,1,0,0,
#     0,0,0,1,0,0,0,0,1,1,
#     0,0,0,0,0,1,0,1,0,0,
#     1,0,0,0,0,0,0,0,0,0,
#     0,0,1,0,0,0,0,0,0,0,
#     0,0,0,0,0,1,0,0,0,0,
#     0,0,1,1,1,1,0,0,0,0
# ]


# print("Words\t\t\tNot Words\n---------------------------------------------------")

# # Print words in columns of words and not words
# for i in range(len(randWords)):
#     if isWord[i]:
#         print(randWords[i])
#     else:
#         print("\t\t\t" + randWords[i])



# # Generate some interesting statisitcs for the words.
# numReal = isWord.count(1)
# numNotReal = isWord.count(0)

# print(str(numReal) + " of the " + str(len(randWords)) + " generated words are real.")
# print(str(numNotReal) + " of the " + str(len(randWords)) + " generated words are not real.")

# realWords = 0
# notWords = 0

# # Determine average length of real and non real words
# for i in range(len(randWords)):
#     if isWord[i]:
#         realWords += len(randWords[i])
#     else:
#         notWords += len(randWords[i])
        
# print("\nThe average length of the real words is:\t", round(realWords/numReal,3))
# print("The average length of the not real words is:\t", round(notWords/numNotReal,3))

For the set of words that I manually checked checked, I found that 22 out of the 100 words that were generated were valid scrabble words. So the one letter prediction technique produces valid words at a rate of about 22%. Another interesting observation is that the valid words have an average length of about 2.8 letters, and the invalid words have an average length of about 6.7 letters. The invalid words are more than twice as long. It's clear to see that some of the invalid words are actually pretty long. It appears that sometimes the randomly generated word was originally on the path to forming one word, but then diverged onto a different path by randomly selecting a different letter, and did so repeatedly resulting in a long jumble of letters that appear to be two or more words jumbled together.

The second method for generating random words is similar, but instead of looking at occurrences of one character, this method uses each pair of consecutive characters that occur. Then calculates the probabilities of the next character based on the number of times each character appears after the pair of the current character and the character directly before it. For example, in the text " Order " if the current character is "d", then the previous character is "r", and the next character is "e". In this string there is only one occurrence of "rd", so there is a probability of 1 that the next character is "e". Here is the example of the probabilities for this string. Since there are three characters that factor in to the probabilities I need a 3D array to store the probabilities. Each grid represents a different previous character, each row represents a different current character, and each column represents a different next character. Here are the number of occurrences of each combination of letters. I have added a space before and after the text to be able to gather data for both the first and second to last characters.

Previous Character - "O":

| Current Character | "O" | "r" | "d" | "e" | " " |
|---|---|---|---|---|---|
| "O" | 0 | 0 | 0 | 0 | 0 |
| "r" | 0 | 0 | 1 | 0 | 0 |
| "d" | 0 | 0 | 0 | 0 | 0 |
| "e" | 0 | 0 | 0 | 0 | 0 |
| " " | 0 | 0 | 0 | 0 | 0 |

Previous Character - "r":

| Current Character | "O" | "r" | "d" | "e" | " " |
|---|---|---|---|---|---|
| "O" | 0 | 0 | 0 | 0 | 0 |
| "r" | 0 | 0 | 0 | 0 | 0 |
| "d" | 0 | 0 | 0 | 1 | 0 |
| "e" | 0 | 0 | 0 | 0 | 0 |
| " " | 0 | 0 | 0 | 0 | 0 |

Previous Character - "d":

| Current Character | "O" | "r" | "d" | "e" | " " |
|---|---|---|---|---|---|
| "O" | 0 | 0 | 0 | 0 | 0 |
| "r" | 0 | 0 | 0 | 0 | 0 |
| "d" | 0 | 0 | 0 | 0 | 0 |
| "e" | 0 | 1 | 0 | 0 | 0 |
| " " | 0 | 0 | 0 | 0 | 0 |

Previous Character - "e":

| Current Character | "O" | "r" | "d" | "e" | " " |
|---|---|---|---|---|---|
| "O" | 0 | 0 | 0 | 0 | 0 |
| "r" | 0 | 0 | 0 | 0 | 1 |
| "d" | 0 | 0 | 0 | 0 | 0 |
| "e" | 0 | 0 | 0 | 0 | 0 |
| " " | 0 | 0 | 0 | 0 | 0 |

Previous Character - " ":

| Current Character | "O" | "r" | "d" | "e" | " " |
|---|---|---|---|---|---|
| "O" | 0 | 1 | 0 | 0 | 0 |
| "r" | 0 | 0 | 0 | 0 | 0 |
| "d" | 0 | 0 | 0 | 0 | 0 |
| "e" | 0 | 0 | 0 | 0 | 0 |
| " " | 0 | 0 | 0 | 0 | 0 |

You may notice that some combinations of characters do not occur. Since there is no data for these pairs of previous and current characters I will use a uniform distribution in these cases so that a following character can still be calculated. However, this is only likely to occur with a pairing at the end of the text where there is no character after the pair. Here are the resulting probabilities.

Previous Character - "O":

| Current Character | "O" | "r" | "d" | "e" | " " |
|---|---|---|---|---|---|
| "O" | 1/5 | 1/5 | 1/5 | 1/5 | 1/5 |
| "r" | 0 | 0 | 1 | 0 | 0 |
| "d" | 1/5 | 1/5 | 1/5 | 1/5 | 1/5 |
| "e" | 1/5 | 1/5 | 1/5 | 1/5 | 1/5 |
| " " | 1/5 | 1/5 | 1/5 | 1/5 | 1/5 |

Previous Character - "r":

| Current Character | "O" | "r" | "d" | "e" | " " |
|---|---|---|---|---|---|
| "O" | 1/5 | 1/5 | 1/5 | 1/5 | 1/5 |
| "r" | 1/5 | 1/5 | 1/5 | 1/5 | 1/5 |
| "d" | 0 | 0 | 0 | 1 | 0 |
| "e" | 1/5 | 1/5 | 1/5 | 1/5 | 1/5 |
| " " | 1/5 | 1/5 | 1/5 | 1/5 | 1/5 |

Previous Character - "d":

| Current Character | "O" | "r" | "d" | "e" | " " |
|---|---|---|---|---|---|
| "O" | 1/5 | 1/5 | 1/5 | 1/5 | 1/5 |
| "r" | 1/5 | 1/5 | 1/5 | 1/5 | 1/5 |
| "d" | 1/5 | 1/5 | 1/5 | 1/5 | 1/5 |
| "e" | 0 | 1 | 0 | 0 | 0 |
| " " | 1/5 | 1/5 | 1/5 | 1/5 | 1/5 |

Previous Character - "e":

| Current Character | "O" | "r" | "d" | "e" | " " |
|---|---|---|---|---|---|
| "O" | 1/5 | 1/5 | 1/5 | 1/5 | 1/5 |
| "r" | 0 | 0 | 0 | 0 | 1 |
| "d" | 1/5 | 1/5 | 1/5 | 1/5 | 1/5 |
| "e" | 1/5 | 1/5 | 1/5 | 1/5 | 1/5 |
| " " | 1/5 | 1/5 | 1/5 | 1/5 | 1/5 |

Previous Character - " ":

| Current Character | "O" | "r" | "d" | "e" | " " |
|---|---|---|---|---|---|
| "O" | 0 | 1 | 0 | 0 | 0 |
| "r" | 1/5 | 1/5 | 1/5 | 1/5 | 1/5 |
| "d" | 1/5 | 1/5 | 1/5 | 1/5 | 1/5 |
| "e" | 1/5 | 1/5 | 1/5 | 1/5 | 1/5 |
| " " | 1/5 | 1/5 | 1/5 | 1/5 | 1/5 |

There is one occurrence of "de" in the text, and it is followed by "r", so the probability of "r" following "de" is 1. Since there is no occurrence of "re" in the text, there is a uniform distribution of any character following those text.

In [12]:
def prob_from_previous_two(text):
    '''
    Function:     prob_from_previous
    
    Description:  Calculates the probabilities of a letter or word, given the current and previous letters or words.
                  This works for both a string of characters or a list of words.
    
    Parameters:   text - A string of characters or a list of words.
    
    Returned:     states - The set of letters or words in text.
                  probs  - 3D array of where the row represents the previous letter or word,
                           the column represents the next letter or word, and the depth represents the 
                           next letter or word. Values of array represent probability of next letter or word
                           given the previous two.
    '''
    
    # Find set of characters or words in the text.
    states = list(set(text))
    
    probs = [[[0 for i in range(len(states))] for j in range(len(states))] for k in range(len(states))]
    
    # Calculate the sum each occurance of consecutive of letters or words.
    for i, state in enumerate(text):
        if not (i == 0 or i+1 == len(text)):
            probs[states.index(text[i-1])][states.index(state)][states.index(text[i+1])] += 1
    
    text = text[:len(text)-1]
    
    # Normalize the probabilities for the next character or word given the current character and the previous character or word.
    for i in range(len(probs)):
        for j in range(len(probs)):
            for k in range(len(probs)):
                if 0 == text.count(states[i]+states[j]):
                    probs[i][j][k] = sm.Rational(1,len(states))
                else:
                    probs[i][j][k] = sm.Rational(probs[i][j][k],text.count(states[i]+states[j]))
    
    return states, probs

In [13]:
# Add a space before the text to use the first character of the text as part of the data
text = " " + text
letters, twoLetterProbs = prob_from_previous_two(text)

Again, now that I have found the probabilities each character occurring after each pair of letters, I can generate words in a similar manner to the one letter prediction technique. First I select a capital letter at random over a uniform distribution. And I append a space to the front of the word, because this technique requires the current and the previous characters, in order to be able to generate the second letter. Now that I know this current letter and the previous letter, I randomly select the next letter based on the probabilities of any specific letter following that pair. Then I repeat this process, appending new letters to the end of the word until a space is selected, indicating the end of the word.

To demonstrate, I'll walk through the example from above, using the string, " Order ". There is only one capital letter in this set so I will start with "O". Then, append a space to the front, and randomly select the next letter based on the probabilities of each letter following " O". Looking at the probability tables above, there is a probability of 1 that the next character will be "r". Now the current letter is "r", and the previous letter is "O". So the process is repeated using the probabilities of each letter following the string, "Or". From the tables above, there is a probability of 1 that the next letter is "d". Now the current letter is "d" and the previous letter is "r". Following this same process, the next characters are "e", "r", and a space, indicating the end of the word. The word that has been produced is "Order". This makes sense because the only word in the sample text is "Order". A more extensive sample text, like the constitution, will produce more varying results.

In [14]:
def gen_two_letter_prediction(letters, letterProbs, seed=None, numWords=100):
    '''
    Function:     gen_two_letter_prediction
    
    Description:  Generates words using the Markov chain technique by randomly generating letters
                  based on the probabilities of a letter given the previous and current letters.
    
    Parameters:   letters     - Set of letters to be used.
                  letterProbs - 3D array of probabilities where the row represents the previous letter,
                                the column represents the current letter, and the depth represents the next letter.
                                The values of the array are the probability of the next letter given the
                                previous and current letters.
                  seed        - Seed to generate the same random order each time.
                  numWords    - Number of words to randomly generate.
    
    Returned:     randWords - List of randomly generated words.
    '''
    
    random.seed(seed)
    np.random.seed(seed)

    capitals = []
    randWords = []

    # Find all capital letters in the set.
    for x in letters:
        if (x >= 'A' and x <= 'Z'):
            capitals.append(x)

    for i in range(numWords):
        # Select a random capital letter to start the word.
        current = capitals[random.randint(0,len(capitals)-1)]
        word = ' ' + current
        
        while current not in [' ', '\n']:
            # Generate the next letter based on the calculated probabilities.
            current = np.random.choice(letters, p=letterProbs[letters.index(word[len(word)-2])][letters.index(current)])
            word += current

        randWords.append(word[1:len(word)-1])

    return randWords

In [15]:
randWords = gen_two_letter_prediction(letters, twoLetterProbs, 0)

for word in randWords:
    print (word)

Judin
Laws
But
Majork
Inficeprecut
No
Juden
Cong
New
Vicle
Kin
Quorce
In
Trescrial
Claticer
Trect
Grated
Pred
Mon
Elex
Shices
Pres
Timbece
Cass
Geof
Offled
Ame
Whe
Yeactionsuprinits
Numbe
Exectinify
Geof
Vot
Legishall
Yeaction
Pres
Depred
Quortiteside
Ele
No
Henationgrance
I
Mem
Butine
Ele
Unite
Affich
Offievot
Jur
Stain
Wilis
Dan
Union
Piess
No
Yeary
Fitiver
Offedentel
Yeacce
Stated
Amement
Quall
Kin
Felfte
Few
Tres
Ever
Hamposeve
Appong
Attaing
Yeaccuted
Impoicaterepars
New
Gory
Clate
Eresise
Constionve
Statent
Geor
Expinationsts
Yeart
End
Quall
Pars
Enuande
Kine
Cons
Housed
Aut
Pregistint
Juress
Yeakenecluresiont
Kin
For
Cony
Repardin
Quor
Rect
Buse
Pin


For the same reasons as above for the one letter prediction method, I manually checked one set or randomly generated words against the scrabble dictionary and calculated some interesting statistics using the commented out cell below.

In [16]:
# # List of boolean values stating if a randomly generated word is real or not.
# isWord = [
#     0,0,0,1,0,1,0,0,0,1,
#     1,0,0,0,0,1,0,0,0,0,
#     0,1,0,0,0,0,0,0,0,0,
#     1,0,0,0,0,1,0,1,0,0,
#     0,1,1,0,0,1,1,0,0,1,
#     1,1,1,0,0,0,0,0,0,0,
#     0,0,0,0,0,0,0,0,1,0,
#     1,1,0,1,0,0,0,0,0,1,
#     1,0,0,0,1,1,0,1,0,1,
#     0,0,0,0,0,0,0,0,0,0
# ]

# print("Words\t\t\tNot Words\n---------------------------------------------------")

# # Print words in columns of words and not words
# for i in range(len(randWords)):
# #     if isWord[i]:
#     if randWords[i].lower() in nltk.corpus.words.words():
#         print(randWords[i])
#     else:
#         print("\t\t\t" + randWords[i])

# numReal = isWord.count(1)
# numNotReal = isWord.count(0)

# print(str(numReal) + " of the " + str(len(randWords)) + " generated words are real.")
# print(str(numNotReal) + " of the " + str(len(randWords)) + " generated words are not real.")

# realWords = 0
# notWords = 0

# # Determine average length of real and non real words
# for i in range(len(randWords)):
#     if isWord[i]:
#         realWords += len(randWords[i])
#     else:
#         notWords += len(randWords[i])
        
# print("\nThe average length of the real words is:\t", round(realWords/numReal,3))
# print("The average length of the not real words is:\t", round(notWords/numNotReal,3))

For the set of words generated from the two letter prediction method that I checked, 27 of the 100 words generated by the two letter prediction turned out to be valid scrabble words, so the the two letter prediction produces valid words at a rate of about 27%, which is 5% more than for the one letter prediction. This makes sense because this method takes in to account the previous two letters, instead of just one, and there will be less variance in the characters that follow a specific pair of characters rather than just one.

It's also interesting to note that the average length of the words using the two letter prediction are longer for both valid and invalid words, about 3.7 and 7.0 respectively. More notably for valid words where the average is almost a letter longer than for the one letter predictions.

With some of the invalid words for the two letter prediction it is fairly easy to see which words were combined to create the word. For example, "Offich" combines "Office" and "which". "Houslave" started with "House" and ended with "slave". There are also instances where one word had been started but smaller word is a subset of that word and cuts the word short. For example, "Legis" is the beginning of "Legislature", but "is" is a common, short word, so a space is more likely to follow "is" than an "l".

The final method uses Markov chains with words to randomly generate sentences. Since a list of characters and a list of words are both just lists in a particular order, the same method can be used to generate the probabilities for the one word prediction as for the one letter prediction. Instead of examining each instance of a character and recording the character that follows it, each instance of a word and the word that follows is recorded. Remember, I am treating punctuation as individual words rather than appending the the end of words.

I'll use the text "We the People of the United States." as an example. First, here is a table showing the number of occurrences of each word appearing after each word. The current word is in the row on the left, and the next word is in the column on top.

|Current Word | "We" | "the" | "People" | "of" | "United" | "States" | "." |
|--------|----|-----|--------|----|--------|--------|---|
| "We"     | 0  | 1   | 0      | 0  | 0      | 0      | 0 |
| "the"    | 0  | 0   | 1      | 0  | 1      | 0      | 0 |
| "People" | 0  | 0   | 1      | 0  | 0      | 0      | 0 |
| "of"     | 0  | 1   | 0      | 0  | 0      | 0      | 0 |
| "United" | 0  | 0   | 0      | 0  | 0      | 1      | 0 |
| "States" | 0  | 0   | 0      | 0  | 0      | 0      | 1 |
| "."      | 0  | 0   | 0      | 0  | 0      | 0      | 0 |

There is one instance where "People" follows "the" and there is one instance where "United" follows "the". Then to find the probabilities for each word, each row is normalized to sum to 1. And words that do not have any instances where a word another word follows ("." in this example) will have a uniform distribution. Below is the resulting probabilities.

| Current Word | "We" | "the" | "People" | "of" | "United" | "States" | "." |
|--------|-----|-----|--------|-----|--------|--------|-----|
| "We"     | 0   | 1   | 0      | 0   | 0      | 0      | 0   |
| "the"    | 0   | 0   | 1/2    | 0   | 1/2    | 0      | 0   |
| "People" | 0   | 0   | 1      | 0   | 0      | 0      | 0   |
| "of"     | 0   | 1   | 0      | 0   | 0      | 0      | 0   |
| "United" | 0   | 0   | 0      | 0   | 0      | 1      | 0   |
| "States" | 0   | 0   | 0      | 0   | 0      | 0      | 1   |
| "."      | 1/6 | 1/6 | 1/6    | 1/6 | 1/6    | 1/6    | 1/6 |

There is a 50% chance that "People" will follow the word "the", and a 50% chance that "United" will.

In [17]:
words = text.split()
wordSet, wordProbs = prob_from_previous(words)

Generating sentences using the one word prediction also follows similar logic to the one letter prediction to generate words. Instead of selecting a random capital letter, a random word is selected from the set of words that follow a period (words that start a sentence). And instead of a space marking the end of a word, a period is used to mark the end of the sentence.

Here is an example using the same sample text from above, "We the People of the United States." The only starting word is "We", so that's what is selected. Then, according to the table above, there is a probability of 1 that the next word "the". Now the current word is "the" and there is a probability of 1/2 that the next word is "People" and a probability of 1/2 that the next word is "United". Say that "United" gets randomly selected. Now "United" is the current word and there is a probability of 1 that the next word is "States". And there is a probability of 1 that the next word after "States" is ".". This marks the end of the sentence and the resulting sentence is "We the United States."

In [18]:
def gen_one_word_prediction(words, wordSet, wordProbs, seed=None, numSentences=100):
    '''
    Function:     gen_one_word_prediction
    
    Description:  Generates sentences using the Markov chain technique by randomly generating words
                  based on the probabilities of a word given the previous and current words.
    
    Parameters:   word      - The list of words in the text. 
                  wordSet   - Set of words to be used.
                  wordProbs - 2D array of probabilities where the row represents the current word,
                                the column represents the next word. The values of the array are the 
                                probability of the next word given the current word.
                  seed      - Seed to generate the same random order each time.
                  numWords  - Number of sentences to randomly generate.
    
    Returned:     randWords - List of randomly generated sentences.
    '''
    
    random.seed(seed)
    np.random.seed(seed)

    startWords = [words[0]]
    randSentences = []
    
    for i, word in enumerate(words):
        if word not in startWords:
            if word not in [".", "\"", ","]:        # Do not include punctuation in the list of starting words.
                if "." == words[i-1]:               # Starting words are words that follow a period.
                    startWords.append(word)

    print("Set of words that start a sentence:\n\n", startWords, "\n\n\n")
    
    for i in range(200):
        current = startWords[random.randint(0, len(startWords)-1)]
        sentence = current
        sentence += " "
        
        while not "." == current:
            current = np.random.choice(wordSet, p=wordProbs[wordSet.index(current)])
            sentence += current
            sentence += " "

        randSentences.append(sentence)
        sentence = ""

    return randSentences

In [19]:
sentences = gen_one_word_prediction(words, wordSet, wordProbs, 21321)

print ("Randomly Generated Sentences: \n\n")

for sentence in sentences:
    print (sentence + "\n\n")

Set of words that start a sentence:

 ['We', 'Article', 'Section', 'No', 'Representatives', 'The', 'When', 'Immediately', 'Judgment', 'Each', 'Neither', 'They', 'Every', 'If', 'But', 'He', 'And', 'In', 'Before', 'A', 'All', 'This', 'Done', 'George', 'North', 'Amendment', '2', 'Congress', '3', '4', '5', 'After', 'Sections', '6', 'Whenever', 'Thereafter', 'Thereupon'] 



Randomly Generated Sentences: 


Whenever there should remain two thirds vote shall , the United States to which some other State shall be entitled to Controversies to Grant Reprieves and Representatives and other needful Rules for President of President elect nor in the other person shall act accordingly until a Vice President . 


He shall be Party , open all Places as Congress shall not be elected by appropriate legislation . 


Neither House in levying War in suppressing insurrection or abridged , promote the Senate and Representatives shall , Returns and been chosen every State , William Blount , without regard to 

The sentences that are generated from the cell below don't really make much sense, but some appear to be somewhat grammatically coherent. At least to the extent that there aren't several nouns in a row or several verbs in a row. So some of the sentences can be read smoothly, even if the content is meaningless. For example, one sentence states, "Immediately after it shall abridge the President during the Congress."

There is a lot of variance in the length of the sentences. Some are only a few words, and others contain lots of commas that draw then sentence on for a whole paragraph.

While these methods for generating words and sentences may not be practical, with a success rate of generating valid words that is under 30%, they are interesting. A particularly interesting takeaway is that the words and sentences tend to diverge when faced with a common character or word. For instance, a comma, or small words like "is" and "or", have many possible words that can follow them. So, it is unlikely that the one that makes the most sense will be selected, resulting in the sentence trailing off on a tangent mid-sentence.