## EE 502 P: Analytical Methods for Electrical Engineering
    
# Homework 10: Review
## Due 15 December, 2019 at 11:59 PM
### <span style="color: red">Miller Sakmar (msakmar)</span>

Copyright &copy; 2019, University of Washington

In [1]:
import re
import urllib.request 
import numpy as np
import collections
import fractions
import string
np.random.seed(19680801) #Fixing the random seed for reproducability
ConstitutionURL = "https://www.usconstitution.net/const.txt"
#Source of ScrabbleDictionary: https://drive.google.com/file/d/1oGDf1wjWp5RF_X9C7HoedhIWMh5uJs8s/view
FullScrabbleFilePath = "./Collins_Scrabble_Words_2019.txt"

# Hallucinating the Constitution

Consider the constitution of the United States:

> https://www.usconstitution.net/const.txt .

This document contains upper- and lower-case letters, numbers, and basic punctuation. 

# Introduction:
In this paper, we will be utilizing Markov models to perform one-letter, two-letter, and one-word prediction based off of a dictionary created from the Constitution of the United States.  Markov models follow the Markov property, in that future states, in this case letters or words, only depend on the current state and not the previous states.  The mathematical notation for a process that is Markov is as follows:
$$
P[X_{n+1} = j \;|\; X_n = i,X_{n-1} = i_{n-1}, ..., X_0 = i_0 ] = P[X_{n+1} = j \;|\; X_n = i] = P_{i,j}
$$
where $X$ is a random variable and, in our case, represents a letter or a word.  More specifically, $P_{i,j}$ is the probability that the next character is $j$ given that the current character is $i$.

Both the one-letter and one-word prediction models can be described as 1st order Markov processes.  The two-letter prediction model, however, is a 2nd order Markov process because the next state relies on the current state *and* the previous state
$$
P[X_{n+1} = k \;|\; X_n = j,X_{n-1} = i, X_{n-2} = i_{n-2}, ..., X_0 = i_0 ] = P[X_{n+1} = k \;|\; X_n = j,X_{n-1} = i] = P_{i,j,k}.
$$

We can then organize these probabilities together into a transition probability matrix P, where
$$
\displaystyle
P = 
\begin{pmatrix}
    P_{0,0} & P_{0,1} & \dots & P_{0,j} & \dots & P_{0,j-1} & P_{0,n-1}\\
    P_{1,0} & P_{1,1} & \dots & P_{1,j} & \dots & P_{1,j-1} & P_{1,n-1} \\
    \vdots  & \vdots  & \dots & \vdots  & \dots & \vdots  & \vdots  \\
    P_{i,0} & P_{i,1} & \dots & P_{i,j} & \dots & P_{i,j-1} & P_{i,n-1} \\
    \vdots  & \vdots  & \dots & \vdots  & \dots & \vdots  & \vdots  \\
    P_{n-1,0} & P_{n-1,1} & \dots & P_{n-1,j} & \dots & P_{n-1,j-1} & P_{n-1,n-1}
\end{pmatrix},
$$

and each row either sums to 0 or 1.  If a row sums to 0, it means that the first character or word in a character/word-pair does not exist in the dictionary or it is the last pair read into the dictionary.  We will cover how these situations were handled later in the paper.

# A Small Example:

For example, the transition probability matrix for the first few words of the Constitution, **"We the People of the United States, "** would be constructed by following the steps below:

1. Find all of the unique characters in the string, "We the People of the United States, ".  This resulted in the array: **[W, e, ' ', t, h, P, o, p, l, f, U, n, i, d, S, a, s,',']**.

     1. Note how spaces, **' '**, punctuations, '**,**', and capital *and* lower-case characters are included in this array.
     
     
2. Count the total number of occurences for each character and character-pair and then divided each unique character-pair by the total number of character-pairs found.  For example, all of the character pairs with beginning with the character 'e' are: **['e ': 4, 'eo': 1, 'ed': 1, 'es': 1]**, which add to $7$ total occurences of character-pairs starting with the letter 'e'.  The probability vector of character-pairs beginning with 'e' is then 

| Starting/Ending Letter | W | e   | ' ' | t   | h   | P   | o   | p   | l | f   | U   | n | i | d   | S   | a   | s   | , |
|------------------------|---|-----|-----|-----|-----|-----|-----|-----|---|-----|-----|---|---|-----|-----|-----|-----|---|
| e                      | 0 | 0   | $\frac{4}{7}$ | 0   | 0   | 0   | $\frac{1}{7}$ | 0   | 0 | 0   | 0   | 0 | 0 | $\frac{1}{7}$ | 0   | 0   | $\frac{1}{7}$ | 0 |


3. Combine each probability vector together into a matrix (printed as a table for easier viewing)

| Starting/Ending Letter | W | e   | ' ' | t   | h   | P   | o   | p   | l | f   | U   | n | i | d   | S   | a   | s   | , |
|------------------------|---|-----|-----|-----|-----|-----|-----|-----|---|-----|-----|---|---|-----|-----|-----|-----|---|
| W                      | 0 | 1   | 0   | 0   | 0   | 0   | 0   | 0   | 0 | 0   | 0   | 0 | 0 | 0   | 0   | 0   | 0   | 0 |
| e                      | 0 | 0   | $\frac{4}{7}$ | 0   | 0   | 0   | $\frac{1}{7}$ | 0   | 0 | 0   | 0   | 0 | 0 | $\frac{1}{7}$ | 0   | 0   | $\frac{1}{7}$ | 0 |
| ' '                    | 0 | 0   | $\frac{1}{7}$ | $\frac{2}{7}$ | 0   | $\frac{1}{7}$ | $\frac{1}{7}$ | 0   | 0 | 0   | $\frac{1}{7}$ | 0 | 0 | 0   | $\frac{1}{7}$ | 0   | 0   | 0 |
| t                      | 0 | $\frac{2}{5}$ | 0   | 0   | $\frac{2}{5}$ | 0   | 0   | 0   | 0 | 0   | 0   | 0 | 0 | 0   | 0   | $\frac{1}{5}$ | 0   | 0 |
| h                      | 0 | 1   | 0   | 0   | 0   | 0   | 0   | 0   | 0 | 0   | 0   | 0 | 0 | 0   | 0   | 0   | 0   | 0 |
| P                      | 0 | 1   | 0   | 0   | 0   | 0   | 0   | 0   | 0 | 0   | 0   | 0 | 0 | 0   | 0   | 0   | 0   | 0 |
| o                      | 0 | 0   | 0   | 0   | 0   | 0   | 0   | $\frac{1}{2}$ | 0 | $\frac{1}{2}$ | 0   | 0 | 0 | 0   | 0   | 0   | 0   | 0 |
| p                      | 0 | 0   | 0   | 0   | 0   | 0   | 0   | 0   | 1 | 0   | 0   | 0 | 0 | 0   | 0   | 0   | 0   | 0 |
| l                      | 0 | 1   | 0   | 0   | 0   | 0   | 0   | 0   | 0 | 0   | 0   | 0 | 0 | 0   | 0   | 0   | 0   | 0 |
| f                      | 0 | 0   | 1   | 0   | 0   | 0   | 0   | 0   | 0 | 0   | 0   | 0 | 0 | 0   | 0   | 0   | 0   | 0 |
| U                      | 0 | 0   | 0   | 0   | 0   | 0   | 0   | 0   | 0 | 0   | 0   | 1 | 0 | 0   | 0   | 0   | 0   | 0 |
| n                      | 0 | 0   | 0   | 0   | 0   | 0   | 0   | 0   | 0 | 0   | 0   | 0 | 1 | 0   | 0   | 0   | 0   | 0 |
| i                      | 0 | 0   | 0   | 1   | 0   | 0   | 0   | 0   | 0 | 0   | 0   | 0 | 0 | 0   | 0   | 0   | 0   | 0 |
| d                      | 0 | 0   | 1   | 0   | 0   | 0   | 0   | 0   | 0 | 0   | 0   | 0 | 0 | 0   | 0   | 0   | 0   | 0 |
| S                      | 0 | 0   | 0   | 1   | 0   | 0   | 0   | 0   | 0 | 0   | 0   | 0 | 0 | 0   | 0   | 0   | 0   | 0 |
| a                      | 0 | 0   | 0   | 1   | 0   | 0   | 0   | 0   | 0 | 0   | 0   | 0 | 0 | 0   | 0   | 0   | 0   | 0 |
| s                      | 0 | 0   | 0   | 0   | 0   | 0   | 0   | 0   | 0 | 0   | 0   | 0 | 0 | 0   | 0   | 0   | 0   | 1 |
| ,                      | 0 | 0   | 1   | 0   | 0   | 0   | 0   | 0   | 0 | 0   | 0   | 0 | 0 | 0   | 0   | 0   | 0   | 0 |
    
   1. Note how the character-pair, 'We', and the first character, 'W', only occur once in the string, "We the People of the United States, ".  Therefore if the first character of a character-pair is 'W', there is a 100% probability of transitioning to the letter 'e'.
   2. Also note that to move along the transition probability matrix, we start with a letter, match it to a letter along the left-side of the matrix (the rows), then move across the columns to find the desired character-pair.

# Coding Up the Small Example, Above:

### 1. Find all of the unique characters in the string, "We the People of the United States, ".

In the below cell, we define a function, UniqueCharacterSetCounter, that utilizes the collection regular expression and collections libraries.  This function splits our string into overlapping pieces of smaller strings by using a regular expression.  We then pass these smaller strings into a counter collection, which creates a dictionary from these smaller strings and counts the occurence for each repetitive string.

In [2]:
def UniqueCharacterSetCounter(_FileString,_CharacterSetNumber):
    """Count the number of occurences for each character in a string, filestring""" 
    matches = re.finditer(r'(?=(.{' +str(_CharacterSetNumber) + '}))',_FileString)
    _Counter = collections.Counter([match.group(1) for match in matches])
    return _Counter

Now that we have our function, we can pass our function our small example string, "We the People of the United States, ".  The function outputs a counter dictionary that can be converted to a list to print out our desired unique character array.

In [3]:
#1. Find all of the unique characters in the string, "We the People of the United States, ".
ExampleString = "We the People of the United States, "
ExampleOneCharacterCounter = UniqueCharacterSetCounter(ExampleString,1)
ExampleOneCharacterList = list(ExampleOneCharacterCounter.items())
ExampleOneCharacterListCharacters = [index[0] for index in ExampleOneCharacterList]
print("List of Unique Characters: ",ExampleOneCharacterListCharacters)
print("")

List of Unique Characters:  ['W', 'e', ' ', 't', 'h', 'P', 'o', 'p', 'l', 'f', 'U', 'n', 'i', 'd', 'S', 'a', 's', ',']



### 2. Count the total number of occurences for each character and character-pair

Because we used a counter to track our unique characters and character-pairs, we can print out the whole counter dictionary to know how many each character and character-pairs occur in our small example string.

In [4]:
#2. Count the total number of occurences for each character and character-pair
ExampleTwoCharactersCounter = UniqueCharacterSetCounter(ExampleString,2)
print("Total Occurences for Each Unique Character: ",list(ExampleOneCharacterCounter.items()))
print("")
print("Total Occurences for Each Unique Character-Pair: ",list(ExampleTwoCharactersCounter.items()))
print("")

Total Occurences for Each Unique Character:  [('W', 1), ('e', 7), (' ', 7), ('t', 5), ('h', 2), ('P', 1), ('o', 2), ('p', 1), ('l', 1), ('f', 1), ('U', 1), ('n', 1), ('i', 1), ('d', 1), ('S', 1), ('a', 1), ('s', 1), (',', 1)]

Total Occurences for Each Unique Character-Pair:  [('We', 1), ('e ', 4), (' t', 2), ('th', 2), ('he', 2), (' P', 1), ('Pe', 1), ('eo', 1), ('op', 1), ('pl', 1), ('le', 1), (' o', 1), ('of', 1), ('f ', 1), (' U', 1), ('Un', 1), ('ni', 1), ('it', 1), ('te', 2), ('ed', 1), ('d ', 1), (' S', 1), ('St', 1), ('ta', 1), ('at', 1), ('es', 1), ('s,', 1), (', ', 1)]



In our small walk-through, above, we showed the occurence and probability for character-pairs when the first letter is 'e'.  To accomplish this in code, we utilize the function below, MatchingCharacterSets, which uses list comprehension to loop through each key, or character-pair, in the ExampleOneCharacterCounter dictionary and will only add pairs that begin with 'e' to a separate dictionary.  The function then prints the full dictionary, the matching sub-dictionary with 'e', and the total sum of occurences for character-pairs with 'e', which is $7$.

In [5]:
def MatchingCharacterSets(_InputCounter,_SearchKey):
    
    print("The original dictionary is : " + str(list(_InputCounter.keys()))) 
    print("")
    
    # Key starts with search_key in dictionary 
    _MatchingCharacterKeys = {key:val for key, val in _InputCounter.items() if (key.startswith(_SearchKey))}

    # printing result  
    print("The matching dictionary is: " + str(list(_MatchingCharacterKeys)))
    print("")
    
    #Get total count of two character pairs that start with search_key
    print("Sum of character sets that start with {}: {}".format(_SearchKey,np.sum(list(_MatchingCharacterKeys.values()))))
    print("")
    return _MatchingCharacterKeys

In [6]:
print("For example, the character-pairs starting with the letter 'e': ")
ExampleMatchingCounter = MatchingCharacterSets(ExampleTwoCharactersCounter,'e')

For example, the character-pairs starting with the letter 'e': 
The original dictionary is : ['We', 'e ', ' t', 'th', 'he', ' P', 'Pe', 'eo', 'op', 'pl', 'le', ' o', 'of', 'f ', ' U', 'Un', 'ni', 'it', 'te', 'ed', 'd ', ' S', 'St', 'ta', 'at', 'es', 's,', ', ']

The matching dictionary is: ['e ', 'eo', 'ed', 'es']

Sum of character sets that start with e: 7



In order to calculate the individual probabilities for every character-pair that starts with 'e', we write another function, CreateCharacterPVector, that returns the characters' calculated transition probability vector.  This function accomplishs its task by simply using for-loops through the input CharacterList, our matching character-pairs.

In [7]:
#Given a CharacterList, return a list of capital letters and a Pvector for those capital letters
#Create a Pvector of available Capital letters in the Pmatrix
def CreateCharacterPVector(_CharacterList,_LettersOnly=True,_CapitalLettersOnly=True):
    """Given a CharacterList, return a list of characters and a transition probability vector, Pvector, for those capital letters"""
    _Characters = []
    OccurenceSum = 0
    _PVector = []
    
    for index in _CharacterList:
        if(_CapitalLettersOnly):
            if(index[0].isupper()):
                _Characters.append(index[0])
                OccurenceSum += index[1]
                _PVector.append(index[1])
        elif(_LettersOnly):
            if(index[0].isalpha()):
                _Characters.append(index[0])
                OccurenceSum += index[1]
                _PVector.append(index[1])
        else:
            _Characters.append(index[0])
            OccurenceSum += index[1]
            _PVector.append(index[1])
    for index in range(len(_Characters)):
        _PVector[index] = _PVector[index]/OccurenceSum
    return _Characters, _PVector

In [8]:
ExampleMatchingList = list(ExampleMatchingCounter.items())
Characters, CharactersPVector = CreateCharacterPVector(ExampleMatchingList,_LettersOnly=False,_CapitalLettersOnly=False)
print("Character array : ",Characters)
print("")
print("Transition Probability Vector: ",CharactersPVector)

Character array :  ['e ', 'eo', 'ed', 'es']

Transition Probability Vector:  [0.5714285714285714, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285]


We can see from the above probability vector that if the current character is 'e', the probability of the next character being a space, ' ', is ~57% $\displaystyle (\frac{4}{7})$, whereas the probability of the next character being an 'o', 'd,', or 's' is only ~14% $\displaystyle (\frac{1}{7})$ for each character.  

### 3. Combine each probability vector together into a matrix

Now that we understand how a probability vector is constructed, we can carry the same principles over to making a transition probability matrix.  To accomplish this new task, we write another function, Create2DPMatrix.  This matrix uses two for-loops to iterate through a $n \times n$ transition probability matrix, where $n$ is the total number of unique characters in our input dictionary.

The function fills each $P_{i,j}$ with the appropriate probability of the character $j$ occuring after the character $i$.

We run into one edge-case this way, in that the last character in our input string does not have another character to pair to.  We handle this by manually pairing the last character to another character in our dictionary.  We could also choose to ignore it, but by pairing it with another character, we allow our simulated Markov chain to continue instead of ending when it hits the last character.

In [9]:
#Create a 2DPMatrix
#Pair the last character to space to complete the pair and make the probability of the laster character pair equal to 1

def Create2DPMatrix(_OneCharacterList,_OneCharacterCounter,_TwoCharactersCounter,_InputFileString):
    """Given a CharacterList, OneCharacterList, of single unique characters, and two counters of unique single-characters and unique characters-pairs, OneCharacterCounter and TwoCharacterCounter,"""
    """return a transition probability matrix, _P."""
    n = len(_OneCharacterList)
    _P = np.zeros((n,n))
    _P[CharacterIndexSearch(_OneCharacterList,LastCharacterOfFileString(_InputFileString,1)),CharacterIndexSearch(_OneCharacterList, ' ')] = 1/_OneCharacterCounter.get(LastCharacterOfFileString(_InputFileString,1))
    for row in range(np.shape(_P)[0]):
        for col in range(np.shape(_P)[1]):
            #print("_OneCharacterList[row][0]: ",_OneCharacterList[row][0])
            #print("_OneCharacterList[col][0]: ",_OneCharacterList[col][0])
            if(_TwoCharactersCounter.get(_OneCharacterList[row][0] + _OneCharacterList[col][0]) != None):
                #print(sm.Rational(_TwoCharactersCounter.get(_OneCharacterList[row][0] + _OneCharacterList[col][0]),_OneCharacterList[row][1]))
                _P[row,col] += _TwoCharactersCounter.get(_OneCharacterList[row][0] + _OneCharacterList[col][0])/_OneCharacterList[row][1]
    return _P

Now that we have made the transitiona probability matrix, we need to verify that every row either sums to 0 or 1. This is our sanity check for if we properly constructed our matrix.  A probability of greater or less than 1 for each row is not possible, unless it is the last character in our input string.

In [10]:
def VerifyValid2DPMatrix(_P):
    """Given a Markov probability matrix, Pmatrix, verify that all of the rows sum to either 0 or 1."""
    _TrackingBool = True
    for index in np.arange(_P.shape[0]):
        #if((np.round(np.sum(Pmatrix[index,:]),10)) == 0):
        #        continue
        if( (np.round(np.sum(_P[index,:]),10)) != 1):
            print("Sum of row, {} does not equal 1!".format(index))
            print(_P[index,:])
            _TrackingBool = False
    return _TrackingBool

In order to handle the last character in our input string, we first have to be able to know what it is.  Thus, we make the function LastCharacterOfFileString, which simply uses slicing of the input string.  We mainly created this function to help clarify the Create2DPMatrix function's code.

In [11]:
def LastCharacterOfFileString(_InputFileString,_NumOfCharacters):
    """Given a string of characters, return the NumOfCharacters .""" 
    return _InputFileString[-_NumOfCharacters:]

As you may have noticed by now, our probability vectors, and therefore our probability matrix, do/does not contain explicitly-matched characters for each row and column.  Therefore, we need a function, CharacterIndexSearch, that translates a character to a matrix index.

This enables us to find where the last character in the input string lies in the character list, which then maps to what row to use in the probability matrix.

In [12]:
def CharacterIndexSearch(_CharacterList,_CharacterSearch):
    """Given a _CharacterList and a _CharacterSearch character to search for in the _CharacterList."""
    """Return either the index for the found character in the _CharacterList or return -1 if the _CharacterSearch could not be found."""
    counter = 0
    for index in _CharacterList:
        _CharacterSearchIndex = -1
        if index[0] == _CharacterSearch:
            _CharacterSearchIndex = counter
            break
        counter = counter + 1
    return _CharacterSearchIndex

In [13]:
#3. Combine each probability vector together into a matrix
ExamplePMatrix = Create2DPMatrix(ExampleOneCharacterList,ExampleOneCharacterCounter,ExampleTwoCharactersCounter,ExampleString)

#We use the fractions library to convert our floats to fractions
np.set_printoptions(formatter={'all':lambda ExamplePMatrix: str(fractions.Fraction(ExamplePMatrix).limit_denominator())})
print("The resulting transition probability matrix: ")
print(ExamplePMatrix)

The resulting transition probability matrix: 
[[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 4/7 0 0 0 1/7 0 0 0 0 0 0 1/7 0 0 1/7 0]
 [0 0 1/7 2/7 0 1/7 1/7 0 0 0 1/7 0 0 0 1/7 0 0 0]
 [0 2/5 0 0 2/5 0 0 0 0 0 0 0 0 0 0 1/5 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1/2 0 1/2 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


### 4. Creating random words

Now that we have our transition probability matrix, we can actually simulate the markov process by picking a random first **capital** letter and following the matrix.  To do this, we make another function, GenerateWords, which takes a first letter, a character list, a number for how many random words to craft, and a transition probability matrix, P.

In [14]:
def GenerateWords(_FirstLetter,_CharacterList,_NumberOfWords,_P):
    """Given a letter, _FirstLetter, a list of characters, _CharacterList, the number of words to create, _NumberOfWords, and a transition probability matrix, P,"""
    """return _NumberOfWords random words."""
    CharacterListLetters = [index[0] for index in _CharacterList]
    LoopLimit = 0
    WordArray = []
    WordSet = {}
    #generate unique words until looplimit
    while ( (len(WordSet) != _NumberOfWords) and (LoopLimit < 1000) ):
        CurrentCharacter = _FirstLetter
        CurrentCharacterIndex = CharacterIndexSearch(_CharacterList, _FirstLetter)
        Word = CurrentCharacter
        while (CurrentCharacter != " "): 
            CurrentCharacter = list(np.random.choice(CharacterListLetters, 1, p=(np.array(_P[CurrentCharacterIndex,:]).astype(np.float64)).flatten()))[0]
            CurrentCharacterIndex = CharacterIndexSearch(_CharacterList, CurrentCharacter)

            Word = Word + CurrentCharacter
        #Condition the word so that it can be a *real* word without punctuation
        Word = (Word.translate(str.maketrans('', '', string.punctuation))).replace(' ','')
        WordArray.append(Word)
        WordSet = set(WordArray)
        LoopLimit = LoopLimit + 1
    return WordSet

But before we can start making random words, we first have to tackle how to choose our first capital letter.  Luckily, we have a function that can calculate a probability vector of a character list, CreateCharacterPVector!  We can then feed this transition probability vector into the numpy random choice library, which can take a non-uniform probability array, our probability vector, and a same-sized array to choose from, our available capital letters.

In [15]:
#Selecting an arbitrary first capital letter
AvailableCapitalLetters, AvailableCapitalLettersPVector = CreateCharacterPVector(ExampleOneCharacterList, _CapitalLettersOnly=True)
ArbitraryCapitalLetter = np.random.choice(AvailableCapitalLetters, 1, p=(np.array(AvailableCapitalLettersPVector).astype(np.float64)).flatten())[0]
print("Arbitrary Capital Letter Selected: ",ArbitraryCapitalLetter)

Arbitrary Capital Letter Selected:  U


Now we can feed this capital letter into GenerateWords! 

In [16]:
GeneratedRandomWords = GenerateWords(ArbitraryCapitalLetter,ExampleOneCharacterList,10,ExamplePMatrix)
print("Random Words Array: ",list(GeneratedRandomWords))

Random Words Array:  ['Unite', 'Uniteof', 'Unitate', 'Unithe', 'Unithed', 'Unitathe', 'Unitheopled', 'Unites', 'Unithes', 'United']


### 5. Checking random words for if they are *real* words

At last, the final stretch for this example is figuring out if our list of generated words actually make real words.  One way to accomplish this is by referencing the Scrabble text file with Scrabble-approved words.  

Our first task, then, is to read in our Scrabble file into a single, long string and then pass that string into a counter dictionary; similar to what we did for each character, earlier.  We then create another function, ReadAndAppendFileIntoWordArray, that reads in our Scrabble file, parses it a bit, and makes it into a long string.

In [17]:
def ReadAndAppendFileIntoWordArray(_InputFilePath):
    """Read in a file at filepath, count punctuation as words, and return an array of words.""" 
    with open(_InputFilePath, 'r') as byteinput:
        file = open(_InputFilePath,mode='r') # file data
        filestring = file.read() # save the data as a string
        file.close()
        _OutputString = filestring.replace(".", " . ") \
               .replace(",", " , ") \
               .replace("-", " - ") \
               .replace(";", " ; ") \
               .replace(":", " : ").split() # remove excess white spaces
    return _OutputString

We then create another function that converts our long string into a counter dictionary, UniqueWordSetCounter.

In [18]:
def UniqueWordSetCounter(_InputWordArray):
    """Count the number of occurences for each word in a a filestring""" 
    _Counter = collections.Counter(_InputWordArray)
    return _Counter

Now that we have our two functions, we can make our Scrabble dictionary counter!  The FullScrabbleFilePath variable is in one of the first cells in this paper; the local path just has to be updated accordingly.

In [19]:
#Create a dictionary of Scrabble words
ScrabbleWords = ReadAndAppendFileIntoWordArray(FullScrabbleFilePath)
ScrabbleCounter = UniqueWordSetCounter(ScrabbleWords)

One last step is to make a function, LookupWordsInScrabbleDict, that takes our Scrabble dictionary and our generated words and prints out if our generated words are found in the Scrabble dictionary.

In [20]:
def LookupWordsInScrabbleDict(ScrabbleDict, WordSet):
    """Given a dictionary, ScrabbleDict, and a set of words to look for in the dictionary, WordSet, print out each word in WordSet and bold the word if it is found in the dictionary."""
    print("\033[1mBolded\033[0m words are in the Scrabble Dictionary")
    WordCount = 0
    for word in WordSet:
        if(ScrabbleDict.get(word.upper().replace(' ','')) == None):
            print("{}".format(word))
        else:
            print("\033[1m{}\033[0m".format(word))
            WordCount = WordCount + 1
    print("\033[1mReal Word Count: \033[0m",WordCount)

In [21]:
LookupWordsInScrabbleDict(ScrabbleCounter,GeneratedRandomWords)

[1mBolded[0m words are in the Scrabble Dictionary
[1mUnite[0m
Uniteof
Unitate
Unithe
Unithed
Unitathe
Unitheopled
[1mUnites[0m
Unithes
[1mUnited[0m
[1mReal Word Count: [0m 3


Awesome!  We now have a process for creating random words based off of an input dictionary and by using single-character prediction.  

# Moving Onto One-Letter Prediction:

Now onto the first challenge in this paper, one-letter prediction for the entire Constitution.  Not surprisingly, none of underlying code from our simple example above changes much.  All that changes is our input, the full Constitution of the United States.

First, we need to read in the entire Constitution as a string.  We can accomplish this by using the url.request library to read in the const.txt file from our provided url, https://www.usconstitution.net/const.txt, as a bytearray, then converting that byte array into a large string.  

In [22]:
def ReadAndAppendConstitutionURLIntoString(url):
    """Provided a valid URL, url, read in the file located at the url, decode it into a string, parse the string a bit to take out uncessary characters, then return the resulting string."""
    #Python 3 reads html as a bytearray
    file = urllib.request.urlopen("https://www.usconstitution.net/const.txt")
    filebytes = file.read()

    #Decode bytearray to string (assuming utf8 encoding)
    rawfilestring = filebytes.decode("utf8")
    file.close()

    #print(rawfilestring)
    FullConstitutionString = rawfilestring[rawfilestring.index("We the People"):].replace('\n', ' ').replace('\r', ' ')
    FullConstitutionString = ' '.join(FullConstitutionString.split())
    
    return FullConstitutionString

In [23]:
FileString = ReadAndAppendConstitutionURLIntoString(ConstitutionURL)

### 1. Find the set of all unique characters used in the Constitution of the United States and call the size of that set $n$.

As we did in the example, we need to first find all of the unique characters in the Constitution.  We can use the same function for this, UniqueCharacterSetCounter.  

In [24]:
#Create a counter dictionary of unique characters in the file
OneCharacterCounter = UniqueCharacterSetCounter(FileString,1)

#Number of unique characters
n = len(OneCharacterCounter) 
print("Total Number of Unique Characters: ",n)
print("")

Total Number of Unique Characters:  69



### 2. Create a transition probability matrix, $P$, with shape $n \times n$ that contains the probabilities where $P_{i,j}$ represents the probability that the next character is $j$, given the current character is $i$.

Now that we know the total number of unique characters, and their occurrences, in our OneCharacterCounter counter dictionary, we can start to build our transition probability matrix.

We begin by counting and figuring out the unique character-pairs in our input string, the Constitution, and make a TwoCharactersCounter counter dictionary.

In [25]:
#Create a counter dictionary of character pairs to help create our transition probability matrix
TwoCharactersCounter = UniqueCharacterSetCounter(FileString,2)

In order to keep everything constitent from run-to-run, we also sort the OneCharacterList by most common to least common characters.

In [26]:
#Create a sorted character list based off of the OneCharacterCounter counter dictionary
OneCharacterList = OneCharacterCounter.most_common()

We now have everything to plug into our Create2DPMatrix function in order to create our transitional probability matrix, which we can then verify using VerifyValid2DPMatrix.

In [27]:
OneCharacterPMatrix = Create2DPMatrix(OneCharacterList,OneCharacterCounter,TwoCharactersCounter,FileString)

In [28]:
VerifyValid2DPMatrix(OneCharacterPMatrix)

True

### 3. Simulate the Markov process by starting with an arbitary capital letter from our dictionary.  Make 100 random *words* and see how many are actually words.

Since we have our transition probability matrix, we can now move onto making random words.  We can use the same method as before, which was to select a random first capital letter from our OneCharacterList by creating its transition probability vector and np.random.choice.  

In [29]:
#Selecting an arbitrary first capital letter
AvailableCapitalLetters, AvailableCapitalLettersPVector = CreateCharacterPVector(OneCharacterList, _CapitalLettersOnly=True)
ArbitraryCapitalLetter = np.random.choice(AvailableCapitalLetters, 1, p=(np.array(AvailableCapitalLettersPVector).astype(np.float64)).flatten())[0]
print("Arbitrary Capital Letter Selected: ",ArbitraryCapitalLetter)

Arbitrary Capital Letter Selected:  T


Luckily our GenerateWords function still works for this situation, as well, so all we need to do is plug in our updated ArbitraryCapitalLetter, OneCharacterList, desired number of random words, and OneCharacterPMatrix.

In [30]:
GeneratedRandomWords = GenerateWords(ArbitraryCapitalLetter,OneCharacterList,100,OneCharacterPMatrix)

We can use the same method as before for looking up these new random words in the Scrabble dictionary, too!

In [31]:
#Create a dictionary of Scrabble words
ScrabbleWords = ReadAndAppendFileIntoWordArray(FullScrabbleFilePath)
ScrabbleCounter = UniqueWordSetCounter(ScrabbleWords)

LookupWordsInScrabbleDict(ScrabbleCounter,GeneratedRandomWords)

[1mBolded[0m words are in the Scrabble Dictionary
Thace
Thevenan
Tericetero
Trthait
Thericthe
Thidmect
Trtio
Tenceror
Toua
Thelon
Thentesh
Thevenigion
Thalicendsh
[1mTor[0m
Tourechely
Thinden
Thapr
Thal
Touprichas
Thenerist
Trereiaxte
[1mTry[0m
Tof
Theilisutor
Thasunetty
Theren
Thame
Thenovorrme
Tred
Thovecopawiore
Tontre
Thesof
[1mThe[0m
Torcr
Thre
Tiory
Thtache
Thallofinda
Trernduchereimall
[1mTire[0m
Trofoff
Thompredeale
Tateris
[1mTe[0m
Tibostiomomes
Thay
[1mTo[0m
Thelepus
Thigall
Thume
[1mTot[0m
Thate
Trangalisin
Thefome
Trng
Trentes
Ther
Tre
Tren
Thallll
Tidexevoumongir
[1mToo[0m
Tipur
[1mThale[0m
Thed
Tononicteiathery
Tidssorof
Trmenthoritaprrtes
Theputhappullliolichapes
Thand
[1mTore[0m
Toredende
Thatecllideve
Theshfucthawh
Trte
Tr
Thes
Thacofrig
Thashe
Theng
Thanus
Thecesilo
Trser
Tobe
Tanmmbes
[1mThaw[0m
[1mTorous[0m
Thashateed
Thesicenvan
Troor
Toin
[1mTrem[0m
Thecathere
Talleatide
Tollallll
Thereexcon
Tol
Ticot
Tw
Th
[1mReal Word Count: [0m 13


# Moving Onto Two-Letter Prediction

Now we can move onto our Two-Letter prediction problem in this paper.  This problem is different in that while it still follows the Markov property, the next state is dependent on the current state *and* the previous state.  To review again, we can describe this behavior the following way:
$$
P[X_{n+1} = k \;|\; X_n = j,X_{n-1} = i, X_{n-2} = i_{n-2}, ..., X_0 = i_0 ] = P[X_{n+1} = k \;|\; X_n = j,X_{n-1} = i] = P_{i,j,k},
$$
where X is a random variable representing an ascii character.

### 1. Find the set of all unique characters used in the Constitution of the United States and call the size of that set $n$.

Since we are using the same dictionary, we can use the same n as before.

In [32]:
#Create a counter dictionary of unique characters in the file
OneCharacterCounter = UniqueCharacterSetCounter(FileString,1)

#Number of unique characters
n = len(OneCharacterCounter) 
print("Total number of unique characters: ",n)
print("")

Total number of unique characters:  69



### 2. Create a transition probability matrix, $P$, with shape $n \times n \times n$ that contains the probabilities where $P_{i,j,k}$ represents the probability that the next character is $k$, given the current character is $j$ and the previous character was i$.

Since we are now working with a $n \times n \times n$ matrix, we need to modify our Create2DPMatrix function a bit to make it Create3DPMatrix.  This simply involves adding another index for loop inside our existing row and column for loops.  This way, we can iterate through every element, $P_{i,j,k}$, in the transition probability matrix, $P$.

In [33]:
def Create3DPMatrix(_OneCharacterList,_TwoCharactersCounter,_ThreeCharactersCounter,_FileString):
    """Given a character list, _OneCharacterList, two counters, _TwoCharactersCounter and _ThreeCharactersCounter, and a long file string, _FileString, return a transition probability matrix."""
    n = len(_OneCharacterList)
    _P = np.zeros((n,n,n))

    #Pair the last two characters with the most common character to complete the pair and make the probability of the last character pair equal to 1
    LastTwoCharactersOfFile = LastCharacterOfFileString(_FileString,2)
    _P[CharacterIndexSearch(_OneCharacterList,LastTwoCharactersOfFile[0]),CharacterIndexSearch(_OneCharacterList,LastTwoCharactersOfFile[1]),CharacterIndexSearch(_OneCharacterList,' ')] = 1/_TwoCharactersCounter.get(LastTwoCharactersOfFile)

    for row in range(np.shape(_P)[0]):
        for col in range(np.shape(_P)[1]):
            for index in range(np.shape(_P)[1]):
                if(ThreeCharactersCounter.get(_OneCharacterList[row][0] + _OneCharacterList[col][0] + _OneCharacterList[index][0]) != None):
                    _P[row,col,index] += _ThreeCharactersCounter.get(_OneCharacterList[row][0] + _OneCharacterList[col][0] + _OneCharacterList[index][0])/_TwoCharactersCounter.get(_OneCharacterList[row][0] + _OneCharacterList[col][0])
    return _P

Another addition we have to incorportate is a ThreeCharactersCounter, which helps us map and count our first two unique character-pairs to a third character found in our dictionary.  

In [34]:
TwoCharactersCounter = UniqueCharacterSetCounter(FileString,2)
ThreeCharactersCounter = UniqueCharacterSetCounter(FileString,3)

#Make all of our counter dictionaries into sorted lists, from most common to least common unique combinations
OneCharacterList = OneCharacterCounter.most_common()
TwoCharactersList = TwoCharactersCounter.most_common()
ThreeCharactersList = ThreeCharactersCounter.most_common()

The final change we have to incorporate is our transition probability matrix verification, since we added another dimension to our matrix, we need to make appropriate changes to our function.  As mentioned early on in the paper, our transition probability matrix this time around will have some additional rows that only sum to 0.  This is because we are first creating the $P$ matrix based off all the unique characters, $n$, which can create character-pairs, $n \times n$, that might not exist in our dictionary.  So we change our VerifyValid3DPMatrix function to simply ignore this case.  Now we are only failing rows that do not sum perfectly to 1.

In [35]:
def VerifyValid3DPMatrix(Pmatrix):
    """Given a transition probability matrix, Pmatrix, verify each row in the matrix sums to 1."""
    TrackingBool = True
    for row in np.arange(Pmatrix.shape[0]):
        for col in np.arange(Pmatrix.shape[1]):
            if((np.round(np.sum(Pmatrix[row,col,:]),10)) == 0):
                continue
            if( (np.round(np.sum(Pmatrix[row,col,:]),10)) != 1):
                print("Sum of row,{}, col, {}, does not equal 1!".format(row,col))
                print(Pmatrix[row,col,:])
                TrackingBool = False
    return TrackingBool

In [36]:
ThreeCharacterPMatrix = Create3DPMatrix(OneCharacterList,TwoCharactersCounter,ThreeCharactersCounter,FileString)
VerifyValid3DPMatrix(ThreeCharacterPMatrix)

True

### 3. Simulate the Markov process by starting with an arbitary capital letter from our dictionary.  Make 100 random *words* and see how many are actually words.

In order to simulate our Markov process this time, we need to select the first *two* letters to begin our process.  Fortunately, we can use the same CreateCharacterPVector to make a transition probability vector of all of the character-pairs that contain capital letters.  We can then use the same np.random.choice method for choosing one of these character-pairs based off of their probability vector.

In [37]:
#Selecting an arbitrary first capital letter
AvailableCapitalLetters, AvailableCapitalLettersPVector = CreateCharacterPVector(TwoCharactersList, _CapitalLettersOnly=True)
ArbitraryCapitalLetterPair = np.random.choice(AvailableCapitalLetters, 1, p=(np.array(AvailableCapitalLettersPVector).astype(np.float64)).flatten())[0]
print("Arbitrary Capital Letter-Pair Selected: ",ArbitraryCapitalLetterPair)

Arbitrary Capital Letter-Pair Selected:   C


Since we are now creating random words based off of two characters, we have to update our GenerateWords function to GenerateWordsTwoLetter.  Our new function just has to be able to track probability vectors based off of the currently indexed character and the previously indexed character.

In [38]:
def GenerateWordsTwoLetter(FirstTwoLetters,OneCharacterList,Pmatrix,NumberOfWords):
    """Given two first letters, FirstTwoLetters, a list of characters, OneCharacterList, the number of words to create, NumberOfWords, and a transition probability matrix, Pmatrix,"""
    """return _NumberOfWords random words, WordSet."""
    CharacterListLetters = [index[0] for index in OneCharacterList]
    LoopLimit = 0
    WordArray = []
    WordSet = {}
    #Generate unique words until looplimit
    while ( (len(WordSet) != NumberOfWords) and (LoopLimit < 1000) ):
        CurrentCharacter = FirstTwoLetters[1]
        CurrentFirstCharacterIndex = CharacterIndexSearch(OneCharacterList, FirstTwoLetters[0])
        CurrentSecondCharacterIndex = CharacterIndexSearch(OneCharacterList, FirstTwoLetters[1])
        Word = FirstTwoLetters
        while (CurrentCharacter != " "): 
            CurrentCharacter = list(np.random.choice(CharacterListLetters, 1, p=(np.array(Pmatrix[CurrentFirstCharacterIndex,CurrentSecondCharacterIndex,:]).astype(np.float64)).flatten()))[0]
            CurrentFirstCharacterIndex = CurrentSecondCharacterIndex
            CurrentSecondCharacterIndex = CharacterIndexSearch(OneCharacterList, CurrentCharacter)
            Word = Word + CurrentCharacter
        #Condition the word so that it can be a *real* word without punctuation
        Word = (Word.translate(str.maketrans('', '', string.punctuation))).replace(' ','')
        WordArray.append(Word)
        WordSet = set(WordArray)
        LoopLimit = LoopLimit + 1
    return WordSet

In [39]:
RandomWords = GenerateWordsTwoLetter(ArbitraryCapitalLetterPair,OneCharacterList,ThreeCharacterPMatrix,100)
print("Random Letter: ",ArbitraryCapitalLetterPair)
#print("Random Letter Array: ",RandomWords)
LookupWordsInScrabbleDict(ScrabbleCounter,RandomWords)

Random Letter:   C
[1mBolded[0m words are in the Scrabble Dictionary
Conmentat
Cas
Cary
[1mClan[0m
Choody
Carthe
Coneclaction
Convilithe
[1mCits[0m
[1mCons[0m
Conatinfectined
Chold
Constationerate
Congreizenater
[1mClays[0m
Conspell
Coneedution
Capprovidennes
Consuchooside
Cesident
Couse
Casucholecuthe
[1mCited[0m
Councity
[1mCase[0m
Const
Consted
Conspers
Courislareens
Cappoing
Consentrumearand
Com
[1mCars[0m
Consen
Conste
Citle
[1mConey[0m
Carlegis
Citesidensed
Cass
Cason
[1mCor[0m
[1mCot[0m
Cont
Cassurposend
Congre
Conse
Cith
[1mClass[0m
Crithoundenforess
Conmee
[1mConto[0m
Congaint
[1mCone[0m
[1mCite[0m
Cong
Commited
[1mCart[0m
Citned
Cout
Confecom
Consylacies
[1mCon[0m
Constionst
Conitte
Casideby
Conckned
Casom
Constivalis
Cartion
Carly
Cresident
Caratimented
[1mCases[0m
Clars
[1mConstates[0m
Carged
Crition
Crithe
[1mCond[0m
[1mCites[0m
Consappromeentany
Cought
Citne
[1mChime[0m
Casent
Cousention
[1mCouth[0m
Critembe
Casuragary
Conside

## Now Let Us Try One-Word Prediction:

So we can currently create random words from one-letter and two-letter prediction.  But how difficult is it to create random *sentences* from one-word prediction?  Surprisingly, or perhaps not surprisingly, it is not too different from one-letter prediction; we are only switching a character for a word.

That does not mean we can just plug in words instead of one or two characters into our functions, however.  We will have to make some minor changes.

The first change is how we are interpreting the Constitution.  In order to break the Constitution up into words, we will split words, including punctuation, by using white-space as a delimeter.  Therefore, we intentionally add white space around existing punctuation and then invoke the split() function, which splits the Constitution variable, FileString, up into an array of strings that are words, FileWordArray.

In [40]:
FileWordArray = (FileString.replace(",", " , ").replace(".", " . ").replace("-", " - ").replace(";", " ; ").replace(":", " : ")).split()

Now that we have a word array, we can use the same function we used for making a counter dictionary of words for the Scrabble dictionary, UniqueWordSetCounter.

### 1. Find the set of all unique words used in the Constitution of the United States and call the size of that set $m$.

Since we can use the existing UniqueWordSetCounter function, we can easily get the unique words and their respective counts easily.

In [41]:
OneWordCounter = UniqueWordSetCounter(FileWordArray)

#Number of unique words
m = len(OneWordCounter) 
print("Total Number of Unique Words: ",m)
print("")

Total Number of Unique Words:  1352



### 2. Create a transition probability matrix, $P$, with shape $m \times m$ that contains the probabilities where $P_{i,j}$ represents the probability that the next word is $j$, given the current word is $i$.

In order to make a two-word counter dicitonary, we have to tweak UniqueWordSetCounter a bit, below; we can iterate through the entire input FileWordArray to create overlapping pairs of words.  We will need this in the future for when we are creating our transition probability matrix.

In [42]:
def UniqueTwoWordSetCounter(_WordArray):
    """Count the number of occurences for each word in a file located at filepath""" 
    _WordPairs = collections.Counter([_WordArray[i] + " " + _WordArray[i+1] for i in range(0,len(_WordArray)-1,1)])
    return _WordPairs

In [43]:
#Create a counter dictionary of word pairs to help create our transition probability matrix
TwoWordsCounter = UniqueTwoWordSetCounter(FileWordArray)

OneWordList = list(OneWordCounter.most_common())

We now *almost* have everything to create our transitional probability matrix, which we can then verify using a verify P matrix function.  We just need to make slight modifications to our existing probability matrix and verification functions in order to make them work with words and adjust our helper functions accordingly.

Our first small change is just renaming our LastCharacterOfFileString function to LastWordOfWordArray.  This is just to make it easier to read what our Create2DWordPMatrix is doing.

In [44]:
def LastWordOfWordArray(_WordArray):
    """Return the last word of the input _WordArray"""
    return _WordArray[-1:][0]

Our next change is to our CharacterIndexSearch to WordIndexSearch.  Again, this is just a renaming for clarity's sake.

In [45]:
def WordIndexSearch(WordList,WordSearch):
    """Given a WordList and a WordSearch word to search for in the WordList."""
    """Return either the index for the found word in the WordList or return -1 if the WordSearch could not be found."""
    counter = 0
    for index in WordList:
        WordSearchIndex = -1
        if index[0] == WordSearch:
            WordSearchIndex = counter
            break
        counter = counter + 1
    return WordSearchIndex

Finally, getting to our Create2DWordPMatrix function, we mainly change our CharacterList and CharacterCounters to WordLists and WordCounters.  We also handle the last word in the FileWordArray so that it is not ignored and instead is reconnected to 'the'.  This is just to make the probability of the row for the last character to equal 1.

In [46]:
def Create2DWordPMatrix(_OneWordList,_OneWordCounter,_TwoWordsCounter,_InputWordArray):
    """Given a WordList, _OneWordList, of single unique words, and two counters of unique single-words and unique word-pairs, _OneWordCounter and _TwoWordsCounter,"""
    """return a transition probability matrix, _P."""
    n = len(_OneWordList)
    _P = np.zeros((n,n))
    _P[WordIndexSearch(_OneWordList,LastWordOfWordArray(_InputWordArray)),WordIndexSearch(_OneWordList, 'the')] = 1/_OneWordCounter.get(LastWordOfWordArray(_InputWordArray))
    for row in range(np.shape(_P)[0]):
        for col in range(np.shape(_P)[1]):
            if(_TwoWordsCounter.get(_OneWordList[row][0] + " " + _OneWordList[col][0]) != None):
                _P[row,col] += _TwoWordsCounter.get(_OneWordList[row][0] + " " + _OneWordList[col][0])/_OneWordList[row][1]
    return _P

In [47]:
OneWordPMatrix = Create2DWordPMatrix(OneWordList,OneWordCounter,TwoWordsCounter,FileWordArray)

VerifyValid2DPMatrix(OneWordPMatrix)

True

### 3. Simulate the Markov process by starting with an arbitary capital word from our dictionary.  Make ~10 random *sentences*.

Just as when we were simulating the Markov process for one-character prediction, we utilize a function like CreateCharacterPVector, renamed to CreateWordPVector.  The notable changes are the handlnig for words instead of characters and also the check for Capital *words* by using the .istitle() function.

In [48]:
def CreateWordPVector(_WordList,_LettersOnly=True,_CapitalLettersOnly=True):
    """Given a _WordList, return a list of wrods and a transition probability vector, Pvector, for those capital letters"""
    _Words = []
    OccurenceSum = 0
    _PVector = []
    
    for index in _WordList:
        if(_CapitalLettersOnly):
            if(index[0].istitle()):
                _Words.append(index[0])
                OccurenceSum += index[1]
                _PVector.append(index[1])
        elif(_LettersOnly):
            if(index[0].isalpha()):
                _Words.append(index[0])
                OccurenceSum += index[1]
                _PVector.append(index[1])
        else:
            _Words.append(index[0])
            OccurenceSum += index[1]
            _PVector.append(index[1])
    for index in range(len(_Words)):
        _PVector[index] = _PVector[index]/OccurenceSum
    return _Words, _PVector

Now we can get our available first words to choose from and their respective probabilities.

In [49]:
AvailableCapitalWords, AvailableCapitalWordsPVector = CreateWordPVector(OneWordList,_CapitalLettersOnly=True)

Our first major divergence from our one-character prediction setup happens here.  Before, we simply selected a single arbitrary capital letter and then generated words with the same capital letter.  However, since we have fewer capitalized words and especially more capitalized words that only have one tranition to the '.' *word*, which ends our sentences, we need to select more arbitary first words.

Therefore, we can use the below while loop to iterate through the probability vector, AvailableCapitalWordsPVector, until we have 100 ArbitraryWords.

In [50]:
ArbitraryWords = []
while(len(ArbitraryWords) != 100):
    NewRandomWord = np.random.choice(AvailableCapitalWords, 1, p=(np.array(AvailableCapitalWordsPVector).astype(np.float64)).flatten())[0]
    try:
        ArbitraryWords.index(NewRandomWord)
    except:
        ArbitraryWords.append(NewRandomWord)

We can now feed these ArbitraryWords into an updated GenerateWord function, GenerateSentences.  GenerateSentences is different in that it accepts an array of \_FirstWords and each iteration of the while loop selects the next word available in the \_FirstWords.  It also conditions the output sentences so that punctuation is visually correct.  

In [51]:
def GenerateSentences(_FirstWords,_WordList,_NumberOfSentences,_Pmatrix):
    """Given a list of words, _FirstWords, a list of words, _WordList, the number of words to create, _NumberOfSentences, and a transition probability matrix, _Pmatrix,"""
    """return _NumberOfWords random words, _SentenceSet."""
    WordListWords = [index[0] for index in _WordList]
    LoopLimit = 0
    SentenceArray = []
    _SentenceSet = {}
    #generate unique words until looplimit
    while ( (_NumberOfSentences != len(_SentenceSet)) and (LoopLimit < 1000) and (LoopLimit < len(_FirstWords))):
        CurrentWord = _FirstWords[LoopLimit]
        CurrentWordIndex = WordIndexSearch(_WordList, _FirstWords[LoopLimit])
        Sentence = CurrentWord
        while (CurrentWord != "."): 
            CurrentWord = list(np.random.choice(WordListWords, 1, p=(np.array(_Pmatrix[CurrentWordIndex,:]).astype(np.float64)).flatten()))[0]
            CurrentWordIndex = WordIndexSearch(_WordList, CurrentWord)
            Sentence = Sentence + " " + CurrentWord
        #Condition the word so that it can be a *real* word without punctuation
        SentenceArray.append(Sentence.replace(" , ",", ").replace(" . ",".").replace(" .",".").replace(" - ","-").replace(" ; ","; ").replace(" : ",": "))
        _SentenceSet = set(SentenceArray)
        LoopLimit = LoopLimit + 1
    return _SentenceSet

Finally, we can plug our necessary inputs into GenerateSentences to make 10 ramdom sentences!

In [52]:
RandomSentences = GenerateSentences(ArbitraryWords,OneWordList,10,OneWordPMatrix)
X = [print("Random Sentence: ",RandomSentence,"") for RandomSentence in RandomSentences]

Random Sentence:  United States, shall be quartered in Cases, both Houses that purpose shall act accordingly, open Court, and other State on the legislature, and Felonies committed on the office or other Property belonging to ourselves and a member of the United States, Charles Cotesworth Pinckney, and naval forces, or naturalized in a quorum for the Laws of the several States; To declare the President, the Government, and eight, Thomas Fitzsimons, the fourth day. 
Random Sentence:  Rhode Island and enjoy any State. 
Random Sentence:  To provide for the United States; a Resident within, or possession of honor, giving them, when elected, and he shall take Care that State in December, or when the first Meeting shall then the several State, and for the land or members from the United States, to the list, one, without the Appointment of the government of the Receipts and Fact, both Houses shall flee from Office on Imports or other Officers; to enter, Imposts and all other Mode of Adjournme