# Problem Text: Hallucinating the Constitution

Consider the constitution of the United States:

> https://www.usconstitution.net/const.txt .

This document contains upper- and lower-case letters, numbers, and basic punctuation. 

**One letter prediction:**

1. Find the set of all characters used in the document. Call the number of characters $n$. 
2. Create an $n \times n$ matrix whose $i,j$ entry is the probability that the next character is $j$ given that the current character is $i$. Estimate this probability by looking at all occurrences of character $i$ in the document and the number of times character $j$ immediately follows it. 
3. Simulate this system as a Markov chain that starts with an arbitrary capital letter and continues until it gets to a space. Produce $100$ random "words" this way. How many of them are actual words? Use a [Scrabble dictionary](https://scrabble.hasbro.com/en-us/tools#dictionary) if you are not certain whether a given sequence is a word. 

**Two letter prediction:**

1. Create an $n \times n \times n$ tensor whose $i,j,k$ entry is the probability that the next character is $k$ given that the current character is $j$ and the previous character is $i$. Use the document to empirically find these probabilities. 
2. Use this model to construct random words. 

**Sentence prediction:**

Do a one word prediction, but use all the unique *words* in the document. Hallucinate sentences. Consider a punctuation mark as a word. 

#Mathematical Descriptions

For the rest of this document, I will intersperse mathematical descriptions for each part of each solution between the blocks of code, as well as justification the specific code used. The mathematical descriptions will be limited, as most of the math I used was in the area of probability and matrices/tensors, the rest is regular coding. 

In this first section, I import the necessary tools to run the code. I import from the urllib library which allows me to read in a text file from a url. The numpy library contains mathematical data structures and functions, the re library allows me to parse through strings using "regular expressions", and the collections library allows me to create a type of dictionary that parses through a list for me. The torch library allows me to create tensors for Part 2. 

I also extract the text file and then convert it to string. I then manipulate the string by replacing all the new line symbols with spaces so that my code doesn't read the new line symbol, cut off the header of the text file which is not part of the actual Constitution, and use the title function to make the beginning of every word a capital letter, so that when my code needs to choose a letter to start a word with, the sample of data it is pulling from is much larger than if I had just left the original document with only a few capital letters for proper nouns and the beginning of sentences.

#Part 1#

In [0]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
from urllib.error import HTTPError
import numpy as np
import re
from collections import Counter

import torch
import torch.nn as nn
import torch.nn.functional as F

#Extract text file to bytes and then bytes to string
Const = urlopen("https://www.usconstitution.net/const.txt").read()
encoding = 'utf-8'
ConStr = Const.decode(encoding)
#Cleans up string so it is just the Constutition, and capitalizes the beginning
#of each word
ConLine = ConStr.replace("\n", " ")
ConLine = ConLine[278::]
ConLine = ConLine.title()

Below, for the purposes of creating words in Parts 1 and 2, punctuation is unnecessary, so the translate and replace functions remove all the punctuation or turn it into the space character. I then create a blank array, and run a for loop through the no punctuation string, only adding a character to the blank array if it hasn't encountered it in the document previously. For ease of use, the array is sorted with the space character at the first index, and uppercase letters and then lowercase letters. Lastly, the amount of unique characters in the document is determined by taking the length of the array.

In [0]:
#Removes unneccessary punctuation for creating words
NoPunc = ConLine.translate({ord(i): None for i in ',"[]\/:();'})
NoPunc = NoPunc.replace(".", " ")
#Creates set of unique characters
unique = []
for char in NoPunc[::]:
    if char not in unique:
        unique.append(char)
unique.sort()
n = len(unique)

For the next section, we need to understand the creation of a word using one letter prediction as a Markov Process. 
First, a process is **Markov** if 
$$
P[X_{k+1} = x_{k+1} \;|\; X_k = x_k, ..., X_0 = x_0 ] = P[X_{k+1} = x_{k+1} \;|\; X_k = x_k ].
$$
meaning that the probability that the next state is equal to some value is dependent only on the current state. Because we are only using the current letter to predict what the next letter should be, we can treat our predictions as being Markov. 

Because the process is Markov the collection of probabilites that one state will go to certain different state can be represened by a probability (stochastic) matrix, and populating it with probabilities. 
First we define a a **Stochastic Matrix** $Q$ as a real valued $n \times n$ matrix such that

- $Q_{i,j} \in [0,1]$
- The sum of each row is one. 

The probability matrix $Q$ is contstructed by placing the probability that the trajectory will go from state $i$ to state $j$ at $Q_{i,j}$, where $i$ is the row and $j$ is the column.

Constructing our specific $Q$, the different states are the different unique characters, so $Q$ is a square matrix with the same size as the number of unqiue characters, with the same indexes for those states as their indices in the unique character array. Next to ensure that once a word is complete (i.e. the next predicted character is space) more words are not created, the probability is that the next character after the space is another space is 1, meaning that is the only outcome. Another way of saying this is that spaces are recursive. This is different than the probability for the character after the space in the document itself, but that isn't relevant to us. Since the space character is at index 0, $p[0,0]=1$

To populate the rest of the probabilities, the code first uses regular expressions to search for each iteration of a certain character in Constitution string, and adds the index of that iteration to an array. The code then adds 1 to every value in the array of first characters indices to determine all the indices for every character that directly follows the initial character. The code then searches the main string for what character is at that index and adds that character to another array. The Counter function then parses that array to determine how many times each second character appears. A for loop records the indices for each second character in the unique characters matrix, and a final while loop sets $Q$ in the row corresponding to the first character, and the column corresponding to the second character to the probability, which is the iterations of that specific second character divided by the total amount of second characters. A while loop performs this for every character after the "space" character, which we manually determined the probability for. 

In [0]:
#Creates blank probability matrix
p = np.zeros((n,n))
p[0,0] = 1
count = 1
while count < len(unique): 
  #Creates an index array for each character iteration, and index for the following character
  FirstLet = [m.start() for m in re.finditer(unique[count], NoPunc)]
  SecondLet = [x+1 for x in FirstLet]
  chars = []
  i = 0 
  #Adds each second letter to an array, and creates a dictionary of the amount of 
  #iterations for each of those letters
  while i < len(SecondLet):
    let = NoPunc[SecondLet[i]]
    chars.append(let)
    i = i + 1
  total = len(chars)
  instances = Counter(chars)
  keys = list(instances.keys())
  keyindex = []
  #Creates a list of which letter each key corresponds to, which acts as the columns
  #for the probability matrix
  for x in keys:
    if x in unique:
      keyindex.append(unique.index(x))
  d = 0
  #Places the probability of each letter at the right index in the p matrix
  while d < len(keyindex):
    p[count,keyindex[d]] = instances[keys[d]]/total
    d = d+1  
  count = count + 1

Now that we have the probability matrix, we determine the trajectory, i.e. the list of states the process goes through, by multiplying Q times different probability vectors. We set our initial probability vector by choosing the index of a random capital letter. That index is the first value in the trajectory. In order to move through the Markov process, we need to determine the different probabilities of the possible next letters. Mathematically this is completed by taking the probability vector and multiplying it by the matrix, but in this case, since we know our starting point, we know the probability vector is $0$ everywhere except for the index of the first letter, where it is $1$. If we multiply that with $Q$, it performs the same function as taking the row of $Q$ with that index as an array, which is what the code does.

The random.choice function then randomly chooses an integer between 0 and n, where n is the length of the unqiue characters, so essentially the random.choice function is choosing an index. The choice is random, but the probability of one index being chosen over another is pulled from the probability previous determined from $Q$. This index is then added to the trajectory, and the process repeats, with the new character index acting as the current state. Thus, the new probability vector is also $0$ except for $1$ and the index of the current state, and we can again simply extract that row of the probability matrix. A while loop repeats this function until the new character is a space i.e. it has an index = 0, after which the loops breaks, signifying the completion of a word. 

Lastly, the characters corresponding to each of the indices contained in the trajectory list are added in order to an empty array, and then those characters are joined together to form a string. That string is added to an array of words, and this process repeats 100 times to form 100 words.

In [0]:
wordlist = []
w = 100
j = 0
while j < w:
  #Starts the word at a random capital letter index
  x = np.random.randint(12,36)
  trajectory = [x]
  nextLet = [x]
  #Randomly chooses the next letter index, based on the probability of the
  #previous letter, until it reachs the "space" index, and adds it to an array
  while nextLet[0] != 0:
    start = p[nextLet[0],:]
    nextLet = np.random.choice(n, 1, p=start)
    trajectory.append(nextLet[0])
  f = []
  #Takes the indices that were chosen and turns them into their corresponding 
  #letters, and then combines those letters into a string 
  for x in trajectory:
    f.append(unique[x])
    str1 = ''.join(f)
  wordlist.append(str1)
  j = j + 1
wordlist

['Go ',
 'Grgar ',
 'Gisonraks ',
 'Atusis ',
 'Kenirtallespo ',
 'Mallecie ',
 'Tomitatorsore ',
 'Unciebmis ',
 'Fiteng ',
 'Hatatingretallay ',
 'Ondmeateviony ',
 'Foiesthajon ',
 'Coratotityirseppatin ',
 'By ',
 'Me ',
 'Onticters ',
 'He ',
 'Mirechr ',
 'She ',
 'Norctss ',
 'Ma ',
 'Novenhe ',
 'Ges ',
 'Quiaslall ',
 'Qurersigrt ',
 'Butand ',
 'For ',
 'Har ',
 'Fir ',
 'Inkeid ',
 'Unsd ',
 'Co ',
 'Yed ',
 'Qusamor ',
 'Frt ',
 'Rel ',
 'Lerus ',
 'Lecus ',
 'Un ',
 'Pr ',
 'May ',
 'Juced ',
 'Jucerim ',
 'Hal ',
 'Yen ',
 'Nores ',
 'Higroume ',
 'Rentidy ',
 'Thames ',
 'Mat ',
 'Ke ',
 'Wited ',
 'Ye ',
 'No ',
 'Male ',
 'Ald ',
 'Coriors ',
 'Ans ',
 'By ',
 'Literiore ',
 'It ',
 'Putindsioncke ',
 'Viowictibe ',
 'Than ',
 'Iteawer ',
 'Hongsanenay ',
 'Fot ',
 'Kiche ',
 'Thepouse ',
 'Carar ',
 'Prwes ',
 'Th ',
 'Und ',
 'Pred ',
 'Fithe ',
 'Ans ',
 'Be ',
 'Jud ',
 'Vat ',
 'Gome ',
 'Grorul ',
 'Ing ',
 'Hatecor ',
 'On ',
 'Ins ',
 'Jucivequre ',
 'Nomes ',


These are the 100 words my function produced.

['Go ', 'Grgar ', 'Gisonraks ', 'Atusis ', 'Kenirtallespo ', 'Mallecie ', 'Tomitatorsore ', 'Unciebmis ', 'Fiteng ', 'Hatatingretallay ', 'Ondmeateviony ', 'Foiesthajon ', 'Coratotityirseppatin ', 'By ', 'Me ', 'Onticters ', 'He ', 'Mirechr', 'She ', 'Norctss ', 'Ma', 'Novenhe ', 'Ges ', 'Quiaslall ', 'Qurersigrt ', 'Butand', 'For ', 'Har ', 'Fir ', 'Inkeid ', 'Unsd ', 'Co ', 'Yed ', 'Qusamor ', 'Frt ', 'Rel ', 'Lerus ', 'Lecus ', 'Un ', 'Pr ', 'May ' , 'Juced ', 'Jucerim ', 'Hal ', 'Yen ', 'Nores ', 'Higroume ', 'Rentidy ', 'Thames ', 'Mat ', 'Ke ', 'Wited ', 'Ye ', 'No ', 'Male ', 'Ald ', 'Coriors ', 'Ans ', 'By ', 'Literiore ', 'It ', 'Putindsioncke ', 'Viowictibe ', 'Than ',  'Iteawer ', 'Hongsanenay ', 'Fot ', 'Kiche ', 'Thepouse ', 'Carar ', 'Prwes ', 'Th ', 'Und ', 'Pred ', 'Fithe ', 'Ans ', 'Be ', 'Jud ', 'Vat ', 'Gome ', 'Grorul ', 'Ing ', 'Hatecor ', 'On ', 'Ins ', 'Jucivequre ', 'Nomes ',  'Unthootississpry ', 'Re ', 'Matiginal ', 'Toy ', 'Rer ', 'Nomicepid ', 'Comer',
 'Surtidin ', 'No ', 'Cos ', 'Ye ', 'Bert ', 'Ex ']

 To determine if they were actually words, I checked these words against the Constitution text file. If the words did not appear in the Constitution text file, then I didn't treat it as an actual word.  The code runs a for loop for every word "hallucinated". It checks if the word is within the Constitution string, and if it finds it, it adds that word to an array and counts it. This code determined that there were 14 words hallucinated, and they are as follows. 
 
['By ', 'He ', 'For ', 'May ', 'No ', 'Male ', 'By ', 'It ', 'Than ', 'Th ', 'Be ', 'On ', 'No ', 'Ex ']

In [0]:
amount = 0
actualWords = []
#Check the words in the set of words to see if they are in the Constitution string
for x in wordlist:
  if x in NoPunc:
    amount = amount + 1
    actualWords.append(x)
amount, actualWords

(14,
 ['By ',
  'He ',
  'For ',
  'May ',
  'No ',
  'Male ',
  'By ',
  'It ',
  'Than ',
  'Th ',
  'Be ',
  'On ',
  'No ',
  'Ex '])

#Part 2#

Because this section requires the same information, i.e. just letters and spaces, and no punctuation, the initial strings and unique character arrays are the same in this part as in Part 1.

We still need to justify the creation of a word using two letter prediction as a Markov Process. 
As a reminder, a process is **Markov** if 
$$
P[X_{k+1} = x_{k+1} \;|\; X_k = x_k, ..., X_0 = x_0 ] = P[X_{k+1} = x_{k+1} \;|\; X_k = x_k ].
$$
meaning that the probability that the next state is equal to some value is dependent only on the current state. We are treating the current state as having two components, the two preceding letters, but those two components still act together as one state, which can be used to predict what the next letter should be, and thus our predictions follow the Markov process. 

The probability matrix must have space for 3 components then, so the probabilities must then be stored in a tensor, which can be understood a matrix of matrices. The probability that based on the current state with components $i$ and $j$ ($i$ being the state of the first character, and $j$ being the state of the second) the next state will be $k$, is stored in $Q_{i,j,k}$, where $i$ is the matrix index, $j$ is the row in that matrix, and $k$ is the column in that row. 

Given that we start with a capital letter, the probability that the "space" character is recursive must still be 1. Otherwise, if the second character is a space, and the third character is not a space, a new word is started, which defeats the purpose of creating a singular word. Thus the probability is 1 that for any first letter, if the second character is a space, the third character is a space, which is summarized by $p2[:,0,0] = 1$

This code begins similarily to the code in part 1, where the second characters are stored in an array and the Counter function produces a dictionary that gives the amount of iterations for each secondary character. However, it then goes a level deeper. The code then creates arrays of the indices for each respective iteration for a second character, and the indices of the character that directly follows them, the third character. The indices in that third letter array then are changed into their corresponding characters, and those third characters are aggregated in another counter to determine how many times they each iterate. The index of those third characters in the unique characters matrix is then placed in another array, and another while loop places each of the probabilities for each unique character index in the corresponding first letter matrix and second letter row that the while loops are currently on. The probability is the amount of iterations for that particular third character, divided by the total amount of third characters for the particular first and second characters. 

Using this methodology, there will be some rows within matrices that do not sum to 1, and instead are all zeroes. This is okay because those matrix/row pairs correspond to first and second letter pairs that cannot create words as defined by the Constitution string. The first and second characters "Bz" will not produce a word, and thus no probability of a third character needs to be calculated as that probability will never be used. 

In [0]:
#Creates a blank tensor
p2 = torch.zeros([n, n, n], dtype=torch.float64)
p2[:,0,0] = 1
count1 = 1
while count1 < n: 
  #Creates an index array for each character iteration, and index of following character
  FL = [m.start() for m in re.finditer(unique[count1], NoPunc)]
  SL = [x+1 for x in FL]
  chars = []
  i = 0 
  #Adds each second letter to an array, and creates a dictionary of the amount of 
  #iterations for each of those letters
  while i < len(SL):
    let = NoPunc[SL[i]]
    chars.append(let)
    i = i + 1
  instances = Counter(chars)
  keys = list(instances.keys())
  keys.sort()
  total = len(chars)
  count2 = 0
  #Creates an index array for each secondary character iteration after the first,
  #and an array of the character that follows those secondary characters  
  while count2 < len(keys):
    z = keys[count2]
    z1 = unique.index(z)
    SL2 = [m.start() for m in re.finditer(keys[count2], NoPunc)]
    SL3 = [x for x in SL if x in SL2]
    TL = [x+1 for x in SL3]
    chars2 = []
    j = 0 
    #Adds each third letter to an array, and creates a dictionary of the amount of 
    #iterations for each of those letters
    while j < len(TL):
      let = NoPunc[TL[j]]
      chars2.append(let)
      j = j + 1
    total2 = len(chars2)
    instances2 = Counter(chars2)
    keys2 = list(instances2.keys())
    keys2.sort()
    keyindex2 = []
    #Creates a list of which letter each key corresponds to, which acts as the columns
    #for the probability matrix
    for x in keys2:
      if x in unique:
        keyindex2.append(unique.index(x))
    d = 0
    #Places the probability of each letter at the right index in the y matrix
    while d < len(keyindex2):
      p2[count1, z1, keyindex2[d]] = instances2[keys2[d]]/total2
      d = d+1  
    count2 = count2 + 1
  count1 = count1 + 1

The code methodology for creating a word starts very similar to the code in Part 1, an index for a random capital letter is chosen, and then code uses the one letter prediction model to find the index of the second letter of the word. The trajectory array is appended to hold those two indices, and first and second characters are also added to a holding array. Using the same methodology as before, selecting the correct row in the probability matrix to find the probability given a current state, the system pulls the code for the given first and second letter indices. The code then moves the second letter index to become the first level index, and randomly chooses a new second letter out of the unique character index based on probability given previously. The new second character index is added to the trajectory as a third character, and the while loop continues this process until the new second character index is the space index, signaling the end of a word. This collection of indices is then added to a new array as their character counter parts, and those characters are added together to form a string. That string is then added to an array of words. A while loop performs this process 100 times for 100 words. 

In [0]:
wordlist2 = []
w = 100
j = 0
while j < w:
  #Chooses a random capital index, and chooses a second letter index randomly 
  #based on the probabilities in part 1
  x = np.random.randint(12,36)
  trajectory2 = [x]
  start = p[x,:]
  nextLet = np.random.choice(n, 1, p=start)
  trajectory2.append(nextLet[0])
  nextLet2 = [trajectory2[0], trajectory2[1]]
  #Randomly chooses the next letter index, based on the probability of the
  #previous 2 letter indices, until it reachs the "space" index, and adds it to
  #the array with the previous indices. 
  while nextLet2[1] != 0:
    start1 = p2[nextLet2[0],nextLet2[1],:]
    start2 = start1.numpy().reshape(62)
    nextLet2[0] = nextLet2[1]
    nextLet2[1] = np.random.choice(n, 1, p=start2)
    trajectory2.append(nextLet2[1][0])
  f = []
  #Takes the indices that were chosen and turns them into their corresponding 
  #letters, and then combines those letters into a string 
  for x in trajectory2:
    f.append(unique[x])
    str2 = ''.join(f)
  wordlist2.append(str2)
  j = j + 1
wordlist2

['Chost ',
 'Year ',
 'Of ',
 'Live ',
 'Reprem ',
 'No ',
 'Con ',
 'Hisding ',
 'Price ',
 'Legis ',
 'Geof ',
 'Namentator ',
 'Relicurtiont ',
 'For ',
 'Boted ',
 'Severes ',
 'House ',
 'Presingres ',
 'Inhall ',
 'Comminaturnatess ',
 'Offer ',
 'Ove ',
 'Grall ',
 'The ',
 'Whold ',
 'Sam ',
 'Havess ',
 'Exes ',
 'Yeasese ',
 'Holl ',
 'Gresent ',
 'Jurnme ',
 'Cent ',
 'Govid ',
 'Prece ',
 'Ball ',
 'Judent ',
 'Yeasen ',
 'Jent ',
 'Coinside ',
 'If ',
 'Vice ',
 'The ',
 'Nothe ',
 'Have ',
 'Hels ',
 'Menstiestiongst ',
 'Fority ',
 'Higisions ',
 'The ',
 'For ',
 'Decto ',
 'Lainey ',
 'Greetes ',
 'He ',
 'A ',
 'Repres ',
 'Mores ',
 'Legaing ',
 'They ',
 'Kings ',
 'Law ',
 'On ',
 'Forson ',
 'Jureentice ',
 'Lains ',
 'Res ',
 'Yeartion ',
 'Force ',
 'Jus ',
 'Keent ',
 'Make ',
 'Yeary ',
 'Thall ',
 'The ',
 'Shate ',
 'Yeased ',
 'Cited ',
 'Keeturnme ',
 'On ',
 'Viclediniall ',
 'Geof ',
 'Of ',
 'Quor ',
 'The ',
 'Wity ',
 'Kinate ',
 'Juns ',
 'Whoose ',


The 100 words the 2-letter prediction code "hallucinated" were: 

['Chost ',
 'Year ',
 'Of ',
 'Live ',
 'Reprem ',
 'No ',
 'Con ',
 'Hisding ',
 'Price ',
 'Legis ',
 'Geof ',
 'Namentator ',
 'Relicurtiont ',
 'For ',
 'Boted ',
 'Severes ',
 'House ',
 'Presingres ',
 'Inhall ',
 'Comminaturnatess ',
 'Offer ',
 'Ove ',
 'Grall ',
 'The ',
 'Whold ',
 'Sam ',
 'Havess ',
 'Exes ',
 'Yeasese ',
 'Holl ',
 'Gresent ',
 'Jurnme ',
 'Cent ',
 'Govid ',
 'Prece ',
 'Ball ',
 'Judent ',
 'Yeasen ',
 'Jent ',
 'Coinside ',
 'If ',
 'Vice ',
 'The ',
 'Nothe ',
 'Have ',
 'Hels ',
 'Menstiestiongst ',
 'Fority ',
 'Higisions ',
 'The ',
 'For ',
 'Decto ',
 'Lainey ',
 'Greetes ',
 'He ',
 'A ',
 'Repres ',
 'Mores ',
 'Legaing ',
 'They ',
 'Kings ',
 'Law ',
 'On ',
 'Forson ',
 'Jureentice ',
 'Lains ',
 'Res ',
 'Yeartion ',
 'Force ',
 'Jus ',
 'Keent ',
 'Make ',
 'Yeary ',
 'Thall ',
 'The ',
 'Shate ',
 'Yeased ',
 'Cited ',
 'Keeturnme ',
 'On ',
 'Viclediniall ',
 'Geof ',
 'Of ',
 'Quor ',
 'The ',
 'Wity ',
 'Kinate ',
 'Juns ',
 'Whoose ',
 'Kin ',
 'Cit ',
 'Hous ',
 'Leguls ',
 'Wholver ',
 'Ques ',
 'Jount ',
 'Quall ',
 'Gil ',
 'Unlesisers ',
 'Wribines ']

 In the same way as in Part 1, I search for each word in the Constitution string, and add the word to an array and count it if it is found. The following 22 actual words were produced by the 2-letter prediction, so we find that 2-letter prediction is better at creating real words than 1-letter prediction.

  ['Year ',
  'Of ',
  'No ',
  'For ',
  'House ',
  'The ',
  'If ',
  'Vice ',
  'The ',
  'Have ',
  'The ',
  'For ',
  'He ',
  'A ',
  'They ',
  'Law ',
  'On ',
  'Make ',
  'The ',
  'On ',
  'Of ',
  'The ']

In [13]:
amount2 = 0
actualWords2 = []
#Check the words in the set of words to see if they are in the Consitution string
for x in wordlist2:
  if x in NoPunc:
    amount2 = amount2 + 1
    actualWords2.append(x)
amount2, actualWords2

(22,
 ['Year ',
  'Of ',
  'No ',
  'For ',
  'House ',
  'The ',
  'If ',
  'Vice ',
  'The ',
  'Have ',
  'The ',
  'For ',
  'He ',
  'A ',
  'They ',
  'Law ',
  'On ',
  'Make ',
  'The ',
  'On ',
  'Of ',
  'The '])

#Part 3#

Because the goal is to "hallucinate" sentences instead of words, and punctuation counts as words, the Constitution string must be manipulated in a different way. Instead of getting rid of all the punctuation, the code below replaces all the punctuation with the same punctuation but surrounded by spaces. This is done because the split function separates the file into a list a substrings, and it splits the substrings if there is a space between them. In order to create a list of the unique words in the file, the list of all the words in the document is converted into a set, which automatically gets rid of all duplicates. This set is then converted back into a list so it can be indexed. For ease of use, the unique words are sorted alphabetically (with punctuation coming first), and the total amount of unique words is found by taking the length of the unique words list. 

In [0]:
#Makes sure puncutation is treated as a word by when the document is split by spaces
PuncSep = ConLine.replace(".", " . ")
PuncSep = PuncSep.replace(",", " , ")
PuncSep = PuncSep.replace(";", " ; ")
PuncSep = PuncSep.replace(":", " : ")
PuncSep = PuncSep.replace(")", " ) ")
PuncSep = PuncSep.replace("(", " ( ")
PuncSep = PuncSep.replace('"', ' " ')
PuncSep = PuncSep.replace('-', ' - ')
#Puts all the words into an array, and then makes set of 1 iteration of each word
allWords = PuncSep.split()
UW = list(set(allWords))
UW.sort()
num = len(UW)

In terms of creating the probability matrix, this code is very similar to creating the probability matrix in Part 1. A blank probability matrix with size and rows/columns corresponding to the indices of the unique words list is created. Then a for loop searches the list of all words for each iteration of a word, and adds those indices to an array, as well as the index of the words that follows the iteration of the first word. The words that are at those second word indices are then put into an array. The only exception to this is the word/puncuation ".", which is the last "word" as thus doesn't have a second word following it. In this case, every word following a "." except for the last "." are added to an array. The counter function then creates a dictionary which delineates how many times each second word occurs after the first word. The index for each second word in the unique words array are determined by using the index funtion on the unique words list, and these indices are added to an index array. Lastly, the code goes through each of those indices and places the probability for that words at that index, in corresponding row for the original first word. The probability is the amount of iterations of the second word divided by the total amount of second words. 

In [0]:
p3 = np.zeros((num,num))
count3 = 0
while count3 < num: 
  FW = [i for i, e in enumerate(allWords) if e == UW[count3]]
  SW = [x+1 for x in FW]
  allWords[SW[0]]
  nextWords = []
  i = 0 
  #Adds each second word to an array, and creates a dictionary of the amount of 
  #iterations for each of those words
  if count3 == 5:
    while i + 1 < len(SW):
         w = allWords[SW[i]]
         nextWords.append(w)
         i = i + 1
  else: 
    while i < len(SW):
      w = allWords[SW[i]]
      nextWords.append(w)
      i = i + 1
  total3 = len(nextWords)
  instances3 = Counter(nextWords)
  keys3 = list(instances3.keys())
  keyindex3 = []
  #Creates a list of which word each key corresponds to, which acts as the columns
  #for the probability matrix
  for x in keys3:
    if x in UW:
      keyindex3.append(UW.index(x))
  d = 0
  #Places the probability of each word at the right index in the p matrix
  while d < len(keyindex3):
    p3[count3,keyindex3[d]] = instances3[keys3[d]]/total3
    d = d+1
  count3 = count3 + 1

The methodology for contructing a sentence also bears a lot of similarity to the methodology in Part 1. The unique words index for a random word (non-punctuation or number) is chosen to start the sentence. I originally had each sentence start with a random words chosen based on the probability of which words were most likely to start a sentence (i.e. follow a period) but for some reason that tended to generate very long or very short sentences. That random word index is then added to the trajectory, and the row of the probability matrix corresponding to that word index is selected. A second word index is then randomly chosen based on the probability from the first word index. The second word index becomes the new current state, and the process repeats itself until the new word generated is a period, which signals the end of a sentence. Those word indices are then converted into their actual word counterparts in the unique word list, and added a new array. The collection of substring in that new array is then added together to create one big string, adding a space between each substring. That larger string is then added to an array of sentences. The while loop that surrounds this code makes sure that the process is performed 5 times for 5 sentences. 

The sentences the code produced were: 

**['Disapproved By The Vice President , Expel A Smaller Number Of Any State , He Shall Be The Government Of The Removal Of The Authority Over Those Of The Authority Of Another : But If No Capitation , Except In The Right To Choose Their Journal .', \\
 'Persons Voted For The Vice President .', \\
 'Terms Of Two Or Affirmation , Open Court , Are Eighteen Years A Resident Within The Common Defence .', \\
 'Seventeenth Day To Raise And Inferior Courts , Determines By The Seat Of President Pro Tempore Of A President Pro Tempore , Punish Its Jurisdiction Thereof , The Vice - President And Consuls ; And Fact , Become President .', \\
 'Thereby , The Constitution Of The Executive Power To The Vice President ; And Nays Of Senators And For More Than One Supreme Court ; He Was Elected By A President Of March Next Session .']**

These sentences don't much sense, but they aren't complete gibberish either. One issue that I noticed was reading punctuation as a word led to run on sentences. Additionally, there were a wider variety of words which followed puntuation, as oppossed to normal words, so sentences didn't make as much sense because adding punctuation was essentially like a restart button to the sentence. 

In [0]:
sentList = []
w = 5
j = 0
begin = p3[5,:]
while j < w:
  #Starts the sentence at a random word (non-punctuation) index
  x = np.random.randint(38,1174)
  trajectory3 = [x]
  nWord = [x]
  #Randomly chooses the next word index, based on the probability of the
  #previous word, until it reachs the "period" index, and adds it to an array
  while nWord[0] != 5:
    start3 = p3[nWord[0],:].reshape(num)
    nWord = np.random.choice(num, 1, p=start3)
    trajectory3.append(nWord[0])
  f = []
  #Takes the indices that were chosen and turns them into their corresponding 
  #words, and then combines those words into a string 
  for x in trajectory3:
    f.append(UW[x])
    str1 = ' '.join(f)
  sentList.append(str1)
  j = j + 1
sentList

['Disapproved By The Vice President , Expel A Smaller Number Of Any State , He Shall Be The Government Of The Removal Of The Authority Over Those Of The Authority Of Another : But If No Capitation , Except In The Right To Choose Their Journal .',
 'Persons Voted For The Vice President .',
 'Terms Of Two Or Affirmation , Open Court , Are Eighteen Years A Resident Within The Common Defence .',
 'Seventeenth Day To Raise And Inferior Courts , Determines By The Seat Of President Pro Tempore Of A President Pro Tempore , Punish Its Jurisdiction Thereof , The Vice - President And Consuls ; And Fact , Become President .',
 'Thereby , The Constitution Of The Executive Power To The Vice President ; And Nays Of Senators And For More Than One Supreme Court ; He Was Elected By A President Of March Next Session .']