## EE 502 P: Analytical Methods for Electrical Engineering
    
# Homework 10: Review
## Due 15 December, 2019 at 11:59 PM
## Allysa Nguyen

Copyright &copy; 2019, University of Washington

<hr>

**Instructions**: Choose **<u>one</u>** of the following problems. Solve the problem and then write up your solution in a stand alone Jupyter Notebook. Your notebook should have the following elements:

- Problem statement
- Mathematical description of the solution
- Executable code, commented, clear code

You will be graded on how well your notebook reads like a nicely formated, well written report. You must:

- Write mathematical descriptions using complete sentences, paragraphs, and LaTeX formulas. 
- Comment your code as necessary, with a description of what each function does and all major steps.
- Label plots axes, use legends, and use plot titles. 

# Hallucinating the Constitution

Consider the constitution of the United States:

> https://www.usconstitution.net/const.txt .

This document contains upper- and lower-case letters, numbers, and basic punctuation. 

**One letter prediction:**

1. Find the set of all characters used in the document. Call the number of characters $n$. 
2. Create an $n \times n$ matrix whose $i,j$ entry is the probability that the next character is $j$ given that the current character is $i$. Estimate this probability by looking at all occurrences of character $i$ in the document and the number of times character $j$ immediately follows it. 
3. Simulate this system as a Markov chain that starts with an arbitrary capital letter and continues until it gets to a space. Produce $100$ random "words" this way. How many of them are actual words? Use a [Scrabble dictionary](https://scrabble.hasbro.com/en-us/tools#dictionary) if you are not certain whether a given sequence is a word. 

**Two letter prediction:**

1. Create an $n \times n \times n$ tensor whose $i,j,k$ entry is the probability that the next character is $k$ given that the current character is $j$ and the previous character is $i$. Use the document to empirically find these probabilities. 
2. Use this model to construct random words. 

**Sentence prediction:**

Do a one word prediction, but use all the unique *words* in the document. Hallucinate sentences. Consider a punctuation mark as a word. 

**Notes:** Use `open` and `file.read` to read in the file as a string. For the sentence. Use `replace` to add space before punctuation and then `split()` to turn the string into a list. Use a `DiGraph` from the `networkx` library to store the data. Note that you can make weighted edges by adding data to the edges, as in [this document](https://networkx.github.io/documentation/stable/auto_examples/drawing/plot_weighted_graph.html).

In [1]:
import scipy # Has linear algebra
import scipy.ndimage
import scipy.linalg as linalg
import numpy as np
import string
import random
import matplotlib.pyplot as plt

import networkx as nx

import torch
import torch.nn as nn
import torch.nn.functional as F

from nltk.corpus import words #Used to check for valid words

### One Letter Prediction

In [2]:
'''
Returns a list of indexes where a string occurs in string
'''
#Use this method if string is only one character or string
def findOccurrences(string, chars):
    return [i for i, letter in enumerate(string) if letter == chars]

#Use this method if more than one character
def findOccurrences2(string, chars):
    #Use this method if more than one character
    indexes = []
    n = 0
    index = 0
    #Finds all indexes by updating the starting index after inital is found.
    while index != -1:
        index = string.find(chars, n)
        indexes.append(index)
        n = index + 1
    return indexes[:-1] #deletes last index which will be -1



1. Find the set of all characters used in the document. Call the number of characters $n$. 

To find the set of all unique characters, the python datatype `set`, which can be used to remove duplicates from a sequence. By reading the file and storing it into a string, the string can be converted into a set python object which will give all the characters in the constitution. 


In [3]:
#open and read constitution text file
file = open("constitution.txt","r")
contents = file.read()
file.close()

# clean up text file & assume newlines are treated as spaces
contents = contents.replace("\n\n", " ")
contents = contents.replace("\n", " ")
contents = contents.replace(".", " . ") \
           .replace(",", " , ") \
           .replace("-", " - ") \
           .replace(";", " ; ") \
           .replace(":", " : ") # remove excess white spaces

#print(contents)

#get set of all unique characters
uniqueSet = set(contents)
n = len(uniqueSet)
print("Set of characters:", uniqueSet)
print("number of characters:", n)

Set of characters: {'g', 'C', 'Q', 'W', 'q', '9', 'a', 'U', 'b', 'M', 'D', '0', 'm', 's', ',', '5', 'x', 'G', '7', 'o', 'H', 'j', 'B', 'V', 'J', '6', ':', ')', 'y', '4', '3', 'f', 'c', 'F', 'e', 'i', 'd', 'r', 'O', 'K', '-', 'I', 'z', '8', 'A', ';', 'P', '.', 'u', 'l', 'T', 'E', 'n', 'w', 'k', 'R', 'N', '"', ' ', 'v', 'S', 'h', 't', 'L', '1', '2', '(', 'p', 'Y'}
number of characters: 69


2. Create an $n \times n$ matrix whose $i,j$ entry is the probability that the next character is $j$ given that the current character is $i$. Estimate this probability by looking at all occurrences of character $i$ in the document and the number of times character $j$ immediately follows it. 

To create the desired matrix, the approach I took was to create an $n$ x $n$ matrix of zeroes and then filled in the probailities where necessary. To do keep track of all the indices, I used a dictionary to give each unique character (dictionary key) an index (dictionary value).


After mapping the matrix, I looped through each of the characters ,`char`,  in the dictionary and in each loop I created a python function `findOccurrences` which will scan through the constitution and return all the indicies where that character appears. Then, once all the occurrences are found, I used the list of indicies to get the following letter and stored that in a new list, `nextLetters`.

Finally, I iterated through the set of `nextLetters` and counted how many times a character appeared in  `nextLetters` and divided that by the number of total characters in  `nextLetters`. This will give the probility for that letter to come next given that the current character is `char` in the dictionary `d`

The output below shows the dictionary and also the probabilities for the letter after *a*.

In [4]:
#initialize an nxn matrix, this will store the given probabilities
matrix = np.zeros((n,n))

#initialize a dictionary that will map the set of characters
#to an index, which will be used to iterate the matrix
d = {}
count = 0
for i in uniqueSet:
    d[i] = count
    count += 1
print(d)


for char in d:
    nextLetters = [] 
    #print(char,".")
    #get instances where character shows up in string
    occurrences = findOccurrences(contents, char)
    for occurrence in occurrences:
        #last character in string EOL, stop iterating
        if occurrence == len(contents)-1:
            break
        #gets the following letter and adds it to list
        nextLetter = contents[occurrence + 1]
        nextLetters.append(nextLetter)
    #print(nextLetters)
    
    #Now go through all following letters and store the probalility that the next letter
    #is j (letter) given i (char). Counts all instances of 'j' and divides by total
    #of all following letters
    for letter in set(nextLetters):
        matrix[d[char]][d[letter]] = nextLetters.count(letter)/len(nextLetters)

print("\n Probalilities of next characters given current is \"a\"\n", matrix[d["a"]])


{'g': 0, 'C': 1, 'Q': 2, 'W': 3, 'q': 4, '9': 5, 'a': 6, 'U': 7, 'b': 8, 'M': 9, 'D': 10, '0': 11, 'm': 12, 's': 13, ',': 14, '5': 15, 'x': 16, 'G': 17, '7': 18, 'o': 19, 'H': 20, 'j': 21, 'B': 22, 'V': 23, 'J': 24, '6': 25, ':': 26, ')': 27, 'y': 28, '4': 29, '3': 30, 'f': 31, 'c': 32, 'F': 33, 'e': 34, 'i': 35, 'd': 36, 'r': 37, 'O': 38, 'K': 39, '-': 40, 'I': 41, 'z': 42, '8': 43, 'A': 44, ';': 45, 'P': 46, '.': 47, 'u': 48, 'l': 49, 'T': 50, 'E': 51, 'n': 52, 'w': 53, 'k': 54, 'R': 55, 'N': 56, '"': 57, ' ': 58, 'v': 59, 'S': 60, 'h': 61, 't': 62, 'L': 63, '1': 64, '2': 65, '(': 66, 'p': 67, 'Y': 68}

 Probalilities of next characters given current is "a"
 [0.01097609 0.         0.         0.         0.         0.
 0.         0.         0.01607213 0.         0.         0.
 0.01920815 0.06154449 0.         0.         0.00392003 0.
 0.         0.000392   0.         0.00548804 0.         0.
 0.         0.         0.         0.         0.03057624 0.
 0.         0.00588005 0.03880831 0.

As seen from the probability matrix above for the letter *a*, there is a high probability of the *t* to appear next, which has a $18.6\%$ probability.

3. Simulate this system as a Markov chain that starts with an arbitrary capital letter and continues until it gets to a space. Produce $100$ random "words" this way. How many of them are actual words? Use a [Scrabble dictionary](https://scrabble.hasbro.com/en-us/tools#dictionary) if you are not certain whether a given sequence is a word. 

To produce $100$ random "words" starting with an arbitrary capital letter, I first used the python library `string` and `random` to select a random capital letter to begin the word with. To make sure the Capital letter is a part of the text, I created a while loop to generate a new random capital letter until one is found that also exists in the set of characters. 

Then, I used another while loop to continue choosing arbitary letters until a space is encountered, which indicates the end of a word. To select the next character, I used `random.choices` which takes two arguments, the first the list of values of to choose from, and second the probability that it will select that value. In this case the list of values would be the set of characters and the probabilites would be the row in the probability matrix of the current character. For instance, to get the probailities of the next letter after *c*, you would call `matrix[d["c"]]`.

Finally, to check which words are actually real words I used `words` from `nltk.corpus` which contains a dictionary of English words to check against. (I also noticed that it is not a complete dictionaty and doesn't catch all valid words, but it does catch most of them.)

In the case that I ran I got:

`  ['Ex', 'Fory', 'Leg', 'Noby', 'Variced', 'Inerepraimend', 'Nofrtanceminachorof', 'Bive', 'Dontssicof', 'Un', 'Mit', 'Ge', 'Oavestof', 'Consprdervety', 'Con', 'Le', 'Ins', 'Imonensing', 'Renowhite', 'Sesherd', 'Kiollll', 'Nutarean', 'Steisheindintctothata', 'Hof', 'Br', 'Nel', 'Expe', 'Yed', 'Wing', 'Kittie', 'Fis', 'Allvembul', 'Mis', 'Me', 'Al', 'Repus', 'Ifondme', 'Re', 'Nof', 'Kild', 'Pre', 'Noutincuanse', 'Porof', 'Dat', 'Er', 'Vorqur', 'Pr', 'Hof', 'Oforthe', 'Lixeshe', 'Trand', 'Jony', 'St', 'Qusss', 'Unced', 'Nepo', 'Res', 'Oriofely', 'Mor', 'Bacenghesclacchar', 'Ofon', 'Unt', 'Noffor', 'Unul', 'Nofore', 'In', 'Fomatiche', 'Und', 'Wite', 'Uny', 'Prsigretict', 'Vionderthand', 'Fengr', 'Maber', 'Ama', 'Prme', 'Serolig', 'Stithid', 'Stes', 'Race', 'Gratiorencthace', 'Gonde', 'Eavestheme', 'Wif', 'Gol', 'Athermeed', 'I', 'Fothe', 'Fe', 'Gertovonsathatan', 'Ifolll', 'Five', 'Stisuaves', 'Depr', 'Fe', 'Qung', 'Nena', 'Vivicc', 'Lallecititathed', 'Que']`

Where the real words generated were:

` ['Ex', 'Leg', 'Un', 'Ge', 'Con', 'Yed', 'Wing', 'Me', 'Al', 'Re', 'Er', 'St', 'Mor', 'In', 'Wite', 'Ama', 'Race', 'Gol', 'I', 'Fe', 'Five', 'Fe']  `

Overall, the program was able to generate 22 real words out of 100 words, which is almost $25\%$ accuracy. It can also be seen that most of the real words are only 2 to 4 characters long.

In [5]:
generatedWords = []
realWords = []

for i in range(100):
    randLetter = random.choice(string.ascii_uppercase)
    #If random capital letter not in the set of characters, continue until so
    while randLetter not in uniqueSet:
        randLetter = random.choice(string.ascii_uppercase)
    word = randLetter
    
    while randLetter != " ":
        #random.choices takes two args, list of choices and list of probabilities
        #Use this to pick the next letter based on probabilities in the nxn matrix.
        randLetter = random.choices(list(uniqueSet), matrix[d[randLetter]])[0]
        word = word + randLetter
    
    #add generated word to list (remove space at end)
    word = word[:-1]
    generatedWords.append(word)
    
    #uses NLTK's word library to determine if generated word is a real word
    if word.lower() in set(words.words()):
        realWords.append(word)
    
print("All generated words:\n", generatedWords)
print("\nReal words:\n", realWords, "\n")
print(len(realWords),"real words out of", len(generatedWords))

All generated words:
 ['Wheothal', 'Jusicusst', 'Windereral', 'Quponsshe', 'Amby', 'Ime', 'De', 'Lareiothece', 'Vo', 'Lille', 'Bese', 'Busof', 'Repourng', 'Se', 'Pazin', 'Thintivensifed', 'Tatite', 'Kien', 'Hotioinl', 'Dandive', 'Nures', 'Pur', 'Rer', 'Imedustagurteses', 'Ne', 'Whedind', 'Jotedived', 'Dur', 'Couint', 'Wintiagh', 'Hourerdes', 'Obt', 'Nof', 'Efiverinthesor', 'Maknt', 'Fate', 'Me', 'Jur', 'Untaves', 'Un', 'Adgre', 'Unsuby', 'Authes', 'Vith', 'Vivorssore', 'Ne', 'Unde', 'Lapere', 'Uns', 'Red', 'Ye', 'Kiof', 'Dathat', 'Vinthal', 'Wepoudera', 'Seashes', 'Ellas', 'Thexity', 'Prshen', 'Winy', 'Stendengit', 'Che', 'Lanchaisiaure', 'Of', 'Prepr', 'Lend', 'Tont', 'Kices', 'Noff', 'Las', 'Ind', 'Thes', 'Go', 'Diesel', 'Fale', 'Thict', 'Honoren', 'Steds', 'El', 'De', 'Obe', 'Gume', 'Hof', 'Excins', 'Else', 'Durg', 'Fost', 'Hero', 'Sthe', 'Kitheisshemeny', 'Horre', 'Yesmm', 'Sanvangibeve', 'Gor', 'De', 'Lal', 'Qudend', 'Undieperthesathmbeastind', 'Victhar', 'Quchedirinore']

Real wo

### TWO WORD PREDICTION

1. Create an $n \times n \times n$ tensor whose $i,j,k$ entry is the probability that the next character is $k$ given that the current character is $j$ and the previous character is $i$. Use the document to empirically find these probabilities. 

To create the desired $n$ x $n$ x $n$ tensor, a similar approach was taken to the one word prediction, except another layer was added. To account for the previous character another for loop was created to iterate through the set of characters again, `j`. Then, similar to the one word prediction method, the character following the two letter combination is added to the list `nextLetters` and the probabilities are calculated and totaled.

For instance, these are the probabilities for the letter combination *th*, where characters not listed have probability of 0. From the table below it can be concluded that the letter *e* is very likely to be chosen for *th*.

| i    | e    | s    | r   | o    | space| f    | a    |
|------|------|------|-----|------|------|------|------|
|0.070 | 0.791| 0.006|0.013|0.032 |0.054 |0.002 |0.033 |

In [6]:
matrix3D = np.zeros((n,n,n))

tensor3D = torch.tensor(matrix3D)

print(d)

for i in d:     #first letter  "i"
    for j in d: #second letter "j"
        nextLetters = []
        #print("search for",i+j)
        #get instances where 2-letter character combination shows up in string
        occurrences = findOccurrences2(contents, i+j)
        #print("found at:",occurrences)
        
        #iterate through all occurances of the 2-letter combo
        for occurrence in occurrences:
            #last possible character to check in string EOL, stop iterating
            if occurrence == len(contents)-2:
                break
            #gets following letter and adds to list
            nextLetter = contents[occurrence + 2]
            nextLetters.append(nextLetter)
        #print("following letter",nextLetters)
    
        for letter in set(nextLetters):
            #print(letter,"count", nextLetters.count(letter))
            #print(letter,"total", len(nextLetters))
            tensor3D[d[i]][d[j]][d[letter]] = nextLetters.count(letter)/len(nextLetters)

#print(matrix)
print("\n Probalilities of next characters given current is \"th\"\n", tensor3D[d["t"]][d["h"]])

{'g': 0, 'C': 1, 'Q': 2, 'W': 3, 'q': 4, '9': 5, 'a': 6, 'U': 7, 'b': 8, 'M': 9, 'D': 10, '0': 11, 'm': 12, 's': 13, ',': 14, '5': 15, 'x': 16, 'G': 17, '7': 18, 'o': 19, 'H': 20, 'j': 21, 'B': 22, 'V': 23, 'J': 24, '6': 25, ':': 26, ')': 27, 'y': 28, '4': 29, '3': 30, 'f': 31, 'c': 32, 'F': 33, 'e': 34, 'i': 35, 'd': 36, 'r': 37, 'O': 38, 'K': 39, '-': 40, 'I': 41, 'z': 42, '8': 43, 'A': 44, ';': 45, 'P': 46, '.': 47, 'u': 48, 'l': 49, 'T': 50, 'E': 51, 'n': 52, 'w': 53, 'k': 54, 'R': 55, 'N': 56, '"': 57, ' ': 58, 'v': 59, 'S': 60, 'h': 61, 't': 62, 'L': 63, '1': 64, '2': 65, '(': 66, 'p': 67, 'Y': 68}

 Probalilities of next characters given current is "th"
 tensor([0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0334, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0057, 0.0000, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0315, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
        0.0000, 0.0000, 0.0000, 0.0000, 0.0019, 0.0000, 0.0000, 0.7908, 0.0697,
        0.

2. Use this model to construct random words. 

Since the tensor only contains probabilities for 2-letter combination probabilities, I used the matrix from the 1-letter probalilty matrix to get the second character. The first capital letter was generated similar to problem 1. 

Then, I used another while loop to continue choosing arbitary letters until a space is encountered, which indicates the end of a word. To select the next character, I used `random.choices` which takes two arguments, the first the list of values of to choose from, and second the probability that it will select that value. This is also similar to the method in problem 1. However, in this case the list of values is still the set of characters and then the probabilites would be the row in the probability matrix of the current character in combination with the previos letter. For instance, to get the probailities of the next letter after *th*, you would call `tensor3D[d["t"]][d["h"]])`, as seen above.

Finally, to check which words are actually real words I used `words` from `nltk.corpus` which contains a dictionary of English words to check against. (I also noticed that it is not a complete dictionaty and doesn't catch all valid words, but it does catch most of them.)

In the case that I ran I got:

` ['Rept', 'Adjorin', 'Mand', 'Kin', 'Vict', 'The', 'Deby', 'Legisdislated', 'Nams', 'Represe', 'Unity', 'Choosess', 'Mand', 'Gove', 'Congs', 'John', 'Inatiest', 'Ado', 'John', 'Officers', 'Mem', 'Yeatimentation', 'Nam', 'Trates', 'Own', 'Amend', 'Ing', 'Sublisidelese', 'Quor', 'Unint', 'Quarly', 'The', 'Represt', 'Judend', 'Islatie', 'John', 'Ords', 'Offess', 'Actior', 'Unifed', 'Quary', 'Viclestiory', 'Fely', 'Powers', 'Trest', 'Quorseque', 'Eacand', 'Law', 'Each', 'Law', 'Stat', 'Forms', 'Kindiressident', 'Quary', 'List', 'Hour', 'Vicass', 'Gract', 'Yeased', 'Quall', 'Station', 'Con', 'The', 'For', 'Unity', 'Forusent', 'Infenition', 'Houndmessary', 'Del', 'Peal', 'Exce', 'Righom', 'Few', 'Infousartion', 'Yeasolvervin', 'Res', 'Powelfaisidecto', 'Vicestich', 'For', 'Quallot', 'Statempoin', 'Kintatemposentes', 'Unity', 'Raties', 'Judisidenate', 'Statesis', 'Con', 'Vicer', 'To', 'Represident', 'Lime', 'Unit', 'Geof', 'For', 'Years', 'Eve', 'King', 'Com', 'Numen', 'Legislas']`

Where the real words generated were:

` ['Mand', 'Kin', 'The', 'Unity', 'Mand', 'Gove', 'Ado', 'Mem', 'Nam', 'Own', 'Amend', 'Ing', 'The', 'Trest', 'Law', 'Each', 'Law', 'List', 'Hour', 'Station', 'Con', 'The', 'For', 'Unity', 'Peal', 'Few', 'For', 'Unity', 'Con', 'To', 'Lime', 'Unit', 'For', 'Eve', 'King', 'Numen'] `

Overall, the program was able to generate 36 real words out of 100 words. Which is marginally better than the 1-letter predictions. The words that were generated also had more characters in lenght then the 1-letter prediction method. In this case, the real words generated are also longer in lenght, with the longest word being *station*. You can also see that even the words that aren't actaully true English words, more of them also seem close based on the patterns of the words.


In [7]:
generatedWords = []
realWords = []

for i in range(100):        
    randLetter = random.choice(string.ascii_uppercase)
    #If random capital letter not in the set of characters, continue until so
    while randLetter not in uniqueSet:
            randLetter = random.choice(string.ascii_uppercase)
    word = randLetter
    word = word + random.choices(list(uniqueSet), matrix[d[randLetter]])[0]

    #print("starting letters:",word)
    while randLetter != " ":
        #randLetter = random.choices(list(uniqueSet), [.2, .2, .2, .2, .2])[0]
        #random.choices takes two args, list of choices and list of probabilities
        prevLetter = d[word[-2:-1]]
        currentLetter = d[word[-1:]]
        randLetter = random.choices(list(uniqueSet), tensor3D[prevLetter][currentLetter])[0]
        word = word + randLetter
        #print(word)
        
    #add generated word to list (remove space at end)
    word = word[:-1]
    generatedWords.append(word)
    
    #uses NLTK's word library to determine if generated word is a real word
    if word.lower() in set(words.words()):
        realWords.append(word)
    
print("All generated words:\n", generatedWords)
print("\nReal words:\n", realWords, "\n")
print(len(realWords),"real words out of", len(generatedWords))

All generated words:
 ['Graw', 'Gunto', 'Roat', 'Major', 'Preirds', 'Sufarly', 'Geopertiffermin', 'Excle', 'Kingres', 'Grall', 'Repropres', 'Act', 'Memost', 'Officlates', 'Wilite', 'Yeass', 'Gor', 'Con', 'Posing', 'Numbeend', 'Cits', 'Hourisinumbe', 'Wria', 'Quall', 'Quall', 'Yeass', 'Cons', 'Geof', 'Yeasesidgmes', 'Elecriteleent', 'Preme', 'Amendirminse', 'Yeard', 'Quall', 'Cervident', 'Atted', 'Jr', 'Forand', 'New', 'Govided', 'Kin', 'Ambe', 'Quall', 'Dept', 'For', 'Geoprebe', 'Reptut', 'The', 'Law', 'Numbe', 'Law', 'Quall', 'Nation', 'Vothe', 'Jousint', 'Yorstivers', 'Leginsevident', 'Ind', 'Govilistice', 'Staxese', 'Day', 'Inhal', 'Quor', 'Vice', 'Yor', 'Year', 'Active', 'Numbe', 'Mong', 'Whe', 'The', 'Offic', 'But', 'Yeactioneres', 'Send', 'We', 'Own', 'Impeenablight', 'Kinge', 'Deprohisce', 'Active', 'Yeas', 'Offind', 'Viced', 'Thicle', 'Units', 'Vot', 'Mand', 'Grany', 'Will', 'Houses', 'Immon', 'Invidgeties', 'Opir', 'Fracte', 'New', 'Uningrelved', 'Objectoblegive', 'Lislaws', '

### SENTENCE PREDICTION

Do a one word prediction, but use all the unique *words* in the document. Hallucinate sentences. Consider a punctuation mark as a word. 

To get the predicted sentence the Constitution text file was re-read into a string and then parsed so that each string in the list is a word or punctuation mark. 

In [8]:
'''
This function returns an array of probabilities for a given word 
by getting the weights on edges
G = directed graph
word = string
words = set of words
'''
def getProbs(G, word, words):
    probsList = []
    for i in words:
        try:
            #gets weight of edge if it exists
            probsList.append(G[word][i]['weight'])
        except KeyError:
            #else, no probability exist for the next word
            probsList.append(0)
    return probsList

# print(getProbs(G1, "I", list(set(data))))
# print(getProbs(G1, "I", set(data)))

In [9]:
file = open('constitution.txt',mode='r') # file data
text = file.read() # save the data as a string
file.close()

print(set(text))
data = text.replace(".", " . ") \
           .replace(",", " , ") \
           .replace("-", " - ") \
           .replace(";", " ; ") \
           .replace(":", " : ").split() # remove excess white spaces

data

{'g', 'C', 'Q', 'W', 'q', '9', 'a', 'U', 'b', 'M', 'D', '0', 'm', 's', ',', '\n', '5', 'x', 'G', '7', 'o', 'H', 'j', 'B', 'V', 'J', '6', ':', ')', 'y', '4', '3', 'f', 'c', 'F', 'e', 'i', 'd', 'r', 'O', 'K', '-', 'I', 'z', '8', 'A', ';', 'P', '.', 'u', 'l', 'T', 'E', 'n', 'w', 'k', 'R', 'N', '"', ' ', 'v', 'S', 'h', 't', 'L', '1', '2', '(', 'p', 'Y'}


['We',
 'the',
 'People',
 'of',
 'the',
 'United',
 'States',
 ',',
 'in',
 'Order',
 'to',
 'form',
 'a',
 'more',
 'perfect',
 'Union',
 ',',
 'establish',
 'Justice',
 ',',
 'insure',
 'domestic',
 'Tranquility',
 ',',
 'provide',
 'for',
 'the',
 'common',
 'defence',
 ',',
 'promote',
 'the',
 'general',
 'Welfare',
 ',',
 'and',
 'secure',
 'the',
 'Blessings',
 'of',
 'Liberty',
 'to',
 'ourselves',
 'and',
 'our',
 'Posterity',
 ',',
 'do',
 'ordain',
 'and',
 'establish',
 'this',
 'Constitution',
 'for',
 'the',
 'United',
 'States',
 'of',
 'America',
 '.',
 'Article',
 '1',
 '.',
 'Section',
 '1',
 'All',
 'legislative',
 'Powers',
 'herein',
 'granted',
 'shall',
 'be',
 'vested',
 'in',
 'a',
 'Congress',
 'of',
 'the',
 'United',
 'States',
 ',',
 'which',
 'shall',
 'consist',
 'of',
 'a',
 'Senate',
 'and',
 'House',
 'of',
 'Representatives',
 '.',
 'Section',
 '2',
 'The',
 'House',
 'of',
 'Representatives',
 'shall',
 'be',
 'composed',
 'of',
 'Members',
 'chosen

To create the directed graph, a new dictionary of words was created to map each word to a unique index. I then iterated through each word `i` in the dictionary `d` and used a simlar method of accounting for all of the following words into a list. Then, to create the graph, edges were added from `i` to each of the words that were found following `i`. The weight was also calculated and added to the edge.

In [10]:
G1 = nx.DiGraph()
totalChars = len(contents)

#create dictionary to map letter to index for matrix
d = {}
count = 0
for i in set(data):
    d[i] = count
    count += 1
    
print(d)

n = len(d)
matrix = np.zeros((n,n))

for i in d:
    nextWords = []
    #get instances where character shows up in string
    print(i,".")
    occurrences = findOccurrences(data, i)
    #iterate through all occurances of the character
    for occurrence in occurrences:
        #last character in string EOL, stop iterating
        if occurrence == len(data)-1:
            break
        nextWord = data[occurrence + 1]
        nextWords.append(nextWord)
    print(nextWords)
    
    for word in set(nextWords):
        #print(letter,"count", nextLetters.count(letter))
        #print(letter,"total", len(nextLetters))
        #matrix[d[i]][d[word]] = nextWords.count(word)/len(nextWords)
        G1.add_edge(i, word, weight=nextWords.count(word)/len(nextWords))


#basic_digraph,ax = plt.subplots(1,1)
#nx.draw(G1,ax=ax,pos=nx.kamada_kawai_layout(G1),with_labels=True, node_color='#444444',font_color="red")
#edge_labels=nx.draw_networkx_edge_labels(G1,pos=nx.kamada_kawai_layout(G1))

{'New': 0, '20': 1, 'person': 2, 'Office': 3, 'nature': 4, 'concur': 5, 'Affirmation': 6, 'prohibited': 7, 'employed': 8, 'hereunto': 9, 'prescribe': 10, '19': 11, 'Measures': 12, 'Sections': 13, 'supported': 14, 'disability': 15, 'published': 16, 'land': 17, 'danger': 18, 'Mode': 19, 'Inhabitant': 20, 'vacated': 21, 'repassed': 22, 'obtaining': 23, 'Citizen': 24, 'Parts': 25, 'Courts': 26, 'no': 27, 'were': 28, 'giving': 29, 'regulated': 30, 'securing': 31, 'what': 32, 'having': 33, 'rebellion': 34, 'executed': 35, 'liquors': 36, 'Concurrence': 37, 'greatest': 38, 'describing': 39, 'Bedford': 40, 'poll': 41, 'Grand': 42, 'Emoluments': 43, 'Cotesworth': 44, 'free': 45, 'pensions': 46, 'religion': 47, 'He': 48, 'him': 49, 'Consequence': 50, 'convene': 51, 'Defence': 52, 'Excessive': 53, 'peaceably': 54, 'disapproved': 55, 'Court': 56, 'Cases': 57, 'beginning': 58, 'Jonathan': 59, 'existing': 60, 'whole': 61, 'Impeachments': 62, 'Exceptions': 63, 'choose': 64, 'December': 65, 'any': 66, 

['bail']
peaceably .
['to']
disapproved .
['by']
Court .
[';', ',', ',', 'shall', 'shall', '.', 'of']
Cases .
['of', ',', 'the', 'whatsoever', 'of', 'of', ',', 'affecting', 'of', 'affecting', 'before', 'of']
beginning .
['of', 'of']
Jonathan .
['Dayton']
existing .
['shall']
whole .
['Number', 'Number', 'Number', 'number', 'number', 'number', 'number', 'number', 'number', 'number']
Impeachments .
['.']
Exceptions .
[',']
choose .
['three', 'their', 'their', 'by', 'the', 'from', 'immediately', 'a', 'the', 'a', 'a']
December .
[',']
any .
['State', 'State', 'Office', 'time', 'question', 'other', 'Speech', 'other', 'civil', 'Office', 'Bill', 'Department', 'of', 'State', 'Regulation', 'Office', 'present', 'kind', 'King', 'Treaty', 'Thing', 'Bill', 'Title', 'Imposts', 'State', 'duty', 'Agreement', 'Person', 'other', 'of', 'subject', 'State', 'State', 'Law', 'other', 'State', 'Claims', 'particular', 'Manner', 'Thing', 'State', 'Office', 'house', 'person', 'criminal', 'Court', 'suit', 'Foreig

['cause']
devolve .
['on', 'upon']
square) .
['as']
Resignation .
[',', ',', 'or']
Consent .
['of', 'of', 'of', 'of', 'of', 'of', 'of', 'of', ',', 'of']
determines .
['by']
loss .
['or']
Expiration .
['of', 'of', 'of']
discipline .
['prescribed']
United .
['States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'States', 'St

['ballots', 'lists']
Services .
[',', ',', 'a']
clear .
[',']
so .
['that', 'construed', 'ratifying', 'construed']
inhabitants .
['of']
Grant .
['Reprieves']
useful .
['Arts']
high .
['Seas', 'Crimes']
Posterity .
[',']
well .
['as', 'regulated']
Objections .
['to', 'at', ',']
things .
['to']
Year .
['by', ',', ',', ',', ';', ',', 'one', 'One', 'of']
claim .
['for']
remainder .
['of']
That .
['the']
Duties .
[',', ',', 'in', 'on', 'and', 'of', 'of']
foregoing .
['Powers']
Vice .
['President', 'President', '-', 'President', '-', 'President', 'President', 'President', '-', '-', '-', '-', '-', '-', '-', '-', '-', '-', 'President', 'President', 'President', 'President', 'President', 'President', 'President', 'President', 'President', 'President', 'President', 'President', 'President', 'President', 'President', 'President', 'President', 'President']
granting .
['Commissions']
Tribunals .
['inferior']
press .
[';']
proper .
['for', 'to', ',', ';']
others .
['retained']
made .
['within', ',',

['Commerce', 'the']
religious .
['Test']
time .
['by', 'to', 'publish', ';', 'to', '.', 'of', 'of', 'to', 'give', 'to', 'ordain', 'of', 'of', 'of', 'fixed', 'fixed']
Broom .
['Maryland']
admit .
[',', 'of']
as .
['they', 'equally', 'may', 'to', 'each', 'may', 'on', 'if', 'may', 'may', 'any', 'will', 'follows', 'the', 'President', 'they', 'he', 'he', 'the', 'to', 'the', 'the', 'well', 'of', 'to', 'part', 'the', 'valid', 'under', 'a', 'President', 'Vice', 'President', 'Vice', 'President', 'President', 'in', 'Vice', 'a', 'a', 'an', 'a', 'an', 'the', 'to', 'part', 'an', 'provided', 'President', 'President', 'an', 'an', 'provided', 'President', 'President', 'President', 'an', 'the', 'provided', 'Acting', 'Congress', 'Acting', 'Congress', 'Acting']
Magazines .
[',']
proportion .
['which']
Acts .
[',', ',']
different .
['Day', 'States', 'States', 'day']
Speech .
['or']
otherwise .
[',', 'provided', 'infamous', 're', ',']
Twelfth .
['.']
Application .
['of', 'of']
services .
['in', 'of']
Credi

['in']
speech .
[',']
impartial .
['jury']
assemble .
['at', ',', 'at', ',']
requisite .
['for', 'for']
concerning .
['Captures']
conventions .
['in']
" .
['Section']
Term .
['of', 'of', 'than', 'of', ',']
approved .
['by', 'by']
Consideration .
['such']
Tender .
['in']
would .
['have', 'be']
Same .
['shall', 'shall', 'shall', '.']
Punishment .
[',', 'of', 'of']
declare .
['War', 'the']
necessary .
['(except', 'and', 'for', 'to', 'and', ',', 'to', 'to', 'to']
Execution .
['the', 'of']
18 .
['1']
Army .
['and']
Members .
['chosen', 'present', ',', ',', 'for', 'of', 'from', 'of']
protection .
['of']
nor .
['to', 'shall', 'diminished', 'any', 'in', 'shall', 'shall', 'be', 'shall', 'excessive', 'cruel', 'prohibited', 'involuntary', 'shall', 'deny', 'any', 'a']
question .
['shall', 'of']
Attest .
[':']
good .
['Behavior']
Jacob .
['Broom']
Pursuance .
['thereof']
exported .
['from']
punish .
['its', 'Piracies']
derived .
[',']
affirm) .
['that']
Jenifer .
[',']
Resolution .
[',']
collect .


seizures .
[',']
pass .
['the', 'any']
Commission .
['all']
compensation .
['.', 'for']
Violence .
['.']
Subjects .
['.', 'of']
Powers .
['herein', ',', 'vested', 'and']
its .
['own', 'Proceedings', 'Members', 'Proceedings', 'Return', 'inspection', 'Consent', 'equal', 'jurisdiction', 'submission', 'submission']
decide .
['the']
end .
['at']
said .
['House', 'Office', 'Crimes']
confirmation .
['by']
Form .
['of']
votes .
['for', 'shall', 'for', 'shall', 'as']
obligation .
['incurred']
just .
['compensation']
Constitution .
['for', 'in', ',', 'of', ',', 'shall', ',', ',', ',', ',', ',', 'or', ';', 'between', ',', ',', 'of', '.', 'by', ',', 'by', 'of', 'by', ',', 'by']
duties .
['as', 'of', 'shall', 'of', 'of', 'of', 'of', 'of', 'of']
Purpose .
[',', 'shall']
Crime .
[',', '.']
during .
['the', 'the', 'their', 'the', 'such', 'his', 'the', 'the', 'the', 'good', 'their', 'the', 'the', 'the']
Mifflin .
[',']
prescribed .
['in', 'in', 'by', 'by']
Oath .
['or', 'or', 'or', 'or']
forth .
['the'

ten .
['Years', ',', 'Days', 'Miles', 'dollars']
actual .
['Enumeration', 'Service', 'service']
Habeas .
['Corpus']
Ambassadors .
[',', 'and', ',', ',']
recommend .
['to']
fines .
['imposed']
given .
['by', 'in', 'aid']
engage .
['in']
Clymer .
[',']
exercise .
['the', 'exclusive', 'like', 'thereof']
apply .
['to']
do .
['ordain', 'Business', 'solemnly']
suppressing .
['insurrection']
work .
['Corruption']
Territory .
['or', ',']
term .
['of', 'of', ',', 'to', 'within', '.']
immunities .
['of']
forces .
[',']
within .
['this', 'three', 'every', 'ten', 'the', 'that', 'any', 'the', 'the', 'its', ',', 'seven', 'seven', 'seven', 'which', 'seven', 'four', 'forty', 'twenty', 'twenty']
than .
['to', 'three', 'that', 'two', 'one', 'according', 'twice', 'two', 'once', 'the']
inability .
['exists']
divided .
['as', '.']
them .
['as', ',', 'for', 'by', '.', ',', ',', 'to', ',', 'Aid', 'against', ',', ',', '.', 'a']
sixth .
['Year']
Commissions .
['which']
Adjournment .
['prevent', ',']
Commerce .

belonging .
['to']
age .
[',', 'in', 'or', '.']
indictment .
['of']
laws .
['.', 'thereof']
Qualifications .
['requisite', 'of']
When .
['vacancies', 'sitting', 'the', 'vacancies']
required .
['as', ',', 'to']
Choosing .
['Senators']
military .
[',']
List .
['of', 'they', 'the']
four .
[',', 'Years', 'days']
presented .
['to', 'to', 'to']
Seat .
['of', 'of']
punishments .
['inflicted']
seized .
['.']
consist .
['of', 'of', 'only', 'of', 'of']
fact .
['tried']
sufficient .
['for']
3d .
['day', 'day']
privileges .
['or']
judicial .
['Power', 'Power', 'Proceedings', 'Officers', 'officer']
needful .
['Buildings', 'Rules']
(or .
['affirm)']
executive .
['Power', 'Departments', 'Authority', 'and', 'or', 'authority', 'thereof', 'departments', 'department']
sent .
[',']
Johnson .
[',']
right .
['of', 'of', 'of', 'to', 'of', 'of', 'to', 'of', 'of', 'of', 'of', 'of', 'of']
majority .
['of', ',', 'of', 'of', ',', 'of', 'vote', 'of', 'of']
Bankruptcies .
['throughout']
Few .
[',']
Dayton .
['Penns

Then, once the graph is now complete, we can use the probabilities on the edges to predict sentences. This can be done using `getProbs(G1, randWord, words)`. This program continues generating random words until a period is encounted, which denotes the end of a sentence.

One generated sentence that it output was:
`Statement and Fact , in the Jurisdiction of Representatives his Continuance in War , by the Congress shall be employed in the other public trial , in time of Forts , convene both Houses that the United States shall be an Elector .`

As seen from the example generated above, the language is very similar to that of the constitution. 

In [12]:
realWords = []
words = list(set(data))

randWord = random.choice(words)
#Generate new random word until it reaches one that starts with a capital letter
while randWord[0].islower():
    randWord = random.choice(words)
sentence = randWord
print("First Word:",sentence)
while randWord!= ".":
    #randLetter = random.choices(list(uniqueSet), [.2, .2, .2, .2, .2])[0]
    #random.choices takes two args, list of choices and list of probabilities
    probArray = getProbs(G1, randWord, words) #get list of probabilities for randWord
    #print(randLetter,probArray)
    randWord = random.choices(words, probArray)[0]
    sentence = sentence + " " + randWord

print("\nGenerated sentence:")
print(sentence)


First Word: 16

Generated sentence:
16 The judicial Power of the Laws which such Trial of the credit of the Establishment of his Objections , Richard Bassett , remove such Importation of Electors , nor any of the most numerous branch of the acceptance of Representatives shall be given by the sole Power to pass any Agreement or of the submission to support the People of the several States , as the Person attainted .
