## Homework 10: Final

### Hallucinating the Constitution by *Michael Chamerski*

Consider the constitution of the United States:

> https://www.usconstitution.net/const.txt .

This document contains upper- and lower-case letters, numbers, and basic punctuation. 

**One letter prediction:**

1. Find the set of all characters used in the document. Call the number of characters $n$. 
2. Create an $n \times n$ matrix whose $i,j$ entry is the probability that the next character is $j$ given that the current character is $i$. Estimate this probability by looking at all occurrences of character $i$ in the document and the number of times character $j$ immediately follows it. 
3. Simulate this system as a Markov chain that starts with an arbitrary capital letter and continues until it gets to a space. Produce $100$ random "words" this way. How many of them are actual words? Use a [Scrabble dictionary](https://scrabble.hasbro.com/en-us/tools#dictionary) if you are not certain whether a given sequence is a word. 

**Two letter prediction:**

1. Create an $n \times n \times n$ tensor whose $i,j,k$ entry is the probability that the next character is $k$ given that the current character is $j$ and the previous character is $i$. Use the document to empirically find these probabilities. 
2. Use this model to construct random words. 

**Sentence prediction:**

Do a one word prediction, but use all the unique *words* in the document. Hallucinate sentences. Consider a punctuation mark as a word. 

#### Due 15 December, 2019 at 11:59 PM

In [1]:
# IMPORTS
import urllib3, random, enchant, re, numpy as np
from collections import Counter

# SUPPRESS WARNINGS; PROBABLY NOT RECOMMENDED, BUT USEFUL FOR THIS CASE
import warnings
warnings.filterwarnings("ignore")

<hr>

#### Using the one-letter prediction method:

We are able to hallucinate words using the Constitution as an input by taking the following steps:

1. The Constitution must be extracted from a source and opened by Python as a text file. In this case, we will use https://www.usconstitution.net/const.txt. We can then clean-up the response removing all line breaks and extra spaces. All of this is done using the `constitution()` function.
2. We then use the text to generate a list of all characters in the string. This is done simply by converting the string to a list. `split_string()` will perform this task and will return a list of all characters and a list of unique characters
3. Next, we can compare the list of all characters against the list of unique characters to determine the probability of a particular character appearing after another. For each character in the unique set, we search the string for characters appearing after it, and build a list, which is then transformed into a `Counter` object containing the number of occurences. This is accomplished with the `generate_counter_object()` function.
4. Each `Counter` object for each character in the unique set must then be normalized to 1, so then we can build a probability matrix. We feed each `Counter` object into `normalize_counter_object()` so that it returns a normalized `Counter` object.
5. Using the `Counter` objects, we use the unique set to build an index for $i$ and $j$ and iterate through the dictionary of `Counter` objects and store the probability values in the probability matrix.
6. We can then use the probability matrix in the `p` argument of `numpy`'s `random.choice()` function to build our word in a similar system to a Markov chain.
7. The list of words is then compared against a U.S. English dictionary using the `pyenchant` library.

In [2]:
def constitution():
    '''Uses urllib3's PoolManager() to request the text
    format of the Constitution. Additionally, removes 
    line breaks and extra spaces.'''
    response = urllib3.PoolManager().request('GET', 'https://www.usconstitution.net/const.txt')
    data = (response.data.decode()).partition('We the People')
    return ((data[1] + data[2]).replace("\n", " ")).replace("  ", " ")[:-1]

def split_string(string):
    '''Takes an input string and returns a tuple containing the
    list of all characters in the string, and a list of all
    unique characters in the string.'''
    listed_string = list(string)
    return listed_string, list(set(listed_string))

def generate_counter_object(raw_list, raw_set):
    '''Takes an input list of all characters, including
    duplicates, and compares against the input list of all
    unique characters to generate a Counter object for each
    character.'''
    char_set_dict = {}
    for char in raw_set:
        char_set = []
        for char_ in range(len(raw_list) - 1):
            if char == raw_list[char_]:
                char_set.append(raw_list[char_ + 1])
        char_set_dict[char] = Counter(char_set)
    return char_set_dict

def normalize_counter_object(counter_object):
    '''Takes Counter object input and returns a
    dictionary with normalized values.'''
    normalized_counter_object = {}
    total = sum(counter_object.values())
    for char in counter_object:
        normalized_counter_object[char] = counter_object[char]/total 
    return normalized_counter_object

def generate_word(index, probability_matrix):
    '''Generate a word given an index and probability matrix.
    Begins with a capital letter and ends when a blank space occurs.
    Uses the probability matrix to simulate the system as a Markov chain.'''
    word = ''
    while not word.isupper():
        word = random.choice(index)
    current_letter = word
    while current_letter not in [' ', '-', ',', '.']:
        k = np.random.choice(char_index, p=probability_set[current_letter])
        word += k; current_letter = k
    return word[:-1]

def generate_words(index, probability_matrix, num):
    '''Uses generate_word function to generate a given number of words.
    Takes the index, probabilities, and number of words as input. Returns
    a list of generated "words".'''
    return [generate_word(index, probability_matrix) for _ in range(num)]

def check_dictionary(word_list):
    '''To check if a particular word is an English word, 
    we can use the en_US dictionary in the PyEnchant module.'''
    d = enchant.Dict("en_US")
    return [word for word in word_list if d.check(word)]

In [3]:
# GENERATE LIST OF CHARACTERS FROM THE CONSTITUTION
raw_text_list, raw_text_set = split_string(constitution())
print(raw_text_set)

['J', 'V', 's', 'C', 'r', '1', '7', '.', '8', ':', 'e', ' ', 'B', 'Q', '"', '2', 'p', 'y', 'k', ';', 'c', 'h', 'x', '3', 'a', 'b', 'I', '4', '5', '0', 'P', 'j', 'u', 'S', 't', 'd', 'g', ',', 'O', 'R', 'M', 'f', 'N', 'E', 'D', 'G', '6', 'q', 'F', '-', 'Y', 'w', 'A', 'U', 'K', 'o', 'v', 'i', 'm', '(', 'n', 'T', 'W', ')', '9', 'l', 'L', 'z', 'H']


In [4]:
# GENERATE COUNTER OBJECT FOR HOW MANY TIMES A LETTER OCCURS AFTER A PARTICULAR CHARACTER
char_set_dict = generate_counter_object(raw_text_list, raw_text_set)
print(char_set_dict['A'])

Counter({'m': 38, 'r': 18, 'u': 8, 'p': 8, 'd': 7, 'c': 7, 'l': 6, 'n': 6, 't': 6, 'g': 4, 'f': 4, ' ': 3, 'b': 2, 'i': 1, 's': 1})


In [5]:
# INITIALIZE NORMAL DICT
normalized_char_set_dict = {}

# FOR EACH CHAR IN DICT, NORMALIZE COUNTER & STORE
for char_ in char_set_dict:
    normalized_set_ = normalize_counter_object(char_set_dict[char_])
    normalized_char_set_dict[char_] = normalized_set_
    
# EXAMPLE FOR CHARACTERS APPEARING AFTER 'l'
normalized_char_set_dict['l']

{' ': 0.30655391120507397,
 ',': 0.014799154334038054,
 '.': 0.0014094432699083862,
 ';': 0.0007047216349541931,
 'a': 0.08386187455954898,
 'd': 0.016913319238900635,
 'e': 0.13742071881606766,
 'f': 0.0035236081747709656,
 'i': 0.07963354474982381,
 'l': 0.29034531360112753,
 'm': 0.0007047216349541931,
 'o': 0.01620859760394644,
 's': 0.008456659619450317,
 't': 0.0028188865398167725,
 'u': 0.01127554615926709,
 'v': 0.007047216349541931,
 'y': 0.018322762508809022}

In [6]:
# DEFINE STANDARIZED CHARACTER INDEX
char_index = sorted(raw_text_set)
print(char_index)

[' ', '"', '(', ')', ',', '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [7]:
# BUILD N-by-N MATRIX USING THE COUNTER DICTIONARY
probability_set = {}
for i in char_index:
    probability_set_char = []
    normalized_ = normalized_char_set_dict[i]
    for j in char_index:
        try:
            probability_set_char += [normalized_[j]] # IF THERE IS A PROBABILITY, SET IT.
        except KeyError:
            probability_set_char += [0] # IF THERE IS NO KEY, THERE IS NO PROBABILITY. SET IT TO ZERO.
    probability_set[i] = probability_set_char # SAVE INTO A MATRIX.

In [8]:
# GENERATE WORDS & COMPARE THEM AGAINST A DICTIONARY
r = generate_words(char_index, probability_set, 100)
r_en_US = check_dictionary(r)
print(("---------- 100 EXAMPLE WORDS ----------\n{}\n\n"
       "---------- ENGLISH EXAMPLE WORDS ----------\n{}"
       "\n----------\n{} word(s) of the {} word(s) "
       "generated are proper English words, using the " 
       "one-letter prediction method.").format(r, r_en_US, len(r_en_US), len(r)))

---------- 100 EXAMPLE WORDS ----------
['Qut', 'Fan', 'Authas', 'Yetorior', 'Un', 'Kided', 'Cony', 'Horviesind', 'Remerd', 'Qutide', 'Nof', 'Hof', 'Gersitothe', 'Seroumpeduche', 'Tave', 'Renge', 'Ames', 'Ampe', 'For', 'Gene', 'Brte', 'Aum', 'Yern', 'Kig', 'Mos', 'On', 'Fan', 'Ge;', 'Go', 'Unga', 'Re', 'Did', 'Tosont', 'Qurthedes', 'Menther', 'Imelllle', 'Qunfithon', 'Makesirqunt', 'Kice', 'Goron', 'Re', 'Reerseclesony', 'At', 'Prthesit', 'Qutys', 'Fong', 'Gre', 'Got', 'Jut', 'Un', 'Nesen', 'Dacowhthar', 'Brme', 'Unemse', 'Kis', 'Cof', 'Fesshal', 'Entoffthitoversweshershexis', 'If', 'Kiote', 'Recce', 'Ele', 'Hothe', 'Bes', 'Re', 'Th', 'If', 'Grecon', 'Dainthonthetaleciofray', 'Hof', 'Qurend', 'Auans', 'Res', 'Hoffonungas', 'Kit', 'Pr', 'Ges', 'Pr', 'Wit', 'Prthier', 'Ofike', 'Ariongrtons', 'Unor', 'Yente', 'Oaprshand', 'Atavionates', 'Donar', 'In', 'Lasithovegre', 'Exces', 'Calle', 'Jo', 'We', 'Coticke', 'Vinche', 'Fasun', 'Qupanuse', 'Jal', 'Hofotisivy', 'Saichirthirs']

---------- EN

<hr>

#### Using the two-letter prediction method:

We will use a similar method for the two-letter prediction as we did for the one-letter prediction. Essentially, the process is the same, except we will need to generate two correlated dictionaries of `Counter` objects that describe the probability of a character appearing before and after a particular character. We then use these dictionary levels to form a 3-dimensional probability matrix $(i, j, k)$, which is then used to generate words using the two-letter prediction method.

In [9]:
def generate_counter_object_two(raw_list, raw_set):
    '''Takes an input list of all characters, including
    duplicates, and compares against the input list of all
    unique characters to generate a Counter object for each
    character.'''
    char1_set_dict = {}
    for char1 in raw_set:
        char2_set_dict = {}
        for char2 in raw_set:
            char_set = []
            for char_ in range(len(raw_list) - 2):
                if char1 == raw_list[char_] and char2 == raw_list[char_ + 1]:
                    char_set.append(raw_list[char_ + 2])
            char2_set_dict[char2] = Counter(char_set)
        char1_set_dict[char1] = char2_set_dict
    return char1_set_dict

def normalize_counter_object_two(counter_object):
    '''Takes Counter object input and returns a
    dictionary with normalized values.'''
    normalized_counter_object = {}
    total = sum(counter_object.values())
    for char in counter_object:
        normalized_counter_object[char] = counter_object[char]/total 
    return normalized_counter_object

def generate_word_two(index, probability_set_1, probability_set_2):
    '''Generate a word given an index and two probability matrices.
    Begins with a capital letter and ends when a blank space occurs.
    Uses the probability matrices to simulate the system as a Markov chain,
    where the combination of probability_set_1 and probability_set_2
    determines the character before and after a particular character.
    Begins with the one-letter method for the first character, then uses
    the two-letter method for every other character.'''
    word = ''
    while not word.isupper():
        word = random.choice(index)
    mid_letter = word
    m = np.random.choice(index, p=probability_set_1[mid_letter])
    word += m
    if m in [' ', '.', ',', '-']:
        return word[:-1]
    selection = probability_set_2[mid_letter][m]
    n = np.random.choice(list(selection.keys()), p=list(selection.values()))
    word += n; starting_letter = m; mid_letter = n
    if n in [' ', '.', ',', '-']:
        return word[:-1]
    while True:
        selection = probability_set_2[starting_letter][mid_letter]
        n = np.random.choice(list(selection.keys()), p=list(selection.values()))
        word += n
        if n in [' ', '.', ',', '-']:
            break
        else:
            starting_letter = mid_letter
            mid_letter = n
    return word[:-1]

def generate_words_two(index, probability_set_1, probability_set_2, num):
    '''Uses generate_word function to generate a given number of words.
    Takes the index, probabilities, and number of words as input. Returns
    a list of generated "words".'''
    return [generate_word_two(index, probability_set_1, probability_set_2) for _ in range(num)]

In [10]:
# GENERATE COUNTER OBJECT FOR HOW MANY TIMES A LETTER OCCURS AFTER AND BEFORE A PARTICULAR CHARACTER
char_set_dict_two = generate_counter_object_two(raw_text_list, raw_text_set)

In [11]:
# INITIALIZE NORMAL DICT
normalized_char_set_dict_two = {}

# ITERATE THROUGH EACH COUNTER OBJECT IN EACH DICTIONARY & NORMALIZE
for char_set_ in char_set_dict_two:
    normalized_char_set_dict_two[char_set_] = {}
    for char_ in char_set_dict_two[char_set_]:
        normalized_set_ = normalize_counter_object_two(char_set_dict_two[char_set_][char_])
        normalized_char_set_dict_two[char_set_][char_] = normalized_set_
        
# EXAMPLE: PROBABILITY OF OCCURANCES OF "ab_"
normalized_char_set_dict_two['a']['b']

{'e': 0.024390243902439025,
 'i': 0.2682926829268293,
 'l': 0.3902439024390244,
 'o': 0.07317073170731707,
 'r': 0.17073170731707318,
 's': 0.07317073170731707}

In [12]:
# GENERATE WORDS & COMPARE THEM AGAINST A DICTIONARY
r_two = generate_words_two(char_index, probability_set, normalized_char_set_dict_two, 100)
r_two_en_US = check_dictionary(r_two)
print(("---------- 100 EXAMPLE WORDS ----------\n{}"
       "\n\n---------- ENGLISH EXAMPLE WORDS ----------"
       "\n{}\n----------\n{} word(s) of the {} word(s) "
       "generated are proper English words, using the " 
       "two-letter prediction method.").format(r_two, r_two_en_US, len(r_two_en_US), len(r_two)))

---------- 100 EXAMPLE WORDS ----------
['Fiterales', 'Quall', 'Quars', 'Willes', 'In', 'Donve', 'Quall', 'Stat', 'Artion', 'Dutior', 'Rept', 'Debaboutionstionse', 'Exce', 'Quars', 'Ambe', 'The', 'Saftess', 'No', 'Dan', 'Kincuthosideng', 'Ele', 'New', 'Legibe', 'Adjousers', 'Quall', 'Treas', 'Fited', 'Offies', 'Depre', 'Nationgreof', 'Arty', 'Mancurroviclaccermesiden', 'Hourposess', 'Sen', 'Law', 'Ador', 'Press', 'Vicur', 'Lislaward', 'Welecution', 'Quall', 'Diss', 'Bil', 'Janto', 'Offired', 'Stated', 'Wele', 'Yeace', 'Jurs', 'For', 'Was', 'Govacassen', 'Amend', 'Cas', 'He', 'Dearty', 'Habity', 'Band', 'Vicess', 'Jund', 'Opirds', 'Mandmento', 'Laws', 'Biliall', 'Kin', 'Laws:', 'Serallonvach', 'Constationsmins', 'Jams', 'Dist', 'Jurt', 'To', 'Regin', 'Quall', 'For', 'For', 'Ind', 'Thelite', 'Viction', 'Houtivices', 'Ball', 'Deptutiturts', 'Bralls', 'Ballarited', 'Wil', 'Oatives;', 'Jr', 'Felegislaw;', 'Jury', 'Repropecut', 'Repatures', 'Qual', 'Rull', 'Preswo', 'Ameem', 'Unitivideprequa

<hr>

#### Using the one-word prediction method:

The one-word prediction method uses the exact same procedure as the one-letter method, except that a list of words is generated from the raw Constitution text, rather than a list of characters.

In [13]:
def split_into_words(string):
    '''Uses the regular expressions library. Takes 
    an input string and separates the words by spaces. 
    Returns a list of all words,and a list of unique words.'''
    r = re.findall(r"[\w']+|[.,!?;]", string)
    return r, list(set(r))

def generate_counter_object_word(raw_list, raw_set):
    '''Takes an input list of all characters, including
    duplicates, and compares against the input list of all
    unique characters to generate a Counter object for each
    character.'''
    word_set_dict = {}
    for word in raw_set:
        word_set = []
        for word_ in range(len(raw_list) - 1):
            if word == raw_list[word_]:
                word_set.append(raw_list[word_ + 1])
        word_set_dict[word] = Counter(word_set)
    return word_set_dict

def generate_sentence(index, probability_set):
    '''Generate a sentence given an index and probability matrix of a particular
    word appearing after a given word. Begins with a word containing a capital 
    letter and ends when a period occurs. Uses the probability matrix to 
    simulate the system as a Markov chain.'''
    sentence = ''
    while not sentence.istitle():
        sentence = random.choice(index)
    current_word = sentence
    while current_word != '.':
        selection = probability_set[current_word]
        p = np.random.choice(list(selection.keys()), p=list(selection.values()))
        if p in ['.', ',', ';']:
            sentence += p
        else:
            sentence += ' ' + p
        current_word = p    
    return sentence

In [14]:
# GENERATE LIST OF WORDS FROM THE CONSTITUTION
raw_word_list, raw_word_set = split_into_words(constitution())
raw_word_list[:10]

['We', 'the', 'People', 'of', 'the', 'United', 'States', ',', 'in', 'Order']

In [15]:
# GENERATE COUNTER OBJECT FOR HOW MANY TIMES A WORD OCCURS AFTER A PARTICULAR WORD
word_set_dict = generate_counter_object_word(raw_word_list, raw_word_set)

In [16]:
# INITIALIZE NORMAL DICT
normalized_word_set_dict = {}

# FOR EACH CHAR IN DICT, NORMALIZE COUNTER & STORE
for word_ in word_set_dict:
    normalized_set_ = normalize_counter_object(word_set_dict[word_])
    normalized_word_set_dict[word_] = normalized_set_
    
# EXAMPLE OF WORDS APPEARING AFTER 'Constitution'
normalized_word_set_dict['Constitution']

{',': 0.44,
 '.': 0.04,
 ';': 0.04,
 'between': 0.04,
 'by': 0.16,
 'for': 0.04,
 'in': 0.04,
 'of': 0.12,
 'or': 0.04,
 'shall': 0.04}

In [22]:
# HALLUCINATE A SENTENCE
print("------ EXAMPLE SENTENCE ------")
generate_sentence(raw_word_set, normalized_word_set_dict)

------ EXAMPLE SENTENCE ------


'Person holding the Senate shall then act as on Imports or which the right of the Vice President elect shall immediately assume the Senate, and until he fled, to support this Constitution, whose Appointments until the Emoluments whereof the sole Power, which some other public Trust or other needful Rules and cause of a question of its submission hereof to time by the Debts and make any State shall be presented to the government of the Same shall have been presented to assemble, during their Votes; and Engagements entered into, the End of the House of trial, or which the other Mode of the list, shall meet in Case, or Labour, until an amendment to dispose of his Objections, of Impeachment, or erected within, Abraham Baldwin Attest William Few, or more who shall be admitted by Oath or as the Congress; To lay and Navy; and of all Treaties made, shall be found in their Services, or as they shall judge necessary to their successors shall be an uniform throughout the Same shall take Care that

<hr>