## Problem 0: Wordle

### Introduction

First learn about the new word game **Wordle** that is taking the internet by storm.

Read about the phenomenon [here.](https://ktla.com/morning-news/technology/what-is-wordle-game-everyone-playing-explained-tips-to-win/)

Play a game [here.](https://www.powerlanguage.co.uk/wordle/)

Here's a quick summary of the idea.

   1. The game consists of one puzzle a day where you have six chances to guess a five-letter word. Let's call that the target. Each of your guesses must be an English word. Sounds pretty hard so far. There are lots of 5-letter words!

   2. But here's the thing.  After you type in your guess, correct letters are highlighted: green means a letter is in the right spot, yellow means the letter is in the target, but it’s not in the right spot. Remaining means the letter does not occur in the target.
   
So as you work through your six guesses, you acquire information
about the target.  Say your guess produces a green *n* in the second
position. You know the target has an *n* in the second position (so, based on what we know about English spelling, it likely has a vowel or an *s* in the first position).  Say you also have a yellow *r* in the fifth position; then you know there's an *r* somewhere in the word, but not in fifth position, and not in first position, because English words can't start with *rn*, and not in second position, because that's filled.  So in fact you know the *r* is in third or fourth position.  Suppose the other three letters in your guess turned black.  You file away the information that none of these letters should show up in your future guesses. And so on, combining simple logic with knowledge of facts 
about English.  Fun game.


<figure>
<center>
<img src='https://gawron.sdsu.edu/python_for_ss/course_core/book_draft/_static/sample_wordle_game.png ' />
<figcaption>A sample Wordle Game (not particularly well-played)</figcaption></center>
</figure>

Here's another slightly more interesting game we'll use below.  Think about whether that last guess
is incredibly lucky, or just represents a good play.  We'll try to answer that question below.

<figure>
<center>
<img src='https://gawron.sdsu.edu/python_for_ss/course_core/book_draft/_static/sample_wordle_game_2.png ' />
<figcaption>A better Wordle Game</figcaption></center>
</figure>

### Problem statement

Write a function **color_guess** that takes a Wordle target and a Wordle guess as inputs (so both are 5-letter words), and returns the **coloring** for the guess.

A coloring is a sequence of colors (represented as the characters 'g', 'y' or 'k' [for black]) that contains the correct colors for the letters of the guess, where correct means correct according to the coloring rules of Wordle. 

The first question you should answer is what the data type of a coloring is.

The definition of the function should be just a few lines of code.

### Extra credit

For extra credit use an English word list to return the set of words compatible with the coloring and the guess. For example you can iterate through
`nltk.corpus.words.words()`.

```
from nltk.corpus import words

def find_compatible_words (guess, coloring):
    result = []
    for wd in words.words():
       <do some cool stuff>
    return result
```

Needless to say you should just return 5-letter words and don't forget to use negative information (for most colorings, you know some letters that can't be in the target).

Note you should try this function out on the second-to-last guess in the second Wordle 
game above, to answer the question whether the last guess is lucky (there are many
open possibilities) or insightful (there are few). 

Also, to sharpen the answer your function gives you, modify `find_compatible_words`  by adding another (optional) argument for excluded letters.  The meaning is that the excluded letters
shouldn't occur in any of the words returned by the function.  Then assuming `coloring0` is the right coloring
for the second-to-last guess, the words left as possibilities after that guess 
would be given by:

```
find_compatible_words('pinch', coloring0, excluded_letters = "stealnh")
```

because the letters in "stealnh" are the ones eliminated earlier in the game.

### Extra special extra credit

For extra special extra credit,  redo everything that we've thought about so far
to use wordle **information states**.  A wordle information state represents
all the information from all the guesses so far. 

You would write two functions.  One called **update_information_state** that inputs an information state (the current information state), a target word, and a guess, and returns the updated information state,  Another called **find_compatible_words** that takes an information state and returns the 5-letter words compatible with it.

Note that a wordle information state **can't** just be a guess and a coloring.  After one unlucky
guess we might know exactly 5 letters that aren't in the target.  After
two unlucky guesses we might know 10, and that can't be represented by
one guess and one coloring sequence with only 5 positions.  So the first thing you need
to think about, before writing any code, is how to represent an information
state.  If you find a good answer, it should be obvious how to represent the
initial information state (before any guesses have been made).  Note that the result
of applying **find_compatible_words** to the initial information state should be the set of all
English 5-letter words.  And the result of applying **find_compatible_words** to the final information state of a winning game (all green letters) should be a set containing a single 5-letter word. 

Important point:  Don't try to make the information states reflect facts about
English. For instance, knowing there's an *n* in second position tells you a lot about
what letters can occur in first position, **if you know some facts about
English words**.  Don't get into that.  It's a rabbit hole from which there is no return.
Just try to represent the fact that you know the information that comes from
the sequence of  colored guesses, e.g., there's an *n* in second position, there is no *i*,
there is an *e* somewhere, but not in fourth position: the **logical** information you know.

Second important point:  Making the information state be elegant in the sense that no piece
of information is represented twice is not essential. For example, you might choose to represent
information states in such a way that the following can happen:  One part of
one of your information states captures the fact that there is an *e* in the target, and
another part captures the fact that an *e* occurs in second position of the target.
The first fact follows logically from the second. That doesn't mean your way of representing information states is incorrect.

The notion information state
has a rich history in computation and in computational psychology. Solving this
problem does not require any acquaintance with that history.  But it will probably be
easier to solve it in an elegant way for those who've had some experience thinking
about information states. 

Summing up: Our notion of information state is  a representation of what our guess thus far
have told us.  It should determine the set
of English words compatible with the guesses and their colorings. 



## Problem 1:  Bigram counting

### Introduction: Bigrams

A bigram is a pair of words occurring together in running text.  For example, in the sentence

```
the old man loved the old woman
```

the bigrams are

```
[('the', 'old'), ('old', 'man'), ('man', 'loved'), ('loved', 'the'), ('the', 'old'), ('old', 'woman')]
```

Bigrams arise in the study of the sequential statistical properties of language,
in other words, when we are trying to answer questions like, "Given that the current word is 
*the* what is the probability that the next word is *old*?"  So we might
also write down the bigrams of the example sentence as

```
[('the', 'old'), ('old', 'man'), ('man', 'loved'), ('loved', 'the'), ('the', 'old'), ('old', 'woman'),
('woman', 'END')]
```

in order to model the probability that there is no next word, that
is, that the current word is the last word in a sentence. We might even write the bigrams as:

```
[('START', 'the'), ('the', 'old'), ('old', 'man'), ('man', 'loved'), ('loved', 'the'), ('the', 'old'), ('old', 'woman'),  ('woman', 'END')]
```

in order to get counts for being a first word as well.  In this way, just by counting
bigrams, we would discover that *the* is a very common way of starting a sentence.

It's useful to know that bigrams are a special case of the general concept of **ngrams**. An
ngram is a sequence of *n* words occurring in running text. The unigrams (or 1-grams) in a text
are just the words, so the unigrams in the text above are

```
['the', 'old', 'man', 'loved', 'the', 'old', 'woman']
```

while the trigrams are:

```
[('START1', 'START2','the'), ('START2', 'the', 'old'), ('the', 'old','man'), ('old', 'man','loved'), ('man', 'loved','the'), ('loved', 'the','old'), ('the', 'old','woman'), ('old', 'woman', 'END1'), ('woman', 'END1', 'END2')]
```

We count ngrams the same way we count words. The bigram ('old', 'woman') occurs once
in our example sentence.  The bigram ('the', 'old') occurs twice.  In this problem we are
going to write some of the utility code for counting bigrams.  First step: list them.

###  Problem statement

Write a function `find_bigrams` that finds the **bigrams** in a string of words.  You can include `'START'`
and `'END'` symbols or not.  Or you can write your function with optional Boolean parameters `start` and `end`,
which give the user the option of including `'START'` and `'END'` symbols.  Your choice.

Write another function `get_bigram_counts` that takes a string of words as its argument, calls `find_bigrams`,
and returns a `FreqDist` with the bigram counts from the input string.  This should be a one-liner.

The functions should work like this.

In [118]:
bgrs = find_bigrams('the old man loved the old woman')
print(bgrs)

[('the', 'old'), ('old', 'man'), ('man', 'loved'), ('loved', 'the'), ('the', 'old'), ('old', 'woman')]


As observed above, the bigram *the old* occurs twice in this sentence, while the bigram *old man* occurs only once. We can capture this fact using a `Counter` (a specialized dictionary for keeping counts of the elements
in a sequence). 

The standard Python module `collections` provides a class called `Counter` for this purpose, but we get a counter with a few more bells and whistles if we import the class `FreqDist` from  the `nltk` (Natural Language Tool Kit) module, and pass it the sequence we want to get counts for.  

For example

In [122]:
from nltk import FreqDist

text = 'the rain in spain falls mainly on the plain'
print(text)
FreqDist(text)


the rain in spain falls mainly on the plain


FreqDist({' ': 8, 'n': 6, 'a': 5, 'i': 5, 'l': 4, 't': 2, 'h': 2, 'e': 2, 's': 2, 'p': 2, ...})

A string is a sequence of characters, so given a string, a `FreqDist` counts characters.

If we want to count words, we need to pass in a sequence of words:

In [123]:
print(text.split())
FreqDist(text.split())

['the', 'rain', 'in', 'spain', 'falls', 'mainly', 'on', 'the', 'plain']


FreqDist({'the': 2, 'rain': 1, 'in': 1, 'spain': 1, 'falls': 1, 'mainly': 1, 'on': 1, 'plain': 1})

Now we see that our `find_bigrams`  function returns just the right thing for counting bigrams.

In [119]:
from nltk import FreqDist

bgrs = find_bigrams('the old man loved the old woman')
print(bgrs)
FreqDist(bgrs)

FreqDist({('the', 'old'): 2, ('old', 'man'): 1, ('man', 'loved'): 1, ('loved', 'the'): 1, ('old', 'woman'): 1})

We count bigrams just like we count words.   

### Implementation details

Now there's a fine point here to pay attention to when you implement
`find_bigrams`.  In order for `FreqDist` to work with your bigrams, the bigrams
have to be **hashable**, which means they have to be **immutable**.  That's why the bigrams are tuples
not lists.

Watch what happens if a bigram is represented as a **list** of two words instead of as a **tuple**
of two words.  

In [129]:
bad_bgrs = bad_find_bigrams ('the old man loved the old woman')
print(bad_bgrs)
FreqDist(bad_bgrs)

[['the', 'old'], ['old', 'man'], ['man', 'loved'], ['loved', 'the'], ['the', 'old'], ['old', 'woman']]


TypeError: unhashable type: 'list'

Here we called `bad_find_bigrams`, a version of `find_bigrams` which represented each bigram as a two-word list.  That caused the error.  Pay attention to the **last line** of the Error (good practice in general, not just here).

```
TypeError: unhashable type: 'list'
```

A `FreqDist` is a `Counter` which in turn is just a specialized Python dictionary.  Our
ngrams are the keys in the dictionary and 
the keys in a dictionary must be **hashable**.  This means they must be immutable.  Tuples
are immutable and lists are not.  Hence when we represent our bigrams as 2-element lists
we get this **unhashable type** error.

Summing up: bigrams must be hashable in order to be keys in a count dictionary.  Hence bigrams
should be tuples, not lists, and bigram sequences should be lists of tuples.  Bear this in mind
when you write `find_bigrams`.

For a discussion of **why** hashable types must be immutable look [here.](https://gawron.sdsu.edu/python_for_ss/course_core/book_draft/Python_introduction/mutability.html)

## Problem 2:  Finding Rhymes

### Introduction 

Consider a description of a Python program used to construct the Jan 12, 2022
NYT crossword puzzle, as written by its constructor, Adam Aaronson:

"Then I made a Python program that mined the CMU Pronouncing Dictionary for every word that rhymes with a single digit and searched my word list for every entry consisting of two of those words back to back."

Hmmm. Interesting piece of programming.  Let's consider writing a function 
that finds English rhymes (as part of, let's say, 
our **Poet's Workbench**, a software package we hope to make
millions on).

Let's make one thing clear.  The use of a pronunciation dictionary is critical for
finding English rhymes.  We are stuck dealing with gazillions of special
cases if we have to deal with English orthography.  Think of bite
and flight, fix and ticks, load and code and owed and furloughed;
think of non-rhymes like cough and tough and bough and dough.

So let's start by bringing in the CMU pronouncing dictionary, and looking at a word's pronounciations:

In [12]:
from nltk.corpus import cmudict
pron_dict = cmudict.dict()
pron_dict['actuary']

[['AE1', 'K', 'CH', 'AH0', 'W', 'EH2', 'R', 'IY0']]

A **pronunciation** in the  CMU pronunciation dictionary is a **list of sounds**. Since words can have more than one pronunciation, a word is associated with a **list** of pronunciations.

The word *actuary* has one pronunciation; hence the dictionary gives it a list with one element, and that element is a list of 8 sounds (or phonemes or phones). Next let's look at *human*, a word with two pronunciations. See if you can say them aloud.  In trying to read the pronunciation representation, it's helpful to know that the numbers represent stress information.

In [63]:
pron_dict['human']

[['HH', 'Y', 'UW1', 'M', 'AH0', 'N'], ['Y', 'UW1', 'M', 'AH0', 'N']]

For the purposes of this problem you don't really need to think about what each sound symbol
means, but if you're interested, there's a set of examples illustrating each
of the phonemes [here.](http://www.speech.cs.cmu.edu/cgi-bin/cmudict?stress=-s&in=SYLLABLE)

Let's look at some rhyming words.

In [45]:
pron_dict['spam'], pron_dict['ham'], pron_dict['sham'], pron_dict['clam']

([['S', 'P', 'AE1', 'M']],
 [['HH', 'AE1', 'M']],
 [['SH', 'AE1', 'M']],
 [['K', 'L', 'AE1', 'M']])

So clearly, the last part of the pronunication has to match.

Let's try this.  Grab the first, preferred pronunciation of a target word and look for end-matches with that.

We'll do the word *ham*.

In [47]:
tgt = pron_dict['ham'][0]
for (wd,ps) in pron_dict.items():
    for p in ps:
       if p[1:] == tgt[1:]: 
          print(wd, p)

bahm ['B', 'AE1', 'M']
bam ['B', 'AE1', 'M']
cam ['K', 'AE1', 'M']
camm ['K', 'AE1', 'M']
cham ['CH', 'AE1', 'M']
dahm ['D', 'AE1', 'M']
dam ['D', 'AE1', 'M']
damm ['D', 'AE1', 'M']
damme ['D', 'AE1', 'M']
damn ['D', 'AE1', 'M']
gahm ['G', 'AE1', 'M']
gamm ['G', 'AE1', 'M']
hahm ['HH', 'AE1', 'M']
ham ['HH', 'AE1', 'M']
hamm ['HH', 'AE1', 'M']
hamme ['HH', 'AE1', 'M']
jam ['JH', 'AE1', 'M']
jamb ['JH', 'AE1', 'M']
kam ['K', 'AE1', 'M']
kamm ['K', 'AE1', 'M']
lahm ['L', 'AE1', 'M']
lam ['L', 'AE1', 'M']
lamb ['L', 'AE1', 'M']
lambe ['L', 'AE1', 'M']
lamm ['L', 'AE1', 'M']
lamme ['L', 'AE1', 'M']
ma'am ['M', 'AE1', 'M']
nahm ['N', 'AE1', 'M']
nam ['N', 'AE1', 'M']
pam ['P', 'AE1', 'M']
pham ['F', 'AE1', 'M']
rahm ['R', 'AE1', 'M']
ram ['R', 'AE1', 'M']
ramm ['R', 'AE1', 'M']
sahm ['S', 'AE1', 'M']
sam ['S', 'AE1', 'M']
sham ['SH', 'AE1', 'M']
tam ['T', 'AE1', 'M']
tamm ['T', 'AE1', 'M']
tham ['TH', 'AE1', 'M']
wham ['W', 'AE1', 'M']
yam ['Y', 'AE1', 'M']
zahm ['Z', 'AE1', 'M']


There are lots of non words in CMU, in part because this was built for use in early speech recognition
systems, and coverage of proper names was desirable.  But we do see real words like *yam, dam, lamb*, and
*damn*, some of which depart from a simple orthographic match, so this is a good start.

So something roughly along these lines will work a lot of the time.

Let's call this version of the program **version A**.

Here's another example that works, using a two syllable rhyme.

In [16]:
tgt = pron_dict['rabbit'][0]
for (wd,ps) in dd.items():
    for p in ps:
       if p[1:] == tgt[1:]: 
          print(wd, p)

cabot ['K', 'AE1', 'B', 'AH0', 'T']
habit ['HH', 'AE1', 'B', 'AH0', 'T']
kabat ['K', 'AE1', 'B', 'AH0', 'T']
rabbit ['R', 'AE1', 'B', 'AH0', 'T']


There are problems with version A, 
illustrated by the target word *kicked*.

In [20]:
tgt = pron_dict['kicked'][0]
for (wd,ps) in dd.items():
    for p in ps:
       if p[1:] == tgt[1:]: 
          print(wd, p)

kicked ['K', 'IH1', 'K', 'T']
licht ['L', 'IH1', 'K', 'T']
licked ['L', 'IH1', 'K', 'T']
nicked ['N', 'IH1', 'K', 'T']
picht ['P', 'IH1', 'K', 'T']
picked ['P', 'IH1', 'K', 'T']
ticked ['T', 'IH1', 'K', 'T']


An impressive list but notice some missing words.

```
clicked
tricked
strict
```

In [22]:
print(pron_dict['clicked'][0])
print(pron_dict['tricked'][0])
print(pron_dict['strict'][0])

['K', 'L', 'IH1', 'K', 'T']
['T', 'R', 'IH1', 'K', 'T']
['S', 'T', 'R', 'IH1', 'K', 'T']


These words are in fact in the dictionary and they are easily discoverable rhymes.
The problem is that our current code for finding rhymes doesn't catch these cases.

### Problem statement

The sample code above prints out rhymes.  Instead of writing
code that **prints out** the rhymes, write a **function** called `find_rhymes`
that takes one argument, an English word, and uses the CMU pronunciation
dictionary to return the set of all its
rhymes.  It should work roughly like this:

```
>>> find_rhymes('kicked')
{'kicked', 'licht', 'licked', 'nicked', 'picht', 'picked', 'ticked','clicked', 'tricked', 'strict'}
```

Note: your program may not return these results in this order and may find other rhymes as well,
but it should return at least these rhymes for *kicked*.   Also to make life a tad
easier, just find rhymes for the preferred pronunciation of the target
word (The first pronunciation in the pronunciation list of a word is its **preferred**
pronunciation).

Let's spell this out in a list of program **specs**.

1.  Figure out why the last three rhymes for *kicked* are missed by version A.

    The function `find_rhymes` should fix the problem of the three missing rhymes for *kicked*.  So getting 'kicked' right means finding all the rhymes for *kicked* found by version A of the program **plus** at least the three new ones listed above, *clicked*, *tricked* and *strict*.

2.  Make sure that if A is a rhyme of B, B is also a rhyme of B.

    So we also want the following behavior:
    
    ```
    >>> find_rhymes('strict')
    ['kicked', 'licht', 'licked', 'nicked', 'picht', 'picked', 'ticked','clicked', 'tricked', 'strict']
     ```
    
3.  Getting it right **also** means not getting the following false rhymes.

    ```
    bellowed billowed
    bellowed furloughed
    bellowed vetoed
    ```
    In brief, the only rhyme for *bellowed* should be *mellowed*.

4.  It will be useful to write an auxiliary function `xx`, which is called by `find_rhyme`. The function `xx` inputs a word pronunciation and returns the part of the pronunciation which has to be matched by a rhyme.

5.  The output from the examples above does have what might be called a bug.  A word is considered a rhyme of itself.  You do not have to worry about fixing this.

6.  In order to solve this,  it will be useful to distinguish between consonants and vowels.  The CMU pronunciation dictionary represents stress information in the form of numbers attached to the vowels. Since only vowels come with stress numbers, the presence of a stress number is a  reliable test for vowelhood:


In [73]:
# Collect all the sound representations used in the dictionary
sounds = {s for ps in pron_dict.values() for p in ps for s in p}
print(f'There are {len(sounds)} sounds')
print(sounds)
vowels = {s for s in sounds if s[-1] in '012'}
print('vowels')
print(vowels)
consonants = sounds - vowels
print('consonants')
print(consonants)

There are 70 sounds
{'AH1', 'K', 'IY0', 'UW1', 'AA2', 'IH0', 'N', 'OY0', 'AY1', 'EH1', 'AO2', 'SH', 'IY2', 'IY1', 'B', 'UH1', 'OW0', 'F', 'OW2', 'DH', 'AW1', 'W', 'EY2', 'ZH', 'EH0', 'ER2', 'IH2', 'ER1', 'AE2', 'P', 'HH', 'AY2', 'EY1', 'AH0', 'EH2', 'M', 'V', 'AE1', 'AO0', 'AA1', 'ER0', 'UW2', 'UH2', 'L', 'NG', 'R', 'T', 'AW2', 'Y', 'CH', 'OY1', 'EY0', 'AY0', 'G', 'OW1', 'AO1', 'TH', 'UH0', 'Z', 'IH1', 'UW', 'S', 'JH', 'OY2', 'AH2', 'AW0', 'UW0', 'AA0', 'AE0', 'D'}
vowels
{'AH1', 'IY0', 'UW1', 'AA2', 'IH0', 'OY0', 'AY1', 'EH1', 'AO2', 'IY1', 'IY2', 'UH1', 'OW0', 'OW2', 'AW1', 'EY2', 'ER2', 'EH0', 'IH2', 'ER1', 'AE2', 'AY2', 'EY1', 'AH0', 'EH2', 'AE1', 'AO0', 'AA1', 'ER0', 'UW2', 'UH2', 'AW2', 'OY1', 'EY0', 'AY0', 'OW1', 'AO1', 'UH0', 'IH1', 'OY2', 'AH2', 'AW0', 'UW0', 'AA0', 'AE0'}
consonants
{'K', 'N', 'SH', 'B', 'F', 'DH', 'W', 'ZH', 'P', 'HH', 'M', 'V', 'L', 'NG', 'R', 'CH', 'T', 'Y', 'G', 'TH', 'Z', 'UW', 'S', 'JH', 'D'}


NB:  For nitpickers.  There appears to be a bug because 'UW', which phonetically is the vowel
in `food`, shows up among the consonants.  

Bottom line:  Ability to bear stress is a very useful operational definition of vowel, arguably the right one for the present problem.  Don't worry about this "exception"-al consonant.  There is exactly one occurrence of the symbol `UW` (as opposed to `UW0`, `UW1` and `UW2`) in the entire pronunciation dictionary, and it's a very peculiar case
which you can find yourself (by looping through all pronunications of all words), if interested.

### Extra credit problem statement

The remarks above lead in a fruitful direction but they do not really solve the problem of rhyme.

Here are some examples which are harder to get and require going a little deeper into the
linguistics.  Arguably the following are rhymes and merely implementing the ideas above will not
find them.

```
incredible inedible
astounding  resounding
```

To get these right, you need an additional idea, stress.  The key idea is
stated in the Wikipedia definition of rhyme:

  **A rhyme is a repetition of similar sounds (usually, exactly the same sound) in the final stressed syllables and any following syllables of two or more words.** 

So there has to be a match on the stressed syllable and all that follows, but that allows 
mismatches **before** the first stressed syllable, as with *astounding* and *resounding*.
The number 1 on a vowel means it receives primary stress, 0 means unstressed, and 2 means
secondary stress ( which falls between primary and unstressed in prominence).

For extra credit, write the code so that it finds *inedble* and *incredible* as
rhymes, as well as *astounding* and *resounding*.  Make sure you keep getting all the
previous examples right, including avoiding false rhymes.

In [87]:
pron_dict['facilitation'],pron_dict['incredible'],pron_dict['inedible'],\
  pron_dict['astounding'],pron_dict['resounding']

([['F', 'AH0', 'S', 'IH2', 'L', 'AH0', 'T', 'EY1', 'SH', 'AH0', 'N']],
 [['IH0', 'N', 'K', 'R', 'EH1', 'D', 'AH0', 'B', 'AH0', 'L']],
 [['IH0', 'N', 'EH1', 'D', 'AH0', 'B', 'AH0', 'L']],
 [['AH0', 'S', 'T', 'AW1', 'N', 'D', 'IH0', 'NG']],
 [['R', 'IY0', 'S', 'AW1', 'N', 'D', 'IH0', 'NG']])