In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw05.ipynb")

# Homework 5 – Dictionaries

## Data 6, Summer 2022

In this homework assignment, you will solve problems involving dictionaries, a key data structure you'll need to be familiar with moving forward. You'll also gain some experience with reading in real data.

This homework is due on **Thursday, August 11th at 11:00PM**. You must submit the assignment to Gradescope. Submission instructions can be found at the bottom of this notebook. See the [syllabus](https://data6.org/su22/syllabus/) for our late submission policy.

**Note:** In this homework, we've tried our best to add helpful comments to the test cases that you will be able to see. If you fail a test case, read the error message to look for our comments, as they may point you in the right direction!

In [None]:
# Just run this cell to load in the relevant dependencies

from datascience import *
from data6_utils import *
import numpy as np
import plotly.express as px
from ipywidgets import interact, widgets
from IPython.display import HTML, display, clear_output, Image
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

<hr style="border: 5px solid #003262;" />
<hr style="border: 1px solid #fdb515;" />

## Part 1: Dictionary Fundamentals

<br></br>
<hr style="border: 1px solid #fdb515;" />

## Question 1 – Syntax

In this question, you will solidify your understanding of the syntax necessary for working with dictionaries. You'll also learn how to read in data from external files.

---
## Question 1a

Below, we create a dictionary that contains modern-day slang acronyms and their corresponding full forms.

In [None]:
# DO NOT EDIT THIS CELL – just run it!

more_slang = {
    'haha': 'that was not funny',
    'smh': 'shake my head',
    'lol': 'laugh out loud',
    'GOAT': 'greatest of all time'
}

In the cell below, add a new key-value pair to `more_slang`, corresponding to the abbreviation `'ofr'`. The value can be any string consisting of three words whose first letter is `'o'`, second letter is `'f'`, and third letter is `'o'`. You should not change the cell above.

<!--
BEGIN QUESTION
name: q1a
points: 1
-->

In [None]:
...

In [None]:
grader.check("q1a")

---
## Question 1b

In the cell below, we've created a new dictionary `even_more_slang` which is a copy of your `more_slang` from 1a. (We did this in order to make the autograder work correctly.)

**Task:** Your job is to add another key-value pair to `even_more_slang`. The key should be the string `'explicit'`, and the value should be another dictionary. In this nested dictionary, the two keys should be the strings `'lmao'` and `'fml'`, and the values should be four-word and three-word strings that abbreviate to `'lmao'` and `'fml'`, respectively. Don't use any swear words – we don't want to lose our jobs! 😅

That is, after running your cell, `even_more_slang['explicit']['fml']` should be a string consisting of three words.

*Reminder:* The keys of a dictionary can be strings, numbers, bools, or even `None` – just not a list or other dictionary. On the other hand, values in a dictionary can be anything!

<!--
BEGIN QUESTION
name: q1b
points: 2
-->

In [None]:
even_more_slang = more_slang.copy() # Don't change this

explicit_dict = {
    ...
}

...

In [None]:
grader.check("q1b")

---
## Question 1c

We can also read and convert JSON files into Python dictionaries. That's what you'll do in this question.

Before following these instructions, make sure to save your notebook (which you should be doing frequently anyways)!

1. Right click the Jupyter logo in the top left of your screen, and click "Open Link in New Tab" (it may appear as Open...)
2. Click the `data` folder.
3. Identify the name of the `.json` file that contains Google Maps data. You may have to open both `.json` files to determine which one it is; you can open files by clicking on them.
4. Set the string `maps_path` below equal to the path to that file. `maps_path` should start with `'data/'` and end with `'.json'`.

<!--
BEGIN QUESTION
name: q1c
points: 1
-->

In [None]:
maps_path = ...

In [None]:
grader.check("q1c")

If you answered the previous part correctly, you should be able to run the following cell:

In [None]:
maps_data = read_json(maps_path)
maps_data

---
## Question 1d

The dictionary above is quite unwieldy, and contains many nested dictionaries! Let's try and extract some data from it programatically (that is, using code).

**Task**: Assign `maps_data_keys` to the `dict_keys` object of all of `maps_data`'s keys. Don't just manually type in all of the keys.

_Hint_: `len(maps_data_keys)` will tell you that there are 6 keys. `'long_name'` is not a key of `maps_data`.

<!--
BEGIN QUESTION
name: q1d
points: 1
-->

In [None]:
maps_data_keys = ...
maps_data_keys

In [None]:
grader.check("q1d")

---
## Question 1e

Finally, assign `key_1`, `key_2`, and `key_3` below so that `maps_data[key_1][key_2][key_3]` evaluates to the latitude of the location whose data is stored in `maps_data`. We've done `key_2` for you.

_Hint_: Work one step at a time. You know that `key_1` must be one of the six keys in `maps_data_keys`, which you found above. Then, given what we've set `key_2` to, what must `key_3` be?

<!--
BEGIN QUESTION
name: q1e
points: 2
-->

In [None]:
key_1 = ...
key_2 = 'location'
key_3 = ...

maps_data[key_1][key_2][key_3]

In [None]:
grader.check("q1e")

By the way, `maps_data` contains location information for Bonchon, a Korean fried chicken restaurant in Downtown Berkeley. It's quite good, you should try it!

<br></br>
<hr style="border: 1px solid #fdb515;" />

## Question 2 – Emojify 

The default keyboard on iOS suggests emojis for you to use in place of boring, ordinary words.

<img src = "https://support.apple.com/library/content/dam/edam/applecare/images/en_US/iOS/ios12-iphone-x-messages-replace-words-with-emoji.jpg" width=200>

In this question, you will replicate some of that behavior using dictionaries!


### Emojis in Python
In Python, emojis can be included as part of a string. For example:

In [None]:
'🤤'

If you remove the quotes from the emoji above, you will see `SyntaxError: invalid character in identifier`. **Make sure that throughout this question, your emojis are contained within strings!** (Fun fact, they cannot currently be used as variable names. Try it and see what error you get!)

---
## Question 2a

In the cell below, define a dictionary `fav_emojis` that has the following **five** keys:
- `'happy'`
- `'annoyed'`
- `'tired'`
- `'love'`
- `'food'`

The values corresponding to these five keys must be an emoji. [getemoji.com](https://getemoji.com) allows you to copy and paste emojis. To select an emoji, double click it to highlight it. You may choose any emojis you would like **as long as**:
>- it is copied from the site above
>- it is not in the "New Emojis" category at the bottom

Have fun with it! We've chosen an emoji for `'happy'` for you, but feel free to change it.

**Some troubleshooting tips:**
- After defining your dictionary, you may see some emoijs displayed with `'\U001...'` instead of their actual graphic. **If this happens, pick different emojis**.
- If you fail the test case that says your emojis are invalid, and you're certain you correctly defined your dictionary, you may consider choosing other emojis that are more generic that are more likely to be recognized by our autograder. This most likely won't be a problem.

<!--
BEGIN QUESTION
name: q2a
points: 2
-->

In [None]:
fav_emojis = {
    'happy': '😀',
    ...
}

fav_emojis

In [None]:
grader.check("q2a")

---
## Question 2b

Now, complete the implementation of the function `emojify`, which takes in a string `message` and returns a new string with all instances of any of the keys in `fav_emojis` replaced with their corresponding emoji value. Example behavior is shown below, though the emojis will be different, depending on what you put in `fav_emojis`. If you passed the previous question, you don't need to change your emojis!

```py
>>> emojify('Filing taxes makes me tired and want food.')
'Filing taxes makes me 😵 and want 🌽.'

>>> emojify('I LOVE love life right now. I am so happy – why do you look so annoyed?!')
'I 💋 💋 life right now. I am so 😀 – why do you look so 💀?!'

>>> emojify("It's not you, it's me... I don't make you haPPy, I make you tired.")
"It's not you, it's me... I don't make you 😀, I make you 😵."
```

*Hint*: You may have seen a similar exercise in lecture.

<!--
BEGIN QUESTION
name: q2b
points: 3
-->

In [None]:
def emojify(message):
    # This line ensures your code replaces correctly if any of
    # the keys in fav_emojis appears in uppercase in the message
    message = message.lower()

    ...
    
    # Don't change this
    return message

In [None]:
grader.check("q2b")

### Fun Demo

Run the cell below to produce a text box (don't worry about the code itself). Type text in the text box and watch it get emojified live!

In [None]:
def emojify_live(type_here):
    display(HTML('<h2>' + emojify(type_here) + '</h2>'))
interact(emojify_live, type_here="I LOVE food");

<br></br>
<hr style="border: 1px solid #fdb515;" />

## Question 3 – Value Counts and State of the Union

In this question you will implement one function, `value_counts`, and we will use it for a particularly interesting application – analyzing State of the Union speeches from recent US Presidents.

<br></br>

---
## Question 3a

`value_counts` will take in an array `vals` and return a dictionary describing the number of times each element in `vals` appeared. Specifically, the keys of the returned dictionary should be the elements of `vals`, and the values of the returned dictionary should be the number of times the keys occurred in `vals`.

Example behavior is shown below.

```py
>>> value_counts(make_array('the', 'dog', 'jumped', 'over', 'the', 'fence', 'over', 'there'))
{'fence': 1, 'there': 1, 'jumped': 1, 'the': 2, 'dog': 1, 'over': 2}

>>> value_counts(make_array('just dance', 'down', 'dancing queen', 'dancing queen', 'dancing queen', 'just dance'))
{'dancing queen': 3, 'down': 1, 'just dance': 2}

>>> value_counts(make_array(4, 5, 1, 9, 9, 3, 2, 4, 9, 9, 5))
{1: 1, 2: 1, 3: 1, 4: 2, 5: 2, 9: 4}
```

In the second example, the value for `'dancing queen'` is 3 because `'dancing queen'` appeared three times in the array `make_array('just dance', 'down', 'dancing queen', 'dancing queen', 'dancing queen', 'just dance')`. Note, the order in which the keys are displayed in your dictionary is not important.

**To help you,** we want to point you to two functions/methods. One is `np.unique`, for which the documentation can be found [here](https://numpy.org/doc/stable/reference/generated/numpy.unique.html). 

In [None]:
np.unique(make_array(4, 5, 1, 9, 9, 3, 2, 4, 9, 9, 5))

Another useful method is `np.count_nonzero`, which, with a little ingenuity, can be used to determine the number of appearances of a particular value in an array; it's documentation can be found [here](https://numpy.org/doc/stable/reference/generated/numpy.count_nonzero.html). 

Below, we create an array of `True` and `False` values corresponding to whether or not each element within the array is equal to 9. Then, we could how many nonzero elements (i.e. `True` elements) that we had. This number represents the number of 9's that appeared in our original array.

In [None]:
np.count_nonzero(make_array(4, 5, 1, 9, 9, 3, 2, 4, 9, 9, 5) == 9)

You should use both of the above techniques (`np.unique` and `np.count_nonzero`) in your implementation of `value_counts`.

**Task**: In the cell below, implement `value_counts` so that it matches the behavior described above.

<!--
BEGIN QUESTION
name: q3a
points: 3
-->

In [None]:
def value_counts(vals):
    unique_vals = ...
    frequencies = ...
    for val in unique_vals:
        ...
    return frequencies

In [None]:
grader.check("q3a")

<br></br>
<hr style="border: 1px solid #fdb515;" />

## State of the Union Addresses

**Note:** The rest of the cells in this question rely on your `value_counts` function being completed and correct. You will not have to write code in the rest of this question, but you'll need to answer a written question.

Each year, the sitting US President delivers a [State of the Union](https://en.wikipedia.org/wiki/State_of_the_Union) (SOTU) address detailing the "current condition of the nation". In the remainder of this question, we will use your `value_counts` function to visualize the most common words in speeches by Presidents George (W.) Bush, Barack Obama, and Donald Trump.

We will load in three arrays:
- `bush_sotu`, which contains all of the words used by George Bush in his **eight** SOTU speeches
- `obama_sotu`, which contains all of the words used by Barack Obama in his **eight** SOTU speeches
- `trump_sotu`, which contains all of the words used by Donald Trump in his **2017** SOTU speech only

Run the cell below to read this information.

In [None]:
# Run this cell.

bush_sotu = load_clean_split('data/bush-sotu.txt')
print('words by Bush: ', len(bush_sotu))

obama_sotu = load_clean_split('data/obama-sotu.txt')
print('words by Obama: ', len(obama_sotu))

trump_sotu = load_clean_split('data/trump-sotu.txt')
print('words by Trump: ', len(trump_sotu))

Let's take a look at `bush_sotu`:

In [None]:
bush_sotu

There are lots of words here! As we mentioned before, we want to determine the number of occurrences of each word. Run the cell below to compute the frequency of each word in the speeches of Bush, Obama, and Trump. This cell may take up to 30 seconds to run.

In [None]:
# Run this cell. Be patient!
bush_dict = value_counts(bush_sotu)
print('unique words by Bush: ', len(bush_dict))

obama_dict = value_counts(obama_sotu)
print('unique words by Obama: ', len(obama_dict))

trump_dict = value_counts(trump_sotu)
print('unique words by Trump: ', len(trump_dict))

Note, there are far fewer unique words by Trump since we're only using the text of one of his speeches, while we're using the text from all eight of both Bush and Obama's speeches.

Let's look at `bush_dict`:

In [None]:
bush_dict

Dictionaries aren't ordered. However, we can extract two arrays from the above dictionary, corresponding to words and their frequencies in sorted order. After doing so, we can visualize the frequencies of the most common words used by each President.

Note, we're going to ignore some of the more common words like "the", "that", and "its" because they won't really tell us anything about the content of a President's speeches.

Run the cell below to get things set up; don't worry about the code itself.

In [None]:
# Just run this cell. 
words_to_ignore = make_array('the', 'and', 'to', 'of', 'in', 'we', 'a', 'our', 'is', 'that', 'will', 'for',
                   'are', 'have', 'this', 'i', 'on', 'with', 'by', 'their', 'it', 'you', 'be', 'they',
                   'not', 'from', 'must', 'all', 'has', 'so', 'as', 'can', 's', 'us', 'who', 'or', 'at',
                   'them', 'these', 'an', 'new', 'he', 'him', 'his', 'she', 'her', 'hers', 
                   'but', 'was', 'my', 've', 'do', 'than', 'its', 't', 'nt', 're', 'if', 'also')

def create_freq_df(words, counts):
    df = Table().with_columns('word', words, 'count', counts).to_df()
    return df[~df['word'].isin(words_to_ignore)].head(50)

bush_table = create_freq_df(*sort_by_value(bush_dict))
obama_table = create_freq_df(*sort_by_value(obama_dict))
trump_table = create_freq_df(*sort_by_value(trump_dict))

def plot_frequency(name):
    name_to_table = {
        'George Bush': bush_table,
        'Barack Obama': obama_table,
        'Donald Trump': trump_table
    }
    fig = px.bar(name_to_table[name], x = 'word', y = 'count', title = 'Frequency of words used by ' + name)
    fig.show()
    
DEFAULT = 'George Bush'

dropdown_pres = widgets.Dropdown(options = ['George Bush', 'Barack Obama', 'Donald Trump'],
                                   value = DEFAULT)

def dropdown_pres_eventhandler(change):
    if change['name'] == 'value' and (change['new'] != change['old']):
        clear_output()
        display(dropdown_pres)
        plot_frequency(change['new'])

Finally, run this cell to produce an interactive visualization of word frequencies. You'll be able to change the President whose word counts we've visualizing via a dropdown menu.

In [None]:
display(dropdown_pres)

plot_frequency(DEFAULT)
    
dropdown_pres.observe(dropdown_pres_eventhandler)

<!-- BEGIN QUESTION -->

---
## Question 3b

Look at the word frequencies for all three US Presidents in the plot above. Did you notice anything interesting? Try and find a word or two that one President used very frequently that other Presidents did not use very often. In the cell below, tell us which President and word(s) you found, and try and give a reason why.

*Note*: there's no right answer here – we're just looking to make sure you played around with the plots above and thought about your answer. Have fun with it!

<!--
BEGIN QUESTION
name: q3b
points: 2
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



# Done!

Congratulations, you've finished your last homework assignment in Data 6! To submit your work, follow the steps outlined on the [Homework Submission Page](https://data6.org/su22/submissions/) of the course website.

The point breakdown for this assignment is given in the table below:

| **Category** | Points |
| --- | --- |
| Autograder | 15 |
| Written (Question 3b) | 2 |
| **Total** | 17 |

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()