# Dictionaries
Dictionaries are like sets, except that the "elements" of the dictionary are treated as keys, and a value is associated with that key. As in sets, the keys to dictionaries must be hashable objects. However, the values associated with the key can be anything, and can be mutable. In addition, the association between key and value can also be changed.

In [1]:
# initialize sets
obj1 = set()
obj2 = {'hi'}

# initialize dictionaries
obj3 = {}
obj4 = dict()
obj5 = {'hi': 'hello'}

print ('type of obj1 = ', type(obj1))
print ('type of obj2 = ', type(obj2))
print ('type of obj3 = ', type(obj3))
print ('type of obj4 = ', type(obj4))
print ('type of obj5 = ', type(obj5))

type of obj1 =  <class 'set'>
type of obj2 =  <class 'set'>
type of obj3 =  <class 'dict'>
type of obj4 =  <class 'dict'>
type of obj5 =  <class 'dict'>


In [2]:
fruits = {}
fruits['apple'] = 4
fruits['pear'] = 6
fruits['orange'] = 4
print(fruits)

{'apple': 4, 'pear': 6, 'orange': 4}


In [3]:
fruits = {}
fruits['apple'] = [4, 5.5, 3]
fruits["pear"] = [6, 6.5, 7]
print(fruits)

{'apple': [4, 5.5, 3], 'pear': [6, 6.5, 7]}


In [4]:
#Is key in a dictionary? Can use 'in' syntax.
if 'pear' in fruits:
    print ('They have pears!')
    
if 'orange' not in fruits:
    print ("They don't have oranges!")

They have pears!
They don't have oranges!


In [5]:
fruits['orange'] = [5.4, 4.3, 5.5]
print("fruits:", fruits)

fruits: {'apple': [4, 5.5, 3], 'pear': [6, 6.5, 7], 'orange': [5.4, 4.3, 5.5]}


In [6]:
# Iterate over keys in a dictionary. Any order can occur!
for key in fruits:
    print("key:", key, "\tval:", fruits[key])

key: apple 	val: [4, 5.5, 3]
key: pear 	val: [6, 6.5, 7]
key: orange 	val: [5.4, 4.3, 5.5]


In [7]:
# Also, can iterate over key, val items in a dictionary. Any order can occur!
for key, val in fruits.items():
    print("key:", key, "\tval:", val)

key: apple 	val: [4, 5.5, 3]
key: pear 	val: [6, 6.5, 7]
key: orange 	val: [5.4, 4.3, 5.5]


In [8]:
print ("keys = \t\t", list(fruits.keys()))
print ("values = \t", list(fruits.values()))

keys = 		 ['apple', 'pear', 'orange']
values = 	 [[4, 5.5, 3], [6, 6.5, 7], [5.4, 4.3, 5.5]]


In [9]:
# Remove element from a dictionary
del fruits['pear']  #Exception if key is not in dictionary
print("fruits:", fruits)

fruits: {'apple': [4, 5.5, 3], 'orange': [5.4, 4.3, 5.5]}


In [10]:
# Dictionary comprehensions also possible
{n: n**3 for n in range(8)}

{0: 0, 1: 1, 2: 8, 3: 27, 4: 64, 5: 125, 6: 216, 7: 343}

In [11]:
my_fruits = {'pear': 4, 'apple': 5, 'orange': 3}
new_fruits = {'grape': 4, 'lemon': 8}
print ("my_fruits =\t", my_fruits)
print ("new_fruits =\t", new_fruits)

my_fruits =	 {'pear': 4, 'apple': 5, 'orange': 3}
new_fruits =	 {'grape': 4, 'lemon': 8}


In [12]:
my_fruits.update(new_fruits)
print ("my_fruits =\t", my_fruits)

my_fruits =	 {'pear': 4, 'apple': 5, 'orange': 3, 'grape': 4, 'lemon': 8}


In [13]:
my_fruits = {'pear': 4, 'apple': 5, 'orange': 3}
new_fruits = {'apple': 8}
print ("my_fruits = \t", my_fruits)
print ("new_fruits = \t", new_fruits)

my_fruits = 	 {'pear': 4, 'apple': 5, 'orange': 3}
new_fruits = 	 {'apple': 8}


In [14]:
my_fruits.update(new_fruits)
print ("my_fruits =\t", my_fruits)

my_fruits =	 {'pear': 4, 'apple': 8, 'orange': 3}


In [15]:
fruits = {}
fruits['apple'] = [4, 5.5, 3]
fruits["pear"] = [6, 6.5, 7]
print(fruits)

{'apple': [4, 5.5, 3], 'pear': [6, 6.5, 7]}


In [16]:
def add_fruit_price(fruit, price):
#     fruits.setdefault(fruit, []).append(price)
    if fruit in fruits:
        fruits[fruit].append(price)
    else:
        fruits[fruit] = [price]

In [17]:
add_fruit_price('guava', 8)
print (fruits)

{'apple': [4, 5.5, 3], 'pear': [6, 6.5, 7], 'guava': [8]}


## Problem:

Return a new list with the 2nd instance of the first element htat is repeated in the input list removed. The rest of the list should remain unchanged (be the same as the input).

## Goal:

Our goal is to use this problem to build some familiarity with sets and dectionaries - so the code versions below are written to use sets and/or dictionaries. This is definitely not the only way to solve the problem, but it happends to also be the most efficient away. (At the very end, we'll also show a more "direct", but less efficient, solution which does not use sets or dictionaries.)

In [18]:
inp = [0, 12, 12, 0, 12, 12, 34, 56, 23, 11, 45, 2, 3, 4, 7, 10, 12]

inp_str = ['zero', 'twelve', 'twelve', 'zero', 'twelve', 'twelve',
          'thirty four', 'fifty six', 'twenty three', 'eleven',
          'forty five', 'two', 'three', 'four', 'seven', 'ten', 'twelve']

### Version 1

In [19]:
def remove_2nd_instance_v1(data):
    print("data:", data)
    
    # create a dictionary with frequencies
    freqd = {}
    for x in data:
        freqd[x] = freqd.get(x, 0) + 1
        
    # look for first element in list, that is repeated somwhere later
    repeated = None
    for x in data:
        if freqd[x] >= 2:
            repeated = x
            break
    print("first repeated:", repeated)
    
    # look through the list to find where the 2nd instance of the repet is
    index = len(data)
    count = 0
    for i in range(len(data)):
        if data[i] == repeated:
            count += 1
        if count == 2:
            index = i
            break
    print("index:", index)
    
    output = data[:]
    # remove this 2nd istance if it exists from a copy of the input
    if index < len(output):
        output.pop(index)
    return output

In [20]:
remove_2nd_instance_v1(inp)

data: [0, 12, 12, 0, 12, 12, 34, 56, 23, 11, 45, 2, 3, 4, 7, 10, 12]
first repeated: 0
index: 3


[0, 12, 12, 12, 12, 34, 56, 23, 11, 45, 2, 3, 4, 7, 10, 12]

In [22]:
output = remove_2nd_instance_v1(inp_str)
print(output)

data: ['zero', 'twelve', 'twelve', 'zero', 'twelve', 'twelve', 'thirty four', 'fifty six', 'twenty three', 'eleven', 'forty five', 'two', 'three', 'four', 'seven', 'ten', 'twelve']
first repeated: zero
index: 3
['zero', 'twelve', 'twelve', 'twelve', 'twelve', 'thirty four', 'fifty six', 'twenty three', 'eleven', 'forty five', 'two', 'three', 'four', 'seven', 'ten', 'twelve']


## Revised Spec

Our procedure removed 0('zero') from the list at index 3, since that is the first repeat of the "earlest" element that has a later repeat. is that what we want? Our specification is perhaps ambiguous. Arguably, we might want to remove 12 at index 2 since 12 appeared twice before 0 appeared twice. Let's clarify our spec: by "first repeated element" we mean the element that appears twice first.

### Version 2

In [26]:
def remove_2nd_instance_v2(data):
    print("data:", data)
    
    freqd = {}
    for i in range(len(data)):
        x = data[i]
        if x in freqd:
            freqd[x][0] += 1
            if freqd[x][0] == 2:
                freqd[x][1] = i
        else:
            freqd[x] = [1, 1]
    print("freqd =", freqd)
    
    index = len(data)
    for x in data:
        entry = freqd[x]
        if entry[0] >= 2:
            index = min(index, entry[1])
    print("index:", index)
    
    output = data[:]
    if index < len(output):
        output.pop(index)
    return output

In [27]:
remove_2nd_instance_v2(inp)

data: [0, 12, 12, 0, 12, 12, 34, 56, 23, 11, 45, 2, 3, 4, 7, 10, 12]
freqd = {0: [2, 3], 12: [5, 2], 34: [1, 1], 56: [1, 1], 23: [1, 1], 11: [1, 1], 45: [1, 1], 2: [1, 1], 3: [1, 1], 4: [1, 1], 7: [1, 1], 10: [1, 1]}
index: 2


[0, 12, 0, 12, 12, 34, 56, 23, 11, 45, 2, 3, 4, 7, 10, 12]

### Version 3

In [28]:
def remove_2nd_instance_v3(data):
    print("data:", data)
    
    index = len(data)
    repeated = set()
    for index, x in enumerate(data):
        if x in repeated:
            break
        repeated.add(x)
    print("freq_set:", repeated)
    print("index:", index)
    
    output = data[:index] + data[index+1:]
    return output

In [29]:
remove_2nd_instance_v3(inp)

data: [0, 12, 12, 0, 12, 12, 34, 56, 23, 11, 45, 2, 3, 4, 7, 10, 12]
freq_set: {0, 12}
index: 2


[0, 12, 0, 12, 12, 34, 56, 23, 11, 45, 2, 3, 4, 7, 10, 12]

### Version 4

In [32]:
def remove_2nd_instance_v4(data):
    print("data:", data)
    
    for i in range(len(data)):
        try:
            ii = data.index(data[i], i+1)
            print("ii:", ii)
            return data[:ii] + data[ii+1:]
        except ValueError:
            continue
    return data[:]

In [33]:
remove_2nd_instance_v4(inp)

data: [0, 12, 12, 0, 12, 12, 34, 56, 23, 11, 45, 2, 3, 4, 7, 10, 12]
ii: 3


[0, 12, 12, 12, 12, 34, 56, 23, 11, 45, 2, 3, 4, 7, 10, 12]

## Wordplay

Suppose we have two books - maybe we'd like to see if they were written by the same author, or are otherwise similar. One approach to this is to evaluate the word use frequency in both texts, and then compute a 'similarity' or 'distance' measure between those two word frequencies. A related approach is to evaluate the frequency of one word being followed by another owrd (a 'aord pair'), and see the similarity in use of word pairs by the two texts. Or maybe we're interested in seeing the set of all words that come after a given word in that text.

Here we'll get some practice using dictionaries and sets, with such wordplay as our motivating example.

Some data to play with...

In [34]:
word_string1 = 'it was the best of times it was the worst of times '
word_string2 = 'it was the age of wisdom it was the age of foolishness'
word_string = word_string1 + word_string2

words1 = word_string1.split()
words2 = word_string2.split()
words = word_string.split()
print("words:", words)

words: ['it', 'was', 'the', 'best', 'of', 'times', 'it', 'was', 'the', 'worst', 'of', 'times', 'it', 'was', 'the', 'age', 'of', 'wisdom', 'it', 'was', 'the', 'age', 'of', 'foolishness']


Create some interesting dictionaries from these...

In [35]:
# word frequencies
word_freq = {}
for w in words:
    word_freq[w] = word_freq.get(w, 0) + 1
print("Word frequencies:", word_freq)

Word frequencies: {'it': 4, 'was': 4, 'the': 4, 'best': 1, 'of': 4, 'times': 2, 'worst': 1, 'age': 2, 'wisdom': 1, 'foolishness': 1}


In [36]:
# word pair frequencies
word_pairs = {}
prev = words[0]
for w in words[1:]:
    pair = (prev, w)
    word_pairs[pair] = word_pairs.get(pair, 0) + 1
    prev = w
print("pair frequencies:", word_pairs)

pair frequencies: {('it', 'was'): 4, ('was', 'the'): 4, ('the', 'best'): 1, ('best', 'of'): 1, ('of', 'times'): 2, ('times', 'it'): 2, ('the', 'worst'): 1, ('worst', 'of'): 1, ('the', 'age'): 2, ('age', 'of'): 2, ('of', 'wisdom'): 1, ('wisdom', 'it'): 1, ('of', 'foolishness'): 1}


In [40]:
# what words follow each word?
word_after = {}
for w1, w2 in zip(words, words[1:]):
    word_after.setdefault(w1, set()).add(w2)
print("words followed by\n" + str(word_after) + "\n")
    

words followed by
{'it': {'was'}, 'was': {'the'}, 'the': {'best', 'age', 'worst'}, 'best': {'of'}, 'of': {'wisdom', 'times', 'foolishness'}, 'times': {'it'}, 'worst': {'of'}, 'age': {'of'}, 'wisdom': {'it'}}



Rewriting as functions so we can use them later:

In [41]:
def get_freq(words):
    word_freq = {}
    for w in words:
        word_freq[w] = word_freq.get(w, 0) + 1
    return word_freq

In [46]:
def get_pair_freq(words):
    word_pairs = {}
    for pair in zip(words, words[1:]):
        word_pairs[pair] = word_pairs.get(pair, 0) + 1
    return word_pairs

Suppose we want to identify the high frequency words(i.e., sort word frequency from high to low).

In [47]:
def sort_freq_dict(freq):
    return sorted([(freq[key], key) for key in freq], reverse=True)

In [49]:
words_by_frequency = sort_freq_dict(word_freq)
print(words_by_frequency)

[(4, 'was'), (4, 'the'), (4, 'of'), (4, 'it'), (2, 'times'), (2, 'age'), (1, 'worst'), (1, 'wisdom'), (1, 'foolishness'), (1, 'best')]


Next, we can build a similarity measure between two wordfrequencies. We'll use a typical "geometric" notion of distance or similarity referred to as cosine similarity. We build this from vector measures of word frequency including the "norm" and the "dot product", and then calculate a normalized cosine distance.

In [50]:
def freq_norm(freq):
    return sum(num**2 for num in freq.values())**0.5

In [51]:
def freq_dot(freq1, freq2):
    return sum(freq1[w] * freq2[w] for w in set(freq1) & set(freq2))

In [53]:
import math

In [52]:
def freq_similarity(freq1, freq2):
    d = freq_dot(freq1, freq2) / (freq_norm(freq1) * freq_norm(freq2))
    ang = math.acos(min(1.0, d))
    return 1 - ang / (math.pi/2)

In [54]:
# some quick tests/ examples
x = {'a': 40, 'b': 2}
y = {'c': 3, 'a': 30}
print("Combined words:", set(x) | set(y))
print("freq_norm of", x, ":", freq_norm(x))
print("freq_norm of", y, ":", freq_norm(y))
print("freq_dot of", x, "and", y, ":", freq_dot(x, y))
print("freq_similarity:", freq_similarity(x, y))

Combined words: {'b', 'c', 'a'}
freq_norm of {'a': 40, 'b': 2} : 40.049968789001575
freq_norm of {'c': 3, 'a': 30} : 30.14962686336267
freq_dot of {'a': 40, 'b': 2} and {'c': 3, 'a': 30} : 1200
freq_similarity: 0.9290478472408689


In [55]:
# try it out with our short phrases
words3 = "this is a random sentence good enough for any random day".split()
print("words1:", words1, "\nwords2:", words2, "\nwords3:", words3, "\n")

# build word and pairfrequency dictionaries, and calculate som similarities
freq1 = get_freq(words1)
freq2 = get_freq(words2)
freq3 = get_freq(words3)
print("words1 vs.words2 -- word use similarity: ", freq_similarity(freq1, freq2))
print("words1 vs.words3 -- word use similarity: ", freq_similarity(freq1, freq3))
print("words3 vs.words3 -- word use similarity: ", freq_similarity(freq3, freq3))

words1: ['it', 'was', 'the', 'best', 'of', 'times', 'it', 'was', 'the', 'worst', 'of', 'times'] 
words2: ['it', 'was', 'the', 'age', 'of', 'wisdom', 'it', 'was', 'the', 'age', 'of', 'foolishness'] 
words3: ['this', 'is', 'a', 'random', 'sentence', 'good', 'enough', 'for', 'any', 'random', 'day'] 

words1 vs.words2 -- word use similarity:  0.5184249085864179
words1 vs.words3 -- word use similarity:  0.0
words3 vs.words3 -- word use similarity:  1.0


In [56]:
# try that for similarity of WORD PAIR use...
pair1 = get_pair_freq(words1)
pair2 = get_pair_freq(words2)
pair3 = get_pair_freq(words3)
print("words1 vs. words2 -- pair use similarity: ", freq_similarity(pair1, pair2))
print("words1 vs. words3 -- pair use similarity: ", freq_similarity(pair1, pair3))

words1 vs. words2 -- pair use similarity:  0.29368642735616235
words1 vs. words3 -- pair use similarity:  0.0


### Now let's do it with some actual books!

In [58]:
with open('hamlet.txt') as f:
    hamlet = f.read().replace('\n', '').lower()
hamlet_words = hamlet.split()

with open('macbeth.txt') as f:
    macbeth = f.read().replace('\n', '').lower()
macbeth_words = macbeth.split()

with open('alice_in_wonderland.txt') as f:
    alice = f.read().replace('\n', '').lower()
alice_words = alice.split()

print(len(hamlet_words), len(macbeth_words), len(alice_words))

28218 18154 15189


With the textfrom those books in hand, let's look at similarities...

In [59]:
hamlet_freq = get_freq(hamlet_words)
macbeth_freq = get_freq(macbeth_words)
alice_freq = get_freq(alice_words)
print("similarity of word freq between hamleth & macbeth:",
      freq_similarity(hamlet_freq, macbeth_freq))
print("similarity of word freq between alice & macbeth:",
      freq_similarity(alice_freq, macbeth_freq))

hamlet_pair = get_pair_freq(hamlet_words)
macbeth_pair = get_pair_freq(macbeth_words)
alice_pair = get_pair_freq(alice_words)
print("\nsimilarity of word pairs between hamlet & macbeth:",
      freq_similarity(hamlet_pair, macbeth_pair))
print("similarity of word pairs between alice & macbeth:",
      freq_similarity(alice_pair, macbeth_pair))

similarity of word freq between hamleth & macbeth: 0.8234901643970373
similarity of word freq between alice & macbeth: 0.7242123439984074

similarity of word pairs between hamlet & macbeth: 0.3195599532068242
similarity of word pairs between alice & macbeth: 0.23412902997911367
