In [None]:
import numpy as np

Recall from lecture and readings that Tversky defined the similarity between objects $a$ and $b$ as:

$$
S(a,b) = \theta\,f(A \cap B) - \alpha\,f(A - B) - \beta\,f(B-A)
$$

Here, $A$ is the set of features of $a$, $B$ is the set of features of $b$, $f$ is an additive function from sets to numbers, and $\theta, \alpha, \beta$ are free parameters all $\ge 0$.

In this problem you will write code to implement Tversky's contrast model. 

## Part A (0.5 points)

We will compute each part of the similarity function in turn. First, we need to compute $f(A\cap B)$, which is essentially the number of features that $A$ and $B$ have in common. As an example, consider the following fruits:

|            | Sweet | Sour  | Bitter | Salty | Seeds |
|:-----------|:-----:|:-----:|:------:|:-----:|:-----:|
| Orange     | 1     | 1     | 0      | 0     | 1     |
| Lemon      | 0     | 1     | 1      | 0     | 1     |

As NumPy arrays, these would look like:

In [None]:
orange_features = np.array([True,  True,  False, False, True ])
lemon_features  = np.array([False, True,  True,  False, True ])

What are the number of features in common between the orange and lemon? A feature only counts as being "in common" if *both* feature vectors have it, so in this case, the only common features between orange and lemon are "sour" and "seeds" (not "salty", because neither of them have this feature). Thus, orange and lemon have two features in common.

<div class="alert alert-success">Complete the function `common_features` so that it takes two binary feature vectors of length $n$ (`a` and `b`) as arguments and returns the total number of features in common between `a` and `b`. Here, we define the "number of common features" as the number of locations that are `True` in both `a` and `b`.</div>

In [None]:
def common_features(a, b):
    """
    Compute the number of common features between a and b. Features 
    count as being shared between the vectors if they are present in
    both vectors (i.e., they are a 1 in both). In other words, you should
    compute the intersection of features between a and b.
    
    Hint: your solution can be done in a single line of code, including
    the return statement.
    
    Parameters
    ----------
    a, b : boolean numpy array with shape (n,)
    
    Returns
    -------
    number of common features between a and b
    
    """
    # YOUR CODE HERE
    raise NotImplementedError()

Test our your `common_features` function on the orange and lemon features, to see if it does in fact return 2:

In [None]:
common_features(orange_features, lemon_features)

In [None]:
# add your own test cases here!


In [None]:
"""Check that common_features is correct."""
from nose.tools import assert_equal
assert_equal(common_features(np.array([True, True, False, False]), np.array([True, True, False, False])), 2)
assert_equal(common_features(np.array([True, False, False, False]), np.array([True, True, False, False])), 1)
assert_equal(common_features(np.array([True, False, True, False]), np.array([True, True, False, False])), 1)
assert_equal(common_features(np.array([False, False, False, False]), np.array([False, False, False, False])), 0)
assert_equal(common_features(np.array([True, True, True, True]), np.array([True, True, True, True])), 4)

print("Success!")

----

## Part B (0.5 points)

In the next two terms of the equation, we need to compute $f(A-B)$ and $f(B-A)$. This can be done using the same operation: computing the number of features that are in one vector, but not the other. As an example, let's take a look at some more fruits:

|            | Sweet | Sour  | Bitter | Salty | Seeds |
|:-----------|:-----:|:-----:|:------:|:-----:|:-----:|
| Grapefruit | 1     | 1     | 1      | 0     | 1     |
| Banana     | 1     | 0     | 0      | 0     | 0     |

If we wanted to compute $f(\textbf{grapefruit}-\textbf{banana})$, we want to see what features grapefruit has that banana does not. In this case, there are three features matching this description: "sour", "bitter", and "seeds". Similarly, to compute $f(\textbf{banana}-\textbf{grapefruit})$, we want to look at features that are in banana but not in grapefruit. Here, there are actually *no* features that the banana has that the grapefruit does not.

<div class="alert alert-success">Complete the function `differences` so that it takes two binary feature vectors of length $n$ (`a` and `b`) as arguments and returns the total number of features in `a` that are not contained in `b`. This is defined as the number of locations that are `True` in `a` and `False` in `b`.</div>

In [None]:
def differences(a, b):
    """
    Compute the number of features that belong to a, but not b. Features 
    count as being in a but not b if the feature is 1 in a, and 0 in b.
    
    Hint: your solution can be done in a single line of code, including
    the return statement.
    
    Parameters
    ----------
    a, b : boolean numpy array with shape (n,)
    
    Returns
    -------
    number of differences between a and b
    
    """
    # YOUR CODE HERE
    raise NotImplementedError()

Test your `differences` function on the orange and lemon feature vectors to see if it works!

In [None]:
# define the feature vectors
grapefruit_features = np.array([True,  True,  True,  False, True ])
banana_features     = np.array([True,  False, False, False, False])

print("f(grapefruit - banana) = " + str(differences(grapefruit_features, banana_features)))
print("f(banana - grapefruit) = " + str(differences(banana_features, grapefruit_features)))

In [None]:
# add your own test cases here!


In [None]:
"""Check that differences is correct."""
assert_equal(differences(np.array([True, True, False, False]), np.array([True, True, False, False])), 0)
assert_equal(differences(np.array([True, False, False, False]), np.array([True, True, False, False])), 0)
assert_equal(differences(np.array([True, False, True, False]), np.array([True, True, False, False])), 1)
assert_equal(differences(np.array([True, True, True, True]), np.array([False, False, False, False])), 4)
assert_equal(differences(np.array([True, True, True, True]), np.array([False, False, False, True])), 3)
assert_equal(differences(np.array([False, False, False, False]), np.array([False, False, False, False])), 0)
assert_equal(differences(np.array([True, True, True, True]), np.array([True, True, True, True])), 0)

print("Success!")

---

## Part C (1 point)

<div class="alert alert-success">Now, using your completed functions `common_features` and `differences`, compute Tversky's similarity function in `tversky_sim` below.</div>

In [None]:
def tversky_sim(a, b, theta=1.0, alpha=0.5, beta=0.5):
    """
    Compute Tversky's similarity function for two vectors a and b:
    
    S(a, b) = theta*f(a ∩ b) - alpha*f(a - b) - beta*f(b - a)
    
    Hint: your solution can be done in 4 lines of code (or less), including
    the return statement.
    
    Parameters
    ----------
    a, b : boolean numpy array with shape (n,)
    theta, alpha, beta : parameters of the similarity function
    
    Returns
    -------
    the similarity between a and b
    
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# add your own test cases here!


In [None]:
x = np.array([True, True, True, False, False, False])
y = np.array([False, True, True, True, False, True])

# check some explicit values
assert_equal(tversky_sim(x, y), 0.5)
assert_equal(tversky_sim(y, x), 0.5)
assert_equal(tversky_sim(x, y, theta=2.0), 2.5)
assert_equal(tversky_sim(y, x, theta=2.0), 2.5)
assert_equal(tversky_sim(x, y, alpha=1.0), 0.0)
assert_equal(tversky_sim(y, x, alpha=1.0), -0.5)
assert_equal(tversky_sim(x, y, beta=1.5), -1.5)
assert_equal(tversky_sim(y, x, beta=1.5), -0.5)

# check that it uses common_featues
old_common_features = common_features
del common_features
try:
    tversky_sim(x, y)
except NameError:
    pass
else:
    raise AssertionError("tversky_sim does not use common_features")
finally:
    common_features = old_common_features
    del old_common_features

# check that it uses differences
old_differences = differences
del differences
try:
    tversky_sim(x, y)
except NameError:
    pass
else:
    raise AssertionError("tversky_sim does not use differences")
finally:
    differences = old_differences
    del old_differences

print("Success!")

---

## Part D (1 point)

Now that we have a way of quantifying similarity, let's apply it to some real data. Here we have provided you with a dataset that includes 50 animals and 80 features, and specifies which animals have which features.

First, let's load our data in. There are three arrays in the data: `feature_names`, `animal_names`, and `animal_features`:

In [None]:
data = np.load("data/50animals.npz")
data.keys()

The `animal_features` array corresponds to a $50\times 80$ boolean array of features, where each row corresponds to a different animal, and each column corresponds to a different feature:

In [None]:
animal_features = data['animal_features']
animal_features

And `animal_names` corresponds to a vector of length 50 of the animal names. For convenience, we're going to convert this to a list (and only show the first 10 animal names, since the list itself is quite long):

In [None]:
animal_names = list(data['animal_names'])
animal_names[:10]

Similarly, the `feature_names` array is a vector of length 85 of the feature names. We actually won't need it for this problem, though, so we won't create a new variable for it.

<div class="alert alert-success">Complete the function `find_similar_animals` to take the name of an animal and find the **5 most similar animals** to that animal, using your function `tversky_sim`. You should return the animals in order of most similar to least similar, and you should *not* return the name of the animal that was passed in.</div>

Note: the `np.argsort()` function will come in handy here (take a look at Problem Set 0 if you forget how it's used). To keep ties in the original order, make sure to use mergesort (which is [stable](http://programmers.stackexchange.com/a/247441)) as so:

```
indices = np.argsort(array, kind='mergesort')
```

In [None]:
def find_similar_animals(name, features, animal_names):
    """
    Finds the five most similar animals to the given animal. You should return the
    animals in order from most similar to least similar to the given animal. In
    addition, you should NOT include the given animal in the list of animals you
    return. 
    
    If two animals have the same similarity score, find_similar_animals 
    should break ties in the REVERSE of the order they appear in animal_names 
    (e.g., if the first two entries in animal_names are A and B, and both animals 
    A and B have the same similarity to target animal C, find_similar_animals should 
    place B BEFORE A when ranking them in terms of their similarity to C.)
    
    Hint: your solution can be done in 4 lines of code, including the return
    statement.
    
    Parameters
    ----------
    name : string
        the name of an animal
    features : boolean numpy array
        animals by features, with shape (n, m)
    animal_names : list of strings
        list of animal names with length n
    
    Returns
    -------
    a list of five animal names
    
    """
    # YOUR CODE HERE
    raise NotImplementedError()

Use your function to find out what animals are most similar to a mouse:

In [None]:
# should print ['rat', 'rabbit', 'weasel', 'hamster', 'squirrel']
find_similar_animals('mouse', animal_features, animal_names)

In [None]:
# add your own test cases here!


In [None]:
"""Check that find_similar_animals is correct"""
from nose.tools import assert_equal

def assert_one_equal(arr, *others):
    for other in others:
        if arr == other:
            return
    assert_equal(arr, others[0])

# load the animal data
data = np.load("data/50animals.npz")
af = data['animal_features']
an = list(data['animal_names'])
data.close()

# try finding animals similar to mouse
assert_equal(
    find_similar_animals('mouse', af, an),
    [u'rat', u'rabbit', u'weasel', u'hamster', u'squirrel'])

# try finding animals similar to grizzly bear
assert_equal(
    find_similar_animals('grizzly bear', af, an),
    ['bobcat', 'polar bear', 'raccoon', 'lion', 'gorilla'])

# try finding animals similar to grizzly bear with different features
assert_equal(
    find_similar_animals('grizzly bear', ~af, an),
    ['polar bear', 'gorilla', 'german shepherd', 'bobcat', 'raccoon'])

# try finding animals similar to grizzly bear with different names
assert_equal(
    find_similar_animals('grizzly bear', af, an[::-1]),
    ['weasel', 'beaver', 'buffalo', 'tiger', 'collie'])

# try finding animals similar to grizzly bear with both different names and features
assert_equal(
    find_similar_animals('grizzly bear', ~af, an[::-1]),
    ['spider monkey', 'seal', 'weasel', 'dalmatian', 'giraffe'])

# check that it uses tversky_sim
old_tversky_sim = tversky_sim
del tversky_sim
try:
    find_similar_animals('mouse', af, an)
except NameError:
    pass
else:
    raise AssertionError("find_similar_animals does not use tversky_sim")
finally:
    tversky_sim = old_tversky_sim
    del old_tversky_sim

print("Success!")

---

## Part E (1 point)

Run your function `find_similar_animals` for the input 'giant panda':

In [None]:
find_similar_animals('giant panda', animal_features, animal_names)

<div class="alert alert-success">What are the five most similar animals it returns to the panda? Do they match your intuitions (that is, if you were to intuitively pick out the five most similar animals to a panda, would you pick those five in that order)? (**0.5 points**)</div>

YOUR ANSWER HERE

<div class="alert alert-success"> If yes, why do you think Tversky's contrast model does a good job at capturing your intuitions? If no, what aspect of Tversky's contrast model leads to the contradiction with your intuitions? (**0.5 points**)</div>

YOUR ANSWER HERE

---

## Part F (1 point)

Tversky's contrast model takes optional parameters, $\theta$, $\alpha$, and $\beta$, which bias the similarity more or less towards shared features versus feature differences. Recall from lecture that Tversky's notion of similarity says that the similarity of the *variant* to the *prototype* should be greater than the similarity of the *prototype* to the *variant*. More formally, if $a$ is the prototype and $b$ is the variant, then:

$$
S(b,a) − S(a,b) = (\alpha −\beta)[ f (A − B) − f (B − A)]
$$

Given this equation, $S(b,a)>S(a,b)$ when $\alpha>\beta$ *and* $f(A-B)>f(B-A)$ -- that is, when the prototype has more distinctive or heavily weighted features than the variant. Previously, we used the same value for both $\alpha$ and $\beta$, meaning that $S(b,a)==S(a,b)$. However, if we change these parameters, then we can get asymmetric similarities.

<div class="alert alert-success">Let's explore this idea a little further, in the case of the bat and the mouse. First, which animal you would intuitively say is more prototypical: the bat or the mouse? Why?  (**0.1 points**)</div>

YOUR ANSWER HERE

Now, run the cell below to find out what Tversky's similarity metric says about the similarity:

In [None]:
leopard = animal_features[animal_names.index('leopard')]
tiger = animal_features[animal_names.index('tiger')]
print("S(leopard, tiger) = {}".format(tversky_sim(leopard, tiger, theta=1.0, alpha=2.5, beta=0.5)))
print("S(tiger, leopard) = {}".format(tversky_sim(tiger, leopard, theta=1.0, alpha=2.5, beta=0.5)))

<div class="alert alert-success">Does Tversky's similarity metric say that a leopard is more similar to a tiger, or vice versa? Does this mean that Tversky's similarity metric says (in this case) that the prototype is more similar to the variant, or that the variant is more similar to the prototype? (Hint: What animal category are leopards and tigers members of? Which do you think is a more "prototypical" example of this category?) (**0.4 points**)</div>

YOUR ANSWER HERE

<div class="alert alert-success">Can Tversky's leopard vs. tiger similarity results be interpreted as counter-evidence for his notion that $S$(variant, prototype) $>$ $S$(prototype, variant)?  (**0.5 points**)</div>

YOUR ANSWER HERE