In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# special imports for computing mds and dendrograms
from representations import mds, plot_dendrogram

In this problem, we will explore the MDS (Multi-Dimensional Scaling) algorithm, which is is a procedure for transforming an array of pairwise distances back into the points that generated them. MDS has additionally been used by cognitive scientists to create spatial representations of the similarities among a set of stimuli.

---
## Part A (1 point)

Alyssa P. Hacker and Ben Bitdiddle are playing a game. Ben thinks of a shape, and then finds the distances between all the vertices of the shape. He then tells Alyssa what the distances between the vertices are, and Alyssa has to guess what the shape is. To begin, Ben gives Alyssa the following pairwise distances:

In [None]:
shape_data = np.load("data/shape.npy")
shape_data

Alyssa knows that just by looking at the distances between the points, she can't tell what the shape originally was. However, Alyssa took CogSci 131 and therefore also knows that she can use the MDS (Multi-Dimensional Scaling) algorithm to transform these distances into points.

We have provided you with a function, `mds`, which performs the MDS algorithm. Look at the documentation for the function to figure out how to call it:

In [None]:
mds?

<div class="alert alert-success">Then, in the following cell, write code to perform MDS on the shape data that Ben gave Alyssa. Save the output of the MDS algorithm into a variable called <code>shape_points</code>, and then plot the points as black dots. Include a title for your plot and make sure that the dimensions are scaled properly.</div> 

<div class="alert alert-warning">
<b>Warning:</b> A common error people make when making MDS plots is allowing one dimension to get stretched out by the default axis scaling. You can see the tragic consequences of this mistake in the plot of a circle below. Fortunately, fixing it is as easy as a single call to <code>axis.set_aspect</code>. Check the documentation for that method to find out how to fix the circle below, then make sure you use the same trick for all your MDS plots!
</div>

In [None]:
# Example of axis scaling gone wrong
x = np.linspace(0, 2 * np.pi, 100)
fig, axis = plt.subplots()
axis.plot(3 * np.cos(x), 3 * np.sin(x))
axis.set_title('An unfortunate plot of a circle')

_Hint:_ For a quick refresher on constructing plots with matplotlib, take a look at the Problem Set 0 notebook "Manipulating and Plotting Data" or the tutorial [here](http://nbviewer.ipython.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-4-Matplotlib.ipynb).

In [None]:
# load the data
shape_data = np.load("data/shape.npy")

# create the figure
fig, axis = plt.subplots()

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
"""Check that MDS was correctly used on the shape data and that the points were plotted."""
from numpy.testing import assert_array_equal, assert_almost_equal
from nose.tools import assert_equal, assert_not_equal
from plotchecker import ScatterPlotChecker

# check that shape_data hasn't changed
assert_array_equal(shape_data, np.load("data/shape.npy"))

# check that shape points has the correct shape and type
assert_equal(shape_points.shape, (10, 2), "incorrect shape of shape_points")
assert_equal(shape_points.dtype, np.float64, "incorrect data type of shape_points")

# check the correct data was plotted
pc = ScatterPlotChecker(axis)
pc.assert_x_data_equal(shape_points[:, 0])
pc.assert_y_data_equal(shape_points[:, 1])

# check that the plotted data has the correct values
vals = [(1, 0.60353738), (4, 0.4613193), (-2, -0.24098396), (-6, 0.10159001), (-10, 0.10731944)]
for (idx, val) in vals:
    assert_almost_equal(shape_points.flatten()[idx], val, 6, "incorrect entry within kinship_points")

# check that black circles were used
pc.assert_colors_equal('k')

# check that a title was included
pc.assert_title_exists()

# check that dimensions are not distorted
assert axis.get_aspect() == 'equal'

print("Success!")

<div class="alert alert-success">What is the shape that Ben was thinking of?</div>

YOUR ANSWER HERE

<div class="alert alert-success">Give a brief explanation of what the MDS algorithm does. That is, what is it doing when it goes from a $n\times n$ array of distances to an array of $n$  2D points?</div>

YOUR ANSWER HERE

---
## Part B (1 point)

Satisfied that she was able to correctly guess Ben's shape, Alyssa starts thinking about other cool ways that the MDS algorithm could be used. She remembers talking about various notions of *similarity* from CogSci 131, and wonders if the MDS algorithm could be used as a way to represent psychological similarity.

Being the aspiring cognitive scientist that she is, Alyssa goes ahead and collects some similarity judgments about different kinds of musical genres. That is, she asks several people to rank the similarity of (for example) jazz piano and heavy metal rock, averages all the responses, and then scales the data to lie between 0 and 1. She saves her similarity data in the file `music_similarities.npz` and the names of musical genres in `music_list.npz`

In [None]:
music_similarities = np.load("data/music_similarities.npz")['data']
music_names = np.load("data/music_list.npz")['data']

This music_similarities file contains Alyssa's similarity matrix, and the music_names file contains the names of the music she used in her experiment on musical similarity.

In [None]:
music_names

The other is a $12\times 12$ matrix of similarities where index $(i,j)$ lists the similarity between the kinship term in `names[i]` and the kinship term in `names[j]`:

In [None]:
music_similarities

Alyssa remembers that the MDS algorithm takes pairwise *distances*; however, her behavioral data is of pairwise *similarities*. Thus, Alyssa must transform her data into *dissimilarities*. Because the data lies between 0 and 1, a simple way to do this is just to subtract the similarities from 1.

<div class="alert alert-success">Help Alyssa visualize her musical similarity data by writing code to compute the 2D points from the similarity data, and plot the points along with labels stating which points correspond to which music genres. Store the output of the `mds` algorithm in a variable called `music_points`. Don't forget to include a title for your plot!</div>

*Hint*: you can add a text label using the `axis.text` command. You may also want to prepend some spaces to the beginning of each label so that they don't overlap with the points!

In [None]:
# get a handle to an axis object, then close the plot
axis = plt.gca()
plt.close()

# look up documentation on axis.text
axis.text?

In [None]:
music_names = np.load("data/music_list.npz")['data']
music_similarities = np.load("data/music_similarities.npz")['data']

# create the figure
fig, axis = plt.subplots()

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
"""Check that MDS was correctly used on the music data and that the points were plotted."""
from numpy.testing import assert_array_equal, assert_almost_equal
from nose.tools import assert_equal, assert_not_equal
from plotchecker import ScatterPlotChecker

names = np.load("data/music_list.npz")['data']
similarities = np.load("data/music_similarities.npz")['data']

assert_equal(list(music_names), list(names), "music_names array has changed")
assert_array_equal(music_similarities, similarities, "music_similarities array has changed")

# check that music_points has the correct shape and type
assert_equal(music_points.shape, (12, 2), "incorrect shape of music_points")
assert_equal(music_points.dtype, np.float64, "incorrect data type of music_points")

# check the correct data was plotted
pc = ScatterPlotChecker(axis)
pc.assert_x_data_equal(music_points[:, 0])
pc.assert_y_data_equal(music_points[:, 1])

# check that the plotted data has the correct values
vals = [(0, -0.59924511), (1, 0.19729266), (2, -0.21593404), (3, 0.09087466), (4, 0.66645917)]
for (idx, val) in vals:
    assert_almost_equal(music_points.flatten()[idx], val, 6, "incorrect entry within music_points")

# check that black circles were used
pc.assert_colors_equal('k')

# check that a title was included
pc.assert_title_exists()

# check that the labels are correct
pc.assert_textlabels_equal(music_names)
pc.assert_textpoints_equal(music_points)

# check that dimensions are not distorted
assert axis.get_aspect() == 'equal'

print("Success!")

---

## Part C (0.5 points)

<div class="alert alert-success">By looking at the graph from Part B, can you identify any **pairs** of points that have similar distances on the 2D plane, but which have different relationships conceptually?</div>

YOUR ANSWER HERE

---

## Part D (0.5 points)

<div class="alert alert-success">Overall, how well does the spatial representation produced by the MDS algorithm capture your intuitions about the similarities between these musical categories? Justify your answer.</div>

YOUR ANSWER HERE

---

## Part E (1 point)

Being a polymath with a wide range of interests, and being especially interested in ways of visualizing psychological similarity data, Alyssa obtained similarity judgments about kinship relations, for example: how similar is an aunt to a nephew? And how similar is a daughter to a grandmother? Her similarity data, and the list of kinship category names she used in her new experiment are conveniently stored in an npz file (which are loaded below).

She decides to try a different way of representing her kinship data (a different representation than the one she used for musical categories). She has heard of a special type of plot called a "dendrogram", which will create a hierarchical clustering based on the similarities (e.g., as opposed to a spatial layout).

We have provided for you a `plot_dendrogram` function that will perform hierarchical clustering and then create a dendrogram plot for you:

In [None]:
# The kinship similarity data
kinship_data = np.load("data/kinship.npz")

# the list of names
kinship_names = list(kinship_data['names'])
print(kinship_names)

# The similarity matrix
kinship_similarities = kinship_data['similarities']

In [None]:
plot_dendrogram??

<div class="alert alert-success">From the documetation, figure out how to call the `plot_dendrogram` function with the kinship data, and then call the function in the following cell. Save the output of `plot_dendrogram` into a variable called `kinship_dissimilarities`. Make sure you add a title to your plot as well. Your solution can be done in 2 lines of code.</div>

In [None]:
# load the kinship data
kinship_data = np.load("data/kinship.npz")
kinship_names = list(kinship_data['names'])
kinship_similarities = kinship_data['similarities']

# create the figure
fig, axis = plt.subplots()

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
"""Check that the dendrogram function was correctly used for the kinship data."""
from numpy.testing import assert_array_equal
from nose.tools import assert_equal, assert_not_equal
from plotchecker import PlotChecker

# check that the kinship data hasn't changed
kd = np.load("data/kinship.npz")
assert_equal(kinship_names, list(kd['names']), "kinship_names array has changed")
assert_array_equal(kinship_similarities, kd['similarities'], "kinship_similarities array has changed")

# check that a title was included
pc = PlotChecker(axis)
pc.assert_title_exists()

# check that the labels are correct
labels = ["aunt", "uncle", "nephew", "niece", "father", "mother", "daughter", 
          "son", "grandfather", "grandmother", "granddaughter", "grandson"]
pc.assert_xticklabels_equal(labels)

# check that the dissimilarities are correct
assert_array_equal(kinship_dissimilarities, 1 - kinship_similarities, "kinship dissimilarities are incorrect")

print("Success!")

<div class="alert alert-success">How well do the results of the hierarchical clustering capture your intuitions about the similarity of the kinship terms? Justify your answer.</div>

YOUR ANSWER HERE