## Manipulating and Plotting Data in the Notebook (part 2)

organized by *Todd Gureckis, Brenden Lake, Alex Rich*  
class webpage: https://brendenlake.github.io/CCM-site/  
direct email to course instructors: instructors-ccm-spring2018@nyuccl.org

<div class="alert alert-danger" role="alert">
  This homework is due before midnight on Feb 6, 2018. 
</div>

---

<div class="alert alert-info">
This introduction to Jupyter Notebook is based on tutorials developed by <a href="http://www.jesshamrick.com">Jessica Hamrick</a>.
</div>

One of the coolest things about the notebook is the ability to display data and plots inline. To take a working example, let's say we've run an experiment on psychological similarity, collecting similarity ratings from participants on color and kinship relations.

(Note: data for this notebook has been adapted from Michael Lee's repository of similarity datasets: http://faculty.sites.uci.edu/mdlee/similarity-data/)

---

## Importing Libraries

Before we get into loading and plotting data, we'll import the libraries we'll be working with. It is considered good practice to import all the libraries you need *first*, at the top of the notebook (or file), so that you can easily find what's been imported and what hasn't. Additionally, this makes it easier to find where certain functions may have come from. You may need to install *plotly* before you could install the following packages.

In [None]:
# special magic function that sets up matplotlib for the notebook
%matplotlib inline 

# "pd" is the standard abbreviation for "pandas", "plt" is the standard
# abbreviation for "matplotlib" (pyplot), and "np" is the standard abbreviation for "numpy"
import pandas as pd               
import matplotlib.pyplot as plt
import numpy as np

# some helper functions we'll use later
from util import mds

---

## Loading Data

The first thing we need to do is to actually load our data from somewhere. Using the [pandas library](http://pandas.pydata.org/), we can trivially load CSV files.

In [None]:
colors = pd.read_csv("data/color_similarities.csv")

When we read in the CSV file with pandas, it creates what is known as a `DataFrame` object. This dataframe contains tabular data with labeled rows and columns, similar to how you would use a spreadsheet. We can visualize what's actually in the dataframe by printing it out:

In [None]:
colors.head()

We can see from this that there are three columns, with the first two corresponding to the pair of wavelengths that are being compared, and the third corresponding to the similarity rating (with 1 being the highest, and 0 being the lowest).

The `.head()` method used above just shows the first few rows. If we wanted to display the whole dataframe, we could just put `colors` on its own line. However, note that pandas automatically truncates the output to avoid it getting too long:

In [None]:
colors

### Converting a DataFrame to an Array

Dataframes are really easy to work with, and make data analysis much easier. However, for now we're just going to reshape the data a bit, and come back to the more advances use cases later.

Currently, the data is effectively just a vector of numbers, but what would be more useful to have is a matrix where each entry corresponds to the similarity of a different pair of colors. Pandas makes this easy to change by using the `.pivot()` function. Here, the different keywords indicate which column should correspond to rows (`index`), which column should correspond to columns (`columns`), and which column should correspond to the data (`values`):

In [None]:
pivoted_colors = colors.pivot(index="wavelength1", columns="wavelength2", values="rating")
pivoted_colors

---

## Plotting in the Notebook

We are now ready to create our first plot! Since our data is now in a $N\times N$ array format, an easy first plot is the `matshow` plot type, which displays a heatmap of matrix values:

In [None]:
plt.matshow(pivoted_colors)

Note how the plot is displayed inline with the rest of the notebook. This is a really cool feature of the notebook, because it means you can always figure out how a plot was generated: simply look at the cell above it!

However, this is not a very useful visual representation of similarity judgments. To get a better visualization, we can reduce the data down to two dimensions using the multidimensional scaling (MDS) technique, a classic computational modeling technique from Roger Shepard:

* Shepard, R. N. (1980). Multidimensional Scaling, Tree-Fitting, and Clustering. *Science, 210*(4468), 390–398.

I have provided for you here a function (which under the hood uses the [scikit-learn library](http://scikit-learn.org/)) which computes the MDS solution:

In [None]:
mds_colors = mds(pivoted_colors)
mds_colors

If you want to see what the `mds` function is doing, remember that you can look at the source using double question marks:

In [None]:
mds??

We can now plot the MDS solution as a regular scatter plot:

In [None]:
plt.plot(mds_colors["x"], mds_colors["y"], "o")

That's nice, but doesn't tell us a whole lot since we can't tell which point is which color. One option is to add text next to each point indicating the wavelength of the color, using the `plt.text` command:

In [None]:
plt.plot(mds_colors["x"], mds_colors["y"], "o")
for _, row in mds_colors.iterrows():
    plt.text(row["x"] + 0.01, row["y"] + 0.01, int(row["label"]))

An even cooler option would be to actually color the points according to which color they represent. I have provided another dataset that converts the wavelengths to RGB values:

In [None]:
rgba = pd.read_csv("data/color_rgba.csv", index_col="wavelength")
rgba

In [None]:
for _, row in mds_colors.iterrows():
    plt.plot(row["x"], row["y"], "o", color=rgba.loc[row["label"]])
    plt.text(row["x"] + 0.01, row["y"] + 0.01, int(row["label"]))

---

## Exercise: Plotting Kinship Relations

<div class="alert alert-success">
I have also provided a dataset of similarities between kinship relations, located in `data/kinship_similarities.csv`. Try loading this file and creating a MDS plot similar to the one above, but for these kinship relations.
</div>

In [None]:
# Enter your code here

## Turning in homeworks

When you are finished with this notebook, save your work in order to turn it in. To do this select *File*->*Download As...*->*HTML*.

<img src="images/save-pdf.png" width="300">

You can turn in your assignments using NYU Classes webpage for the course (available on https://home.nyu.edu).

## Next steps...

So, far, so good...  Now [Complexity and Emergence](Homework1b-Complexity.ipynb) to format your answers to the homeworks.