In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import make_classification

In [6]:
%matplotlib notebook

# Salience: Highlighting the Most Important Features in the Data

**Version 0.1**

***
By AA Miller 11 June 2019

As we saw during the lecture, there are a nearly infinite number of parameters that can be adjusted when developing visuals for scientific communication. From something as small as - the thickness of the axes, to as critical as the choice of color (**or** choice to avoid the use of color), each of these choices will eventually affect the final interpretation of the data.   

As you constuct visualizations today, there are three points from the lecture that I especially want to highlight:

  - *Salience* –– make specific choices to highlight the most important features of the visualization  

  - Storytelling –– figure out the story you want to tell with the data
  
Alternatively, ask your yourself, "what would the newspaper headline be for this figure/presentation?"

## Problem 1) Simple Synthetic Data

We will use the [make_classification](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html#sklearn.datasets.make_classification) function from scikit-learn to generate some data in a low dimensional data space.

**Problem 1a**

Create 125 sources that live in 4 dimensions, where each source belongs to one of two classes.

*Hint* –– execute the cell below.

In [22]:
np.random.seed(23)
X, y = make_classification(n_samples=225, n_classes=2, 
                           n_features=4, n_redundant=0, n_informative=4,
                           flip_y=0.04, weights=[0.62,0.38])

**Problem 1b**

Using the defaults in `matplotlib`, make a scatter plot of the data showing feature 1 vs. feature 2. Use different colors for the two classes (again with the `matplotlib` defaults).

*Hint* –– recall that `scikit-learn` organizes feature data in a two-dimensional array, where every column corresponds to a single source and every row corresponds to a single feature. 

In [23]:
fig, ax = plt.subplots()
ax.scatter(X[:,0], X[:,1], c=y)
ax.set_xlabel('X0')
ax.set_ylabel('X1')
fig.tight_layout()

<IPython.core.display.Javascript object>

Now that we are familiar with the "defaults", we will apply several of the lessons from the lecture to create more salient visualizations.

Note –– many of the following questions are a little open ended, be sure you are happy with your results, but I would suggest that you do not dwell on any single inquiry for a really long time ($\gtrsim$15 min).

## Problem 2) Salience –– Plotting Symbols

**Problem 2a** 

Replot the data using symbols that provide strong visual boundaries between the two classes. 

*Hint* –– make a choice that highlights the most important feature in the data (this will be subjective).

In [None]:
fig, ax = plt.subplots()

**Problem 2b**

Replot the data, again with strong visual boundaries, but this time do not use color (if you did not use color in **2a** then use color for this problem.

In [None]:
fig, ax = plt.subplots()

**Problem 2c**

Replot the data, again with strong visual boundaries, varying some new aspect of the plotting symbols to distinguish the two classes. 

*Hint* –– recall that you have many options at your disposal (e.g., symbol, color, size, orientation, shape, motion, etc)

In [None]:
fig, ax = plt.subplots()

**Problem 2d**

Use [www.color-blindness.com](https://www.color-blindness.com/coblis-color-blindness-simulator/) to examine how each of your choices above would appear to someone that is color blind. How do they appear in black and white? 

After this examination, do you want to alter any of the previous plots?

In [None]:
fig, ax = plt.subplots()

## Problem 3) Salience - Grids

**Problem 3a**

Make a bar graph showing the relative number of sources in each class.

In [None]:
fig, ax = plt.subplots()

**Problem 3b** 

Plot the same bar graph with a background grid that makes it easy to rapidly judge the relative magnitude of each class (i.e. remove the y-axis labels).

*Note* –– beware of introducing judgement error.

In [None]:
fig, ax = plt.subplots()

**Problem 3c**

Can you adjust the grid to improve the salience of the bar graph? What thickness are you using for the grid lines? How does this compare to the axes lines? What line style? What opacity?

In [None]:
fig, ax = plt.subplots()