# Week 4 (Saturday)- More Visualization and Detective Work

**Objectives**: Today we are going to review visualization approaches and solve a mystery using data science. Specifically, we will cover the following:
  
* Different visulization tools
* Review Python's Matplotlib in more detail
* Conduct an analysis as part of the Data Science Detective Agency
* Present your analysis to the class

## Data Visualization Tools

Visulization is an indispencible tool in data science. While most patterns and results can be described mathmatically, visuals are often best for developing an intutive understanding of the meaning behind the data. When presenting data science results, visuals are often the only tool that can be used with a general audience that lacks an analytical background.

Most data scientists use a variety of visualization approaches and tools. Depending on the task at hand, some are better tools than others. The following represent a sample of some of the more common tools.

**Commercial Visualization Products**

These products are a category of easier to use tools that require little programming experience, but offer robust visualization.

* **[Tableau](http://www.tableau.com/products/desktop):** Tableau is a popular choice for data exploration and analysis.
* **[SpotFire](http://spotfire.tibco.com/products/spotfire-desktop):** Spotfire is similar to Tableau.

**Commercial Analytics Products**

Most statistics and technical computing products also have robust visualization.

* **[SPSS](http://www-01.ibm.com/software/analytics/spss/):** SPSS is a popular statistics and modeling software.
* **[SAS](http://www.sas.com/en_us/software/sas9.html):** SAS is another popular analytics package that is newly positioned as a tool for data scientists.
* **[Matlab](http://www.mathworks.com/products/matlab/):** Matlab is a popular tool in engineering and the sciences.
* **[Mathematica](https://www.wolfram.com/mathematica/):** Mathematica is another technical computing solution used in engineering and the sciences.

**Web Technologies**

* **[D3](http://d3js.org/):** D3 is a Javascript library which enables web based visulizations using HTML elements like SVG and CSS. 

**Programmatic Visualization APIs**

* **[Plotly](https://plot.ly/):** Plotly is a web-based visualization solution that recently open sourced it's Python Javascript library. Plotly focuses on interactive plots unlike Matplotlib, ggplot(R plotting), and Matlab which tend to be more static.

**Programming Libraries**

* **[Matplotlib](http://matplotlib.org/):** Modeled after Matlab, Matplotlib is the standard for Python scientific visualizations.
* **[ggplot2](http://docs.ggplot2.org/current/):** ggplot2 is the standard plotting library for R. 







##Matplotlib

In today's lab, we are going to review Matplotlib in more detail. There are a variety of open tutorials including:

* http://www.labri.fr/perso/nrougier/teaching/matplotlib/
* http://jakevdp.github.io/mpl_tutorial/tutorial_pages/tut1.html

**Plotting Environment**

When using Jupyter/IPython, there are several parameters that establish where and how your plots will display before you even begin to plot data.

Matplotlib has two general modes within Jupyter/IPython notebooks and consoles:

* Inline plots: display the plots under the code you are running within the notebook or console
* Interactive windows: display the plots is the seperate Matplotlib backend window

These modes are called by using the Matplotlib magic without a parameter for interactive windows

<code>%matplotlib</code> (interactive) 

or specifying inline plots via 

<code>%matplotlib inline</code>

It is possible to switch back and forth between modes in notebooks and consoles, but occasionally it may be necessary to restart the kernel, so it is best to stick with one or the other. I prefer interactive mode when creating plots for anything outside a notebook as it is easy to resize and save the image for import into presentations and the like.

There are a whole range of default setting for Matplotlib that are contained within the <code>matplotlibrc</code> file. When you import Matplotlib, you can access these setting via a dictionary attribute in <code>matplotlib.rcParams</code>.

For this notebook, we will be plotting inline, so we set that up in the next codeblock along with a default figure size in inches.

In [None]:
import matplotlib as mpl
%matplotlib inline
mpl.rcParams['figure.figsize'] = (10.0, 10.0)

Try accessing the <code>rcParams</code> attribue to see the default settings in the next codeblock.

In [None]:
# Print rcParams


Next, let's import <code>pyplot</code> which is the state machine interface to the Matplotlib library. The state machine allows us to iteratively create a plot by successive function calls that define different elements of the plot.

We will also import <code>pandas</code> and read in the iris dataset.

In [None]:
from matplotlib import pyplot as plt
import pandas as pd

df = pd.DataFrame.from_csv('datasets/iris_data.csv', index_col=4)

<code>pandas</code> has several built in plotting functions that allow you to call plots on the dataframe versus using the dataframe as a parameter of <code>pyplot</code>.  The following codeblocks will demonstrate both approaches.

In [None]:
# Plot using pandas dataframe function

df.plot()

In [None]:
# Plot using pyplot

plt.plot(df)

While the plots use the same data, the <code>pandas</code> function conviently adds additional element to the plot based on the dataframe structure inlcuding the column names and index. Both methods use matplotlib on the backend. Links to both sets of documentation are below:

* **pandas plotting:** http://pandas.pydata.org/pandas-docs/stable/visualization.html
* **Pyplot plotting:** http://matplotlib.org/users/pyplot_tutorial.html

Next, we will go through several different examples. Keep in mind that given the <code>pandas</code> approach uses Matplotlib, the two approaches can work together.

In [None]:
# Creates a figure and plots a scatter plot on x and y inputs
df.plot(kind='scatter', x='sepal_length', y='sepal_width', color='DarkBlue')

# Provides title for plot
plt.title('Sepal Length vs. Sepal Width') 

This last example plotting all the data points regardless of the index group.  To plot each group with a different color, you need to first set up an <code>ax</code> object which is the container for all the plot details. This method uses a dataframe slice to send only the relevant data to each subplot which occupies the same space. Each of these calls defines the label and color and for each subsequent call defines the object <code>ax</code> as the object for the subplots.

To make this more clear, we are going to switch to interactive mode. Restart the kernel, read the data back into a dataframe and run each of the following code blocks and observe the plot output.

In [None]:
# Swtich to interactive plotting
%matplotlib

In [None]:
# Set up figure object and plot first data group
ax = df.ix['Iris-setosa'].plot(kind='scatter', x='sepal_length', y='sepal_width', color='DarkBlue', label='Iris-setosa')

In [None]:
# Add second group subplot to ax object and plot
df.ix['Iris-virginica'].plot(kind='scatter', x='sepal_length', y='sepal_width', color='Green', label='Iris-virginica', ax=ax)

In [None]:
# Add third group subplot to ax object and plot
df.ix['Iris-versicolor'].plot(kind='scatter', x='sepal_length', y='sepal_width', color='Red', label='Iris-versicolor', ax=ax)

In [None]:
# Provides title for plot
plt.title('Sepal Length vs. Sepal Width')

Next, let's use the <code>plt</code> style plotting to construct a histogram of our data. Histograms are a useful way to view the dispersion of a distribution of datapoints.

In [None]:
plt.figure('hist #1')
plt.title('Petal Length with 50 Bins')
plt.hist(df['petal_length'], bins=50)

Change the code in the codeblock above to create a plot of petal width and rerun. Note that the value for <code>bins</code> can be an important component of how the histogram appears and has been critized as a method given larger bins might obscure the underlying distribution.

Kernel Density Estimation is an alternative approach to visualizing distributions.

* https://en.wikipedia.org/wiki/Kernel_density_estimation

<code>pandas</code> has a built in method for this as well, but the scikit-learn package is much more robust. The following link details this package and some of the problems of histograms.

* http://scikit-learn.org/stable/modules/density.html

In [52]:
df['petal_length'].plot(kind='kde')

<matplotlib.axes._subplots.AxesSubplot at 0x2f0465c0>

As you might guess, the <code>kind</code> parameter in the <code>pandas</code> style plotting allow you to define many other options including line, bar, and pie.

For multidimensional data like the iris dataset, a scatter matrix is an excellent way to visualize the correlation between variables. <code>pandas</code> has a built in function which is easy to apply to our dataset. Run the code in the next codeblock.

In [53]:
from pandas.tools.plotting import scatter_matrix

scatter_matrix(df)

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000002F0403C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002EF820F0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000030ACB390>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000030CA2630>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000002F6E9240>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000030AB9AC8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002F1ED2B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000003057FA20>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000002F2F44A8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000000308F6F98>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002F0738D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002EAD84E0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(style="white")

df = pd.DataFrame.from_csv('datasets/iris_data.csv', index_col=4)

sepal_plot = sns.jointplot(df['sepal_length'],df['sepal_width'], kind='kde', size=7, space=0)
petal_plot = sns.jointplot(df['petal_length'],df['petal_width'], kind='kde', size=7, space=0)

