In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
sns.set()
matplotlib.rcParams['figure.dpi'] = 144

In [None]:
import altair as alt
alt.renderers.enable('notebook')

# Exploratory Visualization
<!-- requirement: small_data/goog.json -->
<!-- requirement: small_data/temperatures.csv -->


One purpose of visualization is to help the analyst understand and model the data at hand. Exploratory visualization prioritizes speed over style, in contrast to explanatory visualization.

A few popular python packages are presented below, along with some boilerplate code and external references. 

The rest of the notebook focuses on interactive data exploration inside a Jupyter Notebook.

## Python visualization tools


* matplotlib (a [thorough rundown](http://www.randalolson.com/2014/06/28/how-to-make-beautiful-data-visualizations-in-python-with-matplotlib/) of its potential)
* [pandas](http://pandas.pydata.org/pandas-docs/stable/visualization.html) has its own useful plotting interface around matplotlib
* [Seaborn](https://github.com/mwaskom/seaborn) (focuses on statistics, easier to customize than matplotlib)
* [Altair](https://altair-viz.github.io/) (focus on interactivity, browser delivery)


### Domains for exploratory and explanatory visualization

In contrast to techniques using D3, which tend to focus on polished, *explanatory* visualizations for the end-user, Jupyter notebooks tend to be more useful for *exploration*. This is the time when running code interactively is useful because you're interested in the effect of changing a few lines of code without rerunning the entire script. We already discussed the rationale behind including visualization as part of your data exploration, and interactivity is a powerful tool to accomplish that objective, whether it's just you, or your team huddled around your computer.

## Describing a distribution

* Mean
* Median
* Variance
* Standard Deviation

Often statistical parameters provide important insight into the data - and can reveal information that is not visually obvious. However, it's important to consider their limitations as well and think about what is gained by visual exploration.

Outliers are a good place to start - visually they are easy to spot but they can have deceptive influence on statistical metrics. Consider [Anscombe's Quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet), a set of four distributions with nearly identical aggregate properties:

In [None]:
aq = sns.load_dataset("anscombe")

The *x* and *y* components of each set have similar means and standard deviations.

In [None]:
for ds in ('I', 'II', 'III', 'IV'):
    print('Dataset ' + ds)
    print(str(aq[aq['dataset'] == ds].describe())+ "\n")

For each data set, the *x* and *y* components have nearly identical correlations.

In [None]:
for ds in ('I', 'II', 'III', 'IV'):
    print('Dataset {}: {}'.format(ds, aq[aq['dataset'] == ds].corr().loc['x', 'y']))

So do the data sets represent the same relation?

In [None]:
sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=aq,
           col_wrap=2, ci=None, palette="muted", height=4,
           scatter_kws={"s": 50, "alpha": 1})

*Question*: When can situation 4 arise in real life? What can be done to identify this kind of situation in the feature space?

## Histograms

In [None]:
tips = sns.load_dataset("tips")

Histograms are indispensable tools for visualizing distributions in your data. When working with DataFrames, `pandas` makes it easy to quickly get histograms of your features. 

In [None]:
print(tips.shape) # columns, rows
tips.head()

In [None]:
tips_hist = tips.hist()    #pandas hist() function. tips is a DataFrame

In [None]:
# these will work
# tips_fig.savefig('tiphist.png')
# tips_fig.savefig('tiphist.pdf')

For more information on kernel density estimation, check out these blog posts:
* [Michael Lerner's motivation of KDE based on histograms](http://www.mglerner.com/blog/?p=28)
* [A comparison of KDE methods in Python](https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/)

## Box plots and Violin Plots

In [None]:
sns.boxplot(x='total_bill', data=tips)

In [None]:
sns.violinplot(x="day", y="total_bill", hue="smoker", data=tips, palette="Set2")

You can see more examples in the API: 
http://seaborn.pydata.org/generated/seaborn.violinplot.html#seaborn.violinplot

## Relationships between variables


### Scatter Matrices

In [None]:
import pandas as pd
pd.plotting.scatter_matrix(tips, alpha=0.2, figsize=(6, 6), diagonal='kde')
# available in seaborn as pairplot()



### Linear correlation

The most common metric is [Pearson's](https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient) correlation coefficient (covariance normalized by the product of the standard deviations), which ranges between 1 being total positive correlation and -1 being total negative correlation.

In [None]:
tips.corr(method='pearson')

In [None]:
sns.jointplot(x='total_bill', y='tip', data=tips)

*Question*: Imagine you're trying to do some feature selection using Pearson's coefficient. What are two situations where this metric can be misleading?

### Indirect Influence / constraints

- e.g. speed is highly correlated with accidents only if driving on the highway
- This mostly boils down to intelligently looking at subsets of the data, edge cases, etc.
- Leave one out for predictive models

In [None]:
print(tips['tip'].mean())
print(tips[tips['size'] > 1]['tip'].mean())
print(tips[tips['size'] == 1]['tip'].mean())

*Question*: How meaningful is the above? What else do we need to consider?

In [None]:
sns.lmplot(x='total_bill', y='tip', hue='time', data=tips, palette="Set2")

## Non-obvious patterns in the data

### Autocorrelation

Note: here, we focus on plotting this kind data. Details on time-series analysis are provided in module 3.

In [None]:
from pandas.plotting import autocorrelation_plot, lag_plot

In [None]:
# Get temperature data
temps_df = pd.read_csv("small_data/temperatures.csv", 
                       index_col=0,
                       names=["Temperature"],
                       parse_dates=True,
                       date_parser=lambda u: pd.datetime.strptime(u, "%Y-%m-%d %H:%M:%S"))

# get GOOG data
import simplejson as json

with open('small_data/goog.json') as raw_f:
    raw_data = raw_f.read()
    json_data = json.loads(raw_data)

json_data = json.loads(raw_data)
goog_df = pd.DataFrame(json_data['data'], columns=json_data['column_names'])

goog_open = goog_df['Open']

In [None]:
print(goog_df.columns)
goog_df.head()

In [None]:
goog_df['Open'].plot()

In [None]:
# autocorrelation is near 1 for short lags
autocorrelation_plot(goog_open)
matplotlib.pyplot.xlabel('Lag (days)')

In [None]:
temps_df[-4:]

In [None]:
temps_df[:2]

In [None]:
# seasonality is apparent in the temperature data
# this represents 13 years of temperatures, note 13 oscillations
autocorrelation_plot(temps_df)
matplotlib.pyplot.xlabel('Lag (hours)')

### Using FFT to tease out trends in a time series


**Background notes on Fourier Analysis:**
Any periodic signal can be represented as the sum of a number of sine waves with varying amplitude, phase, and frequency.

A time series can be converted into its frequency components with the mathematical tool known as the Fourier transform. As we are dealing with sampled data, we must use the discrete version. The common algorithm for computing discrete transforms in the fast Fourier transform, usually abbreviated FFT.

The **output of a FFT can be thought of as a representation of all the frequency components of your data**. In some sense it is a histogram with each “frequency bin” corresponding to a particular frequency in your signal. 

In [None]:
# here is a simple example to illustrate FFT
import numpy as np
import matplotlib.pyplot as plt
import scipy.fftpack

In [None]:
# the data
N = 600
T = 1.0 / 800.0

x = np.linspace(0.0, N*T, N)
y = np.sin(50.0 * 2.0 * np.pi * x) + 0.5*np.sin(80.0 * 2.0 * np.pi * x)

plt.plot(x, y)
plt.title("The data");

In [None]:
y.mean()

In [None]:
# FFT of the data
# (usually, you subtract the mean before performing FFT...)
yf = scipy.fftpack.fft(y)
xf = np.linspace(0.0, 1.0/(2.0*T), N//2)

fig, ax = plt.subplots()
ax.plot(xf, 2.0/N * np.abs(yf[:N//2]))
plt.title("FFT of the data")
plt.show();

(In words, explain how the FFT plot relates to the plot of the data.)

## Interactivity in visualizations

While it is possible to include additional axes of data using color, thickness, texture, etc., it can be beneficial to use interactivity for this purpose.
* Gives the end user control, gets them engaged with the data.
* Allows for the base graph to be less cluttered and send a clearer message.
* "Feels" impressive. Interactive graphs make you feel like there's a vast amount of data which you're tapping into.

Most web-deployed interactive plots (or dashboards, as people like to call them) run on D3. You give these tools a large data source server-side, and the JavaScript can rapidly render the desired slice of the data. For polished tools, look at Bokeh server, `Pyxley`, or `highcharts`. 

Even in exploratory visualization, interactivity can enable you (or your team!) to more quickly identify important characteristics of your data and ascertain relationships between features. `Altair` is an easy-to-use Python package that enables a broad range of interactivity in just a few lines of code. It is an ideal candidate for interactivity in exploratory visualizations and still powerful enough for more polished visualizations as well.

In [None]:
import altair as alt

Let's start by making a simple scatter plot -- we'll add interactivity in the next step.

Once again we'll use the `tips` DataFrame. We'll tell Altair to make **point**  marks, and give it an **encoding** specifying which feature should be shown in each axis, as well as which feature should control the color of the points.

### The `Chart()` object

The primary class used in Altair is `Chart()`, upon which marks, encodings, and interactivity can be applied.

Note that in Altair, calling a method on a `Chart` object returns another `Chart` object, meaning that methods can be "chained" together.

In [None]:
scatter=alt.Chart(tips).mark_point().encode(
    x='total_bill',
    y='tip',
    color='smoker'
)

scatter

In general, creating a chart in Altair involves the following steps:
   * specify a data source (a pandas DataFrame)
   * choose a type of mark (e.g. lines, points)
   * specify an *encoding* (set axes, visual cues)
   * define interactivity
   
   
We've done the first three steps already. We can add some basic pan/zoom interactivity to Altair by adding `.interactive()` to our chart. 

Altair really begins to shine when we combine multiple, interactive views of our data. Let's add a histogram for the `size` column, and add a `selection` to it so that we can get conditional scatter plots based on the value of size.

We're using a `selection_multi()` instance in Altair, so we can select multiple values in the histogram with `Shift+Click`. We initialize the selection with the argument `encodings=['x']` , which tells Altair that the selections should refer only to the `x` value of the selected entries.

* Multiple charts can be displayed using `chart1 & chart2` (vertical) or `chart1 | chart2` (horizontal)

In [None]:
size_selector = alt.selection_multi(encodings=['x'])


scatter = alt.Chart(tips, width=500).mark_point().encode(
    x="total_bill",
    y="tip",
    #selected sizes are colored according to the "smoker" column, others are rendered in white
    color = alt.condition( size_selector , "smoker", alt.value("white"))
)
    

size_hist = alt.Chart(tips, width=500, height=200).mark_bar().encode(
    x = "size:N",
    y = "count()",
    color = alt.condition(size_selector, alt.value("blue"), alt.value("lightgray"))
).add_selection(size_selector)

scatter & size_hist

We can combine multiple selections and bindings to get even more powerful interactions!

Below, we've added selection tools to the scatter plot as well, and the size histogram is altered to only consider points lying in the selection.

 * Note the use of `transform_filter` in the specification of `size_hist`, which tells Altair to only include points in the selection.

In [None]:
size_selector = alt.selection_multi(encodings=['x'], empty='all')
scatter_selector = alt.selection(type='interval', encodings=['x','y'], empty='all')


scatter = alt.Chart(tips, width = 500).mark_point().encode(
    x = "total_bill",
    y = "tip",
    color = alt.condition(size_selector | scatter_selector, "smoker", alt.value("white"))
).add_selection(scatter_selector)
    
size_hist = alt.Chart(tips, width=500, height=200).mark_bar().encode(
    x = "size:N",
    y = "count()",
    color = alt.condition(size_selector , alt.value("blue"), alt.value("lightgray"))
).transform_filter(
    scatter_selector
).add_selection(size_selector)

scatter & size_hist

### Line Plot

For a given stock (Google, in this example), let's say we want to visualize changes in opening price, closing price, daily high/low prices, and trade volume over time. We can use Altair to create an interactive line plot that allows us to select which feature to plot as the y-value.

In [None]:
import pandas as pd
import numpy as np
import ujson as json
import altair as alt

In [None]:
with open('small_data/goog.json') as raw_f:
    raw_data = raw_f.read()
    json_data = json.loads(raw_data)
    
df = pd.DataFrame(json_data['data'], columns=json_data['column_names'])
df.set_index(pd.DatetimeIndex(df['Date']), inplace=True)

Let's take a look at this DataFrame:

In [None]:
df.head()

### A basic line chart in Altair

We start with a basic line chart, which will serve as a template to build our interactive chart.
* Note the use of `mark_line()` instead of `mark_point()`. 
* Altair's logical separation of **marks** and **encodings** makes it very fast to switch between multiple types of charts. What happens if we add `.mark_point()` to the bottom line?

In [None]:
import altair as alt
base = alt.Chart(df).mark_line().encode(
    alt.X('Date:T'),
    alt.Y('Close')
)
base

Note the `:T` next to the word Date above. This tells Altair what kind of data the `Date` column contains. Altair uses the following specifications:
  * `:N` Nominal (Categorical)
  * `:O` Ordinal
  * `:Q` Quantitative (Interval, Ratio)
  * `:T` Time (Special formatting for dates and times)

### Interactively select a y-variable

We can extend the functionality of this chart further by modifying it for some interactivity. Let's add a drop down menu that allows for selecting a column to plot as a y-value. 

To do so, we'll first have to change the shape of our DataFrame from its current *wide* format to a more Altair-friendly *long* format. 

In [None]:
df_long = df.melt(id_vars="Date", value_vars=['Open', 'High', 'Low', 'Close', 'Volume'])
df_long.sample(10, random_state=1)

Now we specify a **binding**, which will set the column values (or in this case, values in the variable column) and a **selection**, which will be added to the chart to enable the interactivity. 

In [None]:
base = alt.Chart(df_long).mark_line().encode(
    alt.X('Date:T'),
    alt.Y('value', title=" ")
).properties(
    height=240,
    width = 600
)


# A dropdown filter
columns=['Open', 'High', 'Low', 'Close', 'Volume']
column_dropdown = alt.binding_select(options=columns)
column_select = alt.selection_single(
    fields=['variable'],
    on='doubleclick',
    clear=False, 
    bind=column_dropdown, 
    name="y",
    init={'variable': "Close"}
)


filter_columns = base.add_selection(
    column_select
).transform_filter(
    column_select
)

filter_columns

### Selecting a date range

From the plot above, it looks like both price and volume see a major uptick around July/August. What if we want to "zoom in" on those days interactively?

We've seen how Altair enables a default pan/zoom interaction by simply calling the `.interactive()` method on a chart. Alternatively, let's say we want to specify a date range by selecting a time window on the entire graph, and obtain a view of just that range. Altair also makes this easy compared to other plotting packages.

In [None]:
# Define a new selection object.
brush = alt.selection(type='interval', encodings=['x'], clear=False)



#Link the domain selected by brush to the X range of the chart.
base = alt.Chart(df_long, width=400, height=400).mark_line().encode(
    alt.X('Date:T', scale=alt.Scale(domain=brush)),
    alt.Y('value', title=" ")
)

columns=['Open', 'High', 'Low', 'Close', 'Volume']

# A dropdown filter
column_dropdown = alt.binding_select(options=columns)
column_select = alt.selection_single(
    fields=['variable'],
    on='doubleclick',
    clear=False, 
    bind=column_dropdown, 
    name="y_value",
    init={'variable': "Close"}
)


#Specify the top chart as a modification of the base chart
filter_columns = base.add_selection(
    column_select
).transform_filter(
    column_select
).properties(
    height=240,
    width = 600
)


#Specify the lower chart as a modification of the base chart
lower = filter_columns.add_selection(
    column_select
).transform_filter(
    column_select
).properties(
    height=60,
    width = 600
).add_selection(brush)



filter_columns & lower

*Copyright &copy; 2019 The Data Incubator.  All rights reserved.*