In [11]:
from IPython.display import HTML, IFrame, Image

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from scipy import stats
%matplotlib inline

# Visualizing Big Datasets: Tools, Pitfalls, Experimental Example

## Importance of Data Visualization
We'll use part of the well-studied NYC Taxi trip database

![Image](./images/nyc_pickups_vs_dropoffs.jpg)

## Plotting very large datasets meaningfully

- provides clealear understanding
- aids decision making 

When working with large datasets, visualizations are often the only way available to understand the properties of that dataset -- there are simply too many data points to examine each one!  Thus it is very important to be aware of some common plotting problems that are minor inconveniences with small datasets but very serious problems with larger ones.

![Image](./images/6-blind-men-hans.jpg)

https://python-graph-gallery.com/

## Presentation Outine:
- Tools introduction
- Ratcave VR Acuity introduction
- Pitfalls of Large Datasets Vizualization with Real Data Example

## Tools: Pandas, Seaborn, Datashader

Remarks: create somemthing in pandas -> visualization with seaborn and/or datashader

### Pandas
![Image](./images/pandas.png)

In [7]:

import pandas as pd

Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool using its powerful data structures. The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data.

Key Features of Pandas

- Fast and efficient DataFrame object with default and customized indexing.
- Tools for loading data into in-memory data objects from different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of date sets.
- Label-based slicing, indexing and subsetting of large data sets.
- Columns from a data structure can be deleted or inserted.
- Group by data for aggregation and transformations.
- High performance merging and joining of data.
- Time Series functionality.


In [15]:
## EXAMPLES

### Seaborn
![Image](./images/seaborn.png)

Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.

Some of the features that seaborn offers are

- Several built-in themes for styling matplotlib graphics
- Tools for choosing color palettes to make beautiful plots that reveal patterns in your data
- Functions for visualizing univariate and bivariate distributions or for comparing them between subsets of data
-Tools that fit and visualize linear regression models for different kinds of independent and dependent variables
- Functions that visualize matrices of data and use clustering algorithms to discover structure in those matrices
- A function to plot statistical timeseries data with flexible estimation and representation of uncertainty around the estimate
- High-level abstractions for structuring grids of plots that let you easily build complex visualizations


### DataShader
![Image](./images/datashader.png)

Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data. Datashader breaks the creation of images of data into 3 main steps:

1. Projection - Each record is projected into zero or more bins of a nominal plotting grid shape, based on a specified glyph.

2. Aggregation - Reductions are computed for each bin, compressing the potentially large dataset into a much smaller aggregate array.

3. Transformation - These aggregates are then further processed, eventually creating an image.

Using this very general pipeline, many interesting data visualizations can be created in a performant and scalable way. Datashader contains tools for easily creating these pipelines in a composable manner, using only a few lines of code. Datashader can be used on its own, but it is also designed to work as a pre-processing stage in a plotting library, allowing that library to work with much larger datasets than it would otherwise.

## Ratcave Virtual Reality: Acuity Measurment Project
![Image](./images/rat1.gif)

# Dataset explenation: position and orientation 
<img src='./images/position.png'  width="500" height="500"/> |  <img src="./images/spherical_coordinates.png"  width="500" height="500"/>

## What are we looking for? Stimuli Related Behavior
<img src='./images/ratSRB.gif'  width="500" height="500"/>

## Pitfalls of Large Datasets Vizualization

In [2]:
## Load The Data

### Overplotting / Overdrawing

Let's consider plotting some 2D data points that come from two separate categories, here plotted as blue and red in **A** and **B** below.  When the two categories are overlaid, the appearance of the result can be very different depending on which one is plotted first:



Plots **C** and **D** shown the same distribution of points, yet they give a very different impression of which category is more common, which can lead to incorrect decisions based on this data.  Of course, both are equally common in this case.  The cause for this problem is simply occlusion:

Occlusion of data by other data is called **overplotting** or **overdrawing**, and it occurs whenever a datapoint or curve is plotted on top of another datapoint or curve, obscuring it.  It's thus a problem not just for scatterplots, as here, but for curve plots, 3D surface plots, 3D bar graphs, and any other plot type where data can be obscured.

### Saturation (and Undersaturation)

### Color Pallete
https://seaborn.pydata.org/tutorial/color_palettes.html

## Ratcave Virtual Reality: Acuity Data

1) Bad examples first: just load the dataset and show it
* plus show them how nicely you can plot with seaborn 
2) try chaniging the parameters of alpha and size -> but then realize that well ups not really working
3) create heatmaps
4) then explain the creation of the fancy distr plot


# Thank you for your attention!