# Visualization in Python

## Background

Why visualize?
- Discovery
- Inference
- Communication


Terminology
- Representation
 - Environment for visualization (e.g., 2d, 3d, sound)
- Idiom
 - Constructs used (e.g., bar plot, area plot)
- Task
 - What the user is trying to do (e.g., compare, predict, find relationships)
- Design
 - Choice of the representation(s) and idiom(s) to perform the task


## Software Engineering & Visualization

There are many python packages for visualization.
- pandas – Visualization of pandas objects
- matplotlib – MATLAB plotting in python
- seaborn – Statistical visualizations
- bokeh – Interactive visualization using the browser
- HoloViews – Simplified visualization of engineering/scientific data
- VisPy – fast, scalable, simple interactive scientific visualization
- Altair – declarative statistical visualization


We'll begin with visualization in pandas and focus on matplotlib. There is great documentation on all of this.
The case study is to analyze the flow of bicycles out of stations in the Pronto trip data.
In this section, we'll discuss:
- the structure of a matplotlib plot
- different plot idioms
- doing multiple plots

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
# The following ensures that the plots are in the notebook
%matplotlib inline
# We'll also use capabilities in numpy
import numpy as np

Analysis questions
- Which stations have the biggest difference between in-flow and out-flow of bikes?
- Where can we localize the movement of bicycles between stations that are in close proximity?

## Preparing Data For Plotting

In [None]:
df = pd.read_csv("2015_trip_data.csv")
df.head()

Now let's consider the flow of bicycles from and to stations.

In [None]:
from_counts = pd.value_counts(df.from_station_id)
to_counts = pd.value_counts(df.to_station_id)

## Simple Plots for Series

Let's address the question "Which stations have the biggest difference between the in-flow and out-flow of bicycles?"

What kind of objects are returned from pd.value_counts? Are these plottable? How do we figure this out?

In [None]:
from_counts.plot.bar()

But this plot doesn't tell us about the *difference* between "from" and "to" counts. We want to subtract to_counts from from_counts. Will this difference be plottable?

In [None]:
(from_counts-to_counts).plot.bar()

Some issues:
- Bogus value 'Pronto shop'
- Difficult to read the labels on the x-axis
- The x and y axis aren't labelled
- Lost information about "from" and "to"

## Writing a Data Cleansing Function

We want to get rid of the row 'Pronto shop' in both from_counts and to_counts.

In [None]:
# Selecting a row
from_counts[from_counts.index == 'Pronto shop']

In [None]:
# Deleting a row
new_from_counts = from_counts[from_counts.index != 'Pronto shop']
new_from_counts.plot.bar()

In [None]:
def clean_rows(df, indexes):
    """
    Removes from df all rows with the specified indexes
    :param pd.DataFrame or pd.Series df:
    :param list-of-str indexes
    :return pd.DataFrame or pd.Series:
    """
    for idx in indexes:
        df = df[df.index != idx]
    return df

Does clean_rows need to return df to effect the change in df?

In [None]:
to_counts = clean_rows(to_counts, ['Pronto shop'])
to_counts.plot.bar()

In [None]:
from_counts = clean_rows(from_counts, ['Pronto shop'])
from_counts.plot.bar()

## Getting More Control Over Plots

*Let's take a more detailed approach to plotting so we can better control what gets rendered.*

In this section, we show how to control various elements of plots to produce a desired visualization. We'll use the package matplotlib, a python package that is modelled after MATLAB style plotting.

Make a dataframe out of the count data.

In [None]:
df_counts = pd.DataFrame({'From': from_counts.sort_index(), 'To': to_counts.sort_index()})

Need to align the counts by the station. Do we do this?

In [None]:
df_counts.head()

In [None]:
"""
Basic bar chart using matplotlib
"""
n_groups = len(df_counts.index)
index = np.arange(n_groups)  # The "raw" x-axis of the bar plot

fig = plt.figure(figsize=(12, 8))  # Controls global properties of the bar plot
rects1 = plt.bar(index, df_counts.From)
plt.xlabel('Station')
plt.ylabel('Counts')
plt.xticks(index, df_counts.index)  # Convert "raw" x-axis into labels
_, labels = plt.xticks()  # Get the new labels of the plot
plt.setp(labels, rotation=90)  # Rotate labels to make them readable
plt.title('Station Counts')
plt.show()

Issue - much more code, which will tend to be copied and pasted. 

Solution - **MAKE A FUNCTION NOW!!!**

In [None]:
def plot_bar1(df, column, opts):
    """
    Does a bar plot for a single column.
    :param pd.DataFrame df:
    :param str column: name of the column to plot
    :param dict opts: key is plot attribute
    """
    n_groups = len(df.index)
    index = np.arange(n_groups)  # The "raw" x-axis of the bar plot
    rects1 = plt.bar(index, df[column])
    if opts.has_key('xlabel'):
      plt.xlabel(opts['xlabel'])
    if opts.has_key('ylabel'):
      plt.ylabel(opts['ylabel'])
    if opts.has_key('xticks') and opts['xticks']:
      plt.xticks(index, df.index)  # Convert "raw" x-axis into labels
      _, labels = plt.xticks()  # Get the new labels of the plot
      plt.setp(labels, rotation=90)  # Rotate labels to make them readable
    else:
      labels = ['' for x in df.index]
      plt.xticks(index, labels)   
    if opts.has_key('ylim'):
      plt.ylim(opts['ylim'])
    if opts.has_key('title'):
      plt.title(opts['title'])

In [None]:
fig = plt.figure(figsize=(12, 8))  # Controls global properties of the bar plot
opts = {'xlabel': 'Stations', 'ylabel': 'Counts', 'xticks': True}
plot_bar1(df_counts, 'To', opts)

**QUESTIONS** 
- Why is there no title for this plot? 
- How should we call plot_bar1 to get a title?

### Comparisons Using Subplots

We want to encapsulate the plotting of N variables into a function. We could re-write plot_bar1. But other plots use this. Besides plot_bar1 is pretty good at handling a single plot. So, instead we use plot_bar1 in a new function.

In [None]:
def plot_barN(df, columns, opts):
    """
    Does a bar plot for a single column.
    :param pd.DataFrame df:
    :param list-of-str columns: names of the column to plot
    :param dict opts: key is plot attribute
    """
    num_columns = len(columns)
    local_opts = dict(opts)  # Make a deep copy of the object
    idx = 0
    for column in columns:
        idx += 1
        local_opts['xticks'] = False
        local_opts['xlabel'] = ''
        if idx == num_columns:
          local_opts['xticks'] = True
          local_opts['xlabel'] = opts['xlabel']
        plt.subplot(num_columns, 1, idx)
        plot_bar1(df, column, local_opts)
    

**QUESTIONS**:
- Why is a new variable local_opts used in plot_barN instead of just changing opts?
- Why does the loop manipulate local_opts['xticks']? local_opts['xlabel']?

In [None]:
fig = plt.figure(figsize=(12, 8))  # Controls global properties of the bar plot
opts = {'xlabel': 'Stations', 'ylabel': 'Counts', 'ylim': [0, 8000]}
plot_barN(df_counts, ['To', 'From'], opts)

Issue - x-axis label overlaps second title.
Solution - eliminate the x-axis on the top plot

### Comparisons Using Multiple Bars In a Single Plot

To compare 'from' and 'to', we want:
- bars of different colors
- a legend

Unfortunately, we can't use plot_bar1 because it only accepts a single column as input.

In [None]:
"""
Plotting two variables as a bar chart
"""
n_groups = len(df_counts.index)
index = np.arange(n_groups)  # The "raw" x-axis of the bar plot
fig = plt.figure(figsize=(12, 8))  # Controls global properties of the bar plot

#VVVV Changed to do two plots
bar_width = 0.35  # Width of the bars
opacity = 0.6  # How transparent the bars are
rects1 = plt.bar(index, df_counts.From, bar_width,
                 alpha=opacity,
                 color='b',
                 label='From')
rects2 = plt.bar(index + bar_width, df_counts.To, bar_width,
                 alpha=opacity,
                 color='r',
                 label='to')
plt.xticks(index + bar_width / 2, df_counts.index)
_, labels = plt.xticks()  # Get the new labels of the plot
plt.setp(labels, rotation=90)  # Rotate labels to make them readable
plt.legend()
#^^^^ Changed to do two plots

plt.xlabel('Station')
plt.ylabel('Counts')
plt.title('Station Counts')
plt.show()

## Including Error Bars in a Bar Chart

To make decisions about the truck trips required to adjust bikes at stations, we need to know the variations by day.

Want a bar plot with average daily "to" and "from" with their standard deviations.

### Data Preparation

Need to:
- Create a date column for 'from' and 'to'
- Compute counts by date
- Compute the mean and standard deviation of the counts by date

(Assumes that a station has at least one rental every day.)

Let's start with the values for starttime. What type are these?

In [None]:
print (df.starttime[0])
print (type(df.starttime[0]))

Question: How do we extract the day from a string?

YOU DON'T!!! You convert it to a datetime object.

In [None]:
this_datetime = pd.to_datetime(df.starttime[0])
print this_datetime

In [None]:
start_day = [pd.to_datetime(x).dayofyear for x in df.starttime]
stop_day = [pd.to_datetime(x).dayofyear for x in df.stoptime]

In [None]:
df['startday'] = start_day  # Creates a new column named 'startday'
df['stopday'] = stop_day

In [None]:
df.head()

In [None]:
groupby_day_from = df.groupby(['from_station_id', 'startday']).size()
groupby_day_from.head()

Now we need to compute the average value and its standard deviation across the days for each station.
The groupby produced a MultiIndex. So, further operations on the result must take this into account.

In [None]:
h_index = groupby_day_from.index
h_index.levshape  # Size of the components of the MultiIndex

In [None]:
from_means = groupby_day_from.groupby(level=[0]).mean()  # Computes the mean of counts by day
from_stds = groupby_day_from.groupby(level=[0]).std()   # Computes the standard deviation


In [None]:
groupby_day_to = df.groupby(['to_station_id', 'startday']).size()
to_means = groupby_day_to.groupby(level=[0]).mean()  # Computes the mean of counts by day
to_stds = groupby_day_to.groupby(level=[0]).std()   # Computes the standard deviation

In [None]:
df_day_counts = pd.DataFrame({'from_mean': from_means, 'from_std': from_stds, 'to_mean': to_means, 'to_std': to_stds})
df_day_counts.head()

### Plotting with Error Bars

In [None]:
"""
Plotting two variables as a bar chart with error bars
"""
n_groups = len(df_day_counts.index)
index = np.arange(n_groups)  # The "raw" x-axis of the bar plot
fig = plt.figure(figsize=(12, 8))  # Controls global properties of the bar plot
bar_width = 0.35  # Width of the bars
opacity = 0.6  # How transparent the bars are

#VVVV Changed to do two plots with error bars
error_config = {'ecolor': '0.3'}
rects1 = plt.bar(index, df_day_counts.from_mean, bar_width,
                 alpha=opacity,
                 color='b',
                 yerr=df_day_counts.from_std,
                 error_kw=error_config,
                 label='From')
rects2 = plt.bar(index + bar_width, df_day_counts.to_mean, bar_width,
                 alpha=opacity,
                 color='r',
                 yerr=df_day_counts.to_std,
                 error_kw=error_config,
                 label='to')
#^^^^ Changed to do two plots with error bars

plt.xticks(index + bar_width / 2, df_counts.index)
_, labels = plt.xticks()  # Get the new labels of the plot
plt.setp(labels, rotation=90)  # Rotate labels to make them readable
plt.legend()

plt.xlabel('Station')
plt.ylabel('Counts')
plt.title('Station Counts')
plt.show()

## In-class exercise
Change the above script for plotting with error bars into a function and verify that you can call this function and get the same plot as the one above.
* What are the inputs to your function and why?
* How would you change plot_barN to use this function?