# Data Science Course Week 2 - Data Visuaisation

## We will be exploring datasets with Python visualisation libraries

For more information refer to the [Pandas Documentation](http://pandas.pydata.org/pandas-docs/stable/visualization.html) and the [Plotly Documentation](https://plot.ly/python/offline/)


### 1. Matplotlib
The easiest way to visualise Pandas Dataframes

In [4]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot') # This styles the graphs in a nicer format

In [2]:
# read drinks.csv into a DataFrame called 'drinks'
drinks = pd.read_table('drinks.csv', sep=',')
drinks = pd.read_csv('drinks.csv')              # assumes separator is comma

In [None]:
# bar plot of number of countries in each continent
drinks.continent.value_counts().plot(kind='bar', title='Countries per Continent')
plt.xlabel('Continent')
plt.ylabel('Count')
#plt.show()                                  # show plot window (if it doesn't automatically appear)
plt.savefig('countries_per_continent.png')  # save plot to file

In [None]:
# bar plot of average number of beer servings (per adult per year) by continent
drinks.groupby('continent').beer_servings.mean().plot(kind='bar', title='Beer servings per year, per Continent')
plt.ylabel('Average Number of Beer Servings Per Year')

In [None]:
# histogram of beer servings (shows the distribution of a numeric column)
drinks.beer_servings.hist(bins=15)
plt.xlabel('Beer Servings')
plt.ylabel('Frequency')

In [None]:
# density plot of beer servings (smooth version of a histogram)
drinks.beer_servings.plot(kind='density', xlim=(0,500))
plt.xlabel('Beer Servings')

In [None]:
# grouped histogram of beer servings (shows the distribution for each group)
drinks.beer_servings.hist(by=drinks.continent, sharex=True, sharey=True)

In [None]:
# boxplot of beer servings by continent (shows five-number summary and outliers)
drinks.boxplot(column='beer_servings', by='continent')

In [None]:
# scatterplot of beer servings versus wine servings
drinks.plot(kind='scatter', x='beer_servings', y='wine_servings', alpha=0.3)

In [None]:
# same scatterplot, except point color varies by 'spirit_servings'
drinks.plot(kind='scatter', x='beer_servings', y='wine_servings', c='spirit_servings', colormap='Blues')

In [None]:
# same scatterplot, except all European countries are colored red
colors = np.where(drinks.continent=='EU', 'r', 'b')
drinks.plot(kind='scatter', x='beer_servings', y='wine_servings', c=colors)

In [None]:
# scatterplot matrix of all numerical columns
pd.scatter_matrix(drinks, figsize=(15,15))

### 2. ggplot
This is a mimic of an R package for plotting that is very popular. The benefit of learning this package is that you will be able to think the same way about plotting in both Python and R.
https://github.com/yhat/ggplot

Rough syntax:

- For common plots, establish the data source and data mapping with ggplot and aes:
    - ggplot(data, aes(x='var', y='var2'))

- Then add geometries (plot objects) that depend on the data mappings:
    - geom_histogram()
    - geom_point()

A good recent post about ploting libraries avaiable in Python and how their syntax differ: https://dansaber.wordpress.com/2016/10/02/a-dramatic-tour-through-pythons-data-visualization-landscape-including-ggplot-and-altair/

The 'R Graphics Cookbook' by Winston Chang is a great book on how to achieve what you want with ggplot()



In [None]:
!pip install -U ggplot
#!conda install -c conda-forge ggplot

In [None]:
from ggplot import *

#### A Single Variable: Shape and Distribution

In [None]:
# data set mpg - included in ggplot package - car fuel mileage
mpg.head()

In [None]:
# Single variable analysis: histogram
ggplot(mpg, aes(x='cty')) + geom_histogram()

##### Kernel Density Estimates
To form a KDE, we place a kernel—that is, a smooth, strongly peaked function—at the position of each data point. We then add up the contributions from all kernels to obtain a smooth curve, which we can evaluate at any point along the x axis.

Example Kernels:
(Gaussian must commonly used)

![KDE](KDE_kernals.png)

In [None]:
# Single variable analysis: kernel density estimate
ggplot(mpg, aes(x='cty')) + geom_density()


In [None]:
# Single variable analysis: kernel density estimate with plots split by drivetrain variable
ggplot(mpg, aes(x='cty', colour='drv')) + geom_density()


In [None]:
# Single variable analysis: kernel density estimate with plots split by drivetrain variable, faceted
ggplot(mpg, aes(x='cty', colour='drv')) + geom_density() + facet_grid('drv')

##### Box plot and Violin plot

In [None]:
# Single variable analysis: Box Plot
ggplot(mpg, aes(x='cyl', y='cty')) + geom_boxplot()

In [None]:
# Single variable analysis: Violin Plot - applies the kernel density transformation per category
ggplot(mpg, aes(x='cyl', y='cty')) + geom_violin()

#### Two Variables: establishing a relationship

In [None]:
# data set diamonds - included in ggplot package
diamonds.head()

In [None]:
# scatter plot
# backslash indicates the statement continues on the next line. Make sure no characters follow the slash on that line. 
ggplot(diamonds, aes('carat','price', colour='cut')) + \
    geom_point(alpha=0.7) + \
    ggtitle('Diamond Price vs Carat')

### 3. Plotly
A nice open source library for interactive visualisations

In [None]:
# To run any command at the system shell, simply prefix it with !
# pip won't work from inside python without it
!pip install plotly --upgrade

In [None]:
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot

print __version__ # requires version >= 1.9.0

In [None]:
init_notebook_mode() # run at the start of every ipython notebook to use plotly.offline
                     # this injects the plotly.js source files into the notebook

In [None]:
from plotly.graph_objs import *
import numpy as np

In [None]:
# Scatter plot with heatmap
x = np.random.randn(2000)
y = np.random.randn(2000)
iplot([Histogram2dContour(x=x, y=y, contours=Contours(coloring='heatmap')),
       Scatter(x=x, y=y, mode='markers', marker=Marker(color='white', size=3, opacity=0.3))], show_link=False)

In [None]:

df_airports = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/2011_february_us_airport_traffic.csv')
df_airports.head()

df_flight_paths = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/2011_february_aa_flight_paths.csv')
df_flight_paths.head()

airports = [ dict(
        type = 'scattergeo',
        locationmode = 'USA-states',
        lon = df_airports['long'],
        lat = df_airports['lat'],
        hoverinfo = 'text',
        text = df_airports['airport'],
        mode = 'markers',
        marker = dict(
            size=2,
            color='rgb(255, 0, 0)',
            line = dict(
                width=3,
                color='rgba(68, 68, 68, 0)'
            )
        ))]

flight_paths = []
for i in range( len( df_flight_paths ) ):
    flight_paths.append(
        dict(
            type = 'scattergeo',
            locationmode = 'USA-states',
            lon = [ df_flight_paths['start_lon'][i], df_flight_paths['end_lon'][i] ],
            lat = [ df_flight_paths['start_lat'][i], df_flight_paths['end_lat'][i] ],
            mode = 'lines',
            line = dict(
                width = 1,
                color = 'red',
            ),
            opacity = float(df_flight_paths['cnt'][i])/float(df_flight_paths['cnt'].max()),
        )
    )

layout = dict(
        title = 'Feb. 2011 American Airline flight paths<br>(Hover for airport names)',
        showlegend = False,
        height = 800,
        geo = dict(
            scope='north america',
            projection=dict( type='azimuthal equal area' ),
            showland = True,
            landcolor = 'rgb(243, 243, 243)',
            countrycolor = 'rgb(204, 204, 204)',
        ),
    )

fig = dict( data=flight_paths + airports, layout=layout )

iplot(fig)