# Visualization

<img src='https://az712634.vo.msecnd.net/notebooks/python_course/v1/desktop_pink.png' alt="Smiley face" width="52" align="left">
## Learning Objectives
* Get a taste of plotting with `matplotlib` (*de facto* plotting in python)
* Take `ggplot` (grammer of graphics plotting) out for a spin

In [None]:
# familiar imports at top
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [None]:
# matplotlib and ggplot are very verbose
import warnings
warnings.filterwarnings('ignore')

## Matplotlib

**A simple line plot**

In [None]:
# our data
x = np.linspace(0, 2, 10)
x

In [None]:
# call plot method
plt.plot(x, 'o-') # 'o-' is a line with circle markers

In [None]:
# try changing the plot to use triangle markers (uncomment below and add data from above or your own)
# plt.plot(___, '-v')

**A fancier line plot**

In [None]:
# two lines on a plot

# call plot once
plt.plot(x, x, 'o-', label='linear') # line with circle markers

# call plot twice
plt.plot(x, x ** 2, 'x-', label='quadratic') # line with cross markers

What happens if you call `plot` a third time?  Try it!

**Add helpful info to same plot** (legends, labels, etc.)

In [None]:
# call plot once
plt.plot(x, x, 'o-', label='linear')

# call plot twice
plt.plot(x, x ** 2, 'x-', label='quadratic')

# add some helpful info to the plot
plt.legend(loc='best')
plt.title('Linear vs Quadratic progression')
plt.xlabel('Input')
plt.ylabel('Output')

**A simple scatter plot** (with helpful info)

In [None]:
# data
x = np.arange(10)
y = x ** 2

# plot!
plt.plot(x, y, '-o') # line w/ circle markers

# helpful info
plt.xlabel('x')
plt.ylabel('y')
plt.legend(['hello data'], loc = 'best')

# there's also a scatter function, but plot has more general options
# plt.scatter(x, y)

**A simple histogram**

In [None]:
# data for hist
samples = np.random.normal(loc=1.0, scale=0.5, size=1000)

In [None]:
_ = plt.hist(samples, bins=50)

EXERCISE 1:  Modify this histogram
* Change the color
* Change the opacity
* Add a title and a y-axis label

HINT: Hit Shift-tab after the first open paren in plt.hist() to get help or just `help(plt.hist)`

In [None]:
# Code up your solution here...

**A fancier histogram**

In [None]:
# data (from normal and Student's t distributions)
samples_1 = np.random.normal(loc=1, scale=.5, size=10000)
samples_2 = np.random.standard_t(df=10, size=10000)

In [None]:
# set up hist and plot it
bins = np.linspace(-3, 3, 50)
plt.hist(samples_1, bins=bins, alpha=0.5, label='samples 1') # opacity at 50%
plt.hist(samples_2, bins=bins, alpha=0.5, label='samples 2')

# legend
plt.legend(loc='upper left')

**Another scatter plot**

In [None]:
# data (from normal and Student's t distributions)
samples_1 = np.random.normal(loc=1, scale=.5, size=10000)
samples_2 = np.random.standard_t(df=10, size=10000)

# plot!
plt.scatter(samples_1, samples_2, alpha=0.1)

**Aside: making subplots with `figure` and `subplot`**

In [None]:
# data
x = np.arange(100)
y = np.cos(x)
z = np.cos(np.pi * x)

In [None]:
# figure and subplots
fig = plt.figure()

# first subplot
ax1 = fig.add_subplot(2,1,1) # two rows, one column, first plot
ax1.plot(x, y)
ax1.set_ylim([-2, 2])

# another subplot
ax2 = fig.add_subplot(212) # alternate notation
ax2.plot(x, z, color = 'y') # add color as well
ax2.set_ylim([-2, 2])

EXERCISE 2:  Add the following data to the code above and create another plot on `ax1` or the first subplot of the `figure`

```python
w = 2 * np.cos(x / np.pi)
```

FYI: You will have to copy the code from above into your solution cell.

In [None]:
# Code up your solution here...

**Back to scatter plots**

In [None]:
# more data
samples_3 = np.random.normal(loc=2, scale=.5, size=10000)

In [None]:
# using figure trick for subplotting
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.scatter(samples_1, samples_2, alpha=0.1, c='b', marker="s", label='first')
ax1.scatter(samples_3, samples_2, alpha=0.1, c='r', marker="o", label='second')

## `ggplot` 
* **grammer of graphics for python, finally!**

In [None]:
%matplotlib inline
from ggplot import *
import numpy as np

### Easiest to see how `ggplot` works in terms of layers

#### Built-in dataset `mtcars` (yes, just like `ggplot2` in R)
* `mtcars` is a `pandas` DataFrame, FYI

In [None]:
mtcars.head()

**mtcars format**

A data frame with 32 observations on 11 variables.

[, 1]	mpg	Miles/(US) gallon<br>
[, 2]	cyl	Number of cylinders<br>
[, 3]	disp	Displacement (cu.in.)<br>
[, 4]	hp	Gross horsepower<br>
[, 5]	drat	Rear axle ratio<br>
[, 6]	wt	Weight (1000 lbs)<br>
[, 7]	qsec	1/4 mile time<br>
[, 8]	vs	V/S<br>
[, 9]	am	Transmission (0 = automatic, 1 = manual)<br>
[,10]	gear	Number of forward gears<br>
[,11]	carb	Number of carburetors<br>

Source: Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391Ã¢ÂÂ411.

### Preprocessing (continuous to discrete)
* Create a couple new columns with some numerical values converted to strings to indicate they should be interpreted as discrete by `ggplot`

In [None]:
mtcars['am_str'] = ['automatic' if x == 0 else 'manual' for x in mtcars['am']]
mtcars['cyl_str'] = ['4' if x == 4.0 else '6' if x == 6.0 else '8' for x in mtcars['cyl']]

### `ggplot` example 1: simple x-y plot

<b>Create the "canvas" and attach data to it</b> (again, if you use R, you will think this is the same...it is)

In [None]:
p = ggplot(aes(x='wt', y='mpg'), data=mtcars)
p

<b>Add some points</b>

In [None]:
p = p + geom_point()
p

<b>Add a trendline</b>
This is done by
```python
stat_smooth(color="blue")
```

In [None]:
# add the trendline.  Fill in the blanks.
# p = ___ + ___
# p

In [None]:
# add the trendline.  Filled in blanks.
p = p + stat_smooth(color = 'blue')
p

<b>Let's be a good data scientist and add units (might as well give the plot a title)</b>

In [None]:
p = ggplot(aes(x='wt', y='mpg'), data=mtcars) +\
    geom_point() +\
    stat_smooth(color="blue") +\
    xlab("Weight (in 1000 lbs)") + ylab("MPG") +\
    ggtitle("1974 Motor Trend Data for 32 Cars (1973Ã¢ÂÂ74 models)")

p

### `ggplot` example 2: x-y plot with colors and shapes

It'll look like:


```python
ggplot(data, aes(x='x', y='y', color='var1', shape = 'var2')) ...
```

Where **`color`** with automatically pick colors for the categorical variable `var1` and **`shape`** will auto-pick a shape for the categorical variable `var2`.

In [None]:
# peek at data
diamonds.head()

In [None]:
# summary stats - remember this from DataAnalysis?  Fill in blank.
# diamonds.___(include='all')

In [None]:
# summary stats - remember this from DataAnalysis?  Filled in blank.
diamonds.describe(include='all')

In [None]:
# log-log scale for x and y
diamonds['carat_log'] = np.log(diamonds['carat'])
diamonds['price_log'] = np.log(diamonds['price'])

p = ggplot(diamonds, aes(x='carat_log', y='price_log', color='color', shape = 'cut')) +\
    geom_point() +\
    xlab("Log Carat") + ylab("Log Price") +\
    ggtitle("Diamonds")

p

### `ggplot` example 3: lineplot with colors and variable widths

In [None]:
import seaborn as sbn
flights = sbn.load_dataset("flights")

flights.head()

In [None]:
from datetime import datetime

dates = flights.apply(lambda x:'%s %s %s' % (x['year'], x['month'], 1), axis = 1)

dates = dates.apply(lambda x: datetime.strptime(x, '%Y %B %d'))

flights['date'] = dates

dates.head()

In [None]:
# create a line plot.  Fill in the blank.
# p = ggplot(flights, aes('date', 'passengers', color='month')) + ___()
# p

In [None]:
import pandas as pd
mymonths = ['January', 'April', 'July', 'October']
seasons = flights[flights.month.isin(mymonths)].copy()

# recategorize the newly formed "seasons" dataframe, specifically the month column, 
#   due to copying over all months from flights
seasons.month = pd.Categorical(seasons.month, categories = mymonths)

In [None]:
p = ggplot(seasons, aes(x = 'date', y = 'passengers', color = 'month')) + geom_line()
p

In [None]:
p = ggplot(seasons, aes(x = 'date', y = 'passengers', color = 'month')) + geom_line() \
    + scale_color_manual(values = ['lightblue', 'magenta', 'orange', 'darkgreen'])
p

### `ggplot` example 4:  facetting

In [None]:
import os
import pandas as pd
from urllib.request import urlopen

# titanic dataset
url = 'https://raw.githubusercontent.com/ogrisel/parallel_ml_tutorial/master/notebooks/titanic_train.csv'
titanic = pd.read_csv(urlopen(url))

EXERCISE 3: Explore, continuous to discrete and dummy
1.  Check shape and look at first few rows in data
*  Use `astype` to convert the 'Survived' column to categorical
*  Dummy the 'Sex' column with `pandas` `get_dummies` method like:

```python
df = pd.get_dummies(data, columns = ['colname'])
```

Plot the survival classes by gender.  Fill in the blank:

```python
p = ggplot(titanic, aes(x = ___))
p + geom_histogram() + \
    facet_wrap("Sex_female")
```

In [None]:
# Code up your solution here...

## Additional Resources
* More on python's `ggplot` [here](http://ggplot.yhathq.com/) and in this blog [here](http://blog.yhat.com/posts/ggplot-for-python.html)
* [`matplotlib` "artists tutorial"](http://matplotlib.org/users/artists.html)
* Check out some D3-esque plotting with the amazing [plotly](https://plot.ly/python/)

NB:  Some of the `matplotlib` material is adapted from the Olivier Grisel's 2015 PyCon tutorial:
[Olivier's 2015 PyCon sklearn tutorial](https://github.com/ogrisel/parallel_ml_tutorial)
(Olivier Grisel: [@ogrisel](https://twitter.com/ogrisel) | http://ogrisel.com)


Created by a Microsoft Employee.
	
The MIT License (MIT)<br>
Copyright (c) 2016