# 5. Plotting with Pandas

### Objectives
1. Plot directly with a DataFrame or Series object with the **`plot`** method


### Resources
1. [Pandas Visualization documentation](http://pandas.pydata.org/pandas-docs/stable/visualization.html)

# Plotting in Pandas
Pandas makes plotting easy by automating much of the procedure for you. All pandas plotting passes through Python's main visualization library, **matplotlib** and is accessed through the DataFrame.plot or Series.plot method. We say that the pandas `plot` method is a 'wrapper' for matplotlib.

For plots to be embedded in the notebook, you must run the magic command **`%matplotlib inline`**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 

In [None]:
url = 'https://api.iextrading.com/1.0/stock/AMZN/chart/5y'
amzn = pd.read_json(url)
amzn.head()

In [None]:
amzn = amzn.set_index('date')
amzn_close = amzn['close']
amzn_close.head()

## Plotting a Series
Pandas uses the Series index as the x-values and the values as y-values. By default, Pandas creates a line plot. Let's plot Amazon's closing price for the last 5 years.

In [None]:
amzn_close.plot()

Get data from Apple, Facebook, Schlumberger and Tesla since beginning of 2014

In [None]:
symbols = ['AAPL', 'FB', 'SLB', 'TSLA']

In [None]:
stock_dict = {'AMZN': amzn_close}
for symbol in symbols:
    url = f'https://api.iextrading.com/1.0/stock/{symbol}/chart/5y'
    stock = pd.read_json(url).set_index('date')
    stock_dict[symbol] = stock['close']

In [None]:
df_stocks = pd.DataFrame(stock_dict)
df_stocks.head()

## Plot all Series one at a time
All calls to plot that happen in the same cell will be drawn on the same Axes unless otherwise specified.

In [None]:
df_stocks['AMZN'].plot()
df_stocks['AAPL'].plot()
df_stocks['FB'].plot()
df_stocks['SLB'].plot()
df_stocks['TSLA'].plot()

## Plot all all at once from the DataFrame
Instead of individually plotting Series, we can plot each column in the DataFrame at once with its **`plot`** method.

In [None]:
df_stocks.plot()

# Plotting in Pandas is Column based
The most important thing to know about plotting in pandas is that it is **column based**. Pandas plots each column, one at a time. It uses the index as the x-values for each column and the values of each column as the y-values. The column names will be put in the **legend**.

## Choosing other types of plots
Pandas directly uses Matplotlib for all of its plotting. Pandas does not have any plotting capabilities on its own. Pandas is simply calling Matplotlib's plotting functions and supplying the arguments for you. The types of available plots may be seen in [visualization section of the docs][1]. Use the **`kind`** parameter to set the type of plot.

* ‘line’ : line plot (default)
* ‘bar’ : vertical bar plot
* ‘barh’ : horizontal bar plot
* ‘hist’ : histogram
* ‘box’ : boxplot
* ‘kde’ : Kernel Density Estimation plot
* ‘density’ : same as ‘kde’
* ‘area’ : area plot
* ‘pie’ : pie plot

[1]: http://pandas.pydata.org/pandas-docs/stable/visualization.html#other-plots

### Histogram of the closing prices of Apple

In [None]:
# Lets create a histogram of a Series
aapl = df_stocks['AAPL']
aapl.plot(kind='hist')

### Kernel Density Estimate
Very similar to a histogram, shows the approximate probability as area under the curve.

In [None]:
aapl.plot('kde');

## Additional Plotting Arguments
To modify plots to your liking, matplotlib gives you lots of power. The most commonly used arguments are listed below but thre are [lots more](http://matplotlib.org/api/pyplot_api.html)

* **`linestyle`** (ls) - Pass a string of one of the following ['--', '-.', '-', ':']
* **`color`** (c) - Can take a string of a named color, a string of the hexadecimal characters or a rgb tuple with each number between 0 and 1. [Check out this really good stackoverflow post to see the colors](http://stackoverflow.com/questions/22408237/named-colors-in-matplotlib)
* **`linewidth`** (lw) - controls thickness of line. Default is 1
* **`alpha`** - controls opacity with a number between 0 and 1
* **`figsize`** - a tuple used to control the size of the plot. (width, height) 
* **`legend`** - boolean to control legend

In [None]:
# Use several of the additional plotting arguemnts
aapl.plot(color="darkblue", 
          linestyle='--', 
          figsize=(16, 8), 
          linewidth=5, 
          alpha=.7, 
          legend=True,
          title="AAPL Stock Price - Last 5 Years");

# Plots still Ugly?
If you can't get a plot to look how you would like, you can freely choose from several predefined layouts. These layouts can instantly make your plots more attractive. You set these styles in matplotlib.

In [None]:
# lets look at some styles we can choose from
print(plt.style.available)

In [None]:
# lets use a popular style - ggplot
plt.style.use('ggplot')

In [None]:
df_stocks.plot(figsize=(12, 6))

## New Dataset
A popular intro dataset for the famous **`ggplot2`** package in R is the diamonds dataset with [description here.](http://docs.ggplot2.org/0.9.3.1/diamonds.html)

In [None]:
diamonds = pd.read_csv('../data/diamonds.csv')
diamonds.head(10)

## Changing the defaults for a scatterplot

The default plot is a line plot and uses the index as the x-axis. Each column of the frame become the y-values. This worked well for stock price data where the date was in the index and ordered. For many datasets, you will have to explicitly set the x and y axis variables. Below is a scatterplot comparison of carat vs price.

In [None]:
diamonds.plot('carat', 'price', kind='scatter', figsize=(12, 6));

In [None]:
diamonds.shape

## Sample the data when too many points

In [None]:
diamonds.sample(frac=.02).plot('carat', 'price', kind='scatter', figsize=(12, 6));

# If you have tidy data, use `groupby/pivot_table`, then make a bar plot
If your data is tidy like it is with this diamonds dataset, you will likely need to aggregate it with either a `groupby` or a `pivot_table` to make it work with a bar plot.

### The index becomes the tick labels for String Indexes
Pandas nicely integrates the index into plotting by using it as the tick mark labels for many plots.

In [None]:
cut_count = diamonds['cut'].value_counts()
cut_count

In [None]:
cut_count.plot(kind='bar')

### More than one grouping column in the index

In [None]:
# bar plot with more than one category
cut_color_count = diamonds.groupby(['cut', 'color']).size()
cut_color_count.head(10)

In [None]:
cut_color_count.plot(kind='bar')

## Thats quite ugly
Let's reshape and plot again.

In [None]:
cut_color_pivot = diamonds.pivot_table(index='cut', columns='color', aggfunc='size')
cut_color_pivot

Plot the whole DataFrame. The index always goes on the x-axis. Each column value is the y-value and the column names are used as labels in the legend.

In [None]:
cut_color_pivot.plot(kind='bar')

## Pandas plots return matplotlib objects
After making a plot with pandas, you will see some text output immediately under the cell that was just executed. Pandas is returning to us the matplotlib Axes object. You can assign the result of the **`plot`** method to a variable.

In [None]:
ax = cut_color_pivot.plot(kind='bar')

In [None]:
type(ax)

Get the figure as an attribute of the Axes

In [None]:
fig = ax.figure

In [None]:
type(fig)

# We can use the figure and axes as normal

In [None]:
ax.set_title('My new title on a Pandas plot')
fig

### Problem 1
<span  style="color:green; font-size:16px">In this problem we will test whether daily returns from stocks are normally distributed. Complete the following tasks:
* Take the `df_stocks` DataFrame and call the **`pct_change`** method to get the daily return percentage and assign it to a variable. 
* Assign the mean and standard deviation of each column (these will return Series) to separate variables. 
* Standardize your columns by subtracting the mean and dividing by the standard deviation. You have now produced a **z-score** for each daily return. 
* Add a column to this DataFrame called **`noise`** by calling **`np.random.randn`** which creates random normal variables.
* Plot the KDE for each column in your DataFrame. If the stock returns are normal, then the shapes of the curves will all look the same.
* Limit the xaxis to be between -3 and 3.
* Are stock retunrs normally distributed?</span>

### Problem 2
<span  style="color:green; font-size:16px">Use Pandas to plot a horizontal bar plot of diamond cuts.</span>

### Problem 3
<span  style="color:green; font-size:16px">Make a visualization that easily shows the differences in average salary by gender for each department of the employee dataset.</span>

### Problem 4
<span  style="color:green; font-size:16px">Split the employee data into two separate DataFrames. Those who have a hire date after the year 2000 and those who have one before. Make the same plot above for each group.</span>

### Problem 5
<span  style="color:green; font-size:16px">Use the **`flights`** data set. Plot the counts of the number of flights per day of week.</span>

### Problem 6
<span  style="color:green; font-size:16px">Plot the average arrival delay per day of week.</span>

### Problem 7
<span  style="color:green; font-size:16px">Plot the average arrival delay per day of week per airline.</span>

# Extra

## Scatterplot color based on a column - (unfortunately more difficult than it needs to be)
It is possible to use the value of a different column to change colors of the points. If you have a numeric column, then this is easy. Here, we create a numeric column with random integers from 0 to 100. We also pass a color map to the `cmap` parameter.

In [None]:
# randomly sample
dia_samp = diamonds.sample(frac=.1)
dia_samp['some numeric col'] = np.random.randint(0, 100, len(dia_samp))

In [None]:
dia_samp.plot(x='carat',
              y ='price',
              kind='scatter',
              title='Carat vs Price',
              c='some numeric col',
              cmap='plasma',
              figsize=(14, 6));

### Coloring with string columns - convert to 'category' data type
Working with strings isn't nearly as easy. One way to do this is to first convert the column to a Pandas category.

In [None]:
dia_samp['clarity'] = dia_samp['clarity'].astype('category')

Let's verify the data types:

In [None]:
dia_samp.dtypes

## The `cat` accessor
Pandas has a `cat` accessor for categorical columns. The `cat` accessor works just like `str` and `dt`. It gives you access to special categorical-only attributes and methods. One of these is `codes`. Each unique string is mapped to an integer.

In [None]:
clarity_codes = dia_samp['clarity'].cat.codes
clarity_codes.head()

In [None]:
dia_samp.plot(x='carat',
              y ='price',
              kind='scatter',
              title='Carat vs Price',
              c=clarity_codes,
              cmap='plasma',
              figsize=(14, 6));

### Alternative method - Make a new column of string color values with the `map` Series method
The **`map`** Series method iterates over a column of data and returns a single value for each cell.  **`map`** can accept a function or a dictionary. If a dictionary is passed then a simple key lookup is used to return the value.

In [None]:
color_map = {'E': 'aqua',
             'I': 'green',
             'J': 'black',
             'H': 'cadetblue',
             'F': 'darksalmon',
             'G': 'lavender',
             'D': 'maroon'}

In [None]:
dia_samp['color_map'] = dia_samp['color'].map(color_map).fillna('red')
dia_samp['color_map'].head()

Pass this Series to the **`c`** parameter.

In [None]:
dia_samp.plot(x='carat',
              y ='price',
              kind='scatter',
              title='Carat vs Price',
              c=dia_samp['color_map'],
              figsize=(14, 6));