In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("discussion.ipynb")

# Discussion 3

### Due Saturday October 16th, 11:59:59PM


# Plotting: Pandas, Seaborn, Matplotlib

* **Matplotlib:** Low-level plotting library built on Numpy; use it directly only when you need to!
* **Pandas (plotting):** `.plot` dataFrame/Series methods conveniently plot tabular data (calls Matplotlib).
* **Seaborn:** A python plotting library similar to R's ggplot; makes common statistical plots easy.

In [1]:
# magic command for displaying plots in notebook
%matplotlib inline

In [2]:
import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

In [3]:
from discussion import *

## Plotting in `pandas` is as easy as `.plot()`

* `Series.plot()` plots a column.

In [5]:
data = pd.read_csv('data/data.csv')

In [6]:
data.head()

In [7]:
# select a column from data
z0 = data['z0']
z0.head()

* Use a line plot to plot numeric data.
* `data.plot()` plots a line plot by default.
    - The x-axis is the index by default
    - Can be called out using the key-word argument `x`.

In [8]:
# index is [0...1000]
z0.plot()

In [9]:
# set index to plot correct x-axis
z0 = data.set_index('x').loc[:, 'z0']
z0.head()

In [10]:
z0.plot()

In [11]:
# set x-axis using a keyword argument
data.plot(x='x', y='z0')

### Plotting (quantitative) empirical distributions in Pandas

* Use the key-word argument `kind`
```
kind : str
    - 'hist' : histogram
    - 'box' : boxplot
    - 'kde' : Kernel Density Estimation plot
    ...
```
* The `hist` keyword by default uses 10 bins, and returns the *count* of observations within those bins.
    - use `density=True` to return a histogram whose area is normalized to 1.

In [12]:
# histogram of z0 values; 
# 25 bins.
# density = normalized histogram

z0.plot(kind='hist', bins=25, density=True)

In [13]:
# kernel density estimate of the distribution
# smooth approximation of the empirical distribution

z0.plot(kind='kde')

In [14]:
z0.plot(kind='box')

### Plotting (categorical) empirical distributions in Pandas

* Create a distribution from categorical columns using `value_counts`.
* Categorical columns should use *bar charts*.
* Use the key-word argument `kind`
```
kind : str
    - 'bar' : vertical bar plot
    - 'barh' : horizontal bar plot
    ...
```


In [15]:
empdistr = data['id'].value_counts(normalize=True)
empdistr

In [16]:
# nominal column
empdistr.plot(kind='bar')

In [17]:
# ordinal column: the x-axis has a meaningful order
empdistr.sort_index().plot(kind='bar')

In [18]:
# horizontal bar chart
empdistr.sort_index().plot(kind='barh')

### Plotting `pandas` DataFrames
* `DataFrame.plot()` plots the columns of a dataframe.
* Want multiple plot on the same axis? Get the data into the columns of a dataframe!

In [19]:
data.set_index('x').head()

In [20]:
# plot columns 'z0' and 'z1' with 'x' used as the x-axis
data.set_index('x')[['z0', 'z1']].plot()

In [21]:
# plot columns 'z0' and 'z1' with 'x' used as the x-axis on seperate plots
data.set_index('x')[['z0', 'z1']].plot(subplots=True);

In [22]:
# plot all columns using 'x' as x-axis; elongate plots with 'figsize' keyword
data.set_index('x').plot(subplots=True, figsize=(12,8));

### Scatter-plots with Pandas
* You can create scatter plots with `DataFrame.plot` by passing `kind='scatter'`. Scatter plot requires numeric columns for `x` and `y` axis. 
    * These can be specified by `x` and `y` keywords each.
* To plot multiple column groups in a single axes, repeat plot method specifying target `ax`. It is recommended to specify color and label keywords to distinguish each groups.

In [23]:
data.plot(kind='scatter', x='z0', y='z1')

In [24]:
# plot all the histograms and scatterplots in one plot!
# univariate + bivariate analysis
pd.plotting.scatter_matrix(data.drop(['id', 'x'], axis=1));

In [25]:
df = pd.DataFrame(np.random.rand(50, 4), columns=['a', 'b', 'c', 'd'])

df.plot(kind='scatter', x='a', y='b');

There are other keywords that can be used with scatter. The keyword `c` may be given as the name of a column to provide colors for each point:

In [26]:
samp = data.sample(100)

In [27]:
samp.plot(kind='scatter', x='z0', y='z1', c='z3', s=50);

You can pass other keywords supported by matplotlib `scatter`. The example below shows a bubble chart using a dataframe `column` values as bubble size.

In [28]:
samp.plot(kind='scatter', x='z0', y='z1', s=samp['x']);

### Seaborn: pretty plotting made easy

To install `seaborn`, open a terminal and enter: 

On your laptop:
* `pip install seaborn==0.9`

or, if you are on a shared server (e.g. on `datahub.ucsd.edu`):

* `pip install --user seaborn==0.9`

The `seaborn` documentation has a *great* series of tutorials: https://seaborn.pydata.org/tutorial.html


In [29]:
import seaborn as sns
sns.__version__

#### `sns.scatterplot`
* The relationship between `x` and `y` can be shown for different subsets of the data using the `hue`, `size`, and `style` parameters. 
* These parameters control what visual semantics are used to identify the different subsets. 
* It is possible to show up to three dimensions independently by using all three semantic types, but this style of plot can be hard to interpret and is often ineffective. 
    * Using redundant semantics (i.e. both `hue` and `style` for the same variable) can be helpful for making graphics more accessible.

Show a quantitative variable by using continuous colors:

In [30]:
sns.scatterplot(data=data, x='z0', y='z1', hue='id')

Also show a quantitative variable by varying the size of the points:

In [31]:
sns.scatterplot(data=data, x='z0', y='z1', size='id')

#### `sns.lmplot`

Plot a simple linear relationship between two variables:

In [32]:
# plot a line of best fit
sns.lmplot(data=data, x='z0', y='z2');

#### `sns.distplot`

Plot the distribution with a histogram, kernel density estimate, and rug plot:

In [33]:
z3 = data.sample(50)['z3']
sns.distplot(z3, hist=True, kde=True, rug=True)

#### `sns.boxplot`

Draw a vertical boxplot grouped by a categorical variable:

In [34]:
sns.boxplot(data=data, x='id', y='z2')

## Custom plots with `matplotlib`

* There are other great resources for learning the matplotlib API, for example, [this tutorial](https://www.southampton.ac.uk/~fangohr/training/python/notebooks/Matplotlib.html)

In [35]:
import matplotlib.pyplot as plt

### Matplotlib `axes` objects and Pandas plots

* An 'Axes' object contains the elements of a single plot.
    - contains a coordinate system (axis elements), 
    - the plot elements (e.g. line, bar), 
    - labels, 
    - tick-marks, etc.
    
* A `DataFrame.plot()` method call returns an `axes` object

In [36]:
# notice the <matplotlib.axes._subplots.AxesSubplot at 0x1a21f7bcf8>
data.set_index('x')['z0'].plot()

In [37]:
# save the plot as an variable
ax = data.set_index('x')['z0'].plot()

In [38]:
# get name of x-axis
ax.get_xlabel()

In [39]:
# get y-axis tick-labels
list(ax.get_yaxis().get_majorticklabels())

In [40]:
ax = data.set_index('x')['z0'].plot()
ax.set_xlabel('hi, this is my new axis label!')
ax.set_title('hi this is my new title!');

#### You can add elements to an Axes object

* The Pandas `.plot` method can add a plot to an existing Axes object using the `ax` keyword

In [41]:
ax = data['z0'].plot()

# add z1 to Axes
data['z1'].plot(ax=ax)

# add a vertical line using matplotlib
plt.plot([40,40],[-400, 300])

# add a point using matplotlib
plt.plot(15,-200, marker='x', markersize=10, color='red')

#### You can add a scatterplot to an existing scatterplot

In [42]:
ax = data.plot(kind='scatter', x='z0', y='z1', alpha=0.3)

# the 'ax' keyword in Pandas plot method attaches the new plot to an existing Axes object
data.plot(kind='scatter', x='z0', y='z3', ax=ax, c='g', alpha=0.3)

### Matplotlib `figure` and adding to empty subplots

* A 'Figure' object is a top-level container for all plotting objects.
    - controls overall size, title, fonts, coordination between different elements of subplots.

<img src="https://i.stack.imgur.com/HZWkV.png" width="25%">  

* Instantiate an empty figure containing multiple plots with `plt.subplots`
    - `fig, axes = plt.subplots(R, C)` returns a figure `fig` and an multi-array of `axes`.
    - `axes` has `R` rows and `C` columns corresponding to the subplots laid out on a grid.
    - The `axes` are initially empty; they need to be given data to plot.
   

In [43]:
fig, axes = plt.subplots(1, 2)

In [44]:
len(axes), type(axes), type(axes[0])

In [45]:
fig, axes = plt.subplots(1, 2, figsize=(12,4))

df = data.set_index('x')
df['z0'].plot(ax=axes[0], title='z0')
df['z1'].plot(ax=axes[1], title='z1')

In [46]:
fig, axes = plt.subplots(2, 1, sharex=True)

df = data.set_index('x')
df['z0'].plot(ax=axes[0], title='z0')
df['z1'].plot(ax=axes[1], title='z1')

### Practice: plots and groupby

* Can we plot histograms of `z2` for each value of `id`?

In [47]:
# Hard to understand!
data.drop('x', axis=1).groupby('id')['z2'].plot(kind='hist', alpha=0.3);

In [48]:
data['id'].nunique()

In [49]:
grps = data.groupby('id')
for k, gp in grps:
    print('**** ' + str(k) + ' ****', grps.get_group(k).head().to_string(), sep='\n', end='\n\n')

In [50]:
fig, axes = plt.subplots(2, 2, sharex=True, sharey=True)

for k, gp in data.groupby('id'):
    x_idx = k // 2
    y_idx = k % 2
    ax = axes[x_idx, y_idx]
    title = 'id = %d' % k
    gp['z2'].plot(kind='hist', density=True, ax=ax, title=title)
    
fig.suptitle('Distribution of z2 by id-number');


**Question (Optional)**: Can you plot the histograms of each column by `id`? Each row should contain the histograms by `id` of a single variable (there should be 3 rows and 4 columns). Write this generally enough to handle an arbitrary number of variables and values of `id`.

### Practice problems

* Below is a dataset in the seaborn package that contains data on restaurant bills and (service) tips.
* Try to understand the dataset via plotting using the examples in the notebook.
    - Plot histograms and boxplots for quantitative columns
    - Plot counts of categorical values using bar plots
    - Plot a scatter plot of `tip` vs `total_bill` -- is the relationship linear?

In [51]:
tips = sns.load_dataset('tips')

In [52]:
tips.head()

**Question 1**

Plot the counts of meals in `tips` by day. Your plotting function, `plot_meal_by_day` should return an `matplotlib.axes._subplots.AxesSubplot` object; your plot should look like the plot below.

<img src="imgs/barh.png" width="50%"/>

In [54]:
# don't change this cell -- it is needed for the tests to work
tips = sns.load_dataset('tips')
q1_fig = plot_meal_by_day(tips)

In [None]:
grader.check("q1")

**Question 2**

Plot a seaborn scatterplot using the `tips` data by day. Your plotting function, `plot_bill_by_tip` should return a `matplotlib.axes._subplots.AxesSubplot` object; your plot should look like the plot below.
* `tip` is on the x-axis.
* `total_bill` is on the y-axis.
* color of the dots are given by `day`.
* size of the dots are given by `size` of the table.

<img src="imgs/scatter.png" width="50%"/>

In [59]:
# don't change this cell -- it is needed for the tests to work
tips = sns.load_dataset('tips')
q2_fig = plot_bill_by_tip(tips)

In [None]:
grader.check("q2")

**Question 3**

Plot a figure with two subplots side-by-side. The left plot should contain the **counts** of tips *as a percentage of the total bill*. The right plot should contain the **density plot** of tips as a percentage of the total bill. Your plotting function, `plot_tip_percentages` should return a `matplotlib.Figure` object; your plot should look like the plot below (use 10 bins).

<img src="imgs/hist.png" width="50%"/>

In [64]:
fig, axes = plt.subplots(1, 2)

# plot axes[0]
...
# plot axes[1]
...
# add the title to fig
...

In [65]:
# don't change this cell -- it is needed for the tests to work
tips = sns.load_dataset('tips')
q3_fig = plot_tip_percentages(tips)

In [None]:
grader.check("q3")

## Congratulations! You're done!

* Submit your `.py` file to Gradescope. Note that you only need to submit the `.py` file; this notebook should not be uploaded. Make sure that all of your work is in the `.py` file and not here by running the doctests: `python -m doctest discussion.py`.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()