# Data Visualization and Pipeline

© Data Trainers LLC. GPL v 3.0.

Author: Axel Sirota

## Learning Objectives

- **Practice** using different types of plots.
- **Use** Pandas methods for plotting.
- **Create** line plots, bar plots, histograms, and box plots.
- **Know** when to use Seaborn or advanced Matplotlib

## Lets review the data science lifecycle

<img src="https://www.dropbox.com/scl/fi/avrlrc68dbqifb1yi7cmw/lifecycle.png?rlkey=13w9biwqwsiit9zxjyaay44bb&raw=1"  align="center"/>

## Data Science is an iterative process

<img src="https://www.dropbox.com/scl/fi/sfph7r6nd30icnf7emaer/iterate.png?rlkey=wq2aowuqvjwajty23loj9z6ju&raw=1"  align="center"/>


## Data Visualization

## Lesson Guide

- [Line Plots](#line-plots)
- [Bar Plots](#bar-plots)
- [Histograms](#histograms)
    - [Grouped Histograms](#grouped-histograms)
    
    
- [Box Plots](#box-plots)
    - [Grouped Box Plots](#grouped-box-plots)
    
- [Scatter Plots](#scatter-plots)
- [Using Seaborn](#using-seaborn)
- [OPTIONAL: Understanding Matplotlib (Figures, Subplots, and Axes)](#matplotlib)
- [OPTIONAL: Additional Topics](#additional-topics)

- [Summary](#summary)

### Why Use Data Visualization?

---

Because of the way the human brain processes information, charts or graphs that visualize large amounts of complex data are easier to understand than spreadsheets or reports.

Data visualization is a quick, easy way to convey concepts in a universal  manner — and you can experiment with different scenarios by making slight adjustments.

### Pandas Plotting Documentation

[Link to Documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html)

In [None]:
from IPython.display import HTML

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')
%matplotlib inline

# Increase default figure and font sizes for easier viewing.
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14

### Choosing the right type of visualization

The choice of visualization should depend what you are trying to show. Here is a helpful flowchart that you can use to determine the best type of visualizations.

![Chart Suggestions](https://www.dropbox.com/scl/fi/cozurwr8iiv3kbssqoixk/chart_suggestions.jpg?rlkey=nkm4l7a0zv8vy7mnskpmjos2s&raw=1)


# The Importance of visualization

- Given the same data, different visualization styles can convey different messages
- Examples here:

https://medium.economist.com/mistakes-weve-drawn-a-few-8cdd8a42d368

### Load in data sets for visualization examples.

The Boston data dictionary can be found [here](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names).

In [None]:
%%writefile get_data.sh
mkdir -p data
if [ ! -f data/drinks.csv ]; then
  wget -O data/drinks.csv https://www.dropbox.com/scl/fi/tkfdy0mq30g2t424hmn5o/drinks.csv?rlkey=jl8r4aw1o7y7b5au8icub20pn&dl=0
fi

if [ ! -f data/ufo.csv ]; then
  wget -O data/ufo.csv https://www.dropbox.com/scl/fi/jfdtcoxw3iujoarrqn4uk/ufo.csv?rlkey=rc55ogsir1dpd9h6kmvecpkif&dl=0
fi

if [ ! -f data/boston_housing_data.csv ]; then
  wget -O data/boston_housing_data.csv https://www.dropbox.com/scl/fi/uy9r66ukt34unnhhzak2r/boston_housing_data.csv?rlkey=hzad8f0ot9wyvnqsxrney2ll7&dl=0
fi

if [ ! -f data/sales_info.csv ]; then
  wget -O data/sales_info.csv https://www.dropbox.com/scl/fi/kw6ieq5mh4bus01v2edg5/sales_info.csv?rlkey=vpgbwktfa590onanqq2eu8pwb&dl=0
fi

if [ ! -f data/train.csv ]; then
  wget -O data/train.csv https://www.dropbox.com/scl/fi/a5roxr4r1bk8dvemt64x2/train.csv?rlkey=e6ewu6f0lvtoax3dpjauuhn3f&dl=0
fi

In [None]:
!bash get_data.sh

In [None]:
# Read in the Boston housing data.
housing_csv = 'data/boston_housing_data.csv'
housing = pd.read_csv(housing_csv)

# Read in the drinks data.
drink_cols = ['country', 'beer', 'spirit', 'wine', 'liters', 'continent']
url = 'data/drinks.csv'
drinks = pd.read_csv(url, header=0, names=drink_cols, na_filter=False)

# Read in the ufo data.
ufo = pd.read_csv('data/ufo.csv')
ufo['Time'] = pd.to_datetime(ufo.Time)
ufo['Year'] = ufo.Time.dt.year

## Understand each DataSet well before we try to visualize the data

In [None]:
housing.head()

<img src="https://www.dropbox.com/scl/fi/gaiu5a6miovn687yxqx18/boston.jpg?rlkey=qxslngj79lavdm9ce5h1lgmki&raw=1"  align="center"/>

In [None]:
drinks.head()

In [None]:
ufo.head()

<a id="line-plots"></a>
## Line plots: Show the trend of a numerical variable over time
---

- **Objective:** **Use** Pandas methods for plotting.
- **Objective:** **Create** line plots, bar plots, histograms, and box plots.

In [None]:
# Count the number of ufo reports each year (and sort by year).
ufo.Year.value_counts().sort_index()

In [None]:
# Compare with line plot -- UFO sightings by year. (Ordering by year makes sense.)
ufo.Year.value_counts().sort_index().plot();

In [None]:
drinks.continent.value_counts()

In [None]:
# COMMON MISTAKE: Don't use a line plot when the x-axis cannot be ordered sensically!

# For example, ordering by continent below shows a trend where no exists ...
#    it would be just as valid to plot the continents in any order.

# So, a line plot is the wrong type of plot for this data.
# Always think about what you're plotting and if it makes sense.

drinks.continent.value_counts().plot();

**Important:** A line plot is the wrong type of plot for this data. Any set of countries can be rearranged misleadingly to illustrate a negative trend, as we did here. Due to this, it would be more appropriate to represent this data using a bar plot, which does not imply a trend based on order.

In [None]:
# Plot the same data as a (horizontal) bar plot -- a much better choice!
drinks.continent.value_counts().plot(kind='bar');

### Line Plot With a `DataFrame`

In [None]:
df = pd.DataFrame(np.random.randn(10, 4),
                  columns=['col1', 'col2', 'col3', 'col4'],
                  index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])

In [None]:
df

In [None]:
df.plot(kind='bar');

### How to change the size of a plot

In [None]:
# Technically the figsize is 15 "inches" (width) by 8 "inches" (height)
#   The figure is specified in inches for printing -- you set a dpi (dots/pixels per inch) elsewhere
df.plot(figsize=(10,6)); # width, height

### How to change the color of a plot

In [None]:
df['col1'].plot(color='y', figsize=(10,6), kind='bar');

### How to change the style of individual lines

## CheatSheet of Styles
https://raw.githubusercontent.com/rougier/matplotlib-cheatsheet/master/matplotlib-cheatsheet.png


https://matplotlib.org/2.0.2/api/colors_api.html

In [None]:
# : - dotted line, v - triangle_down
# r - red, b - blue
df[['col1', 'col4']].plot(figsize=(12,6), style={'col1': ':r', 'col4': '--y'});


### Challenge: Create a line plot of `RM` and `MEDV` in the housing data.

- For `RM`, use a solid green line. For `MEDV`, use a blue dashed line.
- Change the figure size to a width of 12 and height of 8.
- Change the style sheet to something you find [here](https://tonysyu.github.io/raw_content/matplotlib-style-gallery/gallery.html).

<img src="https://www.dropbox.com/scl/fi/vw9e0pzpy09bybx2aqbyy/hands_on.jpg?rlkey=nj06wyg85chydzua5xsu0a0ih&raw=1" width="100" height="100" align="right"/>

In [None]:
### Type your code here ....


## Lets use Percent Normalized (MAX)

In [None]:
housing['price_percent']=(housing['MEDV']/housing.MEDV.median())*100
housing['rm_percent']=(housing['RM']/housing.RM.median())*100
housing.head(5)

In [None]:
### Type your code here to plot both percents....


In [None]:
housing.drop(['price_percent','rm_percent'], axis=1, inplace=True)

### some Styles you can play with

In [None]:
%matplotlib inline
#plt.style.use('dark_background')
plt.style.use('classic')

<a id="bar-plots"></a>
## Bar Plots: Show a numerical comparison across different categories
---

In [None]:
# Count the number of countries in each continent.
drinks.continent.value_counts()

In [None]:
drinks.continent.value_counts().plot(kind='bar')

In [None]:
plt.style.use('fivethirtyeight')
drinks.continent.value_counts().plot(kind='bar')

In [None]:
# Compare with bar plot.
drinks.continent.value_counts().plot(kind='pie');

In [None]:
# Calculate the mean alcohol amounts for each continent.
drinks.groupby('continent').mean()

In [None]:
# Side-by-side bar plots
drinks.groupby('continent').mean().plot(kind='bar');

In [None]:
# Sort the continent x-axis by a particular column.
drinks.groupby('continent').mean().sort_values('beer').plot(kind='bar');

In [None]:
drinks.groupby('continent').mean().drop('liters', axis=1).plot(kind='bar');

In [None]:
# Stacked bar plot (with the liters comparison removed!)
drinks.groupby('continent').mean().drop('liters', axis=1).plot(kind='bar', stacked=True);

### Using a `DataFrame` and Matplotlib commands, we can get fancy.

In [None]:
df.head()

In [None]:
ax = df.plot(kind='bar', figsize=(15,10))

# Set the title.
ax.set_title('Some Kinda Plot Thingy', fontsize=21, y=1.01)

# Move the legend.
ax.legend(loc=9)

# x-axis labels
ax.set_ylabel('Important y-axis info', fontsize=16)

# y-axis labels
ax.set_xlabel('Meaningless x-axis info', fontsize=16)


### Challenge: Create a bar chart using `col1` and `col2`.

- Give the plot a large title of your choosing.
- Move the legend to the lower-left corner.

- Do the same thing but with horizontal bars.
- Move the legend to the upper-right corner.

### Stacked works on horizontal bar charts.

In [None]:
df.plot(kind='bar', stacked=True, figsize=(16,8))

<a id="histograms"></a>
## Histograms: Show the distribution of a numerical variable
---


In [None]:
# Sort the beer column and mentally split it into three groups.
drinks.beer.sort_values().values

In [None]:
drinks.shape

In [None]:
# Compare the above with histogram.
# About how many of the points above are in the groups 1-125, 125-250, and 250-376?
n = np.log(193)
print(n)
drinks.beer.plot(kind='hist', bins=int(n));

In [None]:
# Try more bins — it takes the range of the data and divides it into 20 evenly spaced bins.
ax = drinks.beer.plot(kind='hist', bins=8);
ax.set_xlabel('Beer Servings');
ax.set_ylabel('Frequency');

In [None]:
# Compare with density plot (smooth version of a histogram).
drinks.beer.plot(kind='density', xlim=(0, 500));

In [None]:
# Making histograms of DataFrames — histogram of random data
df.hist(figsize=(16,8));
# df.plot(kind='hist')

### Single Histogram

In [None]:
norm = np.random.standard_normal(50000)

In [None]:
pd.Series(norm).hist(figsize=(16,4), bins=50);

### Another bins example: Sometimes the binning makes the data look different or misleading.

In [None]:
pd.Series(norm).hist(figsize=(16,4), bins=11);

In [None]:
np.log(50000)

# Challenge:
### Create a histogram with pandas for using `MEDV` in the housing data.
- Set the bins to 20.

<img src="https://www.dropbox.com/scl/fi/vw9e0pzpy09bybx2aqbyy/hands_on.jpg?rlkey=nj06wyg85chydzua5xsu0a0ih&raw=1" width="100" height="100" align="right"/>

In [None]:
# Do a hist for housing MEDV column


<a id="grouped-histograms"></a>
### Grouped histograms: Show one histogram for each group.

In [None]:
# Reminder: Overall histogram of beer servings
drinks.beer.plot(kind='hist');

In [None]:
# Histogram of beer servings grouped by continent -- how might these graphs be misleading?
drinks.hist(column='beer', by='continent');

In [None]:
# Share the x- and y-axes.
drinks.hist(column='beer', by='continent', sharex=True, sharey=True, layout=(2, 3));

<a id="box-plots"></a>
## Box Plots: Show quartiles (and outliers) for one or more numerical variables
---

We can use boxplots to quickly summarize distributions.

**Five-number summary:**

- min = minimum value
- 25% = first quartile (Q1) = median of the lower half of the data
- 50% = second quartile (Q2) = median of the data
- 75% = third quartile (Q3) = median of the upper half of the data
- max = maximum value

(It's more useful than mean and standard deviation for describing skewed distributions.)

**Interquartile Range (IQR)** = Q3 - Q1

**Outliers:**

- below Q1 - 1.5 * IQR
- above Q3 + 1.5 * IQR

In [None]:
df.boxplot(column='col1');

### Let's see how box plots are generated so we can best interpret them.

In [None]:
# Sort the spirit column.
drinks.spirit.sort_values().values

In [None]:
# Show "five-number summary" for spirit.
drinks.spirit.describe()

In [None]:
# Compare with box plot.
drinks.spirit.plot(kind='box');

In [None]:
drinks.spirit.hist()

In [None]:
drinks.spirit.plot(kind='hist');

In [None]:
# Include multiple variables.
drinks.drop('liters', axis=1).plot(kind='box');

### How to use a box plot to preview the distributions in the housing data

In [None]:
housing.boxplot();

<a id="grouped-box-plots"></a>
### Grouped box plots: Show one box plot for each group.

In [None]:
# Reminder: box plot of beer servings
drinks.beer.plot(kind='box');

In [None]:
# Box plot of beer servings grouped by continent
drinks.boxplot(column='beer', by='continent');

In [None]:
# Box plot of all numeric columns grouped by continent
drinks.boxplot(by='continent');

<a id="scatter-plots"></a>
## Scatter plots: Show the relationship between two numerical variables
---


In [None]:
# Select the beer and wine columns and sort by beer.
drinks[['beer', 'wine']].sort_values('beer').values

In [None]:
# Compare with scatter plot.
drinks.plot(kind='scatter', x='beer', y='wine');

In [None]:
# Add transparency (great for plotting several graphs on top of each other, or for illustrating density!).
drinks.plot(kind='scatter', x='beer', y='wine', alpha=0.3);

### Challenge: Create a scatter plot to view the association between the variables `ZN` and `INDUS` using a scatter plot.

<img src="https://www.dropbox.com/scl/fi/vw9e0pzpy09bybx2aqbyy/hands_on.jpg?rlkey=nj06wyg85chydzua5xsu0a0ih&raw=1" width="100" height="100" align="right"/>


In [None]:
# type answer


<a id="matplotlib"></a>
## OPTIONAL: Understanding Matplotlib (Figures, Subplots, and Axes)

---

Matplotlib uses a blank canvas called a figure.

In [None]:
fig = plt.subplots(1,1, figsize=(16,8));

Within this canvas, we can contain smaller objects called axes.

In [None]:
fig, axes = plt.subplots(2,3, figsize=(16,8));

Pandas allows us to plot to a specified axes if we pass the object to the ax parameter.

In [None]:
fig, axes = plt.subplots(2,3, figsize=(16,8))
df.plot(ax=axes[0][0]);
df['col1'].plot(ax=axes[0][1]);
df['col2'].plot(ax=axes[1][1]);

## Let's use a bit more customization.
---

In [None]:
fig, axes = plt.subplots(2,2, figsize=(16,8))

# We can change the ticks' size.
df['col2'].plot(figsize=(16,4), color='purple', fontsize=21, ax=axes[0][0])

# We can also change which ticks are visible.
# Let's show only the even ticks. ('idx % 2 == 0' only if 'idx' is even.)
ticks_to_show = [idx for idx, _ in enumerate(df['col2'].index) if idx % 2 == 0]
df['col2'].plot(figsize=(16,4), color='purple', xticks=ticks_to_show, fontsize=16, ax=axes[0][1])

# We can change the label rotation.
df.plot(figsize=(15,7), title='Big Rotated Labels - Tiny Title',\
        fontsize=20, rot=-50, ax=axes[1][0])\

# We have to use ".set_title()" to fix title size.
df.plot(figsize=(16,8), fontsize=20, rot=-50, ax=axes[1][1])\
       .set_title('Better-Sized Title', fontsize=21, y=1.01);

<a id="additional-topics"></a>
## OPTIONAL: Additional Topics

In [None]:
# Saving a plot to a file
drinks.beer.plot(kind='hist', bins=20, title='Histogram of Beer Servings');
plt.xlabel('Beer Servings');
plt.ylabel('Frequency');
plt.savefig('beer_histogram.png');    # Save to file!

In [None]:
# List available plot styles
plt.style.available

In [None]:
# Change to a different style.
plt.style.use('ggplot')

<a id="summary"></a>
### Summary

In this lesson, we showed examples how to create a variety of plots using Pandas and Matplotlib. We also showed how to use each plot to effectively display data.

Do not be concerned if you do not remember everything — this will come with practice! Although there are many plot styles, many similarities exist between how each plot is drawn. For example, they have most parameters in common, and the same Matplotlib functions are used to modify the plot area.

We looked at:
- Line plots
- Bar plots
- Histograms
- Box plots
- Special seaborn plots
- How Matplotlib works