# Data Visualization in Python
Take data and turn it into something colorful, graphical and meaningful :) 


## Intro
Data visualization allows data scientists to graphically represent data to extract and understand trends, outliers, patterns and further insights in the data.

Python has many **many** graphing libraries with different features and it can be daunting to know which library to use.  This intro tutorial will focus on a few popular plotting libraries:
* **Pandas** - built on Matplotlib and easy to use with Pandas dataframes
* **Matplotlib** - massive library with lots of flexibility (stackoverflow will be your friend!)
* **Seaborn** - statistical visualization with default themes and beautiful styles

For interactive plots and possibly a feature tutorial
* Plotly
* D3
* Bokeh


This tutorial compares Matplotlib, Pandas and Seaborn for the following visualizations:
* Scatter Plots 
* Line Charts 
* Histograms 
* Bar Charts 
* Box Plots 
* Pie Charts 
* Heatmaps 
* Faceting
* Pairplots



### Setup 
Personally, I prefer to use a Conda environment and/or Docker container (if you are interested in knowing more about these setups let me know).  The README includes the specific packages you will need for this tutorial.  Please make sure you can run the following line without any errors.

In [1]:
## Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_wine, load_iris

%matplotlib inline

### Note:
You may have to install or update the library `jinja2` for certain highlighting
options to work (depending on the bug correction schedules for the `jinja2` or
`pandas` development teams).

## Data 
Scikit-learn includes out of the box [datasets](https://scikit-learn.org/stable/datasets/index.html#]) which are great for practicing your data visualization skills.  We will use the classic wine and iris datasets - imported above.

Since most of the plotting libraries play very nicely with Pandas DataFrames, we will format the *wine* and *iris* dataset with Pandas.


In [None]:
# Load the wine data set into a Pandas dataframe and view the first 5 rows
wine = load_wine()
wine_df = pd.DataFrame(data= np.c_[wine.data, wine.target],
                 columns= list(wine.feature_names) + ['target'])
# Add class label
wine_df['class'] = pd.Categorical.from_codes(wine.target, wine.target_names)

# Show the top 5 rows of Dataframe
wine_df.head(5)

In [None]:
# Load the iris data set into a Pandas dataframe and view the first 5 rows
iris = load_iris()
iris_df =  pd.DataFrame(data= np.c_[iris.data, iris.target],
                 columns= list(iris.feature_names) + ['target'])

# Add class label
iris_df['class'] = pd.Categorical.from_codes(iris.target, iris.target_names)

# Show the top 5 rows of Dataframe
iris_df.head(5)

### Pandas DataFrame Descriptive Statistics
Don't forget to understand the statistics of your data!  You can also stylize your dataframe like you might in a csv.  There are ways to add visual impact to quantitative data.

If you need a brief statistical refresher especially in the context of Python please see [Python Statistics Fundamentals: How to Describe Your Data](https://realpython.com/python-statistics/).

In [4]:
# Generate descriptive statistics of your DataFrame


In [5]:
# Example using a color gradient to display data values in the DataFrame
# For max/min use highlight_max()/highlight_min()


In [None]:
iris_df.loc[:, 'sepal length (cm)': 'petal width (cm)'].head(10).style.background_gradient(subset=['sepal length (cm)', 'petal length (cm)'], cmap='BuGn').background_gradient(subset=['sepal width (cm)', 'petal width (cm)'], cmap='PuRd').highlight_max(color='yellow')

# Scatter Plots
The x-y plot or scatter plot represents the pairs of data from two datasets.

### Matplotlib
Use **scatter** or use **plt.subplots** to stylize our plot with title and labels.

In [None]:
# create a figure and axis
fig, ax = plt.subplots()

# scatter the sepal_length against the sepal_width

# set a title and labels
ax.set_title('Iris Dataset')
ax.set_xlabel('sepal length (cm)')
ax.set_ylabel('sepal width (cm)')

The plot would have more meaning if the data points are colored by class.  In Matplotlib we can create a color dictionary and then assign a color for each class.

In [None]:
# create color dictionary
colors = {'setosa':'r', 'versicolor':'g', 'virginica':'b'}
# create a figure and axis
fig, ax = plt.subplots()
# plot each data-point
for c in np.unique(iris_df['class']):
    ax.scatter(iris_df['sepal length (cm)'].loc[iris_df['class']==c], 
               iris_df['sepal width (cm)'].loc[iris_df['class']==c],
               s=10*iris_df['petal length (cm)'].loc[iris_df['class']==c],
               alpha=0.5,
               color=colors[c],
               label = c)

# set a title and labels
ax.set_title('Iris Dataset')
ax.set_xlabel('sepal length (cm)')
ax.set_ylabel('sepal width (cm)')
ax.legend(loc='best')

### Pandas 
To create a scatter plot in Pandas we can call **dataset.plot.scatter()**. The axes labels will automatically be created from the DataFrame column names.

In [9]:
# create Pandas scatter plot


In [None]:
# Add coloring to plot by class 
# Note here we use the numeric target and not the class names
iris_df.plot.scatter(x='sepal length (cm)', 
                     y='sepal width (cm)', 
                     title='Iris Dataset', 
                     c='target',
                     colormap='viridis')

### Seaborn
Seaborn has a **.scatterplot** method to create a scatterplot similar to Pandas.

In [None]:
# create a seaborn scatter plot

plt.title('Iris Dataset')

In [None]:
# add style to scatterplot by coloring by class
sns.scatterplot(x='sepal length (cm)', 
                y='sepal width (cm)',
                data=iris_df,
                hue = 'class',
                size='petal length (cm)')
plt.title('Iris Dataset')
#plt.legend(loc='upper center', bbox_to_anchor=(1.3, 1.0),
#          fancybox=True, shadow=True, ncol=1)
plt.tight_layout()

# Line Chart
One of the most fundamental plots - a line chart to display a series of data points. 

### Matplotlib
Call line-chart with the method **plot**.  Matplotlib will require a loop to plot multiple columns in one graph.

In [None]:
# get columns to plot
columns = iris_df.columns.drop(['target', 'class'])
# create x data

# create figure and axis
fig, ax = plt.subplots()
# plot each column

# set title and legend
ax.set_title('Iris Dataset')
ax.legend()

## Pandas 
Create line-chart with **.plot.line()** without any loops :) 

In [14]:
# pandas line chart


### Seaborn
Line-chart with seaborn calls method sns.lineplot 

In [15]:
# seaborn line chart 


# Histograms
Histograms are useful for a large number of unique values in a dataset.  The values are sorted into intervals, called bins.  From histograms we can understand the distribution (aka frequency) of the data.

### Matplotlib
Use method **hist**.

In [None]:
# create figure and axis
fig, ax = plt.subplots()
# plot histogram

# set title and labels
ax.set_title('Wine Alcohol')
ax.set_xlabel('Alcohol')
ax.set_ylabel('Frequency')

### Pandas
Create Pandas hist with **plot.hist** - no parameters required.

In [17]:
# histogram using Pandas


In [None]:
# Also easily create mulitple plots 
iris_df.drop(['target', 'class'], axis=1).plot.hist(subplots=True, layout=(2,2), figsize=(10, 10), bins=20)


### Seaborn 
Seaborn uses method **sns.distplot** with many additional parameters.

In [None]:
# sns histogram

plt.ylabel('Frequency')
plt.title('Wine Alcohol')


In [None]:
# sns histogram with kernel density estimate

plt.ylabel('Distribution')
plt.title('Wine Alcohol')


# Bar Charts
Bar charts illustrate data corresponding to given labels or discrete numeric values (like pies charts).  These charts are good when there is low cardinality (not a lot of categories)

### Matplotlib
Use the **bar** method, with the caveat that you need to manually calculate the frequencies of the categories you are interested in which can be done using **value_counts**.

In [None]:
# create a figure and axis 
fig, ax = plt.subplots() 
# count the occurrence of each class 
data = wine_df['class'].value_counts() 
# get x and y data 
alcohol = data.index 
frequency = data.values 
# create bar chart 

# set title and labels 
ax.set_title('Wine Classes') 
ax.set_xlabel('Class') 
ax.set_ylabel('Count')

# Pandas
Use **plot.bar()** method, but like with matplotlib this method requires that we need to count the occurrences using **value_counts** and sort using **sort_index**.

In [None]:

plt.title('Wine Classes')
plt.xlabel('Class')
plt.ylabel('Count')


In [None]:
wine_df['class'].value_counts().sort_index().plot.barh(color=['red', 'blue', 'green'])
plt.title('Wine Classes')
plt.ylabel('Class')
plt.xlabel('Count')

In [None]:
plt.title('Wine Classes with the Highest Alcohol (on Average)')
plt.xlabel('Class')
plt.ylabel('Average Alcohol')

### Seaborn
Use **.countplot** to create a bar-chart, no need to do any data manipulation :) 

In [None]:

plt.title('Wine Class')

# Box Plots
Box plots (and violin plots) are excellent for visualizing descriptive statistics of a dataset since they show the range, interquartile range, median, mode, outliers and all quartiles.

### Matplotlib 
Use the method **boxplot()**.  In order to display multiple columns in one figure we need to construct an array of data.

In [None]:
# get columns to plot
columns = iris_df.columns.drop(['target', 'class'])
data = []
for column in columns:
    data.append(iris_df[column])
# create figure and axis
fig, ax = plt.subplots()
# plot each column


# set title and legend
ax.set_title('Iris Dataset')
ax.set_xticklabels(columns, rotation=45)
ax.set_xlabel('Feature')
ax.set_ylabel('cm')

### Pandas
Use **.boxplot()** to create boxplot from Pandas dataframe.

In [None]:

plt.title('Iris Dataset')
plt.xlabel('Feature')
plt.ylabel('cm')

### Seaborn
Just need **sns.boxplot()** method to create amazing boxplots

In [None]:

plt.title('Iris Dataset')
plt.xlabel('Feature')
plt.ylabel('cm')

In [None]:
# Boxplots are also useful for viewing the different statistics of a feature broken down by class

plt.title('Iris Dataset')

In [None]:
# Alternative to box plot is a violin plot showing 
# the kernel density estimation underlying the distribution
sns.violinplot(x="class", y="sepal length (cm)", data=iris_df)
plt.title('Iris Dataset')

# Pie Charts
Represent data with a small number of labels and given relative frequencies.  

### Matplotlib
Use **plt.pie** method to create a pie chart.

In [None]:
#first create data arrays 
data = wine_df['class'].value_counts()
labels = data.index.tolist()
sizes = data.values
colors = ['yellowgreen', 'lightcoral', 'lightskyblue']
explode = (0.1, 0, 0)  # explode 1st slice



plt.tight_layout()
plt.axis('equal')
plt.show()

### Pandas
Use the method **.plot.pie** on a dataframe.

### Seaborn 
Sorry but seaborn does not have a piechart method :( 

# Heatmaps
Heatmaps can visually represent a matrix.  The colors represent the numbers or elements of the matrix.  They are used for showing covariance and correlation matrices. Pandas **.corr()** allows you to easily create a correlation matrix or you can use numpy **np.corrcoef** method.

### Matplotlib
Use Pandas correlation **.corr()** to easily get the correlation of features inside a dataframe and results in a correlation matrix.

In [None]:
# get correlation matrix
corr = iris_df.corr(numeric_only=True)
fig, ax = plt.subplots()
# create heatmap

# set labels
ax.set_xticks(np.arange(len(corr.columns)))
ax.set_yticks(np.arange(len(corr.columns)))
ax.set_xticklabels(corr.columns)
ax.set_yticklabels(corr.columns)

# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
         rotation_mode="anchor")

# Loop over data dimensions and create text annotations.
for i in range(len(corr.columns)):
    for j in range(len(corr.columns)):
        text = ax.text(j, i, np.around(corr.iloc[i, j], decimals=2),
                       ha="center", va="center", color="black")

### Pandas 
Although Pandas lets you quickly calculate the correlation matrix, you will need Matplotlib or Seaborn to plot the heatmap.

### Seaborn
The best and easiest way to create a heatmap!! 

In [33]:
# Add a mask so you don't show redundant information


# Faceting 
Faceting allows you to break the data variables up across multiple subplots and
combine into a single feature.  This will allow you to quickly explore and
visualize your datasets. Faceting is most easily accomplished in Seaborn.

### Seaborn
Faceting in Seaborn uses a FacetGrid,  You first define the **FacetGrid** and pass the data and column which you want to use to split your data.   Then use the **map** function on the FacetGrid object to make the multiple subplots with the different slices of data.

In [34]:
# Create FacetGrid

# use map function with a histogram

# Pairwise Data Compairson
Pandas and Seaborn both have methods to plot pairwise relationships in your dataset - which can be extremely useful - just be careful for large datasets. 

### Pandas 
Pandas has a **scatter_matrix** which makes it easy to 

In [None]:
fig, ax = plt.subplots(figsize=(12,12))
# img = pd.plotting.scatter_matrix()

### Seaborn 
Seaborn also has a very useful **pairplot** method that automatically plots a grid of pairwise relationships in the dataset.  