ARTI308 - Machine Learning
# Seaborn Overview

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.




## Distribution Plots

Let's discuss some plots that allow us to visualize the distribution of a data set. These plots are:

* distplot
* jointplot
* pairplot
* rugplot
* kdeplot

## Imports

In [17]:
import seaborn as sns  # Import the Seaborn library and alias it as 'sns' for convenience
%matplotlib inline     # Magic command to display plots directly inside the Jupyter notebook

## Data
Seaborn comes with built-in data sets!

In [18]:
tips = sns.load_dataset('tips')  # Load the built-in 'tips' dataset from Seaborn into a pandas DataFrame

In [19]:
tips.head()  # Display the first 5 rows of the tips DataFrame to preview the data

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


## distplot

The distplot shows the distribution of a univariate set of observations.

In [None]:
sns.distplot(tips['total_bill'])  # Plot distribution of 'total_bill' column — shows histogram + KDE (kernel density) curve
# Safe to ignore warnings

To remove the kde layer and just have the histogram use:

In [None]:
sns.distplot(tips['total_bill'],kde=False,bins=30)  # Plot histogram only (kde=False removes the KDE curve), with 30 bins for finer detail

## jointplot

jointplot() allows you to basically match up two distplots for bivariate data. With your choice of what **kind** parameter to compare with: 
* “scatter” 
* “reg” 
* “resid” 
* “kde” 
* “hex”

In [None]:
sns.jointplot(x='total_bill',y='tip',data=tips,kind='scatter')  # Scatter joint plot: shows relationship between total_bill and tip, with marginal distributions on axes

In [None]:
sns.jointplot(x='total_bill',y='tip',data=tips,kind='hex')  # Hexbin joint plot: uses hexagonal bins to show density of points (darker = more points)

In [None]:
sns.jointplot(x='total_bill',y='tip',data=tips,kind='reg')  # Regression joint plot: adds a linear regression line and KDE marginal distributions

## pairplot

pairplot will plot pairwise relationships across an entire dataframe (for the numerical columns) and supports a color hue argument (for categorical columns). 

In [None]:
sns.pairplot(tips)  # Plot pairwise relationships for all numerical columns; diagonal shows histograms, off-diagonal shows scatter plots

In [None]:
sns.pairplot(tips,hue='sex',palette='coolwarm')  # Pairplot colored by 'sex' category using 'coolwarm' palette; diagonal shows KDE instead of histograms

## rugplot

rugplots are actually a very simple concept, they just draw a dash mark for every point on a univariate distribution. They are the building block of a KDE plot:

In [None]:
sns.rugplot(tips['total_bill'])  # Draw a small vertical tick mark at each data point along the x-axis — shows data density

## kdeplot

kdeplots are [Kernel Density Estimation plots](http://en.wikipedia.org/wiki/Kernel_density_estimation#Practical_estimation_of_the_bandwidth). These KDE plots replace every single observation with a Gaussian (Normal) distribution centered around that value. For example:

In [None]:
# Don't worry about understanding this code!
# It's just for the diagram below — demonstrates how KDE works step by step
import numpy as np                  # Import NumPy for numerical operations
import matplotlib.pyplot as plt     # Import Matplotlib for plotting
from scipy import stats             # Import scipy.stats for statistical distributions

# Create dataset
dataset = np.random.randn(25)       # Generate 25 random numbers from a standard normal distribution

# Create another rugplot
sns.rugplot(dataset);                # Draw tick marks along x-axis for each data point; semicolon suppresses output

# Set up the x-axis for the plot
x_min = dataset.min() - 2           # Set x-axis minimum: smallest data point minus 2 (for padding)
x_max = dataset.max() + 2           # Set x-axis maximum: largest data point plus 2 (for padding)

# 100 equally spaced points from x_min to x_max
x_axis = np.linspace(x_min,x_max,100)  # Create 100 evenly spaced values for smooth curve plotting

# Set up the bandwidth, for info on this:
url = 'http://en.wikipedia.org/wiki/Kernel_density_estimation#Practical_estimation_of_the_bandwidth'

bandwidth = ((4*dataset.std()**5)/(3*len(dataset)))**.2  # Calculate optimal bandwidth using Silverman's rule of thumb


# Create an empty kernel list
kernel_list = []                     # Will store individual Gaussian kernels for each data point

# Plot each basis function
for data_point in dataset:           # Loop through each data point in the dataset
    
    # Create a kernel for each point and append to list
    kernel = stats.norm(data_point,bandwidth).pdf(x_axis)  # Create a Gaussian (normal) PDF centered at this data point
    kernel_list.append(kernel)       # Store the kernel values for later summation
    
    # Scale for plotting
    kernel = kernel / kernel.max()   # Normalize kernel so its peak equals 1
    kernel = kernel * .4             # Scale down to 0.4 so individual curves don't dominate the plot
    plt.plot(x_axis,kernel,color = 'grey',alpha=0.5)  # Plot the scaled kernel as a semi-transparent grey curve

plt.ylim(0,1)                        # Set y-axis range from 0 to 1

In [None]:
# To get the kde plot we can sum these basis functions.

# Plot the sum of the basis function
sum_of_kde = np.sum(kernel_list,axis=0)  # Sum all individual kernels element-wise to get the final KDE curve

# Plot figure
fig = plt.plot(x_axis,sum_of_kde,color='indianred')  # Plot the summed KDE curve in indianred color

# Add the initial rugplot
sns.rugplot(dataset,c = 'indianred')  # Overlay rugplot tick marks in matching indianred color

# Get rid of y-tick marks
plt.yticks([])  # Remove y-axis tick labels (KDE density values aren't meaningful on their own)

# Set title
plt.suptitle("Sum of the Basis Functions")  # Add a title above the plot

So with our tips dataset:

In [None]:
sns.kdeplot(tips['total_bill'])   # Plot the KDE (smooth density curve) for the 'total_bill' column
sns.rugplot(tips['total_bill'])   # Overlay rugplot tick marks to show individual data points along the x-axis

In [None]:
sns.kdeplot(tips['tip'])   # Plot the KDE (smooth density curve) for the 'tip' column
sns.rugplot(tips['tip'])   # Overlay rugplot tick marks showing each tip value as a tick on the x-axis

# Categorical Data Plots

Now let's discuss using seaborn to plot categorical data! There are a few main plot types for this:

* factorplot
* boxplot
* violinplot
* stripplot
* swarmplot
* barplot
* countplot

Let's go through examples of each!

In [None]:
import seaborn as sns  # Re-import Seaborn (needed if running this section independently)
%matplotlib inline     # Ensure plots display inline in the notebook

In [None]:
tips = sns.load_dataset('tips')  # Reload the 'tips' dataset into a DataFrame
tips.head()                      # Preview the first 5 rows of the dataset

## barplot and countplot

These very similar plots allow you to get aggregate data off a categorical feature in your data. **barplot** is a general plot that allows you to aggregate the categorical data based off some function, by default the mean:

In [None]:
sns.barplot(x='sex',y='total_bill',data=tips)  # Bar plot showing mean total_bill for each sex category (with confidence interval error bars)

In [None]:
import numpy as np  # Import NumPy for numerical functions like np.std (standard deviation)

You can change the estimator object to your own function, that converts a vector to a scalar:

In [None]:
sns.barplot(x='sex',y='total_bill',data=tips,estimator=np.std)  # Bar plot using standard deviation as the estimator instead of the default mean

### countplot

This is essentially the same as barplot except the estimator is explicitly counting the number of occurrences. Which is why we only pass the x value:

In [None]:
sns.countplot(x='sex',data=tips)  # Count plot: shows the number of occurrences (count) for each sex category

## boxplot and violinplot

boxplots and violinplots are used to shown the distribution of categorical data. A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the inter-quartile range.

In [None]:
sns.boxplot(x="day", y="total_bill", data=tips,palette='rainbow')  # Box plot of total_bill grouped by day, with rainbow color palette; shows median, quartiles, and outliers

In [None]:
# Can do entire dataframe with orient='h'
sns.boxplot(data=tips,palette='rainbow',orient='h')  # Horizontal box plot for all numerical columns in the DataFrame at once

In [None]:
sns.boxplot(x="day", y="total_bill", hue="smoker",data=tips, palette="coolwarm")  # Box plot grouped by day AND split by smoker status using coolwarm colors

### violinplot
A violin plot plays a similar role as a box and whisker plot. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared. Unlike a box plot, in which all of the plot components correspond to actual datapoints, the violin plot features a kernel density estimation of the underlying distribution.

In [None]:
sns.violinplot(x="day", y="total_bill", data=tips,palette='rainbow')  # Violin plot: shows distribution shape (KDE) of total_bill for each day with rainbow colors

In [None]:
sns.violinplot(x="day", y="total_bill", data=tips,hue='sex',palette='Set1')  # Violin plot grouped by day, with separate violins for each sex using 'Set1' palette

In [None]:
sns.violinplot(x="day", y="total_bill", data=tips,hue='sex',split=True,palette='Set1')  # Split violin: each violin is split in half — one side per sex for direct comparison

## stripplot and swarmplot
The stripplot will draw a scatterplot where one variable is categorical. A strip plot can be drawn on its own, but it is also a good complement to a box or violin plot in cases where you want to show all observations along with some representation of the underlying distribution.

The swarmplot is similar to stripplot(), but the points are adjusted (only along the categorical axis) so that they don’t overlap. This gives a better representation of the distribution of values, although it does not scale as well to large numbers of observations (both in terms of the ability to show all the points and in terms of the computation needed to arrange them).

In [None]:
sns.stripplot(x="day", y="total_bill", data=tips, palette='rainbow')  # Strip plot: scatter plot of total_bill for each day category with rainbow colors

In [None]:
sns.stripplot(x="day", y="total_bill", data=tips,jitter=True, palette='rainbow')  # Strip plot with jitter=True: adds random horizontal noise so overlapping points are visible

In [None]:
sns.stripplot(x="day", y="total_bill", data=tips,jitter=True,hue='sex',palette='Set1')  # Strip plot with jitter, colored by sex using 'Set1' palette

In [None]:
sns.stripplot(x="day", y="total_bill", data=tips,jitter=True,hue='sex',palette='Set1',dodge=True)  # dodge=True separates the sex groups side by side instead of overlapping

In [None]:
sns.swarmplot(x="day", y="total_bill", data=tips)  # Swarm plot: like stripplot but points are adjusted so they don't overlap — better shows distribution

### Combining Categorical Plots

In [None]:
sns.violinplot(x="tip", y="day", data=tips,palette='rainbow')           # Draw violin plot of tip by day with rainbow colors as the background
sns.swarmplot(x="tip", y="day", data=tips,color='black',size=3)         # Overlay swarm plot with small black dots to show individual data points on top

## catplot

factorplot is the most general form of a categorical plot. It can take in a kind parameter to adjust the plot type:

In [None]:
sns.catplot(x='sex',y='total_bill',data=tips,kind='bar')  # catplot: general categorical plot; kind='bar' creates a bar plot (can also use 'box', 'violin', 'strip', etc.)

# Matrix Plots

Matrix plots allow you to plot data as color-encoded matrices and can also be used to indicate clusters within the data (later in the machine learning section we will learn how to formally cluster data).

Let's begin by exploring seaborn's heatmap and clutermap:

In [None]:
import seaborn as sns  # Re-import Seaborn for this section on matrix plots
%matplotlib inline     # Ensure plots render inline in the notebook

In [None]:
flights = sns.load_dataset('flights')  # Load the built-in 'flights' dataset — contains monthly passenger counts by year

In [None]:
tips = sns.load_dataset('tips')  # Reload the 'tips' dataset for use in this section

In [None]:
tips.head()  # Preview the first 5 rows of the tips dataset

In [None]:
flights.head()  # Preview the first 5 rows of the flights dataset (year, month, passengers)

## Heatmap

In order for a heatmap to work properly, your data should already be in a matrix form, the sns.heatmap function basically just colors it in for you. For example:

In [None]:
tips.head()  # Display the first 5 rows to see the non-matrix form of the data

In [None]:
# Matrix form for correlation data
tips.corr()  # Compute pairwise correlation matrix of all numerical columns — returns a square DataFrame

In [None]:
sns.heatmap(tips.corr())  # Visualize the correlation matrix as a color-encoded heatmap (darker/lighter = stronger correlation)

In [None]:
sns.heatmap(tips.corr(),cmap='coolwarm',annot=True)  # Heatmap with 'coolwarm' colormap and annot=True to display correlation values in each cell

Or for the flights data:

In [None]:
flights.pivot_table(values='passengers',index='month',columns='year')  # Reshape flights data into a matrix: rows=months, columns=years, values=passenger counts

In [None]:
pvflights = flights.pivot_table(values='passengers',index='month',columns='year')  # Create pivot table and store in variable
sns.heatmap(pvflights)  # Display the pivot table as a heatmap — color intensity represents passenger count

In [None]:
sns.heatmap(pvflights,cmap='magma',linecolor='white',linewidths=1)  # Heatmap with 'magma' colormap, white gridlines (1px wide) separating each cell

## clustermap

The clustermap uses hierarchal clustering to produce a clustered version of the heatmap. For example:

In [None]:
sns.clustermap(pvflights)  # Clustered heatmap: reorders rows and columns by hierarchical clustering to group similar values together

Notice now how the years and months are no longer in order, instead they are grouped by similarity in value (passenger count). That means we can begin to infer things from this plot, such as August and July being similar (makes sense, since they are both summer travel months)

In [None]:
# More options to get the information a little clearer like normalization
sns.clustermap(pvflights,standard_scale=1)  # Clustered heatmap with column-wise normalization (standard_scale=1) for better comparison across years

# Regression Plots

Seaborn has many built-in capabilities for regression plots, however we won't really discuss regression until the machine learning section of the course, so we will only cover the **lmplot()** function for now.

**lmplot** allows you to display linear models, but it also conveniently allows you to split up those plots based off of features, as well as coloring the hue based off of features.

Let's explore how this works:

In [None]:
import seaborn as sns  # Re-import Seaborn for the regression plots section
%matplotlib inline     # Ensure plots render inline in the notebook

In [None]:
tips = sns.load_dataset('tips')  # Reload the tips dataset for regression plot examples

In [None]:
tips.head()  # Preview the first 5 rows of the tips dataset

## lmplot()

In [None]:
sns.lmplot(x='total_bill',y='tip',data=tips)  # Linear model plot: scatter plot of total_bill vs tip with a fitted regression line and confidence interval

In [None]:
sns.lmplot(x='total_bill',y='tip',data=tips,hue='sex')  # lmplot with separate regression lines colored by sex — shows different trends per group

In [None]:
sns.lmplot(x='total_bill',y='tip',data=tips,hue='sex',palette='coolwarm')  # Same as above but with 'coolwarm' color palette for better visual distinction

## Using a Grid

We can add more variable separation through columns and rows with the use of a grid. Just indicate this with the col or row arguments:

In [None]:
sns.lmplot(x='total_bill',y='tip',data=tips,col='sex')  # Create separate subplots (columns) for each sex — one regression plot per category

In [None]:
sns.lmplot(x="total_bill", y="tip", row="sex", col="time",data=tips)  # Grid of plots: rows split by sex, columns split by time (Lunch/Dinner) — 2x2 grid

In [None]:
sns.lmplot(x='total_bill',y='tip',data=tips,col='day',hue='sex',palette='coolwarm')  # One subplot per day (4 columns), with separate regression lines for each sex in coolwarm colors

## Aspect and Size

Seaborn figures can have their size and aspect ratio adjusted with the **height** and **aspect** parameters:

In [None]:
sns.lmplot(x='total_bill',y='tip',data=tips,col='day',hue='sex',palette='coolwarm',
          aspect=0.6,height=8)  # Same plot with custom sizing: height=8 inches tall, aspect=0.6 makes each subplot narrower (width = 0.6 * height)

### Reference:

* https://seaborn.pydata.org/ - Seaborn: statistical data visualization


* https://seaborn.pydata.org/tutorial/color_palettes.html - Color palettes