# Statistical Data Graphics with Seaborn

In this notebook, we will explore how to use Python's Seaborn and Matplotlib libraries to create statistical plots. These plots help us visualize variable distributions, identify outliers, and discover relationships between variables.

## Importing Libraries

We will import standard libraries such as `numpy`, `pandas`, `matplotlib`, and `seaborn`. Additionally, we set plotting parameters for a consistent style.

In [None]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from pandas.plotting import scatter_matrix

import matplotlib.pyplot as plt
from pylab import rcParams
import seaborn as sns

In [None]:
%matplotlib inline
rcParams['figure.figsize'] = 5, 6
sns.set_style('whitegrid')  # Seaborn style options include: darkgrid, whitegrid, dark, white, ticks

## Loading the Dataset

We will use the `mtcars` dataset. Make sure the file path points to your local setup. We will also set the car names as the index for easy reference.

In [None]:
address = '/workspaces/python-for-data-science-and-machine-learning-essential-training-part-1-3006708/data/mtcars.csv'

cars = pd.read_csv(address)

# Rename columns for easier access
cars.columns = ['car_names','mpg','cyl','disp','hp','drat','wt','qsec','vs','am','gear','carb']
cars.index = cars.car_names

cars.head()

## Eyeballing Dataset Distributions with Histograms

Histograms help us visualize the distribution of a single variable. We'll start by looking at the `mpg` (miles per gallon) variable.

In [None]:
# Isolate the mpg variable
mpg = cars['mpg']

# Method 1: Using pandas plot
mpg.plot(kind='hist', title='MPG Distribution (Pandas)')
plt.xlabel('Miles per Gallon')
plt.show()

In [None]:
# Method 2: Using matplotlib directly
plt.hist(mpg, bins=10, color='skyblue', edgecolor='black')
plt.title('MPG Distribution (Matplotlib)')
plt.xlabel('Miles per Gallon')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Method 3: Using Seaborn
sns.displot(mpg, kde=True, color='green')  # kde=True adds a smooth density curve
plt.title('MPG Distribution (Seaborn)')
plt.show()

## Seeing Scatterplots in Action

Scatter plots are useful to visualize relationships between two continuous variables. We will look at `hp` (horsepower) vs `mpg`.

In [None]:
# Scatter plot using pandas
cars.plot(kind='scatter', x='hp', y='mpg', c='darkgray', s=150, title='HP vs MPG')
plt.xlabel('Horsepower')
plt.ylabel('Miles per Gallon')
plt.show()

In [None]:
# Scatter plot using Seaborn
sns.regplot(x='hp', y='mpg', data=cars, scatter=True, color='red')
plt.title('HP vs MPG (Seaborn)')
plt.show()

## Generating a Scatter Plot Matrix

A scatter plot matrix (pair plot) helps visualize pairwise relationships between multiple variables. This is useful to detect correlations or patterns.

In [None]:
# Using all variables
sns.pairplot(cars)
plt.show()

In [None]:
# Using a subset of variables for clarity
cars_subset = cars[['mpg', 'disp', 'hp', 'wt']]
sns.pairplot(cars_subset)
plt.show()

## Building Boxplots

Boxplots provide a visual summary of the distribution, highlighting the median, quartiles, and potential outliers. We can compare distributions for different categories, e.g., automatic vs manual transmission (`am`).

In [None]:
# Boxplots using pandas
cars.boxplot(column='mpg', by='am', grid=False, figsize=(6,4), patch_artist=True)
plt.title('MPG by Transmission (0=Automatic, 1=Manual)')
plt.suptitle('')  # Remove default pandas title
plt.xlabel('Transmission')
plt.ylabel('Miles per Gallon')
plt.show()

In [None]:
# Boxplots using Seaborn
sns.boxplot(x='am', y='mpg', data=cars, hue='am', palette='hls')
plt.title('MPG by Transmission (Seaborn)')
plt.xlabel('Transmission')
plt.ylabel('Miles per Gallon')
plt.show()