# 7. Multivariate EDA - Categorical vs Categorical
All of the previous EDA focused on analyzing single columns of data independent of the others. Of course it is possible to extend a data analyses to multiple columns.

We have three possible bivariate combinations of variables

* categorical vs categorical
* categorical vs continuous
* continuous vs continuous

| Univariate             | Graphical                               | Non-Graphical                     | 
|-------------|-----------------------------------------|-----------------------------------|
| Categorical | Bar char of frequencies (count/percent) | `value_counts` (count/percent) |
| Continuous  | Histogram/KDE, box/violin  | central tendency -mean/median/mode, variance, std, skew, IQR  |

| Multivariate            | Graphical                               | Non-Graphical                     | 
|-------------|-----------------------------------------|-----------------------------------|
| Categorical vs Categorical | heat map, mosaic plot | Cross tabulation (count/percent) |
| Continuous vs Continuous  | all pairwise scatterplots, kde, heatmaps |  all pairwise correlation/regression   |
| Categorical vs Continuous  | All seaborn "categorical" plots | Summary statistics for each group |

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

diamonds = pd.read_csv('../data/diamonds.csv')

new_order = ['cut', 'color', 'clarity','carat', 'price', 'x', 'y','z','depth', 'table']
diamonds = diamonds[new_order]

order = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']
diamonds['cut'] = pd.Categorical(diamonds['cut'], ordered=True, categories=order)

order = ['J', 'I', 'H', 'G', 'F', 'E', 'D']
diamonds['color'] = pd.Categorical(diamonds['color'], ordered=True, categories=order)

order = ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']
diamonds['clarity'] = pd.Categorical(diamonds['clarity'], ordered=True, categories=order)

## Non-Graphical Categorical vs Categorical
Let's create a cross-tabulation table of combinations of different categorical variables.

In [None]:
col_clar_ct = diamonds.pivot_table(index='clarity', columns='color', aggfunc='size')
col_clar_ct

In [None]:
cut_color_ct = diamonds.pivot_table(index='cut', columns='color', aggfunc='size')
cut_color_ct

## Graphical Categorical vs Categorical
We can plot a bar plot.

In [None]:
col_clar_ct.plot(kind='bar', figsize=(12, 4))

We can do this in one line with seaborn.

In [None]:
ax = sns.countplot(x='clarity', hue='color', data=diamonds)
ax.figure.set_size_inches(12, 4)

### Make heatmaps
A heatmap uses the values of table to color each cell.

In [None]:
# bulk of the data is in the middle
sns.heatmap(col_clar_ct)

In [None]:
sns.heatmap(cut_color_ct)

# Exercise
Replicate on your dataset