# 4. Graphical Univariate Analysis

### Introduction
In this notebook, we will explore individual columns of our data graphically. The table corresponding to the available tools for analysis based on the type of data is reprinted below:




| Univariate             | Graphical                               | Non-Graphical                     | 
|-------------|-----------------------------------------|-----------------------------------|
| Categorical | Bar char of frequencies (count/percent) | `value_counts` (count/percent) |
| Continuous  | Histogram/KDE, box/violin  | central tendency -mean/median/mode, variance, std, skew, IQR  |

| Multivariate            | Graphical                               | Non-Graphical                     | 
|-------------|-----------------------------------------|-----------------------------------|
| Categorical vs Categorical | heat map, mosaic plot | Cross tabulation (count/percent) |
| Continuous vs Continuous  | all pairwise scatterplots, kde, heatmaps |  all pairwise correlation/regression   |
| Categorical vs Continuous  | All seaborn "categorical" plots | Summary statistics for each group |

Previous analysis is computed again in the following cell.

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

pd.options.display.max_colwidth = 120

diamonds = pd.read_csv('../data/diamonds.csv')

new_order = ['cut', 'color', 'clarity','carat', 'price', 'x', 'y','z','depth', 'table']
diamonds = diamonds[new_order]

order = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']
diamonds['cut'] = pd.Categorical(diamonds['cut'], ordered=True, categories=order)

order = ['J', 'I', 'H', 'G', 'F', 'E', 'D']
diamonds['color'] = pd.Categorical(diamonds['color'], ordered=True, categories=order)

order = ['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF']
diamonds['clarity'] = pd.Categorical(diamonds['clarity'], ordered=True, categories=order)

## Graphical univariate analysis on categorical variables
A bar plot of the frequencies is one of the few you can do.

In [None]:
diamonds['color'].value_counts().sort_index().plot(kind='bar')

### Seaborn sorts axis automatically
Conveniently, seaborn sorts the categorical variable axis.

In [None]:
sns.countplot(x='color', data=diamonds)

In [None]:
sns.countplot(x='cut', data=diamonds)

In [None]:
sns.countplot(x='clarity', data=diamonds)

## Pie charts
Lots of data visualization experts say to [avoid pie charts](https://www.quora.com/How-and-why-are-pie-charts-considered-evil-by-data-visualization-experts), but they are possible with Pandas.

In [None]:
diamonds['color'].value_counts().plot(kind='pie')

## Graphical univariate analysis on continuous variables
There are many more possibilities for graphical analysis with continuous variables. The plots typically give information about the distribution of the data. Let's plot the distribution of the carat column with a kde plot. Notice that there appear to be sharp increases in distribution around whole or half numbers. This might indicate that people prefer value diamonds that cross some threshold. 

For instance, its possible that a diamond that is 1 carat is more preferable than one that is .99 carats, even though the size would not be readily distinguishable with the naked eye.

In [None]:
diamonds['carat'].plot(kind='kde', figsize=(10, 5), xlim=(0, 3))

Histograms are also good choices to get an idea of the distribution. By default, it shows the raw count of each bin. We can clearly see spikes at carat sizes of 1, 1.5 and 2.

In [None]:
diamonds['carat'].plot(kind='hist', figsize=(10, 5), xlim=(0, 3), bins=100)

Seaborn's `displot` plots a histogram and KDE on the same plot.

In [None]:
ax = sns.distplot(diamonds['carat'])
ax.figure.set_size_inches(12, 6)

Find the distribution of the price.

In [None]:
ax = sns.distplot(diamonds['price'])
ax.figure.set_size_inches(12, 6)

We can also plot just the cumulative distribution.

In [None]:
sns.distplot(diamonds['carat'], kde_kws={'cumulative': True}, hist=False)

### Box plots
Box plots are nice tools to discover outliers. The line in the middle of the box is the median of the distribution, and the sides of the box represent the first and third quartiles, meaning the width of the box is 75% of the distribution and also known as the inner quartile range. The bars outside of the box are the **whiskers** and are usually defined to be 1.5 x the interquartile range. All data that falls outside the whiskers are plotted individually.

In [None]:
sns.boxplot('carat', data=diamonds)

In [None]:
sns.boxplot('price', data=diamonds)

# Exercise
Complete these steps on your dataset